Dear Mike, et al,
My remarks are not necessarily related to tidyverse packages. The main point
is that there are various purposes and business cases for writing code, and
they may imply different trade-offs. Let me illustrate with some examples. I
will focus on non-standard evaluation and dependencies.
TL;DR version: (and this is my opinion, nobody has to agree).
1/Interactive use: user-level NSE ok (as in the not-a-pipe operator, dplyr
verbs), use any package you want.
2/Applications & local packages: avoid NSE within functions, package an
application with dependencies you need, write code with maintainers in mind.
3/Published R-packages: avoid NSE within functions, minimize dependencies to
what you cannot avoid.
Do Read version:
1/ One-off data analyses or exploratory data analyses. There are cases where
you don't need to guarantee that your code will run a few years from now:
you are the only user and once your task is done, you quickly need to move
on to the next. Especially in EDA, I write a lot of code that is nice to
keep in a structured project folder but most probably: 1) I will be its only
user and 2) I will use it only for this one small project so maintenance is
not an issue. Although I'm writing code in scripts, it is very close to
interactive work on the command-line.
In such cases I use whatever gets the job done, including dplyr, tidyr,
ggplot2, data.table, you name it. Here I basically don't care about
dependencies and if I write functions there are usually not many of them.
2/ Writing applications or packages for internal use. When you write an
application you are usually committing to a longer maintenance horizon and
more than one user. Good chance that you're not the user and also good
chance you're not the only developer. There are many implications to this
but since you need to maintain things for a longer term, dependencies can
become a liability. Fortunately, there are techniques to contain
dependencies, for example using packrat or by manually setting up a library
containing the packages your application depends on. You can even use a
docker instance. I have worked with custom libraries on several occasions.
Since you (or someone else) is going to maintain the application, it is
worth while to sit down and think what is the best way to set up code so it
remains maintainable. This includes questions like: can I easily understand
what happens when reading it? What expertise does the maintainer need to
understand it? Non-standard evaluation is generally much harder to reason
about than standard evaluated code. This makes debugging and extending code
harder in general.
Now some people will argue that something like filter(data, x>1) is easier
to understand than data[data$x > 1,,drop=FALSE]. I agree that on a very
shallow level, filter(data, x>1) is easy to follow, in the sense of "oh the
author probably wants to filter something here". But when you are debugging,
you need to understand in much greater detail what happens: you need to know
that 'x>1' is an expression, that will be evaluated in the context of
'data'. You need to know about environments and parent environments and so
on. All this knowledge can be avoided with data[data$x > 1,,drop=FALSE]. The
latter also requires knowledge, but the concepts are much simple I think.
Hence, I tend to avoid NSE when writing applications, although there may
still be good reasons to do it. Dependencies can be containered in various
ways so they are not such a big problem.
3/ Writing packages for CRAN. Now you are committing to long-term
maintenance, and usage by interactive users, application builders, and
possibly other package builders. Now a dependency becomes a direct liability
in the sense that the author of your dependency can change interfaces and
ask you to comply to the new version. Also, and especially because of
recursive dependencies, importing a package may give you a whole tail of
dependencies. This increases load time but also install-time, especially on
systems where you need to install from source. Light-weight packages
therefore have real advantages in applications that run many times (like a
standalone script that is fired by users of a web-application or scripts
that are scheduled to run in high frequency). It is also worth mentioning
that an Imports or Depends puts a burden on the maintainer of the package
you depend on: before submitting to CRAN, a pkg developer needs to check
against all reverse dependencies (preferably recursively).
So now, it is even more worth while to sit down and think about what is the
best way to set up your code. Well thought out code can be a pleasure to
maintain. Code that is hastily put together is a nightmare.
My philosophy is as follows: I depend other packages only when they offer
something that I cannot fairly trivially do myself. This may have to do with
a statistical or numerical method I do not want or cannot implement, or it
can have something to do with performance for example. This does indeed
exclude much of the tidyverse almost automatically. Many tools in tidyverse
make already existing functionality easier for (interactive) use. But since
much of the functionality is already present in base R, and because I find
NSE hard to reason about in a programming context I have until now not used
any tidyverse packages as an Imports or Depends.
Hope this helps,
Best,
Mark
Op di 17 jul. 2018 om 23:10 schreef Michael Hannon
<jmhannon.ucdavis at gmail.com>:
Thanks, Mark. Your points are well-taken, but I wouldn't refer to
this as a "small side-track". You don't say so, but this could be
interpreted as a recommendation to avoid some or all of the
"tidyverse" in developing packages. I'm actually quite comfortable
doing the base-R-style programming you recommend. I've lately being
trying to make a point of using the "tidy" stuff, as that's what I'm
seeing almost exclusively from folks in my neighborhood these days.
("Resistance is few-tile...")
Also, it would seem to be a corollary that if the ultimate goal is to
make a package, then one shouldn't be using the convenience stuff
(pipes, dplyr, etc., etc.), even during the development stages. Can
you comment? Thanks.
-- Mike
On Tue, Jul 17, 2018 at 2:53 AM, Mark van der Loo
<mark.vanderloo at gmail.com> wrote:
Michael,
Just a small side-track here. I would avoid using the not-a-pipe
operator
within functions or packages in general. It is great for interactive
use,
but it does make debugging and hence long-term maintenance of functions
harder. There are two reasons for this. First, it hides intermediate
results, and second, it adds several layers to the call stack making the
output of functions like traceback() harder to interpret. I have
documented
a simple example here: https://github.com/chriscardillo/norris/issues/1
(scroll down a bit).
Regarding learning about quosures and so on. If the literal names of
data
frames are known, you could consider replacing
some_var <- next_data_frame %>% dplyr::select(-amount,...
with something simpler like
some_var <- next_data_frame[ names(next_data_frame) != c("amount", ... )
]
which might also save you some dependencies.
Hope this helps,
Best,
Mark
Op di 17 jul. 2018 om 11:28 schreef Michael Hannon
<jmhannon.ucdavis at gmail.com>:
Thanks to John and Zhian for their recent and informative comments.
Regarding check() and NSE: the moral seems to be that a little
learning is a dangerous thing. I'm off to try to bring quosure to
this issue.
-- Mike
On Mon, Jul 16, 2018 at 2:38 PM, Zhian Kamvar <zkamvar at gmail.com>
wrote:
Using dplyr like that is for exploratory data analysis. You'll want
to
refer
to dplyr's "Programming with dplyr" vignette for using dplyr in a
package:
https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html
Hope that helps.
On Jul 16, 2018, at 22:13 , Michael Hannon
<jmhannon.ucdavis at gmail.com>
wrote:
Thanks, Georgi. I've changed my approach and now do what I gather is
recommended practice: put all external package names into the
"Imports" section of the DESCRIPTION file and then use the
fully-qualified names for functions from those packages, as:
dplyr::select()
The "check" operation is still not entirely "happy" with me, but it
doesn't flag any errors, and the package builds and runs.
BTW, one source of "complaints" from "check()" is evidently the use
of
NSE in the tidyverse functions. For instance, the line:
next_data_frame %>% dplyr::select(-amount,
generates the message:
standardize_format: no visible binding for global variable
?amount?
where, of course, "amount" is one of the column headings in
"next_data_frame". There seems to be no harm done by this, and I
plan
to ignore such messages, but if there's some additional wisdom that
applies here, I'd be happy to receive it.
-- Mike
On Sun, Jul 15, 2018 at 12:05 AM, Georgi Boshnakov
<georgi.boshnakov at manchester.ac.uk> wrote:
It seems that the R session used by 'check' doesn't look in the
library
used
by your interactive session. This discrepancy may happen since the
check
tools do not load the same Renviron files as interactive sessions.
This
may
result in different libraries in interactive and 'check' sessions.
See
?Startup, especially section Note.
It is difficult to give more specific advice without details of your
setup.
Hope this helps,
Georgi Boshnakov