Skip to content

I've written a big review of R. Can I get some feedback?

12 messages · Stephen H. Dawson, DSL, Toby Hocking, Ben Bolker +5 more

#
Hello,

For a while, I've been working on writing a very big review of R. I've finally finished my final proofread of it. Can I get some feedback? This seems the most appropriate place to ask. It's linked below.

https://github.com/ReeceGoding/Frustration-One-Year-With-R

If you think you've seen it before, that will be because it found some popularity on Hacker News before I was done proofreading it. The reception seems largely positive so far.

Thanks,
Reece Goding
#
Hi Reece,


Thanks for the article. What specific feedback do you seek for your writing?


Kindest Regards,
*Stephen Dawson, DSL*
/Executive Strategy Consultant/
Business & Technology
+1 (865) 804-3454
http://www.shdawson.com
On 4/9/22 15:52, Reece Goding wrote:
1 day later
#
You could take some of your observations and turn them into patches that
would help improve R. (discussion of such patches is one function of this
email list)

On Sun, Apr 10, 2022 at 9:05 AM Stephen H. Dawson, DSL via R-devel <
r-devel at r-project.org> wrote:

            

  
  
#
Yes, although I would say that the vast majority of the observations 
here, while true, are thoroughly baked in, through some combination of 
backward compatibility and R-core stubbornness [many of them may indeed 
have been discussed on this list over the years].

    I would say that *documentation* patches (e.g., relating to your 
comments about  the lack of examples for some basic functions) are most 
likely to succeed. Adding a few lines to the documentation here and 
there that will help out new users would have a big marginal value.

   There may be a few edge cases where R does something silently that 
you can successfully argue should *always* be an error, and introduce a 
patch to make it so (e.g. the upcoming change in R 4.2.0 that "Calling 
&& or || with either argument of length greater than one now gives a 
warning (which it is intended will become an error)".

   Some of the issues can be worked around with add-on packages that 
implement the desired functionality (again, it is entirely reasonable to 
argue that the design of the base language should be fixed, but it's not 
going to be ...)

   cheers
    Ben Bolker
On 4/11/22 4:51 PM, Toby Hocking wrote:

  
    
#
Hi Reece,

I'm not really sure what kind of review you're looking for (and I'm not
certain this is the right place for it, but hopefully its ok enough). Also,
to channel Pascal, forgive me, I would have written a shorter response but
I didn't have the time.

Firstly, it is fairly ... partisan, I suppose, for lack of a better term.

More importantly from a usefulness perspective you often notably don't
present the knowledge you gained at the end of the various frustrations you
had. As one example that jumped out to me, you say

"One day, you?ll be tripped up by R?s hierarchy of how it likes to simplify
mixed types outside of lists. "

but you don't present your readers with the (well defined) coercion
hierarchy so that they would, you know, not be tripped up by it as badly.
This is probably my largest issue with your document overall. It can give
the reader talking points about how R is bad (not all of which are even
incorrect, per se, as many expert R users will be happy to tell you), but
it won't really help people become better R users in many cases.

Your article also, I suspect, fails to understand what a typical "Novice R
Users" is and what they want to do. By and large they want to analyze data
and create plots. They are analysts, NOT programmers (writing analysis
scripts is not programming in the typical sense, and I'm not the only one
who thinks that).

So the point you make early on in your explanation why you do not strongly
recommend R For Data Science (which I had no part in writing and have not
read myself) that it

"It deliberately avoids the fundamentals of programming ? e.g. making
functions, loops, and if statements ? until the second half. I therefore
suspect that any non-novice would be better off finding an introduction to
the relevant packages with their favourite search engine."


misses the point of R itself for what I'd claim is the "typical novice R
user".

Having read through your review, I'm confused why you were using R to do
some of the things I'm inferring that you felt like you needed it to do. If
you picked up R  wanting an applicable equally to all programming problem
domains general purpose language, you're going to have a bad time. Mostly
because that is not what R is.

Finally, a (very) incomplete response to a few of the more specific points
raised in your review:

*Lists:*

The linked stack overflow question (
https://stackoverflow.com/questions/2050790/how-to-correctly-use-lists-in-r)
shows a pretty fundamental misunderstanding of what lists and atomic
vectors are/do in R. There is nothing wrong with this, asking questions we
don't know the answer to is how we learn, but I'm not sure the question
serves as well as a primer for R lists as you claim. The top answer at time
of writing discusses the C level structure of R objects, which can, I
suppose, inform your knowledge on how lists at the R level work, but is NOT
necessary nor the most pedagogically useful way to present it.

*Strings:*

Strings are not arrays of characters idiomatically at the R level,
they are *atomic observed
values within a (character) vector of data*. Yes, deep down in the C code
they are arrays of characters, but not at the R level. As such, splitting
the elements of a character vector into their respective component
individual characters is not (at all, in my experience) a common
operation. charvec[1]
within typical R usage (where charvec is *a vector of **data*) is much more
likely to be intended to select the *first observation for the data vector*,
which it does. Given what R is for, frankly I think it'd be fairly insane
for charvec[1] to do what substr does.

*Variable Manipulation*

Novice users shouldn't be calling eval. This is not to gatekeep it from
them, like we have some special "eval-callers" club that they're not
invited to. Rather, it is me saying that metaprogramming is not a
novice-difficulty task in R (or, I would expect, anywhere else really).

You also say "variable names" in this section where you mean "argument
names" and that distinction is both meaningful and important. *Variable
names, *are not partially matched:
*Error: object 'x' not found*


*Subsetting:*

I'm fairly certain arrays (including 2d matrices are stored in column order
rather than row order because that has been the standard for linear algebra
on computers since before I knew what either of those things were...

tail(x,1) *is* the idiomatic way of getting the last element of a vector. The
people on stackoverflow that told you this was "very slow" were misguided
at best. It takes ~6000 *nano*seconds on my laptop, compared to the ~200
nanoseconds x[length(x)]. Yes, that is a 30x speedup; no, it doesn't matter
in practice.

I'm going to stop now because this is already too long, but this type of
response continues to be possible throughout.

Lastly, with regard to your mapply challenge. and I quote directly from the
documentation (emphasis mine):

   ...: *arguments to vectorize over* (vectors or lists of strictly

          positive length, or all of zero length).  See also ?Details?.

   MoreArgs: a list of *other arguments* to ?FUN?.



... is the arguments you vectorize over, so FUN gets one element of each
thing in ... for each call. MoreArgs, then, is the set of arguments to
FUN *which
you don't vectorize over, *ie where each call to FUN gets the whole thing.
That's it, that's the whole thing.


I don't disagree that this could be clearer (as Ben pointed out, a
documentation patch would be the way to address this), but its not correct
to say the information isn't in there at all.


Best,

~G
On Mon, Apr 11, 2022 at 1:52 PM Toby Hocking <tdhock5 at gmail.com> wrote:

            

  
  
#
Any large community-based project is going to be driven by the willing volunteers. Duncan Murdoch
has pointed this out a long time ago for R. Those who do are those who define what is done.

That said, I've felt for quite a long time that the multiplicity of ways in which R can do the same
tasks lead to confusion and errors. Arguably, a much stricter language definition that could execute
95% of the existing user R scripts and programs would be welcome and provide a tool that is easier
to maintain and, with a great deal of luck, lead to better maintainability of user codes.

And, as others have pointed out, backward compatibility is a millstone.

Whether anything will happen depends on who steps up to participate in R.

In the meantime, I believe it is important for all R users to report and try to fix those things
that are egregious faults, and documentation fixes are a very good starting point.

John Nash
On 2022-04-09 15:52, Reece Goding wrote:
#
JC,
Are you going to call this new abbreviated language by the name "Q" or keep calling itby the name "R" as "S" is taken?
As a goal, yes, it is easier to maintain a language that is sparse. It may sort of force programmers?to go in particular ways to do things and those ways could be very reliable.
But it will drive many programmers away from the language as it will often not match their way?of thinking about problems.
You can presumably build a brand new language with design goals. As you note, existing?languages come with a millstone around their necks or an albatross.
R is an extendable language. You can look at many of the packages or even packages of packages?such as the tidyverse as examples of adding on functionality to do things other ways that have?caught on. Some even partially supplant use of perfectly usable base R methods. Many end up?being largely rewritten as libraries in another language such as a version of C to speed them?up.?
So I suspect limiting R from doing things multiple ways would encourage making more other?ways and ignoring the base language.
But different ways of doing things is not just based on command names but on techniques?within programming. Anyone who wants to can do a matrix multiplication using a direct?primitive but also by a nested loop and other ways. There is nothing wrong with allowing?more ways.
Yes, there is a huge problem with teaching too much and with reading code others wrote.?
But I suggest that there have been languages that tried to make you use relatively?pure functional programming methods to solve everything. Others try to make you use?object-oriented techniques. Obviously some older ones only allow procedural?methods and some remain in the GOTO stage.?
Modern languages often seem to feel obligated to support multiple modes but then?sometimes skimp on other things. R had a focus and it left out some things while a?language like Python had another focus and included many things R left out while totallyignoring many it has. BOTH languages have later been extended through packages and?modules because someone WANTED the darn features. People like having concepts?they can use like sets and dictionaries, not just lists and vectors. They like having?the ability to delay evaluation but also to force evaluation and so on. If you do not?include some things in the language for purist reasons, you may find it used anyway?and probably less reliably as various volunteers supply the need.
Python just added versions of a PIPE. That opens up all kinds of new ways to do?almost anything. In the process, they already mucked with a new way to create?an anonymous function, and are now planning to add a new use for a single?underscore as a placeholder. But a significant number of R users already steadily use?the various kinds of pipes written before using various methods and that can break in?many cases. Is it wiser to let a large user body rebel, or consider a built-in and?efficient way to give them that feature?
What I wonder is that now that we have a pipe in R, will any of the other ways?wither away and use it internally or is it already too late and we are stuck now?with even more incompatible ways to do about the same thing?



-----Original Message-----
From: J C Nash <profjcnash at gmail.com>
To: Reece Goding <Reece.Goding at outlook.com>; r-devel at r-project.org <r-devel at r-project.org>
Sent: Tue, Apr 12, 2022 10:17 am
Subject: Re: [Rd] I've written a big review of R. Can I get some feedback?

Any large community-based project is going to be driven by the willing volunteers. Duncan Murdoch
has pointed this out a long time ago for R. Those who do are those who define what is done.

That said, I've felt for quite a long time that the multiplicity of ways in which R can do the same
tasks lead to confusion and errors. Arguably, a much stricter language definition that could execute
95% of the existing user R scripts and programs would be welcome and provide a tool that is easier
to maintain and, with a great deal of luck, lead to better maintainability of user codes.

And, as others have pointed out, backward compatibility is a millstone.

Whether anything will happen depends on who steps up to participate in R.

In the meantime, I believe it is important for all R users to report and try to fix those things
that are egregious faults, and documentation fixes are a very good starting point.

John Nash
On 2022-04-09 15:52, Reece Goding wrote:
______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
#
I probably should have been more expansive. I'm mainly thinking the base language processor should
be "small" or "strict". Possible alternative expressions of ideas can be defined in macros and eventually
in replacement code that is perhaps more efficient. However, I think isolating these macro collections
is helpful. R already does this with packages, so I'm really talking about moving the boundary between
base language and packages downwards towards a stricter core.

Possibly this could be done within the existing R framework. However, I rather doubt that will happen.
More likely someone will propose a "new" language.

Sometimes I find myself going back and running Fortran 77 codes ....

JN
On 2022-04-12 11:29, Avi Gross via R-devel wrote:
#
I hear you, JC.
There is often tension between making what you have work well and making it?perhaps more general or flexible or add in oodles of new features.
Experience programmers sometimes realize that they can speed up their?programs by replacing a call to a more general function like say paste()?with a call to paste0() when the other features are not being used, or use?other simpler and faster methods to just combine two strings that have no?vectorized components.
So perhaps inevitably, you may develop ever more ways to do things?and often a wrapper is made that doe very little more than call another?function with the arguments in a different order or with defaults added.
If you know exactly what you want, why call read.csv() rather than directly?call read.table() and have the overhead of another function call that youdon't need?
But realistically, programmer time and energy also counts for something.



-----Original Message-----
From: J C Nash <profjcnash at gmail.com>
To: Avi Gross <avigross at verizon.net>; r-devel at r-project.org <r-devel at r-project.org>
Sent: Tue, Apr 12, 2022 11:38 am
Subject: Re: [Rd] I've written a big review of R. Can I get some feedback?

I probably should have been more expansive. I'm mainly thinking the base language processor should
be "small" or "strict". Possible alternative expressions of ideas can be defined in macros and eventually
in replacement code that is perhaps more efficient. However, I think isolating these macro collections
is helpful. R already does this with packages, so I'm really talking about moving the boundary between
base language and packages downwards towards a stricter core.

Possibly this could be done within the existing R framework. However, I rather doubt that will happen.
More likely someone will propose a "new" language.

Sometimes I find myself going back and running Fortran 77 codes ....

JN
On 2022-04-12 11:29, Avi Gross via R-devel wrote:

  
  
#
Hi Gabriel,

Thanks for the feedback. Much of what you've said seems to agree with a common trend that I've seen in other feedback. Namely, you seem to agree with the many that have told me that using R as anything other than as a tool for data analysis was a grave mistake. I'm increasingly starting to suspect that you're all right. I therefore have little to no counters to your points.

As for what you've said in reply to my "mapply challenge", I admit that your response is logical and may even be the best possible answer. However, I find it disturbing that the solution to my puzzle appears to rest on a having a very careful and very specific understanding of what the words "vectorize over" means in the documentation. You could well be right, but it doesn't sit well with me.

I'll further consider what you've said about the rest. I'm already making some changes.

Thanks again,
Reece
#
Actually, some of us use R as a quite effective tool for testing and evaluating optimization,
nonlinear least squares codes, and some other numerical algorithms, as well as for one-off
or at least limited number of runs cases. R allows a lot of flexibility in trying out ideas
for such tasks where the goal is to check that the solution is a good one. Reliability
rather than speed.

My experience is that R is at least as good a programming environment for this as any other
I've used e.g., Matlab, Fortran, Pascal, BASIC, C, Python, and a few others now of historical
interest only. Who's up for Modula2, APL, QNial, Algol, Algol68, forth, or that wonderful tool,
Assembler? I've used them all, and now prefer R. I might be tempted to try Julia once it
stabilizes. For now it seems to have a lot of similarities to a greased piglet at a fairground
competition.

I do almost no data analysis with R myself, though others do, I believe, use my software in that
pursuit. Given that some of your complaints about R nuisances have a basis, I'll simply note that all
programming languages have their annoyances. At least R is community-based and we can provide those
wonderful "small reproducible examples" and quite often get improvements, and if we offer good
patches, they do get into the distributed code. I've managed to do that in the space of a few
months with R. A reported bug in Excel ... still there I think after over 2 decades. I gave up
waiting.

As others have written in replies to your request, it is worth focusing on what may be fixable and
seeing what can be done, either to fix those issues or to document things to help users. And I
think it sensible to assume ALL of us are novices in areas beyond where we toil daily. R users
cover a huge range of interests, so experts in one aspect are beginners in another.

John Nash
On 2022-04-12 17:31, Reece Goding wrote:
#
Hi Reece,
We have used R to develop fairly complex data transformation pipelines (which include custom data validation, custom async task scheduling layer and other non-trivial components), and I regularly use R to prototype complex data structure algorithms before implementing them in something more low-level like C. It can be a pain sometimes of course, but R is very well suited for data-oriented design, especially when you leverage the excellent low-level work done by the tidyverse group (especially rlang, vctrs and purrr). Of course, if you are used to OOP everywhere it will be very tough, but it?s the 21st century after all, not the nineties :)

Overall, I would say that there are three big issues with R that I don?t think can be fixed (the fourth smaller issue is the standard library, but thats a different discussion):

1. The type system is very weak and idiosyncratic, this makes more complex applications difficult and error-prone. 

2. The language itself is unsound. R?s semantics have been developed on the fly, with new features driven by the pragmatic necessity.  R is essentially LISP with C-like syntax and a very leaky runtime, all this makes it very powerful but also offers you multiple ways to shoot yourself in the foot. A very simple example: lazy evaluation in R is everywhere, but lazy expressions can contain side effects, which equals FUN. R?s philosophy here is very pragmatic: the user is responsible. And it works very well most of the time, until it doesn?t. 

3. The performance is bad, very bad. This is due to the fact that R?s runtime uses linked lists (arguably the worst data structure for modern hardware) everywhere and you can?t do even the simplest operation without performing multiple memory allocations. R?s core team did some fantastic work in the recent years, like the inclusion of the byte code compiler, lazy data types (ALTREP) etc., but there is only that much one can do under the circumstances. 

I do think one could make a next-gen R by giving it sound evaluation semantics, consistent (stricter) type system, first class-support for meta programming hygiene, drop the linked list implementation in favour of something modern like immutable data structures etc. ect? and it would be a very nifty and powerful little data processing language. Unfortunately, it will also break all the existing code, rendering it fairly useless. Because if one goes though all that effort one might as well just migrate to Julia. 

In the end, R might be frustrating at times and its implementation is dated, but it is still very much usable. And of course, it is the language we love and cherish, and so we carry on :)

Best, 

Taras