Skip to content

popular R packages

30 messages · Gabor Grothendieck, Thomas Adams, David Winsemius +13 more

Messages 1–25 of 30

#
I would like to get some idea of which R-packages are popular, and what R is
used for in general. Are there any statistics available on which R packages
are downloaded often, or is there something like a package-survey? Something
similar to http://popcon.debian.org/ maybe? Any tips are welcome! 

-----
Jeroen Ooms * Dept. of Methodology and Statistics * Utrecht University 

Visit  http://www.jeroenooms.com www.jeroenooms.com  to explore some of my
current projects.
#
This function will show which other packages depend on a particular
package:
+    pkg <- paste("\\b", pkg, "\\b", sep = "")
+    cat("Depends:", rownames(AP)[grep(pkg, AP[, "Depends"])], "\n")
+    cat("Suggests:", rownames(AP)[grep(pkg, AP[, "Suggests"])], "\n")
+ }
Depends: AER BootPR FinTS PerformanceAnalytics RBloomberg
StreamMetabolism TSfame TShistQuote VhayuR dyn dynlm fda fxregime
lmtest meboot party quantmod sandwich sde strucchange tripEstimation
tseries xts
Suggests: TSMySQL TSPostgreSQL TSSQLite TSdbi TSodbc UsingR Zelig
gsubfn playwith pscl tframePlus
On Sat, Mar 7, 2009 at 2:57 PM, Jeroen Ooms <j.c.l.ooms at uu.nl> wrote:
#
When the question arises "How many R-users there are?", the consensus  
seems to be that there is no valid method to address the question. The  
thread "R-business case" from 2004 can be found here:
https://stat.ethz.ch/pipermail/r-help/2004-March/047606.html

I did not see any material revision to that conclusion during the  
recent discussion of the New York Times article on the r-challenge to  
SAS.

Gmane tracks the number of r-help activity (I realize not what you  
asked for):
http://www.gmane.org/info.php?group=gmane.comp.lang.r.general

The distribution of r-packages is, well  ... distributed:
http://cran.r-project.org/mirrors.html

At least one of the participants in the 2004 thread suggested that it  
would be a "good thing" to track the numbers of downloads by package.  
I have not heard of any such system being installed in the mirror  
software and I see nothing that suggests data gathering in the CRAN  
Mirror How-to:
http://cran.r-project.org/mirror-howto.html

On the other hand I am not part of R-core, so you must await more  
authoritative opinion since a 5 year-old thread and amateur  
speculation is not much of a leg to stand on.

There are lexicographic packages for R. One approach to a de novo  
analysis would be to do some sort of natural language analysis of the  
r-help archives counting up either package names with non-English  
names or  close proximity of the words "library" or "package" to  
package names that overlap the 30,000 common English words. That would  
have the danger of inflating counts of the packages with the least  
adequate documentation or a paucity of good worked examples, but there  
are many readers of this list who suspect that new users don't look at  
the documentation, so who knows?
#
I don't think "At least one of the participants in the 2004 thread 
suggested that it would be a "good thing" to track the numbers of 
downloads by package." is reasonable because I download R packages for 2 
home computers (laptop & desktop) and 2 at work (1 Linux & 1 Mac). There 
must be many such cases?

Tom
David Winsemius wrote:

  
    
#
Quite so. It certainly is the case that Dirk Eddelbuettel suggested  
would be very desirable and I think Dirk's track record speaks for  
itself. I never said (and I am sure Dirk never intended) that one  
could take the raw numbers as a basis for blandly asserting that  
<nnnn> copies of <ttt> package are currently installed.

When I update packages, the automated process takes hold and I go for  
a cup of coffee. I only have at the moment two computers with R  
installed and have not updated any binary packages on Windoze in over  
a year.  Nonetheless, I do think the relative numbers of package  
downloads might be interpretable, or at the very least, the basis for  
discussions over beer.
#
i have kept r installed on more than ten computers during the past few
years, some of them running win + more than one linux distro, all of
them having r, most often installed from a separate download.

i know of many cases where students download r for the purpose of a
course in statistics -- often an introductory course for students who
otherwise have little to do with stats. some of them do it more than
once during the semester, and many of them never use r again.

taking into account that basic statistics courses are taught to most
university students and that r is surely the most popular free
statistical computing environment, download-based usage estimates may be
a bit optimistic, unless 'usage' is taken to include 'learn-pass-forget'.

vQ
Tal Galili wrote:
#
I just did RSiteSearch("library(xxx)") with xxx = the names of 6 
packages familiar to me, with the following numbers of hits: 


hits package

 169 lme4
 165 nlme
   6 fda
   4 maps
   2 FinTS
   2 DierckxSpline
     

      Software could be written to (1) extract the names of current 
packages from CRAN then (2) perform queries similar to this on all such 
packages and summarize the results.  I don't have the time now to write 
code for this, but I've written similar code before for step (1);  it 
can be found in "scripts/TsayFiles.R" in the "FinTS" package on CRAN.  
For step (2), Sundar Dorai-Raj wrote code that is is included in the 
preliminary "RSiteSearch" package available from R-Forge via 
install.'packages("RSiteSearch",repos="http://r-forge.r-project.org")'. 

      Code to do this could probably be written (a) in a matter of 
seconds by many of those in the R Core team or (b) in a matter of hours 
by virtually any reader of this list using the examples I just cited.  
And it could provide numbers without a need to convince others to keep 
download statistics and make them available later. 

      Hope this helps. 
      Spencer Graves
Wacek Kusnierczyk wrote:
#
On Sat, 07 Mar 2009 18:04:24 -0500, David Winsemius wrote?:

[ Snip ... ]
*Anything* might be the basis for discussions over beer (obvious 
corollary to Thermogoddamics' second principle....).

More seriously : I don't think relative numbers of package downloads can 
be interpreted in any reasonable way, because reasons for package 
download have a very wide range from curiosity ("what's this ?"), fun 
(think "fortunes"...), to vital need tthink lme4 if/when a consensus on 
denominator DFs can be reached :-)...). What can you infer in good faith 
from such a mess ?

					Emmanuel Charpentier
#
So when we have messy data with measurement error, we should just give
up?  Doesn't sound very statistical! ;)

Hadley
#
On Sun, Mar 8, 2009 at 10:49 AM, hadley wickham <h.wickham at gmail.com> wrote:
Also I would think that the rankings would be meaningful since
the factors that cause the absolute numbers to be off would affect
all packages equally.
#
On 08/03/2009 10:49 AM, hadley wickham wrote:
I think the situation is worse than messy.  If a client comes in with 
data that doesn't address the question they're interested in, I think 
they are better served to be told that, than to be given an answer that 
is not actually valid.  They should also be told how to design a study 
that actually does address their question.

You (and others) have mentioned Google Analytics as a possible way to 
address the quality of data; that's helpful.  But analyzing bad data 
will just give bad conclusions.

Duncan Murdoch
#
As long as we say 'package Foo is the most downloaded package on
CRAN', and not 'package Foo is the most used package for R', we can
leave it to the user to decide if the latter conclusion follows from
the former. In the absence of actual usage data I would think it a
good approximation. Not that I would risk my life on it.

 Pop music charts are now based on download counts, but I wouldn't
believe they represent the songs that are listened to the most times.
Nor would I go so far as to believe they represent the quality of the
songs...

 Should R have a 'Would you like to tell CRAN every time you do
library(foo) so we can do usage counts (no personal data is
transmitted blah blah) ?'? I don't think so....

Barry
#
On 08-Mar-09 15:14:03, Duncan Murdoch wrote:
The population of R users (which we would need to sample in order
to obtain good data) is probably more elusive than a fish population
in the ocean -- only partially visible at best, and with an unknown
proportion invisible.

At least in Fisheries research, there are long established capture
techniques (from trawling to netting to electro-fishing to ... )
which can be deployed, for research purposes, in such a way as to
potentially reach all members of a target population, with at least
a moderately good approximation to random sampling. What have we
for R?

Come to think of it, electro-fishing, ...

Suppose R were released with 2 types of cookie embedded in base R.
Each type is randomly configured, when R is first run, to be Active
or Inactive (probability of activation to be decided at the design
stage ... ). Type 1, if active, on a certain date generates an
event which brings it to the notice of R-Core (e.g. by clandestine
email or by inducing a bug report). Type 2 acts similarly on a later
date. If Type 2 acts, it carries with it information as to whether
there was a Type 1 action along with whether, apparently, the Type 1
action "succeeded".

We then have, in effect, an analogue of the Mark-Recapture technique
of population estimation (along with the usual questions about
equal catchability and so forth).

However, since this sort of thing (which I am not proposing seriously,
only for the sake of argument) is undoubtedly unethical (and would
do R's reputation no good if it came to light), I tentatively conclude
that the population of R users is likely to remain as elusive as ever.

Best wishes to all,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 08-Mar-09                                       Time: 16:11:44
------------------------------ XFMail ------------------------------
#
Is this another discussion of what data might be collected and 
analyzed, and what could and could not be said if we only had such data? 

      Has anyone but me produced any actual data?  If so, I missed it.  
Hadly mentioned the 'fortunes' package.  My earlier methodology, 
"RSiteSearch('library(fortunes)')", produced 40 hits for 'fortunes', 
compared to 169 for 'lme4' and 2 for 'DierckxSpline'. 

      With anything like this, it would be wise to approach the problem 
from many different perspectives, recognizing that the strengths of one 
approach can help improve our understanding of what other analyses say 
about the question at hand. 

      Happy Sunday. 
      Spencer Graves
(Ted Harding) wrote:
#
On 08/03/2009 12:08 PM, Barry Rowlingson wrote:
But we don't even have that data, since CRAN is distributed across lots 
of mirrors.

Duncan Murdoch

  In the absence of actual usage data I would think it a
#
Dear Barry,

As far as I understand, you're telling us that having a bit of data
mining does not harm whatever the data. Your example of pop music charts
might support your point (although my ears disagree ...) but I think it
is bad policy to indulge in white-noise analysis without a well-reasoned
motive to do so. It might give bad ideas to potential "statistics
patrons" (think a bit about the sorry state of financial markets :-().

More generally, I tend to be extremely wary about over-interpretation of
belly grumbles as the Voice of the Spirit ... which is a very powerful
urge of many statisticians and statistician's clients. Data mining can
be fine for exploratory musings, but a serious study needs a model, i.
e. a set of ideas and a way to reality-stress them.

As far as I can see (but I might be nearsighted), I see no model linking
package download to package use(s). Data may or may not become available
with more or less of an effort, but I can't see the point.

					Emmanuel Charpentier

Le dimanche 08 mars 2009 ? 16:08 +0000, Barry Rowlingson a ?crit :
#
On 8 March 2009 at 13:27, Duncan Murdoch wrote:
| But we don't even have that data, since CRAN is distributed across lots 
| of mirrors.
On 8 March 2009 at 19:01, Emmanuel Charpentier wrote:
| As far as I can see (but I might be nearsighted), I see no model linking
| package download to package use(s). Data may or may not become available

Which is why Debian (and Ubuntu) use the _opt-in package_ popularity-contest
that collects data on packages used and submits that to a host collecting the
data.  This drives the so-called 'popcon' statistics.

Yes, and there are many ways in which one can criticise this data collection
process.   But I fail to see how __not having any data__ leads to more
informed decisions.

Once you have data, you have an option of using or discarding it. But if you
have no data, you have no option.  How is that better?

Dirk
#
Dirk Eddelbuettel wrote:
I've also created a package named PopCon here:

http://biostat.mc.vanderbilt.edu/twiki/pub/Main/JeffreyHorner/PopCon_0.1.tar.gz

I provided it to the list many months ago and got no response on it's 
implementation or use. I encourage anyone to download it and understand 
how it can be used to implement a popularity contest for both packages 
and even functions and such.

Maybe R can sponsor a Popularity Contest day where everyone is 
encouraged to download the package and "push" some data to r-project.org 
or even crantastic.org that notes what useRs currently have loaded on 
their search path...

Best,


Jeff
#
On 9/03/2009, at 4:14 AM, Duncan Murdoch wrote:

            
Fortune?

	cheers,

		Rolf Turner

######################################################################
Attention:\ This e-mail message is privileged and confid...{{dropped:9}}
#
Rolf Turner wrote:
looking for fortunes?  got one for you:

    "A key reason that R is a good thing is because it is a language"

who/where is left as an (easy) exercise.

vQ
#
Dear Rolf,

Tukey put it nicely: "The combination of some data and an aching desire for
an answer does not ensure that a reasonable answer can be extracted from a
given body of data." Inasmuch as there are no current fortunes from Tukey, I
nominate this one.

Regards,
 John
On
http://www.R-project.org/posting-guide.html
#
On 9/03/2009, at 10:23 AM, John Fox wrote:

            
Indeed.  That is one of my favourites.  I second the nomination.

	cheers,

		Rolf

######################################################################
Attention:\ This e-mail message is privileged and confid...{{dropped:9}}