Skip to content

[Bioc-devel] Package download stats inflated? (specifically cummeRbund)

2 messages · lgoff at csail.mit.edu, Hervé Pagès

#
Hi Bioc-devel,
I am the package maintainer for the cummeRbund package and since I'm  
not exactly sure to whom I should ask this question, I decided to post  
to the bioc-devel list.

Since this is my first Bioc package I have been keenly interested in  
the download stats that are tracked and visible on the Bioconductor  
website, here:

http://bioconductor.org/packages/stats/index.html

Specifically, I'm noticing that the number of downloads for the  
cummeRbund package seems to far outpace the number of unique IP  
addresses downloading the package:

http://bioconductor.org/packages/stats/bioc/cummeRbund.html

For a few months there was a mean of between 10-20 downloads per  
unique IP address, and for the current month this is on track to be  
about 36 downloads/IP (and looks to be about 8.7% of the total BioC  
packages downloaded this month so far).  Looking around at several  
other packages, this does not seem to be the case as most of the  
packages in the top 30 list have a ratio of about 1.8-3 downloads / IP.

As ecstatic as these numbers make me, I'm certain that there is some  
underlying reason for this inflation that is not being appropriately  
represented here, but without anything else to go on, I'm not really  
sure where this is coming from.  I would obviously like to have an  
honest representation of the number of downloads for my package, and I  
was hoping that someone with access to these data could help me track  
down the cause of this download inflation (unless these numbers are a  
true representation of the downloads, and then I would also very much  
like to find out more demographics if possible as well).

Any and all advice or information is appreciated!  Thanks to all, and  
a special thanks to everyone that helps to keep BioC such an amazing  
project.  I have enjoyed the benefits of bioconductor for the past 5+  
years and I'm very happy that I can finally start to contribute back  
to this wonderful project.  (Also, I look forward to meeting some of  
you at BioC 2012 this year!)

Thanks in advance!

Cheers,

Loyal Goff

(lgoff at csail.mit.edu)
NSF Postdoctoral Fellow
Computer Science and Artificial Intelligence Laboratory, MIT &
Stem Cells and Regenerative Biology Department, Harvard University &
The Broad Institute
#
Hi Loyal,

The high ratio between nb of downloads and nb of unique IPs should
not be a reason to doubt that these numbers are a true representation
of the downloads. We've already seen this before. See for example the
stats for the ChIPpeakAnno package:

   http://bioconductor.org/packages/stats/bioc/ChIPpeakAnno.html

The package got downloaded 67k times in Oct/Nov 2011 from only 573
distinct IPs, so here the ratio is 117 downloads / IP.

The first time we saw this kind of massive repetitive downloads was
for the biomaRt package more than 1 year ago. We investigated it and
discovered that most downloads (> 95%) were coming from a single IP
(the IP itself was from a University somewhere in the US). We don't
know for sure why they needed to download the same package again and
again thousands of times every day for more than 20 days in a row, but
one explanation could be that they were using some kind of dumb script
to install biomaRt on each node of a big cluster. What's strange though
is that we saw the deluge of downloads for a single package (biomaRt)
and not for a subset of Bioconductor packages (it sounds to me that
the people in charge of a cluster would typically install more than
1 BioC package). But maybe they were testing a script on 1 package,
then realized they could improve it (to download each package only
once), and then used the improved script to actually deploy Bioconductor
on their cluster. Hard to know...

Anyway, because those massive repetitive downloads are possible, maybe
we should put more emphasis on the nb of distinct IPs. This number is
probably more representative of the number of users and therefore is
a better indicator of how much a package is actually used.

Cheers,
H.
On 05/23/2012 02:54 PM, lgoff at csail.mit.edu wrote: