Skip to content

About identification of CRAN CHECK machines in logs

4 messages · Marcelo Perlin, Hadley Wickham

#
Hi,

I recently released two packages (RndTexExams and GetTDData) in CRAN and
I'm trying to track the number of downloads and location of users.

I wrote a simple script to download and analyze the log files in http://cran
-logs.rstudio.com.
I realized, however, that during the release of a new version of the
packages there is a spike in the number of downloads. I believe that the
CRAN checks are included in the number of installations of the package in
the log files.

I see from the log files the existence of column "ip_id", which sets a
daily unique id for each new ip. My question is, can CRAN set the ip_id of
the CRAN machines to a fixed value so that we can filter only "real" users
out of the data? Can anyone see any other way around it?


Thanks.
#
On Thu, Jun 9, 2016 at 9:24 AM, Marcelo Perlin <marceloperlin at gmail.com> wrote:
I don't think that's true. Why would CRAN be installing the package
from a mirror?

Hadley
#
I don't know Hadley. But you can see evidence of "something" systematically
installing the packages in the log data. From my two CRAN packages I
noticed a high correlation in the number of downloads.

Try the following script, which will pick 5 random packages from CRAN and
calculate the correlation matrix between their differenced number of
downloads. To avoid spurious correlations,  I removed the weekends since we
can expect some seasonality and also the zero entries. Its crude, I know,
but it does shows some positive associations between the number of
installations of the packages.

If not CRAN, who/what is downloading this packages and how can I set it
apart from the actual user installations?

Many thanks!

____
# get packages
df <- as.data.frame(available.packages())

# choose 5 random
idx <- sample(seq(nrow(df)))[1:5]
df<- df[idx,]

my.pkgs <- as.character(df$Package)

#my.pkgs <- c('RndTexExams','GetTDData')

dl.df <- cranlogs::cran_downloads(my.pkgs, from = '2015-01-01', to =
Sys.Date())

# remove zeros entries
dl.df$count[dl.df$count==0] <- NA

# remove weekends
dl.df$sat.sun <- as.POSIXlt(dl.df$date)$wday
dl.df <- dplyr::filter(dl.df, sat.sun != 0, sat.sun != 6)

# to wide (for corr)
dl.df <- tidyr::spread(dl.df, key = package,value = count)

# remove na
dl.df <- dl.df[complete.cases(dl.df), ]

diff.mat <- diff(as.matrix(dl.df[,3:ncol(dl.df)]))
cor(diff.mat)

___
On Thu, Jun 9, 2016 at 6:18 PM, Hadley Wickham <h.wickham at gmail.com> wrote:

            

  
    
#
On Fri, Jun 10, 2016 at 8:27 AM, Marcelo Perlin <marceloperlin at gmail.com> wrote:
Which is not at all surprising:

* there are very strong seasonal patterns
* there are big jumps after releases of new versions of R
* some people like to have all packages installed locally

This is an intrinsic problem with download data. There's no way to
tell if a downloader is really using your package or not.

Hadley