Skip to content

Issue with dataset inclusion in CRAN packages

3 messages · Frank E Harrell Jr, csrabak

#
I was glad to see the new rpart.plot package by Stephen Milborrow.  I was
however a bit concerned that Stephen distributed a dataset I created, and
renamed the dataset (from titanic3 to ptitanic) in the process [with some
justification, as some variables were omitted].  Fortunately Stephen
included the script he used to download the dataset from our web site, and
gave full credit to us.  What concerns me is that the rpart.plot package
does not contain many functions but the package is as large as packages
containing hundreds of functions.  This is due to the inclusion of the
dataset.  I would prefer that authors provide the URL so that users can
easily install the binary R binary dataframe directly from our web site (we
even have an automated way to do this: require(Hmisc); getHdata(titanic3)). 
This will allow users to profit from possible future data corrections as
well as making the package much more compact.  Thanks for listening.  I'm
writing to r-help because this may applied to other R packages as well.

Frank


-----
Frank Harrell
Department of Biostatistics, Vanderbilt University
--
View this message in context: http://r.789695.n4.nabble.com/Issue-with-dataset-inclusion-in-CRAN-packages-tp3626536p3626536.html
Sent from the R help mailing list archive at Nabble.com.
#
I was wrong about this.  The dataset is small.  Most of the space is taken up
by a nice tutorial on rpart.plot.  Still I would favor linking to datasets
rather than duplicating part of them.
Thanks
Frank
Frank Harrell wrote:
-----
Frank Harrell
Department of Biostatistics, Vanderbilt University
--
View this message in context: http://r.789695.n4.nabble.com/Issue-with-dataset-inclusion-in-CRAN-packages-tp3626536p3626568.html
Sent from the R help mailing list archive at Nabble.com.
#
Em 26/6/2011 17:43, Frank Harrell escreveu:
Frank,

I can understand your concern and at first thought would even second it.

On the other hand, I think there are reasonable explanations why all 
authors prefer to include the datasets, especially if the data will be 
used in examples:

1) Docs written based in the datasets are synced with the dataframes 
offered with the package;

2) In several environments access to the web may be restricted and the 
getHdata or read.table("<url>") be not allowed.

my 0.019999...

Regards,

--
Cesar Rabak