Building package - tab delimited example data issue - R-help

Thu, Dec 6, 2007 2:18 AM #

Hello,

I'm trying to integrate example data in the shape of a tab delimited ASCII
file into my package and therefore dropped it into the data subdirectory.
The build works out just fine, but when I attempt to install I get:

** building package indices ...
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
na.strings,  :
  line 1 did not have 500 elements
Calls: <Anonymous> ... <Anonymous> -> switch -> assign -> read.table -> scan
Execution halted
ERROR: installing package indices failed
** Removing '/usr/local/lib/R/site-library/MaxQuantUtils'
** Restoring previous '/usr/local/lib/R/site-library/MaxQuantUtils'

Accordingly the check delivers:

...
* checking whether package 'MaxQuantUtils' can be installed ... ERROR

Can anyone tell me what I'm doing wrong? build/install witout the ASCII file
works just fine.

Joh

Peter Dalgaard

Thu, Dec 6, 2007 2:52 AM #

Johannes Graumann wrote:

If you had looked at help(data), you would have found a list of which
file formats it supports and how they are read. Hint: TAB-delimited
files are not among them. *Whitespace* separated files work, using
read.table(filename, header=TRUE), but that is not a superset of
TAB-delimited data if there are empty fields.

A nice trick is to figure out how to read the data from the command line
and drop the relevant code into a mydata.R file (assuming that the
actual data file is mydata.txt). This gets executed when the data is
loaded (by data(mydata) or when building the lazyload database) because
.R files have priority over .txt.

This is quite general and allows a nice way of incorporating data
management while retaining the original data source:

stroke <-  read.csv2("stroke.csv", na.strings=".")
names(stroke) <- tolower(names(stroke))
stroke <-  within(stroke,{
    sex <- factor(sex,levels=0:1,labels=c("Female","Male"))
    dgn <- factor(dgn)
    coma <- factor(coma, levels=0:1, labels=c("No","Yes"))
    minf <- factor(minf, levels=0:1, labels=c("No","Yes"))
    diab <- factor(diab, levels=0:1, labels=c("No","Yes"))
    han <- factor(han, levels=0:1, labels=c("No","Yes"))
    died <- as.Date(died, format="%d.%m.%Y")
    dstr <- as.Date(dstr,format="%d.%m.%Y")
    dead <- !is.na(died) & died < as.Date("1996-01-01")
    died[!dead] <- NA
})

SEX;DIED;DSTR;AGE;DGN;COMA;DIAB;MINF;HAN
1;7.01.1991;2.01.1991;76;INF;0;0;1;0
1;.;3.01.1991;58;INF;0;0;0;0
1;2.06.1991;8.01.1991;74;INF;0;0;1;1
0;13.01.1991;11.01.1991;77;ICH;0;1;0;1
0;23.01.1996;13.01.1991;76;INF;0;1;0;1
1;13.01.1991;13.01.1991;48;ICH;1;0;0;1
0;1.12.1993;14.01.1991;81;INF;0;0;0;1
1;12.12.1991;14.01.1991;53;INF;0;0;1;1
0;.;15.01.1991;73;ID;0;0;0;1

O__  ---- Peter Dalgaard             ?ster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907

Berwin A Turlach

Thu, Dec 6, 2007 3:19 AM #

G'day Peter,

On Thu, 06 Dec 2007 11:52:46 +0100

Peter Dalgaard <P.Dalgaard at biostat.ku.dk> wrote:

On the other hand, "Writing R Extensions" has stated since a long time
(and still does):

The @file{data} subdirectory is for additional data files the package
makes available for loading using @code{data()}.  Currently, data files
can have one of three types as indicated by their extension: plain R
code (@file{.R} or @file{.r}), tables (@file{.tab}, @file{.txt}, or
@file{.csv}), or @code{save()} images (@file{.RData} or @file{.rda}).

Now in my book, .csv files contain comma separated values, .tab files
contain values separated by TABs and .txt files are "pure" text files,
presumably values separated by any kind of white space. 

Thus, I think that the expectation that TAB-delimited file formats
should work is not unreasonable; I was long time ago bitten by this
too. Then I realised that the phrase "one of the three types" should
probably be interpreted as implying that .tab, .txt and .csv files are
all of the same type and, apparently, should contain values separated
by whitespace.  I admit that I never tested whether .csv files would
lead to the same problems as TAB delimited .tab files. Rather, I decided
in the end that the safest option, i.e. to avoid misleading file
extensions, would be to use .rda files in the future.

Cheers,

	Berwin

=========================== Full address =============================
Berwin A Turlach                            Tel.: +65 6515 4416 (secr)
Dept of Statistics and Applied Probability        +65 6515 6650 (self)
Faculty of Science                          FAX : +65 6872 3919       
National University of Singapore     
6 Science Drive 2, Blk S16, Level 7          e-mail: statba at nus.edu.sg
Singapore 117546                    http://www.stat.nus.edu.sg/~statba

Johannes Graumann

Thu, Dec 6, 2007 4:03 AM #

On Thursday 06 December 2007 11:52:46 Peter Dalgaard wrote:

Johannes Graumann wrote:

Hello,

I'm trying to integrate example data in the shape of a tab delimited
ASCII file into my package and therefore dropped it into the data
subdirectory. The build works out just fine, but when I attempt to
install I get:

** building package indices ...
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
na.strings,  :
  line 1 did not have 500 elements
Calls: <Anonymous> ... <Anonymous> -> switch -> assign -> read.table ->
scan Execution halted
ERROR: installing package indices failed
** Removing '/usr/local/lib/R/site-library/MaxQuantUtils'
** Restoring previous '/usr/local/lib/R/site-library/MaxQuantUtils'

Accordingly the check delivers:

...
* checking whether package 'MaxQuantUtils' can be installed ... ERROR

Can anyone tell me what I'm doing wrong? build/install witout the ASCII
file works just fine.

Joh

If you had looked at help(data), you would have found a list of which
file formats it supports and how they are read. Hint: TAB-delimited
files are not among them. *Whitespace* separated files work, using
read.table(filename, header=TRUE), but that is not a superset of
TAB-delimited data if there are empty fields.

A nice trick is to figure out how to read the data from the command line
and drop the relevant code into a mydata.R file (assuming that the
actual data file is mydata.txt). This gets executed when the data is
loaded (by data(mydata) or when building the lazyload database) because
.R files have priority over .txt.

This is quite general and allows a nice way of incorporating data

management while retaining the original data source:

more ISwR/data/stroke.R

stroke <-  read.csv2("stroke.csv", na.strings=".")
names(stroke) <- tolower(names(stroke))
stroke <-  within(stroke,{
    sex <- factor(sex,levels=0:1,labels=c("Female","Male"))
    dgn <- factor(dgn)
    coma <- factor(coma, levels=0:1, labels=c("No","Yes"))
    minf <- factor(minf, levels=0:1, labels=c("No","Yes"))
    diab <- factor(diab, levels=0:1, labels=c("No","Yes"))
    han <- factor(han, levels=0:1, labels=c("No","Yes"))
    died <- as.Date(died, format="%d.%m.%Y")
    dstr <- as.Date(dstr,format="%d.%m.%Y")
    dead <- !is.na(died) & died < as.Date("1996-01-01")
    died[!dead] <- NA
})

head ISwR/data/stroke.csv

SEX;DIED;DSTR;AGE;DGN;COMA;DIAB;MINF;HAN
1;7.01.1991;2.01.1991;76;INF;0;0;1;0
1;.;3.01.1991;58;INF;0;0;0;0
1;2.06.1991;8.01.1991;74;INF;0;0;1;1
0;13.01.1991;11.01.1991;77;ICH;0;1;0;1
0;23.01.1996;13.01.1991;76;INF;0;1;0;1
1;13.01.1991;13.01.1991;48;ICH;1;0;0;1
0;1.12.1993;14.01.1991;81;INF;0;0;0;1
1;12.12.1991;14.01.1991;53;INF;0;0;1;1
0;.;15.01.1991;73;ID;0;0;0;1

Thanks for your help. Very insightfull and your version of "RTFM" was not to 
harsh either ;0)
Part of what I want to achieve with the inclusion of the file is to be able to 
showcase a read-in function for the particular data type. Is there a slick 
way - sticking to your example - to reference the 'stroke.csv' directly?
I'd like to put in the example of some function.Rd something analogous to
	# Use function to read in file:
	result <- function(<link to 'stroke.csv' in installed ISwR package>)
Without having to resort to accepting the example as "No Run".

Thanks for your help, Joh
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 827 bytes
Desc: This is a digitally signed message part.
Url : https://stat.ethz.ch/pipermail/r-help/attachments/20071206/9c4b5910/attachment.bin

Peter Dalgaard

Thu, Dec 6, 2007 4:47 AM #

Berwin A Turlach wrote:

Now had you lived in the Western world ... (Hey, what's that? New
address!) ... then you would have known better than to have any trust in
file extensions. At the time "they" apparently figured that the .CSV
standard was so good that it was even better to have two of them (double
standards are twice as good, right?), depending on whether you were in
England or in Denmark, I lost faith completely. (In this country you can
export to a text file with SAS and then NOT read it with SPSS and vice
versa on the same Windows machine).

Actually, R is a bit perverse about .csv too since it expects
_semicolon_  field separator, but not the  comma decimal separator which
usually accompanies it. The reason for this is lost in the mists of time
-- the datasets in current versions of R do not include any .csv files.
There are, however, six .tab files, three of which are not
tab-separated, but I don't actually think there was ever a standard to
the effect that they should be (.tab just means that it is a _table_).

So, you really need to read the help page for data, which does have the 
exact info. The passage you cite from the manual could do with a
rephrasing, although it probably isn't technically incorrect. As it
stands, it reminds me a bit of the old Monty Python sketch:

"Our *three* weapons are fear, surprise, and ruthless efficiency...and
an almost fanatical devotion to the Pope.... Our *four*...no...
*Amongst* our weapons.... Amongst our weaponry...are such elements as
fear, surprise.... I'll come in again"

(There really are 3 data TYPES, but 4 FORMATS and, er, diverse EXTENSIONS)



--  
   O__  ---- Peter Dalgaard             ?ster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907

Johannes Graumann

Thu, Dec 6, 2007 6:37 AM #

Johannes Graumann wrote:

On Thursday 06 December 2007 11:52:46 Peter Dalgaard wrote:

Johannes Graumann wrote:

Hello,

I'm trying to integrate example data in the shape of a tab delimited
ASCII file into my package and therefore dropped it into the data
subdirectory. The build works out just fine, but when I attempt to
install I get:

** building package indices ...
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
na.strings,  :
  line 1 did not have 500 elements
Calls: <Anonymous> ... <Anonymous> -> switch -> assign -> read.table ->
scan Execution halted
ERROR: installing package indices failed
** Removing '/usr/local/lib/R/site-library/MaxQuantUtils'
** Restoring previous '/usr/local/lib/R/site-library/MaxQuantUtils'

Accordingly the check delivers:

...
* checking whether package 'MaxQuantUtils' can be installed ... ERROR

Can anyone tell me what I'm doing wrong? build/install witout the ASCII
file works just fine.

Joh

If you had looked at help(data), you would have found a list of which
file formats it supports and how they are read. Hint: TAB-delimited
files are not among them. *Whitespace* separated files work, using
read.table(filename, header=TRUE), but that is not a superset of
TAB-delimited data if there are empty fields.

A nice trick is to figure out how to read the data from the command line
and drop the relevant code into a mydata.R file (assuming that the
actual data file is mydata.txt). This gets executed when the data is
loaded (by data(mydata) or when building the lazyload database) because
.R files have priority over .txt.

This is quite general and allows a nice way of incorporating data

management while retaining the original data source:

more ISwR/data/stroke.R

stroke <-  read.csv2("stroke.csv", na.strings=".")
names(stroke) <- tolower(names(stroke))
stroke <-  within(stroke,{
    sex <- factor(sex,levels=0:1,labels=c("Female","Male"))
    dgn <- factor(dgn)
    coma <- factor(coma, levels=0:1, labels=c("No","Yes"))
    minf <- factor(minf, levels=0:1, labels=c("No","Yes"))
    diab <- factor(diab, levels=0:1, labels=c("No","Yes"))
    han <- factor(han, levels=0:1, labels=c("No","Yes"))
    died <- as.Date(died, format="%d.%m.%Y")
    dstr <- as.Date(dstr,format="%d.%m.%Y")
    dead <- !is.na(died) & died < as.Date("1996-01-01")
    died[!dead] <- NA
})

head ISwR/data/stroke.csv

SEX;DIED;DSTR;AGE;DGN;COMA;DIAB;MINF;HAN
1;7.01.1991;2.01.1991;76;INF;0;0;1;0
1;.;3.01.1991;58;INF;0;0;0;0
1;2.06.1991;8.01.1991;74;INF;0;0;1;1
0;13.01.1991;11.01.1991;77;ICH;0;1;0;1
0;23.01.1996;13.01.1991;76;INF;0;1;0;1
1;13.01.1991;13.01.1991;48;ICH;1;0;0;1
0;1.12.1993;14.01.1991;81;INF;0;0;0;1
1;12.12.1991;14.01.1991;53;INF;0;0;1;1
0;.;15.01.1991;73;ID;0;0;0;1

Thanks for your help. Very insightfull and your version of "RTFM" was not
to harsh either ;0)
Part of what I want to achieve with the inclusion of the file is to be
able to showcase a read-in function for the particular data type. Is there
a slick way - sticking to your example - to reference the 'stroke.csv'
directly? I'd like to put in the example of some function.Rd something
analogous to # Use function to read in file:
result <- function(<link to 'stroke.csv' in installed ISwR package>)
Without having to resort to accepting the example as "No Run".

Answering to myself and staying with the same example:
        system.file("data/stroke.csv",package="ISwR")
allows direct access to the example file (name).

Joh

Peter Dalgaard

Thu, Dec 6, 2007 7:03 AM #

Johannes Graumann wrote:

Johannes Graumann wrote:

On Thursday 06 December 2007 11:52:46 Peter Dalgaard wrote:

Johannes Graumann wrote:

Hello,

I'm trying to integrate example data in the shape of a tab delimited
ASCII file into my package and therefore dropped it into the data
subdirectory. The build works out just fine, but when I attempt to
install I get:

** building package indices ...
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
na.strings,  :
  line 1 did not have 500 elements
Calls: <Anonymous> ... <Anonymous> -> switch -> assign -> read.table ->
scan Execution halted
ERROR: installing package indices failed
** Removing '/usr/local/lib/R/site-library/MaxQuantUtils'
** Restoring previous '/usr/local/lib/R/site-library/MaxQuantUtils'

Accordingly the check delivers:

...
* checking whether package 'MaxQuantUtils' can be installed ... ERROR

Can anyone tell me what I'm doing wrong? build/install witout the ASCII
file works just fine.

Joh

If you had looked at help(data), you would have found a list of which
file formats it supports and how they are read. Hint: TAB-delimited
files are not among them. *Whitespace* separated files work, using
read.table(filename, header=TRUE), but that is not a superset of
TAB-delimited data if there are empty fields.

A nice trick is to figure out how to read the data from the command line
and drop the relevant code into a mydata.R file (assuming that the
actual data file is mydata.txt). This gets executed when the data is
loaded (by data(mydata) or when building the lazyload database) because
.R files have priority over .txt.

This is quite general and allows a nice way of incorporating data

management while retaining the original data source:

more ISwR/data/stroke.R

stroke <-  read.csv2("stroke.csv", na.strings=".")
names(stroke) <- tolower(names(stroke))
stroke <-  within(stroke,{
    sex <- factor(sex,levels=0:1,labels=c("Female","Male"))
    dgn <- factor(dgn)
    coma <- factor(coma, levels=0:1, labels=c("No","Yes"))
    minf <- factor(minf, levels=0:1, labels=c("No","Yes"))
    diab <- factor(diab, levels=0:1, labels=c("No","Yes"))
    han <- factor(han, levels=0:1, labels=c("No","Yes"))
    died <- as.Date(died, format="%d.%m.%Y")
    dstr <- as.Date(dstr,format="%d.%m.%Y")
    dead <- !is.na(died) & died < as.Date("1996-01-01")
    died[!dead] <- NA
})

head ISwR/data/stroke.csv

SEX;DIED;DSTR;AGE;DGN;COMA;DIAB;MINF;HAN
1;7.01.1991;2.01.1991;76;INF;0;0;1;0
1;.;3.01.1991;58;INF;0;0;0;0
1;2.06.1991;8.01.1991;74;INF;0;0;1;1
0;13.01.1991;11.01.1991;77;ICH;0;1;0;1
0;23.01.1996;13.01.1991;76;INF;0;1;0;1
1;13.01.1991;13.01.1991;48;ICH;1;0;0;1
0;1.12.1993;14.01.1991;81;INF;0;0;0;1
1;12.12.1991;14.01.1991;53;INF;0;0;1;1
0;.;15.01.1991;73;ID;0;0;0;1

Thanks for your help. Very insightfull and your version of "RTFM" was not
to harsh either ;0)
Part of what I want to achieve with the inclusion of the file is to be
able to showcase a read-in function for the particular data type. Is there
a slick way - sticking to your example - to reference the 'stroke.csv'
directly? I'd like to put in the example of some function.Rd something
analogous to # Use function to read in file:
result <- function(<link to 'stroke.csv' in installed ISwR package>)
Without having to resort to accepting the example as "No Run".

Answering to myself and staying with the same example:
        system.file("data/stroke.csv",package="ISwR")
allows direct access to the example file (name).

Yes, but...

This works right until you turn on LazyData for your package, then you
end up with only

00Index  Rdata.rdb  Rdata.rds  Rdata.rdx

in the data directory. Use the "inst" source subdir for files you want
to have installed explicitly.

Also, in principle, it is

system.file("data", "stroke.csv", package="ISwR")


although platforms that do not understand "/" as the path separator are
rare nowadays.

O__  ---- Peter Dalgaard             ?ster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907

Johannes Graumann

Thu, Dec 6, 2007 7:34 AM #

Peter Dalgaard wrote:

How would you do it in that case?

Well: you provided the example ;0) - sort of ...

Thanks for that hint!

Joh

David Winsemius

Thu, Dec 6, 2007 6:42 PM #

Peter Dalgaard <P.Dalgaard at biostat.ku.dk> wrote in
news:4757EF5E.6040904 at biostat.ku.dk:

Is there a place where I can file a claim for a new keyboard? Mine has beer 
all over it.

David Winsemius