read.table with ":" in column names (PR#8511)

5 messages · peverlorenvanthemaat@amc.uva.nl, Roger Bivand, Brian Ripley +1 more

Original

1

5

peverlorenvanthemaat@amc.uva.nl

Fri, Jan 20, 2006 2:47 AM #

Full_Name: emiel ver loren
Version: 2.2.0
OS: Windows XP
Submission from: (NULL) (145.117.31.248)


Dear R-community and developers,

I have been trying to read in a tab delimeted file where the column names and
the row names are of the form "GO:0000051" (gene ontology IDs). When using:

[1] "GO.0000051"

[1] "GO:0000002"

Which means that ":" is transformed into a "." !! This seems like Excel when it
is trying to guess what I am really ment (and turning 1/1/1 into 1-1-2001).

Furthermore, I found the following quite strange as well:

V1         V2
1 GO:0000051 GO:0000280

[1] "8" "2"

[1] "GO:0000051"

I have found a way to work around it, but I am wandering what's happening....

The tab-delimited file look like:

GO:0000051	GO:0000280	GO:0000740	
GO:0000002	0	0	0
GO:0000004	0	0	0
GO:0000012	0	0	0
GO:0000014	0	0	0
GO:0000015	0	0	0
GO:0000018	0	0	0
GO:0000019	0	0	0

Thanks for helping, and 

Emiel

Fri, Jan 20, 2006 3:22 AM #

peverlorenvanthemaat at amc.uva.nl writes:

This is what check.names=FALSE is for... (and NOT a bug, please don't
abuse the bug repository, use the mailing lists)

Yes, this is a bit nasty, but... What is happening is similar to this:

a b
1 A a

[1] "1" "1"

[1] "A"

[1] "1"

or this:

$a
[1] x
Levels: x

$b
[1] y
Levels: y

[1] "1" "1"

The thing is that as.character on a list will first coerce factors to
numeric, then numeric to character. I'm not sure whether there could
be a rationale for it, but it isn't S-PLUS compatible (not 6.2.1
anyway, which is the most recent one that I have access to).

O__  ---- Peter Dalgaard             ?ster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907

Roger Bivand

Fri, Jan 20, 2006 3:47 AM #

On Fri, 20 Jan 2006 peverlorenvanthemaat at amc.uva.nl wrote:

Wrong. 

?read.table says with reference to the check.names = TRUE argument that:

"check.names: logical.  If 'TRUE' then the names of the variables in the
          data frame are checked to ensure that they are syntactically
          valid variable names.  If necessary they are adjusted (by
          'make.names') so that they are, and also to ensure that there
          are no duplicates."

[1] "GO.0000051"

You can use "GO:0000051" as a column name if quoted, otherwise ":" is an 
operator, so the default value of the check.names argument is sound.

If you "ment" to do what you say, you should have set check.names=FALSE.

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Helleveien 30, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: Roger.Bivand at nhh.no

Brian Ripley

Fri, Jan 20, 2006 5:12 AM #

On Fri, 20 Jan 2006, Peter Dalgaard wrote:

[...]

Nope.  It just coerces an INTSXP to a STRSXP.  as.character (and all other 
forms of coercion that I can think of quickly) ignores classes except when 
initially dispatching.

Note that these examples are special cases:

[1] "c(1, 2)" "c(1, 2)"

may also be unexpected but follows from the general (undocumented, I 
dare say) rules.

My S-PLUS deparses:

[1] "structure(.Data = 1, .Label = \"x\", class = \"factor\")"
[2] "structure(.Data = 1, .Label = \"y\", class = \"factor\")"

which seems no better (and probably worse).

The only other consistent option I can see is for all coercion methods to 
dispatch at each element of a recursive object, which I suspect introduces 
a considerable overhead for very little gain.

One could perhaps argue for a data.frame method, since coercion operations 
on dataframes are rare and that is a case where people get factors where 
they wanted character columns.

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Fri, Jan 20, 2006 5:36 AM #

Prof Brian Ripley <ripley at stats.ox.ac.uk> writes:

OK. I just meant that "de facto" it is like as.character(as.integer(f))

and unlike as.character(as.integer(f)), so I do stand corrected....

Same here. Arguably, we deparse too, we just discard attributes first.
Both S-PLUS and R will do

[1] "c(1, 2, 3, 4, 5)" "3"

Then again maybe not, but it is one of those things which have the
potential to break things in unexpected places if you change it.

Agreed.

O__  ---- Peter Dalgaard             ?ster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907