[Bioc-devel] Syntactically correct names in DataFrames
On 06/29/2012 02:35 PM, Michael Lawrence wrote:
On Fri, Jun 29, 2012 at 10:28 AM, Herv? Pag?s <hpages at fhcrc.org
<mailto:hpages at fhcrc.org>> wrote:
Hi Michael,
Here is a somewhat related issue with duplicated colnames (using
the latest IRanges devel):
> data.frame(aa=2:4, aa=LETTERS[2:4], check.names=FALSE)
aa aa
1 2 B
2 3 C
3 4 D
OK.
> DataFrame(aa=2:4, aa=LETTERS[2:4], check.names=FALSE)
DataFrame with 3 rows and 2 columns
aa aa
<integer> <character>
1 2 B
2 3 C
3 4 D
OK.
But then:
> DF <- DataFrame(aa=2:4, aa=LETTERS[2:4], check.names=FALSE)
> validObject(DF)
Error in validObject(DF) :
invalid class ?DataFrame? object: duplicate column names
> DF[ , 2:1]
Error in validObject(.Object) :
invalid class ?DataFrame? object: duplicate column names
Why?
Because it's a bug. I added check.names last release at Florian's
request and didn't test all of this. Thanks for finding these. In my
book, an error should be thrown when there are duplicate names and
isTRUE(check.names). Anyway, I checked in the fixes.
Thanks for the fix. I was worried that validation rejecting duplicated colnames would be intentional. Looks like I don't need to worry anymore. Thanks again, H.
Michael
> data.frame(list(aa=2:4, aa=LETTERS[2:4]), check.names=FALSE)
aa aa
1 2 B
2 3 C
3 4 D
OK.
> DataFrame(list(aa=2:4, aa=LETTERS[2:4]), check.names=FALSE)
DataFrame with 3 rows and 2 columns
aa aa.1
<integer> <character>
1 2 B
2 3 C
3 4 D
Not OK.
I also tend to think that automatic name mangling features is generally
causing more problems than it solves (if it solves any problem at all).
Same thing with automatic coercion from character to factor (which I'm
glad DataFrame() is not trying to mimic).
Cheers,
H.
On 06/28/2012 06:58 AM, Michael Lawrence wrote:
Hi Florian,
A guiding principle in the design of DataFrame was consistency with
data.frame, so that is why we check for syntactic validity of
the column
names. The underlying reasons for this are probably historic
and related
to the rough equivalence between lists and environments.
As for the error you encountered below, that seems to be fixed
in devel.
Michael
On Thu, Jun 28, 2012 at 6:40 AM, Hahne, Florian
<florian.hahne at novartis.com
<mailto:florian.hahne at novartis.com>>__wrote:
Hi all,
I have been playing around with the DataFrame class a bit
and realized
that it always enforces syntactically correct column names.
Since it is a
generalization of the basic R data.frames I am not quite
sure why that has
to be the case.
Assuming I start with a regular data.frame with non-standard
names:
foo <- data.frame("1a"=1:3, b=4:6, check.names=FALSE)
foo
1a b
1 1 4
2 2 5
3 3 6
Coercing this into a DataFrame forces a name change:
DataFrame(foo)
DataFrame with 3 rows and 2 columns
X1a b
<integer> <integer>
1 1 4
2 2 5
3 3 6
as(foo, "DataFrame")
DataFrame with 3 rows and 2 columns
X1a b
<integer> <integer>
1 1 4
2 2 5
3 3 6
My first intuition was to try this:
DataFrame(foo, check.names=FALSE)
DataFrame with 3 rows and 3 columns
Error in matrix(unlist(lapply(object, function(x) paste("<",
class(x), :
length of 'dimnames' [2] not equal to array extent
In addition: Warning message:
In if (check.names) vnames <- make.names(vnames, unique =
TRUE) :
the condition has length > 1 and only the first element
will be used
Now apparently there are multiple things going on here.
First of all,
check.names is recycled by the DataFrame constructor because
it thinks
that it is just another variable to add to the DataFrame
later. The
initializer method however seems to recognize it for the
coercion into a
data.frame, but it complains because it's length is >1. Also
the show
method is broken because things don't really match anymore.
The Data.Table
show method in IRanges seems to be the culprit here.
My simple question here is: why are syntactic names enforced
at all. And
if that is a feature could't there be a way to turn this off?
A very simple fix would be this:
Index: DataFrame-class.R
==============================__==============================__=======
--- DataFrame-class.R (revision 67116)
+++ DataFrame-class.R (working copy)
@@ -183,7 +183,7 @@
varlist <- unlist(varlist, recursive = FALSE, use.names
= FALSE)
nms <- unlist(varnames[ncols > 0L])
if (check.names)
- nms <- make.names(nms, unique = TRUE)
+ nms <- make.unique(nms)
names(varlist) <- nms
} else names(varlist) <- character(0)
Of course I didn't check all of the downstream effects, but
I don't really
see why anything should rely on syntacticly correct names.
In case there
is, the erratic check.names behavior certainly needs some
fixing, after
all it could just be a normal column name in the DataFrame.
Thanks,
Florian
_________________________________________________
Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
mailing list
https://stat.ethz.ch/mailman/__listinfo/bioc-devel
<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
[[alternative HTML version deleted]]
_________________________________________________
Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
mailing list
https://stat.ethz.ch/mailman/__listinfo/bioc-devel
<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
--
Herv? Pag?s
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319