Date: Wed, 7 Feb 2001 09:33:12 -0800 (PST)
From: Thomas Lumley <tlumley@u.washington.edu>
To: Kurt Hornik <Kurt.Hornik@ci.tuwien.ac.at>
cc: Peter Dalgaard BSA <p.dalgaard@biostat.ku.dk>, R-devel@r-project.org
Subject: Re: [Rd] RE: [R] Removing "row.names"
MIME-Version: 1.0
On Wed, 7 Feb 2001, Kurt Hornik wrote:
On Wed, 7 Feb 2001, Kurt Hornik wrote:
Peter Dalgaard BSA writes:
Kurt Hornik <Kurt.Hornik@ci.tuwien.ac.at> writes:
names(sampled) <- " "
and
dimnames(sampled)[[2]] <- " "
happily introduce non-unique variable names in the data frame.
Is the rule that row.names and names must be unique still on?
Argh ...
Splus 3.4 dispatches on dimnames<-, but not on names<- with the
following curious result:
d <- data.frame(a=1:3,b=4:6)
names(d)<-c(" "," ")
d
dimnames(d)[[1]] <- rep(" ",3)
Error in "dimnames<-.data.frame"(d, .A0): column names must be unique
Dumped
R dispatches similarly, but doesn't check the dimnames in
dimnames<-.data.frame. It could do so quite easily. Just add
|| any(duplicated(d[[1]])) || any(duplicated(d[[2]]))
Thomas' view about what should be permitted seems to be different.
I wouldn't object to making it hard to create duplicated names(), but
I think it would be a bad idea to have data.frame() make up unique
names if it's given non-unique ones.
Maybe `check.names' could also be used for uniqueness testing?
In any case, I think we should specify what *exactly* a data frame is.
I think we should specify, and check.names is a logical way to
allow/forbid non-unique columns.
Having a new class would be messy: logically it shouldn't inherit from
data.frame, data.frame should inherit from it, but that would be a real
pain to set up.
Data frames were originally meant to be used in modeling functions.
The opening paragraph in Chapter 3 (Data for Models) in the White Book
says:
"This chapter describes the general structure for data that
will be used throughout the book. In particular, it introduces the
data frame, a class of objects to represent the data typically encounterd
in fitting models."
However, data.frames may not be quite appropriate for representing
other types of tabular data (certainly a data.frame does not capture
the essence of, say, a "relational" table in the SQL sense, which
doesn't have the concept of row names). Several manifestations of
this problem are coercing character data to factors "at the drop of a
hat" (as someone wrote here or in s-news), the row.names issue now
being discussed, problems including general objets in the "cells" of
the data.frame, etc.
I think that the concept of a data.frame to represent data for fitting
models is fine, but we may (certainly I) have abused this concept. We
need other classes of tabular data objects in addition (not as a
replacement) to data.frames, together with coercion methods and
perhaps other utilities.