Removing "row.names"

2 messages · David James, Kurt Hornik

Wed, Feb 7, 2001 10:50 AM #

Date: Wed, 7 Feb 2001 09:33:12 -0800 (PST)
From: Thomas Lumley <tlumley@u.washington.edu>
To: Kurt Hornik <Kurt.Hornik@ci.tuwien.ac.at>
cc: Peter Dalgaard BSA <p.dalgaard@biostat.ku.dk>, R-devel@r-project.org
Subject: Re: [Rd] RE: [R] Removing "row.names"
MIME-Version: 1.0

On Wed, 7 Feb 2001, Kurt Hornik wrote:

Thomas Lumley writes:

On Wed, 7 Feb 2001, Kurt Hornik wrote:

Peter Dalgaard BSA writes:

Kurt Hornik <Kurt.Hornik@ci.tuwien.ac.at> writes:

names(sampled) <- " "
and
dimnames(sampled)[[2]] <- " "

happily introduce non-unique variable names in the data frame.

Is the rule that row.names and names must be unique still on?

Argh ...

Splus 3.4 dispatches on dimnames<-, but not on names<- with the
following curious result:

d <- data.frame(a=1:3,b=4:6)
names(d)<-c(" "," ")
d

1 1 4
2 2 5
3 3 6

dimnames(d)[[1]] <- rep(" ",3)

Error in "dimnames<-.data.frame"(d, .A0): column names must be unique
Dumped

R dispatches similarly, but doesn't check the dimnames in
dimnames<-.data.frame. It could do so quite easily. Just add

|| any(duplicated(d[[1]])) || any(duplicated(d[[2]]))

at the appropriate spot.

Thomas' view about what should be permitted seems to be different.

I wouldn't object to making it hard to create duplicated names(), but
I think it would be a bad idea to have data.frame() make up unique
names if it's given non-unique ones.

Maybe `check.names' could also be used for uniqueness testing?

In any case, I think we should specify what *exactly* a data frame is.

I think we should specify, and check.names is a logical way to
allow/forbid non-unique columns.  

Having a new class would be messy: logically it shouldn't inherit from
data.frame, data.frame should inherit from it, but that would be a real
pain to set up.

Data frames were originally meant to be used in modeling functions.
The opening paragraph in Chapter 3 (Data for Models) in the White Book
says:
 
  "This chapter describes the general structure for data that
  will be used throughout the book.  In particular, it introduces the
  data frame, a class of objects to represent the data typically encounterd  
  in fitting models."

However, data.frames may not be quite appropriate for representing
other types of tabular data (certainly a data.frame does not capture
the essence of, say, a "relational" table in the SQL sense, which doesn't
have the concept of row names).  Several manifestations of this problem are 
coercing character data to factors "at the drop of a hat" (as someone wrote 
here or in s-news), the row.names issue now being discussed,  problems 
including general objets in the "cells" of the data.frame, etc.  

I think that the concept of a data.frame to represent data for fitting
models is fine, but we may (certainly I) have abused this concept.  We need 
other classes of tabular data objects in addition (not as a replacement) to 
data.frames, together with coercion methods and perhaps other utilities.


David A. James
Statistics Research, Room 2C-253            Phone:  (908) 582-3082       
Bell Labs, Lucent Technologies              Fax:    (908) 582-3340
Murray Hill, NJ 09794-0636

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Kurt Hornik

Thu, Feb 8, 2001 5:41 AM #

David James writes:

Date: Wed, 7 Feb 2001 09:33:12 -0800 (PST)
From: Thomas Lumley <tlumley@u.washington.edu>
To: Kurt Hornik <Kurt.Hornik@ci.tuwien.ac.at>
cc: Peter Dalgaard BSA <p.dalgaard@biostat.ku.dk>, R-devel@r-project.org
Subject: Re: [Rd] RE: [R] Removing "row.names"
MIME-Version: 1.0

On Wed, 7 Feb 2001, Kurt Hornik wrote:

Thomas Lumley writes:

On Wed, 7 Feb 2001, Kurt Hornik wrote:

Peter Dalgaard BSA writes:

Kurt Hornik <Kurt.Hornik@ci.tuwien.ac.at> writes:

names(sampled) <- " "
and
dimnames(sampled)[[2]] <- " "

happily introduce non-unique variable names in the data frame.

Is the rule that row.names and names must be unique still on?

Argh ...

Splus 3.4 dispatches on dimnames<-, but not on names<- with the
following curious result:

d <- data.frame(a=1:3,b=4:6)
names(d)<-c(" "," ")
d

1 1 4
2 2 5
3 3 6

dimnames(d)[[1]] <- rep(" ",3)

Error in "dimnames<-.data.frame"(d, .A0): column names must be unique
Dumped

R dispatches similarly, but doesn't check the dimnames in
dimnames<-.data.frame. It could do so quite easily. Just add

|| any(duplicated(d[[1]])) || any(duplicated(d[[2]]))

at the appropriate spot.

Thomas' view about what should be permitted seems to be different.

I wouldn't object to making it hard to create duplicated names(), but
I think it would be a bad idea to have data.frame() make up unique
names if it's given non-unique ones.

Maybe `check.names' could also be used for uniqueness testing?

In any case, I think we should specify what *exactly* a data frame is.

I think we should specify, and check.names is a logical way to
allow/forbid non-unique columns.  

Having a new class would be messy: logically it shouldn't inherit from
data.frame, data.frame should inherit from it, but that would be a real
pain to set up.

Data frames were originally meant to be used in modeling functions.
The opening paragraph in Chapter 3 (Data for Models) in the White Book
says:

  "This chapter describes the general structure for data that
  will be used throughout the book.  In particular, it introduces the
  data frame, a class of objects to represent the data typically encounterd  
  in fitting models."

However, data.frames may not be quite appropriate for representing
other types of tabular data (certainly a data.frame does not capture
the essence of, say, a "relational" table in the SQL sense, which
doesn't have the concept of row names).  Several manifestations of
this problem are coercing character data to factors "at the drop of a
hat" (as someone wrote here or in s-news), the row.names issue now
being discussed, problems including general objets in the "cells" of
the data.frame, etc.

I think that the concept of a data.frame to represent data for fitting
models is fine, but we may (certainly I) have abused this concept.  We
need other classes of tabular data objects in addition (not as a
replacement) to data.frames, together with coercion methods and
perhaps other utilities.

Thomas had said that yes it would be nice to have something with less
restrictions for modeling, but that it was uneconomical at least to
introduce a new class that data.frame would then inherit from.

I interpret your comment as suggesting that we introduce a new class for
holding tabular data?  Do you have specific ideas on this?

-k
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._