-----Original Message-----
From: Edzer J. Pebesma [mailto:e.pebesma at geo.uu.nl]
Sent: 13 April 2006 20:43
To: pedro at dpi.inpe.br
Cc: r-sig-geo at stat.math.ethz.ch; Matthew Dowle
Subject: Re: [R-sig-Geo] AttributeList and data.table
Pedro, you're very alert! I saw it too, and had similar thoughts.
However, I haven't
had any complaints yet about the way AttributeLists work
right now; most
of it
is hidden behind the scenes anyway. If data.table usage becomes
widespread we
can certainly provide coercion functions between the two. Let's first
wait until it
actually hits CRAN. I'm for instance curious what happens if you pass
one to lm().
--
Edzer
pedro at dpi.inpe.br wrote:
Hi,
There is a quite new package on CRAN called data.table. It implements
the class data.table representing a data.frame without rownames, in
order to improve performance. So, it has the same objective
class AttributeList. I confess that I'm very superficial in terms of
the functionality available in both classes, but I think the
could work together, or even be merged.
Best wishes,
Pedro Andrade
---------- Forwarded message ----------
Date: Wed, 12 Apr 2006 15:19:10 +0100
From: Matthew Dowle <mdowle at concordiafunds.com>
To: "'r-devel at r-project.org'" <r-devel at r-project.org>,
"'Cran at r-project.org'" <Cran at r-project.org>
Subject: [Rd] New class: data.table
Hi,
Following previous discussion on this list
(http://tolstoy.newcastle.edu.au/R/devel/05/12/3439.html) I have
created a package as suggested, and uploaded it to CRAN incoming :
data.table.tar.gz.
** Your comments and feedback will be very much appreciated. **
This class really does very little. The only reason for its
is that the white book specifies that data.frame must have rownames.
Most of the code is copied from base functions with the code
manipulating row.names removed.
A data.table is identical to a data.frame other than:
* it doesn't have rownames
* [,drop] by default is FALSE, so selecting a single
return a single row data.table not a vector
* The comma is optional inside [], so DT[3] returns the
1 row data.table
* [] is like a call to subset()
* [,...], is like a call to with(). (not yet implemented)
Motivation:
* up to 10 times less memory
* up to 10 times faster to create, and copy
* simpler R code
* the white book defines rownames, so data.frame can't
... => new class
Examples:
nr = 1000000
D = rep(1:5,nr/5)
system.time(DF <<- data.frame(colA=D, colB=D)) # 2.08
<<- data.table(colA=D, colB=D)) # 0.15 (over 10 times faster to
create) identical(as.data.table(DF), DT)
identical(dim(DT),dim(DF))
object.size(DF)/object.size(DT) # 10 times
tt = subset(DF,colA>3)
ss = DT[colA>3]
identical(as.data.table(tt), ss)
mean(subset(DF,colA+colB>5,"colB"))
mean(DT[colA+colB>5]$colB)
tt = with(subset(DF,colA>3),colA+colB)
ss = with(DT[colA>3],colA+colB) # but could be:
DT[colA>3,colA+colB] (not yet implemented)
identical(tt, ss)
tt = DF[with(DF,tapply(1:nrow(DF),colB,last)),] # select last row
grouping by colB
ss = DT[tapply(1:nrow(DT),colB,last)] # but could be:
DT[last,group=colB] (not yet implemented)
identical(as.data.table(tt),
ss)
Lkp=1:3
tt = DF[with(DF,colA %in% Lkp),]
ss = DT[colA %in% Lkp] # expressions
can see objects in the calling frame identical(as.data.table(tt), ss)
In each case above there is either a space, time, or code brevity
advantage with the data.table.
The motivation for the new class grew from the realization that
performance of data.frames can be improved by removing the