Prev 948 / 29559 Next

AttributeList and data.table

Tue, Apr 18, 2006 2:26 AM

data.table now appears to be on CRAN.  However, given Prof Ripley's mail to
r-devel on Friday: "row.names in data.frame", it would seem data.frame
itself can be changed after all, so data.table could be removed.

-----Original Message-----
From: Edzer J. Pebesma [mailto:e.pebesma at geo.uu.nl] 
Sent: 13 April 2006 20:43
To: pedro at dpi.inpe.br
Cc: r-sig-geo at stat.math.ethz.ch; Matthew Dowle
Subject: Re: [R-sig-Geo] AttributeList and data.table


Pedro, you're very alert! I saw it too, and had similar thoughts. 
However, I haven't
had any complaints yet about the way AttributeLists work 
right now; most 
of it
is hidden behind the scenes anyway. If data.table usage becomes 
widespread we
can certainly provide coercion functions between the two. Let's first 
wait until it
actually hits CRAN. I'm for instance curious what happens if you pass 
one to lm().
--
Edzer

pedro at dpi.inpe.br wrote:

Hi,

There is a quite new package on CRAN called data.table. It implements
the class data.table representing a data.frame without rownames, in 
order to improve performance. So, it has the same objective

of the sp

class AttributeList. I confess that I'm very superficial in terms of 
the functionality available in both classes, but I think the

projects

could work together, or even be merged.

Best wishes,

Pedro Andrade

---------- Forwarded message ----------
Date: Wed, 12 Apr 2006 15:19:10 +0100
From: Matthew Dowle <mdowle at concordiafunds.com>
To: "'r-devel at r-project.org'" <r-devel at r-project.org>,
     "'Cran at r-project.org'" <Cran at r-project.org>
Subject: [Rd] New class: data.table

Hi,

Following previous discussion on this list
(http://tolstoy.newcastle.edu.au/R/devel/05/12/3439.html) I have 
created a package as suggested, and uploaded it to CRAN incoming : 
data.table.tar.gz.

** Your comments and feedback will be very much appreciated. **

From help(data.table) :

This class really does very little. The only reason for its

existence

is that the white book specifies that data.frame must have rownames.

Most of the code is copied from base functions with the code 
manipulating row.names removed.

A data.table is identical to a data.frame other than:
 	* it doesn't have rownames
 	* [,drop] by default is FALSE, so selecting a single

row will always

return a single row data.table not a vector
 	* The comma is optional inside [], so DT[3] returns the

3rd row as a

1 row data.table
 	* [] is like a call to subset()
 	* [,...], is like a call to with().  (not yet implemented)

Motivation:
 	* up to 10 times less memory
 	* up to 10 times faster to create, and copy
 	* simpler R code
 	* the white book defines rownames, so data.frame can't

be changed

... => new class

Examples:
nr = 1000000
D = rep(1:5,nr/5)
system.time(DF <<- data.frame(colA=D, colB=D))  # 2.08

system.time(DT

<<- data.table(colA=D, colB=D))  # 0.15  (over 10 times faster to 
create) identical(as.data.table(DF), DT)
identical(dim(DT),dim(DF))
object.size(DF)/object.size(DT)                 # 10 times

less memory

tt = subset(DF,colA>3)
ss = DT[colA>3]
identical(as.data.table(tt), ss)

mean(subset(DF,colA+colB>5,"colB"))
mean(DT[colA+colB>5]$colB)

tt = with(subset(DF,colA>3),colA+colB)
ss = with(DT[colA>3],colA+colB)                 # but could be:
DT[colA>3,colA+colB]  (not yet implemented)
identical(tt, ss)

tt = DF[with(DF,tapply(1:nrow(DF),colB,last)),] # select last row 
grouping by colB
ss = DT[tapply(1:nrow(DT),colB,last)]           # but could be:
DT[last,group=colB]  (not yet implemented)

identical(as.data.table(tt),

ss)

Lkp=1:3
tt = DF[with(DF,colA %in% Lkp),]
ss = DT[colA %in% Lkp]                        # expressions

inside the []

can see objects in the calling frame identical(as.data.table(tt), ss)

In each case above there is either a space, time, or code brevity 
advantage with the data.table.

The motivation for the new class grew from the realization that 
performance of data.frames can be improved by removing the

rownames.

See here for the previous discussion 
http://tolstoy.newcastle.edu.au/R/devel/05/12/3439.html.

Regards,
Matthew

______________________________________________
R-devel at r-project.org mailing list 
https://stat.ethz.ch/mailman/listinfo/r-devel

_______________________________________________
R-sig-Geo mailing list
R-sig-Geo at stat.math.ethz.ch 
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

Thread (6 messages)

Matthew Dowle AttributeList and data.table Apr 18 Michael Sumner times and ID records for sp Apr 18 Roger Bivand times and ID records for sp Apr 18 Michael Sumner times and ID records for sp Apr 18 Roger Bivand times and ID records for sp Apr 18 Edzer Pebesma times and ID records for sp Apr 19