split() is slow on data.frame (PR#14123)
On Wed, Dec 9, 2009 at 5:44 PM, Charles C. Berry <cberry at tajo.ucsd.edu> wrote:
On Wed, 9 Dec 2009, William Dunlap wrote:
Here are some differences between the current and proposed split.data.frame.
Adding 'drop=FALSE' fixes this case. See in line correction below.
Thank you for the correction.
d<-data.frame(Matrix=I(matrix(1:10, ncol=2)),
Named=c(one=1,two=2,three=3,four=4,five=5), row.names=as.character(1001:1005))
group<-c("A","B","A","A","B")
split.data.frame(d,group)
$A ? ?Matrix.1 Matrix.2 Named 1001 ? ? ? ?1 ? ? ? ?6 ? ? 1 1003 ? ? ? ?3 ? ? ? ?8 ? ? 3 1004 ? ? ? ?4 ? ? ? ?9 ? ? 4 $B ? ?Matrix.1 Matrix.2 Named 1002 ? ? ? ?2 ? ? ? ?7 ? ? 2 1005 ? ? ? ?5 ? ? ? 10 ? ? 5
mysplit.data.frame(d,group) # lost row.names and 2nd column of Matrix
[1] "processing data.frame" $A ? ?Matrix Named [1,] ? ? ?1 ? ? 1 [2,] ? ? ?3 ? ? 3 [3,] ? ? ?4 ? ? 4 $B ? ?Matrix Named [1,] ? ? ?2 ? ? 2 [2,] ? ? ?5 ? ? 5 Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com
-----Original Message-----
From: r-devel-bounces at r-project.org
[mailto:r-devel-bounces at r-project.org] On Behalf Of
pengyu.ut at gmail.com
Sent: Wednesday, December 09, 2009 2:10 PM
To: r-devel at stat.math.ethz.ch
Cc: R-bugs at r-project.org
Subject: [Rd] split() is slow on data.frame (PR#14123)
Please see the following code for the runtime comparison between
split() and mysplit.data.frame() (they do the same thing
semantically). mysplit.data.frame() is a fix of split() in term of
performance. Could somebody include this fix (with possible checking
for corner cases) in future version of R and let me know the inclusion
of the fix?
m=300000
n=6
k=30000
set.seed(0)
x=replicate(n,rnorm(m))
f=sample(1:k, size=m, replace=T)
mysplit.data.frame<-function(x,f) {
?print('processing data.frame')
?v=lapply(
? ? ?1:dim(x)[[2]]
? ? ?, function(i) {
? ? ? ?split(x[,i],f)
Change to: ? ? ? ? split(x[,i,drop=FALSE],f)
? ? ?}
? ? ?)
?w=lapply(
? ? ?seq(along=v[[1]])
? ? ?, function(i) {
? ? ? ?result=do.call(
? ? ? ? ? ?cbind
? ? ? ? ? ?, lapply(v,
? ? ? ? ? ? ? ?function(vj) {
? ? ? ? ? ? ? ? ?vj[[i]]
? ? ? ? ? ? ? ?}
? ? ? ? ? ? ? ?)
? ? ? ? ? ?)
? ? ? ?colnames(result)=colnames(x)
? ? ? ?return(result)
? ? ?}
? ? ?)
?names(w)=names(v[[1]])
?return(w)
}
system.time(split(as.data.frame(x),f))
system.time(mysplit.data.frame(as.data.frame(x),f))
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Charles C. Berry ? ? ? ? ? ? ? ? ? ? ? ? ? ?(858) 534-2098 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Dept of Family/Preventive Medicine E mailto:cberry at tajo.ucsd.edu ? ? ? ? ? ? ? UC San Diego http://famprevmed.ucsd.edu/faculty/cberry/ ?La Jolla, San Diego 92093-0901