I need an analogue of "uniq -c" for a data frame.
xtabs(), although dog slow, would have footed the bill nicely:
--8<---------------cut here---------------start------------->8---
x <- data.frame(a=1:32,b=1:32,c=1:32,d=1:32,e=1:32)
system.time(subset(as.data.frame(xtabs( ~. , x )), Freq != 0 ))
user system elapsed
12.788 4.288 17.224
--8<---------------cut here---------------end--------------->8---
but, alas, if fails on larger data:
system.time(subset(as.data.frame(xtabs( ~. , x )), Freq != 0 ))
Error in table(a = 1:32, b = 1:32, c = 1:32, d = 1:32, e = 1:32, f = 1:32, :
attempt to make a table with >= 2^31 elements
(apparently, because the product of the numbers of all the possible
values of all the columns is too large).
rle() seems to be what I really need, but I cannot figure out what it
returns for a simple example:
--8<---------------cut here---------------start------------->8---
x <- data.frame(a=1:32,b=1:32,c=1:32,d=1:32,e=1:32,f=1:32,g=1:32,h=1:32)
rle(x)
Have you looked at using table() directly? If I understand what you
want correctly something like:
table(do.call(paste, x))
Also, if you take a look at the development version of R, changes are
being put in place to allow much larger data sets.
Cheers,
Micael
On Tue, Oct 16, 2012 at 4:03 PM, Sam Steingold <sds at gnu.org> wrote:
I need an analogue of "uniq -c" for a data frame.
xtabs(), although dog slow, would have footed the bill nicely:
--8<---------------cut here---------------start------------->8---
x <- data.frame(a=1:32,b=1:32,c=1:32,d=1:32,e=1:32)
system.time(subset(as.data.frame(xtabs( ~. , x )), Freq != 0 ))
user system elapsed
12.788 4.288 17.224
--8<---------------cut here---------------end--------------->8---
but, alas, if fails on larger data:
system.time(subset(as.data.frame(xtabs( ~. , x )), Freq != 0 ))
Error in table(a = 1:32, b = 1:32, c = 1:32, d = 1:32, e = 1:32, f = 1:32, :
attempt to make a table with >= 2^31 elements
(apparently, because the product of the numbers of all the possible
values of all the columns is too large).
rle() seems to be what I really need, but I cannot figure out what it
returns for a simple example:
--8<---------------cut here---------------start------------->8---
x <- data.frame(a=1:32,b=1:32,c=1:32,d=1:32,e=1:32,f=1:32,g=1:32,h=1:32)
rle(x)
The count.rows() function is the R analogue.
See
http://orgmode.org/worg/org-contrib/babel/examples/Rpackage.html#sec-6-1
No need to install the package - just copy and paste the function into an
R session.
On cases I've tried that are big enough to matter, it is a good deal
faster than the table( do.call( paste, x )) idiom.
HTH,
Chuck
xtabs(), although dog slow, would have footed the bill nicely:
--8<---------------cut here---------------start------------->8---
x <- data.frame(a=1:32,b=1:32,c=1:32,d=1:32,e=1:32)
system.time(subset(as.data.frame(xtabs( ~. , x )), Freq != 0 ))
user system elapsed
12.788 4.288 17.224
--8<---------------cut here---------------end--------------->8---
but, alas, if fails on larger data:
system.time(subset(as.data.frame(xtabs( ~. , x )), Freq != 0 ))
Error in table(a = 1:32, b = 1:32, c = 1:32, d = 1:32, e = 1:32, f = 1:32, :
attempt to make a table with >= 2^31 elements
(apparently, because the product of the numbers of all the possible
values of all the columns is too large).
rle() seems to be what I really need, but I cannot figure out what it
returns for a simple example:
--8<---------------cut here---------------start------------->8---
x <- data.frame(a=1:32,b=1:32,c=1:32,d=1:32,e=1:32,f=1:32,g=1:32,h=1:32)
rle(x)
* R. Michael Weylandt <zvpunry.jrlynaqg at tznvy.pbz> [2012-10-16 16:19:27 +0100]:
Have you looked at using table() directly? If I understand what you
want correctly something like:
table(do.call(paste, x))
I wished to avoid paste (I will have to re-split later, so it will be a
performance nightmare).
Also, if you take a look at the development version of R, changes are
being put in place to allow much larger data sets.
xtabs(), although dog slow, would have footed the bill nicely:
--8<---------------cut here---------------start------------->8---
x <- data.frame(a=1:32,b=1:32,c=1:32,d=1:32,e=1:32)
system.time(subset(as.data.frame(xtabs( ~. , x )), Freq != 0 ))
user system elapsed
12.788 4.288 17.224
--8<---------------cut here---------------end--------------->8---
you should not need "much larger data sets" for this.
x is sorted.
* R. Michael Weylandt <zvpunry.jrlynaqg at tznvy.pbz> [2012-10-16 16:19:27 +0100]:
Have you looked at using table() directly? If I understand what you
want correctly something like:
table(do.call(paste, x))
I wished to avoid paste (I will have to re-split later, so it will be a
performance nightmare).
Also, if you take a look at the development version of R, changes are
being put in place to allow much larger data sets.
xtabs(), although dog slow, would have footed the bill nicely:
--8<---------------cut here---------------start------------->8---
x <- data.frame(a=1:32,b=1:32,c=1:32,d=1:32,e=1:32)
system.time(subset(as.data.frame(xtabs( ~. , x )), Freq != 0 ))
user system elapsed
12.788 4.288 17.224
--8<---------------cut here---------------end--------------->8---
you should not need "much larger data sets" for this.
x is sorted.
The problem is that xtabs() and by() and related functions are designed
for the case where all combinations of all factors exist. If you have a
dataset where only a few exist, you could use sparseby() from the
reshape package.
Syntax would be
sparseby(data=x, INDICES=x, FUN=nrow)
if you wanted a dataframe giving counts.
I just tried it, and on your two examples it gives a warning about
coercing a list to a logical vector; I guess all(list(TRUE, TRUE)) was
allowed when I wrote it, but isn't any more. I'll send a patch to the
maintainer.
Duncan Murdoch
* Duncan Murdoch <zheqbpu.qhapna at tznvy.pbz> [2012-10-16 12:47:36 -0400]:
On 16/10/2012 12:29 PM, Sam Steingold wrote:
x is sorted.
sparseby(data=x, INDICES=x, FUN=nrow)
this takes forever; apparently, it does not use the fact that x is
sorted (even then - it should not take more than a few minutes)...
It was more or less instantaneous on the examples you posted. It would
be a bit more honest to say "it was fast on the examples, but it was
very slow when I ran it on my real data, which consists of
100000000000000 cases."
Duncan Murdoch
* Duncan Murdoch <zheqbpu.qhapna at tznvy.pbz> [2012-10-16 14:22:51 -0400]:
On 16/10/2012 1:46 PM, Sam Steingold wrote:
* Duncan Murdoch <zheqbpu.qhapna at tznvy.pbz> [2012-10-16 12:47:36 -0400]:
On 16/10/2012 12:29 PM, Sam Steingold wrote:
x is sorted.
sparseby(data=x, INDICES=x, FUN=nrow)
this takes forever; apparently, it does not use the fact that x is
sorted (even then - it should not take more than a few minutes)...
It was more or less instantaneous on the examples you posted. It
would be a bit more honest to say "it was fast on the examples, but it
was very slow when I ran it on my real data, which consists of
100000000000000 cases."
sure, I did not mean any insult to your code, sorry.
all I was saying was that it was too slow for my purposes because it
ignores the fact that the data is sorted.
it turned out that paste+sort+rle+strsplit is fast enough.
(although there should be a way to avoid paste/strsplit!)
Thanks!
You said you wanted the equivalent of the Unix 'uniq -c' but said
that xtab's results were roughly right and the rle might be what
you want. rle() is the equivalent of 'uniq -c', they both output the
lengths of runs of identical elements. if the data is sorted they
are equivalent to using table() or xtabs().
Since you have sorted data try the following
isFirstInRun <- function(x) UseMethod("isFirstInRun")
isFirstInRun.default <- function(x) c(TRUE, x[-1] != x[-length(x)])
isFirstInRun.data.frame <- function(x) {
stopifnot(ncol(x)>0)
retval <- isFirstInRun(x[[1]])
for(column in x) {
retval <- retval | isFirstInRun(column)
}
retval
}
i <- which(isFirstInRun(yourDataFrame))
Then I think
data.frame(Count=diff(c(i, 1L+nrow(yourDataFrame))), yourDataFrame[i,])
gives you what you want. E.g.,
> yourDataFrame <- data.frame(x1=c(1,1,2,2,1), x2=c(11,11,11,12,11))
> i <- which(isFirstInRun(yourDataFrame))
> i
[1] 1 3 4 5
> data.frame(Count=diff(c(i, 1L+nrow(yourDataFrame))), yourDataFrame[i,])
Count x1 x2
1 2 1 11
3 1 2 11
4 1 2 12
5 1 1 11
It should be pretty quick. If you have missing values in your data frame,
you will have to make some decisions about whether they should be
considered equal to each other or not and modify isFirstInRun.default
accordingly.
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
Of Sam Steingold
Sent: Tuesday, October 16, 2012 10:46 AM
To: r-help at r-project.org; Duncan Murdoch
Subject: Re: [R] uniq -c
* Duncan Murdoch <zheqbpu.qhapna at tznvy.pbz> [2012-10-16 12:47:36 -0400]:
On 16/10/2012 12:29 PM, Sam Steingold wrote:
Note that the relative speeds of these, which all use basically the same run-length-encoding
algorithm, depend on the nature of the dataset. I made a million row data.frame with 10,000
unique users, 26 unique countries, and 6 unique languages with c. 3/4 million unique
rows. Then the times for methods 1, 2, and 3 were 0.7, 6.2, and 10.5 seconds,
respectively. With a million row data.frame with 100, 10, and 4 unique users, countries,
and languages, with 4000 unique rows, the times were 0.3, 1.4, and 0.7.
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
Of Sam Steingold
Sent: Wednesday, October 17, 2012 12:58 PM
To: r-help at r-project.org
Subject: Re: [R] uniq -c
* Sam Steingold <fqf at tah.bet> [2012-10-16 11:03:27 -0400]:
I need an analogue of "uniq -c" for a data frame.
Summary of options:
1. William:
isFirstInRun <- function(x) UseMethod("isFirstInRun")
isFirstInRun.default <- function(x) c(TRUE, x[-1] != x[-length(x)])
isFirstInRun.data.frame <- function(x) {
stopifnot(ncol(x)>0)
retval <- isFirstInRun(x[[1]])
for(column in x) {
retval <- retval | isFirstInRun(column)
}
retval
}
row.count.1 <- function (x) {
i <- which(isFirstInRun(x))
data.frame(x[i,], count=diff(c(i, 1L+nrow(x))))
}
147 seconds
2. http://orgmode.org/worg/org-contrib/babel/examples/Rpackage.html#sec-6-1
row.count.2 <- function (x) {
equal.to.previous <- rowSums( x[2:nrow(x),] != x[1:(nrow(x)-1),] )==0
tf.runs <- rle(equal.to.previous)
counts <- c(1, unlist(mapply(function(x,y) if (y) x+1 else (rep(1,x)),
tf.runs$length, tf.runs$value)))
counts <- counts[ c( diff( counts ) <= 0, TRUE ) ]
unique.rows <- which( c(TRUE, !equal.to.previous ) )
cbind(x[ unique.rows, ,drop=FALSE ], counts)
}
136 seconds
3. Micael: paste/strsplit
row.count.3 <- function (x) {
pa <- do.call(paste,x)
rl <- rle(p)
sp <- strsplit(as.character(rl$values)," ")
data.frame(user = sapply(sp,"[",1),
country = sapply(sp,"[",2),
language = sapply(sp,"[",3),
count = rl$length)
}
here I know the columns and rely on absense of spaces in values.
27 seconds.
Thanks to all who answered.
--
Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000
http://www.childpsy.net/http://www.PetitionOnline.com/tap12009/http://thereligionofpeace.comhttp://ffii.orghttp://camera.org
A slave dreams not of Freedom, but of owning his own slaves.
In addition, adding a factor method for isFirstInRun speeds it up on
long factor variables by c. 60%.
isFirstInRun.factor <- function(x)isFirstInRun(as.integer(x))
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
Of William Dunlap
Sent: Wednesday, October 17, 2012 2:11 PM
To: sds at gnu.org; r-help at r-project.org
Subject: Re: [R] uniq -c
Note that the relative speeds of these, which all use basically the same run-length-
encoding
algorithm, depend on the nature of the dataset. I made a million row data.frame with
10,000
unique users, 26 unique countries, and 6 unique languages with c. 3/4 million unique
rows. Then the times for methods 1, 2, and 3 were 0.7, 6.2, and 10.5 seconds,
respectively. With a million row data.frame with 100, 10, and 4 unique users, countries,
and languages, with 4000 unique rows, the times were 0.3, 1.4, and 0.7.
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
Of Sam Steingold
Sent: Wednesday, October 17, 2012 12:58 PM
To: r-help at r-project.org
Subject: Re: [R] uniq -c
* Sam Steingold <fqf at tah.bet> [2012-10-16 11:03:27 -0400]:
I need an analogue of "uniq -c" for a data frame.
Summary of options:
1. William:
isFirstInRun <- function(x) UseMethod("isFirstInRun")
isFirstInRun.default <- function(x) c(TRUE, x[-1] != x[-length(x)])
isFirstInRun.data.frame <- function(x) {
stopifnot(ncol(x)>0)
retval <- isFirstInRun(x[[1]])
for(column in x) {
retval <- retval | isFirstInRun(column)
}
retval
}
row.count.1 <- function (x) {
i <- which(isFirstInRun(x))
data.frame(x[i,], count=diff(c(i, 1L+nrow(x))))
}
147 seconds
2. http://orgmode.org/worg/org-contrib/babel/examples/Rpackage.html#sec-6-1
row.count.2 <- function (x) {
equal.to.previous <- rowSums( x[2:nrow(x),] != x[1:(nrow(x)-1),] )==0
tf.runs <- rle(equal.to.previous)
counts <- c(1, unlist(mapply(function(x,y) if (y) x+1 else (rep(1,x)),
tf.runs$length, tf.runs$value)))
counts <- counts[ c( diff( counts ) <= 0, TRUE ) ]
unique.rows <- which( c(TRUE, !equal.to.previous ) )
cbind(x[ unique.rows, ,drop=FALSE ], counts)
}
136 seconds
3. Micael: paste/strsplit
row.count.3 <- function (x) {
pa <- do.call(paste,x)
rl <- rle(p)
sp <- strsplit(as.character(rl$values)," ")
data.frame(user = sapply(sp,"[",1),
country = sapply(sp,"[",2),
language = sapply(sp,"[",3),
count = rl$length)
}
here I know the columns and rely on absense of spaces in values.
27 seconds.
Thanks to all who answered.
--
Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000
http://www.childpsy.net/http://www.PetitionOnline.com/tap12009/http://thereligionofpeace.comhttp://ffii.orghttp://camera.org
A slave dreams not of Freedom, but of owning his own slaves.