Hi Developers,
I am looking for another new project to help me get more up to speed
on R and to learn something outside of R internals. One recent R
issue I have run into is finding a fast implementations of the
equivalent to the following SAS code:
/* MDPC is an integer sort key made from two integer columns */
MDPC = (MD * 100000) + PCO;
/* sort the dataset by the key */
PROC SORT;
BY MDPC;
/* print out count and sum for each unique sort key (subgroup) */
/* use of BY MDPC requires sorting that data set by MDPC first in SAS */
PROC UNIVARIATE NOPRINT;
VAR MVE;
BY MDPC;
OUTPUT OUT=TMP0 N=XN SUM=XSUM;
Easy to do in R but the problem is the data set this is being run on
has 1,742,201 lines in it and takes up 196,868,713 bytes to store as
character data. The sort key has easily has over 200,000 unique keys
(if not twice that).
My first R attempt was a simple
# sort the data.frame gd and the sort key
sorder <- order(MDPC)
gd <- gd[sorder,]
MDPC <- MDPC[sorder]
attach(gd)
# find the length and sum for each unique sort key
XN <- by(MVE, MDPC, length)
XSUM <- by(MVE, MDPC, sum)
GRPS <- levels(as.factor(MDPC))
Well the ordering and sorting was reasonably fast but the first "by"
statement was still running 4 hours later on my machine (a dual 2.6
gig Opteron with 4 gig of main memory). This same snippet of code in
SAS running on a slower machine takes about 5 minutes of system time.
I tried various simple R implementations of a "by_sorted" that I
thought might help
# walk sorted array once and keep track of beginning
# and ending points for each unique sort key value in p
# and run function fcn on that sub sequence in vector v
# store the results in a vector
by_sorted <- function(v, p, fcn) {
key <- p[[1]]
bp <- 1
r <- NULL
for (i in 2:length(p)) {
if (key != p[[i]]) {
r <- c(r,fcn(v[bp:i-1]))
bp <- i
key <- p[[i]]
}
}
r <- c(r,fcn(v[bp:i]))
}
but they took "forever" to run also (read that I killed those
attempts at 15 minutes of cpu time).
I literally had the same issue when trying with "tapply".
So unless it already exists someplace, I need a really fast
implementation of "by" for very large sorted data sets (probably
written in fortran or c) that will do the equivalent of what SAS does
with its "proc univariate by" approach with close to the same
performance. The code should only have to walk the array once (ie.
be linear in time with the number of rows in the vector). I have
similar issues with "merge" as well since merging data frames already
sorted by the same sort key should be fast as well and does not
appear to be.
Before I jump into this and create my own "sorted large data set"
versions of "by" and "merge", I wanted to be sure it would be of
interest to others. If they work well and are well implemented (a
big if since I am really still just learning this - the whole point
of the project!) would something like this be of any interest for
internal use of R? Or is this something too specialized?
Is there an R function implemented in c or fortran that would make a
good "model" to follow for implementing something like this?
Would/should they be extensions of current implementations of "merge"
and "by" with an additions of a sorted=TRUE (defaulting to FALSE)
extra parameter.
Or am I simply barking up the wrong tree here?
Thanks,
Kevin
Any interest in "merge" and "by" implementations specifically for sorted data?
13 messages · Seth Falcon, Brian Ripley, Gabor Grothendieck +4 more
"Kevin B. Hendricks" <kevin.hendricks at sympatico.ca> writes:
My first R attempt was a simple # sort the data.frame gd and the sort key sorder <- order(MDPC) gd <- gd[sorder,] MDPC <- MDPC[sorder] attach(gd) # find the length and sum for each unique sort key XN <- by(MVE, MDPC, length) XSUM <- by(MVE, MDPC, sum) GRPS <- levels(as.factor(MDPC)) Well the ordering and sorting was reasonably fast but the first "by" statement was still running 4 hours later on my machine (a dual 2.6 gig Opteron with 4 gig of main memory). This same snippet of code in SAS running on a slower machine takes about 5 minutes of system time.
I wonder if split() would be of use here. Once you have sorted the data frame gd and the sort keys MDPC, you could do: gdList <- split(gd$MVE, MDPC) xn <- sapply(gdList, length) xsum <- sapply(gdList, sum) + seth
Which version of R are you looking at? R-devel has
o merge() works more efficiently when there are relatively few
matches between the data frames (for example, for 1-1
matching). The order of the result is changed for 'sort = FALSE'.
On Thu, 27 Jul 2006, Kevin B. Hendricks wrote:
Hi Developers,
I am looking for another new project to help me get more up to speed
on R and to learn something outside of R internals. One recent R
issue I have run into is finding a fast implementations of the
equivalent to the following SAS code:
/* MDPC is an integer sort key made from two integer columns */
MDPC = (MD * 100000) + PCO;
/* sort the dataset by the key */
PROC SORT;
BY MDPC;
/* print out count and sum for each unique sort key (subgroup) */
/* use of BY MDPC requires sorting that data set by MDPC first in SAS */
PROC UNIVARIATE NOPRINT;
VAR MVE;
BY MDPC;
OUTPUT OUT=TMP0 N=XN SUM=XSUM;
Easy to do in R but the problem is the data set this is being run on
has 1,742,201 lines in it and takes up 196,868,713 bytes to store as
character data. The sort key has easily has over 200,000 unique keys
(if not twice that).
My first R attempt was a simple
# sort the data.frame gd and the sort key
sorder <- order(MDPC)
gd <- gd[sorder,]
MDPC <- MDPC[sorder]
attach(gd)
# find the length and sum for each unique sort key
XN <- by(MVE, MDPC, length)
XSUM <- by(MVE, MDPC, sum)
GRPS <- levels(as.factor(MDPC))
Well the ordering and sorting was reasonably fast but the first "by"
statement was still running 4 hours later on my machine (a dual 2.6
gig Opteron with 4 gig of main memory). This same snippet of code in
SAS running on a slower machine takes about 5 minutes of system time.
I tried various simple R implementations of a "by_sorted" that I
thought might help
# walk sorted array once and keep track of beginning
# and ending points for each unique sort key value in p
# and run function fcn on that sub sequence in vector v
# store the results in a vector
by_sorted <- function(v, p, fcn) {
key <- p[[1]]
bp <- 1
r <- NULL
for (i in 2:length(p)) {
if (key != p[[i]]) {
r <- c(r,fcn(v[bp:i-1]))
bp <- i
key <- p[[i]]
}
}
r <- c(r,fcn(v[bp:i]))
}
but they took "forever" to run also (read that I killed those
attempts at 15 minutes of cpu time).
I literally had the same issue when trying with "tapply".
So unless it already exists someplace, I need a really fast
implementation of "by" for very large sorted data sets (probably
written in fortran or c) that will do the equivalent of what SAS does
with its "proc univariate by" approach with close to the same
performance. The code should only have to walk the array once (ie.
be linear in time with the number of rows in the vector). I have
similar issues with "merge" as well since merging data frames already
sorted by the same sort key should be fast as well and does not
appear to be.
Before I jump into this and create my own "sorted large data set"
versions of "by" and "merge", I wanted to be sure it would be of
interest to others. If they work well and are well implemented (a
big if since I am really still just learning this - the whole point
of the project!) would something like this be of any interest for
internal use of R? Or is this something too specialized?
Is there an R function implemented in c or fortran that would make a
good "model" to follow for implementing something like this?
Would/should they be extensions of current implementations of "merge"
and "by" with an additions of a sorted=TRUE (defaulting to FALSE)
extra parameter.
Or am I simply barking up the wrong tree here?
Thanks,
Kevin
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Hi, I was using my installed R which is 2.3.1 for the first tests. I moved to the r-devel tree (I svn up and rebuild everyday) for my "by" tests to see if it would work any better. I neglected to retest "merge" with the devel version. So it appears "merge" is already fixed and I just need to worry about "by".
On Jul 28, 2006, at 3:06 AM, Brian D Ripley wrote:
Which version of R are you looking at? R-devel has
o merge() works more efficiently when there are relatively few
matches between the data frames (for example, for 1-1
matching). The order of the result is changed for 'sort = FALSE'.
Thanks, Kevin
There was a performance comparison of several moving average approaches here: http://tolstoy.newcastle.edu.au/R/help/04/10/5161.html The author of that message ultimately wrote the caTools R package which contains some optimized versions. Not sure if these results suggest anything of interest here but it would be interesting if various base routines could be sped up to the point that a simple idiom is competitive with the caTools versions.
On 7/28/06, Kevin B. Hendricks <kevin.hendricks at sympatico.ca> wrote:
Hi, I was using my installed R which is 2.3.1 for the first tests. I moved to the r-devel tree (I svn up and rebuild everyday) for my "by" tests to see if it would work any better. I neglected to retest "merge" with the devel version. So it appears "merge" is already fixed and I just need to worry about "by". On Jul 28, 2006, at 3:06 AM, Brian D Ripley wrote:
Which version of R are you looking at? R-devel has
o merge() works more efficiently when there are relatively few
matches between the data frames (for example, for 1-1
matching). The order of the result is changed for 'sort = FALSE'.
Thanks, Kevin
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Hi,
There was a performance comparison of several moving average approaches here: http://tolstoy.newcastle.edu.au/R/help/04/10/5161.html
Thanks for that link. It is not quite the same thing but is very similar. The idea is to somehow make functions that work well over small sub- sequences of a much longer vector without resorting to splitting the vector into many smaller vectors. In my particular case, the problem was my data frame had over 1 million lines had probably over 500,000 unique sort keys (ie. think of it as an R factor with over 500,000 levels). The implementation of "by" uses "tapply" which in turn uses "split". So "split" simply ate up all the time trying to create 500,000 vectors each of short length 1, 2, or 3; and the associated garbage collection. I simple loop that walked the short sequence of values (since the data frame was already sorted) calculating what it needed, would work much faster than splitting the original vector into so very many smaller vectors (and the associated copying of data). That problem is very similar problem to the calculation of basic stats on a short moving window over a very long vector.
The author of that message ultimately wrote the caTools R package which contains some optimized versions.
I will look into that package and maybe use it for a model for what I want to do. Thanks, Kevin
"Kevin" == Kevin B Hendricks <kevin.hendricks at sympatico.ca>
on Fri, 28 Jul 2006 14:53:57 -0400 writes:
[.........]
Kevin> The idea is to somehow make functions that work well
Kevin> over small sub- sequences of a much longer vector
Kevin> without resorting to splitting the vector into many
Kevin> smaller vectors.
Kevin> In my particular case, the problem was my data frame
Kevin> had over 1 million lines had probably over 500,000
Kevin> unique sort keys (ie. think of it as an R factor with
Kevin> over 500,000 levels). The implementation of "by"
Kevin> uses "tapply" which in turn uses "split". So "split"
Kevin> simply ate up all the time trying to create 500,000
Kevin> vectors each of short length 1, 2, or 3; and the
Kevin> associated garbage collection.
Not that I have spent enough time thinking about this thread's
topic, but I have seen more than one case where using tapply()
unnecessarily slowed down computations.
I don't remember the details, but know that in one case, replacing
tapply() by a few lines of code {one of which using lapply() IIRC},
sped up that computation by a factor (of 2 ? or more?).
I also vaguely remember that I thought about making tapply()
faster, but came to the conclusion it could not be
sped up quickly, because it works in a quite more general
context than it was used in that application (and maybe yours?).
Kevin> I simple loop that walked the short sequence of
Kevin> values (since the data frame was already sorted)
Kevin> calculating what it needed, would work much faster
Kevin> than splitting the original vector into so very many
Kevin> smaller vectors (and the associated copying of data).
Kevin> That problem is very similar problem to the
Kevin> calculation of basic stats on a short moving window
Kevin> over a very long vector.
>> The author of that message ultimately wrote the caTools R
>> package which contains some optimized versions.
Kevin> I will look into that package and maybe use it for a
Kevin> model for what I want to do.
Kevin> Thanks,
Kevin> Kevin
Kevin> ______________________________________________
Kevin> R-devel at r-project.org mailing list
Kevin> https://stat.ethz.ch/mailman/listinfo/r-devel
On Fri, 28 Jul 2006, Kevin B. Hendricks wrote:
In my particular case, the problem was my data frame had over 1 million lines had probably over 500,000 unique sort keys (ie. think of it as an R factor with over 500,000 levels). The implementation of "by" uses "tapply" which in turn uses "split". So "split" simply ate up all the time trying to create 500,000 vectors each of short length 1, 2, or 3; and the associated garbage collection. I simple loop that walked the short sequence of values (since the data frame was already sorted) calculating what it needed, would work much faster than splitting the original vector into so very many smaller vectors (and the associated copying of data). That problem is very similar problem to the calculation of basic stats on a short moving window over a very long vector. I will look into that package and maybe use it for a model for what I want to do.
Splus8.0 has something like what you are talking about
that provides a fast way to compute
sapply(split(xVector, integerGroupCode), summaryFunction)
for some common summary functions. The 'integerGroupCode'
is typically the codes from a factor, but you could compute
it in other ways. It needs to be a "small" integer in
the range 1:ngroups (like the 'bin' argument to tabulate).
Like tabulate(), which is called from table(), these are
meant to be called from other functions that can set up
appropriate group codes. E.g., groupSums or rowSums or
fancier things could be based on this.
They do not insist you sort the input in any way. That
would really only be useful for group medians and we haven't
written that one yet.
The typical prototype is
igroupSums
function(x, group = NULL, na.rm = F, weights = NULL, ngroups = if(is.null(
group)) 1 else max(as.integer(group), na.rm = T))
and the currently supported summary functions are
mean : igroupMeans
sum : igroupSums
prod : igroupProds
min : igroupMins
max : igroupMaxs
range : igroupRanges
any : igroupAnys
all : igroupAlls
(The plurals are not all properly spelled and they
are plural so match related functions like rowSums.)
It might be useful to also have one to count the number
of non-missing values in each group of x's.
They are fast:
> x<-runif(2e6)
> i<-rep(1:1e6, 2)
> sys.time(sx <- igroupSums(x,i))
[1] 0.66 0.67
> length(sx)
[1] 1000000
On that machine R takes 44 seconds to go the lapply/split
route:
> unix.time(unlist(lapply(split(x,i), sum)))
[1] 43.24 0.78 44.11 0.00 0.00
To save setup time in the S code, out-of-range values
in the group argument (negatives, values greater than
ngroup, and NA's), mean that the correponding element
in x is ignored.
----------------------------------------------------------------------------
Bill Dunlap
Insightful Corporation
bill at insightful dot com
360-428-8146
"All statements in this message represent the opinions of the author and do
not necessarily reflect Insightful Corporation policy or position."
Hi Bill,
Splus8.0 has something like what you are talking about
that provides a fast way to compute
sapply(split(xVector, integerGroupCode), summaryFunction)
for some common summary functions. The 'integerGroupCode'
is typically the codes from a factor, but you could compute
it in other ways. It needs to be a "small" integer in
the range 1:ngroups (like the 'bin' argument to tabulate).
Like tabulate(), which is called from table(), these are
meant to be called from other functions that can set up
appropriate group codes. E.g., groupSums or rowSums or
fancier things could be based on this.
They do not insist you sort the input in any way. That
would really only be useful for group medians and we haven't
written that one yet.
The sort is also useful for recoding each group into subgroups based
on some other numeric vector. This is the problem I run into trying
to build portfolios that can be used as benchmarks for long term
stock returns.
Another issue I have is that to recode a long character string that I
use as a sort key for accessing a subgroup of the data in the
data.frame to a set of small integers is not fast. I can make a fast
implementation if the data is sorted by the key, but without the
sort, just converting my sort keys to the required small integer
codes would be expensive for very long vectors since my small integer
codes would have to reflect the order of the data (ie. be increasing
subportfolio numbers).
More specifically, I am now converting all of my SAS code to R code
and the problem is I have lots of snippets of SAS that do the
following ...
PROC SORT;
BY MDSIZ FSIZ;
/* WRITE OUT THE MIN SIZE CUTOFF VALUES */
PROC UNIVARIATE NOPRINT;
VAR FSIZ;
BY MDSIZ;
OUTPUT OUT=TMPS1 MIN=XMIN;
where my sort key MDSIZ is a character string that is the
concatenation of the month ending date MD and the size portfolio of a
particular firm (SIZ) and I want to find the cutoff points (the mins)
for each of the portfolios for every month end date across all traded
firms.
The typical prototype is
igroupSums
function(x, group = NULL, na.rm = F, weights = NULL, ngroups = if
(is.null(
group)) 1 else max(as.integer(group), na.rm = T))
and the currently supported summary functions are
mean : igroupMeans
sum : igroupSums
prod : igroupProds
min : igroupMins
max : igroupMaxs
range : igroupRanges
any : igroupAnys
all : igroupAlls
SAS is similar in that is also has a specific list of functions you
can request including all of the basic stats from a PROC univariate
including higher moment stuff (skewness, kurtosis, robust
statistics, and even statistical test results for each coded
subgroup, and the nice thing is all combinations can be done with one
call.
But to do that SAS does require the presorting, but it does run
really fast for even super long vectors with lots of sort keys.
Similarly the next snippet of code, will take the file and resort it
by the portfolio key and then the market to book ratio (MTB) for all
trading firms for all monthly periods since 1980. It will then
split each size portfolio for each month ending date into 5 equal
portfolios based on market to book ratios (thus the need for the
sort). SAS returns a coded integer vector PMTB (made up of 1s to 5
with 1s's for the smallest MTB and 5 for the largest MTB) repeated
for each subgroup of MDSIZ. PMTB matches the original vector in
length and therefore fits right into the data frame.
/* SPLIT INTO Market to Book QUINTILES BY MDSIZ */
PROC SORT;
BY MDSIZ MTB;
PROC RANK GROUPS=5 OUT=TMPS0;
VAR MTB;
RANKS PMTB;
BY MDSIZ;
The problem of assigning elements of a long data vector to portfolios
and sub portfolios based on the values of specific data columns which
must be calculated at each step and are not fixed or hardcoded is one
that finance can run into (and therefore I run into it).
So by sorting I could handle the need for "small integer" recoding
and the small integers would have meaning (i.e. higher values will
represent larger MTB firms, etc).
That just leaves the problem of calculating stats on short sequences
of of a longer integer.
They are fast:
x<-runif(2e6) i<-rep(1:1e6, 2) sys.time(sx <- igroupSums(x,i))
[1] 0.66 0.67
length(sx)
[1] 1000000 On that machine R takes 44 seconds to go the lapply/split route:
unix.time(unlist(lapply(split(x,i), sum)))
[1] 43.24 0.78 44.11 0.00 0.00
Yes! That is exactly what I need. Are there plans for adding something like that to R? Thanks, Kevin
On Fri, 28 Jul 2006, Kevin B. Hendricks wrote:
Hi Bill,
Splus8.0 has something like what you are talking about
that provides a fast way to compute
sapply(split(xVector, integerGroupCode), summaryFunction)
for some common summary functions. The 'integerGroupCode'
is typically the codes from a factor, but you could compute
it in other ways. It needs to be a "small" integer in
the range 1:ngroups (like the 'bin' argument to tabulate).
Like tabulate(), which is called from table(), these are
meant to be called from other functions that can set up
appropriate group codes. E.g., groupSums or rowSums or
fancier things could be based on this.
They do not insist you sort the input in any way. That
would really only be useful for group medians and we haven't
written that one yet.
The sort is also useful for recoding each group into subgroups based on some other numeric vector. This is the problem I run into trying to build portfolios that can be used as benchmarks for long term stock returns. Another issue I have is that to recode a long character string that I use as a sort key for accessing a subgroup of the data in the data.frame to a set of small integers is not fast. I can make a fast implementation if the data is sorted by the key, but without the sort, just converting my sort keys to the required small integer codes would be expensive for very long vectors since my small integer codes would have to reflect the order of the data (ie. be increasing subportfolio numbers).
True, but the underlying grouped summary code
shouldn't require you to do the sorting. If
codes <- match(char, sort(unique(char)))
is too slow then you could try sorting the
data set by th 'char' column and doing
codes <- cumsum(char[-1] != char[-length(char)])
(loop over columns before doing cumsum if there
are several columns).
If that isn't fast enough, then you could sort
in the C code as well, but I think there could
be lots of cases where that is slower.
I've used this code for out of core applications,
where I definitely do not want to sort the whole
dataset.
More specifically, I am now converting all of my SAS code to R code
and the problem is I have lots of snippets of SAS that do the
following ...
PROC SORT;
BY MDSIZ FSIZ;
/* WRITE OUT THE MIN SIZE CUTOFF VALUES */
PROC UNIVARIATE NOPRINT;
VAR FSIZ;
BY MDSIZ;
OUTPUT OUT=TMPS1 MIN=XMIN;
where my sort key MDSIZ is a character string that is the
concatenation of the month ending date MD and the size portfolio of a
particular firm (SIZ) and I want to find the cutoff points (the mins)
for each of the portfolios for every month end date across all traded
firms.
The typical prototype is
igroupSums
function(x, group = NULL, na.rm = F, weights = NULL, ngroups = if
(is.null(
group)) 1 else max(as.integer(group), na.rm = T))
and the currently supported summary functions are
mean : igroupMeans
sum : igroupSums
prod : igroupProds
min : igroupMins
max : igroupMaxs
range : igroupRanges
any : igroupAnys
all : igroupAlls
SAS is similar in that is also has a specific list of functions you
can request including all of the basic stats from a PROC univariate
including higher moment stuff (skewness, kurtosis, robust
statistics, and even statistical test results for each coded
subgroup, and the nice thing is all combinations can be done with one
call.
But to do that SAS does require the presorting, but it does run
really fast for even super long vectors with lots of sort keys.
Similarly the next snippet of code, will take the file and resort it
by the portfolio key and then the market to book ratio (MTB) for all
trading firms for all monthly periods since 1980. It will then
split each size portfolio for each month ending date into 5 equal
portfolios based on market to book ratios (thus the need for the
sort). SAS returns a coded integer vector PMTB (made up of 1s to 5
with 1s's for the smallest MTB and 5 for the largest MTB) repeated
for each subgroup of MDSIZ. PMTB matches the original vector in
length and therefore fits right into the data frame.
/* SPLIT INTO Market to Book QUINTILES BY MDSIZ */
PROC SORT;
BY MDSIZ MTB;
PROC RANK GROUPS=5 OUT=TMPS0;
VAR MTB;
RANKS PMTB;
BY MDSIZ;
The problem of assigning elements of a long data vector to portfolios
and sub portfolios based on the values of specific data columns which
must be calculated at each step and are not fixed or hardcoded is one
that finance can run into (and therefore I run into it).
So by sorting I could handle the need for "small integer" recoding
and the small integers would have meaning (i.e. higher values will
represent larger MTB firms, etc).
That just leaves the problem of calculating stats on short sequences
of of a longer integer.
They are fast:
x<-runif(2e6) i<-rep(1:1e6, 2) sys.time(sx <- igroupSums(x,i))
[1] 0.66 0.67
length(sx)
[1] 1000000 On that machine R takes 44 seconds to go the lapply/split route:
unix.time(unlist(lapply(split(x,i), sum)))
[1] 43.24 0.78 44.11 0.00 0.00
Yes! That is exactly what I need. Are there plans for adding something like that to R? Thanks, Kevin
---------------------------------------------------------------------------- Bill Dunlap Insightful Corporation bill at insightful dot com 360-428-8146 "All statements in this message represent the opinions of the author and do not necessarily reflect Insightful Corporation policy or position."
Hi Bill,
sum : igroupSums
Okay, after thinking about this ...
# assumes i is the small integer factor with n levels
# v is some long vector
# no sorting required
igroupSums <- function(v,i) {
sums <- rep(0,max(i))
for (j in 1:length(v)) {
sums[[i[[j]]]] <- sums[[i[[j]]]] + v[[j]]
}
sums
}
if written in fortran or c might be faster than using split. It is
at least just linear in time with the length of vector v. This
approach could be easily made parallel to t threads simply by picking
t starting points someplace along v and running this routine in
parallel on each piece. You could even do it without thread locking
if "sums" elements can be accessed atomically or by creating multiple
copies of "sums" (one for each piece) and then doing a final addition.
I still think I am missing some obvious way to do this but ...
Am I thinking along the right lines?
Kevin
2 days later
On Sat, 29 Jul 2006, Kevin B. Hendricks wrote:
Hi Bill,
sum : igroupSums
Okay, after thinking about this ...
# assumes i is the small integer factor with n levels
# v is some long vector
# no sorting required
igroupSums <- function(v,i) {
sums <- rep(0,max(i))
for (j in 1:length(v)) {
sums[[i[[j]]]] <- sums[[i[[j]]]] + v[[j]]
}
sums
}
if written in fortran or c might be faster than using split. It is
at least just linear in time with the length of vector v.
For sums you should look at rowsum(). It uses a hash table in C and last time I looked was faster than using split(). It returns a vector of the same length as the input, but that would easily be fixed. The same approach would work for min, max, range, count, mean, but not for arbitrary functions. -thomas Thomas Lumley Assoc. Professor, Biostatistics tlumley at u.washington.edu University of Washington, Seattle
Hi Thomas, Here is a comparison of performance times from my own igroupSums versus using split and rowsum: > x <- rnorm(2e6) > i <- rep(1:1e6,2) > > unix.time(suma <- unlist(lapply(split(x,i),sum))) [1] 8.188 0.076 8.263 0.000 0.000 > > names(suma)<- NULL > > unix.time(sumb <- igroupSums(x,i)) [1] 0.036 0.000 0.035 0.000 0.000 > > all.equal(suma, sumb) [1] TRUE > > unix.time(sumc <- rowsum(x,i)) [1] 0.744 0.000 0.742 0.000 0.000 > > sumc <- sumc[,1] > names(sumc)<-NULL > all.equal(suma,sumc) [1] TRUE So my implementation of igroupSums is faster and already handles NA. I also have implemented igroupMins, igroupMaxs, igroupAnys, igroupAlls, igroupCounts, igroupMeans, and igroupRanges. The igroup functions I implemented do not handle weights yet but do handle NAs properly. Assuming I clean them up, is anyone in the R developer group interested? Or would you rather I instead extend the rowsum appropach to create rowcount, rowmax, rowmin, rowcount, etc using a hash function approach. All of these approaches simply use differently ways to map group codes to integers and then do the functions the same. Thanks, Kevin