Skip to content

Any interest in "merge" and "by" implementations specifically for sorted data?

13 messages · Seth Falcon, Brian Ripley, Gabor Grothendieck +4 more

#
Hi Developers,

I am looking for another new project to help me get more up to speed  
on R and to learn something outside of R internals.   One recent R  
issue I have run into is finding a fast implementations of the  
equivalent to the following SAS code:

/* MDPC is an integer sort key made from two integer columns */
MDPC = (MD * 100000) + PCO;

/* sort the dataset by the key */
PROC SORT;
   BY MDPC;

/* print out count and sum for each unique sort key (subgroup) */
/* use of BY MDPC requires sorting that data set by MDPC first in SAS */
PROC UNIVARIATE NOPRINT;
    VAR MVE;
    BY MDPC;
    OUTPUT OUT=TMP0 N=XN SUM=XSUM;

Easy to do in R but the problem is the data set this is being run on  
has 1,742,201 lines in it and takes up 196,868,713 bytes to store as  
character data.  The sort key has easily has over 200,000 unique keys  
(if not twice that).

My first R attempt was a simple

# sort the data.frame gd and the sort key
sorder <- order(MDPC)
gd  <- gd[sorder,]
MDPC <- MDPC[sorder]
attach(gd)

# find the length and sum for each unique sort key
XN <- by(MVE, MDPC, length)
XSUM <- by(MVE, MDPC, sum)
GRPS <- levels(as.factor(MDPC))

Well the ordering and sorting was reasonably fast but the first "by"  
statement was still running 4 hours later on my machine (a dual 2.6  
gig Opteron with 4 gig of main memory).  This same snippet of code in  
SAS running on a slower machine takes about 5 minutes of system time.

I tried various simple R implementations of a "by_sorted" that I  
thought might help

# walk sorted array once and keep track of beginning
# and ending points for each unique sort key value in p
# and run function fcn on that sub sequence in vector v
# store the results in a vector

by_sorted <- function(v, p, fcn) {
    key <- p[[1]]
    bp <- 1
    r <- NULL
    for (i in 2:length(p)) {
       if (key != p[[i]]) {
           r <- c(r,fcn(v[bp:i-1]))
           bp <- i
           key <- p[[i]]
       }
    }
    r <- c(r,fcn(v[bp:i]))
}

but they took "forever" to run also (read that I killed those  
attempts at 15 minutes of cpu time).

I literally had the same issue when trying with "tapply".

So unless it already exists someplace, I need a really fast  
implementation of "by" for very large sorted data sets (probably  
written in fortran or c) that will do the equivalent of what SAS does  
with its "proc univariate by" approach with close to the same  
performance.  The code should only have to walk the array once (ie.  
be linear in time with the number of rows in the vector).   I have  
similar issues with "merge" as well since merging data frames already  
sorted by the same sort key should be fast as well and does not  
appear to be.

Before I jump into this and create my own "sorted large data set"  
versions of "by" and "merge", I wanted to be sure it would be of  
interest to others.  If they work well and are well implemented (a  
big if since I am really still just learning this - the whole point  
of the project!) would something like this be of any interest for  
internal use of R?  Or is this something too specialized?
Is there an R function implemented in c or fortran that would make a  
good "model" to follow for implementing something like this?
Would/should they be extensions of current implementations of "merge"  
and "by" with an additions of a sorted=TRUE (defaulting to FALSE)  
extra parameter.

Or am I simply barking up the wrong tree here?

Thanks,

Kevin
#
"Kevin B. Hendricks" <kevin.hendricks at sympatico.ca> writes:
I wonder if split() would be of use here.  Once you have sorted the
data frame gd and the sort keys MDPC, you could do:

gdList <- split(gd$MVE, MDPC)

xn <- sapply(gdList, length)
xsum <- sapply(gdList, sum)

+ seth
#
Which version of R are you looking at?   R-devel has

    o	merge() works more efficiently when there are relatively few
	matches between the data frames (for example, for 1-1
	matching).  The order of the result is changed for 'sort = FALSE'.
On Thu, 27 Jul 2006, Kevin B. Hendricks wrote:

            

  
    
#
Hi,

I was using my installed R which is 2.3.1 for the first tests.  I  
moved to the r-devel tree (I svn up and rebuild everyday) for my "by"  
tests to see if it would work any better.  I neglected to retest  
"merge" with the devel version.

So it appears "merge" is already fixed and I just need to worry about  
"by".
On Jul 28, 2006, at 3:06 AM, Brian D Ripley wrote:

            
Thanks,

Kevin
#
There was a performance comparison of several moving average
approaches here:
http://tolstoy.newcastle.edu.au/R/help/04/10/5161.html

The author of that message ultimately wrote the caTools R package
which contains some optimized versions.

Not sure if these results suggest anything of interest here but it
would be interesting if various base routines could be sped up
to the point that a simple idiom is competitive with the caTools versions.
On 7/28/06, Kevin B. Hendricks <kevin.hendricks at sympatico.ca> wrote:
#
Hi,
Thanks for that link.  It is not quite the same thing but is very  
similar.

The idea is to somehow make functions that work well over small sub- 
sequences of a much longer vector without resorting to splitting the  
vector into many smaller  vectors.

In my particular case, the problem was my data frame had over 1  
million lines had probably over 500,000 unique sort keys (ie. think  
of it as an R factor with over 500,000 levels).  The implementation  
of "by" uses "tapply" which in turn uses "split".  So "split" simply  
ate up all the time trying to create 500,000 vectors each of short  
length 1, 2, or 3; and the associated garbage collection.

I simple loop that walked the short sequence of values (since the  
data frame was already sorted) calculating what it needed, would work  
much faster than splitting the original vector into so very many  
smaller vectors (and the associated copying of data).

That problem is very similar problem to the calculation of basic  
stats on a short moving window over a very long vector.
I will look into that package and maybe use it for a model for what I  
want to do.

Thanks,

Kevin
#
[.........]

    Kevin> The idea is to somehow make functions that work well
    Kevin> over small sub- sequences of a much longer vector
    Kevin> without resorting to splitting the vector into many
    Kevin> smaller vectors.

    Kevin> In my particular case, the problem was my data frame
    Kevin> had over 1 million lines had probably over 500,000
    Kevin> unique sort keys (ie. think of it as an R factor with
    Kevin> over 500,000 levels).  The implementation of "by"
    Kevin> uses "tapply" which in turn uses "split".  So "split"
    Kevin> simply ate up all the time trying to create 500,000
    Kevin> vectors each of short length 1, 2, or 3; and the
    Kevin> associated garbage collection.

Not that I have spent enough time thinking about this thread's
topic, but I have seen more than one case where using  tapply()
unnecessarily slowed down computations.
I don't remember the details, but know that in one case, replacing
tapply() by a few lines of code {one of which using lapply() IIRC},
sped up that computation by a factor (of 2 ? or more?).

I also vaguely remember that I thought about making tapply()
faster, but came to the conclusion it could not be
sped up quickly, because it works in a quite more general
context than it was used in that application (and maybe yours?).


    Kevin> I simple loop that walked the short sequence of
    Kevin> values (since the data frame was already sorted)
    Kevin> calculating what it needed, would work much faster
    Kevin> than splitting the original vector into so very many
    Kevin> smaller vectors (and the associated copying of data).

    Kevin> That problem is very similar problem to the
    Kevin> calculation of basic stats on a short moving window
    Kevin> over a very long vector.

    >> The author of that message ultimately wrote the caTools R
    >> package which contains some optimized versions.

    Kevin> I will look into that package and maybe use it for a
    Kevin> model for what I want to do.

    Kevin> Thanks,

    Kevin> Kevin

    Kevin> ______________________________________________
    Kevin> R-devel at r-project.org mailing list
    Kevin> https://stat.ethz.ch/mailman/listinfo/r-devel
#
On Fri, 28 Jul 2006, Kevin B. Hendricks wrote:

            
Splus8.0 has something like what you are talking about
that provides a fast way to compute
    sapply(split(xVector, integerGroupCode), summaryFunction)
for some common summary functions.  The 'integerGroupCode'
is typically the codes from a factor, but you could compute
it in other ways.  It needs to be a "small" integer in
the range 1:ngroups (like the 'bin' argument to tabulate).
Like tabulate(), which is called from table(), these are
meant to be called from other functions that can set up
appropriate group codes.  E.g., groupSums or rowSums or
fancier things could be based on this.

They do not insist you sort the input in any way.  That
would really only be useful for group medians and we haven't
written that one yet.

The typical prototype is
function(x, group = NULL, na.rm = F, weights = NULL, ngroups = if(is.null(
        group)) 1 else max(as.integer(group), na.rm = T))

and the currently supported summary functions are
   mean : igroupMeans
   sum : igroupSums
   prod : igroupProds
   min : igroupMins
   max : igroupMaxs
   range : igroupRanges
   any : igroupAnys
   all : igroupAlls
(The plurals are not all properly spelled and they
are plural so match related functions like rowSums.)
It might be useful to also have one to count the number
of non-missing values in each group of x's.

They are fast:

   > x<-runif(2e6)
   > i<-rep(1:1e6, 2)
   > sys.time(sx <- igroupSums(x,i))
   [1] 0.66 0.67
   > length(sx)
   [1] 1000000

On that machine R takes 44 seconds to go the lapply/split
route:

   > unix.time(unlist(lapply(split(x,i), sum)))
   [1] 43.24  0.78 44.11  0.00  0.00

To save setup time in the S code, out-of-range values
in the group argument (negatives, values greater than
ngroup, and NA's), mean that the correponding element
in x is ignored.

----------------------------------------------------------------------------
Bill Dunlap
Insightful Corporation
bill at insightful dot com
360-428-8146

 "All statements in this message represent the opinions of the author and do
 not necessarily reflect Insightful Corporation policy or position."
#
Hi Bill,
The sort is also useful for recoding each group into subgroups based  
on some other numeric vector.  This is the problem I run into trying  
to build portfolios that can be used as benchmarks for long term  
stock returns.

Another issue I have is that to recode a long character string that I  
use as a sort key for accessing a subgroup of the data in the  
data.frame to a set of small integers is not fast.  I can make a fast  
implementation if the data is sorted by the key, but without the  
sort, just converting my sort keys to the required small integer  
codes would be expensive for very long vectors since my small integer  
codes would have to reflect the order of the data (ie. be increasing  
subportfolio numbers).

More specifically, I am now converting all of my SAS code to R code  
and the problem is I have lots of snippets of SAS that do the  
following ...

PROC SORT;
   BY MDSIZ FSIZ;

/* WRITE OUT THE MIN SIZE CUTOFF VALUES */
PROC UNIVARIATE NOPRINT;
    VAR FSIZ;
    BY MDSIZ;
    OUTPUT OUT=TMPS1 MIN=XMIN;

where my sort key MDSIZ is a character string that is the  
concatenation of the month ending date MD and the size portfolio of a  
particular firm (SIZ) and I want to find the cutoff points (the mins)  
for each of the portfolios for every month end date across all traded  
firms.
SAS is similar in that is also has a specific list of functions you  
can request including all of the basic stats from a PROC univariate  
including higher moment stuff (skewness, kurtosis,  robust  
statistics, and even statistical test results for each coded  
subgroup, and the nice thing is all combinations can be done with one  
call.

But to do that SAS does require the presorting, but it does run  
really fast for even super long vectors with lots of sort keys.

Similarly the next snippet of code, will take the file and resort it  
by the portfolio key and then the market to book ratio (MTB) for all  
trading firms for all monthly periods since 1980.    It will then  
split each size portfolio for each month ending date into 5 equal  
portfolios based on market to book ratios (thus the need for the  
sort).   SAS returns a coded integer vector PMTB (made up of 1s to 5  
with 1s's for the smallest MTB and 5 for the largest MTB) repeated  
for each subgroup of MDSIZ.  PMTB matches the original vector in  
length and therefore fits right into the data frame.

/* SPLIT INTO Market to Book QUINTILES BY MDSIZ */
PROC SORT;
   BY MDSIZ MTB;
PROC RANK GROUPS=5 OUT=TMPS0;
    VAR MTB;
    RANKS PMTB;
    BY MDSIZ;

The problem of assigning elements of a long data vector to portfolios  
and sub portfolios based on the values of specific data columns which  
must be calculated at each step and are not fixed or hardcoded is one  
that finance can run into (and therefore I run into it).

So by sorting I could handle the need for "small integer" recoding  
and the small integers would have meaning (i.e. higher values will  
represent larger MTB firms, etc).

That just leaves the problem of calculating stats on short sequences  
of of a longer integer.
Yes!  That is exactly what I need.

Are there plans for adding something like that to R?

Thanks,

Kevin
#
On Fri, 28 Jul 2006, Kevin B. Hendricks wrote:

            
True, but the underlying grouped summary code
shouldn't require you to do the sorting.  If
    codes <- match(char, sort(unique(char)))
is too slow then you could try sorting the
data set by th 'char' column and doing
    codes <- cumsum(char[-1] != char[-length(char)])
(loop over columns before doing cumsum if there
are several columns).

If that isn't fast enough, then you could sort
in the C code as well, but I think there could
be lots of cases where that is slower.

I've used this code for out of core applications,
where I definitely do not want to sort the whole
dataset.
----------------------------------------------------------------------------
Bill Dunlap
Insightful Corporation
bill at insightful dot com
360-428-8146

 "All statements in this message represent the opinions of the author and do
 not necessarily reflect Insightful Corporation policy or position."
#
Hi Bill,
Okay, after thinking about this ...

# assumes i is the small integer factor with n levels
# v is some long vector
# no sorting required

igroupSums <- function(v,i) {
   sums <- rep(0,max(i))
   for (j in 1:length(v)) {
       sums[[i[[j]]]] <- sums[[i[[j]]]] + v[[j]]
   }
   sums
}

if written in fortran or c might be faster than using split.  It is  
at least just linear in time with the length of vector v.  This  
approach could be easily made parallel to t threads simply by picking  
t starting points someplace along v and running this routine in  
parallel on each piece.  You could even do it without thread locking  
if "sums" elements can be accessed atomically or by creating multiple  
copies of "sums" (one for each piece) and then doing a final addition.

I still think I am missing some obvious way to do this but ...

Am I thinking along the right lines?

Kevin
2 days later
#
On Sat, 29 Jul 2006, Kevin B. Hendricks wrote:

            
For sums you should look at rowsum().  It uses a hash table in C and last 
time I looked was faster than using split(). It returns a vector of the 
same length as the input, but that would easily be fixed.

The same approach would work for min, max, range, count, mean, but not for 
arbitrary functions.

 	-thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle
#
Hi Thomas,

Here is a comparison of performance times from my own igroupSums  
versus using split and rowsum:

 > x <- rnorm(2e6)
 > i <- rep(1:1e6,2)
 >
 > unix.time(suma <- unlist(lapply(split(x,i),sum)))
[1] 8.188 0.076 8.263 0.000 0.000
 >
 > names(suma)<- NULL
 >
 > unix.time(sumb <- igroupSums(x,i))
[1] 0.036 0.000 0.035 0.000 0.000
 >
 > all.equal(suma, sumb)
[1] TRUE
 >
 > unix.time(sumc <- rowsum(x,i))
[1] 0.744 0.000 0.742 0.000 0.000
 >
 > sumc <- sumc[,1]
 > names(sumc)<-NULL
 > all.equal(suma,sumc)
[1] TRUE


So my implementation of igroupSums is faster and already handles NA.   
I also have implemented igroupMins, igroupMaxs, igroupAnys,  
igroupAlls, igroupCounts, igroupMeans, and igroupRanges.

The igroup functions I implemented do not handle weights yet but do  
handle NAs properly.

Assuming I clean them up, is anyone in the R developer group interested?

Or would you rather I instead extend the rowsum appropach to create  
rowcount, rowmax, rowmin, rowcount, etc using a hash function approach.

All of these approaches simply use differently ways to map group  
codes to integers and then do the functions the same.

Thanks,

Kevin