Any interest in "merge" and "by" implementations specifically for sorted data? - R-devel

Sat, Jul 29, 2006 1:05 PM #

Hi Bill,

So you wrote one routine that can calculate any single of a variety  
of stats and allows weights, is that right?  Can it return a data  
frame of any subset of requested stats as well (that is what I was  
thinking of doing anyway).

I think someone can easily calculate all of those things in one pass  
through the array and then allow the user to select which of the new  
columns of stats should be added to a data.frame that is returned to  
the user.

To test all of this, I simply wrote my own igroupSums and integrated  
it into r-devel based on the code in split.c.  I can easily modify it  
to handle the case of calculating  a variety of stats (even all at  
the same time if desired).  I do not deal with "weights" at all and  
ignored that for now.

Here is what your test case now shows on my machine with the latest R  
build
with my added igroupSums routine (added internally to R).

 > x <- rnorm(2e6)
 > i <- rep(1:1e6,2)
 > unix.time(Asums <- unlist(lapply(split(x,i),sum)))
[1] 8.940 0.112 9.053 0.000 0.000
 > names(Asums) <- NULL

My version of igroupSums does not keep the names so I remove them to  
make the results comparable.

Here is my my own internal function igroupSums

 > unix.time(Bsums <- igroupSums(x,i))
[1] 0.932 0.024 0.958 0.000 0.000
 >
 > all.equal(Asums, Bsums)
[1] TRUE


So the speed up is quite significant (9.053 seconds vs 0.858 seconds).

I will next modify my code to handle any single one of maxs, mins,  
sums, counts, anys, alls, means, prods, and ranges by user choice.   
Although I will leave the use of weights as unimplemented for now (I  
always get mixed up thinking about weights and basic stats and I  
never use them so ...)

In case others want to play around with this too, here is the R  
wrapper in igroupSums.R to put in src/library/base/R/


igroupSums <- function(x, f, drop = FALSE, ...) UseMethod("igroupSums")

igroupSums.default <- function(x, f, drop=FALSE, ...)
{
     if(length(list(...))) .NotYetUsed(deparse(...), error = FALSE)

     if (is.list(f)) f <- interaction(f, drop = drop)
     else if (drop || !is.factor(f)) # drop extraneous levels
         f <- factor(f)
     storage.mode(f) <- "integer"  # some factors have double
     if (is.null(attr(x, "class")))
         return(.Internal(igroupSums(x, f)))
     ## else
     r <- by(x,f,sum)
     r
}

igroupSums.data.frame <- function(x, f, drop = FALSE, ...)
     lapply(split(seq(length=nrow(x)), f, drop = drop, ...),
            function(ind) x[ind, , drop = FALSE])


And here is a very simple igroupSums.c to put in src/main/

It still needs a lot of work since it does not handle NAs in the  
vector x yet and still needs to be modified into a general routine to  
handle any single function of counts, sums, maxs, mins, means, prods,  
anys, alls, and ranges


#ifdef HAVE_CONFIG_H
#include <config.h>
#endif

#include "Defn.h"

SEXP attribute_hidden do_igroupSums(SEXP call, SEXP op, SEXP args,  
SEXP env)
{
     SEXP x, f, sums;
     int i, j, nobs, nlevs, nfac;

     checkArity(op, args);

     x = CAR(args);
     f = CADR(args);
     if (!isVector(x))
         errorcall(call, _("first argument must be a vector"));
     if (!isFactor(f))
         errorcall(call, _("second argument must be a factor"));
     nlevs = nlevels(f);
     nfac = LENGTH(CADR(args));
     nobs = LENGTH(CAR(args));
     if (nobs <= 0)
         return R_NilValue;
     if (nfac <= 0)
         errorcall(call, _("Group length is 0 but data length > 0"));
     if (nobs % nfac != 0)
         warningcall(call, _("data length is not a multiple of split  
variable"));

     PROTECT(sums = allocVector(TYPEOF(x), nlevs));
     switch (TYPEOF(x)) {
     case INTSXP:
         for (i=0; i < nlevs; i++) INTEGER(sums)[i] = 0;
         break;
     case REALSXP:
         for (i=0; i < nlevs; i++) REAL(sums)[i] = 0.0;
         break;
     default:
         UNIMPLEMENTED_TYPE("igroupSums", x);
     }
     for (i = 0;  i < nobs; i++) {
         j = INTEGER(f)[i % nfac];
         if (j != NA_INTEGER) {
             j--;
             switch (TYPEOF(x)) {
             case INTSXP:
                 INTEGER(sums)[j] = INTEGER(sums)[j] + INTEGER(x)[i];
                 break;
             case REALSXP:
                 REAL(sums)[j] = REAL(sums)[j] + REAL(x)[i];
                 break;
             default:
                 UNIMPLEMENTED_TYPE("igroupSums", x);
             }
         }
     }
     UNPROTECT(1);
     return sums;
}


If anyone is playing with this themselves, don't forget to update  
Internal.h and names.c to reflect the added routine before you make  
clean and then rebuild.

Once I finish, I will post me patches here and then if someone would  
like to modify them to implement "weights", please let me know.

Even if these never get added to R I can keep them in my own tree and  
use them for my own work.

Thanks again for all of your hints and guidance.  This alone will  
speed up my R code greatly!

Kevin

Tom Short

Sun, Jul 30, 2006 5:02 AM #

Kevin, starting with your idea of sorting first, you can get some speedups
just using R. Start by comparing the base case that Bill used:

[1] 11.00  0.16 11.28    NA    NA

Now, try sorting and using a loop:

+   for (j in 1:length(res)) {
+     res[j] <- FUN(x[startidx[j]:endidx[j]])
+   }
+   res
+ }

[1] 6.86 0.00 7.04   NA   NA

For the case of sum (or averages), you can vectorize this using cumsum as
follows. This won't work for median or max.

+   cum <- cumsum(x)
+   res <- cum[endidx]
+   res[2:length(res)] <- res[2:length(res)] - cum[endidx[1:(length(res) -
1)]]
+   res
+ }

[1] 0.20 0.00 0.21   NA   NA

You can also use Luke Tierney's byte compiler
(http://www.stat.uiowa.edu/~luke/R/compiler/) to speed up the loop for
functions where you can't vectorize:

Note: local functions used: FUN

[1] 3.84 0.00 3.91   NA   NA

- Tom

View this message in context: http://www.nabble.com/Any-interest-in-%22merge%22-and-%22by%22-implementations-specifically-for-sorted-data--tf2009595.html#a5562816
Sent from the R devel forum at Nabble.com.

Kevin B. Hendricks

Mon, Jul 31, 2006 6:41 AM #

Hi Tom,

I wonder how much time the sorting, reordering and creation os  
startidx and endidx would add to this time?

Either way, your code can nicely be used to quickly create the small  
integer factors I would need if the igroup functions get integrated.   
Thanks!

Yes that is a quite fast way to handle "sums".

That looks interesting.  Does it only work for specific operating  
systems and processors?  I will give it a try.

Thanks,

Kevin

Tom Short

Mon, Jul 31, 2006 8:57 AM #

Done interactively, sorting and indexing seemed fast. Here are some timings:

+            xs <- x[idx]
+            is <- i[idx]
+            res <- array(NA, 1e6)
+            idx <- which(diff(is) > 0)
+            startidx <- c(1, idx+1)
+            endidx <- c(idx, length(xs))
+          })
[1] 1.06 0.00 1.09   NA   NA

No, as far as I know, it works on all operating systems. Also, it gets a
little faster if you directly put the sum in the function:

+   for (j in 1:length(res)) {
+     res[j] <- sum(x[startidx[j]:endidx[j]])
+   }
+   res
+ }

[1] 2.67 0.03 2.95   NA   NA

- Tom

View this message in context: http://www.nabble.com/Any-interest-in-%22merge%22-and-%22by%22-implementations-specifically-for-sorted-data--tf2009595.html#a5578580
Sent from the R devel forum at Nabble.com.