On Tue, Feb 21, 2012 at 4:04 PM, Matthew Keller <mckellercran at gmail.com> wrote:
X <- read.big.matrix("file.loc.X",sep=" ",type="double")
hap.indices <- bigsplit(X,1:2) #this runs for too long to be useful on
these matrices
#I was then going to use foreach loop to sum across the splits
identified by bigsplit
How about just using foreach earlier in the process ? e.g. split
file.loc.X to (80) sub files and then run
read.big.matrix/bigsplit/sum inside %dopar%
If splitting X beforehand is a problem, you could also use ?scan to
read in different chunks of the file, something like (untested
obviously):
# for X a matrix 800x4
lineind<- seq(1,800,100) ?# create an index vec for the lines to read
ReducedX<- foreach(i = 1:8) %dopar%{
?x <- scan('file.loc.X',list(double(0),double(0),double(0),double(0)),skip=lineind[i],nlines=100)
... do your thing on x (aggregate/tapply etc.)
?}
Hope this helped
Elai.
SO - does anyone have ideas on how to deal with this problem - i.e.,
how to use a tapply() like function on an enormous matrix? This isn't
necessarily a bigtabulate question (although if I screwed up using
bigsplit, let me know). If another package (e.g., an SQL package) can
do something like this efficiently, I'd like to hear about it and your
experiences using it.
Thank you in advance,
Matt
--
Matthew C Keller
Asst. Professor of Psychology
University of Colorado at Boulder
www.matthewckeller.com