Skip to content

nested for loops too slow

4 messages · Morway, Eric, J Robertson-Burns, Thierry Onkelinx +1 more

#
The small example below works lighting-fast; however, when I run the same
script on my real problem, a 1Gb text file, the for loops have been running
for over 24 hrs and I have no idea if the processing is 10% done or 90%
done.  I have not been able to figure out a betteR way to code up the
material within the for loops at the end of the example below.  The
contents of divChng, the final product, are exactly what I'm after, but I
need help formulating more efficient R script, I've got two more 1Gb files
to process after the current one finishes, whenever that is...

I appreciate any insights/solutions, Eric

dat <- read.table(textConnection("ISEG  IRCH  div  gw
1  1  265  229
1  2  260  298
1  3  234  196
54  1  432  485
54  39  467  485
54  40  468  468
54  41  460  381
54  42  489  502
1  1  265  317
1  2  276  225
1  3  217  164
54  1  430  489
54  39  456  495
54  40  507  607
54  41  483  424
54  42  457  404
1  1  265  278
1  2  287  370
1  3  224  274
54  1  412  585
54  39  473  532
54  40  502  595
54  41  497  441
54  42  447  467
1  1  230  258
1  2  251  152
1  3  199  179
54  1  412  415
54  39  439  538
54  40  474  486
54  41  477  484
54  42  413  346
1  1  230  171
1  2  262  171
1  3  217  263
54  1  432  485
54  39  455  482
54  40  493  419
54  41  489  536
54  42  431  504
1  1  1002  1090
1  2  1222  1178
1  3  1198  1177
54  1  1432  1485
54  39  1876  1975
54  40  1565  1646
54  41  1455  1451
54  42  1427  1524
1  1  1002  968
1  2  1246  1306
1  3  1153  1158
54  1  1532  1585
54  39  1790  1889
54  40  1490  1461
54  41  1518  1536
54  42  1486  1585
1  1  1002  1081
1  2  1229  1262
1  3  1142  1241
54  1  1632  1659
54  39  1797  1730
54  40  1517  1466
54  41  1527  1589
54  42  1514  1612"),header=TRUE)

dat$seq <- ifelse(dat$ISEG==1 & dat$IRCH==1, 1, 0)
tmp <- diff(dat[dat$seq==1,]$div)!=0
dat$idx <- 0
dat[dat$seq==1,][c(TRUE,tmp),]$idx <- 1
dat$ts <- cumsum(dat$idx)
dat$iter <- ave(dat$seq, dat$ts,FUN=cumsum)
dat$ct <- seq(1:length(dat[,1]))

timeStep <- unique(dat$ts)
SEG <- unique(dat$ISEG)
divChng <- data.frame(ts=NA, ISEG=NA, divChng=NA, gwChng=NA, iter=NA)

#Can the following be rescripted for better harnessing R's processing power?

for (i in 1:length(timeStep)){
  for (j in 1:length(SEG)){
    datTS <- subset(dat,ts==timeStep[i] & ISEG==SEG[j] & IRCH==1)
    datGW <- subset(dat,ts==timeStep[i] & ISEG==SEG[j])
    grw <- aggregate(gw ~ iter, datGW, sum)

    DC <- max(datTS$div)-min(datTS$div)
    GRW <- max(grw$gw) - min(grw$gw)
    divChng <- rbind(divChng,c(datTS$ts[1], SEG[j], DC, GRW,
max(datTS$iter)))
  }
}
divChng <- divChng[!is.na(divChng$ISEG),]
#
You are certainly in Circle 2 of 'The R Inferno',
which I suspect is where almost all of the
computation time is coming from.

Instead of doing:

divChng <- rbind(divChng,c(datTS$ts[1], SEG[j], DC, GRW,
max(datTS$iter)))

it would be much better to create 'divChng' to
be the final length and then in the loop do:

divChng[count,] <- c(datTS$ts[1], SEG[j], DC, GRW,
max(datTS$iter))
count <- count + 1

You may also want to explore 'tapply'.
If this is something you will be doing
a lot of, then you should probably learn
the 'data.table' package.

http://www.burns-stat.com/documents/books/the-r-inferno/

Pat
On 12/04/2015 14:47, Morway, Eric wrote:
#
You don't need loops at all.

    grw <- aggregate(gw ~ ts + ISEG + iter, data = dat, FUN = sum)
    GRW <- aggregate(gw ~ ts + ISEG, data = grw, FUN = function(x){max(x) -
min(x)})
    DC <- aggregate(div ~ ts + ISEG, data = subset(dat, IRCH == 1), FUN =
function(x){max(x) - min(x)})
    iter <- aggregate(iter ~ ts + ISEG, data = subset(dat, IRCH == 1), FUN
= max)
    tmp <- merge(DC, iter)
    merge(tmp, GRW)

another option is to use the plyr package

    library(plyr)
    merge(
      ddply(
        subset(dat, IRCH == 1),
        c("ts", "ISEG"),
        summarize,
        divChng = max(div) - min(div),
        max.iter = max(iter)
      ),
      ddply(
        dat,
        c("ts", "ISEG"),
        summarize,
        gwChng = diff(range(ave(gw, iter, FUN = sum)))
      )
    )

Best regards,

ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature and
Forest
team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
Kliniekstraat 25
1070 Anderlecht
Belgium

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to say
what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
~ John Tukey

2015-04-12 15:47 GMT+02:00 Morway, Eric <emorway at usgs.gov>:

  
  
#
Well, sort of...

aggregate() is basically a wrapper for lapply(), which ultimately must loop
over the function call at the R interpreter level, as opposed to vectorized
functions that loop at the C level and hence can be orders of magnitude
faster. As a result, there is often little difference in efficiency between
explicit and *smart* (in the sense that Pat Burns has already pointed out
of not growing structures at each iteration,among other things)  for()
looping and apply-type calls. For some of us, the chief advantage of the
*apply idioms is that the code is more readable and maintainable, with R
handling fussy details of loop indexing, for example.lapply() is also more
in keeping with the functional programming paradigm.
?Others?
find both these "virtues" to be annoyances,
? ?
however, and prefer explicit *smart* looping. Chaque un ?
? ?
son go?
?t?
.

None of which necessarily denies the wisdom of the approach you've
suggested, however. It may indeed be considerably faster,
but timing will have to tell. I am just trying to correct
?(again) ?
the
? ?
widely held misperception
?that yo?
u
? seem to?
express
?.?


Cheers,
Bert


Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge is
certainly not wisdom."
Clifford Stoll



On Sun, Apr 12, 2015 at 1:48 PM, Thierry Onkelinx <thierry.onkelinx at inbo.be>
wrote: