Skip to content

ideas about how to reduce RAM & improve speed in trying to use lapply(strsplit())

6 messages · jim holtman, Joshua Wiley, Ian Gow +2 more

#
hi all,

I'm full of questions today :). Thanks in advance for your help!

Here's the problem:
x <- c('18x.6','12x.9','302x.3')

I want to get a vector that is c('18x','12x','302x')

This is easily done using this code:

unlist(lapply(strsplit(x,".",fixed=TRUE),function(x) x[1]))

So far so good. The problem is that x is a vector of length 132e6.
When I run the above code, it runs for > 30 minutes, and it takes > 23
Gb RAM (no kidding!).

Does anyone have ideas about how to speed up the code above and (more
importantly) reduce the RAM footprint? I'd prefer not to change the
file on disk using, e.g., awk, but I will do that as a last resort.

Best

Matt
#
Try this approach:
[1] "18x"  "12x"  "302x"
On Sun, May 29, 2011 at 8:10 PM, Matthew Keller <mckellercran at gmail.com> wrote:

  
    
#
Hi Matt,

There are likely more efficient ways still, but this is a big
performance boost time-wise for me:

x <- c('18x.6','12x.9','302x.3')

gsub("\\.(.+$)", "", x)

x <- rep(x, 10^5)
user  system elapsed
   2.89    0.03    2.96
user  system elapsed
   0.57    0.00    0.59
[1] TRUE


Cheers,

Josh
On Sun, May 29, 2011 at 5:10 PM, Matthew Keller <mckellercran at gmail.com> wrote:

  
    
#
Not a new approach, but some benchmark data (the perl=TRUE speeds up Jim's
suggestion):
user  system elapsed
  1.203   0.018   1.222
user  system elapsed
  0.176   0.001   0.176
[1] TRUE
user  system elapsed
  0.292   0.001   0.291
[1] TRUE
user  system elapsed
  0.160   0.001   0.161
On 5/29/11 7:40 PM, "jim holtman" <jholtman at gmail.com> wrote:

            
#
God this listserve is awesome. Thanks to everyone for their ideas.
I'll speed & memory test tomorrow and change the code. Thanks again!

Matt
On Sun, May 29, 2011 at 6:44 PM, Ian Gow <iandgow at gmail.com> wrote:

  
    
#
On 2011-05-29 23:08, Matthew Keller wrote:
Since you're dealing with a vector of ~ 1e8 elements, you might
find that (at a probably small cost of time) you can reduce the
memory requirements by processing the vector in pieces:

## adjust n to suit trade-off between memory usage and time
n <- 100
k <- length(x) / n
L <- vector("list", n)
for( i in 1:n ) {
   y <- x[seq((i - 1) * k + 1, i * k)]
   L[[i]] <- gsub("^(.*?)\\..*$","\\1",y, perl=TRUE)
}
newx <- unlist(L)


Peter Ehlers