Skip to content

Why are big data.frames slow? What can I do to get it faster?

6 messages · Marcus Jellinghaus, Uwe Ligges, Thomas Lumley +1 more

#
Hello,

I?m quite new to this list.
I have a high frequency-dataset with more than 500.000 records.
I want to edit a data.frame "Test". My small programm runs fine with a small
part of the dataset (just 100 records), but it is very slow with a huge
dataset. Of course it get?s slower with more records, but when I change just
the size of the frame and keep the number of edited records fixed, I see
that it is also getting slower.

Here is my program:

print(dim(test)[1])
Sys.time()
for(i in 1:100) {
  test[i,6] = paste(test[i,2],"-",test[i,3], sep = "")
}
Sys.time()

I connect 2 currency symbols to a currency pair.
I always calculate only for the first 100 lines.
WHen I load just 100 lines in the data.frame "test", it takes 1 second.
When I load 1000 lines, editing 100 lines takes 2 seconds,
10,000 lines loaded and 100 lines editing takes 5 seconds,
100,000 lines loaded and editing 100 lines takes 31 seconds,
500,000 lines loaded and editing 100 lines takes 11 minutes(!!!).

My computer has 1 GB Ram, so that shouldn?t be the reason.

Of course, I could work with many small data.frames instead of one big, but
the program above is just the very first step and so I don?t want to split.

Is there a way to edit big data.frames without waiting for a long time?


Thank?s a lot for help,


Marcus

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
Marcus Jellinghaus wrote:
Well, the point is, I guess, to address elements in a large data.frame,
which reasonably takes much more time than in a small one.

Maybe it's an idea to use vectorized operations instead of the loop,
which is preferable, if your computation is easy vectorizable without a
big penalty of memory consumption:

 test[1:100, 6] <- paste(test[1:100, 2], "-", test[1:100, 3], sep = "")
or 
 test[ , 6] <- paste(test[ , 2], "-", test[ , 3], sep = "")
for the whole data.frame.

Uwe Ligges
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
On Sun, 6 Oct 2002, Marcus Jellinghaus wrote:

            
1.6.0 has faster dataframe indexing.  Also, there's no need to do this one
line at a time
  i<-1:100
  test[i,6]<-paste(test[i,2],test[i,3],sep="-")
should be quite a bit faster.

	-thomas
Thomas Lumley			Asst. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle
^^^^^^^^^^^^^^^^^^^^^^^^
- NOTE NEW EMAIL ADDRESS


-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
First I want to say "thank you" to everybody who replied.
I understand that vectorized operations instead of the loop are faster.
I also made sure not to use factors.

Since the loop runs 100times in my example, the loop should only take the
time of the vectorized operation mutliplied by 100.
But the loop takes about 10 minutes, the  vectorized operation takes about 3
seconds. (See below)
Why that? Shouldn?t the loop take max 100*3seconds = 5 minutes?

I?m interested in that because I think that I will have computations that
are easily vectorizable(like this example) and that I will have computations
that are not/very difficult vectorizable.

Marcus Jellinghaus
[1] 500000
[1] "2002-10-07 06:17:33 Eastern Sommerzeit"
[1] "2002-10-07 06:17:35 Eastern Sommerzeit"

[..]
[1] 500000
[1] "2002-10-07 06:05:29 Eastern Sommerzeit"
+   test[i,6] = paste(test[i,2],"-",test[i,3], sep = "")
+ }
[1] "2002-10-07 06:15:17 Eastern Sommerzeit"


-----Urspr?ngliche Nachricht-----
Von: Uwe Ligges [mailto:ligges at statistik.uni-dortmund.de]
Gesendet: Sunday, October 06, 2002 1:58 PM
An: Marcus Jellinghaus
Cc: r-help at stat.math.ethz.ch
Betreff: Re: [R] Why are big data.frames slow? What can I do to get it
faster?
Marcus Jellinghaus wrote:
small
just
but
split.
Well, the point is, I guess, to address elements in a large data.frame,
which reasonably takes much more time than in a small one.

Maybe it's an idea to use vectorized operations instead of the loop,
which is preferable, if your computation is easy vectorizable without a
big penalty of memory consumption:

 test[1:100, 6] <- paste(test[1:100, 2], "-", test[1:100, 3], sep = "")
or
 test[ , 6] <- paste(test[ , 2], "-", test[ , 3], sep = "")
for the whole data.frame.

Uwe Ligges

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
"Marcus Jellinghaus" <Marcus_Jellinghaus at gmx.de> writes:
You'll likely have to invoke the garbage collector a couple of times,
and there might also be issues of memory growth kicking in. Once you
get beyond some threshold, the machine starts swapping bits and pieces
of the workspace in and out of physical memory,

It's somewhat difficult to reproduce the behaviour, since you only give
part of the code necessary (e.g. how many *columns* do you have in
your data frame?) 

Something like this?

N <- 100000
test <- as.data.frame(lapply(1:6,function(i)rnorm(N)))
unix.time(test[1:100,6] <- paste(test[1:100,2],"-",test[1:100,3], sep = ""))
unix.time(for (i in 1:100) test[i,6] <- paste(test[i,2],"-",test[i,3], sep = ""))

(Using N==500000 made my little desktop swap like crazy, but the above
gave something like 2s CPU time for the 1st case and 92s CPU + 23s
system for the other one with R 1.6.0)
#
I wanted to know why not-vectorized operations are slow.
Thank you for your suggestions.
I did three things:
-Beside looking at the total computation time, I analyzed the
GarbageCollection-time (gc()).
-I told R to use more memory. I use version 1.6.0 and used the command
"Rgui --min-vsize=600M --min-nsize=10M"
-I used test$Fieldname[i] instead of test[i, 6].

My results show that it saves a lot of time when I use enough memory and the
fieldnames. So thank?s a lot!

Here are the details:
Without fieldnames and without use of more memory:
GC-Time: 494Seconds, other calculations 124Seconds, Total 619Seconds.

Without fieldnames, with "Rgui --min-vsize=600M --min-nsize=10M"
GC-Time: 34Seconds, other calculations 114Seconds, Total 148Seconds.

With fieldnames, without use of more memory:
GC-Time: 0,5 Seconds, other calculations 2 Seconds, Total 2,5 Seconds.
(but long time for loading the matrix)

with fieldnames, with "Rgui --min-vsize=600M --min-nsize=10M"
GC-Time: < 1 Second, other calculations < 1 Second, Total < 1 second

Marcus Jellinghaus



Peter Dalgaard writes:
Andy Liaw writes:
smaller
[..]
Peter Dalgaard writes:
datetime                       CCY1               CCY2
Bid               Ask             CCYPair
 Min.   :2002-05-28 00:00:02   Length:500000      Length:500000      Min.
:  0.557   Min.   :  0.5574   Length:500000
 1st Qu.:2002-05-28 17:30:47   Mode  :character   Mode  :character   1st
Qu.:  1.532   1st Qu.:  1.5319   Mode  :character
 Median :2002-05-29 14:43:02                                         Median
:  4.047   Median :  4.0476
 Mean   :2002-05-29 14:42:36                                         Mean
: 38.664   Mean   : 38.6858
 3rd Qu.:2002-05-30 10:22:30                                         3rd
Qu.: 32.888   3rd Qu.: 32.8891
 Max.   :2002-05-31 02:58:54                                         Max.
:182.150   Max.   :182.3000

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._