Hello,
I?m quite new to this list.
I have a high frequency-dataset with more than 500.000 records.
I want to edit a data.frame "Test". My small programm runs fine with a small
part of the dataset (just 100 records), but it is very slow with a huge
dataset. Of course it get?s slower with more records, but when I change just
the size of the frame and keep the number of edited records fixed, I see
that it is also getting slower.
Here is my program:
print(dim(test)[1])
Sys.time()
for(i in 1:100) {
test[i,6] = paste(test[i,2],"-",test[i,3], sep = "")
}
Sys.time()
I connect 2 currency symbols to a currency pair.
I always calculate only for the first 100 lines.
WHen I load just 100 lines in the data.frame "test", it takes 1 second.
When I load 1000 lines, editing 100 lines takes 2 seconds,
10,000 lines loaded and 100 lines editing takes 5 seconds,
100,000 lines loaded and editing 100 lines takes 31 seconds,
500,000 lines loaded and editing 100 lines takes 11 minutes(!!!).
My computer has 1 GB Ram, so that shouldn?t be the reason.
Of course, I could work with many small data.frames instead of one big, but
the program above is just the very first step and so I don?t want to split.
Is there a way to edit big data.frames without waiting for a long time?
Thank?s a lot for help,
Marcus
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Why are big data.frames slow? What can I do to get it faster?
6 messages · Marcus Jellinghaus, Uwe Ligges, Thomas Lumley +1 more
Marcus Jellinghaus wrote:
Hello,
I?m quite new to this list.
I have a high frequency-dataset with more than 500.000 records.
I want to edit a data.frame "Test". My small programm runs fine with a small
part of the dataset (just 100 records), but it is very slow with a huge
dataset. Of course it get?s slower with more records, but when I change just
the size of the frame and keep the number of edited records fixed, I see
that it is also getting slower.
Here is my program:
print(dim(test)[1])
Sys.time()
for(i in 1:100) {
test[i,6] = paste(test[i,2],"-",test[i,3], sep = "")
}
Sys.time()
I connect 2 currency symbols to a currency pair.
I always calculate only for the first 100 lines.
WHen I load just 100 lines in the data.frame "test", it takes 1 second.
When I load 1000 lines, editing 100 lines takes 2 seconds,
10,000 lines loaded and 100 lines editing takes 5 seconds,
100,000 lines loaded and editing 100 lines takes 31 seconds,
500,000 lines loaded and editing 100 lines takes 11 minutes(!!!).
My computer has 1 GB Ram, so that shouldn?t be the reason.
Of course, I could work with many small data.frames instead of one big, but
the program above is just the very first step and so I don?t want to split.
Is there a way to edit big data.frames without waiting for a long time?
Well, the point is, I guess, to address elements in a large data.frame, which reasonably takes much more time than in a small one. Maybe it's an idea to use vectorized operations instead of the loop, which is preferable, if your computation is easy vectorizable without a big penalty of memory consumption: test[1:100, 6] <- paste(test[1:100, 2], "-", test[1:100, 3], sep = "") or test[ , 6] <- paste(test[ , 2], "-", test[ , 3], sep = "") for the whole data.frame. Uwe Ligges -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
On Sun, 6 Oct 2002, Marcus Jellinghaus wrote:
Hello,
I´m quite new to this list.
I have a high frequency-dataset with more than 500.000 records.
I want to edit a data.frame "Test". My small programm runs fine with a small
part of the dataset (just 100 records), but it is very slow with a huge
dataset. Of course it get´s slower with more records, but when I change just
the size of the frame and keep the number of edited records fixed, I see
that it is also getting slower.
Here is my program:
print(dim(test)[1])
Sys.time()
for(i in 1:100) {
test[i,6] = paste(test[i,2],"-",test[i,3], sep = "")
}
Sys.time()
1.6.0 has faster dataframe indexing. Also, there's no need to do this one line at a time i<-1:100 test[i,6]<-paste(test[i,2],test[i,3],sep="-") should be quite a bit faster. -thomas
I connect 2 currency symbols to a currency pair. I always calculate only for the first 100 lines. WHen I load just 100 lines in the data.frame "test", it takes 1 second. When I load 1000 lines, editing 100 lines takes 2 seconds, 10,000 lines loaded and 100 lines editing takes 5 seconds, 100,000 lines loaded and editing 100 lines takes 31 seconds, 500,000 lines loaded and editing 100 lines takes 11 minutes(!!!). My computer has 1 GB Ram, so that shouldn´t be the reason. Of course, I could work with many small data.frames instead of one big, but the program above is just the very first step and so I don´t want to split. Is there a way to edit big data.frames without waiting for a long time? Thank´s a lot for help, Marcus -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Thomas Lumley Asst. Professor, Biostatistics tlumley at u.washington.edu University of Washington, Seattle ^^^^^^^^^^^^^^^^^^^^^^^^ - NOTE NEW EMAIL ADDRESS -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
First I want to say "thank you" to everybody who replied. I understand that vectorized operations instead of the loop are faster. I also made sure not to use factors. Since the loop runs 100times in my example, the loop should only take the time of the vectorized operation mutliplied by 100. But the loop takes about 10 minutes, the vectorized operation takes about 3 seconds. (See below) Why that? Shouldn?t the loop take max 100*3seconds = 5 minutes? I?m interested in that because I think that I will have computations that are easily vectorizable(like this example) and that I will have computations that are not/very difficult vectorizable. Marcus Jellinghaus
print(dim(test)[1])
[1] 500000
Sys.time()
[1] "2002-10-07 06:17:33 Eastern Sommerzeit"
test[1:100,6] = paste(test[1:100,2],"-",test[1:100,3], sep = "") Sys.time()
[1] "2002-10-07 06:17:35 Eastern Sommerzeit" [..]
print(dim(test)[1])
[1] 500000
Sys.time()
[1] "2002-10-07 06:05:29 Eastern Sommerzeit"
for(i in 1:100) {
+ test[i,6] = paste(test[i,2],"-",test[i,3], sep = "") + }
Sys.time()
[1] "2002-10-07 06:15:17 Eastern Sommerzeit" -----Urspr?ngliche Nachricht----- Von: Uwe Ligges [mailto:ligges at statistik.uni-dortmund.de] Gesendet: Sunday, October 06, 2002 1:58 PM An: Marcus Jellinghaus Cc: r-help at stat.math.ethz.ch Betreff: Re: [R] Why are big data.frames slow? What can I do to get it faster?
Marcus Jellinghaus wrote:
Hello, I?m quite new to this list. I have a high frequency-dataset with more than 500.000 records. I want to edit a data.frame "Test". My small programm runs fine with a
small
part of the dataset (just 100 records), but it is very slow with a huge dataset. Of course it get?s slower with more records, but when I change
just
the size of the frame and keep the number of edited records fixed, I see
that it is also getting slower.
Here is my program:
print(dim(test)[1])
Sys.time()
for(i in 1:100) {
test[i,6] = paste(test[i,2],"-",test[i,3], sep = "")
}
Sys.time()
I connect 2 currency symbols to a currency pair.
I always calculate only for the first 100 lines.
WHen I load just 100 lines in the data.frame "test", it takes 1 second.
When I load 1000 lines, editing 100 lines takes 2 seconds,
10,000 lines loaded and 100 lines editing takes 5 seconds,
100,000 lines loaded and editing 100 lines takes 31 seconds,
500,000 lines loaded and editing 100 lines takes 11 minutes(!!!).
My computer has 1 GB Ram, so that shouldn?t be the reason.
Of course, I could work with many small data.frames instead of one big,
but
the program above is just the very first step and so I don?t want to
split.
Is there a way to edit big data.frames without waiting for a long time?
Well, the point is, I guess, to address elements in a large data.frame, which reasonably takes much more time than in a small one. Maybe it's an idea to use vectorized operations instead of the loop, which is preferable, if your computation is easy vectorizable without a big penalty of memory consumption: test[1:100, 6] <- paste(test[1:100, 2], "-", test[1:100, 3], sep = "") or test[ , 6] <- paste(test[ , 2], "-", test[ , 3], sep = "") for the whole data.frame. Uwe Ligges -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
"Marcus Jellinghaus" <Marcus_Jellinghaus at gmx.de> writes:
First I want to say "thank you" to everybody who replied. I understand that vectorized operations instead of the loop are faster. I also made sure not to use factors. Since the loop runs 100times in my example, the loop should only take the time of the vectorized operation mutliplied by 100. But the loop takes about 10 minutes, the vectorized operation takes about 3 seconds. (See below) Why that? Shouldn?t the loop take max 100*3seconds = 5 minutes?
You'll likely have to invoke the garbage collector a couple of times, and there might also be issues of memory growth kicking in. Once you get beyond some threshold, the machine starts swapping bits and pieces of the workspace in and out of physical memory, It's somewhat difficult to reproduce the behaviour, since you only give part of the code necessary (e.g. how many *columns* do you have in your data frame?) Something like this? N <- 100000 test <- as.data.frame(lapply(1:6,function(i)rnorm(N))) unix.time(test[1:100,6] <- paste(test[1:100,2],"-",test[1:100,3], sep = "")) unix.time(for (i in 1:100) test[i,6] <- paste(test[i,2],"-",test[i,3], sep = "")) (Using N==500000 made my little desktop swap like crazy, but the above gave something like 2s CPU time for the 1st case and 92s CPU + 23s system for the other one with R 1.6.0)
O__ ---- Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
I wanted to know why not-vectorized operations are slow. Thank you for your suggestions. I did three things: -Beside looking at the total computation time, I analyzed the GarbageCollection-time (gc()). -I told R to use more memory. I use version 1.6.0 and used the command "Rgui --min-vsize=600M --min-nsize=10M" -I used test$Fieldname[i] instead of test[i, 6]. My results show that it saves a lot of time when I use enough memory and the fieldnames. So thank?s a lot! Here are the details: Without fieldnames and without use of more memory: GC-Time: 494Seconds, other calculations 124Seconds, Total 619Seconds. Without fieldnames, with "Rgui --min-vsize=600M --min-nsize=10M" GC-Time: 34Seconds, other calculations 114Seconds, Total 148Seconds. With fieldnames, without use of more memory: GC-Time: 0,5 Seconds, other calculations 2 Seconds, Total 2,5 Seconds. (but long time for loading the matrix) with fieldnames, with "Rgui --min-vsize=600M --min-nsize=10M" GC-Time: < 1 Second, other calculations < 1 Second, Total < 1 second Marcus Jellinghaus Peter Dalgaard writes:
You'll likely have to invoke the garbage collector a couple of times, and there might also be issues of memory growth kicking in. Once you get beyond some threshold, the machine starts swapping bits and pieces of the workspace in and out of physical memory,
Andy Liaw writes:
If you are on Windows and using R version prior to 1.6.0, make sure R can use all 1GB of the ram, as the default is to use up to 256MB or physical RAM, which ever is smaller. In R-1.6.0, that limit is raised to the
smaller
of 1GB and physical RAM.
[..]
Extracting from data frame one element at a time the way you did is expensive. I.e., test[i, 6] is slower than test$whatever[i].
Peter Dalgaard writes:
It's somewhat difficult to reproduce the behaviour, since you only give part of the code necessary (e.g. how many *columns* do you have in your data frame?)
summary(test)
datetime CCY1 CCY2 Bid Ask CCYPair Min. :2002-05-28 00:00:02 Length:500000 Length:500000 Min. : 0.557 Min. : 0.5574 Length:500000 1st Qu.:2002-05-28 17:30:47 Mode :character Mode :character 1st Qu.: 1.532 1st Qu.: 1.5319 Mode :character Median :2002-05-29 14:43:02 Median : 4.047 Median : 4.0476 Mean :2002-05-29 14:42:36 Mean : 38.664 Mean : 38.6858 3rd Qu.:2002-05-30 10:22:30 3rd Qu.: 32.888 3rd Qu.: 32.8891 Max. :2002-05-31 02:58:54 Max. :182.150 Max. :182.3000 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._