equipment
"Ruud H. Koning" <info at rhkoning.com> writes:
Hello, it is likely that I will have to analyze a rather sizeable dataset: 60000 records, 10 to 15 variables. I will have to make descriptive statistics, and estimate linear models, glm's and maybe Cox proportional hazard model with time varying covariates. In theory, this is possible in R, but I would like to get some feedback on the equipment I should get for this. At this moment, I have a Pentium 3 laptop running windows 2000 with 384MB ram. What type of cpu-speed and/or how much memory should I get? Thanks for some ideas, Ruud
Except for the time-varying Cox thing, this doesn't seem too hard:
d <- as.data.frame(matrix(rnorm(60000*15),60000,15)) names(d)
[1] "V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V9" "V10" "V11" "V12" [13] "V13" "V14" "V15"
system.time(lm(V15~.,data=d))
[1] 2.62 0.61 3.24 0.00 0.00
gc()
used (Mb) gc trigger (Mb) Ncells 431614 11.6 741108 19.8 Vcells 1079809 8.3 6817351 52.1 That's on the fastest machine I have access to, a 2.8GHz Xeon (Dual, but not with threaded BLAS lib). About three times slower on a 900 MHz PIII. For GLM you'll do similar operations iterated say 5 times, and if you have factors and interactions among your predictors, you'll get essentially an increase proportional to the number of parameters in the model. Time-dependent Cox in full generality has complexity proportional to the square of the data set (one regression computation per death) which could be prohibitive, but there are often simplifications, depending on the nature of the time dependency.
O__ ---- Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907