#########################################################################
It worked all right, but I'm just wondering if there is a more efficient
way (it takes about 10 minutes to run the above scripts, for my 300,000 x
25 CSV file)?
It should be quicker not to convert to a data frame. You can just keep
the data as a list of vectors and lapply() the summary() function.
-thomas
usage mileage sex excess ncd drivers
S :125788 Min. : 288 F: 82208 0 : 4744 0: 880 1:100791
SB: 12581 1st Qu.: 5000 M:217792 100:161311 1: 2819 2:175100
SC:161524 Median : 8000 75 :133945 2: 5245 3: 19146
ST: 107 Mean : 7640 3: 5230 4: 4156
3rd Qu.:10000 4:285826 5: 515
Max. :40000 6: 69
7: 223
district cargroup car.age wsclms adclms
6 :59053 8 :44524 Min. :-1.000 0:294521 0:292852
5 :57113 6 :39171 1st Qu.: 4.000 1: 5267 1: 6720
7 :51166 9 :38965 Median : 7.000 2: 201 2: 405
4 :50643 7 :35139 Mean : 7.234 3: 11 3: 23
3 :33041 10 :31091 3rd Qu.:10.000
8 :16437 5 :27456 Max. :30.000
(Other):32547 (Other):83654
ftclms pdclms piclms adincur pdincur
0:298661 :281056 :281056 Min. : 0.00 Min. : -4985.2
1: 1316 0: 15277 0: 18131 1st Qu.: 0.00 1st Qu.: 0.0
2: 22 1: 3587 1: 809 Median : 0.00 Median : 0.0
3: 1 2: 79 2: 4 Mean : 21.25 Mean : 225.4
3: 1 3rd Qu.: 0.00 3rd Qu.: 0.0
Max. :13779.55 Max. : 25050.0
NA's :281056.0
wsincur ftincur piincur days
Min. : 0.00 Min. : 0.000 Min. : 0.0 Min. : 0.0
1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.0 1st Qu.:123.0
Median : 0.00 Median : 0.000 Median : 0.0 Median :340.0
Mean : 2.07 Mean : 5.183 Mean : 345.8 Mean :248.7
3rd Qu.: 0.00 3rd Qu.: 0.000 3rd Qu.: 0.0 3rd Qu.:364.0
Max. :2004.64 Max. :25082.910 Max. :484550.1 Max. :365.0
NA's :281056.0
minagen primagen
Min. :17.00 Min. :17.00
1st Qu.:41.00 1st Qu.:43.00
Median :56.00 Median :53.00
Mean :63.81 Mean :53.25
3rd Qu.:99.00 3rd Qu.:64.00
Max. :99.00 Max. :93.00
#########################################################################
It worked all right, but I'm just wondering if there is a more efficient
way (it takes about 10 minutes to run the above scripts, for my 300,000 x
25 CSV file)?
For example, the CSV file has 25 columns but I don't need 3 of them (6, 7,
and 22). What I have done is to scan them in anyway, convert the list
into a data frame then remove the 3 columns. Just wonder if it is
possible to simply ignore them in scan() to make the process faster?
Cheers,
Kevin
------------------------------------------------------------------------------
/* Time is the greatest teacher, unfortunately it kills its students */
--
Ko-Kang Kevin Wang
Master of Science (MSc) Student
SLC Tutor and Lab Demonstrator
Department of Statistics
University of Auckland
New Zealand
Homepage: http://www.stat.auckland.ac.nz/~kwan022
Ph: 373-7599
x88475 (City)
x88480 (Tamaki)
#########################################################################
It worked all right, but I'm just wondering if there is a more efficient
way (it takes about 10 minutes to run the above scripts, for my 300,000 x
25 CSV file)?
For example, the CSV file has 25 columns but I don't need 3 of them (6, 7,
and 22). What I have done is to scan them in anyway, convert the list
into a data frame then remove the 3 columns. Just wonder if it is
possible to simply ignore them in scan() to make the process faster?
It might not make a lot of difference in your case where you are
reading many fields and want to ignore a few, but if you want to read
a few out of many, it would help to preprocess the input file using,
for example, awk as in the following, which would pick up fields 1, 2,
and 4:
> con <- pipe("awk -F , '{print $1,$3 $4}' ../Data/Rating.csv")
> rating <- scan(con, what = list(
+ usage = "",
+ mileage = 0,
+ excess = "")
+ , quiet = TRUE, skip = 1)
> close(con)
I do this sort of thing a lot using various utilities; so I've defined
the following function to take care of opening and closing the
connection:
scanpipe <- function(x,...) {
con <- pipe(x)
out <- scan(con,...)
close(con)
out
}
-----------------------------------------------------------------
Pierre Kleiber Email: pkleiber at honlab.nmfs.hawaii.edu
Fishery Biologist Tel: 808 983-5399/737-7544
NOAA FISHERIES - Honolulu Laboratory Fax: 808 983-2902
2570 Dole St., Honolulu, HI 96822-2396
-----------------------------------------------------------------
For example, the CSV file has 25 columns but I don't need 3 of them (6, 7,
and 22). What I have done is to scan them in anyway, convert the list
into a data frame then remove the 3 columns. Just wonder if it is
possible to simply ignore them in scan() to make the process faster?
Yes: see the help page
If any of the types is `NULL', the corresponding field is skipped
(but a `NULL' component appears in the result).
If you don't need a data frame, don't do the conversion. You might
well find read.table setting colClasses is faster than converting by
as.data.frame.
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595