Can I improve the efficiency of my scan() command? - R-help

Fri, Apr 11, 2003 2:14 PM #

On Sat, 12 Apr 2003, Ko-Kang Kevin Wang wrote:

<snip>

It should be quicker not to convert to a data frame.  You can just keep
the data as a list of vectors and lapply() the summary() function.

	-thomas

Ko-Kang Kevin Wang

Fri, Apr 11, 2003 2:23 PM #

Hi,

Suppose I use the following codes to read in a data set.

###############################################

+                what = list(
+                  usage = "",
+                  mileage = 0,
+                  sex = "",
+                  excess = "",
+                  ncd = "",
+                  primage = "",
+                  minage = "",
+                  drivers = "",
+                  district = "",
+                  cargroup = "",
+                  car.age = 0,
+                  wsclms = "",
+                  adclms = "",
+                  ftclms = "",
+                  pdclms = "",
+                  piclms = "",
+                  adincur = 0,
+                  pdincur = 0,
+                  wsincur = 0,
+                  ftincur = 0,
+                  piincur = 0,
+                  record = 0,
+                  days = 0,
+                  minagen = 0,
+                  primagen = 0),
+                sep=",", quiet = TRUE, skip = 1)

usage          mileage      sex        excess       ncd        drivers   
 S :125788   Min.   :  288   F: 82208   0  :  4744   0:   880   1:100791  
 SB: 12581   1st Qu.: 5000   M:217792   100:161311   1:  2819   2:175100  
 SC:161524   Median : 8000              75 :133945   2:  5245   3: 19146  
 ST:   107   Mean   : 7640                           3:  5230   4:  4156  
             3rd Qu.:10000                           4:285826   5:   515  
             Max.   :40000                                      6:    69  
                                                                7:   223  
    district        cargroup        car.age       wsclms     adclms    
 6      :59053   8      :44524   Min.   :-1.000   0:294521   0:292852  
 5      :57113   6      :39171   1st Qu.: 4.000   1:  5267   1:  6720  
 7      :51166   9      :38965   Median : 7.000   2:   201   2:   405  
 4      :50643   7      :35139   Mean   : 7.234   3:    11   3:    23  
 3      :33041   10     :31091   3rd Qu.:10.000                        
 8      :16437   5      :27456   Max.   :30.000                        
 (Other):32547   (Other):83654                                         
 ftclms     pdclms     piclms        adincur            pdincur        
 0:298661    :281056    :281056   Min.   :    0.00   Min.   : -4985.2  
 1:  1316   0: 15277   0: 18131   1st Qu.:    0.00   1st Qu.:     0.0  
 2:    22   1:  3587   1:   809   Median :    0.00   Median :     0.0  
 3:     1   2:    79   2:     4   Mean   :   21.25   Mean   :   225.4  
            3:     1              3rd Qu.:    0.00   3rd Qu.:     0.0  
                                  Max.   :13779.55   Max.   : 25050.0  
                                                     NA's   :281056.0  
    wsincur           ftincur             piincur              days      
 Min.   :   0.00   Min.   :    0.000   Min.   :     0.0   Min.   :  0.0  
 1st Qu.:   0.00   1st Qu.:    0.000   1st Qu.:     0.0   1st Qu.:123.0  
 Median :   0.00   Median :    0.000   Median :     0.0   Median :340.0  
 Mean   :   2.07   Mean   :    5.183   Mean   :   345.8   Mean   :248.7  
 3rd Qu.:   0.00   3rd Qu.:    0.000   3rd Qu.:     0.0   3rd Qu.:364.0  
 Max.   :2004.64   Max.   :25082.910   Max.   :484550.1   Max.   :365.0  
                                       NA's   :281056.0                  
    minagen         primagen    
 Min.   :17.00   Min.   :17.00  
 1st Qu.:41.00   1st Qu.:43.00  
 Median :56.00   Median :53.00  
 Mean   :63.81   Mean   :53.25  
 3rd Qu.:99.00   3rd Qu.:64.00  
 Max.   :99.00   Max.   :93.00  
                             
#########################################################################

It worked all right, but I'm just wondering if there is a more efficient 
way (it takes about 10 minutes to run the above scripts, for my 300,000 x 
25 CSV file)?

For example, the CSV file has 25 columns but I don't need 3 of them (6, 7, 
and 22).  What I have done is to scan them in anyway, convert the list 
into a data frame then remove the 3 columns.  Just wonder if it is 
possible to simply ignore them in scan() to make the process faster?

Cheers,

Kevin

------------------------------------------------------------------------------
/* Time is the greatest teacher, unfortunately it kills its students */

--
Ko-Kang Kevin Wang
Master of Science (MSc) Student
SLC Tutor and Lab Demonstrator
Department of Statistics
University of Auckland
New Zealand
Homepage: http://www.stat.auckland.ac.nz/~kwan022
Ph: 373-7599
    x88475 (City)
    x88480 (Tamaki)

Pierre Kleiber

Fri, Apr 11, 2003 3:07 PM #

Ko-Kang Kevin Wang wrote:

> +                  usage = "",
 > +                  mileage = 0,
 > +                  sex = "",
 > +                  excess = "",
 > +                  ncd = "",
 > +                  primage = "",
 > +                  minage = "",
 > +                  drivers = "",
 > +                  district = "",
 > +                  cargroup = "",
 > +                  car.age = 0,
 > +                  wsclms = "",

[...]

It might not make a lot of difference in your case where you are
reading many fields and want to ignore a few, but if you want to read
a few out of many, it would help to preprocess the input file using,
for example, awk as in the following, which would pick up fields 1, 2,
and 4:

 > con <- pipe("awk -F , '{print $1,$3 $4}' ../Data/Rating.csv")
 > rating <- scan(con, what = list(
+                  usage = "",
+                  mileage = 0,
+                  excess = "")
+            , quiet = TRUE, skip = 1)
 > close(con)

I do this sort of thing a lot using various utilities; so I've defined
the following function to take care of opening and closing the
connection:

scanpipe <- function(x,...) {
   con <- pipe(x)
   out <- scan(con,...)
   close(con)
   out
}

-----------------------------------------------------------------
Pierre Kleiber             Email: pkleiber at honlab.nmfs.hawaii.edu
Fishery Biologist                     Tel: 808 983-5399/737-7544
NOAA FISHERIES - Honolulu Laboratory         Fax: 808 983-2902
2570 Dole St., Honolulu, HI 96822-2396
-----------------------------------------------------------------

Brian Ripley

Sat, Apr 12, 2003 12:14 AM #

On Sat, 12 Apr 2003, Ko-Kang Kevin Wang wrote:

[...]

Yes: see the help page

      If any of the types is `NULL', the corresponding field is skipped
     (but a `NULL' component appears in the result).

If you don't need a data frame, don't do the conversion.  You might
well find read.table setting colClasses is faster than converting by 
as.data.frame.

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595