48K csv files, 1000 lines each. How to redesign? (big picture)

This is likely not going to help, but I've had immense success using
RHIPE [^1] . I say likely not going help because installing it
appears to be painful (though I have scripts for Amazon EMR
c4.2xlarge clusters).

My approach to this is

1. Read the hundreds of gb of CSV files into RHIPE
2. Chunk them as data tables [^2] each data table corresponding to
   the information of one subject.

Then analyze this data. I've worked with the tens of millions of
   subjects (the data tbales for each of these were < 1000 rows).

At the end of the following code, my data set consists of ~ 6MM
subjects each with few tens to a few hundred rows of data. The data
for each subject('cid') is stored as a data table.

I can then compute across subjects very easily. Upping the number of
compute nodes if i feel the need to do so(Elastic MapReduce makes
this simple)

Thanks
Saptarshi

```{r}
z <- rhwatch(map=expression({
    tryCatch({
        z <- fread(paste(unlist(map.values),collapse="\n")
                  ,
colClasses=c('character','integer','character','character','character','character',

'integer','integer','integer','integer','integer','integer'))

setnames(z,c("cid","pcd","arch","ver","osver","subdate","addons","contentcr","mediancr","plugincr"
                    ,"browsercr","sec"))
        z[, subdate:=as.Date(subdate,"%Y%m%d")]
        z[, rhcollect(.BY$cid, .SD) by=cid]
        rhcollect(sample(1:1000,1), z)
    }, error=function(e) { rhcounter("errors",as.character(e),1)})
   })
   , reduce=expression(
       pre = {
           .r <- NULL
       },
       reduce = {
           .r <- rbind(.r,rbindlist(reduce.values))
       },
       post = {
           .r <- .r[order(subdate),]
           rhcollect(reduce.key, .r)
       }
   )
   , mapred = list(mapred.reduce.tasks=300)
   , output =
's3://mozilla-metrics/sguha/tmp/64bitcrashesromain',setup=E,read=FALSE)
   , input  =
rhfmt("text",folders="s3://mozilla-metrics/sguha/longlong/txt2/")

```

[^1]: http://deltarho.org/

[^2]: https://cran.r-project.org/web/packages/data.table/index.html

48K csv files, 1000 lines each. How to redesign? (big picture)

Thread (6 messages)