48K csv files, 1000 lines each. How to redesign? (big picture) - R-SIG-HPC

Wed, Mar 1, 2017 2:50 PM #

Hi

I asked this question on stack overflow and 3 people voted it up
within 5 minutes, and then the admins froze it because it is too
interesting. Sorry. It is too broad and not specific. If you have ever
tried to store a giant collection of simulation exercises and lived to
tell the tale, would be glad to know your experience. (here it was:
http://stackoverflow.com/questions/42394583/48k-csv-files-1000-lines-each-how-to-redesign-the-data-storage).

This is it:

One of the people that I help decided to scale up a simulation
exercise to massive proportions. The usual sort of thing we do will
have 100 conditions with 1000 runs with each one, and the result can
"easily" fit into a single file or data frame. We do that kind of
thing with SAS, R, or Mplus. This one is in R. I should have seen
trouble coming when I heard that the project was failing for lack of
memory. We see that sometimes with Bayesian models, where holding all
of the results from chains in memory becomes too demanding. The fix in
those cases has been to save batches of iterations in separate files.
Without paying attention to details, I suggested they write smaller
files on disk as the simulation proceeds.

Later, I realized the magnitude of my error. They had generated 48,000
output CSV files, in each of which there are 1000 lines and about 80
columns of real numbers. These are written out in CSV files because
the researchers are comfortable with data they can see. Again, I was
not paying attention when they asked me how to analyze that. I was
thinking small data, and told them to stack up the csv files using a
shell script. The result is a 40+GB csv file. R can't hope to open
that on the computers we have around here.

I believe/hope that the analysis will never need to use all 40GB of
data in one regression model :) I expect it is more likely they will
want to summarize smaller segments. The usual exercise in this ilk has
3 - 5 columns of simulation parameters and then 10 columns of results
from analysis. In this project, the result is much more massive
because they have 10 columns of parameters and all of the mix and
match combinations made the project expand.

I believe that the best plan is to store the data in a "database" like
structure. I want you to advise me about which approach to take.

Mysql? Not open anymore, I'm not too enthusiastic.

PostgreSQL? Seems more and more popular, have not administered a server before.

SQlite3? Some admins here supply us with data for analysis in that
format, but never have we received anything larger than 1.5GB.

HDF5 (Maybe netCDF?) It used to be (say 2005) these specialized
science style container database-like formats would work well.
However, I have not heard mention of them since I started helping the
social science students. Back when R started, we were using HDF5 and
one of my friends wrote the original R code to interact with HDF5.

My top priority is rapid data retrieval. I think if one of the
technicians can learn to retrieve a rectangular chunk, we can show
researchers how to do the same.

Warm Regards
PJ

Paul E. Johnson   http://pj.freefaculty.org
Director, Center for Research Methods and Data Analysis http://crmda.ku.edu

To write to me directly, please address me at pauljohn at ku.edu.

Roman Luštrik

Wed, Mar 1, 2017 2:57 PM #

Upon a cursory read, sqlite db can handle up to 140 TB. I think your 40 GB
are safe. I would advocate sqlitr because its one file, easy to understand,
no installation required and is supported by R' sqldf package (maybe more
now).

Cheers,
Roman

On Mar 1, 2017 11:52 PM, "Paul Johnson" <pauljohn32 at gmail.com> wrote:

Hi

I asked this question on stack overflow and 3 people voted it up
within 5 minutes, and then the admins froze it because it is too
interesting. Sorry. It is too broad and not specific. If you have ever
tried to store a giant collection of simulation exercises and lived to
tell the tale, would be glad to know your experience. (here it was:
http://stackoverflow.com/questions/42394583/48k-csv-
files-1000-lines-each-how-to-redesign-the-data-storage).

This is it:

One of the people that I help decided to scale up a simulation
exercise to massive proportions. The usual sort of thing we do will
have 100 conditions with 1000 runs with each one, and the result can
"easily" fit into a single file or data frame. We do that kind of
thing with SAS, R, or Mplus. This one is in R. I should have seen
trouble coming when I heard that the project was failing for lack of
memory. We see that sometimes with Bayesian models, where holding all
of the results from chains in memory becomes too demanding. The fix in
those cases has been to save batches of iterations in separate files.
Without paying attention to details, I suggested they write smaller
files on disk as the simulation proceeds.

Later, I realized the magnitude of my error. They had generated 48,000
output CSV files, in each of which there are 1000 lines and about 80
columns of real numbers. These are written out in CSV files because
the researchers are comfortable with data they can see. Again, I was
not paying attention when they asked me how to analyze that. I was
thinking small data, and told them to stack up the csv files using a
shell script. The result is a 40+GB csv file. R can't hope to open
that on the computers we have around here.

I believe/hope that the analysis will never need to use all 40GB of
data in one regression model :) I expect it is more likely they will
want to summarize smaller segments. The usual exercise in this ilk has
3 - 5 columns of simulation parameters and then 10 columns of results
from analysis. In this project, the result is much more massive
because they have 10 columns of parameters and all of the mix and
match combinations made the project expand.

I believe that the best plan is to store the data in a "database" like
structure. I want you to advise me about which approach to take.

Mysql? Not open anymore, I'm not too enthusiastic.

PostgreSQL? Seems more and more popular, have not administered a server
before.

SQlite3? Some admins here supply us with data for analysis in that
format, but never have we received anything larger than 1.5GB.

HDF5 (Maybe netCDF?) It used to be (say 2005) these specialized
science style container database-like formats would work well.
However, I have not heard mention of them since I started helping the
social science students. Back when R started, we were using HDF5 and
one of my friends wrote the original R code to interact with HDF5.

My top priority is rapid data retrieval. I think if one of the
technicians can learn to retrieve a rectangular chunk, we can show
researchers how to do the same.


Warm Regards
PJ
--
Paul E. Johnson   http://pj.freefaculty.org
Director, Center for Research Methods and Data Analysis
http://crmda.ku.edu

To write to me directly, please address me at pauljohn at ku.edu.

_______________________________________________
R-sig-hpc mailing list
R-sig-hpc at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-hpc

Cristian Bologa

Wed, Mar 1, 2017 4:10 PM #

Hi Paul,

In your case, you might be better with MonetDBLite.

https://www.monetdb.org/blog/monetdblite-r

Good luck,
Cristian


Cristian Bologa, Ph.D.
Research Professor,
Div. of Translational Informatics, 
Dept. of Internal Medicine,
Univ. of New Mexico, School of Medicine,
Innovation Discovery&Training Center, MSC09 5025, 
700 Camino de Salud NE, Albuquerque, NM 87131
Telephone: +1 (505) 925-7534
Fax:+1 (505) 925-7625



-----Original Message-----
From: R-sig-hpc [mailto:r-sig-hpc-bounces at r-project.org] On Behalf Of romunov
Sent: Wednesday, March 01, 2017 3:58 PM
To: Paul Johnson <pauljohn32 at gmail.com>
Cc: R SIG High Performance Computing <r-sig-hpc at r-project.org>
Subject: Re: [R-sig-hpc] 48K csv files, 1000 lines each. How to redesign? (big picture)

Upon a cursory read, sqlite db can handle up to 140 TB. I think your 40 GB are safe. I would advocate sqlitr because its one file, easy to understand, no installation required and is supported by R' sqldf package (maybe more now).

Cheers,
Roman

On Mar 1, 2017 11:52 PM, "Paul Johnson" <pauljohn32 at gmail.com> wrote:

Hi

I asked this question on stack overflow and 3 people voted it up 
within 5 minutes, and then the admins froze it because it is too 
interesting. Sorry. It is too broad and not specific. If you have ever 
tried to store a giant collection of simulation exercises and lived to 
tell the tale, would be glad to know your experience. (here it was:
http://stackoverflow.com/questions/42394583/48k-csv-
files-1000-lines-each-how-to-redesign-the-data-storage).

This is it:

One of the people that I help decided to scale up a simulation 
exercise to massive proportions. The usual sort of thing we do will 
have 100 conditions with 1000 runs with each one, and the result can 
"easily" fit into a single file or data frame. We do that kind of 
thing with SAS, R, or Mplus. This one is in R. I should have seen 
trouble coming when I heard that the project was failing for lack of 
memory. We see that sometimes with Bayesian models, where holding all 
of the results from chains in memory becomes too demanding. The fix in 
those cases has been to save batches of iterations in separate files.
Without paying attention to details, I suggested they write smaller 
files on disk as the simulation proceeds.

Later, I realized the magnitude of my error. They had generated 48,000 
output CSV files, in each of which there are 1000 lines and about 80 
columns of real numbers. These are written out in CSV files because 
the researchers are comfortable with data they can see. Again, I was 
not paying attention when they asked me how to analyze that. I was 
thinking small data, and told them to stack up the csv files using a 
shell script. The result is a 40+GB csv file. R can't hope to open 
that on the computers we have around here.

I believe/hope that the analysis will never need to use all 40GB of 
data in one regression model :) I expect it is more likely they will 
want to summarize smaller segments. The usual exercise in this ilk has
3 - 5 columns of simulation parameters and then 10 columns of results 
from analysis. In this project, the result is much more massive 
because they have 10 columns of parameters and all of the mix and 
match combinations made the project expand.

I believe that the best plan is to store the data in a "database" like 
structure. I want you to advise me about which approach to take.

Mysql? Not open anymore, I'm not too enthusiastic.

PostgreSQL? Seems more and more popular, have not administered a 
server before.

SQlite3? Some admins here supply us with data for analysis in that 
format, but never have we received anything larger than 1.5GB.

HDF5 (Maybe netCDF?) It used to be (say 2005) these specialized 
science style container database-like formats would work well.
However, I have not heard mention of them since I started helping the 
social science students. Back when R started, we were using HDF5 and 
one of my friends wrote the original R code to interact with HDF5.

My top priority is rapid data retrieval. I think if one of the 
technicians can learn to retrieve a rectangular chunk, we can show 
researchers how to do the same.


Warm Regards
PJ
--
Paul E. Johnson   http://pj.freefaculty.org
Director, Center for Research Methods and Data Analysis 
http://crmda.ku.edu

To write to me directly, please address me at pauljohn at ku.edu.

_______________________________________________
R-sig-hpc mailing list
R-sig-hpc at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-hpc

_______________________________________________
R-sig-hpc mailing list
R-sig-hpc at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-hpc

Hesen Peng

Wed, Mar 1, 2017 5:06 PM #

In fact 40GB is not that much data to play with. And fortunately you have
multiple solutions. :

1) Make R great again by adopting 64 bit of R. This would dramatically
increase your memory limit (
https://stat.ethz.ch/R-manual/R-devel/library/base/html/Memory-limits.html)

2) Make streaming or out of memory computation using packages like
bigmemory (
ftp://cran.r-project.org/pub/R/web/packages/bigmemory/bigmemory.pdf)

3) If you are willing to adopt 3rd party platforms, you can also try
importing them into Elasticsearch and play with Kibana (catch: it looks
good on resume). (https://www.elastic.co/products/kibana)

On Wed, Mar 1, 2017 at 2:50 PM, Paul Johnson <pauljohn32 at gmail.com> wrote:

Hi

I asked this question on stack overflow and 3 people voted it up
within 5 minutes, and then the admins froze it because it is too
interesting. Sorry. It is too broad and not specific. If you have ever
tried to store a giant collection of simulation exercises and lived to
tell the tale, would be glad to know your experience. (here it was:
http://stackoverflow.com/questions/42394583/48k-csv-
files-1000-lines-each-how-to-redesign-the-data-storage).

This is it:

One of the people that I help decided to scale up a simulation
exercise to massive proportions. The usual sort of thing we do will
have 100 conditions with 1000 runs with each one, and the result can
"easily" fit into a single file or data frame. We do that kind of
thing with SAS, R, or Mplus. This one is in R. I should have seen
trouble coming when I heard that the project was failing for lack of
memory. We see that sometimes with Bayesian models, where holding all
of the results from chains in memory becomes too demanding. The fix in
those cases has been to save batches of iterations in separate files.
Without paying attention to details, I suggested they write smaller
files on disk as the simulation proceeds.

Later, I realized the magnitude of my error. They had generated 48,000
output CSV files, in each of which there are 1000 lines and about 80
columns of real numbers. These are written out in CSV files because
the researchers are comfortable with data they can see. Again, I was
not paying attention when they asked me how to analyze that. I was
thinking small data, and told them to stack up the csv files using a
shell script. The result is a 40+GB csv file. R can't hope to open
that on the computers we have around here.

I believe/hope that the analysis will never need to use all 40GB of
data in one regression model :) I expect it is more likely they will
want to summarize smaller segments. The usual exercise in this ilk has
3 - 5 columns of simulation parameters and then 10 columns of results
from analysis. In this project, the result is much more massive
because they have 10 columns of parameters and all of the mix and
match combinations made the project expand.

I believe that the best plan is to store the data in a "database" like
structure. I want you to advise me about which approach to take.

Mysql? Not open anymore, I'm not too enthusiastic.

PostgreSQL? Seems more and more popular, have not administered a server
before.

SQlite3? Some admins here supply us with data for analysis in that
format, but never have we received anything larger than 1.5GB.

HDF5 (Maybe netCDF?) It used to be (say 2005) these specialized
science style container database-like formats would work well.
However, I have not heard mention of them since I started helping the
social science students. Back when R started, we were using HDF5 and
one of my friends wrote the original R code to interact with HDF5.

My top priority is rapid data retrieval. I think if one of the
technicians can learn to retrieve a rectangular chunk, we can show
researchers how to do the same.


Warm Regards
PJ
--
Paul E. Johnson   http://pj.freefaculty.org
Director, Center for Research Methods and Data Analysis
http://crmda.ku.edu

To write to me directly, please address me at pauljohn at ku.edu.

_______________________________________________
R-sig-hpc mailing list
R-sig-hpc at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-hpc

Hesen Peng ???

	[[alternative HTML version deleted]]

Paul Gilbert

Wed, Mar 1, 2017 9:56 PM #

Well, having lived to tell the tale, I would like to mention one option 
that never seems as obvious as it should. With simulation exercises you 
can save the seed and regenerate only portions of the data you want for 
specific analysis. It can be especially fast if the analysis can be done 
without actually saving data to file. This is a trade-off between 
compute speed and storage access/query speed, and depends of course on 
the complexity of the model computation. The trade off does not seem to 
always work the way one is inclined to think it should. (BTW, it is 
important to beware of the details needed for regenerating simulations 
on clusters.)

Paul Gilbert

On 03/01/2017 05:50 PM, Paul Johnson wrote:

Hi

I asked this question on stack overflow and 3 people voted it up
within 5 minutes, and then the admins froze it because it is too
interesting. Sorry. It is too broad and not specific. If you have ever
tried to store a giant collection of simulation exercises and lived to
tell the tale, would be glad to know your experience. (here it was:
http://stackoverflow.com/questions/42394583/48k-csv-files-1000-lines-each-how-to-redesign-the-data-storage).

This is it:

One of the people that I help decided to scale up a simulation
exercise to massive proportions. The usual sort of thing we do will
have 100 conditions with 1000 runs with each one, and the result can
"easily" fit into a single file or data frame. We do that kind of
thing with SAS, R, or Mplus. This one is in R. I should have seen
trouble coming when I heard that the project was failing for lack of
memory. We see that sometimes with Bayesian models, where holding all
of the results from chains in memory becomes too demanding. The fix in
those cases has been to save batches of iterations in separate files.
Without paying attention to details, I suggested they write smaller
files on disk as the simulation proceeds.

Later, I realized the magnitude of my error. They had generated 48,000
output CSV files, in each of which there are 1000 lines and about 80
columns of real numbers. These are written out in CSV files because
the researchers are comfortable with data they can see. Again, I was
not paying attention when they asked me how to analyze that. I was
thinking small data, and told them to stack up the csv files using a
shell script. The result is a 40+GB csv file. R can't hope to open
that on the computers we have around here.

I believe/hope that the analysis will never need to use all 40GB of
data in one regression model :) I expect it is more likely they will
want to summarize smaller segments. The usual exercise in this ilk has
3 - 5 columns of simulation parameters and then 10 columns of results
from analysis. In this project, the result is much more massive
because they have 10 columns of parameters and all of the mix and
match combinations made the project expand.

I believe that the best plan is to store the data in a "database" like
structure. I want you to advise me about which approach to take.

Mysql? Not open anymore, I'm not too enthusiastic.

PostgreSQL? Seems more and more popular, have not administered a server before.

SQlite3? Some admins here supply us with data for analysis in that
format, but never have we received anything larger than 1.5GB.

HDF5 (Maybe netCDF?) It used to be (say 2005) these specialized
science style container database-like formats would work well.
However, I have not heard mention of them since I started helping the
social science students. Back when R started, we were using HDF5 and
one of my friends wrote the original R code to interact with HDF5.

My top priority is rapid data retrieval. I think if one of the
technicians can learn to retrieve a rectangular chunk, we can show
researchers how to do the same.

Warm Regards
PJ

Saptarshi Guha

Wed, Mar 1, 2017 10:09 PM #

This is likely not going to help, but I've had immense success using
RHIPE [^1] . I say likely not going help because installing it
appears to be painful (though I have scripts for Amazon EMR
c4.2xlarge clusters).

My approach to this is

1. Read the hundreds of gb of CSV files into RHIPE
2. Chunk them as data tables [^2] each data table corresponding to
   the information of one subject.

Then analyze this data. I've worked with the tens of millions of
   subjects (the data tbales for each of these were < 1000 rows).

At the end of the following code, my data set consists of ~ 6MM
subjects each with few tens to a few hundred rows of data. The data
for each subject('cid') is stored as a data table.

I can then compute across subjects very easily. Upping the number of
compute nodes if i feel the need to do so(Elastic MapReduce makes
this simple)

Thanks
Saptarshi

```{r}
z <- rhwatch(map=expression({
    tryCatch({
        z <- fread(paste(unlist(map.values),collapse="\n")
                  ,
colClasses=c('character','integer','character','character','character','character',

'integer','integer','integer','integer','integer','integer'))

setnames(z,c("cid","pcd","arch","ver","osver","subdate","addons","contentcr","mediancr","plugincr"
                    ,"browsercr","sec"))
        z[, subdate:=as.Date(subdate,"%Y%m%d")]
        z[, rhcollect(.BY$cid, .SD) by=cid]
        rhcollect(sample(1:1000,1), z)
    }, error=function(e) { rhcounter("errors",as.character(e),1)})
   })
   , reduce=expression(
       pre = {
           .r <- NULL
       },
       reduce = {
           .r <- rbind(.r,rbindlist(reduce.values))
       },
       post = {
           .r <- .r[order(subdate),]
           rhcollect(reduce.key, .r)
       }
   )
   , mapred = list(mapred.reduce.tasks=300)
   , output =
's3://mozilla-metrics/sguha/tmp/64bitcrashesromain',setup=E,read=FALSE)
   , input  =
rhfmt("text",folders="s3://mozilla-metrics/sguha/longlong/txt2/")

```

[^1]: http://deltarho.org/

[^2]: https://cran.r-project.org/web/packages/data.table/index.html

On Wed, Mar 1, 2017 at 9:56 PM, Paul Gilbert <pgilbert902 at gmail.com> wrote:

Well, having lived to tell the tale, I would like to mention one option
that never seems as obvious as it should. With simulation exercises you can
save the seed and regenerate only portions of the data you want for
specific analysis. It can be especially fast if the analysis can be done
without actually saving data to file. This is a trade-off between compute
speed and storage access/query speed, and depends of course on the
complexity of the model computation. The trade off does not seem to always
work the way one is inclined to think it should. (BTW, it is important to
beware of the details needed for regenerating simulations on clusters.)

Paul Gilbert


On 03/01/2017 05:50 PM, Paul Johnson wrote:

Hi

I asked this question on stack overflow and 3 people voted it up
within 5 minutes, and then the admins froze it because it is too
interesting. Sorry. It is too broad and not specific. If you have ever
tried to store a giant collection of simulation exercises and lived to
tell the tale, would be glad to know your experience. (here it was:
http://stackoverflow.com/questions/42394583/48k-csv-files-
1000-lines-each-how-to-redesign-the-data-storage).

This is it:

One of the people that I help decided to scale up a simulation
exercise to massive proportions. The usual sort of thing we do will
have 100 conditions with 1000 runs with each one, and the result can
"easily" fit into a single file or data frame. We do that kind of
thing with SAS, R, or Mplus. This one is in R. I should have seen
trouble coming when I heard that the project was failing for lack of
memory. We see that sometimes with Bayesian models, where holding all
of the results from chains in memory becomes too demanding. The fix in
those cases has been to save batches of iterations in separate files.
Without paying attention to details, I suggested they write smaller
files on disk as the simulation proceeds.

Later, I realized the magnitude of my error. They had generated 48,000
output CSV files, in each of which there are 1000 lines and about 80
columns of real numbers. These are written out in CSV files because
the researchers are comfortable with data they can see. Again, I was
not paying attention when they asked me how to analyze that. I was
thinking small data, and told them to stack up the csv files using a
shell script. The result is a 40+GB csv file. R can't hope to open
that on the computers we have around here.

I believe/hope that the analysis will never need to use all 40GB of
data in one regression model :) I expect it is more likely they will
want to summarize smaller segments. The usual exercise in this ilk has
3 - 5 columns of simulation parameters and then 10 columns of results
from analysis. In this project, the result is much more massive
because they have 10 columns of parameters and all of the mix and
match combinations made the project expand.

I believe that the best plan is to store the data in a "database" like
structure. I want you to advise me about which approach to take.

Mysql? Not open anymore, I'm not too enthusiastic.

PostgreSQL? Seems more and more popular, have not administered a server
before.

SQlite3? Some admins here supply us with data for analysis in that
format, but never have we received anything larger than 1.5GB.

HDF5 (Maybe netCDF?) It used to be (say 2005) these specialized
science style container database-like formats would work well.
However, I have not heard mention of them since I started helping the
social science students. Back when R started, we were using HDF5 and
one of my friends wrote the original R code to interact with HDF5.

My top priority is rapid data retrieval. I think if one of the
technicians can learn to retrieve a rectangular chunk, we can show
researchers how to do the same.


Warm Regards
PJ

_______________________________________________
R-sig-hpc mailing list
R-sig-hpc at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-hpc