Manage huge database

"Jos? E. Lozano" <lozalojo at jcyl.es> writes:
Maybe you've not lurked on R-help for long enough :) Apologies!
Probably.

So, how much "design" is in this data? If none, and what you've
basically got is a 2000x500000 grid of numbers, then maybe a more raw
Exactly, raw data, but a little more complex since all the 500000 variables
are in text format, so the width is around 2,500,000.

http://cran.r-project.org/web/packages/RNetCDF/index.html
http://cran.r-project.org/web/packages/hdf5/index.html
Thanks, I will check. Right now I am reading line by line the file. It's
time consuming, but since I will do it only once, just to rearrange the data
into smaller tables to query, it's ok.

Thinking back to your 4GB file with 1,000,000,000 entries, that's
only 3 bytes per entry (+1 for the comma). What is this data? There
may be more efficient ways to handle it.
Is genetic DNA data (individuals genotyped), hence the large amount of
columns to analyze.
The Bioconductor package snpMatrix is designed for this type of
data. See

http://www.bioconductor.org/packages/2.2/bioc/html/snpMatrix.html

and if that looks promising
source('http://bioconductor.org/biocLite.R')
biocLite('snpMatrix')
Likely you'll quickly want a 64 bit (linux or Mac) machine.

Martin
Best Regards,
Jose Lozano
------------------------------------------
Jose E. Lozano Alonso
Observatorio de Salud P?blica.
Direccion General de Salud P?blica e I+D+I.
Junta de Castilla y Le?n.
Direccion: Paseo de Zorrilla, n?1. Despacho 3103. CP 47071. Valladolid.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793

"Jos? E. Lozano" <lozalojo at jcyl.es> writes:

Maybe you've not lurked on R-help for long enough :) Apologies!
Probably.

So, how much "design" is in this data? If none, and what you've
basically got is a 2000x500000 grid of numbers, then maybe a more raw
Exactly, raw data, but a little more complex since all the 500000 variables
are in text format, so the width is around 2,500,000.
<snip>>
Is genetic DNA data (individuals genotyped), hence the large amount of
columns to analyze.
The Bioconductor package snpMatrix is designed for this type of
data. See

http://www.bioconductor.org/packages/2.2/bioc/html/snpMatrix.html

and if that looks promising

source('http://bioconductor.org/biocLite.R')
biocLite('snpMatrix')
Likely you'll quickly want a 64 bit (linux or Mac) machine.

netCDF is another useful option -- we have been using the ncdf package for 
large genomic datasets.  We read the data in one person at a time and 
write to netCDF.  For analysis we can then read any subsets.  Since we 
have imputed SNP data  as well as measured this comes to about 2.5 million 
variables on 4000 people for one of our data sets.

 	-thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle
Hello:

I've been reading all the replies, and I think I have some good ideas to
work on.

Right now the code I programmed is running, It has been running in a batch
process 20h now, and It has imported 1750 rows out of 2000. I will read docs
for the bioconductor package, and I will check the gawk option, as well as I
will try (time consuming I guess) to transform variables in text format to
numbers, and other options I have just read.

I appreciate all ideas and hints, I've programmed time-consuming quite
complex programs but I've never programmed one just for a simple task as to
read a file.

Finally, unfortunately I work on a 32bit Windows XP machine, so neither
Linux or 64bit Windows... :-)

Thanks for your help,
Jose Lozano

------------------------------------------
Jose E. Lozano Alonso
Observatorio de Salud P?blica.
Direccion General de Salud P?blica e I+D+I.
Junta de Castilla y Le?n.
Direccion: Paseo de Zorrilla, n?1. Despacho 3103. CP 47071. Valladolid.