Manage huge database
"Jos? E. Lozano" <lozalojo at jcyl.es> writes:
Maybe you've not lurked on R-help for long enough :) Apologies!
Probably.
So, how much "design" is in this data? If none, and what you've basically got is a 2000x500000 grid of numbers, then maybe a more raw
Exactly, raw data, but a little more complex since all the 500000 variables are in text format, so the width is around 2,500,000.
http://cran.r-project.org/web/packages/RNetCDF/index.html http://cran.r-project.org/web/packages/hdf5/index.html
Thanks, I will check. Right now I am reading line by line the file. It's time consuming, but since I will do it only once, just to rearrange the data into smaller tables to query, it's ok.
Thinking back to your 4GB file with 1,000,000,000 entries, that's only 3 bytes per entry (+1 for the comma). What is this data? There may be more efficient ways to handle it.
Is genetic DNA data (individuals genotyped), hence the large amount of columns to analyze.
The Bioconductor package snpMatrix is designed for this type of data. See http://www.bioconductor.org/packages/2.2/bioc/html/snpMatrix.html and if that looks promising
source('http://bioconductor.org/biocLite.R')
biocLite('snpMatrix')
Likely you'll quickly want a 64 bit (linux or Mac) machine. Martin
Best Regards, Jose Lozano ------------------------------------------ Jose E. Lozano Alonso Observatorio de Salud P?blica. Direccion General de Salud P?blica e I+D+I. Junta de Castilla y Le?n. Direccion: Paseo de Zorrilla, n?1. Despacho 3103. CP 47071. Valladolid.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793