Message-ID: <6ph3ajs71ps.fsf@lamprey.fhcrc.org>
Date: 2008-09-22T13:24:31Z
From: Martin Morgan
Subject: Manage huge database
In-Reply-To: <47A455630022EC00@mtacsbs.csbs.jcyl.es> (added by postmaster@jcyl.es) (José E. Lozano's message of "Mon, 22 Sep 2008 12:26:29 +0200")
"Jos? E. Lozano" <lozalojo at jcyl.es> writes:
>> Maybe you've not lurked on R-help for long enough :) Apologies!
>
> Probably.
>
>> So, how much "design" is in this data? If none, and what you've
>> basically got is a 2000x500000 grid of numbers, then maybe a more raw
>
> Exactly, raw data, but a little more complex since all the 500000 variables
> are in text format, so the width is around 2,500,000.
>
>> http://cran.r-project.org/web/packages/RNetCDF/index.html
>> http://cran.r-project.org/web/packages/hdf5/index.html
>
> Thanks, I will check. Right now I am reading line by line the file. It's
> time consuming, but since I will do it only once, just to rearrange the data
> into smaller tables to query, it's ok.
>
>> Thinking back to your 4GB file with 1,000,000,000 entries, that's
>> only 3 bytes per entry (+1 for the comma). What is this data? There
>> may be more efficient ways to handle it.
>
> Is genetic DNA data (individuals genotyped), hence the large amount of
> columns to analyze.
The Bioconductor package snpMatrix is designed for this type of
data. See
http://www.bioconductor.org/packages/2.2/bioc/html/snpMatrix.html
and if that looks promising
> source('http://bioconductor.org/biocLite.R')
> biocLite('snpMatrix')
Likely you'll quickly want a 64 bit (linux or Mac) machine.
Martin
> Best Regards,
> Jose Lozano
> ------------------------------------------
> Jose E. Lozano Alonso
> Observatorio de Salud P?blica.
> Direccion General de Salud P?blica e I+D+I.
> Junta de Castilla y Le?n.
> Direccion: Paseo de Zorrilla, n?1. Despacho 3103. CP 47071. Valladolid.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M2 B169
Phone: (206) 667-2793