Back to formatted view
Raw Message

Message-ID: <47A455630022EC00@mtacsbs.csbs.jcyl.es> (added by postmaster@jcyl.es)
Date: 2008-09-22T10:26:29Z
From: José E. Lozano
Subject: Manage huge database
In-Reply-To: <d8ad40b50809220141l5274bf8fw29d36784de519eab@mail.gmail.com>

> Maybe you've not lurked on R-help for long enough :) Apologies!

Probably.

> So, how much "design" is in this data? If none, and what you've
> basically got is a 2000x500000 grid of numbers, then maybe a more raw

Exactly, raw data, but a little more complex since all the 500000 variables
are in text format, so the width is around 2,500,000.

> http://cran.r-project.org/web/packages/RNetCDF/index.html
> http://cran.r-project.org/web/packages/hdf5/index.html

Thanks, I will check. Right now I am reading line by line the file. It's
time consuming, but since I will do it only once, just to rearrange the data
into smaller tables to query, it's ok.

> Thinking back to your 4GB file with 1,000,000,000 entries, that's
> only 3 bytes per entry (+1 for the comma). What is this data? There
> may be more efficient ways to handle it.

Is genetic DNA data (individuals genotyped), hence the large amount of
columns to analyze.

Best Regards,
Jose Lozano
------------------------------------------
Jose E. Lozano Alonso
Observatorio de Salud P?blica.
Direccion General de Salud P?blica e I+D+I.
Junta de Castilla y Le?n.
Direccion: Paseo de Zorrilla, n?1. Despacho 3103. CP 47071. Valladolid.