Manage huge database

Mon, Sep 22, 2008 1:41 AM #

2008/9/22 Jos? E.  Lozano <lozalojo at jcyl.es>:

Maybe you've not lurked on R-help for long enough :) Apologies!

A bit more googling tells me both MySQL and PostgreSQL have limits of
a few thousand on the number of columns in a table, not a few hundred
thousand. An insightful comment on one mailing list is:

"Of course, the real bottom line is that if you think you need more than
order-of-a-hundred columns, your database design probably needs revision
anyway ;-)"

 So, how much "design" is in this data? If none, and what you've
basically got is a 2000x500000 grid of numbers, then maybe a more raw
binary-type format will help - HDF or netCDF? Although I'm not sure
how much R support for reading slices of these formats exists, you may
be able to use an external utility to write slices out on demand.
Random access to parts of these files is pretty fast.

http://cran.r-project.org/web/packages/RNetCDF/index.html
http://cran.r-project.org/web/packages/hdf5/index.html

 Thinking back to your 4GB file with 1,000,000,000 entries, that's
only 3 bytes per entry (+1 for the comma). What is this data? There
may be more efficient ways to handle it.

 Hope *that* helps...

Barry

José E. Lozano

Mon, Sep 22, 2008 3:26 AM #

Probably.

Exactly, raw data, but a little more complex since all the 500000 variables
are in text format, so the width is around 2,500,000.

Thanks, I will check. Right now I am reading line by line the file. It's
time consuming, but since I will do it only once, just to rearrange the data
into smaller tables to query, it's ok.

Is genetic DNA data (individuals genotyped), hence the large amount of
columns to analyze.

Best Regards,
Jose Lozano
------------------------------------------
Jose E. Lozano Alonso
Observatorio de Salud P?blica.
Direccion General de Salud P?blica e I+D+I.
Junta de Castilla y Le?n.
Direccion: Paseo de Zorrilla, n?1. Despacho 3103. CP 47071. Valladolid.