Skip to content

Manage huge database

3 messages · Martin Morgan, Thomas Lumley, José E. Lozano

#
"Jos? E. Lozano" <lozalojo at jcyl.es> writes:
The Bioconductor package snpMatrix is designed for this type of
data. See

http://www.bioconductor.org/packages/2.2/bioc/html/snpMatrix.html

and if that looks promising
Likely you'll quickly want a 64 bit (linux or Mac) machine.

Martin

  
    
#
On Mon, 22 Sep 2008, Martin Morgan wrote:

            
<snip>>
netCDF is another useful option -- we have been using the ncdf package for 
large genomic datasets.  We read the data in one person at a time and 
write to netCDF.  For analysis we can then read any subsets.  Since we 
have imputed SNP data  as well as measured this comes to about 2.5 million 
variables on 4000 people for one of our data sets.


 	-thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle
#
Hello:

I've been reading all the replies, and I think I have some good ideas to
work on.

Right now the code I programmed is running, It has been running in a batch
process 20h now, and It has imported 1750 rows out of 2000. I will read docs
for the bioconductor package, and I will check the gawk option, as well as I
will try (time consuming I guess) to transform variables in text format to
numbers, and other options I have just read.

I appreciate all ideas and hints, I've programmed time-consuming quite
complex programs but I've never programmed one just for a simple task as to
read a file.

Finally, unfortunately I work on a 32bit Windows XP machine, so neither
Linux or 64bit Windows... :-)

Thanks for your help,
Jose Lozano

------------------------------------------
Jose E. Lozano Alonso
Observatorio de Salud P?blica.
Direccion General de Salud P?blica e I+D+I.
Junta de Castilla y Le?n.
Direccion: Paseo de Zorrilla, n?1. Despacho 3103. CP 47071. Valladolid.