Skip to content

help with loading National Comorbidity Survey

2 messages · Jim Hurd, Thomas Lumley

2 days later
#
On Sat, 1 Oct 2005, Jim Hurd wrote:
If you mean the NCS 1 data file from that link (da06694-0001.dta) then I 
don't have this problem.

I have been able to load in the .dta file under Windows on a computer with 
1Gb of RAM.  The maximum memory use was about 350Mb.  It was very slow -- 
about half an hour.  This is because the processing of missing values and 
of factor levels is very inefficient in read.dta when dealing with very 
wide data frames. It makes calls to [.data.frame, [<-.data.frame, etc, for 
each column and so the time is probably quadratic in the number of 
columns.

The call to .External that does the actual reading took less than 1% of 
the time. If you only want a hundred or so of the 3000 variables it may be 
worth just using that .External() call to read the data, then subset it 
and then work out how to apply the factor levels and so on.

read.dta clearly needs a different algorithm to handle very wide data sets 
efficiently.

 	-thomas