Skip to content

Running out of memory when importing SPSS files

5 messages · dobomode, Uwe Ligges, Thomas Lumley +1 more

#
Hello R-help,

I am trying to import a large dataset from SPSS into R. The SPSS file
is in .SAV format and is about 1GB in size. I use read.spss to import
the file and get an error saying that I have run out of memory. I am
on a MAC OS X 10.5 system with 4GB of RAM. Monitoring the R process
tells me that R runs out of memory when reaching about 3GB of RAM so I
suppose the remaining 1GB is used up by the OS.

Why would a 1GB SPSS file take up more than 3GB of memory in R? Is it
perhaps because R is converting each SPSS column to a less memory-
efficient data type? In general, what is the best strategy to load
large datasets in R?

Thanks!

P.S.

I exported the SPSS .SAV file to .CSV and tried importing the comma
delimited file. Same results ? the import was much slower but
eventually I ran out of memory again...
#
dobomode wrote:
Because SPSS stores data in a compressed way?

 > Is it
Use a 64-bit version of R and have sufficient amount of RAM in your system.

Uwe Ligges
#
I found the culprit. I had a number of variables in the SPSS file that
were a variable length string data type (255 characters). This seemed
to force R into creating 255-byte variables which eventually choked my
machine's memory...


On Feb 18, 5:34?pm, Uwe Ligges <lig... at statistik.tu-dortmund.de>
wrote:
#
On Wed, 18 Feb 2009, Uwe Ligges wrote:

            
Or because R uses quite a lot more memory to read a data set than to store it. Either way, even if the data set eventually took up only 1Gb in R you still would probably not be able to work usefully with it on a 32-bit machine.

You need to either use a 64-bit system or avoid loading the whole data set.  Unfortunately read.spss can't read the data selectively [something I'd like to fix, sometime], but if you had a .csv file you could read a subset of columns or rows using read.table.

A better bet is likely to be putting the data set into a database (SQLite is easiest) and reading subsets of the data that way.  That's how I handle data sets of a few Gb (on a laptop with 1Gb memory).


       -thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle
#
2009/2/19 Thomas Lumley <tlumley at u.washington.edu>:
You could try using package memisc and only bring in the variables you
need to analyse.

see spss.system.file() and the additional subset() methods in memisc.

Paul Bivand

---------------------------------------------------------
Paul Bivand
Head of Analysis and Statistics
Inclusion

Inclusion has a launched a new website, please visit: www.cesi.org.uk