Skip to content

Problem with read.spss() and as.data.frame(), or: alternative to subset()?

5 messages · Dirk Enzmann, Martin Maechler, Thomas Lumley +1 more

#
Trying to select a subset of cases (rows of data) I encountered several 
problems:

Firstly, because I did not read the help to read.spss() thoroughly 
enough, I treated the data read as a data frame. For example,

dr2000 <- read.spss('myfile.sav')
d <- subset(dr2000,RBINZ99 > 0)

and thus received an error message (Object "RBINZ99" not found), because 
dr2000 is not a data.frame but a list (shown by class(dr2000)).

d <- subset(dr2000,dr2000$RBINZ99)

didn' help either, because now d is empty (dim = NULL).

Thus, I tried to use the option "to.data.frame=T" of read.spss():

dr2000 <- read.spss('myfile.sav',to.data.frame=T)

However, now R "crashes" ('R for Windows GUI front-end has found an 
error and must be closed') (the error message is in German).

Finally, I tried again using read.spss() without the option 
'to.data.frame=T' (as before) and tried to convert dr2000 to a data 
frame by using

d <- as.data.frame(dr2000)

However, R crashes again (with the same error message).

Of course, I could use SPSS first and save only the cases with RBINZ99 > 
0, but this is not always possible (all users of the data must have SPSS 
available and we have to use different selection criteria). Is there 
another possibility to solve the problem by using R? I want to select 
certain rows (cases) based on the values of one "variable" of dr2000, 
but keep all columns (variables) - although dr2000 is not a data frame?

And: R should not crash but rather give a warning.

------------------------
R version 2.1.1 Patched (2005-07-15)
Package Foreign Version 0.8-10

Operating system: Windows XP Professional (5.1 (Build 2600))
CPU: Pentium Model 2 Stepping 9
RAM: 512 MB

*************************************************
Dr. Dirk Enzmann
Institute of Criminal Sciences
Dept. of Criminology
Edmund-Siemers-Allee 1
D-20146 Hamburg
Germany

phone: +49-040-42838.7498 (office)
        +49-040-42838.4591 (Billon)
fax:   +49-040-42838.2344
email: dirk.enzmann at jura.uni-hamburg.de
www: 
http://www2.jura.uni-hamburg.de/instkrim/kriminologie/Mitarbeiter/Enzmann/Enzmann.html
#
The selection problem can be solved by

dr2000=read.spss('myfile')
d=lapply(dr2000,subset,dr2000$RBINZ99 > 0)

however, there is still the problem that R crashes when using

d = as.data.frame(dr2000)

or

dr2000=read.spss('myfile',to.data.frame=T)

Any suggestions why? I checked whether all components of dr2000 are of 
the same length and the sort of object of each component. This is not 
the problem: Each component has the same length (9232) and there are 66 
components of the class 'character', 981 of the class 'factor', and 479 
of the class 'numeric'.
*************************************************
Dr. Dirk Enzmann
Institute of Criminal Sciences
Dept. of Criminology
Edmund-Siemers-Allee 1
D-20146 Hamburg
Germany

phone: +49-040-42838.7498 (office)
        +49-040-42838.4591 (Billon)
fax:   +49-040-42838.2344
email: dirk.enzmann at jura.uni-hamburg.de
www: 
http://www2.jura.uni-hamburg.de/instkrim/kriminologie/Mitarbeiter/Enzmann/Enzmann.html
#
Dirk> The selection problem can be solved by
    Dirk> dr2000=read.spss('myfile')
    Dirk> d=lapply(dr2000,subset,dr2000$RBINZ99 > 0)

    Dirk> however, there is still the problem that R crashes when using

    Dirk> d = as.data.frame(dr2000)

which is bug in a R, or at least in your R installation.

However we can't do anything about it at the moment, because we
can't even try to do reproduce it...

So dr2000 is a list; what length() does it have?, what names() ?
what does str(dr2000) look like?

What does happen for  as.data.frame(dr2000[1:10]) ?
and '100' or '1000' instead of '10'?

Maybe try to find a small version of 'dr2000' which still has
the problem, and show us that one,
e.g. by making it available via http://... if it is still large,
otherwise (if it's small), maybe even posting the result of
dump(..).

Regards,
Martin
2 days later
#
On Wed, 21 Sep 2005, Martin Maechler wrote:

            
I suspect this is the same stack overflow in coerce.c:substituteList that 
was reported in PR#8141

 	-thomas
Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle
#
On Fri, 23 Sep 2005, Thomas Lumley wrote:

            
Apparently not (it had only about 1500 columns rather than 198000).  After 
taking it offline I was able to make it work on 1Gb machines under Windows 
and Linux, and Dirk succeeded using --max-mem-size=640M on Windows.  So it 
looks like it was a problem with total memory usage - I have yet to find 
out what exactly.