foreign(read.spss) in rw2000 and re2001beta - R-devel

Sat, Nov 6, 2004 7:48 AM #

I encountered something strange with read.spss (package foreign, version 
0.7 with R2.0.0 and
version 0.8 with R2.0.1 beta, windows XP)

I made a test file test.sav with SPSS version 11.5.1
containing only one numeric variable, with a value label
for one value not occuring in the file. According to ?read.spss
this should result in a factor, but it results in all NA. Using the 
argument
use.value.labels=FALSE, everything is read as expected.

test <- read.spss("test.sav", to.=TRUE)
test > only NA's

Kjetil

Kjetil Halvorsen.

Peace is the most effective weapon of mass construction.
               --  Mahdi Elmandjra

Brian Ripley

Sat, Nov 6, 2004 7:55 AM #

On Sat, 6 Nov 2004, Kjetil Brinchmann Halvorsen wrote:

Please clarify: the same thing in both versions (read.spss has not changed 
between them)?

Can you make that file available please?

Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Peter Dalgaard

Sat, Nov 6, 2004 9:14 AM #

Kjetil Brinchmann Halvorsen <kjetil@acelerate.com> writes:

Er, what do you mean "all NA"? If the only factor level corresponds to
a value that isn't present, wouldn't you expect to get a factor with
one level and all values missing? What does str() say about the
resulting object?

O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk)             FAX: (+45) 35327907

Thomas Lumley

Sat, Nov 6, 2004 10:05 AM #

On Sat, 6 Nov 2004, Peter Dalgaard wrote:

It should result in a factor all of whose values are missing. And it does.

I have modified read.spss (but not committed the changes yet)  so that it 
does not create a factor when there are missing levels.

The problem is that SPSS uses value labels for two different things: for 
factors and for labelling a subset of values (eg different types of 
missing). It is hard for R to guess which the user intends.

You can always set use.value.labels=FALSE.  You still get the value labels 
read in and then you can decide what to do with them.

 	-thomas

Peter Dalgaard

Sat, Nov 6, 2004 11:04 AM #

Thomas Lumley <tlumley@u.washington.edu> writes:

Hmm, could we handle this more elegantly? At the very least, we should
probably try to keep the labels as an attribute, so that you have the
option of doing something constructive with them afterwards. Perhaps
we need an entire new data class (say "labeled") and an as.factor()
method doing basically what happens now.

BTW, I take it that you mean "when there are values with no labels",
not "labels for values that are not present in data", right?

O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk)             FAX: (+45) 35327907

Thomas Lumley

Sat, Nov 6, 2004 12:15 PM #

On Sat, 6 Nov 2004, Peter Dalgaard wrote:

They are kept as an attribute if use.value.labels=TRUE, whether or not a 
factor is created. I had thought they were always kept as an attribute 
even if use.value.labels=FALSE (as happens with read.dta), but this turns 
out not to be the case. This is easily fixed, and I will do so.


I have thought about a class that would make sense for data with some 
values labelled but not all. This is related to some other data 
representation issues that a colleague is asking about, and we may come up 
with a reasonably elegant and general solution.

 	-thomas