Hi R experts I'm trying to emigrate from SPSS to R, thou I have some problems whit getting R to distinguish between the different kind of missing. I want to distinguish between data that are missing because a respondent refused to answer and data that are missing because the question didn't apply to that respondent. In other words I wante to create data values where I control what are valid and what are missing observations s? I can study both the valid and the missing observations. SPSS dos this in a quite smooth way, look something like this in SPSS: Get paid appropriately, considering efforts and achievements N Valid 947 Missing 558 Valid Cumulative Frequency Percent Percent Percent Valid Agree strongly 98 6,5 10,3 10,3 Agree 408 27,1 43,1 53,4 Neither agree nor disagree 126 8,4 13,3 66,7 Disagree 259 17,2 27,3 94,1 Disagree strongly 56 3,7 5,9 100,0 Total 947 62,9 100,0 Missing Not applicable 534 35,5 Don't know 1 ,1 No answer 23 1,5 Total 558 37,1 Total 1505 100,0 (If the table get messy and you can?t read it in your email program there is a nice formatted SPSS table here https://stat.ethz.ch/ pipermail/r-help/1998-October/002942.html whare K. Mueller ask a almost similar question in 1998!) SPSS is metacategorizing or recognizing if my variables are Missing or Valid. This means that, besides differentiating between missing and valid, the categories within missing are treated separately. # At the moment I'm only able to get this information from R: > describe(ess3dk$PDAPRP) ess3dk$PDAPRP : Get paid appropriately, considering efforts and achievements n missing unique 1505 0 8 Agree strongly (98, 7%), Agree (408, 27%) Neither agree nor disagree (126, 8%), Disagree (259, 17%) Disagree strongly (56, 4%), Not applicable (534, 35%) Don't know (1, 0%), No answer (23, 2%) # Then I can recode 'Not applicable', 'Don't know' and 'No answer' as missing: > ess3dk[ess3dk$PDAPRP=="Not applicable" | ess3dk$PDAPRP=="Don't know" | ess3dk$PDAPRP=="No answer","PDAPRP"] <- NA # But that just pile 'Not applicable', 'Don't know' and 'No answer' together in ?missing?: > describe(ess3dk$PDAPRP) ess3dk$PDAPRP : Get paid appropriately, considering efforts and achievements n missing unique 947 558 5 Agree strongly (98, 10%), Agree (408, 43%) Neither agree nor disagree (126, 13%), Disagree (259, 27%) Disagree strongly (56, 6%) Is there a smart way in R to differentiate between missing and valid and at the same time treat both the categories within missing and valid as answers (like SPSS did above)? I'm using a SPSS data set (.sav/.por) from The European Social Survey (the ESS) http://ess.nsd.uib.no/index.jsp? module=download&year=2007&country=&download=%5CDirect+Data+download% 5C2007%5C01%23ESS3+-+integrated+file%2C+edition+2.0%5C.% 5CESS3e02.spss.zip which I import via the spss.get like this: > ess3dk<- spss.get("filename.sav", lowernames=FALSE, datevars = NULL, use.value.labels = TRUE, to.data.frame = TRUE, max.value.labels = Inf, force.single=TRUE, allow=NULL, charfactor=FALSE) I have read the help file in spss.get and read.spss to see it this subject was mentioned and I have looked around this malinglist. I have found one question that is almost similar, here https:// stat.ethz.ch/pipermail/r-help/1998-October/002942.html (from October 1998!) but there is no one answer anywhere. Here are some self contained reproducible code: dataFrame <- data.frame(ONE = c(2, 1, 3, 2, NA, 4, 2), TWO = c("yes", "?", "No", "X", "No", "?", "X"), AGE = c(42, 18, 49, 62,NA, 19, 82)) # I create a simpel dataframe describe(dataFrame$TWO) # then I have a look at the ?TWO?-column. Here I can see every answer. dataFrame[dataFrame$TWO== "?" | dataFrame$TWO== "X", "TWO" ] <- NA # Now i classify the answers "X" and "?" as missing, bacause I want to know the valid percent (yes and no) but I don?t want to delete the "X" and the ??? answers. describe(dataFrame$TWO) # then I have a another look at the ?TWO?- column. Now I can't see how many answered "X" and how many answered "?" # my question is if it's possible in R to work whit a metacategory of valid and not valid answers, as described above. In other words I want to, as possible in SPSS, distinguish between a percent with in the valid answers and a percent over all. I normally use this method to quickly get an overview of missing and valid answers and the internal percentile distribution within the missing and valid answers, so I would like to find some smart solution to this problem. I would really appreciate a answer or some help to get my in the right direction. Thanks in advance Regards Ericka Lujndstr?m
Studdy Missing Data, differentiate between a percent with in the valid answers and with in the different missing answers
4 messages · James Reilly, Frank E Harrell Jr, Eric Fail
On 3/3/08 8:21 PM, Ericka Lundstr?m wrote:
> I'm trying to emigrate from SPSS to R, thou I have some problems whit
> getting R to distinguish between the different kind of missing.
...
> Is there a smart way in R to differentiate between missing and valid
> and at the same time treat both the categories within missing and
> valid as answers (like SPSS did above)
The Hmisc package has some support for special missing values, for
instance when reading in SAS datasets using sas.get. I don't believe
spss.get offers the same facility, though.
You can define special missing values for a variable manually, which
might seem a bit involved, but this could easily be automated. For your
example, try:
special <- dataFrame$TWO %in% c("?","X")
attr(dataFrame$TWO, "special.miss") <-
list(codes=as.character(dataFrame$TWO[special]),
obs=(1:length(dataFrame$TWO))[special])
class(dataFrame$TWO) <- c("factor", "special.miss")
is.na(dataFrame$TWO) <- special
# Then describe gives new percentages
describe(dataFrame$TWO)
dataFrame$TWO
n missing ? X unique
3 4 2 2 2
No (2, 67%), yes (1, 33%)
HTH,
James
James Reilly Department of Statistics, University of Auckland Private Bag 92019, Auckland, New Zealand
James Reilly wrote:
On 3/3/08 8:21 PM, Ericka Lundstr?m wrote:
> I'm trying to emigrate from SPSS to R, thou I have some problems whit > getting R to distinguish between the different kind of missing.
...
> Is there a smart way in R to differentiate between missing and valid > and at the same time treat both the categories within missing and > valid as answers (like SPSS did above)
The Hmisc package has some support for special missing values, for
instance when reading in SAS datasets using sas.get. I don't believe
spss.get offers the same facility, though.
You can define special missing values for a variable manually, which
might seem a bit involved, but this could easily be automated. For your
example, try:
special <- dataFrame$TWO %in% c("?","X")
attr(dataFrame$TWO, "special.miss") <-
list(codes=as.character(dataFrame$TWO[special]),
obs=(1:length(dataFrame$TWO))[special])
class(dataFrame$TWO) <- c("factor", "special.miss")
is.na(dataFrame$TWO) <- special
# Then describe gives new percentages
describe(dataFrame$TWO)
dataFrame$TWO
n missing ? X unique
3 4 2 2 2
No (2, 67%), yes (1, 33%)
HTH,
James
Thanks for pointing out how this can be done with Hmisc, James. If the foreign package can sense SPSS special missing values in general, it would not be hard to add the special.miss mechanism to spss.get in Hmisc. Frank
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
On Mon, 03 Mar 2008 22:02:17 +1300, James Reilly wrote
On 3/3/08 8:21 PM, Ericka Lundstr?m wrote:
> I'm trying to emigrate from SPSS to R, thou I have some
problems whit > getting R to distinguish between the different
kind of missing. ... > Is there a smart way in R to
differentiate between missing and valid > and at the same time
treat both the categories within missing and > valid as
answers (like SPSS did above)
The Hmisc package has some support for special missing values,
for instance when reading in SAS datasets using sas.get. I
don't believe spss.get offers the same facility, though.
You can define special missing values for a variable manually,
which might seem a bit involved, but this could easily be
automated. For your example, try:
special <- dataFrame$TWO %in% c("?","X")
attr(dataFrame$TWO, "special.miss") <-
list(codes=as.character(dataFrame$TWO[special]),
obs=(1:length(dataFrame$TWO))[special])
class(dataFrame$TWO) <- c("factor", "special.miss")
is.na(dataFrame$TWO) <- special
# Then describe gives new percentages
describe(dataFrame$TWO)
dataFrame$TWO
n missing ? X unique
3 4 2 2 2
No (2, 67%), yes (1, 33%)
Dear James Reilly Tanks a for your answer, now I can get - or make - ?metacategories? for my data, which is wonderful! Thou I actually only needed two ?metacategories?. One for missing answers and one for valid answers, anyhow it looks like R are treating ?X? and ??? as missing, or subcategorise of missing. One thing I still need R to give me a percent with in the valid answers (or unique) and a percent over all. Is that in anyway possible? Whit the special.miss I doesn?t get percentages I only get distribution with in n [No (2, 67%), yes (1, 33%)]. I don?t get an percent over all [? (2, 29%), No (2, 29%), X (2, 29%), yes (1, 14%)]. Isn?t there someone who has developed a Package for this feature? Karsten Mueller asked about this 10 years ago https://stat.ethz.ch/pipermail/r-help/1998-October/002942.html Hope some one have the time to help me. And again, thanks to James Reilly for his answer! All the best Ericka Lujndstr?m