Still can't find missing data - How do I get NA in xtabs with factors?

2 messages · Farley, Robert, Uwe Ligges

Original

Fri, May 29, 2009 11:14 AM #

Let's see if I understand this.  Do I iterate through
    x <- factor(x, levels(c(levels(x), NA), exclude=NULL)
for each of the few hundred variables (x) in my data frame?


I tried to do this all at once and failed:

Data1 Data2  Data3 Weight
101   Sam   Red Banana    1.1
102   Sam Green Banana    2.1
103   Sam  Blue Orange    2.1
104  Fred   Red Orange    2.1
105  Fred Green  Guava    2.1
106  Fred  Blue  Guava    2.1
107  <NA>   Red   Pear   50.1
108  <NA> Green   Pear   50.1
109  <NA>  Blue   <NA> 1000.2

Error in levels(c(levels(ToyData), NA), exclude = NULL, na.action = na.pass) :
  unused argument(s) (exclude = NULL, na.action = function (object, ...)

Data1  Data2  Data3 Weight
  <NA>   <NA>   <NA>   <NA>
Levels:

But it didn't work.  Don't I need to do this separately for each variable?



Is there a way to get read.spss to insert "NA" levels for each variable when I create the data frame?  Is this because SPSS (and STATA) allow "NA" as an "undeclared level" and R does not?


Will this be a problem with read.dta as well?




Robert Farley
Metro
www.Metro.net


-----Original Message-----
From: William Dunlap [mailto:wdunlap at tibco.com]
Sent: Thursday, May 28, 2009 20:39
To: Farley, Robert
Subject: RE: [R] Still can't find missing data

In R factors don't save space over character vectors - only
one copy of any given string is kept in memory in either case.
Factors do let you order the levels in the way you want and
that is often important in presentations.

You can add NA to the list of levels of a factor by doing
    x <- factor(x, levels(c(levels(x), NA), exclude=NULL)
where 'x' represents each factor in your dataset.  After
doing that is.na(x) will be all FALSE and you may not
want that for other situations.

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com

-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On Behalf Of Farley, Robert
Sent: Thursday, May 28, 2009 5:27 PM
To: R-help
Subject: Re: [R] Still can't find missing data

That seems to work for the toy data.  How do I implement this
change with my real data, which are read from very large
Stata and SPSS files and keep the factor definitions?  Won't
I be losing information (and creating a larger dataset) by
not using the factor levels?

How do I recover the factor values?  I read my datafile
(read.spss using   use.value.labels = FALSE,) and got this:

              connector
Mode_orig_only            1            9
          1       17.814338     0.000000
          3       49.128982     0.000000
          4      525.978899     0.000000
          5      913.295370     0.000000
          6      114.302764     0.000000
          7      298.151438     0.000000
          8       93.088049     0.000000
          9      233.794168     0.000000
          10      20.764539     0.000000
          11     424.120506     0.000000
          12       8.054528     0.000000
          13       6.010790     0.000000
          14    1832.748525     0.000000
          15   10191.284139     0.000000
          16    2099.771923     0.000000
          17    1630.148576     0.000000
          <NA>     0.000000  9491.013249

which does have the "NA" row, but not the factor labels.  If
I read the file with use.value.labels=TRUE I can see what I'm
summarizing, but not the NAs.  Can't I have both?

The top summary will also omit all 0 value factors (of
course) in the variable summarized.

The same summary using factors:
                                                             connector

Mode_orig_only
 OD Passenger    Connector

  Walked/Biked
    17.814338     0.000000

   I flew in from another a place/connected
     0.000000     0.000000

  Amtrak
    49.128982     0.000000

  Bus - Chartered bus or van
   525.978899     0.000000

  Bus - Hotel Courtesy van
   913.295370     0.000000

  Bus - MTA (Metro) or other public transit bus
   114.302764     0.000000

  Bus - Scheduled airport bus or van (e.g. Airport bus or
Disn   298.151438     0.000000

  Bus - Union Station Flyaway
    93.088049     0.000000

  Bus - Van Nuys Flyaway
   233.794168     0.000000

  Green line/light rail
    20.764539     0.000000

  Limousine/town car
   424.120506     0.000000

  Metrolink
     8.054528     0.000000

  Motorcycle
     6.010790     0.000000

  On-call shuttle/van (e.g. Super Shuttle, Prime Time)
  1832.748525     0.000000

  Car/truck/van - Private
 10191.284139     0.000000

  Car/truck/van - Rental
  2099.771923     0.000000

  Taxi
  1630.148576     0.000000

  ..Refused
     0.000000     0.000000

Robert Farley
Metro
www.Metro.net

-----Original Message-----
From: William Dunlap [mailto:wdunlap at tibco.com]
Sent: Thursday, May 28, 2009 16:26
To: Farley, Robert
Subject: RE: [R] Still can't find missing data

Try reading it in with read.table's argument stringsAsFactors=FALSE.

I think the underlying problem is that exclude= is used only if
the classifying variables are not already factors.  I haven't studied
the help file well enough to see if that is what is is documented
to do, but it seems misleading.

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com

-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On Behalf Of Farley, Robert
Sent: Thursday, May 28, 2009 4:10 PM
To: R-help
Subject: Re: [R] Still can't find missing data

In this toy data, each of the tables should sum to 1111
None of the tables shows NA columns or rows.

################################
ToyData <- read.table("C:/Data/R/Toy.csv", header=TRUE,

sep=",", na.strings="NA", dec=".", row.names="ID_Num")

ToyData

    Data1 Data2  Data3 Weight
101   Sam   Red Banana      1
102   Sam Green Banana      2
103   Sam  Blue Orange      2
104  Fred   Red Orange      2
105  Fred Green  Guava      2
106  Fred  Blue  Guava      2
107  <NA>   Red   Pear     50
108  <NA> Green   Pear     50
109  <NA>  Blue   <NA>   1000

xtabs(Weight ~  Data1 + Data2, exclude=NULL,

na.action=na.pass, ToyData)
      Data2
Data1  Blue Green Red
  Fred    2     2   2
  Sam     2     2   1

xtabs(Weight ~  Data1 + Data2, exclude=NULL,

na.action=na.pass,drop.unused.levels = FALSE, ToyData)
      Data2
Data1  Blue Green Red
  Fred    2     2   2
  Sam     2     2   1

xtabs(Weight ~  Data1 + Data3, exclude=NULL,

na.action=na.pass,drop.unused.levels = FALSE, ToyData)
      Data3
Data1  Banana Guava Orange Pear
  Fred      0     4      2    0
  Sam       3     0      2    0





Robert Farley
Metro
www.Metro.net


-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On Behalf Of Dieter Menne
Sent: Thursday, May 28, 2009 05:46
To: r-help at r-project.org
Subject: Re: [R] Still can't find missing data




Farley, Robert wrote:

I can't get the syntax that will allow me to show NA values

(rows) in the

xtabs.

lengthy non-reproducible example removed

If you want a reproducible answer, prepare a reproducible
result. And check
that the
syntax is

na.action=na.pass

Dieter




--
View this message in context:
http://www.nabble.com/Still-can%27t-find-missing-data-tp237306
27p23761006.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Uwe Ligges

Sat, May 30, 2009 6:41 AM #

Farley, Robert wrote:

Yes, for all being factors.

Best,
Uwe Ligges

I tried to do this all at once and failed:

ToyData

    Data1 Data2  Data3 Weight
101   Sam   Red Banana    1.1
102   Sam Green Banana    2.1
103   Sam  Blue Orange    2.1
104  Fred   Red Orange    2.1
105  Fred Green  Guava    2.1
106  Fred  Blue  Guava    2.1
107  <NA>   Red   Pear   50.1
108  <NA> Green   Pear   50.1
109  <NA>  Blue   <NA> 1000.2

ToyData <- factor(ToyData, levels(c(levels(ToyData), NA), exclude=NULL, na.action=na.pass))

Error in levels(c(levels(ToyData), NA), exclude = NULL, na.action = na.pass) :
  unused argument(s) (exclude = NULL, na.action = function (object, ...)

ToyData <- factor(ToyData, levels(c(levels(ToyData), NA)))
ToyData

 Data1  Data2  Data3 Weight
  <NA>   <NA>   <NA>   <NA>
Levels:
But it didn't work.  Don't I need to do this separately for each variable?



Is there a way to get read.spss to insert "NA" levels for each variable when I create the data frame?  Is this because SPSS (and STATA) allow "NA" as an "undeclared level" and R does not?


Will this be a problem with read.dta as well?




Robert Farley
Metro
www.Metro.net


-----Original Message-----
From: William Dunlap [mailto:wdunlap at tibco.com]
Sent: Thursday, May 28, 2009 20:39
To: Farley, Robert
Subject: RE: [R] Still can't find missing data

In R factors don't save space over character vectors - only
one copy of any given string is kept in memory in either case.
Factors do let you order the levels in the way you want and
that is often important in presentations.

You can add NA to the list of levels of a factor by doing
    x <- factor(x, levels(c(levels(x), NA), exclude=NULL)
where 'x' represents each factor in your dataset.  After
doing that is.na(x) will be all FALSE and you may not
want that for other situations.

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com

-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On Behalf Of Farley, Robert
Sent: Thursday, May 28, 2009 5:27 PM
To: R-help
Subject: Re: [R] Still can't find missing data

That seems to work for the toy data.  How do I implement this
change with my real data, which are read from very large
Stata and SPSS files and keep the factor definitions?  Won't
I be losing information (and creating a larger dataset) by
not using the factor levels?

How do I recover the factor values?  I read my datafile
(read.spss using   use.value.labels = FALSE,) and got this:

              connector
Mode_orig_only            1            9
          1       17.814338     0.000000
          3       49.128982     0.000000
          4      525.978899     0.000000
          5      913.295370     0.000000
          6      114.302764     0.000000
          7      298.151438     0.000000
          8       93.088049     0.000000
          9      233.794168     0.000000
          10      20.764539     0.000000
          11     424.120506     0.000000
          12       8.054528     0.000000
          13       6.010790     0.000000
          14    1832.748525     0.000000
          15   10191.284139     0.000000
          16    2099.771923     0.000000
          17    1630.148576     0.000000
          <NA>     0.000000  9491.013249

which does have the "NA" row, but not the factor labels.  If
I read the file with use.value.labels=TRUE I can see what I'm
summarizing, but not the NAs.  Can't I have both?

The top summary will also omit all 0 value factors (of
course) in the variable summarized.

The same summary using factors:
                                                             connector

Mode_orig_only
 OD Passenger    Connector

  Walked/Biked
    17.814338     0.000000

   I flew in from another a place/connected
     0.000000     0.000000

  Amtrak
    49.128982     0.000000

  Bus - Chartered bus or van
   525.978899     0.000000

  Bus - Hotel Courtesy van
   913.295370     0.000000

  Bus - MTA (Metro) or other public transit bus
   114.302764     0.000000

  Bus - Scheduled airport bus or van (e.g. Airport bus or
Disn   298.151438     0.000000

  Bus - Union Station Flyaway
    93.088049     0.000000

  Bus - Van Nuys Flyaway
   233.794168     0.000000

  Green line/light rail
    20.764539     0.000000

  Limousine/town car
   424.120506     0.000000

  Metrolink
     8.054528     0.000000

  Motorcycle
     6.010790     0.000000

  On-call shuttle/van (e.g. Super Shuttle, Prime Time)
  1832.748525     0.000000

  Car/truck/van - Private
 10191.284139     0.000000

  Car/truck/van - Rental
  2099.771923     0.000000

  Taxi
  1630.148576     0.000000

  ..Refused
     0.000000     0.000000

Robert Farley
Metro
www.Metro.net

-----Original Message-----
From: William Dunlap [mailto:wdunlap at tibco.com]
Sent: Thursday, May 28, 2009 16:26
To: Farley, Robert
Subject: RE: [R] Still can't find missing data

Try reading it in with read.table's argument stringsAsFactors=FALSE.

I think the underlying problem is that exclude= is used only if
the classifying variables are not already factors.  I haven't studied
the help file well enough to see if that is what is is documented
to do, but it seems misleading.

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com

-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On Behalf Of Farley, Robert
Sent: Thursday, May 28, 2009 4:10 PM
To: R-help
Subject: Re: [R] Still can't find missing data

In this toy data, each of the tables should sum to 1111
None of the tables shows NA columns or rows.

################################
ToyData <- read.table("C:/Data/R/Toy.csv", header=TRUE,

sep=",", na.strings="NA", dec=".", row.names="ID_Num")

ToyData

    Data1 Data2  Data3 Weight
101   Sam   Red Banana      1
102   Sam Green Banana      2
103   Sam  Blue Orange      2
104  Fred   Red Orange      2
105  Fred Green  Guava      2
106  Fred  Blue  Guava      2
107  <NA>   Red   Pear     50
108  <NA> Green   Pear     50
109  <NA>  Blue   <NA>   1000

xtabs(Weight ~  Data1 + Data2, exclude=NULL,

na.action=na.pass, ToyData)
      Data2
Data1  Blue Green Red
  Fred    2     2   2
  Sam     2     2   1

xtabs(Weight ~  Data1 + Data2, exclude=NULL,

na.action=na.pass,drop.unused.levels = FALSE, ToyData)
      Data2
Data1  Blue Green Red
  Fred    2     2   2
  Sam     2     2   1

xtabs(Weight ~  Data1 + Data3, exclude=NULL,

na.action=na.pass,drop.unused.levels = FALSE, ToyData)
      Data3
Data1  Banana Guava Orange Pear
  Fred      0     4      2    0
  Sam       3     0      2    0




Robert Farley
Metro
www.Metro.net


-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On Behalf Of Dieter Menne
Sent: Thursday, May 28, 2009 05:46
To: r-help at r-project.org
Subject: Re: [R] Still can't find missing data




Farley, Robert wrote:

I can't get the syntax that will allow me to show NA values

(rows) in the

xtabs.

lengthy non-reproducible example removed

If you want a reproducible answer, prepare a reproducible
result. And check
that the
syntax is

na.action=na.pass

Dieter




--
View this message in context:
http://www.nabble.com/Still-can%27t-find-missing-data-tp237306
27p23761006.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.