Skip to content

Deleting rows and columns containing NA's and "" only

6 messages · syrvn, Petr Savicky, David Winsemius

#
Hello,

I use read.xls from the gdata package to read in xlsx files. Sometimes these
data.frames contain NA columns
and rows only. I know how to get rid of those ones but here is the R output
of a test data set read in with read.xls
A          B         X         D               X.1         X.2
1 test      1         NA                        NA    
2 <NA>   asd    NA      asdasd    NA    
3                          NA      asdasd    NA    
4                          NA                        NA         NA

t1[1,2], t1[4,5] and t1[4,6] are NA in text form in the excel sheet. I don't
understand why in the first column it is <NA> while in the last two is not.
I basically want to get rid of column 5 and 6 and row 4 as they do not
contain any relevant information. If i do a is.na.data.frame(t1):
A     B    X     D  X.1   X.2
[1,] FALSE FALSE TRUE FALSE TRUE FALSE
[2,]  TRUE FALSE TRUE FALSE TRUE FALSE
[3,] FALSE FALSE TRUE FALSE TRUE FALSE
[4,] FALSE FALSE TRUE FALSE TRUE FALSE

does not give me the result I hoped to get.

It seems that <NA> and NA are treated as NA but in t1[4,6] it is treated as
FALSE because if I do
[1] "NA "

one can see that there is a whitespace after NA which is, however, not in
the excel sheet for sure.

I do not know how to deal with that...

Cheers

--
View this message in context: http://r.789695.n4.nabble.com/Deleting-rows-and-columns-containing-NA-s-and-only-tp4384173p4384173.html
Sent from the R help mailing list archive at Nabble.com.
#
On Mon, Feb 13, 2012 at 07:48:11AM -0800, syrvn wrote:
Hi.

I do not know, how "NA " appeared, however, it is possible
to change them to real NA as follows.

  # some data frame
  df <- structure(list(a = c(NA, 2L, 3L, 4L), b = c("a", NA, "c", "NA "),
    c = structure(c(1L, 2L, NA, 4L), .Label = c("e", "f", "g", "h"),
    class = "factor")), .Names = c("a", "b", "c"), row.names = c(NA, -4L),
    class = "data.frame")
  df

     a    b    c
  1 NA    a    e
  2  2 <NA>    f
  3  3    c <NA>
  4  4  NA     h

  df[4, 2] # this is not NA, but "NA "

  [1] "NA "

  # replace all "NA " by NA in column 2

  df[which(df[,2] == "NA "), 2] <- NA
  df

     a    b    c
  1 NA    a    e
  2  2 <NA>    f
  3  3    c <NA>
  4  4 <NA>    h

Hope this helps.

Petr Savicky.
#
Hi,

thanks for you suggestion.

I finally solved it in a different way using apply and is.na for TRUE NA's
and if(as.character(x) == "NA") etc.

However, I just spotted that read.xls seems to have problems reading in
special characters such as < or >.

Is there any workaround for that?

--
View this message in context: http://r.789695.n4.nabble.com/Deleting-rows-and-columns-containing-NA-s-and-only-tp4384173p4384663.html
Sent from the R help mailing list archive at Nabble.com.
#
On Feb 13, 2012, at 1:05 PM, syrvn wrote:

            
Or could it be that you have problems in not reading the help pages  
carefully?

 > read.table(text="<NA>")
     V1
1 <NA>

 > read.table(text="<NA>", na.strings=c("<NA>","NA"))
   V1
1 NA
#
On Feb 13, 2012, at 1:57 PM, syrvn wrote:

            
Please read the help page for read.xls more carefully.
David Winsemius, MD
West Hartford, CT