Skip to content

Subsetting on multiple criteria (AND condition) in R

5 messages · arun, Marc Schwartz, William Dunlap +1 more

#
Hi,
Try:
table(as.character(non_us[,"COUNTRY"]))
A.K.
On Tuesday, January 14, 2014 3:17 PM, Jeff Johnson <mrjefftoyou at gmail.com> wrote:
I'm running the following to get what I would expect is a subset of
countries that are not equal to "US" AND COUNTRY is not in one of my
validcountries values.

non_us <- subset(mydf, (COUNTRY %in% validcountries) & COUNTRY != "US",
select = COUNTRY, na.rm=TRUE)

however, when I then do table(non_us) I get:
non_us
?  AE AN AR AT AU BB BD BE BH BM BN BO BR BS CA CH CM CN CO CR CY DE DK DO
EC ES
0? 3? 0? 2? 1 31? 4? 1? 1? 1 45? 1? 1? 4? 5 86? 3? 1? 8? 1? 2? 1? 8? 2? 1
2? 4
FI FR GB GR GU HK ID IE IL IN IO IT JM JP KH KR KY LU LV MO MX MY NG NL NO
NZ PA
2? 4 35? 3? 3 14? 3? 5? 2? 5? 1? 2? 1 15? 1 11? 2? 2? 1? 1 23? 7? 1? 6? 1
3? 1
PE PG PH PR PT RO RU SA SE SG TC TH TT TW TZ US ZA
2? 1? 1? 8? 1? 1? 1? 1? 1 18? 1? 1? 2 11? 1? 0? 3
Notice US appears as the second to last. I expected it to NOT appear.

Do you know if I'm using incorrect syntax? Is the & symbol equivalent to
AND (notice I have 2 criteria for subsetting)? Also, is COUNTRY != "US"
valid syntax? I don't get errors, but then again I don't get what I expect
back.

Thanks in advance!
#
On Jan 14, 2014, at 1:38 PM, Jeff Johnson <mrjefftoyou at gmail.com> wrote:

            
Review the Details section of ?subset, where you will find the following:

"Factors may have empty levels after subsetting; unused levels are not automatically removed. See droplevels for a way to drop all unused levels from a data frame."


Your syntax is fine and the behavior is as expected.

Regards,

Marc Schwartz
#
Here is a reproducible example of your problem where you do not
want to see a table entry for "Medium".
  > tmp_df <- data.frame(Size=factor(rep(c("Small","Medium","Large"),1:3), levels=c("Small","Medium","Large")))
  > non_medium <- subset(tmp_df, Size != "Medium", select=Size)
  > table(non_medium)
  non_medium
   Small Medium  Large 
       1      0      3

The problem arises because, by default, when you take a subset of a factor
all the levels of the factor are retained and table(factor) makes an entry for
every level.  If you want to drop the unused levels in a factor (and retain the
order of the remaining levels) you can pass it through the factor function:
  > table(Size=factor(non_medium$Size))
  Size
  Small Large 
     1     3

You can also subset the factor with the drop=TRUE argument to drop the unused
levels when you make the subset
   > table(Size=tmp_df$Size[tmp_df$Size != "Medium", drop=TRUE])
   Size
  Small Large 
      1     3

Some will say to use as.character on the factor or not to use factors at all.  That
works if you are OK with the entries in the table being in alphabetic order and
not a semantic order of your choosing.

Bill Dunlap
TIBCO Software
wdunlap tibco.com