Skip to content

Selecting groups with R

13 messages · jlwoodard, Don McKenzie, Fredrik Karlsson +4 more

#
I have a data set similar to the following:

Color  Score
RED      10
RED      13
RED      12
WHITE   22
WHITE   27
WHITE   25
BLUE     18
BLUE     17
BLUE     16

and I am trying to to select just the values of Color that are equal to RED
or WHITE, excluding the BLUE.

I've tried the following:
myComp1<-subset(dataset, Color =="RED" | Color == "WHITE")
myComp1<-subset(dataset, Color != "BLUE")
myComp1<-dataset[which(dataset$Color != "BLUE"),]

Each of the above lines successfully excludes the BLUE subjects, but the
"BLUE" category is still present in my data set; that is, if I try
table(Color)  I get 

RED  WHITE  BLUE
82     151      0

If I try to do a t-test (since I've presumably gone from three groups to two
groups), I get:
Error in if (stderr < 10 * .Machine$double.eps * max(abs(mx), abs(my)))
stop("data are essentially constant") : 
  missing value where TRUE/FALSE needed
In addition: Warning message:
In mean.default(y) : argument is not numeric or logical: returning NA

and describe.by(score,Color) gives me descriptives for RED and WHITE, and
BLUE also shows up as NULL.

How can I eliminate the BLUE category completely so I can do a t-test using
Color (with just the RED and WHITE subjects)?

Many thanks in advance!!

John
#
dataset[dataset$Color != "BLUE",]
On 21-Aug-09, at 3:08 PM, jlwoodard wrote:

            
Don McKenzie, Research Ecologist
Pacific WIldland Fire Sciences Lab
US Forest Service

Affiliate Professor
School of Forest Resources, College of the Environment
CSES Climate Impacts Group
University of Washington

desk: 206-732-7824
cell: 206-321-5966
dmck at u.washington.edu
donaldmckenzie at fs.fed.us
#
Hi John,

I would guess that your Color column is a factor, with three levels
("RED","BLUE","WHITE"), which means that they will all be included in
the output of a table() call, even if they are empty. Try

dataset <- transform(dataset, Color=as.character(Color))

or something similar and then create the table.

/Fredrik
On Fri, Aug 21, 2009 at 11:08 PM, jlwoodard<john.woodard at wayne.edu> wrote:

  
    
#
On Aug 21, 2009, at 6:08 PM, jlwoodard wrote:

            
You are being bitten by the behavior of factors.
How.... did you do the "t-test"?
dataset$Color <- as.character(dataset$Color)
#
On Aug 21, 2009, at 6:16 PM, Don McKenzie wrote:

            
Will return a data.frame with Color still a factor with three levels.
David Winsemius, MD
Heritage Laboratories
West Hartford, CT
#
Right, but he just wanted to eliminate "BLUE" as far as I could see.   
Your solution does more, of course.
On 21-Aug-09, at 3:33 PM, David Winsemius wrote:

            
Don McKenzie, Research Ecologist
Pacific WIldland Fire Sciences Lab
US Forest Service

Affiliate Professor
School of Forest Resources, College of the Environment
CSES Climate Impacts Group
University of Washington

desk: 206-732-7824
cell: 206-321-5966
dmck at u.washington.edu
donaldmckenzie at fs.fed.us
#
On Aug 21, 2009, at 6:35 PM, jlwoodard wrote:

            
?t,test

t.test expects two numeric vectors, not a numeric vector and a  
grouping indicator.

 > t.test(dataset[dataset$Color=="RED", "Score"], dataset[dataset 
$Color=="WHITE", "Score"] )

	Welch Two Sample t-test

data:  dataset[dataset$Color == "RED", "Score"] and dataset[dataset 
$Color == "WHITE", "Score"]
t = -7.6485, df = 3.298, p-value = 0.003305
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  -18.143205  -7.856795
sample estimates:
mean of x mean of y
  11.66667  24.66667
#
On Aug 21, 2009, at 6:36 PM, Don McKenzie wrote:

            
Read his message again. He already showed three methods all of which  
gave results identical to the one you offered. He asked to be shown  
why the 0's were appearing in table().

He thought that aspect was also the cause of his problems with t.test,  
although it wasn't. t.test was coercing his character vector, dataset 
$Color, to numeric NA's and then complaining about a lack of  
variability in the vector.
David Winsemius, MD
Heritage Laboratories
West Hartford, CT
#
David Winsemius wrote:
Thank you again, David!  I also just realized I could have replaced the
comma with a tilde, as in
t.test(Score~Color).  What a difference a character makes!

John
#
David Winsemius wrote:

            
or factor(dataset$Color), even.  As has been pointed out already, 
t.test.formula et al. do this internally, so there was really no problem 
to begin with.

BTW, the subsetting behaviour of factors is deliberate. It is not always 
by definition that some categories come out empty in a subgroup (think, 
e.g., a question in a questionnaire, stratified by age), and it is much 
easier to remove empty levels when you don't want them than to get them 
back in when you do.
2 days later
#
To drop empty factor levels from a subset, I use the following:

a.subset <- subset(dataset, Color!='BLUE')
ifac <- sapply(a.subset,is.factor)
a.subset[ifac] <- lapply(a.subset[ifac],factor)

Mike
Color Score
1   RED    10
2   RED    13
3   RED    12
4 WHITE    22
5 WHITE    27
6 WHITE    25
7  BLUE    18
8  BLUE    17
9  BLUE    16
Score
Color   10 12 13 16 17 18 22 25 27
  BLUE   0  0  0  1  1  1  0  0  0
  RED    1  1  1  0  0  0  0  0  0
  WHITE  0  0  0  0  0  0  1  1  1
Color Score
1   RED    10
2   RED    13
3   RED    12
4 WHITE    22
5 WHITE    27
6 WHITE    25
Score
Color   10 12 13 22 25 27
  BLUE   0  0  0  0  0  0
  RED    1  1  1  0  0  0
  WHITE  0  0  0  1  1  1
Score
Color   10 12 13 22 25 27
  RED    1  1  1  0  0  0
  WHITE  0  0  0  1  1  1
#
jlwoodard wrote:
A simpler example.  See "details" in the help file for factor() for an
explanation.
x
 blue   red white 
    1     1     2
y
 blue   red white 
    0     1     2
red white 
    1     2