boxplot by factor (Package base version 2.1.1) ( PR#7976)

3 messages · Liaw, Andy, Peter Dalgaard, Gabor Grothendieck

Original

1

3

Liaw, Andy

Tue, Jun 28, 2005 5:37 AM #

The issue is not with boxplot, but with split.  boxplot.formula() 
calls boxplot(split(split(mf[[response]], mf[-response]), ...), 
but look at what split() returns when there are empty levels in
the factor:

$"1"
[1] 0.4832124 1.1924811 0.3657797 1.7400198 0.5577356 0.9889520

$"2"
[1] -1.1296642 -0.4808355 -0.2789933  0.1220718  0.1287742 -0.7573801

$"3"
[1]  1.2320902  0.5090700 -1.5508074  2.1373780  1.1681297 -0.7151561

The "culprit" is the following in split.default():

    f <- factor(f)

which drops empty levels in f, if there are any.  BTW, ?split doesn't
mention what it does in such situation.  Perhaps it should?

If this is to be "fixed", I suppose an additional argument, e.g.,
drop=TRUE, can be added, and the corresponding line mentioned
above changed to something like:

    if (drop || !is.factor(f)) f <- factor(f)

Then this additional argument can be pass on from boxplot.formula() to 
split().

Just my $0.02...

Andy

From: mwtoews at sfu.ca

I consider this to be an old bug, which also persists in Splus 7. It  
is unnecessary, and annoying.

## Section 1: Consider a simple data frame with three possible  
factors (in levels)

d <- data.frame(a=sort(rnorm(10)*10), b=factor(c(rep("A",4), rep("C", 
6)), levels=c("A","B","C")))
tapply(d$a, d$b, mean) # returns three results, which I would expect
plot(a ~ b, d) # plots only two of three objects, ignoring 
that there  
was "C" in the second position

# if I tried to plot a blank in between the two boxplots:
plot(a ~ b, d, at=1:3) # nope: error
plot(a ~ b, d, at=c(1,3)) # nope: out of range (also xlim does  
nothing for the formula boxplot method)

# to make this work with the current R/Splus implementation, I have  
to add a zero:
d <- rbind(d, data.frame(a=0,b="B")) # which I don't want to do,  
since there are no "B"
plot(a ~ b, d) # yuk!

## Section 2: Why is this important? Consider another realistic  
example of [synthetic] daily temperature

temp <- 5 - 10*cos(1:365*2*pi/365) + rnorm(365)*3
d1 <- data.frame(year=2005, jday=1:365, date=NA, month=NA, temp) #  
jday is Julian day [1,365]
d1$date <- as.Date(paste(d1$year, d1$jday), "%Y %j")
d1$month <- factor(months(d1$date,TRUE), levels=month.abb)
plot(temp ~ month, d1) # perfect, in a perfect meteorological world

d2 <- d1[!d1$month %in% c("Mar","Apr","May","Sep"),] # now let's  
remove some data
tapply(d2$temp,d2$month,mean)  # perfect
plot(temp ~ month, d2) # ugly, not 12 months, etc. (despite 
having 12  
levels)

# again the only cure is to add zeros to the missing months  
(unnecessary forgery of data)
d3 <- d2
for (i in c("Mar","Apr","May","Sep")) {
     d3 <- rbind(d3,NA)
     d3$month[nrow(d3)] <- i
     d3$temp[nrow(d3)] <- 0
}
plot(temp ~ month, d3) # still ugly, but at least has 12 months!

## Section 3: Solution
The obvious solution is to leave a blank where a boxplot should go,  
similar to tapply. This would have 1:n positions, where n is the  
number of levels of the factor, not the number of factors that have  
one or more numbers.  The position should also have a label 
under the  
tick mark.
I don't see any reason why the missing data should be completely  
ignored. Users wishing to not plot the blanks where the data 
could go  
can simply type (for back-compatibility):

d2$month <- factor(d2$month) # from 12 to 8 levels

Which will produce the same 8-factor plot as above.

## Section 4: Conclusion
I consider this to be a bug in regards to data representation, and  
this function is not consistant with other functions like `tapply'.   
Considering that the back-compatibility solution is very simple, and  
most users would probably prefer a result including all levels (NULL  
or real values in each), I feel this an appropriate improvement (and  
easy to fix in the code). At the very least, include an option to  
honour the factor levels.

Thanks.
-mt

--please do not edit the information below--

Version:
platform = powerpc-apple-darwin8.1.0
arch = powerpc
os = darwin8.1.0
system = powerpc, darwin8.1.0
status = Patched
major = 2
minor = 1.1
year = 2005
month = 06
day = 26
language = R

Locale:
en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

Search Path:
.GlobalEnv, package:methods, package:stats, package:graphics,  
package:grDevices, package:utils, package:datasets, Autoloads,  
package:base

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Tue, Jun 28, 2005 5:57 AM #

"Liaw, Andy" <andy_liaw at merck.com> writes:

Alternatively, I suspect that the intention was as.factor() rather
than factor(). It does require a bit of care to fix it that way,
though. There could be problems with empty levels popping up in
unexpected places.

O__  ---- Peter Dalgaard             ?ster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907

Gabor Grothendieck

Tue, Jun 28, 2005 8:25 AM #

Based on Andy's comment a workaround can consist of
not using boxplot.formula, e.g. using the data frame d
defined by the original poster (see below):

	boxplot( by(d, d$b, function(x)x$a) )

On 6/28/05, Liaw, Andy <andy_liaw at merck.com> wrote:

The issue is not with boxplot, but with split.  boxplot.formula()
calls boxplot(split(split(mf[[response]], mf[-response]), ...),
but look at what split() returns when there are empty levels in
the factor:

f <- factor(gl(3, 6), levels=1:5)
y <- rnorm(f)
split(y, f)

$"1"
[1] 0.4832124 1.1924811 0.3657797 1.7400198 0.5577356 0.9889520

$"2"
[1] -1.1296642 -0.4808355 -0.2789933  0.1220718  0.1287742 -0.7573801

$"3"
[1]  1.2320902  0.5090700 -1.5508074  2.1373780  1.1681297 -0.7151561

The "culprit" is the following in split.default():

   f <- factor(f)

which drops empty levels in f, if there are any.  BTW, ?split doesn't
mention what it does in such situation.  Perhaps it should?

If this is to be "fixed", I suppose an additional argument, e.g.,
drop=TRUE, can be added, and the corresponding line mentioned
above changed to something like:

   if (drop || !is.factor(f)) f <- factor(f)

Then this additional argument can be pass on from boxplot.formula() to
split().

Just my $0.02...

Andy

From: mwtoews at sfu.ca

I consider this to be an old bug, which also persists in Splus 7. It
is unnecessary, and annoying.

## Section 1: Consider a simple data frame with three possible
factors (in levels)

d <- data.frame(a=sort(rnorm(10)*10), b=factor(c(rep("A",4), rep("C",
6)), levels=c("A","B","C")))
tapply(d$a, d$b, mean) # returns three results, which I would expect
plot(a ~ b, d) # plots only two of three objects, ignoring
that there
was "C" in the second position

# if I tried to plot a blank in between the two boxplots:
plot(a ~ b, d, at=1:3) # nope: error
plot(a ~ b, d, at=c(1,3)) # nope: out of range (also xlim does
nothing for the formula boxplot method)

# to make this work with the current R/Splus implementation, I have
to add a zero:
d <- rbind(d, data.frame(a=0,b="B")) # which I don't want to do,
since there are no "B"
plot(a ~ b, d) # yuk!

## Section 2: Why is this important? Consider another realistic
example of [synthetic] daily temperature

temp <- 5 - 10*cos(1:365*2*pi/365) + rnorm(365)*3
d1 <- data.frame(year=2005, jday=1:365, date=NA, month=NA, temp) #
jday is Julian day [1,365]
d1$date <- as.Date(paste(d1$year, d1$jday), "%Y %j")
d1$month <- factor(months(d1$date,TRUE), levels=month.abb)
plot(temp ~ month, d1) # perfect, in a perfect meteorological world

d2 <- d1[!d1$month %in% c("Mar","Apr","May","Sep"),] # now let's
remove some data
tapply(d2$temp,d2$month,mean)  # perfect
plot(temp ~ month, d2) # ugly, not 12 months, etc. (despite
having 12
levels)

# again the only cure is to add zeros to the missing months
(unnecessary forgery of data)
d3 <- d2
for (i in c("Mar","Apr","May","Sep")) {
     d3 <- rbind(d3,NA)
     d3$month[nrow(d3)] <- i
     d3$temp[nrow(d3)] <- 0
}
plot(temp ~ month, d3) # still ugly, but at least has 12 months!

## Section 3: Solution
The obvious solution is to leave a blank where a boxplot should go,
similar to tapply. This would have 1:n positions, where n is the
number of levels of the factor, not the number of factors that have
one or more numbers.  The position should also have a label
under the
tick mark.
I don't see any reason why the missing data should be completely
ignored. Users wishing to not plot the blanks where the data
could go
can simply type (for back-compatibility):

d2$month <- factor(d2$month) # from 12 to 8 levels

Which will produce the same 8-factor plot as above.

## Section 4: Conclusion
I consider this to be a bug in regards to data representation, and
this function is not consistant with other functions like `tapply'.
Considering that the back-compatibility solution is very simple, and
most users would probably prefer a result including all levels (NULL
or real values in each), I feel this an appropriate improvement (and
easy to fix in the code). At the very least, include an option to
honour the factor levels.

Thanks.
-mt

--please do not edit the information below--

Version:
platform = powerpc-apple-darwin8.1.0
arch = powerpc
os = darwin8.1.0
system = powerpc, darwin8.1.0
status = Patched
major = 2
minor = 1.1
year = 2005
month = 06
day = 26
language = R

Locale:
en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

Search Path:
.GlobalEnv, package:methods, package:stats, package:graphics,
package:grDevices, package:utils, package:datasets, Autoloads,
package:base

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel