This seems trivially fixable using something like
median.data.frame <- function(x, na.rm=FALSE) {
sapply(x, function(y, na.rm=FALSE) if(is.factor(y)) NA else
median(y, na.rm=na.rm), na.rm=na.rm)
}
Paul Johnson <pauljohn32 at gmail.com> 28/04/2011 06:20 >>>
On Wed, Apr 27, 2011 at 12:44 PM, Patrick Burns
<pburns at pburns.seanet.com> wrote:
Here are some data frames: df3.2 <- data.frame(1:3, 7:9) df4.2 <- data.frame(1:4, 7:10) df3.3 <- data.frame(1:3, 7:9, 10:12) df4.3 <- data.frame(1:4, 7:10, 10:13) df3.4 <- data.frame(1:3, 7:9, 10:12, 15:17) df4.4 <- data.frame(1:4, 7:10, 10:13, 15:18) Now here are some commands and their answers:
median(df4.4)
[1] 8.5 11.5
median(df3.2[c(1,2,3),])
[1] 2 8
median(df3.2[c(1,3,2),])
[1] 2 NA Warning message: In mean.default(X[[2L]], ...) : argument is not numeric or logical: returning NA The sessionInfo is below, but it looks to me like the present behavior started in 2.10.0. Sometimes it gets the right answer. I'd be grateful to hear how it does that -- I can't figure it out.
Hello, Pat. Nice poetry there! I think I have an actual answer, as opposed to the usual crap I spew. I would agree if you said median.data.frame ought to be written to work columnwise, similar to mean.data.frame. apply and sapply always give the correct answer
apply(df3.3, 2, median)
X1.3 X7.9 X10.12
2 8 11
apply(df3.2, 2, median)
X1.3 X7.9 2 8
apply(df3.2[c(1,3,2),], 2, median)
X1.3 X7.9 2 8 mean.data.frame is now implemented as mean.data.frame <- function(x, ...) sapply(x, mean, ...) I think we would suggest this for medians: ?????????????????????? median.data.frame <- function(x,...) sapply(x, median, ...) ????????????????????? It works, see:
median.data.frame(df3.2[c(1,3,2),])
X1.3 X7.9 2 8 Would our next step be to enter that somewhere in R bugzilla? (I'm not joking--I'm that naive). I think I can explain why the current median works intermittently in those cases you mention. Give it a small set of pre-sorted data, all is well. median.default uses a sort function, and it is confused when it is given a data.frame object rather than just a vector. I put a browser() at the top of median.default
median(df3.2[c(1,3,2),])
Called from: median.default(df3.2[c(1, 3, 2), ])
Browse[1]> n
debug at <tmp>#4: if (is.factor(x)) stop("need numeric data")
Browse[2]> n
debug at <tmp>#4: NULL
Browse[2]> n
debug at <tmp>#6: if (length(names(x))) names(x) <- NULL
Browse[2]> n
debug at <tmp>#6: names(x) <- NULL
Browse[2]> n
debug at <tmp>#8: if (na.rm) x <- x[!is.na(x)] else if (any(is.na(x)))
return(x[FALSE][NA])
Browse[2]> n
debug at <tmp>#8: if (any(is.na(x))) return(x[FALSE][NA])
Browse[2]> n
debug at <tmp>#8: NULL
Browse[2]> n
debug at <tmp>#12: n <- length(x)
Browse[2]> n
debug at <tmp>#13: if (n == 0L) return(x[FALSE][NA])
Browse[2]> n
debug at <tmp>#13: NULL
Browse[2]> n
debug at <tmp>#15: half <- (n + 1L)%/%2L
Browse[2]> n
debug at <tmp>#16: if (n%%2L == 1L) sort(x, partial = half)[half] else
mean(sort(x,
partial = half + 0L:1L)[half + 0L:1L])
Browse[2]> n
debug at <tmp>#16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L])
Browse[2]> n
[1] 2 NA
Warning message:
In mean.default(X[[2L]], ...) :
argument is not numeric or logical: returning NA
Note the sort there in step 16. I think that's what is killing us.
If you are lucky, give it a small data frame that is in order, like
df3.2, the sort doesn't produce gibberish. When I get to that point, I
will show you the sort's effect.
First, the case that "works". I moved the browser() down, because I
got tired of looking at the same old not-yet-erroneous output.
median(df3.2)
Called from: median.default(df3.2)
Browse[1]> n
debug at <tmp>#15: half <- (n + 1L)%/%2L
Browse[2]> n
debug at <tmp>#16: if (n%%2L == 1L) sort(x, partial = half)[half] else
mean(sort(x,
partial = half + 0L:1L)[half + 0L:1L])
Browse[2]> n
debug at <tmp>#16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L])
Interactively, type
Browse[2]> sort(x, partial = half + 0L:1L)
NA NA NA NA NA NA
1 1 7 NULL NULL NULL NULL
2 2 8 <NA> <NA> <NA> <NA>
3 3 9 <NA> <NA> <NA> <NA>
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
corrupt data frame: columns will be truncated or padded with NAs
But it still gives you a "right" answer:
Browse[2]> n
[1] 2 8
But if you give it data out of order, the second column turns to NA,
and that causes doom.
median(df3.2[c(1,3,2),])
Called from: median.default(df3.2[c(1, 3, 2), ])
Browse[1]> n
debug at <tmp>#15: half <- (n + 1L)%/%2L
Browse[2]> n
debug at <tmp>#16: if (n%%2L == 1L) sort(x, partial = half)[half] else
mean(sort(x,
partial = half + 0L:1L)[half + 0L:1L])
Browse[2]> n
debug at <tmp>#16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L])
Interactively:
Browse[2]> sort(x, partial = half + 0L:1L)
NA NA NA NA NA NA
1 1 NULL 7 NULL NULL NULL
3 3 <NA> 9 <NA> <NA> <NA>
2 2 <NA> 8 <NA> <NA> <NA>
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
corrupt data frame: columns will be truncated or padded with NAs
Browse[2]> n
[1] 2 NA
Warning message:
In mean.default(X[[2L]], ...) :
argument is not numeric or logical: returning NA
Here's a larger test case. Note columns 1 and 3 turn to NULL
df8.8 <- data.frame(a=2:8, b=1:7)
median(df8.8)
debug at <tmp>#16: if (n%%2L == 1L) sort(x, partial = half)[half] else
mean(sort(x,
partial = half + 0L:1L)[half + 0L:1L])
Browse[2]> n
debug at <tmp>#16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L])
Browse[2]> sort(x, partial = half + 0L:1L)
NA NA NA NA NA NA NA NA NA NA NA NA NA NA
1 NULL 2 NULL 1 NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL
2 <NA> 3 <NA> 2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3 <NA> 4 <NA> 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
4 <NA> 5 <NA> 4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
5 <NA> 6 <NA> 5 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
6 <NA> 7 <NA> 6 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
7 <NA> 8 <NA> 7 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
corrupt data frame: columns will be truncated or padded with NAs
Run ?sort and you see it was not intended for data.frames.
In conclusion, I think median applied to a data.frame causes undefined
behavior because median is not intending to deal with several columns
at once.
I don't see any changes in median.default that would explain the
changes you see. Compare:
From R-2.13, src/library/stats/R/median.R
median.default <- function(x, na.rm = FALSE)
{
if(is.factor(x)) stop("need numeric data")
## all other objects only need sort() & mean() to be working
if(length(names(x))) names(x) <- NULL # for e.g., c(x = NA_real_)
if(na.rm) x <- x[!is.na(x)] else if(any(is.na(x)))
return(x[FALSE][NA])
n <- length(x)
if (n == 0L) return(x[FALSE][NA])
half <- (n + 1L) %/% 2L
if(n %% 2L == 1L) sort(x, partial = half)[half]
else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L])
}
From R-2.9.2
median.default <- function(x, na.rm = FALSE)
{
if(is.factor(x)) stop("need numeric data")
## all other objects only need sort() & mean() to be working
if(length(names(x))) names(x) <- NULL # for e.g., c(x = NA_real_)
if(na.rm) x <- x[!is.na(x)] else if(any(is.na(x)))
return(x[FALSE][NA])
n <- length(x)
if (n == 0L) return(x[FALSE][NA])
half <- (n + 1L) %/% 2L
if(n %% 2L == 1L) sort(x, partial = half)[half]
else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L])
}
pj
Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas ______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel ******************************************************************* This email and any attachments are confidential. Any use...{{dropped:8}}