Skip to content
Prev 39978 / 63421 Next

median and data frames

On Wed, Apr 27, 2011 at 12:44 PM, Patrick Burns
<pburns at pburns.seanet.com> wrote:
Hello, Pat.

Nice poetry there!  I think I have an actual answer, as opposed to the
usual crap I spew.

I would agree if you said median.data.frame ought to be written to
work columnwise, similar to mean.data.frame.

apply and sapply  always give the correct answer
X1.3   X7.9 X10.12
     2      8     11
X1.3 X7.9
   2    8
X1.3 X7.9
   2    8

mean.data.frame is now implemented as

mean.data.frame <- function(x, ...) sapply(x, mean, ...)

I think we would suggest this for medians:

??????????????????????

median.data.frame <- function(x,...) sapply(x, median, ...)

?????????????????????

It works, see:
X1.3 X7.9
   2    8

Would our next step be to enter that somewhere in R bugzilla? (I'm not
joking--I'm that naive).

I think I can explain why the current median works intermittently in
those cases you mention.  Give it a small set of pre-sorted data, all
is well.  median.default uses a sort function, and it is confused when
it is given a data.frame object rather than just a vector.


I put a browser() at the top of median.default
Called from: median.default(df3.2[c(1, 3, 2), ])
Browse[1]> n
debug at <tmp>#4: if (is.factor(x)) stop("need numeric data")
Browse[2]> n
debug at <tmp>#4: NULL
Browse[2]> n
debug at <tmp>#6: if (length(names(x))) names(x) <- NULL
Browse[2]> n
debug at <tmp>#6: names(x) <- NULL
Browse[2]> n
debug at <tmp>#8: if (na.rm) x <- x[!is.na(x)] else if (any(is.na(x)))
return(x[FALSE][NA])
Browse[2]> n
debug at <tmp>#8: if (any(is.na(x))) return(x[FALSE][NA])
Browse[2]> n
debug at <tmp>#8: NULL
Browse[2]> n
debug at <tmp>#12: n <- length(x)
Browse[2]> n
debug at <tmp>#13: if (n == 0L) return(x[FALSE][NA])
Browse[2]> n
debug at <tmp>#13: NULL
Browse[2]> n
debug at <tmp>#15: half <- (n + 1L)%/%2L
Browse[2]> n
debug at <tmp>#16: if (n%%2L == 1L) sort(x, partial = half)[half] else
mean(sort(x,
    partial = half + 0L:1L)[half + 0L:1L])
Browse[2]> n
debug at <tmp>#16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L])
Browse[2]> n
[1]  2 NA
Warning message:
In mean.default(X[[2L]], ...) :
  argument is not numeric or logical: returning NA


Note the sort there in step 16. I think that's what is killing us.

If you are lucky, give it a small  data frame that is in order, like
df3.2, the sort doesn't produce gibberish. When I get to that point, I
will show you the sort's effect.

First, the case that "works". I moved the browser() down, because I
got tired of looking at the same old not-yet-erroneous output.
Called from: median.default(df3.2)
Browse[1]> n
debug at <tmp>#15: half <- (n + 1L)%/%2L
Browse[2]> n
debug at <tmp>#16: if (n%%2L == 1L) sort(x, partial = half)[half] else
mean(sort(x,
    partial = half + 0L:1L)[half + 0L:1L])
Browse[2]> n
debug at <tmp>#16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L])

Interactively, type

Browse[2]> sort(x, partial = half + 0L:1L)
  NA NA   NA   NA   NA   NA
1  1  7 NULL NULL NULL NULL
2  2  8 <NA> <NA> <NA> <NA>
3  3  9 <NA> <NA> <NA> <NA>
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
  corrupt data frame: columns will be truncated or padded with NAs

But it still gives you a "right" answer:

Browse[2]> n
[1] 2 8


But if  you give it data out of order, the second column turns to NA,
and that causes doom.
Called from: median.default(df3.2[c(1, 3, 2), ])
Browse[1]> n
debug at <tmp>#15: half <- (n + 1L)%/%2L
Browse[2]> n
debug at <tmp>#16: if (n%%2L == 1L) sort(x, partial = half)[half] else
mean(sort(x,
    partial = half + 0L:1L)[half + 0L:1L])
Browse[2]> n
debug at <tmp>#16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L])

Interactively:

Browse[2]> sort(x, partial = half + 0L:1L)
  NA   NA NA   NA   NA   NA
1  1 NULL  7 NULL NULL NULL
3  3 <NA>  9 <NA> <NA> <NA>
2  2 <NA>  8 <NA> <NA> <NA>
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
  corrupt data frame: columns will be truncated or padded with NAs

Browse[2]> n
[1]  2 NA
Warning message:
In mean.default(X[[2L]], ...) :
  argument is not numeric or logical: returning NA


Here's a larger test case. Note columns 1 and 3 turn to NULL
median(df8.8)

debug at <tmp>#16: if (n%%2L == 1L) sort(x, partial = half)[half] else
mean(sort(x,
    partial = half + 0L:1L)[half + 0L:1L])
Browse[2]> n
debug at <tmp>#16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L])
Browse[2]> sort(x, partial = half + 0L:1L)
    NA NA   NA NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
1 NULL  2 NULL  1 NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL
2 <NA>  3 <NA>  2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3 <NA>  4 <NA>  3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
4 <NA>  5 <NA>  4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
5 <NA>  6 <NA>  5 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
6 <NA>  7 <NA>  6 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
7 <NA>  8 <NA>  7 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
  corrupt data frame: columns will be truncated or padded with NAs

Run ?sort and you see it was not intended for data.frames.

In conclusion, I think median applied to a data.frame causes undefined
behavior because median is not intending to deal with several columns
at once.


I don't see any changes in median.default that would explain the
changes you see. Compare:
median.default <- function(x, na.rm = FALSE)
{
    if(is.factor(x)) stop("need numeric data")
    ## all other objects only need sort() & mean() to be working
    if(length(names(x))) names(x) <- NULL # for e.g., c(x = NA_real_)
    if(na.rm) x <- x[!is.na(x)] else if(any(is.na(x))) return(x[FALSE][NA])
    n <- length(x)
    if (n == 0L) return(x[FALSE][NA])
    half <- (n + 1L) %/% 2L
    if(n %% 2L == 1L) sort(x, partial = half)[half]
    else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L])
}
median.default <- function(x, na.rm = FALSE)
{
    if(is.factor(x)) stop("need numeric data")
    ## all other objects only need sort() & mean() to be working
    if(length(names(x))) names(x) <- NULL # for e.g., c(x = NA_real_)
    if(na.rm) x <- x[!is.na(x)] else if(any(is.na(x))) return(x[FALSE][NA])
    n <- length(x)
    if (n == 0L) return(x[FALSE][NA])
    half <- (n + 1L) %/% 2L
    if(n %% 2L == 1L) sort(x, partial = half)[half]
    else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L])
}



pj