Skip to content

rootogram for normal distributions

6 messages · Hugo Mildenberger, Hadley Wickham, Achim Zeileis +1 more

#
Using R-2.12.1 and latticeExtra-0.6-14, I would like to understand 
why a rootogram displaying samples from the Poisson distribution looks like I 
expected it, whereas a rootogram using the normal distribution does not:

library(latticeExtra)
rootogram(~rpois(1000, lambda = 50), dfun = function(x) dpois(x, lambda = 50))

rootogram(~rnorm(1000), dfun = function(x) dnorm(x,mean(x),sd(x)))

I probably can't attach figures here. Thus a textual description of what I get may 
suffice: With increasing sample size, the rootogram using random samples 
from the Poisson distribution shows decreasing differences (bars are quickly 
approaching the zero line), whereas the displayed differences for random 
samples of the normal distribution are always large. The differences even increase 
with sample size, i.e,  the hanging bars tend to vanish for very large samples.
#
On Sun, 16 Jan 2011, Hugo Mildenberger wrote:

            
The normal distribution is a continuous distribution, i.e., the frequency 
for each observed value will essentially be 1/n and not converge to the 
density function. Hence, you would need to look at histogram or smoothed 
densities. Rootograms, on the other hand, are intended for discrete 
distributions.
Z
#
I don't think that's true - rootograms are useful for both continuous
and discrete distributions.  See (e.g.) p 314 at
http://www.edwardtufte.com/tufte/tukey, where Tukey himself uses a
rootogram with a normal distribution.

Hadley
#
On Sun, 16 Jan 2011, Hadley Wickham wrote:

            
OK, let me rephrase: Rootograms as implemented in rootogram() are intended 
for discrete distributions. At least that's my reading. But maybe I've 
missed a trick that you can point us to.
Z
#
Thank you very much for your qualified answers, and also for the 
link to the Tukey paper. I appreciate Tukey's writings very much. 

Looking at the lattice code (below), a possible implementation might 
involve  binning, not so?

I see a problematic part here:

   xx <- sort(unique(x))

Unique certainly works well with Poisson distributed data, but is 
essentially a no-op when confronted with continous floating-point 
numbers.

Best

Hugo
function (x, y = table(x), dfun = NULL, transformation = sqrt, 
    hang = TRUE, ...) 
{
    plot.line <- trellis.par.get("plot.line")
    stopifnot(is.function(dfun))
    yy <- transformation(y/sum(y))
    xx <- sort(unique(x))
    dotArgs <- list(...)
    dfunArgs <- names(formals(dfun))
    if (!("..." %in% dfunArgs)) 
        dotArgs <- dotArgs[dfunArgs[-1]]
    dd <- transformation(do.call(dfun, c(list(xx), dotArgs)))
    list(xlim = range(xx), ylim = if (hang) range(dd, dd - yy, 
        0) else range(dd, yy, 0), dx = diff(xx), dy = diff(dd))
}
function (x, y = table(x), dfun = NULL, col = plot.line$col, 
    lty = plot.line$lty, lwd = plot.line$lwd, alpha = plot.line$alpha, 
    transformation = sqrt, hang = TRUE, ...) 
{
    plot.line <- trellis.par.get("plot.line")
    ref.line <- trellis.par.get("reference.line")
    stopifnot(is.function(dfun))
    yy <- transformation(y/sum(y))
    xx <- sort(unique(x))
    dotArgs <- list(...)
    dfunArgs <- names(formals(dfun))
    if (!("..." %in% dfunArgs)) 
        dotArgs <- dotArgs[dfunArgs[-1]]
    dd <- transformation(do.call(dfun, c(list(xx), dotArgs)))
    panel.abline(h = 0, col = ref.line$col, lty = ref.line$lty, 
        lwd = ref.line$lwd, alpha = ref.line$alpha)
    panel.segments(xx, if (hang) 
        dd
    else 0, xx, if (hang) 
        (dd - yy)
    else yy, col = col, lty = lty, lwd = lwd, alpha = alpha, 
        ...)
    panel.lines(xx, dd)
}
<environment: namespace:latticeExtra>

        
On Sunday 16 January 2011 15:59:58 Achim Zeileis wrote:
#
On Sun, Jan 16, 2011 at 11:58 AM, Hugo Mildenberger
<Hugo.Mildenberger at web.de> wrote:
Yes, thanks to Hadley for the nice reference, I hadn't seen it before.
True, but as Achim said, rootogram() is intended to work with data
arising from discrete distributions, not continuous ones. I see now
that this is not as explicit as it could be in the help page (although
"frequency distribution" gives a hint), which I will try to improve.

I don't think automatic handling of continuous distributions is simple
(because it is not clear how you would specify the reference
distribution). However, a little preliminary work will get you close
with the current implementation:

xnorm <- rnorm(1000)

## 'discretize' by binning and replacing data by bin midpoints
h <- hist(xnorm, plot = FALSE) # add arguments for more control
xdisc <- with(h, rep(mids, counts))

## Option 1: Assume bin probabilities proportional to dnorm()
norm.factor <- sum(dnorm(h$mids, mean(xnorm), sd(xnorm)))

rootogram(~ xdisc,
          dfun = function(x) {
              dnorm(x, mean(xnorm), sd(xnorm)) / norm.factor
          })

## Option 2: Compute probabilities explicitly using pnorm()

## pdisc <- diff(pnorm(h$breaks)) ## or estimated:
pdisc <- diff(pnorm(h$breaks, mean = mean(xnorm), sd = sd(xnorm)))
pdisc <- pdisc / sum(pdisc)

rootogram(~ xdisc,
          dfun = function(x) {
              f <- factor(x, levels = h$mids)
              pdisc[f]
          })

-Deepayan