Skip to content

density of hist(freq = FALSE) inversely affected by data magnitude

4 messages · James, William Dunlap

#
Hi,

I have a couple of observations, a question or two, and perhaps a
suggestion related to the plotting of density on the y-axis within the
hist() function when freq=FALSE.  I was using the function and trying
to develop an intuitive understanding of what the density is telling
me.  After reading through this fairly helpful post:

http://stats.stackexchange.com/questions/17258/odd-problem-with-a-histogram-in-r-with-a-relative-frequency-axis

I finally realized that in the case where freq = FALSE, the y-axis
isn't really telling me the density.  It's actually indicating the
density multiplied by the bin size.  I assume this is for the case
where the bins may be of non-regular size.

from hist.default:

dens <- counts/(n * diff(breaks))

So the count in each bin is divided by the total number of
observations (n) multiplied by the size of the bin.  The problem, as I
see it, is that the density ends up being scaled by the size of the
bins, which is inversely proportional to the magnitude of the data.
Therefore the magnitude of the data is directly affecting the density,
which seems problematic.

For example*:

set.seed(4444)
x <- runif(100)
y <- x / 1000

par(mfrow = c(2, 1))
hist(x, prob = TRUE)
hist(y, prob = TRUE)
1000 times larger, simply because the y data is 1000 times smaller.
Again, that seems problematic.  It seems to me, that the density
should be unit-less, but here it's affected by the magnitude of the
data.

So, my question is, why is density calculated this way?

For the case where all the bins are of the same size, I would think
density should simply be calculated as:

dens <- counts / n

Of course, that might be somewhat misleading for the case where the
bin sizes vary.  So then why not calculate density as:

dens <- counts / (n * diff(breaks) / min(diff(breaks)))

Dividing diff(breaks) by min(diff(breaks)) removes the scaling effect
of the magnitude of the data, and simply leaves the relative
difference in bin size.

For the case where all the bins are the same size, the calculation is
equivalent to dens <- counts / n

For all other cases, the density is scaled by the size of the bin, but
unaffected by the magnitude of the data.

So, what am I misunderstanding?  Why is density calculated as it is,
and what does it mean?

Thanks,


James


*example from http://stats.stackexchange.com/questions/17258/odd-problem-with-a-histogram-in-r-with-a-relative-frequency-axis
#
The probability density function is not unitless - it is the derivative of the
[cumulative] probability distribution function so it has units delta-probability-mass
over delta-x.  It must integrate to 1 (over the all possible x).  hist(freq=FALSE,x)
or hist(prob=TRUE,x) displays an estimate of the density function and the following
example shows how the scale matches what you get from the presumed 
population density function.
function (n, sd) 
{
    x <- rnorm(n, sd = sd)
    hist(x, freq = FALSE) # estimated density
    s <- seq(min(x), max(x), len = 129)
    lines(s, dnorm(s, sd = sd), col = "red") # overlay expected density for this sample
}
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
#
Bill,

Thank you.  I got it.  That can require a fair amount of work to
interpret the density, especially with odd or irregular bin sizes.

Thanks again,

James
On Tue, Jan 22, 2013 at 5:33 PM, William Dunlap <wdunlap at tibco.com> wrote:
#
I think it is a fair bit of work to interpret the freq=TRUE (prob=FALSE)
version of hist() when the bins have unequal sizes.  E.g.,
in the following the bins are sized so that each contains
an equal number of observations.  The resulting flat
frequency plot is hard for me to interpret.  The density plot
is easy.

  > x <- rnorm(1000, sd=50)
  > hist(x, breaks=quantile(x,(0:10)/10), prob=TRUE)
  > hist(x, breaks=quantile(x,(0:10)/10), prob=FALSE)
  Warning message:
  In plot.histogram(r, freq = freq1, col = col, border = border, angle = angle,  :
    the AREAS in the plot are wrong -- rather use freq=FALSE

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com