log y 'axis' of histogram
On 31/08/10 03:37, Derek M Jones wrote:
Hadley,
I have counts ranging over 4-6 orders of magnitude with peaks occurring at various 'magic' values. Using a log scale for the y-axis enables the smaller peaks, which would otherwise be almost invisible bumps along the x-axis, to be seen
That doesn't justify the use of a _histogram_ - and regardless of
The usage highlights meaningful characteristics of the data. What better justification for any method of analysis and display is there?
what distributional display you use, logging the counts imposes some pretty heavy restrictions on the shape of the distribution (e.g. that it must not drop to zero).
Does there have to be a recognized statistical distribution to use R? In my case I am using R for all of the analysis and graphics in a new book. This means that sometimes I have to deal with data sets that are more or less a jumble of numbers with patterns in a few places. For instance, the numeric value of integer constants appearing as one operand of the binary bitwise-AND operator (see figure 1224.1 of www.knosof.co.uk/cbook/usefigtab.pdf, raw data at: www.knosof.co.uk/cbook/bandcons.hist.gz) qplot(band, binwidth=8, geom="histogram") + scale_y_log() does a good job of highlighting the peaks.
It may be useful for your purposes, but that doesn't necessarily make it a meaningful graphic.
Doesn't being useful for my purpose make it meaningful, at least for me and I hope my readers?
Hadley is correct about the problem of where to end the bars when trying to draw a log-histogram: basically you have to decide to cut them off somewhere. He is also right that a log-histogram is perhaps not a great graphic to use. However, they are used and indeed there is one in the Fieller, Flenley, Olbricht paper (published in Applied Statistics, now JRSS C) for example. I haven't searched for others, but certainly when I wrote a log-histogram routine it wasn't because I thought of doing such a plot all on my own. A number of authors, including Barndorff-Nielsen in at least some of his papers (I haven't gone back and checked all his older work) just plot the midpoints of the tops of the log-histogram. (That is an option in logHist). Another approach is to fit an empirical density to the data and plot the log-density. That matches the advice often seen in this forum that plotting empirical density functions is preferable to drawing histograms. My feeling is that either of these two approaches is probably preferable to using log-histograms for the reasons Hadley enunciated. When plotting data plus a fitted curve, the midpoints approach does have the advantage of distinguishing data and theoretical curve more clearly. Overall the idea of a plot with a logged y-axis is definitely a good one and its use is endemic in literature concerned with heavy-tailed distributions, particularly finance. The advantage is the clarity offered regarding tail behaviour, where for example exponential tails in the density correspond to straight lines in the logged y-axis plot. Hope this helps. David Scott
_________________________________________________________________ David Scott Department of Statistics The University of Auckland, PB 92019 Auckland 1142, NEW ZEALAND Phone: +64 9 923 5055, or +64 9 373 7599 ext 85055 Email: d.scott at auckland.ac.nz, Fax: +64 9 373 7018 Director of Consulting, Department of Statistics