Skip to content

deviance vs entropy

5 messages · RemoteAPL, Thomas Lumley, Warren R. Greiff

#
Hello,

The question looks like simple. It's probably even stupid. But I spent several hours
searching Internet, downloaded tons of papers, where deviance is mentioned and...
And haven't found an answer. 

Well, it is clear for me the using of entropy when I split some node of a classification tree.
The sense is clear, because entropy is an old good measure of how uniform is distribution.
And we want, for sure, the distribution to be uniform, represent one class only as the best.

Where deviance come from at all? I look at a formula and see that the only difference to
entropy is use of *number* of each class points, instead of *probability* as a multiplier
of log(Pik). So, it looks like the deviance and entropy differ by factor 1/N (or 2/N), where
N is total number of cases. Then WHY to say "deviance"? Any historical reason?
Or most likely I do not understand something very basic. Please, help.

Thanks,
Alexander Skomorokhov,

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://stat.ethz.ch/pipermail/r-help/attachments/20010215/f5aa15a6/attachment.html
#
On Thu, 15 Feb 2001, RemoteAPL wrote:

            
Entropy is, as you say, a measure of non-uniformity. Deviance (which is
based on the loglikelihood function) is a measure of evidence.  A given
level of improvement in classification is much stronger evidence for a
split if it is based on a large number of points.  For example, with 2
points you can always find a split that gives perfect classification. With
2000 points it is very impressive to be able to get perfect classification
with one split.

	-thomas

Thomas Lumley			Asst. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
I'm not quite sure what you have in mind, but I'm inferring from your comments that by "deviance"
you mean:

   -SUM p_i log (p_i/q_i)  (or -2 SUM p_i log (p_i/q_i))

In information theoretic terms this is:

   D(p_i||q_i) = - SUM p_i log p_i + SUM p_i log q_i = H(p) - H(p:q)

where H(p) is entropy of p, and H(p:q) is the cross entropy.  If q is the uniform distribution, then
the cross entropy reduces to:

   -SUM p_i log q_i = -SUM p_i log 1/N = log(N) SUM p_i = log(N)

If q_i is the uniform distribution, then you get:

   D(p||q) = H(p) + log(N).

I'm guessing that in the things you've read, when they are talking about deviance, q can (and
generally is) something other than the uniform distribution.  For example, p is often the empirical
distribution of a data sample, and q is the distribution corresponding to some induced model.  Then
D(p||q) is a measure of how far the model is from the observed data.

Note that the cross entropy corresponds to the likelihood, L(data;q), of the data (which has an
empirical distribution of p) having been produced by the induced model q.  The entropy corresponds
to the likelihood, L(data;p), of the data having been produced by a saturated model, one which fits
the empirical distribution of the data perfectly.  So the deviance:

   D(p||q) = log L(data;p) - log L(data;q) = log [L(data;p) / L(data;q)]

is a measure of how far the model is from being as good as it could be.  Statisticians tend to think
in terms of the ratio of likelihoods, Machine Learning folks tend to think in terms of relative
entropy (entropy - cross_entropy, or KL-divergence).  Statisticians are interested in deviance
because (with the factor of 2) it is asymptotically chi-square for many modeling families.  In
information theoretic terms it's nice to think of the deviance as the number of bits extra that it
would take to transmit the data for a system assuming the distribution q, relative to a system that
had assumed p, which is the best system for transmitting that particular data set.

Then again, maybe I've misunderstood you completely.  Please set me straight if I have.

-warren
-------------- next part --------------
A non-text attachment was scrubbed...
Name: greiff.vcf
Type: text/x-vcard
Size: 388 bytes
Desc: Card for Warren R. Greiff
Url : https://stat.ethz.ch/pipermail/r-help/attachments/20010215/81ad0780/greiff.vcf
#
----- Original Message -----
From: "Thomas Lumley" <tlumley at u.washington.edu>
To: "RemoteAPL" <remoteapl at obninsk.com>
Cc: <r-help at hypatia.math.ethz.ch>
Sent: Thursday, February 15, 2001 7:41 PM
Subject: Re: [R] deviance vs entropy
several hours
mentioned and...
a classification tree.
uniform is distribution.
class only as the best.
only difference to
*probability* as a multiplier
1/N (or 2/N), where
reason?
Thomas,

Thanks a lot. I catch your explanation. My problem was that I thought on one
node, where it doesn't matter
if you calculate deviance or entropy. But when you compare the parent and
child nodes it does matter indeed.
Entropy shows the same improvement if you split perfectly a mixture of 2
classes of 10 or 1000 cases.
And deviance says that in the second case you may have more cake (or beer:-)
in the evening.
Also, now I understand, that while construction tree we have less and less
points at each split.
And deviance helps to see that the next split does add complexity, but
probably doesn't add much
of classification quality.

Well, now I read in a paper: "Other approaches for choosing future splits
involve entropy measures
and the Gini index". Why entropy is used any way? May you point me to some
paper on all these measures comparison?

Thanks again for your help.
Alexander.

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
Warren,

Thank you for your answer. It gave some food to my brain. Let me ask more...
comments that by "deviance"
I am sorry for my language. I meant in particular those deviance which is
calculated when
we select split of a node building classification tree. As far as I know it
should be:

 -2 SUM n_i log (p_i)

where n_i is number of points of class c_i at this node and p_i=n_i/N, where
N is total number
of cases at this node. Probably it refers some way to what you wrote above.
May you tell more
on p_i and q_i in your formula?
uniform distribution, then
I probably understand this and the next statements if I understand the first
formula.
deviance, q can (and
p is often the empirical
some induced model.  Then
It sounds interesting. May you please repeat this in terms of classification
trees? I mean what is
"induced model" and "corresponding distribution" if we are speaking on CART?
interested in deviance
What "KL" stands for?
modeling families.  In
That's probably the most important argument PRO.
number of bits extra that it
relative to a system that
data set.
Very interesting! I must think over this more.
straight if I have.
I see that you sit quite straight. I am afraid that I lie horizontally:-)

Regards,
Alexander.

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._