Hello, The question looks like simple. It's probably even stupid. But I spent several hours searching Internet, downloaded tons of papers, where deviance is mentioned and... And haven't found an answer. Well, it is clear for me the using of entropy when I split some node of a classification tree. The sense is clear, because entropy is an old good measure of how uniform is distribution. And we want, for sure, the distribution to be uniform, represent one class only as the best. Where deviance come from at all? I look at a formula and see that the only difference to entropy is use of *number* of each class points, instead of *probability* as a multiplier of log(Pik). So, it looks like the deviance and entropy differ by factor 1/N (or 2/N), where N is total number of cases. Then WHY to say "deviance"? Any historical reason? Or most likely I do not understand something very basic. Please, help. Thanks, Alexander Skomorokhov, -------------- next part -------------- An HTML attachment was scrubbed... URL: https://stat.ethz.ch/pipermail/r-help/attachments/20010215/f5aa15a6/attachment.html
deviance vs entropy
5 messages · RemoteAPL, Thomas Lumley, Warren R. Greiff
On Thu, 15 Feb 2001, RemoteAPL wrote:
Hello, The question looks like simple. It's probably even stupid. But I spent several hours searching Internet, downloaded tons of papers, where deviance is mentioned and... And haven't found an answer. Well, it is clear for me the using of entropy when I split some node of a classification tree. The sense is clear, because entropy is an old good measure of how uniform is distribution. And we want, for sure, the distribution to be uniform, represent one class only as the best. Where deviance come from at all? I look at a formula and see that the only difference to entropy is use of *number* of each class points, instead of *probability* as a multiplier of log(Pik). So, it looks like the deviance and entropy differ by factor 1/N (or 2/N), where N is total number of cases. Then WHY to say "deviance"? Any historical reason? Or most likely I do not understand something very basic. Please, help.
Entropy is, as you say, a measure of non-uniformity. Deviance (which is based on the loglikelihood function) is a measure of evidence. A given level of improvement in classification is much stronger evidence for a split if it is based on a large number of points. For example, with 2 points you can always find a split that gives perfect classification. With 2000 points it is very impressive to be able to get perfect classification with one split. -thomas Thomas Lumley Asst. Professor, Biostatistics tlumley at u.washington.edu University of Washington, Seattle -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
RemoteAPL wrote: Hello, The question looks like simple. It's probably even stupid. But I spent several hours searching Internet, downloaded tons of papers, where deviance is mentioned and... And haven't found an answer. Well, it is clear for me the using of entropy when I split some node of a classification tree. The sense is clear, because entropy is an old good measure of how uniform is distribution. And we want, for sure, the distribution to be uniform, represent one class only as the best. Where deviance come from at all? I look at a formula and see that the only difference to entropy is use of *number* of each class points, instead of *probability* as a multiplier of log(Pik). So, it looks like the deviance and entropy differ by factor 1/N (or 2/N), where N is total number of cases. Then WHY to say "deviance"? Any historical reason? Or most likely I do not understand something very basic. Please, help. Thanks, Alexander Skomorokhov,
I'm not quite sure what you have in mind, but I'm inferring from your comments that by "deviance" you mean: -SUM p_i log (p_i/q_i) (or -2 SUM p_i log (p_i/q_i)) In information theoretic terms this is: D(p_i||q_i) = - SUM p_i log p_i + SUM p_i log q_i = H(p) - H(p:q) where H(p) is entropy of p, and H(p:q) is the cross entropy. If q is the uniform distribution, then the cross entropy reduces to: -SUM p_i log q_i = -SUM p_i log 1/N = log(N) SUM p_i = log(N) If q_i is the uniform distribution, then you get: D(p||q) = H(p) + log(N). I'm guessing that in the things you've read, when they are talking about deviance, q can (and generally is) something other than the uniform distribution. For example, p is often the empirical distribution of a data sample, and q is the distribution corresponding to some induced model. Then D(p||q) is a measure of how far the model is from the observed data. Note that the cross entropy corresponds to the likelihood, L(data;q), of the data (which has an empirical distribution of p) having been produced by the induced model q. The entropy corresponds to the likelihood, L(data;p), of the data having been produced by a saturated model, one which fits the empirical distribution of the data perfectly. So the deviance: D(p||q) = log L(data;p) - log L(data;q) = log [L(data;p) / L(data;q)] is a measure of how far the model is from being as good as it could be. Statisticians tend to think in terms of the ratio of likelihoods, Machine Learning folks tend to think in terms of relative entropy (entropy - cross_entropy, or KL-divergence). Statisticians are interested in deviance because (with the factor of 2) it is asymptotically chi-square for many modeling families. In information theoretic terms it's nice to think of the deviance as the number of bits extra that it would take to transmit the data for a system assuming the distribution q, relative to a system that had assumed p, which is the best system for transmitting that particular data set. Then again, maybe I've misunderstood you completely. Please set me straight if I have. -warren -------------- next part -------------- A non-text attachment was scrubbed... Name: greiff.vcf Type: text/x-vcard Size: 388 bytes Desc: Card for Warren R. Greiff Url : https://stat.ethz.ch/pipermail/r-help/attachments/20010215/81ad0780/greiff.vcf
----- Original Message ----- From: "Thomas Lumley" <tlumley at u.washington.edu> To: "RemoteAPL" <remoteapl at obninsk.com> Cc: <r-help at hypatia.math.ethz.ch> Sent: Thursday, February 15, 2001 7:41 PM Subject: Re: [R] deviance vs entropy
On Thu, 15 Feb 2001, RemoteAPL wrote:
Hello, The question looks like simple. It's probably even stupid. But I spent
several hours
searching Internet, downloaded tons of papers, where deviance is
mentioned and...
And haven't found an answer. Well, it is clear for me the using of entropy when I split some node of
a classification tree.
The sense is clear, because entropy is an old good measure of how
uniform is distribution.
And we want, for sure, the distribution to be uniform, represent one
class only as the best.
Where deviance come from at all? I look at a formula and see that the
only difference to
entropy is use of *number* of each class points, instead of
*probability* as a multiplier
of log(Pik). So, it looks like the deviance and entropy differ by factor
1/N (or 2/N), where
N is total number of cases. Then WHY to say "deviance"? Any historical
reason?
Or most likely I do not understand something very basic. Please, help.
Entropy is, as you say, a measure of non-uniformity. Deviance (which is based on the loglikelihood function) is a measure of evidence. A given level of improvement in classification is much stronger evidence for a split if it is based on a large number of points. For example, with 2 points you can always find a split that gives perfect classification. With 2000 points it is very impressive to be able to get perfect classification with one split. -thomas Thomas Lumley Asst. Professor, Biostatistics tlumley at u.washington.edu University of Washington, Seattle
Thomas, Thanks a lot. I catch your explanation. My problem was that I thought on one node, where it doesn't matter if you calculate deviance or entropy. But when you compare the parent and child nodes it does matter indeed. Entropy shows the same improvement if you split perfectly a mixture of 2 classes of 10 or 1000 cases. And deviance says that in the second case you may have more cake (or beer:-) in the evening. Also, now I understand, that while construction tree we have less and less points at each split. And deviance helps to see that the next split does add complexity, but probably doesn't add much of classification quality. Well, now I read in a paper: "Other approaches for choosing future splits involve entropy measures and the Gini index". Why entropy is used any way? May you point me to some paper on all these measures comparison? Thanks again for your help. Alexander. -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Warren, Thank you for your answer. It gave some food to my brain. Let me ask more...
I'm not quite sure what you have in mind, but I'm inferring from your
comments that by "deviance"
you mean: -SUM p_i log (p_i/q_i) (or -2 SUM p_i log (p_i/q_i))
I am sorry for my language. I meant in particular those deviance which is calculated when we select split of a node building classification tree. As far as I know it should be: -2 SUM n_i log (p_i) where n_i is number of points of class c_i at this node and p_i=n_i/N, where N is total number of cases at this node. Probably it refers some way to what you wrote above. May you tell more on p_i and q_i in your formula?
D(p_i||q_i) = - SUM p_i log p_i + SUM p_i log q_i = H(p) - H(p:q) where H(p) is entropy of p, and H(p:q) is the cross entropy. If q is the
uniform distribution, then
the cross entropy reduces to:
I probably understand this and the next statements if I understand the first formula.
I'm guessing that in the things you've read, when they are talking about
deviance, q can (and
generally is) something other than the uniform distribution. For example,
p is often the empirical
distribution of a data sample, and q is the distribution corresponding to
some induced model. Then
D(p||q) is a measure of how far the model is from the observed data.
It sounds interesting. May you please repeat this in terms of classification trees? I mean what is "induced model" and "corresponding distribution" if we are speaking on CART?
entropy (entropy - cross_entropy, or KL-divergence). Statisticians are
interested in deviance What "KL" stands for?
because (with the factor of 2) it is asymptotically chi-square for many
modeling families. In That's probably the most important argument PRO.
information theoretic terms it's nice to think of the deviance as the
number of bits extra that it
would take to transmit the data for a system assuming the distribution q,
relative to a system that
had assumed p, which is the best system for transmitting that particular
data set. Very interesting! I must think over this more.
Then again, maybe I've misunderstood you completely. Please set me
straight if I have. I see that you sit quite straight. I am afraid that I lie horizontally:-) Regards, Alexander. -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._