Skip to content
Prev 396625 / 398502 Next

please help generate a square correlation matrix

Curses, my laptop is hallucinating again.  Hope I can get through this.
So we're talking about correlations between binary variables.
Suppose we have two 0-1-valued variables, x and y.
Let A <- sum(x*y)  # number of cases where x and y are both 1.
Let B <- sum(x)-A  # number of cases where x is 1 and y is 0
Let C <- sum(y)-A # number of cases where y is 1 and x is 0
Let D <- sum(!x * !y) # number of cases where x and y are both 0.
(also D = length(x)-A-B-C)

All the information is summarised in the 2-by-2 contingency table.
Some years ago, Nathan Rountree and I supervised Yung-Sing Koh's
data-mining PhD.
She surveyed the data mining literature and found some 37 different
"interestingness measures" for two-variable associations  -- if I
remember correctly; there were a lot of them.  They fell into a much
smaller number of qualitatively similar groups.
At any rate, the Pearson correlation between x and y is
(A*D - B*C)/sqrt((A+B)*(C+D)*(A+C)*(B+D))

So what happens when we delete the rows where x = 0 and y = 0?
Right, it forces D to 0, leaving A B C unchanged.
And looking at the numerator,
  If you delete rows with x = 0 y = 0 you MUST get a negative correlation.

Quite a modest "true" correlation (based on all the data) like -0.2
can masquerade as quite a strong "zero-suppressed" correlation like
-0.6.  Even +0.2 can turn into -0.4.   (These figures are from a
particular simulation run and may not apply in your case.)

Now one of the reasons why Yun-Sing Koh, Nathan Rountree, and I were
interested in interestingness measures is perhaps coincidentally
related to the file drawer/underreporting problem: it's quite common
for rows where x = 0 and y = 0 never to have been reported to you, so
we were hoping there were measures immune to that.  I have argued for
years that "till record analysis" for supermarkets &c is badly flawed
by two facts: (a) it is hard to measure how much of a product people
WOULD have bought if only you had offered it for sale (although you
can make educated guesses) and (b) till records provide no evidence on
what the people who walked out without buying anything wanted (was the
price too high?  could they not find it?).  Problem (a) leads to a
commercial variant of the Signor-Lipps effect: "when x and/or y were
available for purchase" is not the same as "the period for which data
were recorded", thus inflating D, perhaps massively.  Methods
developed for handling the Signor-Lipps effect in paleontology can be
used to estimate when x and y were available helping you to recover a
more realistic N=A+B+C+D.  I really should have published that.

All of which is a long-winded way of saying that
- Pearson correlations on binary columns can be computed very efficiently
- the rows with x=0 and y=0 may be very informative, even essential for analysis
- delete them at your peril.
- really, delete them at your peril.
On Sat, 27 Jul 2024 at 23:07, Richard O'Keefe <raoknz at gmail.com> wrote: