Skip to content

Correlation when one variable has zero variance (polychoric?)

4 messages · John Fox, Jose Quesada

#
Hi,

I'm running this for a simulation study, so many combinations of parameter  
produce many predictions that I need to correlate with data.

The problem
----------------
I'm using rating data with 3 to 5 categories (e.g., too low, correct, too  
high). The underlying continuous scales should be normal, so I chose the  
polychoric correlation. I'm using library(polychor) in its latest version  
0.7.4

The problem is that sometimes the models predict always the same value  
(i.e., the median). Example frequency table:
2   3   4   5
   3  28 179 141  50

That is, there is no variability in one of the variables (the only value  
is 3, the median).

Pearson Product Moment Correlation consists of the covariation divided by  
the square root of the product of the standard deviations of the two  
variables. If the standard deviation of one of the variables is zero, then  
the denominator is zero and the correlation cannot be computed. R returns  
NA and a warning.

If I add jitter to the variable with no variability, then I get a  
virtually zero, but calculable, Pearson correlation.

However, when I use the polychoric correlation (using the default  
settings), I get just the opposite: a very high correlation!
[1] 0.999959

This is very counterintuitive. I also ran the same analysis in 2005 (what  
has changed in the package polycor since then, I don't know) and the  
results were different. I think back then I contrasted them with SAS and  
they were the same. Maybe the approximation fails in extreme cases where  
most of the cells are zero? Maybe the approximation was not used in the  
first releases of the package? But it seems that the ML estimator doesn't  
work at all (at least in the current version of the package) with those  
tables when most cells are zero due to no variability on one variable):
Error in tab * log(P) : non-conformable arrays

I've seen some posts where sparse tables were trouble, eg:  
http://www.nabble.com/polychor-error-td5954345.html#a5954345
  "You're expecting a lot out of ML to get estimates of the first couple of  
thresholds for rows and the first for columns. [which were mostly zeroes]"

Are the polychoric estimates using the approximation completely wrong? Is  
there any way to compute a polychoric correlation with such a dataset?  
What should I conclude from data like these?
Maybe using correlation is not the right thing to do.

Thanks,
-Jose
#
Dear Jose,
This is simply a bug in polychor(), which currently does the following test:

  if (r < 1) stop("the table has fewer than 2 rows")
  if (c < 2) stop("the table has fewer than 2 columns")

That is, my intention was to check (r < 2) and report an error. Actually, it
would probably be better to return NA and report a warning.
I don't entirely follow this. Are you referring to the table above with one
row, more generally to table with zero marginals, or to tables in which
there are interior zeroes?
When there are zero marginals the ML estimate cannot be unique since there
is zero information about one or more of the thresholds.
Yes. If there is a zero marginal, then it shouldn't have been computed in
the first place (and was due to the error that I mentioned).
I'd say no. There is no information in the data about the correlation.
That the data aren't informative about the parameters of interest.
Presumably the normally distributed latent variables that underlie the table
have some correlation, but you can't estimate it from the data.

I'll fix polycor() (and put it a test for 0 marginals as well as single-row
or -column tables) -- thanks for the bug report.

Regards,
John
#
Dear John,
John> I don't entirely follow this. Are you referring to the table above with 
one
John> row, more generally to table with zero marginals, or to tables in which
John> there are interior zeroes?> 

I have plenty of those tables, but I think quite a few of them have zero 
marginals (the case I posted might be a bit extreme). I have 400 observations, 
so no matter how centered the distributions are, some observations will be out 
of the center.

The results I got in 2005 cannot be reproduced now in 2007 with the same code; 
I guess this could be due to this bug you describe (maybe it was introduced 
later?). In 2007, I got many correlations has high as the one I described and I 
was wondering what the problem was. I don't have SAS available anymore so I 
cannot run the code I wrote in SAS to compare.

Where can I get the new code for polychor?

I'm in a predicament here; the data I'm analyzing are from a flight simulation 
and are extremely expensive to get, so running more experiments is out of 
question.

Any pointers as to how I could analyze this dataset? (i.e. one where there 
might be zero marginals?)

Thanks

-Jose
#
Dear Jose,
As I said, there's no basis for estimating polychoric correlations and all
thresholds when there are zero marginals. If there is more than one row and
column remaining with nonzero marginals, then you could simply eliminate the
rows/columns with zero marginals, but tables with only one nonzero row or
column have no information about the correlation. I'll think about doing
this -- i.e., removing zero rows and columns -- automatically and issuing a
warning.
No program, not even SAS, can magically estimate a correlation from a table
with one row or column. If polychor() did that in 2005, the answer it
provided was erroneous.
I plan to upload a new version of the polycor package to CRAN as soon as I
have a chance -- probably sometime this week. But you already have the code
for polychor() and can modify it yourself: Just fix the test so that it
checks for < 2 rather than < 1 row, and return NA (and issue a warning) in
this case.
I'm sorry, but as I said there's no magic solution here. The data, however
expensive, don't have information relevant to estimating the correlation.

Regards,
 John