Which columns give rise to linear dependency? - R-help

Tue, Nov 5, 2002 4:24 AM #

Short version

If I have a data frame X and I suspect
that there is a dependency between
the columns how do I confirm that,
and how do I tell which subset of columns
is involved?

==================================

Long version

A colleague had been trying to use
the SPSS RELIABILITY procedure.
It told her that the determinant of the
matrix was small. She asked me what that meant
and I told her that one of her variables was a 
linear combination of others.
I agreed to investigate further and imported
the datasets into R. The rows of each X represent
people, and the columns items. The x_{ij} are binary (coded
0/1). Three of the datasets gave the
error message from SPSS. I confirmed that
the matrix involved was indeed var(X)
and that det(var(X)) agreed with SPSS.
What I thought was that I would find
that the smallest eigenvalues would
be zero, but in two of the datasets that was not true.
In the third dataset I traced the problem quickly
to a pair of items which were 
perfectly correlated.

1 I suspect that det(var(X)) is a poor test of
  whether X is of reduced rank. I have also looked at kappa(X)
  which gives values of 10 and 17 for the two offending scales, 
  but I have no feel for whether that is high (bad?).
2 I thought that by doing svd(X) and then
  examining V I could answer my problem.
  However the elements of V, specifically
  the last column, did not show what I 
  hoped: most values effectively
  zero and the rest adding to zero.
  This did work for the third dataset though.
3 I think that SPSS was trying to invert
  var(X) in order to compute the multiple
  correlation of each item with the others.
  Is there any neat way of doing that in R?

I am using 1.5.1 on Windows 98 if that makes
a difference.

If anyone wants to look at one of the datasets
I have her permission to make it available.
Point your browser at http://www.aghmed.fsnet.co.uk/r.html
     

Michael Dewey
michael.dewey at nottingham.ac.uk
http://www.nottingham.ac.uk/~mhzmd/home.html



-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

John Fox

Tue, Nov 5, 2002 7:03 AM #

Dear Michael,

There are several ways of finding near dependencies. For example, Belsley, 
Kuh, and Welsch in Regression Diagnostics (1980) use the singular-value 
decomposition. Here are a couple of simple approaches:

(1) Use the principal-component analysis of the standardized X-matrix. Very 
small component variances correspond to near collinearities, and the 
corresponding principal-component coefficients give you linear combination 
of the standardized x's nearly equal to 0.

(2) Look at the variance-inflation factors. Very large VIFs correspond to 
variables that are nearly linearly dependent on others; regress each such 
variable on the others to see what the dependencies are. (Some of these 
regressions will be redundant.)

I hope that this helps,
  John

At 12:24 PM 11/5/2002 +0000, Michael Dewey wrote:

Short version

If I have a data frame X and I suspect
that there is a dependency between
the columns how do I confirm that,
and how do I tell which subset of columns
is involved?

==================================

Long version

A colleague had been trying to use
the SPSS RELIABILITY procedure.
It told her that the determinant of the
matrix was small. She asked me what that meant
and I told her that one of her variables was a
linear combination of others.
I agreed to investigate further and imported
the datasets into R. The rows of each X represent
people, and the columns items. The x_{ij} are binary (coded
0/1). Three of the datasets gave the
error message from SPSS. I confirmed that
the matrix involved was indeed var(X)
and that det(var(X)) agreed with SPSS.
What I thought was that I would find
that the smallest eigenvalues would
be zero, but in two of the datasets that was not true.
In the third dataset I traced the problem quickly
to a pair of items which were
perfectly correlated.

1 I suspect that det(var(X)) is a poor test of
  whether X is of reduced rank. I have also looked at kappa(X)
  which gives values of 10 and 17 for the two offending scales,
  but I have no feel for whether that is high (bad?).
2 I thought that by doing svd(X) and then
  examining V I could answer my problem.
  However the elements of V, specifically
  the last column, did not show what I
  hoped: most values effectively
  zero and the rest adding to zero.
  This did work for the third dataset though.
3 I think that SPSS was trying to invert
  var(X) in order to compute the multiple
  correlation of each item with the others.
  Is there any neat way of doing that in R?

I am using 1.5.1 on Windows 98 if that makes
a difference.

If anyone wants to look at one of the datasets
I have her permission to make it available.
Point your browser at http://www.aghmed.fsnet.co.uk/r.html


Michael Dewey
michael.dewey at nottingham.ac.uk
http://www.nottingham.ac.uk/~mhzmd/home.html



-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

-----------------------------------------------------
John Fox
Department of Sociology
McMaster University
Hamilton, Ontario, Canada L8S 4M4
email: jfox at mcmaster.ca
phone: 905-525-9140x23604
web: www.socsci.mcmaster.ca/jfox
-----------------------------------------------------

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Ott Toomet

Tue, Nov 5, 2002 8:54 AM #

Hi,

 | From: "Michael Dewey" <Michael.Dewey at nottingham.ac.uk>
 | 
 | Short version
 | 
 | If I have a data frame X and I suspect
 | that there is a dependency between
 | the columns how do I confirm that,
 | and how do I tell which subset of columns
 | is involved?

In similar cases I have used condition number of the matrix (it is
basically square root of the ratio of largest and smallest eigenvalue
of a matrix, e.g. X'X where X is your dataframe (normalized)).  I am
adding the data columns one-by-one and watching what happens with the
condition number.  The normal number is around 20.

In R, condition number is estimated by kappa()

Perhaps it helps.

Ott

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._