On 08/08/02 13:23, Huan Huang wrote:
Dear Prof. Harrell and R list, I have done the variable clustering and summary scores. Thanks a lot for your kind help. But it hasn't solved the collinearity problem in my dataset. Afer the clustering and transcan, there is still very strong collinearity between the summary scores. The objective of my project is to find out the influential variables. I believe any variable resuction is not appropriate when the collinearity exists. I am thinking about the principal component regression and variable reduction based on it (Rudolf J. Freund and William J. Wilson (1998), P215). Does anybody have suggestion on the variable resuction under this condition? I will appreciate any kind imformation.
I'm not sure what you mean by resuction, but when I and many other psychologists face this kind of problem - reducing a set of variables - we often use factor analysis. A good progam is factanal in the mva library. Varimax rotation (the default) usuallly picks out a sensible set of factors, although of course other rotations may be more informative for a given case. You sort the loadings if you want. (Look at the various options for loadings() and print().) There are no fixed rules for this sort of thing. Sometimes one variable winds up in the wrong place by chance. The strategy I use is to figure out a sensible grouping of variables before I use them to predict anything, so that I am not biased by knowing the results. So I feel free to move or remove variables that don't make sense. Some people may prefer a more rigid approach, which further reduces the temptation to cheat. Having found the grouping of variables, you can do three different things: 1. Define "scores" by simply adding up the (standardized?) scores of the variables in each group (with high loadings in the same factor, perhaps). 2. Use the factor scores themselves as variables. 3. Use a single representative variable from each group. This seems to be what you were suggesting, but I'm having trouble thinking of a situation where this would be better than #1 or #2. Whatever you do, you need to figure out how many groups, and prcomp() or princomp() is often helpful here. (And take a look at biplot(). A really nice tool for looking at the first two principal components.) The factanal() program also reports a chi-square fit statistic. So in principle you could use that to figure out how many factors there are. However, that method usually gives more factors than are meaningful, especially when you have a large data set.
Jonathan Baron, Professor of Psychology, University of Pennsylvania Home page: http://www.sas.upenn.edu/~baron R page: http://finzi.psych.upenn.edu/ -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._