Em Ter 26 abr. 2016, ?s 16:17, Leonardo Ferreira Fontenelle escreveu:
Hello everyone,
I have a dataset from a household survey, with sampling weights, and I
wanted to create an assets-based indicator of economic level for the
households. The idea was to run PCA (with sampling weights), and the
first principal component would be the economic level indicator. In
this email, I am calling "surveydesignobj" the object returned by
svydesign(), and "svyprcompobj" the object returned by
svyprcomp(formula, surveydesignobj, center = TRUE, scale. = TRUE,
scores = TRUE).
What I can't understand is why the first principal component stored in
svyprcompobj$x is so different from what I get from
predict(svyprcompobj, surveydesignobject). By "different" I mean
different distributions and only moderate correlation (~ 0.5) between
them. I also tried recreating the first principal component "by hand",
by summing the (centered or not) variables after multiplying them by the
loadings (svyprcompobj$rotation) and dividing them by their scales (from
svyprcompobj$scale); the resulting vector was highly correlated (> 0.99)
with the first principal component obtained from predict().
Peaking in the svyprcomp() code I saw that the function runs PCA in the
data after multiplying it by
sqrt(samplingweights/mean(samplingweights)), and latter divides
svyprcompobj$x by?sqrt(samplingweights/mean(samplingweights)) before
returning it. I also noticed that there is no predict.svyprcomp(), only
predict.prcomp().
Given that different methods provide different values, I'd like to know
if there is only one correct method (which?), or if it's a matter of
interpreting differently the results of each method.
Thanks in advance,
When I sent my previous email I was away from my computer, and couldn't
provide some code in R to exemplify what I meant. The following code
illustrates the point with more accessibled data:
library("survey")
data(api)
dclus2 <- svydesign(ids = ~ dnum + snum, fpc = ~ fpc1 + fpc2, data =
apiclus2)
pc <- svyprcomp(~ api99 + api00 + ell + hsg + meals + emer, design =
dclus2, scale. = TRUE, scores = TRUE)
dclus2$variables$pc1 <- pc$x[, "PC1"]
dclus2$variables$pc2 <- predict(pc, apiclus2)[, "PC1"]
mycoef <- pc$rotation[, "PC1"] / pc$scale
dclus2$variables$pc3 <- with(apiclus2, api99 * mycoef["api99"] + api00 *
mycoef["api00"] + ell * mycoef["ell"] +
hsg * mycoef["hsg"] + meals * mycoef["hsg"] + emer *
mycoef["emer"])
cov.wt(dclus2$variables[, paste0("pc", 1:3)], wt = weights(dclus2), cor
= TRUE)$cor # correlation matrix
summary(dclus2$variables[, paste0("pc", 1:3)])
bw1 <- sqrt(coef(svyvar(~ pc1, dclus2))) / 3
bw2 <- sqrt(coef(svyvar(~ pc2, dclus2))) / 3
bw3 <- sqrt(coef(svyvar(~ pc3, dclus2))) / 3
plot(svysmooth(~ pc1, dclus2, bandwidth = bw1), xlim = c(-2.5, 7.5),
ylim = c(0, 0.75))
lines(svysmooth(~ pc2, dclus2, bandwidth = bw2), col = 2)
lines(svysmooth(~ pc3, dclus2, bandwidth = bw3), col = 3)
legend("topright", legend = c("pc$x[, \"PC1\"]", "predict(pc,
apiclus2)[, \"PC1\"]", "sum(variables * loadings / scale)"), col = 1:3,
lty = 1)
Thanks,
Leonardo Ferreira Fontenelle
http://lattes.cnpq.br/9234772336296638