lm and R-squared (newbie)
On Thu, Dec 15, 2011 at 8:35 AM, PtitBleu <ptit_bleu at yahoo.fr> wrote:
Hello, I've two data.frames (data1 and data4), dec="." and sep=";". http://r.789695.n4.nabble.com/file/n4199964/data1.txt data1.txt http://r.789695.n4.nabble.com/file/n4199964/data4.txt data4.txt When I do plot(data1$nx,data1$ny, col="red") points(data4$nx,data4$ny, col="blue") , ?results seem very similar (at least to me) but the R-squared of summary(lm(data1$ny ~ data1$nx)) and summary(lm(data4$ny ~ data4$nx)) are very different (0.48 against 0.89). Could someone explain me the reason? To be complete, I am looking for an simple indicator telling me if it is worthwhile to keep the values provided by lm. I thought that R-squared could do the job. For me, if R-squared is far from 1, the data are not good enough to perform a linear fit. It seems that I'm wrong.
The problem is the outliers. Try using a robust measure instead. If we replace Pearson correlations with Spearman (rank) correlations they are much closer:
# R^2 based on Pearson correlations cor(fitted(lm(ny ~ nx, data4)), data4$ny)^2
[1] 0.8916924
cor(fitted(lm(ny ~ nx, data1)), data1$ny)^2
[1] 0.4868575
# R^2 based on Spearman (rank) correlations cor(fitted(lm(ny ~ nx, data4)), data4$ny, method = "spearman")^2
[1] 0.8104026
cor(fitted(lm(ny ~ nx, data1)), data1$ny, method = "spearman")^2
[1] 0.7266705
Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com