HELP! Excel and R give me totally different regression results using the exact same data
On Nov 7, 2012, at 11:47 AM, frauke wrote:
Okay. Sorry for being vague in my earlier message. I had missed a few lines from your message because they were hiding well in my own email. I am really on the learning side with this, so it will take some time. Sorry. There seem to be two issues: (1) Me preparing the data incorrectly and (2) the data not being fit for regression. Right?
Well. the second point might be more correctly stated that the data do not meet the conditions for valid inference using linear regression. Since the goals of the exercise have never been stated, it is difficult to say whether other regression methods migh be more applicable.
Ad1. Point about header taken. As to using characters in a matrix, I extract the data from data files from the National Weather Service. I extract observations together with dates and location names. Each row comes consists of date, location and observations. I chose to store them in matrices because I can combine them to arrays. A matrix can only have one type of data, so I chose to leave them all as characters.
That is generally the reason people use data.frames.
When I proceed to do a regression analysis I transform the observations into numbers using as.numeric(). Do you have a different suggestion? Will R give me different results if I store characters in a matrix?
It shouldn't, but it seems unnecessarily convoluted and prone to errors.
Even though such excerpts from a long script aren't very informative, to be complete: collection <- matrix(rep(NA,25),ncol=25) #collection will be a row of the output matrix later on. #extract dates collection[1] < -paste(year,"/", substring(.file,125,126), "/", substring(.file, 127, 128), sep="")
That is only going to change the first element of 'collection'. You should study the help page for "[". If you were changing the first column it would need to be a different call on the LHS.
#extract observations
collection[start.write+i]<-(substring(input , fields[[i]][1] ,
fields[[i]][2]))
Again, possibly not what you thought you were doing.Lack of context prevents further analysis.
Ad2. You mention heteroscedasticity and non-normality of residuals. To keep it short I had provided just a subset of the data I have (100 of 4000 matrix rows). But the same is true for the whole dataset. I attached the whole thing this time. test_complete.txt <http://r.789695.n4.nabble.com/file/n4648759/test_complete.txt> How do I deal with this?
str(dat)
'data.frame': 3548 obs. of 5 variables: $ V1: num 1.91 1.9 1.93 2.16 1.9 1.87 1.87 2.01 2.8 2.11 ... $ V2: num 1.86 1.9 1.91 1.88 1.87 1.88 6.94 2.01 2.03 2.09 ... $ V3: num 1.89 1.94 1.9 1.85 1.86 1.88 2.01 2 2.03 2.06 ... $ V4: num 1.92 1.96 1.91 1.83 1.85 1.87 2.01 2.03 2.04 2.03 ... $ V5: num 2.1 2 1.93 1.92 1.85 1.86 2.02 2.15 2.08 2.03 ...
lm(V1 ~ ., data=dat)
Call:
lm(formula = V1 ~ ., data = dat)
Coefficients:
(Intercept) V2 V3 V4 V5
0.1291 0.3378 0.2079 0.2635 0.1460
summary( lm(V1 ~ ., data=dat))
Call:
lm(formula = V1 ~ ., data = dat)
Residuals:
Min 1Q Median 3Q Max
-13.3116 -0.1825 -0.0304 0.0959 27.0989
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.12906 0.03840 3.361 0.000784 ***
V2 0.33783 0.01768 19.111 < 2e-16 ***
V3 0.20789 0.01686 12.329 < 2e-16 ***
V4 0.26346 0.01784 14.768 < 2e-16 ***
V5 0.14596 0.01672 8.728 < 2e-16 ***
---
Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
Residual standard error: 1.781 on 3543 degrees of freedom
Multiple R-squared: 0.7693, Adjusted R-squared: 0.7691
F-statistic: 2954 on 4 and 3543 DF, p-value: < 2.2e-16
with(dat, plot(V2, V1) )
Hit <Return> to see next plot: -------------- next part -------------- A non-text attachment was scrubbed... Name: Rplot.png Type: image/png Size: 139409 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20121107/ecd2057a/attachment.png> -------------- next part -------------- There appears to be quite a bit of "structure" in that plot.And a rather similar structure in with(dat, plot(V3, V1) )
I admit I am pretty clueless in this case. Can I do meaningful regression at all? (I didn't expect test[,3] to be good predictor but had hopes for test[,2].
What are these data and what are the scientific questions? You appear to think a) I can look over your shoulder and see your display and b) deduce your goals from extremely fragmentary evidence. I have a lower opinion of my ability to accomplish those tasks.
The residuals are definitely not normally distributed.
Not generally the biggest concern. But again you provide no code. Nabble-users are unfortunately notorious in rhelp for not reading the Posting Guide, and some do not seem even to understand that rhelp is not Nabble.
They do not seem to related to either of the two predictors.
Well, that second outcome would be the expected (even the desired) outcome of a regression wouldn't it? You would want the relationships to be in the prediction and the residuals to have zero correlations with
What is the conclusion from that? Thanks for your patience!
I'm rapidly running out of patience, however. Please read the PostingGuide more thoroughly than you appear to have done so far.
-- View this message in context: http://r.789695.n4.nabble.com/HELP-Excel-and-R-give-me-totally-different-regression-results-using-the-exact-same-data-tp4648648p4648759.html Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD Alameda, CA, USA