Skip to content

HELP! Excel and R give me totally different regression results using the exact same data

7 messages · frauke, Rui Barradas, David Winsemius

#
Hallo, 

I am totally confused why Excel and R give me totally different regression
results for the data below. If you know the solution, please enlighten me. 

In Excel I used LINEST() and Data>Data Analysis>Regression  and both
(fortunately) gave the same result.
The coefficients were:
/b0=1.16, b1=0.957, b2=0.024, R2=0.0027, adjusted R2=-0.017/

In R I used model<-lm(as.numeric(test[,1]) /(wave sign here)/
as.numeric(test[,2]) + as.numeric(test[,3]))
The result is:
/Coefficients:
                       Estimate Std. Error t value Pr(>|t|)
(Intercept)               3.514     28.913   0.122    0.904
as.numeric(test[, 2])    0.423      1.027   0.412    0.681
as.numeric(test[, 3])    6.629      4.194   1.581    0.117

Residual standard error: 29.19 on 98 degrees of freedom
Multiple R-squared: 0.02706,	Adjusted R-squared: 0.007204 
F-statistic: 1.363 on 2 and 98 DF,  p-value: 0.2607 
/

Does anybody know what is going on?

Thank you for your time! Frauke



test<-
       obs     fcst48 rate24.last           
  [1,] "1.93"  "6.9"  "-0.13"               
  [2,] "5.23"  "6.6"  "6.52"                
  [3,] "5.25"  "6.8"  "5.85"                
  [4,] "5.26"  "6.6"  "1.27"                
  [5,] "5.26"  "6.8"  "-0.0700000000000003" 
  [6,] "5.34"  "6.6"  "-0.95"               
  [7,] "5.37"  "6.9"  "0.78"                
  [8,] "5.56"  "6.7"  "3.05"                
  [9,] "5.60"  "7"    "1.64"                
 [10,] "5.61"  "6.9"  "2.6"                 
 [11,] "5.74"  "6.7"  "4.09"                
 [12,] "5.92"  "6.8"  "1.95"                
 [13,] "5.95"  "6.8"  "-5.87"               
 [14,] "6.08"  "6.6"  "0.1"                 
 [15,] "6.13"  "6.6"  "2.6"                 
 [16,] "6.16"  "6.6"  "-6.15"               
 [17,] "6.26"  "6.5"  "1.4"                 
 [18,] "6.32"  "7"    "9.78"                
 [19,] "6.33"  "6.5"  "0.28"                
 [20,] "6.35"  "6.6"  "-0.0899999999999999" 
 [21,] "6.36"  "6.6"  "-0.13"               
 [22,] "6.37"  "6.9"  "-0.04"               
 [23,] "6.39"  "6.5"  "-2.14"               
 [24,] "6.39"  "6.6"  "0.15"                
 [25,] "6.39"  "6.7"  "2.29"                
 [26,] "6.40"  "6.6"  "-1.84"               
 [27,] "6.42"  "6.7"  "-2.25"               
 [28,] "6.43"  "6.6"  "-0.0300000000000002" 
 [29,] "6.45"  "6.8"  "0.220000000000001"   
 [30,] "6.47"  "6.5"  "-0.18"               
 [31,] "6.51"  "6.9"  "-1.8"                
 [32,] "6.55"  "6.7"  "-1.38"               
 [33,] "6.56"  "6.6"  "1.33"                
 [34,] "6.58"  "6.7"  "7.86"                
 [35,] "6.58"  "6.8"  "-0.220000000000001"  
 [36,] "6.61"  "6.8"  "-0.59"               
 [37,] "6.62"  "6.7"  "-0.24"               
 [38,] "6.62"  "6.7"  "-0.0999999999999996" 
 [39,] "6.64"  "6.6"  "-0.220000000000001"  
 [40,] "6.65"  "6.7"  "-9.63"               
 [41,] "6.69"  "6.8"  "-0.18"               
 [42,] "6.72"  "6.9"  "-0.739999999999999"  
 [43,] "6.73"  "6.8"  "0.83"                
 [44,] "6.76"  "6.6"  "-1.56"               
 [45,] "6.76"  "6.7"  "-1.64"               
 [46,] "6.79"  "6.8"  "-0.140000000000001"  
 [47,] "6.80"  "6.9"  "0.38"                
 [48,] "6.81"  "6.8"  "6.46"                
 [49,] "6.81"  "6.9"  "-0.0199999999999996" 
 [50,] "6.82"  "6.9"  "0.54"                
 [51,] "6.84"  "6.9"  "-0.19"               
 [52,] "6.84"  "7"    "-2.11"               
 [53,] "6.85"  "6.7"  "-0.170000000000001"  
 [54,] "6.88"  "6.9"  "0.0600000000000005"  
 [55,] "6.90"  "6.7"  "-0.0800000000000001" 
 [56,] "6.90"  "7"    "-0.18"               
 [57,] "6.91"  "6.6"  "-4.65"               
 [58,] "6.91"  "6.6"  "3"                   
 [59,] "6.92"  "6.6"  "-0.219999999999999"  
 [60,] "6.94"  "6.8"  "1.4"                 
 [61,] "6.99"  "6.7"  "-0.44"               
 [62,] "6.99"  "6.9"  "1.16"                
 [63,] "7.02"  "6.8"  "-1.72"               
 [64,] "7.03"  "6.8"  "-0.46"               
 [65,] "7.03"  "6.9"  "-0.24"               
 [66,] "7.04"  "7"    "-0.11"               
 [67,] "7.06"  "6.9"  "-0.13"               
 [68,] "7.07"  "6.7"  "-0.0199999999999996" 
 [69,] "7.08"  "6.8"  "0.23"                
 [70,] "7.09"  "6.7"  "0.66"                
 [71,] "7.09"  "7"    "-0.55"               
 [72,] "7.12"  "7"    "-0.17"               
 [73,] "7.15"  "6.7"  "-3.05"               
 [74,] "7.17"  "6.7"  "-0.140000000000001"  
 [75,] "7.26"  "6.9"  "-0.170000000000001"  
 [76,] "7.31"  "7"    "-1.57"               
 [77,] "7.36"  "6.8"  "-0.2"                
 [78,] "7.41"  "6.6"  "-0.29"               
 [79,] "7.42"  "6.6"  "-3.39"               
 [80,] "7.42"  "6.8"  "2.11"                
 [81,] "7.53"  "6.9"  "0.18"                
 [82,] "7.66"  "6.6"  "0"                   
 [83,] "7.74"  "6.7"  "10.31"               
 [84,] "7.87"  "6.9"  "-0.0700000000000003" 
 [85,] "7.89"  "6.6"  "0.24"                
 [86,] "8.07"  "6.8"  "-0.25"               
 [87,] "8.17"  "6.6"  "0.97"                
 [88,] "8.38"  "6.8"  "-0.15"               
 [89,] "8.38"  "6.9"  "0.7"                 
 [90,] "8.43"  "6.8"  "0.0999999999999979"  
 [91,] "8.92"  "6.8"  "0"                   
 [92,] "9.20"  "7"    "0.18"                
 [93,] "9.79"  "6.6"  "0.33"                
 [94,] "9.87"  "6.8"  "0"                   
 [95,] "10.72" "6.8"  "0.13"                
 [96,] "11.61" "6.8"  "1.88"                
 [97,] "16.44" "6.8"  "-0.00999999999999801"
 [98,] "16.55" "6.6"  "2.96"                
 [99,] "16.88" "6.7"  "1.16"                
[100,] "17.72" "6.7"  "-0.0699999999999994" 
[101,] "17.91" "6.7"  "-0.24"               
[102,] "19.03" "6.8"  "-0.430000000000001"  



--
View this message in context: http://r.789695.n4.nabble.com/HELP-Excel-and-R-give-me-totally-different-regression-results-using-the-exact-same-data-tp4648648.html
Sent from the R help mailing list archive at Nabble.com.
#
On Nov 6, 2012, at 4:20 PM, frauke wrote:

            
I was unable to reproduce this result using the data you offered. Code is below. I'm guessing you failed to include a header argument in read table abd as a result were actually doing your regression on the integers underneath the factor representation. (Who knows what you did wrong in Excel.)
Have you looked at the data? Looks to me that assumptions of constant variance and normality of errors are  violated.
test<- read.table(text='
      obs     fcst48 rate24.last           
 [1,] "1.93"  "6.9"  "-0.13"               
 [2,] "5.23"  "6.6"  "6.52"                
 [3,] "5.25"  "6.8"  "5.85"                
 [4,] "5.26"  "6.6"  "1.27"                
 [5,] "5.26"  "6.8"  "-0.0700000000000003" 
 [6,] "5.34"  "6.6"  "-0.95"               
 [7,] "5.37"  "6.9"  "0.78"                
 [8,] "5.56"  "6.7"  "3.05"                
 [9,] "5.60"  "7"    "1.64"                
[10,] "5.61"  "6.9"  "2.6"                 
[11,] "5.74"  "6.7"  "4.09"                
[12,] "5.92"  "6.8"  "1.95"                
[13,] "5.95"  "6.8"  "-5.87"               
[14,] "6.08"  "6.6"  "0.1"                 
[15,] "6.13"  "6.6"  "2.6"                 
[16,] "6.16"  "6.6"  "-6.15"               
[17,] "6.26"  "6.5"  "1.4"                 
[18,] "6.32"  "7"    "9.78"                
[19,] "6.33"  "6.5"  "0.28"                
[20,] "6.35"  "6.6"  "-0.0899999999999999" 
[21,] "6.36"  "6.6"  "-0.13"               
[22,] "6.37"  "6.9"  "-0.04"               
[23,] "6.39"  "6.5"  "-2.14"               
[24,] "6.39"  "6.6"  "0.15"                
[25,] "6.39"  "6.7"  "2.29"                
[26,] "6.40"  "6.6"  "-1.84"               
[27,] "6.42"  "6.7"  "-2.25"               
[28,] "6.43"  "6.6"  "-0.0300000000000002" 
[29,] "6.45"  "6.8"  "0.220000000000001"   
[30,] "6.47"  "6.5"  "-0.18"               
[31,] "6.51"  "6.9"  "-1.8"                
[32,] "6.55"  "6.7"  "-1.38"               
[33,] "6.56"  "6.6"  "1.33"                
[34,] "6.58"  "6.7"  "7.86"                
[35,] "6.58"  "6.8"  "-0.220000000000001"  
[36,] "6.61"  "6.8"  "-0.59"               
[37,] "6.62"  "6.7"  "-0.24"               
[38,] "6.62"  "6.7"  "-0.0999999999999996" 
[39,] "6.64"  "6.6"  "-0.220000000000001"  
[40,] "6.65"  "6.7"  "-9.63"               
[41,] "6.69"  "6.8"  "-0.18"               
[42,] "6.72"  "6.9"  "-0.739999999999999"  
[43,] "6.73"  "6.8"  "0.83"                
[44,] "6.76"  "6.6"  "-1.56"               
[45,] "6.76"  "6.7"  "-1.64"               
[46,] "6.79"  "6.8"  "-0.140000000000001"  
[47,] "6.80"  "6.9"  "0.38"                
[48,] "6.81"  "6.8"  "6.46"                
[49,] "6.81"  "6.9"  "-0.0199999999999996" 
[50,] "6.82"  "6.9"  "0.54"                
[51,] "6.84"  "6.9"  "-0.19"               
[52,] "6.84"  "7"    "-2.11"               
[53,] "6.85"  "6.7"  "-0.170000000000001"  
[54,] "6.88"  "6.9"  "0.0600000000000005"  
[55,] "6.90"  "6.7"  "-0.0800000000000001" 
[56,] "6.90"  "7"    "-0.18"               
[57,] "6.91"  "6.6"  "-4.65"               
[58,] "6.91"  "6.6"  "3"                   
[59,] "6.92"  "6.6"  "-0.219999999999999"  
[60,] "6.94"  "6.8"  "1.4"                 
[61,] "6.99"  "6.7"  "-0.44"               
[62,] "6.99"  "6.9"  "1.16"                
[63,] "7.02"  "6.8"  "-1.72"               
[64,] "7.03"  "6.8"  "-0.46"               
[65,] "7.03"  "6.9"  "-0.24"               
[66,] "7.04"  "7"    "-0.11"               
[67,] "7.06"  "6.9"  "-0.13"               
[68,] "7.07"  "6.7"  "-0.0199999999999996" 
[69,] "7.08"  "6.8"  "0.23"                
[70,] "7.09"  "6.7"  "0.66"                
[71,] "7.09"  "7"    "-0.55"               
[72,] "7.12"  "7"    "-0.17"               
[73,] "7.15"  "6.7"  "-3.05"               
[74,] "7.17"  "6.7"  "-0.140000000000001"  
[75,] "7.26"  "6.9"  "-0.170000000000001"  
[76,] "7.31"  "7"    "-1.57"               
[77,] "7.36"  "6.8"  "-0.2"                
[78,] "7.41"  "6.6"  "-0.29"               
[79,] "7.42"  "6.6"  "-3.39"               
[80,] "7.42"  "6.8"  "2.11"                
[81,] "7.53"  "6.9"  "0.18"                
[82,] "7.66"  "6.6"  "0"                   
[83,] "7.74"  "6.7"  "10.31"               
[84,] "7.87"  "6.9"  "-0.0700000000000003" 
[85,] "7.89"  "6.6"  "0.24"                
[86,] "8.07"  "6.8"  "-0.25"               
[87,] "8.17"  "6.6"  "0.97"                
[88,] "8.38"  "6.8"  "-0.15"               
[89,] "8.38"  "6.9"  "0.7"                 
[90,] "8.43"  "6.8"  "0.0999999999999979"  
[91,] "8.92"  "6.8"  "0"                   
[92,] "9.20"  "7"    "0.18"                
[93,] "9.79"  "6.6"  "0.33"                
[94,] "9.87"  "6.8"  "0"                   
[95,] "10.72" "6.8"  "0.13"                
[96,] "11.61" "6.8"  "1.88"                
[97,] "16.44" "6.8"  "-0.00999999999999801"
[98,] "16.55" "6.6"  "2.96"                
[99,] "16.88" "6.7"  "1.16"                
[100,] "17.72" "6.7"  "-0.0699999999999994" 
[101,] "17.91" "6.7"  "-0.24"               
[102,] "19.03" "6.8"  "-0.430000000000001"    ', header=TRUE)

 test[] <- lapply( test, as.numeric)

 plot(x=test[[2]], y=test[[1]])
 plot(x=test[[3]], y=test[[1]])
 plot(resid(lm(obs ~ ., data=test)))
 plot(x=test[[2]], resid(lm(obs ~ ., data=test))) 
 plot(x=test[[3]], resid(lm(obs ~ ., data=test)))

# This show heteroscedasticity and non-normality of residuals around the predictions.
 plot(predict(lm(obs ~ ., data=test)), resid(lm(obs ~ ., data=test)))
lm(obs ~ ., data=test)

Call:
lm(formula = obs ~ ., data = test)

Coefficients:
(Intercept)       fcst48  rate24.last  
 10.9546962   -0.5097615   -0.0004139
Call:
lm(formula = obs ~ ., data = test)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.5074 -1.1413 -0.6548 -0.1731 11.5415 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.9546962 13.7494829   0.797    0.428
fcst48      -0.5097615  2.0350479  -0.250    0.803
rate24.last -0.0004139  0.1045771  -0.004    0.997

Residual standard error: 2.779 on 99 degrees of freedom
Multiple R-squared: 0.0006371,	Adjusted R-squared: -0.01955 
F-statistic: 0.03156 on 2 and 99 DF,  p-value: 0.9689
#
Hi David, hi Rui,

thanks for your quick replies. I have replicated David's R results and
confirmed them with Minitab. Though I'm not sure what you are trying to tell
me with the code you wrote, David. Do you mean, I should use a dataframe
rather than a matrix, or use the "data=" part of the lm() function?

Rui seems to be right, too. Excel's regression function doesn't work; I
cannot replicate the Minitab and R results with it. According to the
Microsoft website this is probably because the x- and y-values overlap. I am
truly astonished that such a major bug doesn't at least have a major red
flag to it. 

Thank you! Frauke



--
View this message in context: http://r.789695.n4.nabble.com/HELP-Excel-and-R-give-me-totally-different-regression-results-using-the-exact-same-data-tp4648648p4648723.html
Sent from the R help mailing list archive at Nabble.com.
#
On Nov 7, 2012, at 8:53 AM, frauke wrote:

            
What I thought I was demonstrating was:

a) we had different degrees of freedom suggesting that you made a major error in data preparation. I offered my hypothesis for how this happened, but since you failed to provide the requested code showing your data input steps, it was only a guess.

b) I used a dataframe because that is the simplest way of preparing data for presentation to lm().

You can use a matrix to store data and present to lm(), but storing character representations of numeric data in a matrix ( as you appeared to be attempting) seems just plain ...  wrong.
Many people are astonished, incredulous, aghast, astounded, (what is the right adjective?) that MS has allowed many errors to persist despite negative reviews by statisticians and mathematicians for decades.

On the other hand my reading of the commentary suggest a different interpretation of the error conditions. MS says:

"Case 1: The x-value and y-value ranges overlap

If the x-value and y-value ranges overlap, the LINEST worksheet function produces incorrect values in all result cells. Normal statistical probability disallows the values in the x and y ranges to overlap (duplicate each other). Do not overlap the x- and y-value ranges when referencing cells in the formula."

I think that mean not that the mathematical ranges overlap but rather that the error occurs when the spreadsheet ranges overlap.
#
Okay. Sorry for being vague in my earlier message. I had missed a few lines
from your message because they were hiding well in my own email. I am really
on the learning side with this, so it will take some time. Sorry.

There seem to be two issues: (1) Me preparing the data incorrectly and (2)
the data not being fit for regression. Right?

Ad1. Point about header taken. As to using characters in a matrix, I extract
the data from data files from the National Weather Service. I extract
observations together with dates and location names. Each row comes consists
of date, location and observations.  I chose to store them in matrices
because I can combine them to arrays. A matrix can only have one type of
data, so I chose to leave them all as characters. When I proceed to do a
regression analysis I transform the observations  into numbers using
as.numeric(). Do you have a different suggestion? Will R give me different
results if I store characters in a matrix?
Even though such excerpts from a long script aren't very informative, to be
complete:
collection <- matrix(rep(NA,25),ncol=25)        #collection will be a row of
the output matrix later on. 
#extract dates

collection[1]<-paste(year,"/",substring(.file,125,126),"/",substring(.file,127,128),sep="")
#extract observations
            collection[start.write+i]<-(substring(input , fields[[i]][1] ,
fields[[i]][2]))

Ad2.  You mention heteroscedasticity and non-normality of residuals. To keep
it short I had provided just a subset of the data I have (100 of 4000 matrix
rows). But the same is true for the whole dataset. I attached the whole
thing this time.  test_complete.txt
<http://r.789695.n4.nabble.com/file/n4648759/test_complete.txt>  How do I
deal with this? I admit I am pretty clueless in this case. Can I do
meaningful regression at all? (I didn't expect test[,3] to be good predictor
but had hopes for test[,2]. 

The residuals are definitely not normally distributed. They do not seem to
related to either of the two predictors. What is the conclusion from that? 

Thanks for your patience!
Frauke






--
View this message in context: http://r.789695.n4.nabble.com/HELP-Excel-and-R-give-me-totally-different-regression-results-using-the-exact-same-data-tp4648648p4648759.html
Sent from the R help mailing list archive at Nabble.com.
#
On Nov 7, 2012, at 11:47 AM, frauke wrote:

            
Well. the second point might be more correctly stated that the data do not meet the conditions for valid inference using linear regression. Since the goals of the exercise have never been stated, it is difficult to say whether other regression methods migh be more applicable.
That is generally the reason people use data.frames.
It shouldn't, but it seems unnecessarily convoluted and prone to errors.
That is only going to change the first element of 'collection'. You should study the help page for "[". If you were changing the first column it would need to be a different call on the LHS.
Again, possibly not what you thought you were doing.Lack of context prevents further analysis.
'data.frame':	3548 obs. of  5 variables:
 $ V1: num  1.91 1.9 1.93 2.16 1.9 1.87 1.87 2.01 2.8 2.11 ...
 $ V2: num  1.86 1.9 1.91 1.88 1.87 1.88 6.94 2.01 2.03 2.09 ...
 $ V3: num  1.89 1.94 1.9 1.85 1.86 1.88 2.01 2 2.03 2.06 ...
 $ V4: num  1.92 1.96 1.91 1.83 1.85 1.87 2.01 2.03 2.04 2.03 ...
 $ V5: num  2.1 2 1.93 1.92 1.85 1.86 2.02 2.15 2.08 2.03 ...
Call:
lm(formula = V1 ~ ., data = dat)

Coefficients:
(Intercept)           V2           V3           V4           V5  
     0.1291       0.3378       0.2079       0.2635       0.1460
Call:
lm(formula = V1 ~ ., data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.3116  -0.1825  -0.0304   0.0959  27.0989 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.12906    0.03840   3.361 0.000784 ***
V2           0.33783    0.01768  19.111  < 2e-16 ***
V3           0.20789    0.01686  12.329  < 2e-16 ***
V4           0.26346    0.01784  14.768  < 2e-16 ***
V5           0.14596    0.01672   8.728  < 2e-16 ***
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 

Residual standard error: 1.781 on 3543 degrees of freedom
Multiple R-squared: 0.7693,	Adjusted R-squared: 0.7691 
F-statistic:  2954 on 4 and 3543 DF,  p-value: < 2.2e-16
Hit <Return> to see next plot: 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: Rplot.png
Type: image/png
Size: 139409 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20121107/ecd2057a/attachment.png>
-------------- next part --------------


There appears to be quite a bit of "structure" in that plot.And a rather similar structure in 

with(dat, plot(V3, V1) )
What are these data and what are the scientific questions? You appear to think a) I can look over your shoulder and see your display and b) deduce your goals from extremely fragmentary evidence. I have a lower opinion of my ability to accomplish those tasks.
Not generally the biggest concern. But again you provide no code. Nabble-users are unfortunately notorious in rhelp for not reading the Posting Guide, and some do not seem even  to understand that rhelp is not Nabble.
Well, that second outcome would be the expected (even the desired) outcome of a regression wouldn't it? You would want the relationships to be in the prediction and the residuals to have zero correlations with
I'm rapidly running out of patience, however. Please read the PostingGuide more thoroughly than you appear to have done so far.
David Winsemius, MD
Alameda, CA, USA