Skip to content

Simple lm/regression question

4 messages · James Annan, Achim Zeileis, Peter Dalgaard

#
I am trying to use lm for a simple linear fit with weights. The results 
I get from IDL (which I am more familiar with) seem correct and 
intuitive, but the "lm" function in R gives outputs that seem strange to me.

Unweighted case:

 > x<-1:4
 > y<-(1:4)^2
 > summary(lm(y~x))

Call:
lm(formula = y ~ x)

Residuals:
  1  2  3  4
  1 -1 -1  1

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept)  -5.0000     1.7321  -2.887   0.1020
x             5.0000     0.6325   7.906   0.0156 *

So far, so good - IDL does much the same:

IDL> vec=linfit(x,y,sigma=sig)
IDL> print,vec,sig
      -5.00000      5.00000
       1.73205     0.632456

Now, if the dependent variable has known (measurement) uncertainties 
(10, say) then it is appropriate to use weights defined as the inverse 
of the variances, right?

 > summary(lm(y~x,weights=c(.01,.01,.01,.01)))

Call:
lm(formula = y ~ x, weights = c(0.01, 0.01, 0.01, 0.01))

Residuals:
    1    2    3    4
  0.1 -0.1 -0.1  0.1

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept)  -5.0000     1.7321  -2.887   0.1020
x             5.0000     0.6325   7.906   0.0156 *

But here the *residuals* have changed and are no longer the actual 
response minus fitted values. (They seem to be scaled by the sqrt of the 
weights). The uncertainties on the parameter estimates, however, have 
*not* changed, which seems very odd to me.

The behaviour of IDL is rather different and intuitive to me:

IDL> vec=linfit(x,y,sigma=sig,measure_errors=[1,1,1,1])
IDL> print,vec,sig
      -5.00000      5.00000
       1.22474     0.447214

IDL> vec=linfit(x,y,sigma=sig,measure_errors=[10,10,10,10])
IDL> print,vec,sig
      -5.00000      5.00000
       12.2474      4.47214

Note that the uncertainties are 10* larger when the errors are 10* larger.

My question is really how to replicate these results in R. I suspect 
there is some terminology or convention that I'm confused by. But in the 
lm help page documentation, even the straightforward definition of 
"residual" seems incompatible with what the code actually returns:

	residuals: the residuals, that is response minus fitted values.

But, in the weighted case, this is not actually true...

James
#
On Mon, 6 Feb 2012, James Annan wrote:

            
If you think that in y = x'b + e the error has

E(e^2) = sigma^2 * w

then you should use

weights = 1/w

in R because that is how the weights enter the objective function

sum 1/w (y - x'b)^2
The summary() shows under "Residuals" the contributions to the objective 
function, i.e. sqrt(1/w) (y - x'b) in the notation above.

However, by using the residuals extractor function you can get the 
unweighted residuals:

residuals(lm(y~x,weights=c(.01,.01,.01,.01)))
lm() interprets the weights as precision weights, not as case weights.

Thus, the scaling in the variances is done by the number of (non-zero) 
weights, not by the sum of weights.
This appears to use sandwich standard errors. If you load the sandwich and 
lmtest package you can do

coeftest(m, vcov = sandwich)

where 'm' is the fitted lm object.

Hope that helps,
Z
#
On Feb 6, 2012, at 10:57 , Achim Zeileis wrote:

            
Actually, I think the issue is slightly different:  IDL assumes that the errors _are_ something (notice that setting measure_errors to 1 is not equvalent to omitting them), R assumes that they are _proportional_ to the inverse weights, and proportionality to c(.01,.01,.01,.01) is not different from proportionality to c(1,1,1,1)...

There are a couple of ways to avoid the use of the estimated multiplicative dispersion parameter in R, one is to extract cov.unscaled from the summary, another is to use summary.glm with dispersion=1, but I'm not quite sure how they interact with weights (and I don't have the time to check just now.)
#
On 6/2/12 19:36 , peter dalgaard wrote:

            
Yes, I think this is correct.
Thanks, this appears to give me what I am looking for. In fact it seems 
that cov and cov.unscaled give identical results which *do* account for 
the weights in each case. The default seems to be weights = 1 (ie, an 
actual measurement error of 1). Anyway, this now reconciles with what I 
was used to with IDL.

eg:

 > summary(lm(y~x))$cov
             (Intercept)    x
(Intercept)         1.5 -0.5
x                  -0.5  0.2
 > summary(lm(y~x))$cov.unscaled
             (Intercept)    x
(Intercept)         1.5 -0.5
x                  -0.5  0.2
 > summary(lm(y~x,weights=c(10,10,10,10)))$cov
             (Intercept)     x
(Intercept)        0.15 -0.05
x                 -0.05  0.02
 > summary(lm(y~x,weights=c(10,10,10,10)))$cov.unscaled
             (Intercept)     x
(Intercept)        0.15 -0.05
x                 -0.05  0.02
 > summary(lm(y~x,weights=c(10,10,1,1)))$cov.unscaled
             (Intercept)           x
(Intercept)   0.2669039 -0.13167260
x            -0.1316726  0.07829181
 > summary(lm(y~x,weights=c(10,10,1,1)))$cov
             (Intercept)           x
(Intercept)   0.2669039 -0.13167260
x            -0.1316726  0.07829181

and the last case in IDL:

IDL> 
vec=linfit(x,y,sigma=sig,measure_errors=[sqrt(.1),sqrt(.1),1,1],covar=cov)
IDL> print,cov
      0.266904    -0.131673
     -0.131673    0.0782918

(the other parameters also agree!)