Skip to content

naive "collinear" weighted linear regression

8 messages · Thomas Lumley, Mauricio O Calvao, David Winsemius +1 more

#
Hi there

Sorry for what may be a naive or dumb question.

I have the following data:

 > x <- c(1,2,3,4) # predictor vector

 > y <- c(2,4,6,8) # response vector. Notice that it is an exact, 
perfect straight line through the origin and slope equal to 2

 > error <- c(0.3,0.3,0.3,0.3) # I have (equal) ``errors'', for 
instance, in the measured responses

Of course the best fit coefficients should be 0 for the intercept and 2 
for the slope. Furthermore, it seems completely plausible (or not?) 
that, since the y_i have associated non-vanishing ``errors'' 
(dispersions), there should be corresponding non-vanishing ``errors'' 
associated to the best fit coefficients, right?

When I try:

 > fit_mod <- lm(y~x,weights=1/error^2)

I get

Warning message:
In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
   extra arguments weigths are just disregarded.

Keeping on, despite the warning message, which I did not quite 
understand, when I type:

 > summary(fit_mod)

I get

Call:
lm(formula = y ~ x, weigths = 1/error^2)

Residuals:
          1          2          3          4
-5.067e-17  8.445e-17 -1.689e-17 -1.689e-17

Coefficients:
              Estimate Std. Error   t value Pr(>|t|)
(Intercept) 0.000e+00  8.776e-17 0.000e+00        1
x           2.000e+00  3.205e-17 6.241e+16   <2e-16 ***
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

Residual standard error: 7.166e-17 on 2 degrees of freedom
Multiple R-squared:     1,      Adjusted R-squared:     1
F-statistic: 3.895e+33 on 1 and 2 DF,  p-value: < 2.2e-16


Naively, should not the column Std. Error be different from zero?? What 
I have in mind, and sure is not what Std. Error means, is that if I 
carried out a large simulation, assuming each response y_i a Gaussian 
random variable with mean y_i and standard deviation 2*error=0.6, and 
then making an ordinary least squares fitting of the slope and 
intercept, I would end up with a mean for these simulated coefficients 
which should be 2 and 0, respectively, and, that's the point, a 
non-vanishing standard deviation for these fitted coefficients, right?? 
This somehow is what I expected should be an estimate or, at least, a 
good indicator, of the degree of uncertainty which I should assign to 
the fitted coefficients; it seems to me these deviations, thus 
calculated as a result of the simulation, will certainly not be zero (or 
  3e-17, for that matter). So this Std. Error does not provide what I, 
naively, think should be given as a measure of the uncertainties or 
errors in the fitted coefficients...

What am I not getting right??

Thanks and sorry for the naive and non-expert question!
#
On Nov 11, 2009, at 7:45 PM, Mauricio Calvao wrote:

            
Which means those x, y, and "error" figures did not come from an  
experiment, but rather from theory???
(Actually the weights are for adjusting for sampling, and I do not  
see any sampling in your "design".)
Well, not precisely 2 and 0, but rather something very close ... i.e,  
within "experimental error". Please note that numbers in the range of  
10e-17 are effectively zero from a numerical analysis perspective.

http://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f

 > .Machine$double.eps ^ 0.5
[1] 1.490116e-08
You are trying to impose an error structure on a data situation that  
you constructed artificially to be perfect.
That if you input "perfection" into R's linear regression program, you  
get appropriate warnings?
You are a Professor of physics, right? You do experiments, right? You  
replicate them.  S0 perhaps I'm the one who should be puzzled.
1 day later
#
On Wed, 11 Nov 2009, David Winsemius wrote:

            
No, the weights are not for adjusting for sampling.  You will get the wrong standard errors if you put sampling weights into the  `weights` argument.

  The warning is saying that he had misspelt 'weights' as 'weigths' (and so the code he posted wasn' t the same as the code he ran).


      -thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle
#
David Winsemius <dwinsemius <at> comcast.net> writes:
The fact is I am trying to compare the results of:
(1) lm under R and 
(2) the Java applet at http://omnis.if.ufrj.br/~carlos/applets/reta/reta.html 
(3) the Fit method of the ROOT system used by CERN,
(4) the Numerical Recipes functions for weighted linear regression

The three last ones all provide, for the "fake" data set I furnished in my first
post in this thread, the same results; particularly they give erros or
uncertainties in the estimated coefficients of intercept and slope which, as
seems intuitive, are not zero at all, but of the order 0.1 or 0.2, whereas the
method lm under R issues a "Std. Error", which is zero. Independently of
terminology, which sure is of utmost importance, the data I provided should give
rise to a best fit straight line with intercept zero and slope 2, but also with
non-vanishing errors associated with them. How do I get this from lm????

I only want, for instance, calculation of the so-called covariance matrix for
the estimated coefficients, as given, for instance, in Equation (2.3.2) of the
second edition of Draper and Smith, "Applied regression analysis"; this is a
standard statistical result, right? So why does R not directly provide it as a
summary from an lm object???
http://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f
I know this all too well and it is obviously a trivial supernewbie issue, which
I have already overcome a long time ago...
Unfortunately you eschewed answering objectively any of my questions; I insist
they do make sense. Don't mention the data are perfect; this does not help to
make any progress in understanding the choice of convenient summary info the lm
method provides, as compared to what, in my humble opinion and in this specific
particular case, it should provide: the covariance matrix of the estimated
coefficients...
#
On Nov 14, 2009, at 1:50 PM, Mauricio O Calvao wrote:

            
It's really not that difficult to get the variance covariance matrix.  
What is not so clear is why you think differential weighting of a set  
that has a perfect fit should give meaningfully different results than  
a fit that has no weights.

?lm
?vcov

 >  y <- c(2,4,6,8) # response vect
 > fit_mod <- lm(y~x,weights=1/error^2)
Error in eval(expr, envir, enclos) : object 'error' not found
 > error <- c(0.3,0.3,0.3,0.3)
 > fit_mod <- lm(y~x,weights=1/error^2)
 > vcov(fit_mod)
               (Intercept)             x
(Intercept)  2.396165e-30 -7.987217e-31
x           -7.987217e-31  3.194887e-31

Numerically those are effectively zero.

 > fit_mod <- lm(y~x)
 > vcov(fit_mod)
             (Intercept) x
(Intercept)           0 0
x                     0 0
#
Mauricio O Calvao wrote:

            
The point is that R (as well as almost all other mainstream statistical 
software) assumes that a "weight" means that the variance of the 
corresponding observation is the general variance divided by the weight 
factor. The general variance is still determined from the residuals, and 
if they are zero to machine precision, well, there you go. I suspect you 
  get closer to the mark with glm, which allows you to assume that the 
dispersion is known:

 > summary(glm(y~x,family="gaussian"),dispersion=0.3^2)

or

 > summary(glm(y~x,family="gaussian",weights=1/error^2),dispersion=1)
#
Peter Dalgaard <p.dalgaard <at> biostat.ku.dk> writes:
Excellent; any of these commands provide Std. Errors which now coincide with my
naive expectation: though the data fall perfectly in a straight line, since they
have some associated "uncertainties" (only) in the response variables
(homoskedasticity), the estimated coefficients should have some kind of
nonvanishing "uncertainties" as well, should they not??

Now, forgive me, but I did not get the explanation for the distinct meanings of
Std. Error when calling simply summary(lm(y~x,weights=1/error^2), which I had
done before, and your suggested calls; could you rephrase and dwell a little bit
more upon this point. What does the option dispersion exactly mean?

Also, could you suggest some specific reference for me to read about this? I
have your excellent book "Introductory statistics with R", 1st edition, but was
not able (perhaps I have missed some point) to find this kind of distinction
there... Does this theme is specifically what statisticians call really
generalized linear models (glm) as opposed to (ordinary) linear models? If so,
which good references could you please suggest?? I thought of the following
books and would feel much obliged should you give me your impressions about
them, if any, or about any other relevant references at all:

1) Faraway, "Linear models with R"
2) Faraway, "Extending the linear model with R: generalized linear..."
3) Fox, "An R and S-Plus companion.."
4) Uusipaikka, "Confidence intervals in generalized linear regression models"

Thank you very much!!
#
David Winsemius <dwinsemius <at> comcast.net> writes:
Again, David, what I have in mind is: since there are errors or uncertainties in
the response variables (despite the perfect collinearity of the data), which I
assume are Gaussian, if I make a large enough number of simulations of four
response values, there will undoubtedly be a dispersion in the best fit
intercept and slope obtained from a usual unweighted least squares procedure,
right? Then, if I calculate the arithmetic mean of these simulated intercept and
slope, I would certainly check that they would be 0 and 2, respectively.
However, and THAT IS THE POINT, there will also be a standard deviation
associated with each one of these two coefficients, right??, and that is what I
would assign as the measure of uncertainty in the estimation of the
coefficients. This is not, as Dalgaard has called attention to, what the simple
command summary(lm(y~x,weights=1/err^2)) provides in its Std. Error. However, as
Dalgaard also recalled, the command
summary(glm(y~x,family=gaussian,weights=1/err^2),dispersion=1) does provide Std.
Errors in the coefficients which look plausible (at least to me) and, at any
rate, which do coincide with results from other packages (Numerical Recipes,
ROOT and possibly GSL...)