(For the record: I prefer something like my original explanation of
the problem with (C), instead of (D)+(E)+(F):
"With summarized data the standard errors get smaller with
increasing numbers of observations w_i. However, when for instance all
w_i are multiplied with the same constant larger than one, the
reported standard errors do not get smaller since the w_i are defined
apart from an arbitrary positive multiplicative constant. Hence the
reported standard errors tend to be too large and the reported t
values and the reported number of significance stars too small.
Obviously, also the reported number of observations and the reported
number of degrees of freedom are too small."
Note that with heteroskedasticity, _the_ residual standard error
has no meaning.)
Finally, about the original text: (B) and (C) mention only y_i, not
x_i, while this is about entire observations. Maybe this can remedied
also?
Arie
On Tue, Nov 28, 2017 at 1:01 PM, peter dalgaard <pdalgd at gmail.com> wrote:
My local R-devel version now has (in ?lm)
Non-?NULL? ?weights? can be used to indicate that different
observations have different variances (with the values in
?weights? being inversely proportional to the variances); or
equivalently, when the elements of ?weights? are positive integers
w_i, that each response y_i is the mean of w_i unit-weight
observations (including the case that there are w_i observations
equal to y_i and the data have been summarized). However, in the
latter case, notice that within-group variation is not used.
Therefore, the sigma estimate and residual degrees of freedom may
be suboptimal; in the case of replication weights, even wrong.
Hence, standard errors and analysis of variance tables should be
treated with care.
OK?
-pd
On 12 Oct 2017, at 13:48 , Arie ten Cate <arietencate at gmail.com> wrote:
OK. We have now three suggestions to repair the text:
- remove the text
- add "not" at the beginning of the text
- add at the end of the text a warning; something like:
"Note that in this case the standard estimates of the parameters are
in general not correct, and hence also the t values and the p value.
Also the number of degrees of freedom is not correct. (The parameter
values are correct.)"
A remark about the glm example: the Reference manual says: "For a
binomial GLM prior weights are used to give the number of trials when
the response is the proportion of successes ....". Hence in the
binomial case the weights are frequencies.
With y <- 0.51 and w <- 100 you get the same result.
Arie
On Mon, Oct 9, 2017 at 5:22 PM, peter dalgaard <pdalgd at gmail.com> wrote:
AFAIR, it is a little more subtle than that.
If you have replication weights, then the estimates are right, it is "just" that the SE from summary.lm() are wrong. Somehow, the text should reflect this.
It is of some importance when you put glm() into the mix, because you can in fact get correct results from things like
y <- c(0,1)
w <- c(49,51)
glm(y~1, weights=w, family=binomial)
-pd
On 9 Oct 2017, at 07:58 , Arie ten Cate <arietencate at gmail.com> wrote:
Yes. Thank you; I should have quoted it.
I suggest to remove this text or to add the word "not" at the beginning.
Arie
On Sun, Oct 8, 2017 at 4:38 PM, Viechtbauer Wolfgang (SP)
<wolfgang.viechtbauer at maastrichtuniversity.nl> wrote:
Ah, I think you are referring to this part from ?lm:
"(including the case that there are w_i observations equal to y_i and the data have been summarized)"
I see; indeed, I don't think this is what 'weights' should be used for (the other part before that is correct). Sorry, I misunderstood the point you were trying to make.
Best,
Wolfgang
-----Original Message-----
From: R-devel [mailto:r-devel-bounces at r-project.org] On Behalf Of Arie ten Cate
Sent: Sunday, 08 October, 2017 14:55
To: r-devel at r-project.org
Subject: [Rd] Discourage the weights= option of lm with summarized data
Indeed: Using 'weights' is not meant to indicate that the same
observation is repeated 'n' times. As I showed, this gives erroneous
results. Hence I suggested that it is discouraged rather than
encouraged in the Details section of lm in the Reference manual.
Arie
---Original Message-----
On Sat, 7 Oct 2017, wolfgang.viechtbauer at maastrichtuniversity.nl wrote:
Using 'weights' is not meant to indicate that the same observation is
repeated 'n' times. It is meant to indicate different variances (or to
be precise, that the variance of the last observation in 'x' is
sigma^2 / n, while the first three observations have variance
sigma^2).
Best,
Wolfgang
-----Original Message-----
From: R-devel [mailto:r-devel-bounces at r-project.org] On Behalf Of Arie ten Cate
Sent: Saturday, 07 October, 2017 9:36
To: r-devel at r-project.org
Subject: [Rd] Discourage the weights= option of lm with summarized data
In the Details section of lm (linear models) in the Reference manual,
it is suggested to use the weights= option for summarized data. This
must be discouraged rather than encouraged. The motivation for this is
as follows.
With summarized data the standard errors get smaller with increasing
numbers of observations. However, the standard errors in lm do not get
smaller when for instance all weights are multiplied with the same
constant larger than one, since the inverse weights are merely
proportional to the error variances.
Here is an example of the estimated standard errors being too large
with the weights= option. The p value and the number of degrees of
freedom are also wrong. The parameter estimates are correct.
n <- 10
x <- c(1,2,3,4)
y <- c(1,2,5,4)
w <- c(1,1,1,n)
xb <- c(x,rep(x[4],n-1)) # restore the original data
yb <- c(y,rep(y[4],n-1))
print(summary(lm(yb ~ xb)))
print(summary(lm(y ~ x, weights=w)))
Compare with PROC REG in SAS, with a WEIGHT statement (like R) and a
FREQ statement (for summarized data).
Arie