Skip to content

analysis of data with observation weights

9 messages · Michal Bojanowski, Peter Dalgaard, John Fox +2 more

#
Dear R-users,

Recently I had to analyze a dataset from household survey. The sample design
ensured, that each household in the population has the same probability of being
sampled. However the data were gathered from only one adult individual in each
household, who was randomly choosen by an interviewer (via "Kish grid"). To
equalize the probabilities for each INDIVIDUAL a casewise weighting factor is
introduced. It is proportional to the reciprocal of the number of adults in the
household and rescaled so it's sum equals the sample size. This weighting factor
is neccessery to perform inferences for population of individuals.

I had no problems with estimating models which use count data, because I could
construct contingency tables with something like:

tapply(weight, a.bunch.of.factors, sum)

Unfortunately I couldn't come up with a good way of building other kinds of
models for those data. Is there some way (apart for writing new functions from
scratch) to perform modelling tasks like lm(), that will take the weights into
account?

(As far as I know there are only basic functions weighted.mean() and cov.wt()
for weighted means and weighted covariance/correlation matrices respectively.)


Thank you in advance for any suggestions.


Michal


~,~`~,~`~,~`~,~`~,~`~,~`~,~`~,~
Michal Bojanowski
Institute for Social Studies
University of Warsaw
Poland
http://www.iss.uw.edu.pl

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
Dear Michal,

As far as I know (and I'd be happy to be wrong), there's no *general* way 
of introducing case weights in R. The glm function, however, accommodates 
case weights via its weights argument, and this might be sufficient to do 
what you want to do. You'll have to be careful with inferences, though.

Perhaps someone else on the list can provide additional information.

John
At 05:22 PM 11/14/2002 +0100, Michal Bojanowski wrote:

            
-----------------------------------------------------
John Fox
Department of Sociology
McMaster University
Hamilton, Ontario, Canada L8S 4M4
email: jfox at mcmaster.ca
phone: 905-525-9140x23604
web: www.socsci.mcmaster.ca/jfox
-----------------------------------------------------

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
Hello John,
Thursday, November 14, 2002, 9:07:51 PM, you wrote:
JF> Dear Michal,

JF> As far as I know (and I'd be happy to be wrong), there's no *general* way 
JF> of introducing case weights in R. The glm function, however, accommodates 
JF> case weights via its weights argument, and this might be sufficient to do 
JF> what you want to do. You'll have to be careful with inferences, though.

JF> Perhaps someone else on the list can provide additional information.

JF> John

Thank you for your answer professor Fox.

I did perform an "experiment" (which follows) using 'weight' argument, but in lm() function. The
help page states, that this argument should contain weights used in weighted
regression fitting process. I dont feel strong in WLS I must say (to state it
diplomatically) so I dont know if it is possible use 'weight' argument to solve
my problem.

I generated the data:

x <- rep(c(1,2), c(6,4))
y <- rep(c(1,2,3,4),c(2,3,3,2))

# which look like
cbind(x,y)

# now I fit a model
summary(m <- lm(y~x))

# now when I create "collapsed" data
x1 <- rep(c(1,2), c(3,2))
y1 <- rep(c(1,2,3,4), c(1,1,2,1))

# with frequencies
w <- c(2,3,1,2,2)

# which look like
cbind(x1,y1,w)

# and fit a model
summary(m1 <- lm(y1~x1, weight=w))

I'm gettin the same coefficients, but different standard errors. I guess this is
what you had in mind.

I guess I need a book on WLS... Thank you for the answer anyway.


Michal



ps. Also, I would like to thank you for your fine lecture about S/R in Ann Arbor
this summer -- which I attended with great pleasure.





~,~`~,~`~,~`~,~`~,~`~,~`~,~`~,~
Michal Bojanowski
Institute for Social Studies
University of Warsaw
Poland
http://www.iss.uw.edu.pl

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
Michal Bojanowski <bojaniss at poczta.onet.pl> writes:
Thomas Lumley once did a brief but very good writeup on the various
kinds of weighting. I forget whether it was for one of the open
mailing lists or in connection with a discussion in R-core.

One thing I remember from it was the need to distinguish between the
various reasons for weighting. The one used in lm/glm is based on the
idea that some measurements are more precise than others and therefore
deserve more weight, so basically the weight is the inverse variance
of an observation. However, you might want to weight observations
differently even if their variance is the same, e.g. to obtain a
method that is stable against differences in population structure,
even if the model is slightly wrong. (Some rather subtle issues are
involved here and I'm not sure I'm representing them adequately.)
#
Dear Peter and Michal,

I was under the impression that the weights argument in lm specifies 
inverse-variance weights, but that the weights argument in glm specifies 
case weights. Inverse-variance weights, which produce a WLS solution, are 
inappropriate for Michal's problem. I checked and now see that the weights 
arguments for both lm and glm are inverse-variance weights, so the 
procedure that I suggested was incorrect.

Sorry,
  John
At 11:48 PM 11/14/2002 +0100, Peter Dalgaard BSA wrote:
-----------------------------------------------------
John Fox
Department of Sociology
McMaster University
Hamilton, Ontario, Canada L8S 4M4
email: jfox at mcmaster.ca
phone: 905-525-9140x23604
web: www.socsci.mcmaster.ca/jfox
-----------------------------------------------------

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
On Thu, 14 Nov 2002, John Fox wrote:

            
The weights argument to lm and glm will give the right point estimates.
The standard errors  will potentially be wrong.  This can be fixed with
`sandwich' standard errors, so one option is to use gee() with each
observation being in a `group' on its own. Similarly, the `robust'
standard errors in coxph() will allow probability-weighted survival
analyses.

The sandwich standard errors used by gee() are not quite the same as the
ones used by survey samplers, but they are very similar and they are
consistent estimates of the same thing.

The usual linear model standard errors are often pretty good even for
probability weighting as long as  important covariates aren't strongly
associated with the weights.


	-thomas

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
Hello Thomas,
Friday, November 15, 2002, 4:35:13 PM, you wrote:
TL> The weights argument to lm and glm will give the right point estimates.
TL> The standard errors  will potentially be wrong.  This can be fixed with
TL> `sandwich' standard errors, so one option is to use gee() with each
TL> observation being in a `group' on its own. Similarly, the `robust'
TL> standard errors in coxph() will allow probability-weighted survival
TL> analyses.

TL> The sandwich standard errors used by gee() are not quite the same as the
TL> ones used by survey samplers, but they are very similar and they are
TL> consistent estimates of the same thing.

TL> The usual linear model standard errors are often pretty good even for
TL> probability weighting as long as  important covariates aren't strongly
TL> associated with the weights.


TL>         -thomas

Where can I find the gee() function, it's not in base package nor in
any packages I have installed.

I use R 1.5.1


Thank you.

~,~`~,~`~,~`~,~`~,~`~,~`~,~`~,~`~,~`~,~`~,~`~,~`~,~`~,~`~,~`~,~

Michal Bojanowski       mailto:mbojanowski at samba.iss.uw.edu.pl
Polish General Social Survey
Institute for Social Studies
University of Warsaw
http://www.iss.uw.edu.pl/

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
Surprisingly enough it is in the gee package.
On Fri, 15 Nov 2002, bojaniss wrote:

            
I suggest upgrading.
#
On Fri, 15 Nov 2002, bojaniss wrote:

            
In the gee package.

	-thomas

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._