Skip to content

Bootstrap CIs for weighted means of paired differences

11 messages · Ivan, David Winsemius

#
Hi,

I am trying to compute bootstrap confidence intervals for weighted means of
paired differences with the boot package. Unfortunately, the weighted mean
estimate lies out of the confidence bounds and hence I am obviously doing
something wrong.

Appreciate any help. Thanks. Here is a reproducible example:


library(boot)
set.seed(1111)
x <- rnorm(50)
y <- rnorm(50)
weights <- runif(50)
weights <- weights / sum(weights)
dataset <- cbind(x,y,weights)
vw_m_diff <- function(dataset,w, d) {
    differences <- dataset[d,1]-dataset[d,2]
    weights <- w[d]
    return(weighted.mean(x=differences, w=weights))
}
res_boot <- boot(dataset, statistic=vw_m_diff, R = 1000, w=dataset[,3])
boot.ci(res_boot)

*BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS*
*Based on 1000 bootstrap replicates*

*CALL : *
*boot.ci <http://boot.ci>(boot.out = res_boot)*

*Intervals : *
*Level      Normal              Basic         *
*95%   (-0.8365, -0.3463 )   (-0.8311, -0.3441 )  *

*Level     Percentile            BCa          *
*95%   (-0.3276,  0.1594 )   (-0.4781, -0.3477 )  *

weighted.mean(x=dataset[,1]-dataset[,2], w=dataset[,3])

*[1] -0.07321734*
#
On Nov 14, 2014, at 12:15 PM, ivan wrote:

            
My understanding of the principle underlying the design of the bootstrapped function was that the data was the first argument and the index vector was the second. (I admit to not knowing what it would do with a third argument. So I would have guessed that you wanted:

 vw_m_diff <- function(dataset,w) {
     differences <- dataset[d,1]-dataset[d,2]
    weights <- dataset[w, "weights"]
    return(weighted.mean(x=differences, w=weights))
  } 

I get what appears to me as a sensible set of estimates (since they seem centered on zero) although I further admit I do not know what the theoretic CI _should_ be for this problem:
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = res_boot)

Intervals : 
Level      Normal              Basic         
95%   (-0.5657,  0.4962 )   (-0.5713,  0.5062 )  

Level     Percentile            BCa          
95%   (-0.6527,  0.4249 )   (-0.5579,  0.5023 )  
Calculations and Intervals on Original Scale
David Winsemius
Alameda, CA, USA
#
On Nov 14, 2014, at 3:18 PM, David Winsemius wrote:

            
I'm sorry. That was the code I first editted. This is the code that produced that output:

 vw_m_diff <- function(dataset,w) {
      differences <- dataset[w,1]-dataset[w,2]
     weights <- dataset[w, "weights"]
     return(weighted.mean(x=differences, w=weights))
   }
David Winsemius
Alameda, CA, USA
3 days later
#
Hi David,

thanks a lot for the response. I see that this works. I am not sure, however, what the appropriate way to do this is. It also works if you do not define weights in the boot() function (weighted bootstrap) but rather in the vw_m_diff function (ordinary bootstrap), i.e., 


vw_m_diff <- function(dataset,d) {
  differences <- dataset[d,1]-dataset[d,2]
  weights <- dataset[d, "weights"]
  return(weighted.mean(x=differences, w=weights))
}

with 

boot(dataset, statistic=vw_m_diff, R = 1000)

I guess this is rather a statistical question and hence I will have to look further into it. 

In any case, thanks a lot for your help.

Best
On 15 Nov 2014, at 17:27, David Winsemius <dwinsemius at comcast.net> wrote:

            

  
  
#
On Nov 19, 2014, at 6:08 AM, i.petzev wrote:

            
I don't undertand how that is in the slightest way different than what I suggested. You've only substituted 'd' for 'w' as a formal parameter in the function. That should not change the processed results at all.
#
Hi David,

sorry, I was not clear. The difference comes from defining or not defining ?w? in the boot() function. The results with your function and your approach are thus:

set.seed(1111)
x <- rnorm(50)
y <- rnorm(50)
weights <- runif(50)
weights <- weights / sum(weights)
dataset <- cbind(x,y,weights)

vw_m_diff <- function(dataset,w) {
  differences <- dataset[w,1]-dataset[w,2]
  weights <- dataset[w, "weights"]
  return(weighted.mean(x=differences, w=weights))
}
res_boot <- boot(dataset, statistic=vw_m_diff, R = 1000, w=dataset[,3])
boot.ci(res_boot)

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = res_boot)

Intervals : 
Level      Normal              Basic         
95%   (-0.5657,  0.4962 )   (-0.5713,  0.5062 )  

Level     Percentile            BCa          
95%   (-0.6527,  0.4249 )   (-0.5579,  0.5023 )  
Calculations and Intervals on Original Scale

********************************************************************************************************************

However, without defining ?w? in the bootstrap function, i.e., running an ordinary and not a weighted bootstrap, the results are:

res_boot <- boot(dataset, statistic=vw_m_diff, R = 1000)
boot.ci(res_boot)

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = res_boot)

Intervals : 
Level      Normal              Basic         
95%   (-0.6265,  0.4966 )   (-0.6125,  0.5249 )  

Level     Percentile            BCa          
95%   (-0.6714,  0.4661 )   (-0.6747,  0.4559 )  
Calculations and Intervals on Original Scale
On 19 Nov 2014, at 17:49, David Winsemius <dwinsemius at comcast.net> wrote:

            

  
  
#
On Nov 20, 2014, at 2:23 AM, i.petzev wrote:

            
Right. You never were clear about what you wanted and your examples was so statistically symmetric that it is still hard to see what is needed. The examples below show CI's that are arguably equivalent. I can be faulted for attempting to provide code that produced a sensible answer to a vague question to which I was only guessing at the intent.
I hope you are not saying that because those CI's are different that there is some meaning in that difference. Bootstrap runs will always be "different" than each other unless you use set.seed(.) before the runs.
David Winsemius
Alameda, CA, USA
#
I am aware of the fact that bootstrapping produces different CIs with every
run. I still believe that there is a difference between both types of
procedures. My understanding is that setting "w" in the boot() function
influences the "importance" of observations or how the bootstrap selects
the observations. I.e, observation i does not have the same probability of
being chosen as observation j when "w" is defined in the boot() function.
If you return res_boot you will notice that with "w" being set in the
boot() function, the function call states "weighted bootstrap". If not, it
states "ordinary nonparametric bootstrap". But maybe I am wrong.

On Thu, Nov 20, 2014 at 8:19 PM, David Winsemius <dwinsemius at comcast.net>
wrote:

  
  
#
On Nov 21, 2014, at 6:52 AM, ivan wrote:

            
OK. So in the the second call w affects the probability of a case being sent to the boot-function as well as being used in the boot-function; while with the "non-weighted call" the w's are only affecting the individual mean estimates. So the second one is different. And as I suggested earlier you never described the goals of the investigation or the meaning of the variables. 

   I can tell you that when Davison and Hinkley offered examples of using a bootstrap for a weighted bootstrap mean, they compared a stratified analysis with an example where the weighting was only used on the inner function (example 3.2, practical 3.14 pp 72, 131 of their book) with one where the strata parameter was used. But so far I don't think you have ever described what sort of weights these actually are. In that example the weights were the inverse variances of the sample groups. They didn't use a 'weights' parameter in the boot call. I'm do not know if it was part of the S package that was being used at the time.

I tried to find an example of a weighted bootstrap in V&R 4e but did not see one. Prof Ripley is the maintainer of the boot package. In the V&R book, Angelo Canty is given the credit for writing the boot package for S. I think you should consult the code, first. And you should also look at the `stype` parameter where "w" is one option.
#
On 21 Nov 2014, at 19:18, David Winsemius <dwinsemius at comcast.net> wrote:

            

  
  
#
Ok, thanks for the suggestions. I will look into that. And you are absolutely right that I should have been more clear about what type of weighting I want. So to clarify: I run time series regressions of returns of company i on two different sets of explanatory variables. Then I extract the respective intercepts of the two regressions and take the difference between both. I repeat this for the whole sample of companies and then compute the market value weighted average of those differences.
On 21 Nov 2014, at 19:18, David Winsemius <dwinsemius at comcast.net> wrote: