Skip to content

How to generate SE for the proportion value using a randomization process in R?

9 messages · Rui Barradas, Marna Wagley

#
Hi All,
I was trying to estimate standard error (SE) for the proportion value using
some kind of randomization process (bootstrapping or jackknifing) in R, but
I could not figure it out.

Is there any way to generate SE for the proportion?

The example of the data and the code I am using is attached for your
reference. I would like to generate the value of proportion with a SE using
a 1000 times randomization.

dat<-structure(list(Sample = structure(c(1L, 12L, 13L, 14L, 15L, 16L,
17L, 18L, 19L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L), .Label = c("id1",
"id10", "id11", "id12", "id13", "id14", "id15", "id16", "id17",
"id18", "id19", "Id2", "id3", "id4", "id5", "id6", "id7", "id8",
"id9"), class = "factor"), Time1 = c(0L, 1L, 1L, 1L, 0L, 0L,
1L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 0L), Time2 = c(1L,
0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L,
1L, 1L)), .Names = c("Sample", "Time1", "Time2"), class = "data.frame",
row.names = c(NA,
-19L))
daT<-data.frame(dat %>%
  mutate(Time1.but.not.in.Time2 = case_when(
            Time1 %in% "1" & Time2 %in% "0"  ~ "1"),
Time2.but.not.in.Time1 = case_when(
            Time1 %in% "0" & Time2 %in% "1"  ~ "1"),
 BothTimes = case_when(
            Time1 %in% "1" & Time2 %in% "1"  ~ "1")))
 daT
 summary(daT)

cols.num <- c("Time1.but.not.in.Time2","Time2.but.not.in.Time1",
"BothTimes")
daT[cols.num] <- sapply(daT[cols.num],as.numeric)
summary(daT)
ProportionValue<-sum(daT$BothTimes, na.rm=T)/sum(daT$Time1, na.rm=T)
ProportionValue
standard error??
#
Hello,

Something like this, using base package boot?


library(boot)

bootprop <- function(data, index){
   d <- data[index, ]
   sum(d[["BothTimes"]], na.rm = TRUE)/sum(d[["Time1"]], na.rm = TRUE)
}

R <- 1e3
set.seed(2020)
b <- boot(daT, bootprop, R)
b
b$t0     # original
sd(b$t)  # bootstrapped estimate of the SE of the sample prop.
hist(b$t, freq = FALSE)


Hope this helps,

Rui Barradas

?s 21:57 de 22/01/21, Marna Wagley escreveu:
#
Dear Rui,
I was wondering whether we have to square root of SD to find SE, right?

bootprop <- function(data, index){
   d <- data[index, ]
   sum(d[["BothTimes"]], na.rm = TRUE)/sum(d[["Time1"]], na.rm = TRUE)
}

R <- 1e3
set.seed(2020)
b <- boot(daT, bootprop, R)
b
b$t0     # original
sd(b$t)  # bootstrapped estimate of the SE of the sample prop.
sd(b$t)/sqrt(1000)
pandit*(1-pandit)

hist(b$t, freq = FALSE)
On Fri, Jan 22, 2021 at 3:07 PM Rui Barradas <ruipbarradas at sapo.pt> wrote:

            

  
  
#
Hello,

Inline.

?s 07:47 de 23/01/21, Marna Wagley escreveu:
No, we don't. var already divides by n, don't divide again.
This is the code, that can be seen by running the function name at a 
command line.


sd
#function (x, na.rm = FALSE)
#sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x),
#    na.rm = na.rm))
#<bytecode: 0x55f3ce900848>
#<environment: namespace:stats>
Try plotting the normal densities for both cases, the red line is 
clearly wrong.


f <- function(x, xbar, s){
   dnorm(x, mean = xbar, sd = s)
}

hist(b$t, freq = FALSE)
curve(f(x, xbar = b$t0, s = sd(b$t)), from = 0, to = 1, col = "blue", 
add = TRUE)
curve(f(x, xbar = b$t0, s = sd(b$t)/sqrt(R)), from = 0, to = 1, col = 
"red", add = TRUE)


Hope this helps,

Rui Barradas
#
Yes Rui, I can see we don't need to divide by square root of sample size.
The example is great to understand it.
Thank you.
Marna
On Sat, Jan 23, 2021 at 12:28 AM Rui Barradas <ruipbarradas at sapo.pt> wrote:

            

  
  
5 days later
#
Hi Rui,
I am sorry for asking you several questions.

In the given example, randomizations (reshuffle) were done 1000 times, and
its 1000 proportion values (results) are stored and it can be seen using
b$t; but I was wondering how the table was randomized (which rows have been
missed/or repeated in each randomizing procedure?).

Is there any way we can see the randomized table and its associated
results? Here in this example, I randomized (or bootstrapped) the table
into three times (R=3) so I would like to store these three tables and look
at them later to know which rows were repeated/missed. Is there any
possibility?
The example data and the code is given below.

Thank you for your help.

####
library(boot)
dat<-structure(list(Sample = structure(c(1L, 12L, 13L, 14L, 15L, 16L,
17L, 18L, 19L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L), .Label = c("id1",
"id10", "id11", "id12", "id13", "id14", "id15", "id16", "id17",
"id18", "id19", "Id2", "id3", "id4", "id5", "id6", "id7", "id8",
"id9"), class = "factor"), Time1 = c(0L, 1L, 1L, 1L, 0L, 0L,
1L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 0L), Time2 = c(1L,
0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L,
1L, 1L)), .Names = c("Sample", "Time1", "Time2"), class = "data.frame",
row.names = c(NA,
-19L))
daT<-data.frame(dat %>%
  mutate(Time1.but.not.in.Time2 = case_when(
            Time1 %in% "1" & Time2 %in% "0"  ~ "1"),
Time2.but.not.in.Time1 = case_when(
            Time1 %in% "0" & Time2 %in% "1"  ~ "1"),
 BothTimes = case_when(
            Time1 %in% "1" & Time2 %in% "1"  ~ "1")))
cols.num <- c("Time1.but.not.in.Time2","Time2.but.not.in.Time1",
"BothTimes")
daT[cols.num] <- sapply(daT[cols.num],as.numeric)
summary(daT)

bootprop <- function(data, index){
   d <- data[index, ]
   sum(d[["BothTimes"]], na.rm = TRUE)/sum(d[["Time1"]], na.rm = TRUE)
}

R <- 3
set.seed(2020)
b <- boot(daT, bootprop, R)
b
b$t0     # original
b$t
sd(b$t)  # bootstrapped estimate of the SE of the sample prop.
hist(b$t, freq = FALSE)

str(b)
b$data
b$seed
b$sim
b$strata
################


On Sat, Jan 23, 2021 at 12:36 AM Marna Wagley <marna.wagley at gmail.com>
wrote:

  
  
#
Hello,

I don't know why you would need to see the indices but rewrite the 
function bootprop as

bootprop_ind <- function(data, index){
   d <- data[index, ]
   #sum(d[["BothTimes"]], na.rm = TRUE)/sum(d[["Time1"]], na.rm = TRUE)
   index
}


and call in the same way. It will now return a matrix of indices with R 
= 1000 rows and 19 columns.

Hope this helps,

Rui Barradas


?s 19:29 de 28/01/21, Marna Wagley escreveu:
#
Thank you Rui,
This is great. How about the following?

SimilatedData<-boot.array(b, indices=T)

seems it is giving the rows ID which are used in the calculation, isn't it?
On Thu, Jan 28, 2021 at 12:21 PM Rui Barradas <ruipbarradas at sapo.pt> wrote:

            

  
  
#
Hello,

Yes, sorry for my previous post, I had forgotten about boot.array.
That's a much better solution for your problem.

Rui Barradas

?s 20:29 de 28/01/21, Marna Wagley escreveu: