Dear R experts, I want to simulate some unbalanced clustered data. The number of clusters is 20 and the average number of observations is 30. However, I would like to create an unbalanced clustered data per cluster where there are 10% more observations than specified (i.e., 33 rather than 30). I then want to randomly exclude an appropriate number of observations (i.e., 60) to arrive at the specified average number of observations per cluster (i.e., 30). The probability of excluding an observation within each cluster was not uniform (i.e., some clusters had no cases removed and others had more excluded). Therefore in the end I still have 600 observations in total. How to realize that in R? Thank you for your help! Best, Liu
Help with simulation of unbalanced clustered data
6 messages · Jeff Newmiller, Chao Liu, Abby Spurdle
This is R-help, not R-do-my-work-for-me. It is also not a homework help line. The Posting Guide is required reading. Assuming this is not homework, since each step in your problem definition can be mapped to a fairly basic operation in R (the sample function and indexing being key tools), you should be showing your work with a reproducible example that illustrates where you are stuck or why the result you are getting does not exhibit the desired properties.
On December 15, 2020 6:48:12 PM PST, Chao Liu <psychaoliu at gmail.com> wrote:
Dear R experts, I want to simulate some unbalanced clustered data. The number of clusters is 20 and the average number of observations is 30. However, I would like to create an unbalanced clustered data per cluster where there are 10% more observations than specified (i.e., 33 rather than 30). I then want to randomly exclude an appropriate number of observations (i.e., 60) to arrive at the specified average number of observations per cluster (i.e., 30). The probability of excluding an observation within each cluster was not uniform (i.e., some clusters had no cases removed and others had more excluded). Therefore in the end I still have 600 observations in total. How to realize that in R? Thank you for your help! Best, Liu [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Sent from my phone. Please excuse my brevity.
Thank you for the reminder, Jeff. I am new to R-help and so please bear with my ignorance. This is not homework and here is a reproducible example. The number of observations per cluster doesn't follow the condition specified above though, I just used this to convey my idea. > y <- rnorm(20)
x <- rnorm(20) z <- rep(1:5, 4) w <- rep(1:4, each=5) dd <- data.frame(id=z,cluster=w,x=x,y=y) #this is a balanced dataset
id cluster x y
1 1 1 0.30003855 0.65325768
2 2 1 -1.00563626 -0.12270866
3 3 1 0.01925927 -0.41367651
4 4 1 -1.07742065 -2.64314895
5 5 1 0.71270333 -0.09294102
6 1 2 1.08477509 0.43028470
7 2 2 -2.22498770 0.53539884
8 3 2 1.23569346 -0.55527835
9 4 2 -1.24104450 1.77950291
10 5 2 0.45476927 0.28642442
11 1 3 0.65990264 0.12631586
12 2 3 -0.19988983 1.27226678
13 3 3 -0.64511396 -0.71846622
14 4 3 0.16532102 -0.45033862
15 5 3 0.43881870 2.39745248
16 1 4 0.88330282 0.01112919
17 2 4 -2.05233698 1.63356842
18 3 4 -1.63637927 -1.43850664
19 4 4 1.43040234 -0.19051680
20 5 4 1.04662885 0.37842390
After randomly adding and deleting some data, the unbalanced data become
like this:
id cluster x y
1 1 1 0.895 -0.659
2 2 1 -0.160 -0.366
3 1 2 -0.528 -0.294
4 2 2 -0.919 0.362
5 3 2 -0.901 -0.467
6 1 3 0.275 0.134
7 2 3 0.423 0.534
8 3 3 0.929 -0.953
9 4 3 1.67 0.668
10 5 3 0.286 0.0872
11 1 4 -0.373 -0.109
12 2 4 0.289 0.299
13 3 4 -1.43 -0.677
14 4 4 -0.884 1.70
15 5 4 1.12 0.386
16 1 5 -0.723 0.247
17 2 5 0.463 -2.59
18 3 5 0.234 0.893
19 4 5 -0.313 -1.96
20 5 5 0.848 -0.0613
Here is what I tried:
dd[-sample(which(dd$cluster==sample(unique(dd$cluster),round(0.2*length(unique(dd$cluster))))),
round(0.5*length(which(dd$cluster==sample(unique(dd$cluster),round(0.2*length(unique(dd$cluster)))))))),].
I know it is very inefficient. Also it just randomly deleted rows and
had no effects in adding rows to match the total number of
observations. Thank you for your help!
Best,
Liu
On Wed, Dec 16, 2020 at 8:50 AM Jeff Newmiller <jdnewmil at dcn.davis.ca.us>
wrote:
This is R-help, not R-do-my-work-for-me. It is also not a homework help line. The Posting Guide is required reading. Assuming this is not homework, since each step in your problem definition can be mapped to a fairly basic operation in R (the sample function and indexing being key tools), you should be showing your work with a reproducible example that illustrates where you are stuck or why the result you are getting does not exhibit the desired properties. On December 15, 2020 6:48:12 PM PST, Chao Liu <psychaoliu at gmail.com> wrote:
Dear R experts,
I want to simulate some unbalanced clustered data. The number of
clusters
is 20 and the average number of observations is 30. However, I would
like
to create an unbalanced clustered data per cluster where there are 10%
more
observations than specified (i.e., 33 rather than 30). I then want to
randomly exclude an appropriate number of observations (i.e., 60) to
arrive
at the specified average number of observations per cluster (i.e., 30).
The
probability of excluding an observation within each cluster was not
uniform
(i.e., some clusters had no cases removed and others had more
excluded).
Therefore in the end I still have 600 observations in total. How to
realize
that in R? Thank you for your help!
Best,
Liu
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Sent from my phone. Please excuse my brevity.
Sigh. You still haven't read the Posting Guide? HTML email causes problems with this mailing list so do send email using your mail client's plain text option. You assert that
The probability of excluding an observation within each cluster was not uniform
but having a different number excluded can either be due to having a different probability or due to equal probability but different random chance associated with the same probability.
(i.e., some clusters had no cases removed and others had more excluded)
so this could occur various ways. If you meant for the probability to vary, just how should it vary? Also, changing your requirements mid-stream makes it very difficult to see what you really want to accomplish.
On December 16, 2020 6:56:12 AM PST, Chao Liu <psychaoliu at gmail.com> wrote:
Thank you for the reminder, Jeff. I am new to R-help and so please bear with my ignorance. This is not homework and here is a reproducible example. The number of observations per cluster doesn't follow the condition specified above though, I just used this to convey my idea.
> y <- rnorm(20)
x <- rnorm(20) z <- rep(1:5, 4) w <- rep(1:4, each=5) dd <- data.frame(id=z,cluster=w,x=x,y=y) #this is a balanced dataset
id cluster x y
1 1 1 0.30003855 0.65325768
2 2 1 -1.00563626 -0.12270866
3 3 1 0.01925927 -0.41367651
4 4 1 -1.07742065 -2.64314895
5 5 1 0.71270333 -0.09294102
6 1 2 1.08477509 0.43028470
7 2 2 -2.22498770 0.53539884
8 3 2 1.23569346 -0.55527835
9 4 2 -1.24104450 1.77950291
10 5 2 0.45476927 0.28642442
11 1 3 0.65990264 0.12631586
12 2 3 -0.19988983 1.27226678
13 3 3 -0.64511396 -0.71846622
14 4 3 0.16532102 -0.45033862
15 5 3 0.43881870 2.39745248
16 1 4 0.88330282 0.01112919
17 2 4 -2.05233698 1.63356842
18 3 4 -1.63637927 -1.43850664
19 4 4 1.43040234 -0.19051680
20 5 4 1.04662885 0.37842390
After randomly adding and deleting some data, the unbalanced data
become
like this:
id cluster x y
1 1 1 0.895 -0.659
2 2 1 -0.160 -0.366
3 1 2 -0.528 -0.294
4 2 2 -0.919 0.362
5 3 2 -0.901 -0.467
6 1 3 0.275 0.134
7 2 3 0.423 0.534
8 3 3 0.929 -0.953
9 4 3 1.67 0.668
10 5 3 0.286 0.0872
11 1 4 -0.373 -0.109
12 2 4 0.289 0.299
13 3 4 -1.43 -0.677
14 4 4 -0.884 1.70
15 5 4 1.12 0.386
16 1 5 -0.723 0.247
17 2 5 0.463 -2.59
18 3 5 0.234 0.893
19 4 5 -0.313 -1.96
20 5 5 0.848 -0.0613
Here is what I tried:
dd[-sample(which(dd$cluster==sample(unique(dd$cluster),round(0.2*length(unique(dd$cluster))))),
round(0.5*length(which(dd$cluster==sample(unique(dd$cluster),round(0.2*length(unique(dd$cluster)))))))),].
I know it is very inefficient. Also it just randomly deleted rows and
had no effects in adding rows to match the total number of
observations. Thank you for your help!
Best,
Liu
On Wed, Dec 16, 2020 at 8:50 AM Jeff Newmiller
<jdnewmil at dcn.davis.ca.us>
wrote:
This is R-help, not R-do-my-work-for-me. It is also not a homework
help
line. The Posting Guide is required reading. Assuming this is not
homework,
since each step in your problem definition can be mapped to a fairly
basic
operation in R (the sample function and indexing being key tools),
you
should be showing your work with a reproducible example that
illustrates
where you are stuck or why the result you are getting does not
exhibit the
desired properties. On December 15, 2020 6:48:12 PM PST, Chao Liu <psychaoliu at gmail.com> wrote:
Dear R experts, I want to simulate some unbalanced clustered data. The number of clusters is 20 and the average number of observations is 30. However, I would like to create an unbalanced clustered data per cluster where there are
10%
more observations than specified (i.e., 33 rather than 30). I then want
to
randomly exclude an appropriate number of observations (i.e., 60) to arrive at the specified average number of observations per cluster (i.e.,
30).
The
probability of excluding an observation within each cluster was not
uniform
(i.e., some clusters had no cases removed and others had more
excluded).
Therefore in the end I still have 600 observations in total. How to
realize
that in R? Thank you for your help!
Best,
Liu
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Sent from my phone. Please excuse my brevity.
Sent from my phone. Please excuse my brevity.
Hi Chao Liu,
I'm having difficulty following your question, and examples.
And also, I don't see the motivation for increasing, then decreasing
the sample sizes.
Intuitively, one would compute the correct sample sizes, first time round...
But I thought I'd add some comments, just in case they're useful.
If the problem relates to memberships (in clusters), then the problem
can be simplified.
All one needs is an integer vector, where each value is the index of
the cluster.
To compute random memberships of 600 observations in 20 clusters, one could run:
m <- sample (1:20, 600, TRUE)
To compute the number of observations per cluster, one could then run:
table (m)
In the above code, the probability of an observation being assigned to
each cluster, is uniform.
Non-uniform sampling can be achieved by supplying a 4th argument to
the sample function, which is a numeric vector of weights.
On Wed, Dec 16, 2020 at 10:08 PM Chao Liu <psychaoliu at gmail.com> wrote:
Dear R experts,
I want to simulate some unbalanced clustered data. The number of clusters
is 20 and the average number of observations is 30. However, I would like
to create an unbalanced clustered data per cluster where there are 10% more
observations than specified (i.e., 33 rather than 30). I then want to
randomly exclude an appropriate number of observations (i.e., 60) to arrive
at the specified average number of observations per cluster (i.e., 30). The
probability of excluding an observation within each cluster was not uniform
(i.e., some clusters had no cases removed and others had more excluded).
Therefore in the end I still have 600 observations in total. How to realize
that in R? Thank you for your help!
Best,
Liu
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Thank you for your help Abby!
On Wed, Dec 16, 2020 at 11:32 PM Abby Spurdle <spurdle.a at gmail.com> wrote:
Hi Chao Liu,
I'm having difficulty following your question, and examples.
And also, I don't see the motivation for increasing, then decreasing
the sample sizes.
Intuitively, one would compute the correct sample sizes, first time
round...
But I thought I'd add some comments, just in case they're useful.
If the problem relates to memberships (in clusters), then the problem
can be simplified.
All one needs is an integer vector, where each value is the index of
the cluster.
To compute random memberships of 600 observations in 20 clusters, one
could run:
m <- sample (1:20, 600, TRUE)
To compute the number of observations per cluster, one could then run:
table (m)
In the above code, the probability of an observation being assigned to
each cluster, is uniform.
Non-uniform sampling can be achieved by supplying a 4th argument to
the sample function, which is a numeric vector of weights.
On Wed, Dec 16, 2020 at 10:08 PM Chao Liu <psychaoliu at gmail.com> wrote:
Dear R experts, I want to simulate some unbalanced clustered data. The number of clusters is 20 and the average number of observations is 30. However, I would like to create an unbalanced clustered data per cluster where there are 10%
more
observations than specified (i.e., 33 rather than 30). I then want to randomly exclude an appropriate number of observations (i.e., 60) to
arrive
at the specified average number of observations per cluster (i.e., 30).
The
probability of excluding an observation within each cluster was not
uniform
(i.e., some clusters had no cases removed and others had more excluded). Therefore in the end I still have 600 observations in total. How to
realize
that in R? Thank you for your help!
Best,
Liu
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.