Creating data
On Wed, Mar 25, 2009 at 1:46 PM, G. Jay Kerns <gkerns at ysu.edu> wrote:
Dear Scott,
On Tue, Mar 24, 2009 at 4:04 PM, Scott F. Kiesling <kiesling+ at pitt.edu> wrote:
For a simple example, let's say I want to create a dataset with students from different countries and academic departments who took an English test. I want to make some differences (significant and not) and possibly even interactions among the scores by country and department.
I've been meaning to do this for a while (for quizzes, demos, etc) but didn't yet have the chance. ?Below is something I quickly put together; I am sure that it can be improved. You seemed to be looking for only one dependent variable but in my case I am looking for several. Dep. vars: ?y1 - y4 (correlated) Fixed effects: dept (4 levels) and region (3 levels). Covariate: ?x_cov The model is Y = X*B + Eps: X is the model matrix: ?~ x_cov + region + dept + region:dept B is the parameter matrix Eps ~ multivariate normal (or t), mean 0, variance/covariance Sigma I used deviation contrasts; it would be easy to use treatment contrasts, too. If multicollinearity is desired, the same trick as for Eps can be used to generate multiple correlated x-variables. I didn't spend time thinking about the parameters - clearly the most important part (after all, there are 50+ of them in a bloated model like this). They are randomly generated which means that they will display *radically* different behavior, even on different runs. ?In practice, one would want to choose a parameter matrix thoughtfully and stick with it. Everything is in BigData, at the end.
Slightly off-topic but you may want to look at the 'exams' package by Achim Zeileis and Bettina Gruen http://cran.r-project.org/web/packages/exams/ The vignette for the package gives examples of incorporating the data set generating process into the document that will generate the exam questions. Even if you do not choose to use LaTeX to generate the exam it is still helpful to encapsulate the data generation process in something beyond an R script with some comments. I know for myself that even with comments in a script I soon forget exactly what is being simulated in a script with a name like 'simulate.R'.
##################################
# need pkgs corpcor, sfsmisc, mvtnorm
set.seed(1)
n <- 500 ? # sample size
##################################
# generate indep. vars
# covariate
x_cov <- rnorm(n, mean = 37, sd = 15)
# fixed effects
region <- sample(c("Amer", "Euro", "Asia"), size = n, prob = c(7,9,4),
replace = TRUE)
dept <- sample(LETTERS[1:4], size = n, prob = c(2,3,3,4), replace = TRUE)
region <- factor(region)
dept <- factor(dept)
D <- data.frame(x_cov = x_cov, region = region, dept = dept)
head(D) ?# the data
##################################
# model matrix
X <- model.matrix( ~ x_cov + (region + dept)^2,
? ? ? ? ? ? ? ? ? data = D,
? ? ? ? ? ? ? ? ? contrasts = list( region="contr.sum", dept="contr.sum"))
head(X) ?# the model matrix
# num of parameters, for each dep variable
k <- dim(X)[2]
##################################
# specify parameters
# should think about these
b1 <- rnorm(k, sd = 6)
b2 <- rcauchy(k, scale = 9)
b3 <- rnorm(k, sd = 12)
b4 <- rcauchy(k, scale = 15)
B <- matrix( c(b1, b2, b3, b4), ncol = 4)
B ? # the parameter matrix
##################################
# generate error
# a correlation matrix
r1 <- c( ?1.0, ?0.4, -0.5, ?0.7 )
r2 <- c( ?0.4, ?1.0, ?0.2, -0.1 )
r3 <- c( -0.5, ?0.2, ?1.0, ?0.0 )
r4 <- c( ?0.7, -0.1, ?0.0, ?1.0 )
Corr <- matrix( c(r1, r2, r3, r4), ncol = 4)
library(sfsmisc)
Corr <- posdefify(Corr)
# variances
# tune Y correlation with the 1750
eps_sigma <- 1750*c(15, 2, 21, 5)
library(corpcor)
Sigma <- rebuild.cov(r = Corr, v = eps_sigma)
library(mvtnorm)
Eps <- rmvnorm(n, sigma = Sigma)
#Eps <- rmvt(n, df = 2, sigma = Sigma)
##################################
# put it all together
Y <- data.frame( X %*% B + Eps )
names(Y) <- c("y1", "y2", "y3", "y4")
BigData <- cbind(Y,D)
head(BigData)
***************************************************
G. Jay Kerns, Ph.D.
Associate Professor
Department of Mathematics & Statistics
Youngstown State University
Youngstown, OH 44555-0002 USA
Office: 1035 Cushwa Hall
Phone: (330) 941-3310 Office (voice mail)
-3302 Department
-3170 FAX
E-mail: gkerns at ysu.edu
http://www.cc.ysu.edu/~gjkerns/
_______________________________________________ R-sig-teaching at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-teaching