Skip to content
Prev 76 / 885 Next

Simulating Data with predefined reg-coefficients and R2

On Wed, Nov 19, 2008 at 2:19 AM, markus <m.kossner at tu-bs.de> wrote:
Do you want to simulate data such that the least squares estimates of
the regression coefficients are exactly b and the R2 is exactly the
value you specify or do you want to simulate data according to a model
for which the "true but unknown" regression coefficients are b and the
variance of the random noise is a particular value?

The second scenario is easier than the first but both are possible.

To simulate from a "true" model X %*% beta + epsilon where
Var(epsilon) = sigma^2 * diag(n) you simply add random noise to the
vector of true responses.  Because the lm function in R can take a
matrix of responses (each column corresponding to a response vector)
it is best to simulate a matrix of y values as

# assign r to be the number of replicates desired
n <- nrow(X)
ymat <- X %*% beta + matrix(rnorm(n * r, sd = sigma), nrow = n)

If you want the second scenario where you simulate data such that the
least squares estimates are exactly b (or as close to b as floating
point computation allows) then you should use the QR decomposition of
X.  The Q matrix from QR decomposition is an orthogonal matrix
corresponding to a rigid transformation of the response space after
which the part determining the coefficients and the part corresponding
to the noise are different groups of elements.  Under that basis you
can establish the required coefficients and a noise term of exactly
the desired length.