Skip to content

Trouble Replicating Unstructured Mixed Procedure in R

7 messages · Steven McKinney, Charles Determan Jr, David Duffy +1 more

#
Since SAS does not publish its source code,
replicating SAS code is not always possible
(nor always desirable).  

R code is completely open, so can be studied,
debated and replicated or modified - very useful when
people want to engage in scientific discussions
of statistical issues.  Doing good science and
data analysis is challenging when you are working with 
a black box of mysterious computer code.  That's
why statisticians have worked so hard for years to
set up open source computational tools such as R.


Steven McKinney
Statistician
Molecular Oncology and Breast Cancer Program
British Columbia Cancer Research Centre
#
Here's a typical agreement that users of SAS must agree to:


Subject to the provisions contained herein, EMPLOYEE may use the SAS copyrighted computer software products which LICENSEE has provided in accordance with its agreement with SAS.

EMPLOYEE acknowledges that these products are copyrighted and that SAS retains all title and ownership rights to the products.  EMPLOYEE agrees not to copy or permit others to copy the products, in whole or in part.

EMPLOYEE agrees to use the products under this agreement only on a computer which is owned or leased by LICENSEE and controlled by LICENSEE.  EMPLOYEE further agrees that the products must remain under EMPLOYEE's control, and that resale or other transfer is explicitly prohibited.

EMPLOYEE agrees to use the products only for EMPLOYEE's or LICENSEE's own data processing requirements, and not for commercial time-sharing, rental or service bureau use.

EMPLOYEE agrees not to create, or attempt to create, or permit or help others to create, the source code from the products furnished under this agreement.  EMPLOYEE agrees that it will not reverse engineer or decompile the products.


(source: http://www.mcmaster.ca/uts/software_downloads/docs/SAS/saslicendform.doc )

Note that last paragraph.  You can find it in other SAS end user license agreements.

So anyone who tries to "replicate PROC MIXED for repeated measures set as unstructured in R"
is then subject to legal action by the largest wealthiest statistical software company ever in
existence.  I personally am not up for that challenge, especially when the code has
debateable merits.

I'd rather write code from scratch using sound statistical first principles,
which I can do thanks to the amazing amount of hard work by the R core group,
none of whom have ever asked me to sign any agreement (though they do insist
that I distribute source code and the GNU General Public License with any
copies I modify and/or distribute).


Steven McKinney

Statistician
Molecular Oncology and Breast Cancer Program
British Columbia Cancer Research Centre




From: Charles Determan Jr [mailto:deter088 at umn.edu]
Sent: January-26-12 4:50 PM
To: Steven McKinney
Cc: r-sig-mixed-models at r-project.org
Subject: Re: [R-sig-ME] Trouble Replicating Unstructured Mixed Procedure in R

So am I to assume that this implies that there isn't any known way to replicate PROC MIXED for repeated measures set as unstructured in R?

Charles
On Thu, Jan 26, 2012 at 6:36 PM, Steven McKinney <smckinney at bccrc.ca> wrote:
Since SAS does not publish its source code,
replicating SAS code is not always possible
(nor always desirable).

R code is completely open, so can be studied,
debated and replicated or modified - very useful when
people want to engage in scientific discussions
of statistical issues.  Doing good science and
data analysis is challenging when you are working with
a black box of mysterious computer code.  That's
why statisticians have worked so hard for years to
set up open source computational tools such as R.


Steven McKinney
Statistician
Molecular Oncology and Breast Cancer Program
British Columbia Cancer Research Centre
#
On Thu, 26 Jan 2012, Charles Determan Jr wrote:

            
If you Google on "Orthodont" and R, you will find several analyses of this 
dataset, including one in John Maindonald's book.  In those analyses, age 
has been treated as a continuous covariate, rather than the SAS example's 
approach.
#
Hi Charles,

Caveat emptor: I have not read John Maindonalds analysis of this data.
 There may well be problems with this, but here are some of the things
I would try with the data.

By the way, if you want us to help you fit the same model as SAS, it
would help to know what SAS is fitting.  If you could provide the
formula for the model and covariance structure, that would help.  If
you do not know, perhaps first try to replicate in SAS using something
more explicit than the 'repeated' option.

Cheers,

Josh

dat <- structure(list(distance = c(26, 25, 29, 31, 21.5, 22.5, 23, 26.5,
23, 22.5, 24, 27.5, 25.5, 27.5, 26.5, 27, 20, 23.5, 22.5, 26,
24.5, 25.5, 27, 28.5, 22, 22, 24.5, 26.5, 24, 21.5, 24.5, 25.5,
23, 20.5, 31, 26, 27.5, 28, 31, 31.5, 23, 23, 23.5, 25, 21.5,
23.5, 24, 28, 17, 24.5, 26, 29.5, 22.5, 25.5, 25.5, 26, 23, 24.5,
26, 30, 22, 21.5, 23.5, 25, 21, 20, 21.5, 23, 21, 21.5, 24, 25.5,
20.5, 24, 24.5, 26, 23.5, 24.5, 25, 26.5, 21.5, 23, 22.5, 23.5,
20, 21, 21, 22.5, 21.5, 22.5, 23, 25, 23, 23, 23.5, 24, 20, 21,
22, 21.5, 16.5, 19, 19, 19.5, 24.5, 25, 28, 28), age = c(8L,
10L, 12L, 14L, 8L, 10L, 12L, 14L, 8L, 10L, 12L, 14L, 8L, 10L,
12L, 14L, 8L, 10L, 12L, 14L, 8L, 10L, 12L, 14L, 8L, 10L, 12L,
14L, 8L, 10L, 12L, 14L, 8L, 10L, 12L, 14L, 8L, 10L, 12L, 14L,
8L, 10L, 12L, 14L, 8L, 10L, 12L, 14L, 8L, 10L, 12L, 14L, 8L,
10L, 12L, 14L, 8L, 10L, 12L, 14L, 8L, 10L, 12L, 14L, 8L, 10L,
12L, 14L, 8L, 10L, 12L, 14L, 8L, 10L, 12L, 14L, 8L, 10L, 12L,
14L, 8L, 10L, 12L, 14L, 8L, 10L, 12L, 14L, 8L, 10L, 12L, 14L,
8L, 10L, 12L, 14L, 8L, 10L, 12L, 14L, 8L, 10L, 12L, 14L, 8L,
10L, 12L, 14L), Subject = structure(c(12L, 12L, 12L, 12L, 13L,
13L, 13L, 13L, 14L, 14L, 14L, 14L, 15L, 15L, 15L, 15L, 16L, 16L,
16L, 16L, 17L, 17L, 17L, 17L, 18L, 18L, 18L, 18L, 19L, 19L, 19L,
19L, 20L, 20L, 20L, 20L, 21L, 21L, 21L, 21L, 22L, 22L, 22L, 22L,
23L, 23L, 23L, 23L, 24L, 24L, 24L, 24L, 25L, 25L, 25L, 25L, 26L,
26L, 26L, 26L, 27L, 27L, 27L, 27L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 6L, 6L, 6L,
6L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 10L, 10L,
10L, 10L, 11L, 11L, 11L, 11L), .Label = c("F01", "F02", "F03",
"F04", "F05", "F06", "F07", "F08", "F09", "F10", "F11", "M01",
"M02", "M03", "M04", "M05", "M06", "M07", "M08", "M09", "M10",
"M11", "M12", "M13", "M14", "M15", "M16"), class = "factor"),
    Sex = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Female", "Male"
    ), class = "factor")), .Names = c("distance", "age", "Subject",
"Sex"), class = "data.frame", row.names = c("1", "2", "3", "4",
"5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15",
"16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26",
"27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37",
"38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48",
"49", "50", "51", "52", "53", "54", "55", "56", "57", "58", "59",
"60", "61", "62", "63", "64", "65", "66", "67", "68", "69", "70",
"71", "72", "73", "74", "75", "76", "77", "78", "79", "80", "81",
"82", "83", "84", "85", "86", "87", "88", "89", "90", "91", "92",
"93", "94", "95", "96", "97", "98", "99", "100", "101", "102",
"103", "104", "105", "106", "107", "108"))

require(ggplot2)
require(mgcv)

## intercepts and slopes seem different between sexes
## but there is no evidence of a nonlinear relationship between
## age and distance
ggplot(dat, aes(x = age, y = distance, colour = Sex)) +
  geom_point() +
  stat_smooth(method = "gam", formula = y ~ s(x))

## reorder data by initial distance value
dat$Subject <- with(dat, reorder(Subject,
  distance[ifelse(age == min(age), TRUE, NA)],
  FUN = mean, na.rm = TRUE))

## slight evidence that lower intercepts may be
## associated with more positive slopes
ggplot(dat, aes(x = age, y = distance, colour = Sex)) +
  geom_point() +
  stat_smooth(method = "gam", formula = y ~ s(x)) +
  facet_wrap(~ Subject)

## lme4 package for mixed effects models
require(lme4)

mnull <- lmer(distance ~ 1 + (1 | Subject), data = dat)
m1 <- update(mnull, . ~ . + age * Sex)
m2 <- update(m2, . ~ . + (0 + age | Subject))
m3 <- lmer(distance ~ age * Sex + (1 + age | Subject), data = dat)

## compare different models, m1 seems good
anova(mnull, m1, m2, m3)

plot(dat$distance, fitted(m1))

## examine residuals and random effects
qqnorm(resid(m1))
plot(dat$age, resid(m1))
qqnorm(ranef(m1)$Subject[,1])

## view the summary
summary(m1)
## Linear mixed model fit by REML
## Formula: distance ~ (1 | Subject) + age + Sex + age:Sex
##    Data: dat
##    AIC   BIC logLik deviance REMLdev
##  445.8 461.9 -216.9    428.7   433.8
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Subject  (Intercept) 3.2986   1.8162
##  Residual             1.9221   1.3864
## Number of obs: 108, groups: Subject, 27

## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 17.37273    1.18349  14.679
## age          0.47955    0.09347   5.130
## SexMale     -1.03210    1.53740  -0.671
## age:SexMale  0.30483    0.12142   2.511

## Correlation of Fixed Effects:
##             (Intr) age    SexMal
## age         -0.869
## SexMale     -0.770  0.669
## age:SexMale  0.669 -0.770 -0.869
On Thu, Jan 26, 2012 at 5:37 PM, Charles Determan Jr <deter088 at umn.edu> wrote: