Hi Jarrod,
that was exactly it. I hadn't checked the $VCV mcmclist, but I'll do
so in the future
as the mistake is blindingly obvious that way: http://imgur.com/nlA0QwZ
After adding fix=2 to R1 in my starting values, my parallel chains
converged as well.
For future amateurs reading this:
prior <- list(
R=list(V=diag(2), nu=1.002, fix=2),
G=list(G1=list(V=diag(2), nu=1, alpha.mu=c(0,0), alpha.V=diag(2)*1000))
)
start <- list(
liab=c(rnorm( nrow(krmh.1)*2 )),
R = list(R1 = rIW(diag(2), 10 , fix = 2)), ### FIX=2 HERE AS WELL
G = list(G1 = rIW(diag(2), 10 ))
)
Thank you so much. This sort of remote diagnosis with this little information
seems borderline psychic to me.
Now, onwards to modelling the pedigrees :-)
Best wishes,
Ruben
PS.: I never actually figured out what kind of variability to use
for rIW() in my starting values,
but it doesn't seem to matter and that's what I should want I
suppose. Actually, even with the mis-specified
starting values my posterior looked pretty similar to now.
On 29 Aug 2014, at 09:13, Jarrod Hadfield <j.hadfield at ed.ac.uk> wrote:
Hi Ruben,
Actually I might know what it is. When you sample different
starting values do you inadvertently sample a new residual variance
for the unidentified za part? You need to make sure that this is
always fixed at the same value (otherwise the model is different).
This is not a problem under the trait:units specification.
Cheers,
Jarrod.
Quoting Jarrod Hadfield <j.hadfield at ed.ac.uk> on Fri, 29 Aug 2014
08:04:42 +0100:
Hi Ruben,
Can you share your data and I will take a look. Its definitely not
Monte Carlo error.
Cheers,
Jarrod
Quoting Ruben Arslan <rubenarslan at gmail.com> on Thu, 28 Aug 2014
22:44:47 +0200:
Hi Jarrod,
those two matched up quite well yes. I just completed another 20
chains, using more variable
starting values. There's still two fixed effects
traitza_children:spouses and :male which haven't converged
according to multi-chain (gelman), but have according to geweke.
The offending traces: http://imgur.com/Qm6Ovfr
These specific effects aren't of interest to me, so if this
doesn't affect the rest of my estimates, I can be happy
with this, but I can't conclude that, can I?
I'm now also doing a run to see how it deals with the more
intensely zero-inflated data when including
the unmarried.
Thanks a lot for all that help,
Ruben
Potential scale reduction factors:
Point est. Upper C.I.
(Intercept) 1.00 1.00
traitza_children 1.40 1.65
male 1.00 1.00
spouses 1.00 1.00
paternalage.mean 1.00 1.00
paternalage.factor(25,30] 1.00 1.00
paternalage.factor(30,35] 1.00 1.00
paternalage.factor(35,40] 1.00 1.00
paternalage.factor(40,45] 1.00 1.00
paternalage.factor(45,50] 1.00 1.00
paternalage.factor(50,55] 1.00 1.00
paternalage.factor(55,90] 1.00 1.00
traitza_children:male 1.33 1.54
traitza_children:spouses 2.21 2.83
traitza_children:paternalage.mean 1.01 1.02
traitza_children:paternalage.factor(25,30] 1.05 1.08
traitza_children:paternalage.factor(30,35] 1.08 1.13
traitza_children:paternalage.factor(35,40] 1.15 1.25
traitza_children:paternalage.factor(40,45] 1.15 1.26
traitza_children:paternalage.factor(45,50] 1.26 1.43
traitza_children:paternalage.factor(50,55] 1.15 1.25
traitza_children:paternalage.factor(55,90] 1.14 1.23
Multivariate psrf
8.99
Iterations = 100001:149951
Thinning interval = 50
Number of chains = 20
Sample size per chain = 1000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive
SE Time-series SE
(Intercept) 1.36326 0.04848
0.0003428 0.0003542
traitza_children -0.76679 0.28738
0.0020321 0.0016682
male 0.09980 0.01633
0.0001155 0.0001222
spouses 0.12333 0.01957
0.0001384 0.0001414
paternalage.mean 0.07215 0.02194
0.0001551 0.0001596
paternalage.factor(25,30] -0.03381 0.04184
0.0002959 0.0003066
paternalage.factor(30,35] -0.08380 0.04270
0.0003019 0.0003118
paternalage.factor(35,40] -0.16502 0.04569
0.0003231 0.0003289
paternalage.factor(40,45] -0.16738 0.05090
0.0003599 0.0003697
paternalage.factor(45,50] -0.18383 0.05880
0.0004158 0.0004242
paternalage.factor(50,55] -0.18241 0.07277
0.0005146 0.0005302
paternalage.factor(55,90] -0.40612 0.09875
0.0006983 0.0007467
traitza_children:male 0.12092 0.08223
0.0005815 0.0004697
traitza_children:spouses 0.64881 0.21132
0.0014942 0.0008511
traitza_children:paternalage.mean -0.02741 0.08550
0.0006046 0.0006221
traitza_children:paternalage.factor(25,30] -0.17296 0.18680
0.0013209 0.0013750
traitza_children:paternalage.factor(30,35] -0.19027 0.19267
0.0013624 0.0013901
traitza_children:paternalage.factor(35,40] -0.24911 0.21282
0.0015049 0.0014391
traitza_children:paternalage.factor(40,45] -0.29772 0.23403
0.0016548 0.0015956
traitza_children:paternalage.factor(45,50] -0.51782 0.28589
0.0020215 0.0017602
traitza_children:paternalage.factor(50,55] -0.46126 0.32064
0.0022673 0.0021397
traitza_children:paternalage.factor(55,90] -0.38612 0.41461
0.0029317 0.0027396
2. Quantiles for each variable:
2.5% 25%
50% 75% 97.5%
(Intercept) 1.26883 1.33106
1.36322 1.39575 1.4589722
traitza_children -1.20696 -0.95751
-0.81076 -0.63308 -0.0365042
male 0.06785 0.08878
0.09970 0.11085 0.1320168
spouses 0.08467 0.11030
0.12343 0.13643 0.1617869
paternalage.mean 0.02950 0.05751
0.07202 0.08683 0.1153881
paternalage.factor(25,30] -0.11581 -0.06174
-0.03397 -0.00574 0.0473783
paternalage.factor(30,35] -0.16656 -0.11250
-0.08358 -0.05519 0.0003065
paternalage.factor(35,40] -0.25518 -0.19530
-0.16500 -0.13440 -0.0757366
paternalage.factor(40,45] -0.26887 -0.20164
-0.16675 -0.13335 -0.0677407
paternalage.factor(45,50] -0.30080 -0.22320
-0.18339 -0.14440 -0.0687967
paternalage.factor(50,55] -0.32663 -0.23034
-0.18227 -0.13317 -0.0415547
paternalage.factor(55,90] -0.60202 -0.47303
-0.40454 -0.33994 -0.2139128
traitza_children:male -0.01083 0.06634
0.11024 0.16109 0.3295892
traitza_children:spouses 0.37857 0.51072
0.59398 0.71395 1.2127940
traitza_children:paternalage.mean -0.19138 -0.08250
-0.02985 0.02493 0.1468989
traitza_children:paternalage.factor(25,30] -0.57457 -0.28481
-0.16489 -0.05151 0.1728148
traitza_children:paternalage.factor(30,35] -0.61499 -0.30350
-0.17736 -0.06299 0.1555147
traitza_children:paternalage.factor(35,40] -0.74251 -0.36752
-0.22966 -0.10777 0.1151897
traitza_children:paternalage.factor(40,45] -0.84165 -0.42691
-0.27729 -0.14322 0.1032436
traitza_children:paternalage.factor(45,50] -1.21782 -0.66568
-0.48420 -0.32873 -0.0476720
traitza_children:paternalage.factor(50,55] -1.21327 -0.63623
-0.43432 -0.24957 0.0955360
traitza_children:paternalage.factor(55,90] -1.33772 -0.62227
-0.35364 -0.11050 0.3361684
(Intercept)
traitza_children
18814.05
16359.33
male
spouses
18132.98
19547.05
paternalage.mean
paternalage.factor(25,30]
19238.72
18974.81
paternalage.factor(30,35]
paternalage.factor(35,40]
18874.33
19406.63
paternalage.factor(40,45]
paternalage.factor(45,50]
19075.18
19401.77
paternalage.factor(50,55]
paternalage.factor(55,90]
18960.11
17893.23
traitza_children:male
traitza_children:spouses
18545.55
14438.51
traitza_children:paternalage.mean
traitza_children:paternalage.factor(25,30]
18464.09
16943.43
traitza_children:paternalage.factor(30,35]
traitza_children:paternalage.factor(35,40]
16827.44
17230.04
traitza_children:paternalage.factor(40,45]
traitza_children:paternalage.factor(45,50]
17144.78
18191.67
traitza_children:paternalage.factor(50,55]
traitza_children:paternalage.factor(55,90]
17466.60
18540.59
### current script:
# bsub -q mpi -W 24:00 -n 21 -R np20 mpirun -H localhost -n 21 R
--slave -f "/usr/users/rarslan/rpqa/krmh_main/children.R"
setwd("/usr/users/rarslan/rpqa/")
library(doMPI)
cl <-
startMPIcluster(verbose=T,workdir="/usr/users/rarslan/rpqa/krmh_main/")
registerDoMPI(cl)
Children = foreach(i=1:clusterSize(cl),.options.mpi =
list(seed=1337) ) %dopar% {
library(MCMCglmm);library(data.table)
setwd("/usr/users/rarslan/rpqa/krmh_main/")
source("../1 - extraction functions.r")
load("../krmh1.rdata")
krmh.1 = recenter.pat(na.omit(krmh.1[spouses>0, list(idParents,
children, male, spouses, paternalage)]))
samples = 1000
thin = 50; burnin = 100000
nitt = samples * thin + burnin
prior <- list(
R=list(V=diag(2), nu=1.002, fix=2),
G=list(G1=list(V=diag(2), nu=1, alpha.mu=c(0,0), alpha.V=diag(2)*1000))
)
start <- list(
liab=c(rnorm( nrow(krmh.1)*2 )),
R = list(R1 = rIW(diag(2), 10 )),
G = list(G1 = rIW(diag(2), 10 ))
)
( m1 = MCMCglmm( children ~ trait * (male + spouses +
paternalage.mean + paternalage.factor),
rcov=~idh(trait):units,
random=~idh(trait):idParents,
family="zapoisson",
start = start,
prior = prior,
data=krmh.1,
pr = F, saveX = F, saveZ = F,
nitt=nitt,thin=thin,burnin=burnin)
)
m1$Residual$nrt<-2
m1
}
save(Children,file = "Children.rdata")
closeCluster(cl)
mpi.quit()
On 28 Aug 2014, at 20:59, Jarrod Hadfield <j.hadfield at ed.ac.uk> wrote:
Hi,
The posteriors for the two models look pretty close to me. Are
the scale reduction factors really as high as previously
reported? Before you had 1.83 for traitza_children:spouses, but
the plot suggests that it should be close to 1?
Cheers,
Jarrod
Quoting Ruben Arslan <rubenarslan at gmail.com> on Thu, 28 Aug 2014
19:59:16 +0200:
Sure! Thanks a lot.
I am using ~idh(trait):units already, sorry for saying that
incorrectly in my last email.
These models aren't the final thing, I will replace the
paternalage.factor variable
with its linear equivalent if that seems defensible (does so
far) and in this model it seems
okay to remove the za-effects for all predictors except spouses.
So a final model would have fewer fixed effects. I also have
datasets of 200k+ and 5m+,
but I'm learning MCMCglmm with this smaller one because my
wrong turns take less time.
I've uploaded a comparison coef plot of two models:
http://i.imgur.com/sHUfnmd.png
m7 is with the default starting values, m1 is with the
specification I sent in my last email. I don't
know if such differences are something to worry about.
I don't know what qualifies as highly overdispersed, here's a
plot of the outcome for ever
married people (slate=real data):
http://imgur.com/14MywgZ
here's with everybody born (incl. some stillborn etc.):
http://imgur.com/knRGa1v
I guess my approach (generating an overdispersed poisson with
the parameters from
the data and checking if it has as excess zeroes) is not the
best way to diagnose zero-inflation,
but especially in the second case it seems fairly clear-cut.
Best regards,
Ruben
Iterations = 50001:149951
Thinning interval = 50
Sample size = 2000
DIC: 31249.73
G-structure: ~idh(trait):idParents
post.mean l-95% CI u-95% CI eff.samp
children.idParents 0.006611 4.312e-08 0.0159 523.9
za_children.idParents 0.193788 7.306e-02 0.3283 369.3
R-structure: ~idh(trait):units
post.mean l-95% CI u-95% CI eff.samp
children.units 0.1285 0.1118 0.1452 716.1
za_children.units 0.9950 0.9950 0.9950 0.0
Location effects: children ~ trait * (male + spouses +
paternalage.mean + paternalage.factor)
post.mean l-95% CI
u-95% CI eff.samp pMCMC
(Intercept) 1.3413364
1.2402100 1.4326099 1789 <5e-04 ***
traitza_children -0.8362879
-1.2007980 -0.5016730 1669 <5e-04 ***
male 0.0994902
0.0679050 0.1297394 2000 <5e-04 ***
spouses 0.1236033
0.0839000 0.1624939 2000 <5e-04 ***
paternalage.mean 0.0533892
0.0119569 0.0933960 2000 0.015 *
paternalage.factor(25,30] -0.0275822
-0.1116421 0.0537359 1842 0.515
paternalage.factor(30,35] -0.0691025
-0.1463214 0.0122393 1871 0.097 .
paternalage.factor(35,40] -0.1419933
-0.2277379 -0.0574678 1845 <5e-04 ***
paternalage.factor(40,45] -0.1364952
-0.2362714 -0.0451874 1835 0.007 **
paternalage.factor(45,50] -0.1445342
-0.2591767 -0.0421178 1693 0.008 **
paternalage.factor(50,55] -0.1302972
-0.2642965 0.0077061 2000 0.064 .
paternalage.factor(55,90] -0.3407879
-0.5168972 -0.1493652 1810 <5e-04 ***
traitza_children:male 0.0926888
-0.0147379 0.2006142 1901 0.098 .
traitza_children:spouses 0.5531197
0.3870616 0.7314289 1495 <5e-04 ***
traitza_children:paternalage.mean 0.0051463
-0.1279396 0.1460099 1617 0.960
traitza_children:paternalage.factor(25,30] -0.1538957
-0.4445749 0.1462955 1781 0.321
traitza_children:paternalage.factor(30,35] -0.1747883
-0.4757851 0.1162476 1998 0.261
traitza_children:paternalage.factor(35,40] -0.2261843
-0.5464379 0.0892582 1755 0.166
traitza_children:paternalage.factor(40,45] -0.2807543
-0.6079678 0.0650281 1721 0.100 .
traitza_children:paternalage.factor(45,50] -0.4905843
-0.8649214 -0.1244174 1735 0.010 **
traitza_children:paternalage.factor(50,55] -0.4648579
-0.9215759 -0.0002083 1687 0.054 .
traitza_children:paternalage.factor(55,90] -0.3945406
-1.0230155 0.2481568 1793 0.195
---
Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
describe(krmh.1[spouses>0,])
vars n mean sd median trimmed mad min
max range skew kurtosis se
children 2 6829 3.81 2.93 4.00 3.61 2.97
0.00 16.00 16.00 0.47 -0.46 0.04
male 3 6829 0.46 0.50 0.00 0.45 0.00
0.00 1.00 1.00 0.14 -1.98 0.01
spouses 4 6829 1.14 0.38 1.00 1.03 0.00
1.00 4.00 3.00 2.87 8.23 0.00
paternalage 5 6829 3.65 0.80 3.57 3.60 0.80
1.83 7.95 6.12 0.69 0.70 0.01
paternalage_c 6 6829 0.00 0.80 -0.08 -0.05 0.80
-1.82 4.30 6.12 0.69 0.70 0.01
paternalage.mean 7 6829 0.00 0.68 -0.08 -0.05 0.59
-1.74 4.30 6.04 0.95 1.97 0.01
paternalage.diff 8 6829 0.00 0.42 0.00 -0.01 0.38
-1.51 1.48 2.99 0.17 0.17 0.01
table(krmh.1$paternalage.factor)
[0,25] (25,30] (30,35] (35,40] (40,45] (45,50] (50,55] (55,90]
309 1214 1683 1562 1039 623 269 130
On 28 Aug 2014, at 19:05, Jarrod Hadfield <j.hadfield at ed.ac.uk> wrote:
Hi Ruben,
It might be hard to detect (near) ECPs with so many fixed
effects (can you post the model summary (and give us the mean
and standard deviation of any continuous covariates)). Also,
the complementary log-log link (which is the za specification)
is non-symmetric and runs into problems outside the range -35
to 3.5 so there may be a problem there, particularly if you
use rcov=~trait:units and the Poisson part is highly
over-dispersed. You could try rcov=~idh(trait):units and fix
the non-identifiable za residual variance to something smaller
than 1 (say 0.5) - it will mix slower but it will reduce the
chance of over/underflow.
Cheers,
Jarrod
Quoting Ruben Arslan <rubenarslan at gmail.com> on Thu, 28 Aug
2014 18:45:30 +0200:
1) it did not return an error with rcov = ~trait:units
because you used R1=rpois(2,1)+1 and yet this specification
only fits a single variance (not a 2x2 covariance matrix).
R1=rpois(2,1)+1 is a bit of a weird specification since it
has to be integer. I would obtain starting values using rIW().
I agree it's a weird specification, I was a bit lost and
thought I could get away with just putting some random
numbers in the starting value.
I didn't do R1=rpois(2,1)+1 though, I did
R1=diag(rpois(2,1)+1), so I got a 2x2 matrix, but yes, bound
to be integer.
I didn't know starting values should come from a conjugate
distribution, though that probably means I didn't think about
it much.
I'm now doing
start <- list(
liab=c(rnorm( nrow(krmh.1)*2 )),
R = list(R1 = rIW( diag(2), nrow(krmh.1)) ),
G = list(G1 = rIW( diag(2), nrow(krmh.1)) )
)
Is this what you had in mind?
I am especially unsure if I am supposed to use such a low
sampling variability (my sample size is probably not even
relevant for the starting values) and if I should start from
diag(2).
And, I am still happily confused that this specification
still doesn't lead to errors with respect to rcov =
~trait:units . Does this mean I'm doing it wrong?
3) a) how many effective samples do you have for each
parameter? and b) are you getting extreme category
problems/numerical issues? If you store the latent variables
(pl=TUE) what is their range for the Zi/za part?
My parallel run using the above starting values isn't finished yet.
a) After applying the above starting values I get, for the
location effects 1600-2000 samples for a 2000 sample chain
(with thin set to 50). G and R-structure are from 369
(za_children.idParents) to 716 (and 0 for the fixed part).
Effective sample sizes were similar for my run using the
starting values for G/R that I drew from rpois, and using 40
chains I of course get
b) I don't think I am getting extreme categories. I would
probably be getting extreme categories if I included the
forever-alones (they almost never have children), but this
way no.
I wasn't sure how to examine the range of the latents
separately for the za part, but for a single chain it looks
okay:
quantile(as.numeric(m1$Liab),probs = c(0,0.01,0,0.99,1))
0% 1% 0% 99% 100%
-4.934111 -1.290728 -4.934111 3.389847 7.484206
Well, all considered now that I use the above starting value
specification I get slightly different estimates for all
za-coefficients. Nothing major, but still leading me to
think my estimates aren't exactly independent of the starting
values I use. I'll see what the parallel run yields.
Thanks a lot,
Ruben
Cheers,
Jarrod
Quoting Ruben Arslan <rubenarslan at gmail.com> on Wed, 27 Aug
2014 19:23:42 +0200:
Hi Jarrod,
thanks again. I was able to get it running with your advice.
Some points of confusion remain:
- You wrote that zi/za models would return an error with
rcov = ~trait:units + starting values. This did not happen
in my case, so I didn't build MCMCglmm myself with your
suggested edits. Also, have you considered putting your own
MCMCglmm repo on Github? Your users would be able to
install pre-releases and I'd think you'd get some
time-saving pull requests too.
- In my attempts to get my models to run properly, I messed
up a prior and did not use fix=2 in my prior specification
for my za models. This led to crappy convergence, it's much
better now and for some of my simpler models I think I
won't need parallel chains. I'm reminded of Gelman's folk
theorem of statistical computing.
- I followed your advice, but of course I could not set the
true values as starting values, but wanted to set random,
bad starting values. I pasted below what I arrived at, I'm
especially unsure whether I specified the starting values
for G and R properly (I think not).
start <- list(
liab=c(rnorm( nrow(krmh.1)*2 )),
R = list(R1 = diag(rpois(2, 1)+1)),
G = list(G1 = diag(rpois(2, 1)+1))
)
However, even though I may not need multiple chains for
some of my simpler models, I've now run into conflicting
diagnostics. The geweke.diag for each chain (and
examination of the traces) gives
satisfactory diagnostics. Comparing multiple chains using
gelman.diag, however, leads to one bad guy, namely the
traitza_children:spouses interaction.
I think this implies that I've got some starting value
dependence for this parameter, that won't be easily
rectified through longer burnin?
Do you have any ideas how to rectify this?
I am currently doing sequential analyses on episodes of
selection and in historical human data only those who marry
have a chance at having kids. I exclude the unmarried
from my sample where I predict number of children, because
I examine that in a previous model and the zero-inflation
(65% zeros, median w/o unmarried = 4) when including the
unmarried is so excessive.
Number of spouses is easily the strongest predictor in the
model, but only serves as a covariate here. Since my other
estimates are stable across chains and runs and agree well
across models and with theory, I'm
inclined to shrug this off. But probably I shouldn't ignore
this sign of non-convergence?
Potential scale reduction factors:
Point est. Upper C.I.
(Intercept) 1.00 1.00
traitza_children 1.27 1.39
male 1.00 1.00
spouses 1.00 1.00
paternalage.mean 1.00 1.00
paternalage.factor(25,30] 1.00 1.00
paternalage.factor(30,35] 1.00 1.00
paternalage.factor(35,40] 1.00 1.00
paternalage.factor(40,45] 1.00 1.00
paternalage.factor(45,50] 1.00 1.00
paternalage.factor(50,55] 1.00 1.00
paternalage.factor(55,90] 1.00 1.00
traitza_children:male 1.22 1.32
traitza_children:spouses 1.83 2.13
traitza_children:paternalage.mean 1.02 1.02
traitza_children:paternalage.factor(25,30] 1.03 1.05
traitza_children:paternalage.factor(30,35] 1.05 1.08
traitza_children:paternalage.factor(35,40] 1.10 1.15
traitza_children:paternalage.factor(40,45] 1.12 1.17
traitza_children:paternalage.factor(45,50] 1.19 1.28
traitza_children:paternalage.factor(50,55] 1.12 1.18
traitza_children:paternalage.factor(55,90] 1.11 1.17
Multivariate psrf
7.27
Best regards,
Ruben
On 26 Aug 2014, at 13:04, Jarrod Hadfield
<j.hadfield at ed.ac.uk> wrote:
Hi Ruben,
There are 400 liabilities in a zapoisson model (2 per
datum). This code should work:
g <-sample(letters[1:10], size = 200, replace = T)
pred <- rnorm(200)
l1<-rnorm(200, -1, sqrt(1))
l2<-rnorm(200, -1, sqrt(1))
y<-VGAM::rzapois(200, exp(l1), exp(-exp(l2)))
# generate zero-altered data with an intercept of -1
(because the intercept and variance are the same for both
processes this is just standard Poisson)
dat<-data.frame(y=y, g = g, pred = pred)
start.1<-list(liab=c(l1,l2), R = list(R1=diag(2)),
G=list(G1=diag(2)))
prior.1<-list(R=list(V=diag(2), nu=1.002, fix=2),
G=list(G1=list(V=diag(2), nu=2, alpha.mu=c(0,0),
alpha.V=diag(2)*1000)))
m1<-MCMCglmm(y~trait + pred:trait, random=~us(trait):g,
family="zapoisson",rcov=~idh(trait):units, data=dat,
prior=prior.1, start= start.1)
However, there are 2 bugs in the current version of
MCMCglmm that return an error message when the
documentation implies it should be fine:
a) it should be possible to have R=diag(2) rather than R =
list(R1=diag(2)). This bug cropped up when I implemented
block-diagonal R structures. It can be fixed by inserting:
if(!is.list(start$R)){
start$R<-list(R1=start$R)
}
on L514 of MCMCglmm.R below
if(!is.list(prior$R[[1]])){
prior$R<-list(R1=prior$R)
}
b) rcov=~trait:units models for zi/za models will return
an error when passing starting values. To fix this insert
if(diagR==3){
if(dim(start)[1]!=1){
stop("V is the wrong dimension for some
strart$G/start$R elements")
}
start<-diag(sum(nfl))*start[1]
}
on L90 of priorfromat.R below
if(is.matrix(start)==FALSE){
start<-as.matrix(start)
}
I will put these in the new version.
Cheers,
Jarrod
Quoting Ruben Arslan <rubenarslan at gmail.com> on Mon, 25
Aug 2014 21:52:30 +0200:
Hi Jarrod,
thanks for these pointers.
You will need to provide over-dispersed starting values
for multiple-chain convergence diagnostics to be useful
(GLMM are so simple I am generally happy if the output
of a single run looks reasonable).
Oh, I would be happy with single chains, but since
computation would take weeks this way, I wanted to
parallelise and I would use the multi-chain convergence
as a criterion that my parallelisation was proper
and is as informative as a single long chain. There don't
seem to be any such checks built-in ? I was analysing my
40 chains for a bit longer than I like to admit until I
noticed they were identical (effectiveSize
and summary.mcmc.list did not yell at me for this).
# use some very bad starting values
I get that these values are bad, but that is the goal for
my multi-chain aim, right?
I can apply this to my zero-truncated model, but am again
getting stuck with the zero-altered one.
Maybe I need only specify the Liab values for this?
At least I'm getting nowhere with specifying R and G
starting values here. When I got an error, I always
went to the MCMCglmm source to understand why the checks
failed, but I didn't always understand
what was being checked and couldn't get it to work.
Here's a failing example:
l<-rnorm(200, -1, sqrt(1))
t<-(-log(1-runif(200)*(1-exp(-exp(l)))))
g = sample(letters[1:10], size = 200, replace = T)
pred = rnorm(200)
y<-rpois(200,exp(l)-t)
y[1:40] = 0
# generate zero-altered data with an intercept of -1
dat<-data.frame(y=y, g = g, pred = pred)
set.seed(1)
start_true = list(Liab=l, R = 1, G = 1 )
m1<-MCMCglmm(y~1 + pred,random = ~ g,
family="zapoisson",rcov=~us(trait):units, data=dat,
start= start_true)
# use true latent variable as starting values
set.seed(1)
# use some very bad starting values
start_rand = list(Liab=rnorm(200), R = rpois(1, 1)+1, G =
rpois(1, 1)+1 )
m2<-MCMCglmm(y~1 + pred,random = ~
g,rcov=~us(trait):units, family="zapoisson", data=dat,
start = start_rand)
Best,
Ruben
On 25 Aug 2014, at 18:29, Jarrod Hadfield
<j.hadfield at ed.ac.uk> wrote:
Hi Ruben,
Sorry - I was wrong when I said that everything is
Gibbs sampled conditional on the latent variables. The
location effects (fixed and random effects) are also
sampled conditional on the (co)variance components so
you should add them to the starting values. In the case
where the true values are used:
m1<-MCMCglmm(y~1, family="ztpoisson", data=dat,
start=list(Liab=l,R=1))
Cheers,
Jarrod
Quoting Jarrod Hadfield <j.hadfield at ed.ac.uk> on Mon, 25
Aug 2014 17:14:14 +0100:
Hi Ruben,
You will need to provide over-dispersed starting values
for multiple-chain convergence diagnostics to be useful
(GLMM are so simple I am generally happy if the output
of a single run looks reasonable).
With non-Gaussian data everything is Gibbs sampled
conditional on the latent variables, so you only need
to pass them:
l<-rnorm(200, -1, sqrt(1))
t<-(-log(1-runif(200)*(1-exp(-exp(l)))))
y<-rpois(200,exp(l)-t)+1
# generate zero-truncated data with an intercept of -1
dat<-data.frame(y=y)
set.seed(1)
m1<-MCMCglmm(y~1, family="ztpoisson", data=dat,
start=list(Liab=l))
# use true latent variable as starting values
set.seed(1)
m2<-MCMCglmm(y~1, family="ztpoisson", data=dat,
start=list(Liab=rnorm(200)))
# use some very bad starting values
plot(mcmc.list(m1$Sol, m2$Sol))
# not identical despite the same seed because of
different starting values but clearly sampling the same
posterior distribution:
gelman.diag(mcmc.list(m1$Sol, m2$Sol))
Cheers,
Jarrod
Quoting Ruben Arslan <rubenarslan at gmail.com> on Mon, 25
Aug 2014 18:00:08 +0200:
Dear Jarrod,
thanks for the quick reply. Please, don't waste time
looking into doMPI ? I am happy that I
get the expected result, when I specify that
reproducible seed, whyever that may be.
I'm pretty sure that is the deciding factor, because I
tested it explicitly, I just have no idea
how/why it interacts with the choice of family.
That said, is setting up different RNG streams for my
workers (now that it works) __sufficient__
so that I get independent chains and can use
gelman.diag() for convergence diagnostics?
Or should I still tinker with the starting values myself?
I've never found a worked example of supplying
starting values and am thus a bit lost.
Sorry for sending further questions, I hope someone
else takes pity while
you're busy with lectures.
Best wishes
Ruben
On 25 Aug 2014, at 17:29, Jarrod Hadfield
<j.hadfield at ed.ac.uk> wrote:
Hi Ruben,
I do not think the issue is with the starting values,
because even if the same starting values were used
the chains would still differ because of the
randomness in the Markov Chain (if I interpret your
`identical' test correctly). This just involves a
call to GetRNGstate() in the C++ code (L 871
ofMCMCglmm.cc) so I think for some reason
doMPI/foreach is not doing what you expect. I am not
familiar with doMPI and am in the middle of writing
lectures so haven't got time to look into it
carefully. Outside of the context of doMPI I get the
behaviour I expect:
l<-rnorm(200, -1, sqrt(1))
t<-(-log(1-runif(200)*(1-exp(-exp(l)))))
y<-rpois(200,exp(l)-t)+1
# generate zero-truncated data with an intercept of -1
dat<-data.frame(y=y)
set.seed(1)
m1<-MCMCglmm(y~1, family="ztpoisson", data=dat)
set.seed(2)
m2<-MCMCglmm(y~1, family="ztpoisson", data=dat)
set.seed(2)
m3<-MCMCglmm(y~1, family="ztpoisson", data=dat)
plot(mcmc.list(m1$Sol, m2$Sol))
# different, as expected
plot(mcmc.list(m2$Sol, m3$Sol))
# the same, as expected
Quoting Ruben Arslan <rubenarslan at gmail.com> on Mon,
25 Aug 2014 16:58:06 +0200:
Dear list,
sorry for bumping my old post, I hope to elicit a
response with a more focused question:
When does MCMCglmm automatically start from
different values when using doMPI/foreach?
I have done some tests with models of varying
complexity. For example, the script in my last
post (using "zapoisson") yielded 40 identical chains:
identical(mcmclist[1], mcmclist[30])
TRUE
A simpler (?) model (using "ztpoisson" and no
specified prior), however, yielded different chains
and I could use them to calculate gelman.diag()
Changing my script to the version below, i.e.
seeding foreach using .options.mpi=list( seed= 1337)
so as to make RNGstreams reproducible (or so I
thought), led to different chains even for the
"zapoisson" model.
In no case have I (successfully) tried to supplant
the default of MCMCglmm's "start" argument.
Is starting my models from different RNGsubstreams
inadequate compared to manipulating
the start argument explicitly? If so, is there any
worked example of explicit starting value manipulation
in parallel computation?
I've browsed the MCMCglmm source to understand how
the default starting values are generated,
but didn't find any differences with respect to RNG
for the two families "ztpoisson" and "zapoisson"
(granted, I did not dig very deep).
Best regards,
Ruben Arslan
# bsub -q mpi -W 12:00 -n 41 -R np20 mpirun -H
localhost -n 41 R --slave -f
"/usr/users/rarslan/rpqa/rpqa_main/rpqa_children_parallel.R"
library(doMPI)
cl <-
startMPIcluster(verbose=T,workdir="/usr/users/rarslan/rpqa/rpqa_main/")
registerDoMPI(cl)
Children_mcmc1 =
foreach(i=1:clusterSize(cl),.options.mpi =
list(seed=1337) ) %dopar% {
library(MCMCglmm);library(data.table)
load("/usr/users/rarslan/rpqa/rpqa1.rdata")
nitt = 130000; thin = 100; burnin = 30000
prior.m5d.2 = list(
R = list(V = diag(c(1,1)), nu = 0.002),
G=list(list(V=diag(c(1,1e-6)),nu=0.002))
)
rpqa.1 = na.omit(rpqa.1[spouses>0, list(idParents,
children, male, urban, spouses, paternalage.mean,
paternalage.factor)])
(m1 = MCMCglmm( children ~ trait * (male + urban +
spouses + paternalage.mean + paternalage.factor),
rcov=~us(trait):units,
random=~us(trait):idParents,
family="zapoisson",
prior = prior.m5d.2,
data=rpqa.1,
pr = F, saveX = F, saveZ = F,
nitt=nitt,thin=thin,burnin=burnin))
}
library(coda)
mcmclist =
mcmc.list(lapply(Children_mcmc1,FUN=function(x) {
x$Sol}))
save(Children_mcmc1,mcmclist, file =
"/usr/users/rarslan/rpqa/rpqa_main/rpqa_mcmc_kids_za.rdata")
closeCluster(cl)
mpi.quit()
On 04 Aug 2014, at 20:25, Ruben Arslan
<rubenarslan at gmail.com> wrote:
Dear list,
would someone be willing to share her or his
efforts in parallelising a MCMCglmm analysis?
I had something viable using harvestr that seemed
to properly initialise
the starting values from different random number
streams (which is desirable,
as far as I could find out), but I ended up being
unable to use harvestr, because
it uses an old version of plyr, where
parallelisation works only for multicore, not for
MPI.
I pasted my working version, that does not do
anything about starting values or RNG
at the end of this email. I can try to fumble
further in the dark or try to update harvestr,
but maybe someone has gone through all this already.
I'd also appreciate any tips for elegantly
post-processing such parallel data, as some of my
usual
extraction functions and routines are hampered by
the fact that some coda functions
do not aggregate results over chains. (What I get
from a single-chain summary in MCMCglmm
is a bit more comprehensive, than what I managed to
cobble together with my own extraction
functions).
The reason I'm parallelising my analyses is that
I'm having trouble getting a good effective
sample size for any parameter having to do with the
many zeroes in my data.
Any pointers are very appreciated, I'm quite
inexperienced with MCMCglmm.
Best wishes
Ruben
# bsub -q mpi-short -W 2:00 -n 42 -R np20 mpirun -H
localhost -n 41 R --slave -f
"rpqa/rpqa_main/rpqa_children_parallel.r"
library(doMPI)
cl <- startMPIcluster()
registerDoMPI(cl)
Children_mcmc1 = foreach(i=1:40) %dopar% {
library(MCMCglmm)
load("rpqa1.rdata")
nitt = 40000; thin = 100; burnin = 10000
prior = list(
R = list(V = diag(c(1,1)), nu = 0.002),
G=list(list(V=diag(c(1,1e-6)),nu=0.002))
)
MCMCglmm( children ~ trait -1 +
at.level(trait,1):male + at.level(trait,1):urban +
at.level(trait,1):spouses +
at.level(trait,1):paternalage.mean +
at.level(trait,1):paternalage.factor,
rcov=~us(trait):units,
random=~us(trait):idParents,
family="zapoisson",
prior = prior,
data=rpqa.1,
pr = F, saveX = T, saveZ = T,
nitt=nitt,thin=thin,burnin=burnin)
}
library(coda)
mcmclist =
mcmc.list(lapply(Children_mcmc1,FUN=function(x) {
x$Sol}))
save(Children_mcmc1,mcmclist, file =
"rpqa_mcmc_kids_za.rdata")
closeCluster(cl)
mpi.quit()
--
Ruben C. Arslan
Georg August University G?ttingen
Biological Personality Psychology and Psychological
Assessment
Georg Elias M?ller Institute of Psychology
Go?lerstr. 14
37073 G?ttingen
Germany
Tel.: +49 551 3920704
https://psych.uni-goettingen.de/en/biopers/team/arslan
[[alternative HTML version deleted]]
--
The University of Edinburgh is a charitable body,
registered in
Scotland, with registration number SC005336.
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.