Multiple imputations : wicked dataset ? Wicked computers ? Am I cursed ? (or stupid ?)

3 messages · Emmanuel Charpentier, Gerard M. Keogh

Wed, Apr 22, 2009 3:31 PM #

Dear list,

I'd like to use multiple imputations to try and save a somewhat badly
mangled dataset (lousy data collection, worse than lousy monitoring, you
know that drill... especially when I am consulted for the first time
about one year *after* data collection).

My dataset has 231 observations of 53 variables, of which only a very
few has no missing data. Most variables have 5-10% of missing data, but
the whole dataset has only 15% complete cases (40% when dropping the 3
worst cases, which might be regained by other means).

To try my hand at multiple imputation, I created a small extract of the
original dataset :

DataTest<-Data2[c("ctr","cadredp","persmob","trea","taller","tbox","nbpradio","nbpventil","sitrea")]

[1] 231   9

ctr   cadredp   persmob      trea    taller      tbox  nbpradio
nbpventil 
 "factor"  "factor" "numeric" "numeric" "numeric" "numeric" "numeric"
"numeric" 
   sitrea 
"ordered"

ctr   cadredp   persmob      trea    taller      tbox  nbpradio
nbpventil 
        0         5        17        13         5        17        27
60 
   sitrea 
        0

[1] 147

0   1   2   3   4   5 
147  45  22  14   2   1

I tried five different packages for generating multiple imputations in
order to assess their respective ease of use. I am aware of the
differences in algorithms, but I have (currently) no reason to favor or
exclude one.

*No* *single* *one* was able to create but one imputation set !

Suspecting a possible hardware/installation problem, I repeated the same
sequence on another computer ... with identical results.

Here are the gory details (after each try, I left R with q("no") and
restarted a fresh session.  ESS is your friend...) :

Common characteristics :

R version 2.9.0 (2009-04-17) 
x86_64-pc-linux-gnu 

locale:
LC_CTYPE=fr_FR.UTF-8;LC_NUMERIC=C;LC_TIME=fr_FR.UTF-8;LC_COLLATE=fr_FR.UTF-8;LC_MONETARY=C;LC_MESSAGES=fr_FR.UTF-8;LC_PAPER=fr_FR.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=fr_FR.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] datasets  grDevices graphics  stats     utils     methods
base     

other attached packages:
[1] MASS_7.2-46        boot_1.2-36        lme4_0.999375-28
Matrix_0.999375-25
[5] odfWeave_0.7.10    XML_2.3-0          lattice_0.17-22   

loaded via a namespace (and not attached):
[1] grid_2.9.0  nlme_3.1-90

Note : R has been recently obtained from the CRAN repository with Ubuntu
8.10. The packages are mostly hand compiled (install.packages(...)).

Frank Harrel's aregImpute (in package Hmisc) :

+tbox+nbpradio+nbpventil+sitrea,data=DataTest,n.impute=5))
Erreur dans matxv(X, xcof) : columns in a (35) must be <= length of b
(33)
De plus : Warning message:
In f$xcoef[, 1] * f$xcenter :
  la taille d'un objet plus long n'est pas multiple de la taille d'un
objet plus court
Timing stopped at: 0.028 0 0.028

Examples #1 and #2 of the manual gave expected results.

Package "mi" (Andrew Gelman and a Columbia team, early development
(version=0.06.something), but manual advertises nice features) :

Erreur dans dimnames(AveVar) <- list(NULL, NULL, c(paste("mean(",
sim.varnames,  : 
  la longueur de 'dimnames' [3] n'est pas ?gale ? l'?tendue du tableau
Timing stopped at: 0.08 0 0.08

"textbook example" (example for function mi) : did not converge after6
iterations. Did not converge after 50 iterations.

Package "Amelia" (Gary King and, apparently, the Zelig gang) :

+
ords=names(DataTest)[sapply(DataTest,
+                                       data.class)=="ordered"],
+
noms=names(DataTest)[sapply(DataTest,
+                                       data.class)=="factor"]))
-- Imputation 1 --

 1  2 Erreur dans am.inv(a = g11) : 
  le mineur dominant d'ordre 22 n'est pas d?fini positif
De plus : Warning message:
In amcheck(x = x, m = m, idvars = numopts$idvars, priors = priors,  : 

The number of catagories in one of the variables marked nominal has
greater than 10 categories. Check nominal specification.


Timing stopped at: 0.048 0 0.048

Note : various repeats crashed at various points (one of tre trials
started to simulate, and stopped at obs 177...), but the numerical
accuracy seemed involved every time.

The "textbook example" (example for amelia function) runs perfectly.

Package "mice" (Stef van Buuren & Karin Groothuis-Oudshoorn) :

iter imp variable
  1   1  cadredpErreur dans mice.impute.logreg(c(1L, 2L, 2L, 1L, 1L, 2L,
2L, 1L, 2L, 2L,  : 
  dims [produit 28] ne correspond pas ? la longueur de l'objet [31]
De plus : Warning message:
In beta + rv %*% rnorm(ncol(rv)) :
  la taille d'un objet plus long n'est pas multiple de la taille d'un
objet plus court
Timing stopped at: 0.092 0 0.094

The "textbook example" (example for mice) runs almost perfectly
(warnings in the second example)

Package "mix" (Joseph Schafer, maintained by Brian Ripley) :

user  system elapsed 
  0.012   0.000   0.009 
Warning messages:
1: In storage.mode(w) <- "integer" :
  NAs introduits lors de la conversion automatique
2: In max(w[!is.na(w[, j]), j]) :
  aucun argument pour max ; -Inf est renvoy?
3: NAs introduits lors de la conversion automatique 
4: In max(w[!is.na(w[, j]), j]) :
  aucun argument pour max ; -Inf est renvoy?
5: NAs introduits lors de la conversion automatique 
6: In max(w[!is.na(w[, j]), j]) :
  aucun argument pour max ; -Inf est renvoy?
7: NAs introduits lors de la conversion automatique

Erreur dans rep(prior, s$ncells) : argument 'times' incorrect
Timing stopped at: 0 0 0 
### indeed : s$ncells is NA !
### Therefore, the rest crashes :

Erreur dans rep(prior, s$ncells) : argument 'times' incorrect
Timing stopped at: 0 0 0

Erreur dans imp.mix(s, newtheta) : objet 'newtheta' introuvable
Timing stopped at: 0.004 0 0.001

The "textbook example" (example for da.mix) runs perfectly (and fast !).

=========================================================================

I might be particularly stupid and misunderstanding manual and the
"textbook" examples of one or two packages, but five !

Visual/manual/graphical examination of my dataset does not suggest an
explanation.

I am not aware of anything "fishy" in my setup (one of the machines is a
bit aging, but the one used for the reported tests is almost new.

Could someone offer an explanation ? Or must I recourse to sacrifying a
black goat at midnight next new moon ?

Any hint will be appreciated (and by the way, next new moon is in about
3 days...).

					Emmanuel Charpentier

4 days later

Emmanuel Charpentier

Mon, Apr 27, 2009 12:49 PM #

Answering to myself (for future archive users' sake), more to come
(soon) :

Le jeudi 23 avril 2009 ? 00:31 +0200, Emmanuel Charpentier a ?crit :

[ Big snip ... ]

It turns out that my problems were caused by ... the dataset. Two very
important variables (i. e. of strong influence on the outcomes and
proxies) are ill-distributed :
- one is a modus operandi (two classes)
- the second is center (23 classes, alas...)

My data are quite ill-distributed : some centers have contributed a
large number of observations, some other very few. Furthermore, while
few variables are quite badly known, the "missingness pattern" is such
as :
- some centers have no directly usable information (= complete cases)
under one of the modi operandi
- some other have no complete case at all...

Therefore, any model-based prediction method using the whole dataset
(recommended for multiple imputations, since one should not use for
inference a richer set of data than what was imputed (seen this
statement in a lot of references)) fails miserably.

Remembering some fascinating readings (incl. V&R) and an early (20 years
ago) excursion in AI (yes, did that, didn't even got the T-shirt...), I
have attempted (with some success) to use recursive partitioning for
prediction. This (non-)model has some very interestind advantages in my
case :
- model-free
- distribution-free (quite important here : you should see my density
curves... and I won't speak about the outliers !)
- handles missing data gracefully (almost automagically)
- automatic selection and ranking of the pertinent variables
- current implementation in R has some very nice features, such as
surrogate splits if a value is missing, auto-pruning by
cross-validation, etc ...

It has also some drawbacks :
- no (easy) method for inference
- not easy to abstract (you can't just publish an ANOVA table and a
couple of p-values...)
- no "well-established" (i. e. acknowledged by journal reviewers) =>
difficult to publish

These latter point do not bother me in my case. So I attempted to use
this for imputation.

Since mice is based on a "chained equations" approach and allows the
end-user to write its own imputation functions, I wrote a set of such
imputers to be called within the framework of the Gibbs sampler. They
proceed as follow :
- build a regression or classification tree of the relevant variable
using the rest of the dataset
- predict the relevant variable for *all* the dataset,
- compute "residuals" from known values of the relevant variable and
their prediction
- impute values to missing data as prediction + a random residual.

It works. It's a tad slower than prediction using
normal/logistic/multinomial modelling (about a factor of 3, but for y
first trial, I attempted to err on the side of excessive precision ==>
deeper trees). It does not exhibit any "obvious" statistical
misfeatures.

But I have questions :

1) What is known of such imputation by regression/classification trees
(aka recursive partitionning) ? A quick research didn't turn up very
much : the idea has been evoked here and there, but I am not aware of
any published solution. In particular, I have no knowledge of any
theoretical (i. e. probability) wotrk on their properties ?

2) Where could I find published datasets having been used to validate
other imputation methods ?

3) Do you think that these functions should be published ?

Sincerely yours,

					Emmanuel Charpentier

PS :

The goat has endured the fear of her life, but is still alive... will
probably start worshipping the Moon, however.

Gerard M. Keogh

Tue, Apr 28, 2009 2:25 AM #

Emmanuel,

Friedman's (Annals of Stats 1991) MARS program implements recursive
partitioning in a regression context - a version of it written by Trevor
Hastie was available in R but I don't know what package it's now in - I
only have base stuff available (long story).

MARS, like recursive partitioning is a data exploration tool that builds up
an approximation to a nonlinear regression function using piecewise
regression splines. Each splines is split and replaced by a pair and the
GCV score computed - if the split reduces the GCV then the split is
accepted - in this way the method is adaptive.

MARS is very powerful and was used for  time  series research by LEwis &
Bonnie Ray (JASA 1991) - Bonnie has later papers as well. The main flaw
with MARS and I suppose a key reason why it doesn't feature more is that
there is no physical/biological underlying model that the researcher in
trying to make sense of - MARS just finds the best curve. Interpretation of
the result can therefore be a problem. However, MARS does provide an
"anova" type decomposition of the curve and this can certainly help in
making sense of the underlying relationships.

To use it (or related methods such as Generalised Additive Models GAMs) for
imputation then is a question of taste. If you're happy that the regression
curve is sufficient explanation then MARS is worth looking at - if you want
to know more about the physical model, well ...
Finally, MARS will treat all missing data as missing at random so if there
are specific conditional effects there have to be included as categorical
predictors a priori. As MARS is based on least squares it's only optimal
for Gaussian errors - it can be used on categorical data as well - another
variation called PolyMARS also implements MARS for categorical/multinomial
data.

Hope this is of interest!

Gerard







                                                                           
             Emmanuel                                                      
             Charpentier                                                   
             <charpent at bacbuc.                                          To 
             dyndns.org>               r-help at stat.math.ethz.ch            
             Sent by:                                                   cc 
             r-help-bounces at r-                                             
             project.org                                           Subject 
                                       Re: [R] Multiple imputations :      
                                       wicked dataset. Need advice for     
             27/04/2009 20:49          follow-up to a possible solution.   
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           




Answering to myself (for future archive users' sake), more to come
(soon) :

Le jeudi 23 avril 2009 ? 00:31 +0200, Emmanuel Charpentier a ?crit :

[ Big snip ... ]

It turns out that my problems were caused by ... the dataset. Two very
important variables (i. e. of strong influence on the outcomes and
proxies) are ill-distributed :
- one is a modus operandi (two classes)
- the second is center (23 classes, alas...)

My data are quite ill-distributed : some centers have contributed a
large number of observations, some other very few. Furthermore, while
few variables are quite badly known, the "missingness pattern" is such
as :
- some centers have no directly usable information (= complete cases)
under one of the modi operandi
- some other have no complete case at all...

Therefore, any model-based prediction method using the whole dataset
(recommended for multiple imputations, since one should not use for
inference a richer set of data than what was imputed (seen this
statement in a lot of references)) fails miserably.

Remembering some fascinating readings (incl. V&R) and an early (20 years
ago) excursion in AI (yes, did that, didn't even got the T-shirt...), I
have attempted (with some success) to use recursive partitioning for
prediction. This (non-)model has some very interestind advantages in my
case :
- model-free
- distribution-free (quite important here : you should see my density
curves... and I won't speak about the outliers !)
- handles missing data gracefully (almost automagically)
- automatic selection and ranking of the pertinent variables
- current implementation in R has some very nice features, such as
surrogate splits if a value is missing, auto-pruning by
cross-validation, etc ...

It has also some drawbacks :
- no (easy) method for inference
- not easy to abstract (you can't just publish an ANOVA table and a
couple of p-values...)
- no "well-established" (i. e. acknowledged by journal reviewers) =>
difficult to publish

These latter point do not bother me in my case. So I attempted to use
this for imputation.

Since mice is based on a "chained equations" approach and allows the
end-user to write its own imputation functions, I wrote a set of such
imputers to be called within the framework of the Gibbs sampler. They
proceed as follow :
- build a regression or classification tree of the relevant variable
using the rest of the dataset
- predict the relevant variable for *all* the dataset,
- compute "residuals" from known values of the relevant variable and
their prediction
- impute values to missing data as prediction + a random residual.

It works. It's a tad slower than prediction using
normal/logistic/multinomial modelling (about a factor of 3, but for y
first trial, I attempted to err on the side of excessive precision ==>
deeper trees). It does not exhibit any "obvious" statistical
misfeatures.

But I have questions :

1) What is known of such imputation by regression/classification trees
(aka recursive partitionning) ? A quick research didn't turn up very
much : the idea has been evoked here and there, but I am not aware of
any published solution. In particular, I have no knowledge of any
theoretical (i. e. probability) wotrk on their properties ?

2) Where could I find published datasets having been used to validate
other imputation methods ?

3) Do you think that these functions should be published ?

Sincerely yours,

                                                             Emmanuel
Charpentier

PS :

The goat has endured the fear of her life, but is still alive... will
probably start worshipping the Moon, however.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

**********************************************************************************
The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer. It is the policy of the Department of Justice, Equality and Law Reform and the Agencies and Offices using its IT services to disallow the sending of offensive material.
Should you consider that the material contained in this message is offensive you should contact the sender immediately and also mailminder[at]justice.ie.

Is le haghaidh an duine n? an eintitis ar a bhfuil s? d?rithe, agus le haghaidh an duine n? an eintitis sin amh?in, a bhearta?tear an fhaisn?is a tarchuireadh agus f?adfaidh s? go bhfuil ?bhar faoi r?n agus/n? faoi phribhl?id inti. Toirmisctear aon athbhreithni?, atarchur n? leathadh a dh?anamh ar an bhfaisn?is seo, aon ?s?id eile a bhaint aisti n? aon ghn?omh a dh?anamh ar a hiontaoibh, ag daoine n? ag eintitis seachas an faighteoir beartaithe. M? fuair t? ? seo tr? dhearmad, t?igh i dteagmh?il leis an seolt?ir, le do thoil, agus scrios an t-?bhar as aon r?omhaire. Is ? beartas na Roinne Dl? agus Cirt, Comhionannais agus Athch?irithe Dl?, agus na nOif?g? agus na nGn?omhaireachta? a ?s?ideann seirbh?s? TF na Roinne, seoladh ?bhair chol?il a dh?chead?.
M?s rud ? go measann t? gur ?bhar col?il at? san ?bhar at? sa teachtaireacht seo is ceart duit dul i dteagmh?il leis an seolt?ir l?ithreach agus le mailminder[ag]justice.ie chomh maith.
***********************************************************************************