missing values

10 messages · Jonathan Baron, falissard, Giordano Sanchez +3 more

Original

1

10

Giordano Sanchez

Sun, Apr 24, 2005 3:15 AM #

Hello,

I have climatic data of various years with many missing values. I would like 
to know what tools in R are most suited to estimate this missing values. 
(New in R and quite new on statistics).

Thanks,

G

Sun, Apr 24, 2005 4:06 AM #

Turns out that this is not a simple question.  Depending on what
you want to do, some statistical methods will just deal with
missing data and use what is available, in different ways, e.g.,
cor().  For other purposes, you might want to "impute" (fill in)
the missing values, and then there are many ways to do this,
depending on what else you have (correlated variables?) and what
assumptions you are willing to make.  Two methods (among many)
that I have found useful are in aregImpute() and transcan(), both
in the Hmisc package.

To learn more, see my R search page:
http://finzi.psych.upenn.edu/

and I also have an example of aregImpute() in 
http://www.psych.upenn.edu/~baron/rpsych/rpsych.html

but see the help files first.

I found the following article very helpful when I was a beginner
with respect to this topic (which is still close to true):

Schafer, J. L., & Graham, J. W. (2002).  Missing data: Our view
of the state of the art.  Psychological Methods, 7, 147-177.

Jon

On 04/24/05 10:15, Giordano Sanchez wrote:

Hello,
 
 I have climatic data of various years with many missing values. I would like
 to know what tools in R are most suited to estimate this missing values.
 (New in R and quite new on statistics).

Jonathan Baron, Professor of Psychology, University of Pennsylvania
Home page: http://www.sas.upenn.edu/~baron

falissard

Sun, Apr 24, 2005 7:20 AM #

Hello,

The mice package http://web.inter.nl.net/users/S.van.Buuren/mi/hmtl/mice.htm
is also potentially interesting.
It works with R 1.9 but not always with newer versions.
Best regards,

Bruno

------------------------------------------------------------------------
Bruno Falissard
D??partement de sant?? publique
H??pital Paul Brousse
14 Avenue Paul Vaillant Couturier
94804 Villejuif cedex, France
tel : (+33) 6 81 82 70 76
fax : (+33) 1 45 59 34 18
web??site : http://perso.wanadoo.fr/bruno.falissard/
------------------------------------------------------------------------

-----Message d'origine-----
De??: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] De la part de Giordano Sanchez
Envoy????: dimanche 24 avril 2005 12:15
????: r-help at stat.math.ethz.ch
Objet??: [R] missing values

Hello,

I have climatic data of various years with many missing values. I would like

to know what tools in R are most suited to estimate this missing values. 
(New in R and quite new on statistics).

Thanks,

G

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

1 day later

Giordano Sanchez

Tue, Apr 26, 2005 2:58 AM #

Hello,

Thanks for the instructive responses. But two questions arise.
Firstable I can't manage to load the library "mice".
I'm using R 2.0.1 on my Debian
   I try just copying the package in my library /usr/lib/R/library .
but when i do >library()
                        ...
                        mice       ** No title available (pre-2.0.0 
install?) **
                        ...
and when i do > library(mice)
                       Error in library(mice) : 'mice' is not a valid 
package --installed < 2.0.0?
                       >

The second question is more statistical:
aregImpute() seems to give good results but i would like to compare the 
different methods not just graphically. It'is possible?
I also have other meteorological stations that have correleted data with the 
data station I'm using? Can I use those data to improve my imputation 
method.

Regards,

Giordano

falissard

Tue, Apr 26, 2005 3:25 AM #

Hello,
On my experience, mice works fine with R 1.9 but not necessarily for newer
versions...
Bruno

----------------------------------------------------------------------------
Bruno Falissard
INSERM U669, PSIGIAM
"Paris Sud Innovation Group in Adolescent Mental Health"
Maison de Solenn
97 Boulevard de Port Royal
75679 Paris cedex 14, France
tel : (+33) 6 81 82 70 76
fax : (+33) 1 45 59 34 18
web site : http://perso.wanadoo.fr/bruno.falissard/
----------------------------------------------------------------------------
 
-----Message d'origine-----
De??: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] De la part de Giordano Sanchez
Envoy????: mardi 26 avril 2005 11:58
????: r-help at stat.math.ethz.ch
Objet??: Re: [R] missing values

Hello,

Thanks for the instructive responses. But two questions arise.
Firstable I can't manage to load the library "mice".
I'm using R 2.0.1 on my Debian
   I try just copying the package in my library /usr/lib/R/library .
but when i do >library()
                        ...
                        mice       ** No title available (pre-2.0.0 
install?) **
                        ...
and when i do > library(mice)
                       Error in library(mice) : 'mice' is not a valid 
package --installed < 2.0.0?
                       >

The second question is more statistical:
aregImpute() seems to give good results but i would like to compare the 
different methods not just graphically. It'is possible?
I also have other meteorological stations that have correleted data with the

data station I'm using? Can I use those data to improve my imputation 
method.

Regards,

Giordano

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

Tue, Apr 26, 2005 3:44 AM #

On 04/26/05 09:58, Giordano Sanchez wrote:

Hello,
 
 Thanks for the instructive responses. But two questions arise.
 Firstable I can't manage to load the library "mice".
 I'm using R 2.0.1 on my Debian

The package called norm also has functions for missing data.
When I tried it, the values it gave were not sensible for my
problem, but I may have done something wrong.  (This was a simple 
problem that did not involve multiple imputation.)
 
 The second question is more statistical:
 aregImpute() seems to give good results but i would like to compare the
 different methods not just graphically. It'is possible?

What different methods?  Compare how?  Are you assuming that we
remember your last post?

 I also have other meteorological stations that have correleted data with the
 data station I'm using? Can I use those data to improve my imputation
 method.

This sounds like exactly what aregImput() is good for, or
transcan(), depending on whether you need to make inferences (and 
hence do multiple imputation).

Jon

Jonathan Baron, Professor of Psychology, University of Pennsylvania
Home page: http://www.sas.upenn.edu/~baron

Frank E Harrell Jr

Tue, Apr 26, 2005 4:24 AM #

Jonathan Baron wrote:

For those interested I have preprints of a paper comparing MICE, 
aregImpute, and transcan on the basis of simulations.

Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University

Tue, Apr 26, 2005 4:54 AM #

On 26-Apr-05 Jonathan Baron wrote:

Hi Jonathan,
Would you be kind enough to give sufficient detail to reproduce
such a case? I've used 'norm' (and 'cat' and 'mix') quite
extensively, without encountering non-sensible results (at any
rate in situations where the packages were not being abused,
which one can do in certain circumstances -- imputing missing
values can depend quite strongly on supplying realistic constraints,
and on not expecting too much when the proportion of missing data
is substantial: this methodology does not have magical powers!).

best wishes,
Ted.


--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 26-Apr-05                                       Time: 12:47:42
------------------------------ XFMail ------------------------------

Ramon Diaz-Uriarte

Tue, Apr 26, 2005 7:25 AM #

Dear Giordano,

Library Hmisc, by Frank Harrell, contains several functions for imputation 
which I have found extremely useful.

Best,

R.

On Tuesday 26 April 2005 11:58, Giordano Sanchez wrote:

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

Ram??n D??az-Uriarte
Bioinformatics Unit
Centro Nacional de Investigaciones Oncol??gicas (CNIO)
(Spanish National Cancer Center)
Melchor Fern??ndez Almagro, 3
28029 Madrid (Spain)
Fax: +-34-91-224-6972
Phone: +-34-91-224-6900

http://ligarto.org/rdiaz
PGP KeyID: 0xE89B3462
(http://ligarto.org/rdiaz/0xE89B3462.asc)




**NOTA DE CONFIDENCIALIDAD** Este correo electr??nico, y en su caso los ficheros adjuntos, pueden contener informaci??n protegida para el uso exclusivo de su destinatario. Se proh??be la distribuci??n, reproducci??n o cualquier otro tipo de transmisi??n por parte de otra persona que no sea el destinatario. Si usted recibe por error este correo, se ruega comunicarlo al remitente y borrar el mensaje recibido. 
**CONFIDENTIALITY NOTICE** This email communication and any attachments may contain confidential and privileged information for the sole use of the designated recipient named above. Distribution, reproduction or any other use of this transmission by any party other than the intended recipient is prohibited. If you are not the intended recipient please contact the sender and delete all copies.

Tue, Apr 26, 2005 8:12 AM #

On 04/26/05 12:54, Ted Harding wrote:

Would you be kind enough to give sufficient detail to reproduce
 such a case? I've used 'norm' (and 'cat' and 'mix') quite
 extensively, without encountering non-sensible results (at any
 rate in situations where the packages were not being abused,
 which one can do in certain circumstances -- imputing missing
 values can depend quite strongly on supplying realistic constraints,
 and on not expecting too much when the proportion of missing data
 is substantial: this methodology does not have magical powers!).

OK.  Here you go.  First the data without any names:

41,43,41,43,44
43,40,40,42,41
43,44,NA,43,44
42,43,NA,44,44
41,44,42,42,42
43,43,41,42,42
47,48,46,47,46
39,35,35,39,38
40,39,36,40,38
40,40,40,40,40
48,46,46,48,46
45,45,42,44,45
41,40,40,41,41
40,39,37,40,38
41,42,40,41,41
41,42,41,43,43
46,46,45,46,46
40,40,41,40,41
39,41,40,41,41
40,43,38,40,39
37,36,37,36,39
45,46,45,46,46
43,44,42,43,44
42,42,48,42,43
45,46,45,46,45
37,36,36,36,38
37,34,39,37,39
NA,43,41,44,43
45,44,45,44,45
38,38,37,39,38
45,44,44,44,45
NA,42,43,43,43
45,45,44,44,45
40,35,37,40,38
43,43,43,43,43
39,34,37,36,39
38,38,38,39,39
43,41,40,42,43
46,43,42,45,45
46,45,41,44,44
40,40,38,39,40
39,37,39,38,39

Now the commands I used in norm, and the result:

m1 <- as.matrix(read.csv("test.data"))
s1 <- prelim.norm(m1)
thetahat <- em.norm(s1)
rngseed(1234564)
ximp <- imp.norm(s1,thetahat,m1)
ximp

1  41.00000 43 41.00000 43 44
2  43.00000 40 40.00000 42 41
3  43.00000 44 43.72409 43 44
4  42.00000 43 43.36864 44 44
5  41.00000 44 42.00000 42 42
6  43.00000 43 41.00000 42 42
7  47.00000 48 46.00000 47 46
8  39.00000 35 35.00000 39 38
9  40.00000 39 36.00000 40 38
10 40.00000 40 40.00000 40 40
11 48.00000 46 46.00000 48 46
12 45.00000 45 42.00000 44 45
13 41.00000 40 40.00000 41 41
14 40.00000 39 37.00000 40 38
15 41.00000 42 40.00000 41 41
16 41.00000 42 41.00000 43 43
17 46.00000 46 45.00000 46 46
18 40.00000 40 41.00000 40 41
19 39.00000 41 40.00000 41 41
20 40.00000 43 38.00000 40 39
21 37.00000 36 37.00000 36 39
22 45.00000 46 45.00000 46 46
23 43.00000 44 42.00000 43 44
24 42.00000 42 48.00000 42 43
25 45.00000 46 45.00000 46 45
26 37.00000 36 36.00000 36 38
27 37.00000 34 39.00000 37 39
28 44.13337 43 41.00000 44 43
29 45.00000 44 45.00000 44 45
30 38.00000 38 37.00000 39 38
31 45.00000 44 44.00000 44 45
32 41.25152 42 43.00000 43 43
33 45.00000 45 44.00000 44 45
34 40.00000 35 37.00000 40 38
35 43.00000 43 43.00000 43 43
36 39.00000 34 37.00000 36 39
37 38.00000 38 38.00000 39 39
38 43.00000 41 40.00000 42 43
39 46.00000 43 42.00000 45 45
40 46.00000 45 41.00000 44 44
41 40.00000 40 38.00000 39 40
42 39.00000 37 39.00000 38 39

What seemed odd to me, and maybe they aren't, were the imputed
values in rows 3 and 4.  They seemed high, knowing the rater in
question and the students.  Here is the output of transcan, for
the same cases, which looks more in line with what I expected:

1  41.00000 43 41.00000 43 44
2  43.00000 40 40.00000 42 41
3  43.00000 44 43.09469 43 44
4  42.00000 43 43.39897 44 44
5  41.00000 44 42.00000 42 42
6  43.00000 43 41.00000 42 42
7  47.00000 48 46.00000 47 46
8  39.00000 35 35.00000 39 38
9  40.00000 39 36.00000 40 38
10 40.00000 40 40.00000 40 40
11 48.00000 46 46.00000 48 46
12 45.00000 45 42.00000 44 45
13 41.00000 40 40.00000 41 41
14 40.00000 39 37.00000 40 38
15 41.00000 42 40.00000 41 41
16 41.00000 42 41.00000 43 43
17 46.00000 46 45.00000 46 46
18 40.00000 40 41.00000 40 41
19 39.00000 41 40.00000 41 41
20 40.00000 43 38.00000 40 39
21 37.00000 36 37.00000 36 39
22 45.00000 46 45.00000 46 46
23 43.00000 44 42.00000 43 44
24 42.00000 42 48.00000 42 43
25 45.00000 46 45.00000 46 45
26 37.00000 36 36.00000 36 38
27 37.00000 34 39.00000 37 39
28 43.80165 43 41.00000 44 43
29 45.00000 44 45.00000 44 45
30 38.00000 38 37.00000 39 38
31 45.00000 44 44.00000 44 45
32 42.91116 42 43.00000 43 43
33 45.00000 45 44.00000 44 45
34 40.00000 35 37.00000 40 38
35 43.00000 43 43.00000 43 43
36 39.00000 34 37.00000 36 39
37 38.00000 38 38.00000 39 39
38 43.00000 41 40.00000 42 43
39 46.00000 43 42.00000 45 45
40 46.00000 45 41.00000 44 44
41 40.00000 40 38.00000 39 40
42 39.00000 37 39.00000 38 39

The commands here were

s.imp <- transcan(m1,asis="*",data=m1,imputed=T,long=T,pl=F)
s.na <- is.na(m1) # which ratings are imputed
m1[which(s.na)] <- unlist(s.imp$imputed)

(I wish I could find a more elegant way to replace the NAs.)

Jon
- 
Jonathan Baron, Professor of Psychology, University of Pennsylvania
Home page: http://www.sas.upenn.edu/~baron