Hello, I have climatic data of various years with many missing values. I would like to know what tools in R are most suited to estimate this missing values. (New in R and quite new on statistics). Thanks, G
missing values
10 messages · Jonathan Baron, falissard, Giordano Sanchez +3 more
Turns out that this is not a simple question. Depending on what you want to do, some statistical methods will just deal with missing data and use what is available, in different ways, e.g., cor(). For other purposes, you might want to "impute" (fill in) the missing values, and then there are many ways to do this, depending on what else you have (correlated variables?) and what assumptions you are willing to make. Two methods (among many) that I have found useful are in aregImpute() and transcan(), both in the Hmisc package. To learn more, see my R search page: http://finzi.psych.upenn.edu/ and I also have an example of aregImpute() in http://www.psych.upenn.edu/~baron/rpsych/rpsych.html but see the help files first. I found the following article very helpful when I was a beginner with respect to this topic (which is still close to true): Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7, 147-177. Jon
On 04/24/05 10:15, Giordano Sanchez wrote:
Hello, I have climatic data of various years with many missing values. I would like to know what tools in R are most suited to estimate this missing values. (New in R and quite new on statistics).
Jonathan Baron, Professor of Psychology, University of Pennsylvania Home page: http://www.sas.upenn.edu/~baron
Hello, The mice package http://web.inter.nl.net/users/S.van.Buuren/mi/hmtl/mice.htm is also potentially interesting. It works with R 1.9 but not always with newer versions. Best regards, Bruno ------------------------------------------------------------------------ Bruno Falissard D??partement de sant?? publique H??pital Paul Brousse 14 Avenue Paul Vaillant Couturier 94804 Villejuif cedex, France tel : (+33) 6 81 82 70 76 fax : (+33) 1 45 59 34 18 web??site : http://perso.wanadoo.fr/bruno.falissard/ ------------------------------------------------------------------------ -----Message d'origine----- De??: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] De la part de Giordano Sanchez Envoy????: dimanche 24 avril 2005 12:15 ????: r-help at stat.math.ethz.ch Objet??: [R] missing values Hello, I have climatic data of various years with many missing values. I would like to know what tools in R are most suited to estimate this missing values. (New in R and quite new on statistics). Thanks, G ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
1 day later
Hello,
Thanks for the instructive responses. But two questions arise.
Firstable I can't manage to load the library "mice".
I'm using R 2.0.1 on my Debian
I try just copying the package in my library /usr/lib/R/library .
but when i do >library()
...
mice ** No title available (pre-2.0.0
install?) **
...
and when i do > library(mice)
Error in library(mice) : 'mice' is not a valid
package --installed < 2.0.0?
>
The second question is more statistical:
aregImpute() seems to give good results but i would like to compare the
different methods not just graphically. It'is possible?
I also have other meteorological stations that have correleted data with the
data station I'm using? Can I use those data to improve my imputation
method.
Regards,
Giordano
Hello, On my experience, mice works fine with R 1.9 but not necessarily for newer versions... Bruno ---------------------------------------------------------------------------- Bruno Falissard INSERM U669, PSIGIAM "Paris Sud Innovation Group in Adolescent Mental Health" Maison de Solenn 97 Boulevard de Port Royal 75679 Paris cedex 14, France tel : (+33) 6 81 82 70 76 fax : (+33) 1 45 59 34 18 web site : http://perso.wanadoo.fr/bruno.falissard/ ---------------------------------------------------------------------------- -----Message d'origine----- De??: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] De la part de Giordano Sanchez Envoy????: mardi 26 avril 2005 11:58 ????: r-help at stat.math.ethz.ch Objet??: Re: [R] missing values Hello, Thanks for the instructive responses. But two questions arise. Firstable I can't manage to load the library "mice". I'm using R 2.0.1 on my Debian I try just copying the package in my library /usr/lib/R/library . but when i do >library() ... mice ** No title available (pre-2.0.0 install?) ** ... and when i do > library(mice) Error in library(mice) : 'mice' is not a valid package --installed < 2.0.0? > The second question is more statistical: aregImpute() seems to give good results but i would like to compare the different methods not just graphically. It'is possible? I also have other meteorological stations that have correleted data with the data station I'm using? Can I use those data to improve my imputation method. Regards, Giordano ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
On 04/26/05 09:58, Giordano Sanchez wrote:
Hello, Thanks for the instructive responses. But two questions arise. Firstable I can't manage to load the library "mice". I'm using R 2.0.1 on my Debian The package called norm also has functions for missing data. When I tried it, the values it gave were not sensible for my problem, but I may have done something wrong. (This was a simple problem that did not involve multiple imputation.) The second question is more statistical: aregImpute() seems to give good results but i would like to compare the different methods not just graphically. It'is possible? What different methods? Compare how? Are you assuming that we remember your last post? I also have other meteorological stations that have correleted data with the data station I'm using? Can I use those data to improve my imputation method. This sounds like exactly what aregImput() is good for, or transcan(), depending on whether you need to make inferences (and hence do multiple imputation). Jon
Jonathan Baron, Professor of Psychology, University of Pennsylvania Home page: http://www.sas.upenn.edu/~baron
Jonathan Baron wrote:
On 04/26/05 09:58, Giordano Sanchez wrote: Hello, Thanks for the instructive responses. But two questions arise. Firstable I can't manage to load the library "mice". I'm using R 2.0.1 on my Debian The package called norm also has functions for missing data. When I tried it, the values it gave were not sensible for my problem, but I may have done something wrong. (This was a simple problem that did not involve multiple imputation.) The second question is more statistical: aregImpute() seems to give good results but i would like to compare the different methods not just graphically. It'is possible? What different methods? Compare how? Are you assuming that we remember your last post? I also have other meteorological stations that have correleted data with the data station I'm using? Can I use those data to improve my imputation method. This sounds like exactly what aregImput() is good for, or transcan(), depending on whether you need to make inferences (and hence do multiple imputation). Jon
For those interested I have preprints of a paper comparing MICE, aregImpute, and transcan on the basis of simulations.
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
On 26-Apr-05 Jonathan Baron wrote:
On 04/26/05 09:58, Giordano Sanchez wrote: Hello, Thanks for the instructive responses. But two questions arise. Firstable I can't manage to load the library "mice". I'm using R 2.0.1 on my Debian The package called norm also has functions for missing data. When I tried it, the values it gave were not sensible for my problem, but I may have done something wrong. (This was a simple problem that did not involve multiple imputation.)
Hi Jonathan, Would you be kind enough to give sufficient detail to reproduce such a case? I've used 'norm' (and 'cat' and 'mix') quite extensively, without encountering non-sensible results (at any rate in situations where the packages were not being abused, which one can do in certain circumstances -- imputing missing values can depend quite strongly on supplying realistic constraints, and on not expecting too much when the proportion of missing data is substantial: this methodology does not have magical powers!). best wishes, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 26-Apr-05 Time: 12:47:42 ------------------------------ XFMail ------------------------------
Dear Giordano, Library Hmisc, by Frank Harrell, contains several functions for imputation which I have found extremely useful. Best, R.
On Tuesday 26 April 2005 11:58, Giordano Sanchez wrote:
Hello,
Thanks for the instructive responses. But two questions arise.
Firstable I can't manage to load the library "mice".
I'm using R 2.0.1 on my Debian
I try just copying the package in my library /usr/lib/R/library .
but when i do >library()
...
mice ** No title available (pre-2.0.0
install?) **
...
and when i do > library(mice)
Error in library(mice) : 'mice' is not a valid
package --installed < 2.0.0?
The second question is more statistical:
aregImpute() seems to give good results but i would like to compare the
different methods not just graphically. It'is possible?
I also have other meteorological stations that have correleted data with
the data station I'm using? Can I use those data to improve my imputation
method.
Regards,
Giordano
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Ram??n D??az-Uriarte Bioinformatics Unit Centro Nacional de Investigaciones Oncol??gicas (CNIO) (Spanish National Cancer Center) Melchor Fern??ndez Almagro, 3 28029 Madrid (Spain) Fax: +-34-91-224-6972 Phone: +-34-91-224-6900 http://ligarto.org/rdiaz PGP KeyID: 0xE89B3462 (http://ligarto.org/rdiaz/0xE89B3462.asc) **NOTA DE CONFIDENCIALIDAD** Este correo electr??nico, y en su caso los ficheros adjuntos, pueden contener informaci??n protegida para el uso exclusivo de su destinatario. Se proh??be la distribuci??n, reproducci??n o cualquier otro tipo de transmisi??n por parte de otra persona que no sea el destinatario. Si usted recibe por error este correo, se ruega comunicarlo al remitente y borrar el mensaje recibido. **CONFIDENTIALITY NOTICE** This email communication and any attachments may contain confidential and privileged information for the sole use of the designated recipient named above. Distribution, reproduction or any other use of this transmission by any party other than the intended recipient is prohibited. If you are not the intended recipient please contact the sender and delete all copies.
On 04/26/05 12:54, Ted Harding wrote:
Would you be kind enough to give sufficient detail to reproduce
such a case? I've used 'norm' (and 'cat' and 'mix') quite
extensively, without encountering non-sensible results (at any
rate in situations where the packages were not being abused,
which one can do in certain circumstances -- imputing missing
values can depend quite strongly on supplying realistic constraints,
and on not expecting too much when the proportion of missing data
is substantial: this methodology does not have magical powers!).
OK. Here you go. First the data without any names:
41,43,41,43,44
43,40,40,42,41
43,44,NA,43,44
42,43,NA,44,44
41,44,42,42,42
43,43,41,42,42
47,48,46,47,46
39,35,35,39,38
40,39,36,40,38
40,40,40,40,40
48,46,46,48,46
45,45,42,44,45
41,40,40,41,41
40,39,37,40,38
41,42,40,41,41
41,42,41,43,43
46,46,45,46,46
40,40,41,40,41
39,41,40,41,41
40,43,38,40,39
37,36,37,36,39
45,46,45,46,46
43,44,42,43,44
42,42,48,42,43
45,46,45,46,45
37,36,36,36,38
37,34,39,37,39
NA,43,41,44,43
45,44,45,44,45
38,38,37,39,38
45,44,44,44,45
NA,42,43,43,43
45,45,44,44,45
40,35,37,40,38
43,43,43,43,43
39,34,37,36,39
38,38,38,39,39
43,41,40,42,43
46,43,42,45,45
46,45,41,44,44
40,40,38,39,40
39,37,39,38,39
Now the commands I used in norm, and the result:
m1 <- as.matrix(read.csv("test.data"))
s1 <- prelim.norm(m1)
thetahat <- em.norm(s1)
rngseed(1234564)
ximp <- imp.norm(s1,thetahat,m1)
ximp
1 41.00000 43 41.00000 43 44
2 43.00000 40 40.00000 42 41
3 43.00000 44 43.72409 43 44
4 42.00000 43 43.36864 44 44
5 41.00000 44 42.00000 42 42
6 43.00000 43 41.00000 42 42
7 47.00000 48 46.00000 47 46
8 39.00000 35 35.00000 39 38
9 40.00000 39 36.00000 40 38
10 40.00000 40 40.00000 40 40
11 48.00000 46 46.00000 48 46
12 45.00000 45 42.00000 44 45
13 41.00000 40 40.00000 41 41
14 40.00000 39 37.00000 40 38
15 41.00000 42 40.00000 41 41
16 41.00000 42 41.00000 43 43
17 46.00000 46 45.00000 46 46
18 40.00000 40 41.00000 40 41
19 39.00000 41 40.00000 41 41
20 40.00000 43 38.00000 40 39
21 37.00000 36 37.00000 36 39
22 45.00000 46 45.00000 46 46
23 43.00000 44 42.00000 43 44
24 42.00000 42 48.00000 42 43
25 45.00000 46 45.00000 46 45
26 37.00000 36 36.00000 36 38
27 37.00000 34 39.00000 37 39
28 44.13337 43 41.00000 44 43
29 45.00000 44 45.00000 44 45
30 38.00000 38 37.00000 39 38
31 45.00000 44 44.00000 44 45
32 41.25152 42 43.00000 43 43
33 45.00000 45 44.00000 44 45
34 40.00000 35 37.00000 40 38
35 43.00000 43 43.00000 43 43
36 39.00000 34 37.00000 36 39
37 38.00000 38 38.00000 39 39
38 43.00000 41 40.00000 42 43
39 46.00000 43 42.00000 45 45
40 46.00000 45 41.00000 44 44
41 40.00000 40 38.00000 39 40
42 39.00000 37 39.00000 38 39
What seemed odd to me, and maybe they aren't, were the imputed
values in rows 3 and 4. They seemed high, knowing the rater in
question and the students. Here is the output of transcan, for
the same cases, which looks more in line with what I expected:
1 41.00000 43 41.00000 43 44
2 43.00000 40 40.00000 42 41
3 43.00000 44 43.09469 43 44
4 42.00000 43 43.39897 44 44
5 41.00000 44 42.00000 42 42
6 43.00000 43 41.00000 42 42
7 47.00000 48 46.00000 47 46
8 39.00000 35 35.00000 39 38
9 40.00000 39 36.00000 40 38
10 40.00000 40 40.00000 40 40
11 48.00000 46 46.00000 48 46
12 45.00000 45 42.00000 44 45
13 41.00000 40 40.00000 41 41
14 40.00000 39 37.00000 40 38
15 41.00000 42 40.00000 41 41
16 41.00000 42 41.00000 43 43
17 46.00000 46 45.00000 46 46
18 40.00000 40 41.00000 40 41
19 39.00000 41 40.00000 41 41
20 40.00000 43 38.00000 40 39
21 37.00000 36 37.00000 36 39
22 45.00000 46 45.00000 46 46
23 43.00000 44 42.00000 43 44
24 42.00000 42 48.00000 42 43
25 45.00000 46 45.00000 46 45
26 37.00000 36 36.00000 36 38
27 37.00000 34 39.00000 37 39
28 43.80165 43 41.00000 44 43
29 45.00000 44 45.00000 44 45
30 38.00000 38 37.00000 39 38
31 45.00000 44 44.00000 44 45
32 42.91116 42 43.00000 43 43
33 45.00000 45 44.00000 44 45
34 40.00000 35 37.00000 40 38
35 43.00000 43 43.00000 43 43
36 39.00000 34 37.00000 36 39
37 38.00000 38 38.00000 39 39
38 43.00000 41 40.00000 42 43
39 46.00000 43 42.00000 45 45
40 46.00000 45 41.00000 44 44
41 40.00000 40 38.00000 39 40
42 39.00000 37 39.00000 38 39
The commands here were
s.imp <- transcan(m1,asis="*",data=m1,imputed=T,long=T,pl=F)
s.na <- is.na(m1) # which ratings are imputed
m1[which(s.na)] <- unlist(s.imp$imputed)
(I wish I could find a more elegant way to replace the NAs.)
Jon
-
Jonathan Baron, Professor of Psychology, University of Pennsylvania
Home page: http://www.sas.upenn.edu/~baron