Skip to content

Filling in missing values in a column based on previous and following values

8 messages · fd, Gerrit Eichner, John McKown +1 more

fd
#
I have the following .csv file containing about 40000 values (here only an
extract and simplified version):

NAME  ;	YEAR; ID;	VALUE; CUMMB
Sample1;	1998;	 354;	45;	   45
Sample1;	1999;	 354;	23;	   68
Sample1;	2000;	 NA;	66;	   134
Sample1;	2001;	 NA;	98;	   232
Sample1;	2002;	 NA;	36;	   268
Sample1;	2003;	 NA;	59;	   327
Sample1;	2004;	 NA;	64;	   391
Sample1;	2005;	 354;	23;	   414
Sample1;	2006;	 354;	69;	   483
Sample1;	2007;	 354;	94;	   577
Sample1;	2008;	 354;	24;	   601
Sample2;	1964;	1342;	7;	      7
Sample2;	1965;	1342;	24;	     31
Sample3;	2002;	 859;	90;	     90
Sample3;	2003;	  NA;	93;	    183
Sample3;	2004;	  NA;	53;	    236
Sample3;	2005;	 859;	98;	    334

What I would like to do is to replace the NA values in ID with the values
from the ID. E.g. all values in ID from Sample 1 should have the value 354;
all values in ID from Sample 3 should have the value 859 etc.

Is there a simple way to do this? 

Thanks for your help.



--
View this message in context: http://r.789695.n4.nabble.com/Filling-in-missing-values-in-a-column-based-on-previous-and-following-values-tp4694993.html
Sent from the R help mailing list archive at Nabble.com.
#
Hello, Florian,

function na.locf() from package zoo mightdo what you want.

Hth --  Gerrit

---------------------------------------------------------------------
Dr. Gerrit Eichner                   Mathematical Institute, Room 212
gerrit.eichner at math.uni-giessen.de   Justus-Liebig-University Giessen
Tel: +49-(0)641-99-32104          Arndtstr. 2, 35392 Giessen, Germany
Fax: +49-(0)641-99-32109        http://www.uni-giessen.de/cms/eichner
---------------------------------------------------------------------
On Mon, 4 Aug 2014, fd wrote:

            
#
this is great, thanks! 

one problem I noticed though is that it fills all NA values in every column, is it possible to specify only one column, e.g. only ID (I have NA values in another column I want to keep)

Kind regards,
Florian

 

Am 04.08.2014 um 11:31 schrieb Gerrit Eichner <Gerrit.Eichner at math.uni-giessen.de>:
#
On Mon, 4 Aug 2014, Florian Denzinger wrote:

            
Yes, of course. Just access only one column, not all: Something like

yourdataframe$ID <- na.locf( yourdataframe$ID)

should replace the ID-column with the modified version you want.

Regards  --  Gerrit
#
On Mon, Aug 4, 2014 at 6:34 AM, Florian Denzinger
<florian.denzinger at uzh.ch> wrote:
Florian,

I'd be interested in seeing your solution. I need as many techniques
in my "bag of tricks" as I can find.
#
Hi John,

I only meant the subsetting of the dataframe by using the zoo function suggested by Gerrit! Not a new solution/function, I am afraid. 
Regards,
F

Am 04.08.2014 um 14:13 schrieb John McKown <john.archie.mckown at gmail.com>:
#
On Mon, Aug 4, 2014 at 9:01 AM, Florian Denzinger
<florian.denzinger at uzh.ch> wrote:
I take it that means you _don't_ have a solution. If not, or even if
you do, my solution is below. There likely is a better way to do it.
I'm still learning R and there is a _LOT_ of things in all those
packages which I don't know.

library(reshape2); # needed for dcast()
#
# create variable
florian <- data.frame(
NAME=c('Sample1','Sample1','Sample1','Sample1','Sample1','Sample1','Sample1',
'Sample1','Sample1','Sample1','Sample1','Sample2','Sample2','Sample3','Sample3',
'Sample3','Sample3'),
YEAR=c( 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 1964,
1965, 2002, 2003, 2004, 2005),
ID=c( 354 , 354 ,NA ,NA ,NA ,NA ,NA , 354 , 354 , 354 , 354 , 1342 ,
1342 , 859 ,NA ,NA , 859 ),
VALUE=c( 45 , 23 , 66 , 98 , 36 , 59 , 64 , 23 , 69 , 94 , 24 , 7 , 24 , 90 ,
93 , 53 , 98 ),
CUMMB=c( 45 , 68 , 134 , 232 , 268 , 327 , 391 , 414 , 483 , 577 , 601 , 7 , 31
, 90 , 183 , 236 , 334 ));
#
# start of solution: y is a temporary table I need to
# find the non-NA ID values per NAME value.
# I actually use max() but ASSuME that all non-NA value are ==
# !is.na() is used to remove NA values from consideration.
y <- dcast(florian[!is.na(florian$ID),],NAME ~
.,value.var="ID",fun.aggregate=max,fill=-1,drop=FALSE);
names(y) <- c("NAME","IDy"); # Nicer names that dcast() makes.
#
# The real work is done in merge()
florian <- merge(x=florian,y=y,by=c("NAME"),all.x=TRUE);
florian$ID <- florian$IDy; #copy merged values to ID column
florian$IDy <- NULL; # remove temporary column
rm(y); # erase temporary frame.