R-help Is there a R function that will insert missing sequence number(s) and then fill a missing observation with the preceding value. For example df <- data.frame(seq = c(1,2,4,5,7), count = c(4,7,3,5,2)) seq count 1 1 4 2 2 7 3 4 3 4 5 5 5 7 2 What I need is seq count 1 1 4 2 2 7 3 3 7 4 4 3 5 5 5 6 6 5 7 7 2 Jeff
Inserting missing seq number
9 messages · Eric Berger, Avi Gross, PIKAL Petr +4 more
Hi Jeff, tidyr::fill() will do what you want aDf <- data.frame(seq=1:7) bDf <- left_join(aDf,df,by="seq") tidyr::fill(bDf,count) HTH, Eric On Wed, Mar 30, 2022 at 5:47 AM Jeff Reichman <reichmanj at sbcglobal.net> wrote:
R-help Is there a R function that will insert missing sequence number(s) and then fill a missing observation with the preceding value. For example df <- data.frame(seq = c(1,2,4,5,7), count = c(4,7,3,5,2)) seq count 1 1 4 2 2 7 3 4 3 4 5 5 5 7 2 What I need is seq count 1 1 4 2 2 7 3 3 7 4 4 3 5 5 5 6 6 5 7 7 2 Jeff
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Jeff, There may well be such a function somewhere but I would have requested something less ambitious than a function that does exactly that. As I see it, there are many ways to do what you specifically want, but much depends on exact conditions in your data. Your example shows seq values ascending from 1 to 7 with missing rows for values 3 and 6. Can we assume the data will have everything in order with gaps of exactly 1, or might there be gaps of 2 or more? Is the sequence always beginning with 1, or might you have numbers from say 666 to 999? Can the first or last entries be missing? How would we know? One of many approaches I can outline is to create a second data structure containing all valid integers between highest and lowest. In your case, from 1 to 7. Again, one way is to make a data.frame similar to the above and use some functions to make the appropriate merge. You would end up, perhaps, with a data.frame with more rows and the new rows might contain an NA for the count column. Now, yes, there are many packages out there that have functions for filling in missing values. Some allow you to fill in the previous value, and some may allow a mean or interpolation and so on. But consider a simple loop of indices in the enlarged data.frame of values of seq. Each step in the loop makes note of the value stored in count on the previous pass and if the currently pass has an NA, changes it to the value being held. Lots more details, but as an outline, this may suffice. And, yes, if you make a simple set of program lines that performs what you want, and make it general enough, you should be able to write an R function that accepts a data.frame and perhaps an indication of which column to do this with and calculates everything needed to return the enhanced data.frame as a result. You will then have a function that does this! I am sure eventually someone else will throw something like a dplyr function or two or suggest you could have searched for something like: "R fill in missing info" -----Original Message----- From: Jeff Reichman <reichmanj at sbcglobal.net> To: R-help at r-project.org Sent: Tue, Mar 29, 2022 10:47 pm Subject: [R] Inserting missing seq number R-help Is there a R function that will insert missing sequence number(s) and then fill a missing observation with the preceding value. For example df <- data.frame(seq = c(1,2,4,5,7), count = c(4,7,3,5,2)) ? seq count 1? ? 1? ? ? ? 4 2? ? 2? ? ? ? 7 3? ? 4? ? ? ? 3 4? ? 5? ? ? ? 5 5? ? 7? ? ? ? 2 What I need is ? seq count 1? ? 1? ? ? ? 4 2? ? 2? ? ? ? 7 3? ? 3? ? ? ? 7 4? ? 4? ? ? ? 3 5? ? 5? ? ? ? 5 6? ? 6? ? ? ? 5 7? ? 7? ? ? ? 2 Jeff ______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Hallo Jeff Similar result you will get by na.locf function from zoo package df <- data.frame(seq = c(1,2,4,5,7), count = c(4,7,3,5,2)) vec <-data.frame(seq =1:7) df <- merge(vec, df, all=T) library(zoo) df$count <- na.locf(df$count) Cheers Petr
-----Original Message----- From: R-help <r-help-bounces at r-project.org> On Behalf Of Eric Berger Sent: Wednesday, March 30, 2022 5:36 AM To: reichmanj at sbcglobal.net Cc: r-help at r-project.org Subject: Re: [R] Inserting missing seq number Hi Jeff, tidyr::fill() will do what you want aDf <- data.frame(seq=1:7) bDf <- left_join(aDf,df,by="seq") tidyr::fill(bDf,count) HTH, Eric On Wed, Mar 30, 2022 at 5:47 AM Jeff Reichman <reichmanj at sbcglobal.net> wrote:
R-help Is there a R function that will insert missing sequence number(s) and then fill a missing observation with the preceding value. For example df <- data.frame(seq = c(1,2,4,5,7), count = c(4,7,3,5,2)) seq count 1 1 4 2 2 7 3 4 3 4 5 5 5 7 2 What I need is seq count 1 1 4 2 2 7 3 3 7 4 4 3 5 5 5 6 6 5 7 7 2 Jeff
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code.
Petr & Eric Thank you for the two options. Figured there was an easy straight forward way. Jeff -----Original Message----- From: PIKAL Petr <petr.pikal at precheza.cz> Sent: Wednesday, March 30, 2022 12:59 AM To: reichmanj at sbcglobal.net Cc: r-help at r-project.org Subject: RE: [R] Inserting missing seq number Hallo Jeff Similar result you will get by na.locf function from zoo package df <- data.frame(seq = c(1,2,4,5,7), count = c(4,7,3,5,2)) vec <-data.frame(seq =1:7) df <- merge(vec, df, all=T) library(zoo) df$count <- na.locf(df$count) Cheers Petr
-----Original Message----- From: R-help <r-help-bounces at r-project.org> On Behalf Of Eric Berger Sent: Wednesday, March 30, 2022 5:36 AM To: reichmanj at sbcglobal.net Cc: r-help at r-project.org Subject: Re: [R] Inserting missing seq number Hi Jeff, tidyr::fill() will do what you want aDf <- data.frame(seq=1:7) bDf <- left_join(aDf,df,by="seq") tidyr::fill(bDf,count) HTH, Eric On Wed, Mar 30, 2022 at 5:47 AM Jeff Reichman <reichmanj at sbcglobal.net> wrote:
R-help Is there a R function that will insert missing sequence number(s) and then fill a missing observation with the preceding value. For example df <- data.frame(seq = c(1,2,4,5,7), count = c(4,7,3,5,2)) seq count 1 1 4 2 2 7 3 4 3 4 5 5 5 7 2 What I need is seq count 1 1 4 2 2 7 3 3 7 4 4 3 5 5 5 6 6 5 7 7 2 Jeff
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code.
stats::approx can do the job:
approx(x=df$seq, df$count, xout=1:7, method="constant", f=0)
$x [1] 1 2 3 4 5 6 7 $y [1] 4 7 7 3 5 5 2 -Bill On Tue, Mar 29, 2022 at 7:47 PM Jeff Reichman <reichmanj at sbcglobal.net> wrote:
R-help Is there a R function that will insert missing sequence number(s) and then fill a missing observation with the preceding value. For example df <- data.frame(seq = c(1,2,4,5,7), count = c(4,7,3,5,2)) seq count 1 1 4 2 2 7 3 4 3 4 5 5 5 7 2 What I need is seq count 1 1 4 2 2 7 3 3 7 4 4 3 5 5 5 6 6 5 7 7 2 Jeff
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
That is not only an elegant solution, but exactly what the OP requested. I assume it can be generalized: approx(x=df$seq,df$count,xout=min(df$seq):max(df$seq),method="constant",f=0) Jim
On Thu, Mar 31, 2022 at 2:41 AM Bill Dunlap <williamwdunlap at gmail.com> wrote:
stats::approx can do the job:
approx(x=df$seq, df$count, xout=1:7, method="constant", f=0)
$x [1] 1 2 3 4 5 6 7 $y [1] 4 7 7 3 5 5 2 -Bill On Tue, Mar 29, 2022 at 7:47 PM Jeff Reichman <reichmanj at sbcglobal.net> wrote:
R-help Is there a R function that will insert missing sequence number(s) and then fill a missing observation with the preceding value. For example df <- data.frame(seq = c(1,2,4,5,7), count = c(4,7,3,5,2)) seq count 1 1 4 2 2 7 3 4 3 4 5 5 5 7 2 What I need is seq count 1 1 4 2 2 7 3 3 7 4 4 3 5 5 5 6 6 5 7 7 2 Jeff
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Several perfectly fine "off the shelf" solutions have been offered for this post, so what follows here should be considered only for amusement. This seemed like a simple problem to me, so I wondered if one could write simple code in plain old base R to solve it. Of course, one can. What follow are two such approaches. First, as several have already noted, one can use the merge() trick to obtain the full 'seq' vector and the corresponding 'count' vector with NA's for the missing seq values. Here is a slightly more complicated example than the OP posted that allows for more than one missing 'seq' value in a row and starts at something other than 1(one could use diff if NA's only occur one at a time I think). dat <- data.frame(seq = c(3,4,6,9,10), count = c(4,7,3,5,2))
dat
seq count 1 3 4 2 4 7 3 6 3 4 9 5 5 10 2 The merge() trick than gives: dat <- merge(data.frame( seq = seq.int(dat[1,'seq'], tail(dat,1)[1,'seq'])), dat, all.x = TRUE)
dat
seq count
1 3 4
2 4 7
3 5 NA
4 6 3
5 7 NA
6 8 NA
7 9 5
8 10 2
So focusing on the 'count' vector, one needs a 'fill in' function to
appropriately fill in the missing values. Here are two. fillin1() does
this in the obvious way, moving sequentially from the beginning to the
end, filling in the previous value whenever NA is encountered:
fillin1 <- function(x){
for(i in seq_along(x))
if(is.na(x[i]))x[i] <- x[i-1]
x
}
fillin2() is a bit trickier, working recursively from end to
beginning. Still, it's only a few lines of code, and might be improved
in some way I didn't think of:
fillin2 <- function(x){
if(length(x) > 1){
z <- Recall(head(x, -1))
if(is.na(tail(x,1))){
x <-c(z,tail(z,1))
} else x[-length(x)] <- z
}
x
}
It might be interesting to compare the performance of all the
suggestions, but I'm too lazy to do that and will compare it only to
Bill's suggestion of approx(). To make the comparison fairer, I'll
remove unnecessary overhead and put 'seq' and 'count' in the global
environment.
seq <- c(3,4,6, 9, 10)
count <- c(4,7,3,5,2)
I'll also use the post merged 'count' from the above merge()
merged_count <- dat$count
merged_count
[1] 4 7 NA 3 NA NA 5 2 First, check that the fillinx functions actually work:
fillin1(merged_count)
[1] 4 7 7 3 3 3 5 2
fillin2(merged_count)
[1] 4 7 7 3 3 3 5 2 ## and of course!
approx(x=seq, y=count, xout=3:10, method="constant", f=0)
$x [1] 3 4 5 6 7 8 9 10 $y [1] 4 7 7 3 3 3 5 2 Timing the execution of each 5000 times:
system.time(replicate(5000,approx(x=seq, y=count, xout=3:10, method="constant", f=0)))
user system elapsed 0.062 0.001 0.063
system.time(replicate(5000, fillin1(merged_count)))
user system elapsed 0.008 0.000 0.007
system.time(replicate(5000, fillin2(merged_count)))
user system elapsed 0.222 0.001 0.223 I was not surprised that the recursive solution was the slowest, but was a little surprised that approx() was considerably slower than the iterative fillin1() . Of course, one shouldn't invest too much faith in this little exercise. Results might change drastically for very long vectors with different proportions/patterns of missings, for example. And approx is much more general and the above comparison may be a cheat of sorts anyway since I've omitted the overhead of the merge(), which I assumed would be small for reasonably sized examples. That might be wrong. Anyway, as I said, for amusement only. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Wed, Mar 30, 2022 at 8:41 AM Bill Dunlap <williamwdunlap at gmail.com> wrote:
stats::approx can do the job:
approx(x=df$seq, df$count, xout=1:7, method="constant", f=0)
$x [1] 1 2 3 4 5 6 7 $y [1] 4 7 7 3 5 5 2 -Bill On Tue, Mar 29, 2022 at 7:47 PM Jeff Reichman <reichmanj at sbcglobal.net> wrote:
R-help Is there a R function that will insert missing sequence number(s) and then fill a missing observation with the preceding value. For example df <- data.frame(seq = c(1,2,4,5,7), count = c(4,7,3,5,2)) seq count 1 1 4 2 2 7 3 4 3 4 5 5 5 7 2 What I need is seq count 1 1 4 2 2 7 3 3 7 4 4 3 5 5 5 6 6 5 7 7 2 Jeff
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Bill Thank you. This is a third option to look into. Jeff From: Bill Dunlap <williamwdunlap at gmail.com> Sent: Wednesday, March 30, 2022 10:40 AM To: reichmanj at sbcglobal.net Cc: R-help at r-project.org Subject: Re: [R] Inserting missing seq number stats::approx can do the job:
approx(x=df$seq, df$count, xout=1:7, method="constant", f=0)
$x [1] 1 2 3 4 5 6 7 $y [1] 4 7 7 3 5 5 2 -Bill
On Tue, Mar 29, 2022 at 7:47 PM Jeff Reichman <reichmanj at sbcglobal.net <mailto:reichmanj at sbcglobal.net> > wrote:
R-help Is there a R function that will insert missing sequence number(s) and then fill a missing observation with the preceding value. For example df <- data.frame(seq = c(1,2,4,5,7), count = c(4,7,3,5,2)) seq count 1 1 4 2 2 7 3 4 3 4 5 5 5 7 2 What I need is seq count 1 1 4 2 2 7 3 3 7 4 4 3 5 5 5 6 6 5 7 7 2 Jeff ______________________________________________ R-help at r-project.org <mailto:R-help at r-project.org> mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.