Inserting missing seq number
Several perfectly fine "off the shelf" solutions have been offered for this post, so what follows here should be considered only for amusement. This seemed like a simple problem to me, so I wondered if one could write simple code in plain old base R to solve it. Of course, one can. What follow are two such approaches. First, as several have already noted, one can use the merge() trick to obtain the full 'seq' vector and the corresponding 'count' vector with NA's for the missing seq values. Here is a slightly more complicated example than the OP posted that allows for more than one missing 'seq' value in a row and starts at something other than 1(one could use diff if NA's only occur one at a time I think). dat <- data.frame(seq = c(3,4,6,9,10), count = c(4,7,3,5,2))
dat
seq count 1 3 4 2 4 7 3 6 3 4 9 5 5 10 2 The merge() trick than gives: dat <- merge(data.frame( seq = seq.int(dat[1,'seq'], tail(dat,1)[1,'seq'])), dat, all.x = TRUE)
dat
seq count
1 3 4
2 4 7
3 5 NA
4 6 3
5 7 NA
6 8 NA
7 9 5
8 10 2
So focusing on the 'count' vector, one needs a 'fill in' function to
appropriately fill in the missing values. Here are two. fillin1() does
this in the obvious way, moving sequentially from the beginning to the
end, filling in the previous value whenever NA is encountered:
fillin1 <- function(x){
for(i in seq_along(x))
if(is.na(x[i]))x[i] <- x[i-1]
x
}
fillin2() is a bit trickier, working recursively from end to
beginning. Still, it's only a few lines of code, and might be improved
in some way I didn't think of:
fillin2 <- function(x){
if(length(x) > 1){
z <- Recall(head(x, -1))
if(is.na(tail(x,1))){
x <-c(z,tail(z,1))
} else x[-length(x)] <- z
}
x
}
It might be interesting to compare the performance of all the
suggestions, but I'm too lazy to do that and will compare it only to
Bill's suggestion of approx(). To make the comparison fairer, I'll
remove unnecessary overhead and put 'seq' and 'count' in the global
environment.
seq <- c(3,4,6, 9, 10)
count <- c(4,7,3,5,2)
I'll also use the post merged 'count' from the above merge()
merged_count <- dat$count
merged_count
[1] 4 7 NA 3 NA NA 5 2 First, check that the fillinx functions actually work:
fillin1(merged_count)
[1] 4 7 7 3 3 3 5 2
fillin2(merged_count)
[1] 4 7 7 3 3 3 5 2 ## and of course!
approx(x=seq, y=count, xout=3:10, method="constant", f=0)
$x [1] 3 4 5 6 7 8 9 10 $y [1] 4 7 7 3 3 3 5 2 Timing the execution of each 5000 times:
system.time(replicate(5000,approx(x=seq, y=count, xout=3:10, method="constant", f=0)))
user system elapsed 0.062 0.001 0.063
system.time(replicate(5000, fillin1(merged_count)))
user system elapsed 0.008 0.000 0.007
system.time(replicate(5000, fillin2(merged_count)))
user system elapsed 0.222 0.001 0.223 I was not surprised that the recursive solution was the slowest, but was a little surprised that approx() was considerably slower than the iterative fillin1() . Of course, one shouldn't invest too much faith in this little exercise. Results might change drastically for very long vectors with different proportions/patterns of missings, for example. And approx is much more general and the above comparison may be a cheat of sorts anyway since I've omitted the overhead of the merge(), which I assumed would be small for reasonably sized examples. That might be wrong. Anyway, as I said, for amusement only. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Wed, Mar 30, 2022 at 8:41 AM Bill Dunlap <williamwdunlap at gmail.com> wrote:
stats::approx can do the job:
approx(x=df$seq, df$count, xout=1:7, method="constant", f=0)
$x [1] 1 2 3 4 5 6 7 $y [1] 4 7 7 3 5 5 2 -Bill On Tue, Mar 29, 2022 at 7:47 PM Jeff Reichman <reichmanj at sbcglobal.net> wrote:
R-help Is there a R function that will insert missing sequence number(s) and then fill a missing observation with the preceding value. For example df <- data.frame(seq = c(1,2,4,5,7), count = c(4,7,3,5,2)) seq count 1 1 4 2 2 7 3 4 3 4 5 5 5 7 2 What I need is seq count 1 1 4 2 2 7 3 3 7 4 4 3 5 5 5 6 6 5 7 7 2 Jeff
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.