Skip to content

Inserting missing seq number

9 messages · Eric Berger, Avi Gross, PIKAL Petr +4 more

#
R-help

Is there a R function that will insert missing sequence number(s) and then
fill a missing observation with the preceding value.

For example df <- data.frame(seq = c(1,2,4,5,7), count = c(4,7,3,5,2))

  seq count
1    1        4
2    2        7
3    4        3
4    5        5
5    7        2

What I need is

  seq count
1    1        4
2    2        7
3    3        7
4    4        3
5    5        5
6    6        5
7    7        2

Jeff
#
Hi Jeff,
tidyr::fill() will do what you want

aDf <- data.frame(seq=1:7)
bDf <- left_join(aDf,df,by="seq")
tidyr::fill(bDf,count)

HTH,
Eric


On Wed, Mar 30, 2022 at 5:47 AM Jeff Reichman <reichmanj at sbcglobal.net>
wrote:

  
  
#
Jeff,

There may well be such a function somewhere but I would have requested
something less ambitious than a function that does exactly that.

As I see it, there are many ways to do what you specifically want, but much
depends on exact conditions in your data.

Your example shows seq values ascending from 1 to 7 with missing rows
for values 3 and 6. Can we assume the data will have everything in order
with gaps of exactly 1, or might there be gaps of 2 or more? Is the sequence
always beginning with 1, or might you have numbers from say 666 to 999?

Can the first or last entries be missing? How would we know?

One of many approaches I can outline is to create a second data structure
containing all valid integers between highest and lowest. In your case,
from 1 to 7. Again, one way is to make a data.frame similar to the above
and use some functions to make the appropriate merge. You would end
up, perhaps, with a data.frame with more rows and the new rows might
contain an NA for the count column.

Now, yes, there are many packages out there that have functions for
filling in missing values. Some allow you to fill in the previous value,
and some may allow a mean or interpolation and so on.

But consider a simple loop of indices in the enlarged data.frame of
values of seq. Each step in the loop makes note of the value stored in
count on the previous pass and if the currently pass has an NA, changes
it to the value being held. Lots more details, but as an outline, this
may suffice. And, yes, if you make a simple set of program lines
that performs what you want, and make it general enough, you should be able
to write an R function that accepts a data.frame and perhaps an indication
of which column to do this with and calculates everything needed to
return the enhanced data.frame as a result. You will then have a function
that does this!

I am sure eventually someone else will throw something like a dplyr function or two
or suggest you could have searched for something like: "R fill in missing info"

-----Original Message-----
From: Jeff Reichman <reichmanj at sbcglobal.net>
To: R-help at r-project.org
Sent: Tue, Mar 29, 2022 10:47 pm
Subject: [R] Inserting missing seq number


R-help



Is there a R function that will insert missing sequence number(s) and then

fill a missing observation with the preceding value.



For example df <- data.frame(seq = c(1,2,4,5,7), count = c(4,7,3,5,2))



? seq count

1? ? 1? ? ? ? 4

2? ? 2? ? ? ? 7

3? ? 4? ? ? ? 3

4? ? 5? ? ? ? 5

5? ? 7? ? ? ? 2



What I need is



? seq count

1? ? 1? ? ? ? 4

2? ? 2? ? ? ? 7

3? ? 3? ? ? ? 7

4? ? 4? ? ? ? 3

5? ? 5? ? ? ? 5

6? ? 6? ? ? ? 5

7? ? 7? ? ? ? 2



Jeff



______________________________________________

R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see

https://stat.ethz.ch/mailman/listinfo/r-help

PLEASE do read the posting guide http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.
#
Hallo Jeff

Similar result you will get by na.locf function from zoo package

df <- data.frame(seq = c(1,2,4,5,7), count = c(4,7,3,5,2))
vec <-data.frame(seq =1:7)
df <- merge(vec, df, all=T)
library(zoo)
df$count <- na.locf(df$count)

Cheers
Petr
#
Petr & Eric

Thank you for the two options.  Figured there was an easy straight forward
way.

Jeff

-----Original Message-----
From: PIKAL Petr <petr.pikal at precheza.cz> 
Sent: Wednesday, March 30, 2022 12:59 AM
To: reichmanj at sbcglobal.net
Cc: r-help at r-project.org
Subject: RE: [R] Inserting missing seq number

Hallo Jeff

Similar result you will get by na.locf function from zoo package

df <- data.frame(seq = c(1,2,4,5,7), count = c(4,7,3,5,2))
vec <-data.frame(seq =1:7)
df <- merge(vec, df, all=T)
library(zoo)
df$count <- na.locf(df$count)

Cheers
Petr
#
stats::approx can do the job:
$x
[1] 1 2 3 4 5 6 7

$y
[1] 4 7 7 3 5 5 2

-Bill

On Tue, Mar 29, 2022 at 7:47 PM Jeff Reichman <reichmanj at sbcglobal.net>
wrote:

  
  
#
That is not only an elegant solution, but exactly what the OP
requested. I assume it can be generalized:

approx(x=df$seq,df$count,xout=min(df$seq):max(df$seq),method="constant",f=0)

Jim
On Thu, Mar 31, 2022 at 2:41 AM Bill Dunlap <williamwdunlap at gmail.com> wrote:
#
Several perfectly fine "off the shelf" solutions have been offered for
this post, so what follows here should be considered only for
amusement.

This seemed like a simple problem to me, so I wondered if one could
write simple code in plain old base R to solve it. Of course, one can.
What follow are two such approaches.

First, as several have already noted, one can use the merge() trick to
obtain the full 'seq' vector and the corresponding 'count' vector with
NA's for the missing seq values. Here is a slightly more complicated
example than the OP posted that allows for more than one missing 'seq'
value in a row and starts at something other than 1(one could use diff
if NA's only occur one at a time I think).

dat <- data.frame(seq = c(3,4,6,9,10), count = c(4,7,3,5,2))
seq count
1   3     4
2   4     7
3   6     3
4   9     5
5  10     2

The merge() trick than gives:
dat <- merge(data.frame(
   seq = seq.int(dat[1,'seq'], tail(dat,1)[1,'seq'])),
   dat, all.x = TRUE)
seq count
1   3     4
2   4     7
3   5    NA
4   6     3
5   7    NA
6   8    NA
7   9     5
8  10     2

So focusing on the 'count' vector, one needs a 'fill in' function to
appropriately fill in the missing values. Here are two. fillin1() does
this in the obvious way, moving sequentially from the beginning to the
end, filling in the previous value whenever NA is encountered:

fillin1 <- function(x){
   for(i in seq_along(x))
      if(is.na(x[i]))x[i] <- x[i-1]
   x
}

fillin2() is a bit trickier, working recursively from end to
beginning. Still, it's only a few lines of code, and might be improved
in some way I didn't think of:

fillin2 <- function(x){
   if(length(x) > 1){
      z <- Recall(head(x, -1))
      if(is.na(tail(x,1))){
         x <-c(z,tail(z,1))
      } else x[-length(x)] <- z
   }
   x
}

It might be interesting to compare the performance of all the
suggestions, but I'm too lazy to do that and will compare it only to
Bill's suggestion of approx(). To make the comparison fairer, I'll
remove unnecessary overhead and put 'seq' and 'count' in the global
environment.


seq <- c(3,4,6, 9, 10)
count <- c(4,7,3,5,2)

I'll also use the post merged 'count' from the above merge()
merged_count <- dat$count
[1]  4  7 NA  3 NA NA  5  2

First, check that the fillinx functions actually work:
[1] 4 7 7 3 3 3 5 2
[1] 4 7 7 3 3 3 5 2

## and of course!
$x
[1]  3  4  5  6  7  8  9 10

$y
[1] 4 7 7 3 3 3 5 2

Timing the execution of each 5000 times:
user  system elapsed
  0.062   0.001   0.063
user  system elapsed
  0.008   0.000   0.007
user  system elapsed
  0.222   0.001   0.223

I was not surprised that the recursive solution was the slowest, but
was a little surprised that approx() was considerably slower than the
iterative fillin1() . Of course, one shouldn't invest too much faith
in this little exercise. Results might change drastically for very
long vectors with different proportions/patterns of missings, for
example. And approx is much more general and the above comparison may
be a cheat of sorts anyway since I've omitted the overhead of the
merge(), which I assumed would be small for reasonably sized examples.
That might be wrong.

Anyway, as I said, for amusement only.

Cheers,
Bert


Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Wed, Mar 30, 2022 at 8:41 AM Bill Dunlap <williamwdunlap at gmail.com> wrote:
#
Bill 

 

Thank you. This is a third option to look into.

 

Jeff

 

From: Bill Dunlap <williamwdunlap at gmail.com> 
Sent: Wednesday, March 30, 2022 10:40 AM
To: reichmanj at sbcglobal.net
Cc: R-help at r-project.org
Subject: Re: [R] Inserting missing seq number

 

stats::approx can do the job:
$x
[1] 1 2 3 4 5 6 7

$y
[1] 4 7 7 3 5 5 2

 

-Bill
On Tue, Mar 29, 2022 at 7:47 PM Jeff Reichman <reichmanj at sbcglobal.net <mailto:reichmanj at sbcglobal.net> > wrote:
R-help

Is there a R function that will insert missing sequence number(s) and then
fill a missing observation with the preceding value.

For example df <- data.frame(seq = c(1,2,4,5,7), count = c(4,7,3,5,2))

  seq count
1    1        4
2    2        7
3    4        3
4    5        5
5    7        2

What I need is

  seq count
1    1        4
2    2        7
3    3        7
4    4        3
5    5        5
6    6        5
7    7        2

Jeff

______________________________________________
R-help at r-project.org <mailto:R-help at r-project.org>  mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.