finding and describing missing data runs in a time series

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120213/5d799814/attachment.pl>
Not at a computer to test this but perhaps

rle(is.na(x))

might help. 

Michael

Hi -

I am trying to find and describe missing data in a time series. For instance, in the library openair, there is a data frame called "mydata":
library(openair)
head(mydata)

 date   ws  wd nox no2 o3 pm10    so2      co pm25
1 1998-01-01 00:00:00 0.60 280 285  39  1   29 4.7225  3.3725   NA
2 1998-01-01 01:00:00 2.16 230  NA  NA NA   37     NA      NA   NA
3 1998-01-01 02:00:00 2.76 190  NA  NA  3   34 6.8300  9.6025   NA
4 1998-01-01 03:00:00 2.16 170 493  52  3   35 7.6625 10.2175   NA
5 1998-01-01 04:00:00 2.40 180 468  78  2   34 8.0700  8.9125   NA
6 1998-01-01 05:00:00 3.00 190 264  42  0   16 5.5050  3.0525   NA

So for example, I would like to be able to detect for pm25, I would like to be able to detect that there are NA's starting at 1998-01-01 0:00:00 and runs for 2887 hourly observations.  Then I would be able to know that there is an NA at 2910 and so on. The key information I am looking for is when the NA's start and their length. The closest thing I can use that I know about is timePlot in the openair package with statistic="frequency" but it only gives monthly summary data, and does not tell me if the missing data are clumped together or are dispersed.

VR

Jim

James T. Durant, MSPH CIH
Emergency Response Coordinator
US Agency for Toxic Substances and Disease Registry
Atlanta, GA 30341
770-378-1695

   [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Hi -
I am trying to find and describe missing data in a time series.
For instance, in the library openair, there is a data frame
called "mydata":
library(openair)
head(mydata)

  date   ws  wd nox no2 o3 pm10    so2      co pm25
1 1998-01-01 00:00:00 0.60 280 285  39  1   29 4.7225  3.3725   NA
2 1998-01-01 01:00:00 2.16 230  NA  NA NA   37     NA      NA   NA
3 1998-01-01 02:00:00 2.76 190  NA  NA  3   34 6.8300  9.6025   NA
4 1998-01-01 03:00:00 2.16 170 493  52  3   35 7.6625 10.2175   NA
5 1998-01-01 04:00:00 2.40 180 468  78  2   34 8.0700  8.9125   NA
6 1998-01-01 05:00:00 3.00 190 264  42  0   16 5.5050  3.0525   NA

So for example, I would like to be able to detect for pm25,
I would like to be able to detect that there are NA's starting
at 1998-01-01 0:00:00 and runs for 2887 hourly observations.
Then I would be able to know that there is an NA at 2910 and
so on. The key information I am looking for is when the NA's
start and their length. The closest thing I can use that I
know about is timePlot in the openair package with
statistic="frequency" but it only gives monthly summary data,
and does not tell me if the missing data are clumped together
or are dispersed.

VR
Jim

James T. Durant, MSPH CIH
Emergency Response Coordinator
US Agency for Toxic Substances and Disease Registry
Atlanta, GA 30341
770-378-1695
You might consider an approach based on

  rle(is.na(mydata$pm25))

See ?rle

Example:

  X <- c(1,2,3,NA,NA,NA,4,5,NA,6,7,8,NA,NA,NA,NA,NA)
  X
  # [1]  1  2  3 NA NA NA  4  5 NA  6  7  8 NA NA NA NA NA
  rle(is.na(X))
  # Run Length Encoding
  #   lengths: int [1:6] 3 3 2 1 3 5
  #   values : logi [1:6] FALSE TRUE FALSE TRUE FALSE TRUE

Ted.

-------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at wlandres.net>
Date: 13-Feb-2012  Time: 08:51:19
This message was sent by XFMail