An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120213/5d799814/attachment.pl>
finding and describing missing data runs in a time series
3 messages · Durant, James T. (ATSDR/DTEM/PRMSB), michael.weylandt at gmail.com (R. Michael Weylandt, (Ted Harding)
Not at a computer to test this but perhaps rle(is.na(x)) might help. Michael
On Feb 12, 2012, at 7:36 PM, "Durant, James T. (ATSDR/DTEM/PRMSB)" <hzd3 at cdc.gov> wrote:
Hi - I am trying to find and describe missing data in a time series. For instance, in the library openair, there is a data frame called "mydata": library(openair) head(mydata) date ws wd nox no2 o3 pm10 so2 co pm25 1 1998-01-01 00:00:00 0.60 280 285 39 1 29 4.7225 3.3725 NA 2 1998-01-01 01:00:00 2.16 230 NA NA NA 37 NA NA NA 3 1998-01-01 02:00:00 2.76 190 NA NA 3 34 6.8300 9.6025 NA 4 1998-01-01 03:00:00 2.16 170 493 52 3 35 7.6625 10.2175 NA 5 1998-01-01 04:00:00 2.40 180 468 78 2 34 8.0700 8.9125 NA 6 1998-01-01 05:00:00 3.00 190 264 42 0 16 5.5050 3.0525 NA So for example, I would like to be able to detect for pm25, I would like to be able to detect that there are NA's starting at 1998-01-01 0:00:00 and runs for 2887 hourly observations. Then I would be able to know that there is an NA at 2910 and so on. The key information I am looking for is when the NA's start and their length. The closest thing I can use that I know about is timePlot in the openair package with statistic="frequency" but it only gives monthly summary data, and does not tell me if the missing data are clumped together or are dispersed. VR Jim James T. Durant, MSPH CIH Emergency Response Coordinator US Agency for Toxic Substances and Disease Registry Atlanta, GA 30341 770-378-1695 [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
On 13-Feb-2012 Durant, James T. (ATSDR/DTEM/PRMSB) wrote:
Hi - I am trying to find and describe missing data in a time series. For instance, in the library openair, there is a data frame called "mydata": library(openair) head(mydata) date ws wd nox no2 o3 pm10 so2 co pm25 1 1998-01-01 00:00:00 0.60 280 285 39 1 29 4.7225 3.3725 NA 2 1998-01-01 01:00:00 2.16 230 NA NA NA 37 NA NA NA 3 1998-01-01 02:00:00 2.76 190 NA NA 3 34 6.8300 9.6025 NA 4 1998-01-01 03:00:00 2.16 170 493 52 3 35 7.6625 10.2175 NA 5 1998-01-01 04:00:00 2.40 180 468 78 2 34 8.0700 8.9125 NA 6 1998-01-01 05:00:00 3.00 190 264 42 0 16 5.5050 3.0525 NA So for example, I would like to be able to detect for pm25, I would like to be able to detect that there are NA's starting at 1998-01-01 0:00:00 and runs for 2887 hourly observations. Then I would be able to know that there is an NA at 2910 and so on. The key information I am looking for is when the NA's start and their length. The closest thing I can use that I know about is timePlot in the openair package with statistic="frequency" but it only gives monthly summary data, and does not tell me if the missing data are clumped together or are dispersed. VR Jim James T. Durant, MSPH CIH Emergency Response Coordinator US Agency for Toxic Substances and Disease Registry Atlanta, GA 30341 770-378-1695
You might consider an approach based on rle(is.na(mydata$pm25)) See ?rle Example: X <- c(1,2,3,NA,NA,NA,4,5,NA,6,7,8,NA,NA,NA,NA,NA) X # [1] 1 2 3 NA NA NA 4 5 NA 6 7 8 NA NA NA NA NA rle(is.na(X)) # Run Length Encoding # lengths: int [1:6] 3 3 2 1 3 5 # values : logi [1:6] FALSE TRUE FALSE TRUE FALSE TRUE Ted. ------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at wlandres.net> Date: 13-Feb-2012 Time: 08:51:19 This message was sent by XFMail