[RsR] Seasonal adjustment; What's a good `blackbox' robust lm() function?
On Sun, Jun 8, 2008 at 10:19 AM, Ajay Shah <ajayshah at mayin.org> wrote:
There is no support for x12arima or Tramo/Seats in R. I am doing a
Check out: http://finzi.psych.upenn.edu/R/Rhelp02a/archive/40254.html Also more below.
poor man's seasonal adjustment procedure: For monthly data, I do a regression with no intercept and 12 dummy variables (i.e. a month factor), and then calculating the residual. If I did this using lm() the procedure is always feasible as long as there are atleast 12 data points. When I started using this, I noticed that quite a few time series have extreme values and these were making the function generate wonky values. So I thought I should switch to a robust lm(). I was surprised to find that MASS:rlm() does not always converge. For quite a few series I find that it breaks. I wonder what the gurus on this list think. I can come back with concrete numerical examples of time-series where MASS:rlm() does not work, if that's interesting. More generally, if you have any suggestions on how to do seasonal adjustment, they would be most interesting.
Noting that months(chron.object) and quarters(chron.object) gives an ordered factor, here are a few approaches to the coefs: library(chron) library(quantreg) rq(cpi ~ months(as.chron(date)) -1, data = n, method = "fn") or what is basically the same, the median tapply(n$cpi, list(month = months(as.chron(n$date))), median, na.rm = TRUE) or a trimmed mean: tapply(n$cpi, list(month = months(as.chron(n$date))), mean, trim = 0.1, na.rm = TRUE) Note that we can simplify the adjustment function using ave. The following subtracts out monthly trimmed means from each data point: ave(n$cpi, months(as.chron(n$date)), FUN = function(x) x - mean(x, trim = 0.1, na.rm = TRUE))
My code is:
# A vector of data "x" and a vector of dates "dates" is supplied.
# You must choose whether you want to focus on monthly or quarterly
# seasonality. The function will then do the work of mapping from
# "dates" into the relevant vector of dummy variables.
#
# x is assumed to be in levels.
# A dummy variable regression is done.
# NA values for x are correctly handled, but NA values for the
# dates are not tolerated.
# It is typical for you to send in c(NA,diff(log(levels))) in which case
# the dummy variables regression is additive seasonality in log
# differences.
# You get back a vector which has the same length as what came in.
# In addition, you also get back the results of the dummy variable
# regression.
library(MASS)
date2month <- function(dates) {
factor(as.POSIXlt(dates)$mon, levels=0:11,
labels=c("Jan", "Feb", "Mar", "Apr",
"May", "Jun", "Jul", "Aug",
"Sep", "Oct", "Nov", "Dec"))
}
date2quarter <- function(dates) {
factor(c(1,1,1,2,2,2,3,3,3,4,4,4)[as.numeric(date2month(dates))],
levels=1:4, labels=c("Q1","Q2","Q3","Q4"))
}
sa.dummyvariableregression <- function(x, dates, seasonality="monthly") {
d <- data.frame(x=x, date=dates)
if (seasonality == "monthly") {
d$pfactor <- date2month(d$date)
} else {
d$pfactor <- date2quarter(d$date)
}
non.na.locations <- !is.na(d$x)
d <- d[non.na.locations,]
# At this point, we're holding the clean non-NA data.
# Dummy variable regression: do rlm() if you can but use lm() otherwise.
m <- rlm(x ~ -1 + pfactor, data=d)
if (! m$converged) {
cat("rlm failed to converge; switching to lm().\n")
m <- lm(x ~ -1 + pfactor, data=d)
}
# Extract residuals
sa <- m$residuals # m$residuals is one short
sa <- sa + mean(coef(m)) # Intercepts have soaked up the mean
# Place back into the original layout with NAs in the right places
sa.full <- rep(NA, length(non.na.locations))
sa.full[non.na.locations] <- sa
# Send back the goodies.
list(sa=sa.full, m=m)
}
Here's some data to experiment with this. It's a data frame "n"
containing n$date and n$cpi.
n <- structure(list(date = structure(c(7395, 7425, 7456, 7486, 7517,
7548, 7578, 7609, 7639, 7670, 7701, 7729, 7760, 7790, 7821, 7851,
7882, 7913, 7943, 7974, 8004, 8035, 8066, 8095, 8126, 8156, 8187,
8217, 8248, 8279, 8309, 8340, 8370, 8401, 8432, 8460, 8491, 8521,
8552, 8582, 8613, 8644, 8674, 8705, 8735, 8766, 8797, 8825, 8856,
8886, 8917, 8947, 8978, 9009, 9039, 9070, 9100, 9131, 9162, 9190,
9221, 9251, 9282, 9312, 9343, 9374, 9404, 9435, 9465, 9496, 9527,
9556, 9587, 9617, 9648, 9678, 9709, 9740, 9770, 9801, 9831, 9862,
9893, 9921, 9952, 9982, 10013, 10043, 10074, 10105, 10135, 10166,
10196, 10227, 10258, 10286, 10317, 10347, 10378, 10408, 10439,
10470, 10500, 10531, 10561, 10592, 10623, 10651, 10682, 10712,
10743, 10773, 10804, 10835, 10865, 10896, 10926, 10957, 10988,
11017, 11048, 11078, 11109, 11139, 11170, 11201, 11231, 11262,
11292, 11323, 11354, 11382, 11413, 11443, 11474, 11504, 11535,
11566, 11596, 11627, 11657, 11688, 11719, 11747, 11778, 11808,
11839, 11869, 11900, 11931, 11961, 11992, 12022, 12053, 12084,
12112, 12143, 12173, 12204, 12234, 12265, 12296, 12326, 12357,
12387, 12418, 12449, 12478, 12509, 12539, 12570, 12600, 12631,
12662, 12692, 12723, 12753, 12784, 12815, 12843, 12874, 12904,
12935, 12965, 12996, 13027, 13057, 13088, 13118, 13149, 13180,
13208, 13239, 13269, 13300, 13330, 13361, 13392, 13422, 13453,
13483, 13514, 13545, 13573, 13604, 13634, 13665, 13695, 13726,
13757, 13787, 13818, 13848, 13879, 13910, 13939, 13970), class = "Date"),
cpi = c(38.88, 39.31, 39.96, 40.82, 41.04, 41.25, 42.12,
42.76, 42.98, 43.63, 43.63, 43.41, 43.63, 44.06, 45.14, 46.22,
46.87, 47.73, 48.16, 48.6, 48.6, 49.24, 49.46, 49.46, 49.89,
50.54, 50.97, 52.27, 52.27, 52.48, 52.7, 52.7, 52.48, 52.05,
52.27, 52.48, 52.92, 53.13, 54, 54.64, 55.29, 55.94, 56.59,
57.24, 57.02, 56.8, 57.24, 57.67, 58.1, 58.75, 59.83, 60.69,
61.34, 62.2, 62.42, 62.85, 62.42, 62.42, 62.85, 63.28, 63.71,
64.79, 66.09, 67.6, 68.03, 68.47, 68.9, 69.33, 68.47, 68.03,
68.25, 68.9, 69.98, 70.84, 71.92, 73.22, 74.08, 74.3, 74.73,
75.38, 75.59, 75.59, 75.59, 75.81, 76.46, 76.03, 76.67, 77.32,
77.54, 77.97, 78.83, 79.05, 80.35, 82.94, 82.51, 82.07, 82.72,
84.02, 86.18, 88.77, 89.2, 90.71, 93.52, 94.6, 92.66, 90.71,
89.63, 89.42, 89.63, 90.5, 90.71, 91.58, 92.01, 92.66, 94.38,
94.6, 93.09, 93.09, 92.87, 93.74, 94.6, 95.03, 95.46, 96.11,
95.68, 95.9, 96.98, 97.19, 96.33, 96.11, 95.68, 96.11, 96.76,
97.41, 98.7, 100, 100.65, 100.43, 101.08, 101.94, 101.3,
100.86, 100.65, 101.08, 101.3, 101.94, 102.81, 103.89, 104.54,
104.75, 105.18, 105.62, 104.54, 104.32, 104.54, 105.18, 106.48,
106.7, 107.34, 108.21, 107.78, 107.78, 108.64, 108.86, 108.42,
108.86, 108.86, 108.86, 108.86, 109.72, 110.58, 111.66, 112.74,
112.96, 113.61, 113.39, 112.53, 113.61, 113.39, 113.39, 114.25,
113.82, 114.25, 116.2, 116.63, 117.06, 118.36, 119.44, 118.79,
119, 119, 119, 120, 121, 123, 124, 124, 125, 127, 127, 127,
127, 128, 127, 128, 129, 130, 132, 133, 133, 134, 134, 134,
134, 135, 137, 138)), .Names = c("date", "cpi"), row.names = 4:220, class = "data.frame", na.action = structure(c(1L,
2L, 3L, 221L), .Names = c("1", "2", "3", "221"), class = "omit"))
--
Ajay Shah http://www.mayin.org/ajayshah
ajayshah at mayin.org http://ajayshahblog.blogspot.com
<*(:-? - wizard who doesn't know the answer.
_______________________________________________ R-SIG-Robust at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-robust