How to clean errors in yahoo historical quotes?
This is a followup question on the previous thread on time gaps. In the first thread, I had identified gaps in the time series. I was focusing on GENZ because I was seeing aberrant results when working with that time series. I first detected the problem with the time gaps and was assuming that this was the problem. While the code provided by Josh Ulrich works beautifully to get compatible series, I still got aberrant results when working with GENZ :-( This has now been traced to errors in the series, probably related to a problem with the adjustment algorithm (see below). Thanks to Jeff Ryan, I am also able to compare the same data as reported by Google, that in this specific instance is not affected. The general question is then: given that we know that in general downloaded data can be effected by errors, how to clean them? I can see ways to do that, especially by direct observation and manual cleaning, but again I don't want to reinvent the wheel. Also is it worth to contact Yahoo to have the series cleaned at the source (gut feeling is no). And yes, I understand both yahoo and google data are free and so come with no guarantee. First the data as downloaded from Yahoo via getYahooData in package TTR, corresponding Yahoo chart is OK BTW
genz['2001-04-24::2001-05-08']
Open High Low Close Volume Unadj.Close Div Split Adj.Div 2001-04-24 51.900 52.000 50.225 50.525 2839400 101.05 NA NA NA 2001-04-25 50.495 53.345 50.495 51.765 4161600 103.53 NA NA NA 2001-04-26 51.875 53.255 51.145 52.685 1956000 105.37 NA NA NA 2001-04-27 52.880 55.000 52.750 53.755 3028800 107.51 NA NA NA 2001-04-30 27.025 27.585 26.545 27.245 3396800 54.49 NA NA NA << not very likely 2001-05-01 54.625 55.750 52.860 55.515 2792000 111.03 NA NA NA 2001-05-02 55.540 55.625 51.875 54.080 3466600 108.16 NA NA NA 2001-05-03 51.375 51.835 51.100 51.315 5412600 102.63 NA NA NA 2001-05-04 49.000 51.515 48.750 50.900 4066400 101.80 NA NA NA 2001-05-07 25.485 26.495 25.375 26.315 3185500 52.63 NA NA NA << not very likely 2001-05-08 52.800 53.275 52.110 52.980 1884000 105.96 NA NA NA Then the data as downloaded from google via getSymbols in package quantmod
GENZ['2001-04-24::2001-05-08']
GENZ.Open GENZ.High GENZ.Low GENZ.Close GENZ.Volume 2001-04-24 51.90 52.00 50.22 50.52 5678800 2001-04-25 50.50 53.34 50.50 51.76 8323400 2001-04-26 51.88 53.26 51.14 52.68 3912200 2001-04-27 52.88 55.00 52.75 53.76 6057400 2001-04-30 54.05 55.17 53.09 54.48 6793600 << I can believe this one 2001-05-01 54.62 55.75 52.86 55.52 5584000 2001-05-02 55.54 55.62 51.88 54.08 6933200 2001-05-03 51.38 51.84 51.10 51.32 10825200 2001-05-04 49.00 51.52 48.75 50.90 8132800 2001-05-07 50.97 53.00 50.75 52.64 6371000 2001-05-08 52.80 53.28 52.11 52.98 3768200