An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-sig-finance/attachments/20120225/513535ad/attachment.pl>
really puzzled by this R script
2 messages · johnzli at comcast.net, R. Michael Weylandt
This isn't really a finance question... Your problem is that you use names() instead of just getting the row numbers from outlierTest but then, when you convert the names to an integer, your attempts to remove the row by that number and so doesn't actually get the right row, then an outlier remains and the infinite loop is triggered. To put it concretely, add a browser() at the top of the while loop and note this: x <- 1:20 y <- x; y[c(3, 14)] <- 1000*c(1, 1.05); y <- jitter(y); dat <- data.frame(x = x, y = y) rmOutlier(y ~ x, dat[-1,]) ## Once in browser note this: ret # should give "14" because the 14th spot is a problem by construction fullData[ - ret, ] # Still has the outlier at "14" because it's not in the 14th row! If you just use the names and index appropriately, you should be fine. Further help is more suited to R-help though as this isn't very financial...just a heads up though: you'll also probably get told off for mentioning outlier removal on R-help: it's something of a rite of passage Michael
On Sat, Feb 25, 2012 at 1:27 AM, <johnzli at comcast.net> wrote:
Dear all,
I wrote a R script that basically trying to identify outliers, and returns a non-empty vector containing the index to the outliers or a NULL object if there is no outliers.
I have been puzzled by the strange behavior of this function. Let's say we have 10 outliers in a data frame of 1000 row samples.
1) If I run rmOutlier(y ~ x1 + x2, xyData), where xyData is a data frame with column names "y", "x1", "x2". The program runs fine, and returns the indices of those 10 outliers.
2) If I run rmOutlier(y ~ x1 + x2, xyData[1:200, ]), or any subset starts from the first row (i.e., 1:xx), the program runs fine.
3) If I run the script start from subsets of data not starting from the first row, e.g., rumOutlier
(y ~ x1 + x2, xyData[100:1000, ] ), if there is no outlier falls within xyData[100:1000, ], the program runs fine.
However, in case 3), if there is any outlier falls within xyData[100:1000, ], the program runs in infinite loop (the "while" loop in the script). Trouble shooting indicates that outlierTest( lm(lm_form, data = fullData[-ret, ]) will always return the same set of outliers index, and fullData[-ret, ] seems have no effects.
What went wrong here? Any help will be greatly appreciated.
Thank you.
John Li
This is the script:
?"rmOutlier" <- function(lm_form, fullData) {
# Find and return Outliers indices based on Bonferroni Outlier Test
# The program returns a non-empty vector or a NULL object
# AUTHOR:
# John Li
# Date: Feb 17, 2012
# Revised: Feb 24, 2012
#
?require(car, quietly = TRUE)
?#sanity check
?stopifnot(is.data.frame(fullData), ncol(fullData) > 1, length(names(fullData)) == ncol(fullData))
?outlier <- outlierTest( lm(lm_form, data = fullData), n.max = Inf)
?if ( outlier$signif ) {
? ?ex <- c(as.numeric(names(outlier$rstudent)))
? ?ret <- ex
?}
?else {
? ?ret <- NULL
? ?return(ret)
?}
?while( outlier$signif ) {
? ?outlier <- outlierTest( lm(lm_form, data = fullData[-ret, ]), n.max = Inf) # fullData[-ret, ] seems not work
? ?if ( outlier$signif ) {
? ? ?ex <- as.numeric(names(outlier$rstudent))
? ? ?ret <- c(ret, ex)
? ?}
?}
?return(ret)
?}
? ? ? ?[[alternative HTML version deleted]]
_______________________________________________ R-SIG-Finance at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. If you want to post, subscribe first. -- Also note that this is not the r-help list where general R questions should go.