Skip to content

Removing Outliers Function

10 messages · kirtau, Ravi Varadhan, David Winsemius +3 more

#
I am working on a function that will remove outliers for regression analysis.
I am stating that a data point is an outlier if its studentized residual is
above or below 3 and -3, respectively. The code below is what i have thus
far for the function

x = c(1:20)
y = c(1,3,4,2,5,6,18,8,10,8,11,13,14,14,15,85,17,19,19,20)
data1 = data.frame(x,y)

 
rm.outliers = function(dataset,dependent,independent){
    dataset$predicted = predict(lm(dependent~independent))
    dataset$stdres = rstudent(lm(dependent~independent))
    m = 1
    for(i in 1:length(dataset$stdres)){
      dataset$outlier_counter[i] = if(dataset$stdres[i] >= 3 |
dataset$stdres[i] <= -3) {m} else{0}
    }
    j = length(which(dataset$outlier_counter >= 1))
    while(j>=1){
      print(dataset[which(dataset$outlier_counter >= 1),])
      dataset = dataset[which(dataset$outlier_counter == 0),]
      dataset$predicted = predict(lm(dependent~independent))
      dataset$stdres = rstudent(lm(dependent~independent))
        m = m+1
        for(k in 1:length(dataset$stdres)){
          dataset$outlier_counter[k] = if(dataset$stdres[k] >= 3 |
dataset$stdres[k] <= -3) {m} else{0}
        }
      j = length(which(dataset$outlier_counter >= 1))
    }
    return(dataset)
}

The problem that I run into is that i receive this error when i type 

rm.outliers(data1,data1$y,data1$x)

"    x  y predicted   stdres outlier_counter
16 16 85  22.98647 24.04862               1
Error in `$<-.data.frame`(`*tmp*`, "predicted", value = c(0.114285714285714, 
: 
  replacement has 20 rows, data has 19"

Note: the outlier_counter variable is used to state which "round" of the
loop the datapoint was marked as an outlier.

This would be a HUGE help to me and a few buddies who run a lot of different
regression tests.

Thanks, and if the question is still confusing please ask

 

-----
- AK
#
On Feb 8, 2011, at 9:11 PM, kirtau wrote:

            
The solution is about 3 or 4 lines of code to make the function, but  
removing outliers like this is simply statistical malpractice. Maybe  
it's a good thing that R has a shallow learning curve.
#
David,

Please allow me to digress a lot here.  You are one of the few (inlcuding yours truly!) that uses the phrase "shallow learning curve" to indicate difficulty of learning (I assume this is what you meant). I always felt that "steep learning curve" was incorrect.  If you plotted the amount of learning on the Y-axis and time on the X-axis, a steep learning curve means that one learns very quickly, but this is just the opposite of what is actually meant. 

Best,
Ravi.
____________________________________________________________________

Ravi Varadhan, Ph.D.
Assistant Professor,
Division of Geriatric Medicine and Gerontology
School of Medicine
Johns Hopkins University

Ph. (410) 502-2619
email: rvaradhan at jhmi.edu


----- Original Message -----
From: David Winsemius <dwinsemius at comcast.net>
Date: Tuesday, February 8, 2011 10:09 pm
Subject: Re: [R] Removing Outliers Function
To: kirtau <kirtau at live.com>
Cc: r-help at r-project.org
#
Exactly right. I use the phrase to catch the unwary's attention. I  
think the effect is properly placed on the y-axis.

IIRC, Ben Bolker (or was it Bert Gunter?)  has also commented in the R- 
help or r-devel pages this curious inversion of functional meaning.
#
On 02/09/2011 03:43 PM, David Winsemius wrote:
I certainly agree with both of you as a matter of illustration. However, 
I have heard the phrase (mis)used to indicate a situation in which the 
learner had to learn a lot quickly, or in the concrete imagery of "a 
steep learning curve must be hard to climb". Language is a wonderful 
tool, even it we sometimes break things with it.

Jim
#
I have two questions,

1) if the solutions is only three or four lines of code is there anyway you
can share those lines, without disrespecting me further

2) Can you explain why you feel that this is "statistical malpractice"

-----
- AK
#
I have two questions, 

1) if the solutions is only three or four lines of code is there anyway you
can share those lines instead of stating that the solution is easy and
providing no code. I prefer not to use an R-Package but have a "raw
function". 

2) Can you explain why you feel that this is "statistical malpractice"

-----
- AK
#
On Feb 9, 2011, at 1:25 PM, kirtau wrote:

            
You are proposing to systematically distort your data (apparently  
without even examining it)  before conducting an inferential process.  
The old FLA GIGO is operative here. The data arose from some process  
in nature and the outliers are just as important as the inliers. If  
you want methods that are robust to "outliers" you should look at the  
Robust Statistics Task View:
http://cran.r-project.org/web/views/Robust.html
#
If you insist ...

1. You are reinventing wheels (poorly).

RSiteSearch("outlier tests",restr="fun")
##RsiteSearch is a handy interface to search facilities on CRAN.
# Go to the site directly for more. Or use Google or other search engines.

will show you that a R package, outlier, already exists that does all
the tests you can imagine -- and more.

2. For why this is a BAD idea, you would need to read up on the
voluminous literature. Talking to a local statistician might be a
better alternative. But here's a hint: AFAIK, the FDA allows no such
tests in the submissions of clinical trial data because it would bias
the results. (Correction welcome if this statement is wrong).

Cheers,
Bert
On Wed, Feb 9, 2011 at 10:25 AM, kirtau <kirtau at live.com> wrote:

  
    
#
For your number 2, look at the outliers data set in the TeachingDemos package and run the 1st set of examples, yes it uses a different rule than you use, but still a common one.  Think about what is happening in the example, doesn't that make you a little nervous about methods that automatically discard "outliers"?