Skip to content

removing outlier function / dataset update

3 messages · kirtau, Ista Zahn

#
Hi,

I have a few lines of code that will remove outliers for a regression test
based on the studentized residuals being above or below 3, -3. I have to do
this multiple times and have attempted to create a function to lessen the
amount of copying, pasting and replacing. 

I run into trouble with the function and receiving the error "Error in
`$<-.data.frame`(`*tmp*`, "varpredicted", value = c(0.114285714285714,  : 
  replacement has 20 rows, data has 19
"

any help would be appreciated. a list of code is listed below. 

Thank you for your time!

x = c(1:20)
y = c(1,3,4,2,5,6,18,8,10,8,11,13,14,14,15,85,17,19,19,20)
data1 = data.frame(x,y)

# remove outliers for regression by studentized residuals being greater than
3
data1$predicted = predict(lm(data1$y~data1$x))
data1$stdres = rstudent(lm(data1$y~data1$x));
i=length(which(data1$stdres>3|data1$stdres< -3))
while(i >= 1){
	remove<-which(data1$stdres>3|data1$stdres< -3)
	print(data1[remove,])
	data1 = data1[-remove,]
	data1$predicted = predict(lm(data1$y~data1$x))
	data1$stdres = rstudent(lm(data1$y~data1$x))
	i = with(data1,length(which(stdres>3|stdres< -3)))
 }

# attemp to create a function to perfom same idea as above
rm.outliers = function(dataset,var1, var2) {

  dataset$varpredicted = predict(lm(var1~var2))
  dataset$varstdres = rstudent(lm(var1~var2))
  i = length(which(dataset$varstdres > 3 | dataset$varstdres < -3))
  while(i >= 1){
	 removed = which(dataset$varstdres > 3 | dataset$varstdres < -3)
	 print(dataset[removed,])
	 dataset = dataset[-removed,]
	 dataset$varpredicted = predict(lm(var1~var2))
   dataset$varstdres = rstudent(lm(var1~var2))
	 i = with(dataset,length(varstdres > 3 | varstdres < -3))
   }
}
#
Hi,
x and y are being picked up from your global environment, not from the
x and y in dataset. Here is a version that seems to work:

rm.outliers = function(dataset,var1, var2) {

    dataset$varpredicted = predict(lm(as.formula(paste(var1, var2,
sep=" ~ ")), data=dataset))
    dataset$varstdres = rstudent(lm(as.formula(paste(var1, var2, sep="
~ ")), data=dataset))
    i = length(which(dataset$varstdres > 3 | dataset$varstdres < -3))
    while(i >= 1){
        removed = which(dataset$varstdres > 3 | dataset$varstdres < -3)
        print(dataset[removed,])
        dataset = dataset[-removed,]
        dataset$varpredicted = predict(lm(as.formula(paste(var1, var2,
sep=" ~ ")), data=dataset))
        dataset$varstdres = rstudent(lm(as.formula(paste(var1, var2,
sep=" ~ ")), data=dataset))
        i = with(dataset,length(varstdres > 3 | varstdres < -3))
    }
}


Best,
Ista
On Wed, Jan 26, 2011 at 11:36 AM, kirtau <kirtau at live.com> wrote:

  
    
#
First off, thank you for the help with the global environment.

I have however attempted to run the code and am now presented with a new
error which is 
"Error in formula.default(eval(parse(text = x)[[1L]])) : invalid formula"
and am not sure what to make of it. I have tried a few different work around
with no luck.

Any help will continue to be appreciated! 

-----
- AK