Skip to content

Capping outliers

3 messages · Aher, jim holtman

#
Hi Experts,

I am new to R, using following sample code for capping outliers using
percentile information.  Working on large data (30000 observations and 150
variables), loop I am using in the below mentioned code for detecting
outliers and capping to upper /lower percentile value is taking much time
for the execution.
Is there anything wrong with code, can anyone suggest improvement in the
script to enhance performance!
min_pctle_cut <- 0.01
max_pctle_cut <- 0.99
library(outliers)

n <- 100
x1 <- runif(n) 
x2 <- runif(n) 
x3 <- x1 + x2 + runif(n)/10 
x4 <- x1 + x2 + x3 + runif(n)/10 
x5 <- factor(sample(c('a','b','c'),n,replace=TRUE)) 
x6 <- factor(1*(x5=='a' | x5=='c')) 
data1 <- cbind(x1,x2,x3,x4,x5,x6) 
x <- data.frame(data1) 

z <- x[,sapply(x,is.numeric)]

qs <- sapply(z, function(z) quantile(z,
 	c(min_pctle_cut, max_pctle_cut), na.rm = TRUE)) 


#Loop below taking time for execution

system.time(for (i in 1:ncol(z))
{
	for (j in 1:nrow(z)) 
{
if (z[j,i] < qs[1,i]) z[j,i]=qs[1,i]
if (z[j,i] > qs[2,i]) z[j,i]=qs[2,i] 
} 
})



--
View this message in context: http://r.789695.n4.nabble.com/Capping-outliers-tp4094647p4094647.html
Sent from the R help mailing list archive at Nabble.com.
#
You can easily vectorize this code using pmin/pmax.

Sent from my iPad
On Nov 22, 2011, at 1:06, Aher <ajit.aher at cedar-consulting.com> wrote:

            
#
Here is the solution using pmin/pmax for 10,000 rows.
+        c(min_pctle_cut, max_pctle_cut), na.rm = TRUE))
+ {
+        for (j in 1:nrow(z))
+ {
+ if (z[j,i] < qs[1,i]) z[j,i]=qs[1,i]
+ if (z[j,i] > qs[2,i]) z[j,i]=qs[2,i]
+ }
+ })
   user  system elapsed
   6.64    0.00    7.76
+     for (i in 1:ncol(z)) zNew[[i]] <- pmax(qs[1,i], pmin(qs[2,i], z[[i]]))
+ })
   user  system elapsed
   0.02    0.00    0.00
[1] TRUE

        
On Tue, Nov 22, 2011 at 6:24 AM, Jim Holtman <jholtman at gmail.com> wrote: