Capping outliers
Here is the solution using pmin/pmax for 10,000 rows.
min_pctle_cut <- 0.01
max_pctle_cut <- 0.99
library(outliers)
n <- 10000
x1 <- runif(n)
x2 <- runif(n)
x3 <- x1 + x2 + runif(n)/10
x4 <- x1 + x2 + x3 + runif(n)/10
x5 <- factor(sample(c('a','b','c'),n,replace=TRUE))
x6 <- factor(1*(x5=='a' | x5=='c'))
data1 <- cbind(x1,x2,x3,x4,x5,x6)
x <- data.frame(data1)
z <- x[,sapply(x,is.numeric)]
zNew <- z # save for 2nd test
qs <- sapply(z, function(z) quantile(z,
+ c(min_pctle_cut, max_pctle_cut), na.rm = TRUE))
#Loop below taking time for execution system.time(for (i in 1:ncol(z))
+ {
+ for (j in 1:nrow(z))
+ {
+ if (z[j,i] < qs[1,i]) z[j,i]=qs[1,i]
+ if (z[j,i] > qs[2,i]) z[j,i]=qs[2,i]
+ }
+ })
user system elapsed
6.64 0.00 7.76
system.time({
+ for (i in 1:ncol(z)) zNew[[i]] <- pmax(qs[1,i], pmin(qs[2,i], z[[i]])) + }) user system elapsed 0.02 0.00 0.00
all(z == zNew) # are they the same?
[1] TRUE
On Tue, Nov 22, 2011 at 6:24 AM, Jim Holtman <jholtman at gmail.com> wrote:
You can easily vectorize this code using pmin/pmax. Sent from my iPad On Nov 22, 2011, at 1:06, Aher <ajit.aher at cedar-consulting.com> wrote:
Hi Experts,
I am new to R, using following sample code for capping outliers using
percentile information. ?Working on large data (30000 observations and 150
variables), loop I am using in the below mentioned code for detecting
outliers and capping to upper /lower percentile value is taking much time
for the execution.
Is there anything wrong with code, can anyone suggest improvement in the
script to enhance performance!
min_pctle_cut <- 0.01
max_pctle_cut <- 0.99
library(outliers)
n <- 100
x1 <- runif(n)
x2 <- runif(n)
x3 <- x1 + x2 + runif(n)/10
x4 <- x1 + x2 + x3 + runif(n)/10
x5 <- factor(sample(c('a','b','c'),n,replace=TRUE))
x6 <- factor(1*(x5=='a' | x5=='c'))
data1 <- cbind(x1,x2,x3,x4,x5,x6)
x <- data.frame(data1)
z <- x[,sapply(x,is.numeric)]
qs <- sapply(z, function(z) quantile(z,
? ?c(min_pctle_cut, max_pctle_cut), na.rm = TRUE))
#Loop below taking time for execution
system.time(for (i in 1:ncol(z))
{
? ?for (j in 1:nrow(z))
{
if (z[j,i] < qs[1,i]) z[j,i]=qs[1,i]
if (z[j,i] > qs[2,i]) z[j,i]=qs[2,i]
}
})
--
View this message in context: http://r.789695.n4.nabble.com/Capping-outliers-tp4094647p4094647.html
Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it.