Skip to content

Winsorizing Multiple Variables

4 messages · Karl Healey, David Winsemius, Michael Conklin +1 more

#
Hi All,

I want to take a matrix (or data frame) and winsorize each variable.  
So I can, for example, correlate the winsorized variables.

The code below will winsorize a single vector, but when applied to  
several vectors, each ends up sorted independently in ascending order  
so that a given observation is no longer on the same row for each  
vector.

So I need to winsorize the variable but then return it to its original  
order. Or another solution that will take a data frame, wisorize each  
variable, and return a new data frame with all the variables in the  
original order.

Thanks for any help!

-Karl


#The function I'm working from

win<-function(x,tr=.2,na.rm=F){

    if(na.rm)x<-x[!is.na(x)]
    y<-sort(x)
    n<-length(x)
    ibot<-floor(tr*n)+1
    itop<-length(x)-ibot+1
    xbot<-y[ibot]
    xtop<-y[itop]
    y<-ifelse(y<=xbot,xbot,y)
    y<-ifelse(y>=xtop,xtop,y)
    win<-y
    win
}

#Produces an example data frame, ss is the observation id, vars 1-5  
are the variables I want to winzorise.

ss 
= 
c 
(1 
: 
5 
);var1 
= 
rnorm 
(5 
);var2 
= 
rnorm 
(5 
);var3 
=rnorm(5);var4=rnorm(5);as.data.frame(cbind(ss,var1,var2,var3,var4))- 
 >data
data

#Winsorizes each variable, but sorts them independently so the  
observations no longer line up.

sapply(data,win)


___________________________
M. Karl Healey
Ph.D. Student

Department of Psychology
University of Toronto
Sidney Smith Hall
100 St. George Street
Toronto, ON
M5S 3G3

karl at psych.utoronto.ca
#
Might work better to determine top and bottom for each column with  
quantile() using an appropriate quantile option,  and then process  
each variable "in place" with your ifelse logic.

I did find a somewhat different definition of winsorization with no  
sorting in this code copied from a Patrick Burns posting from earlier  
this year on R-SIG-Finance;

function(x, winsorize=5) {
            s <- mad(x) * winsorize
            top <- median(x) + s
            bot <- median(x) - s
            x[x > top] <- top
            x[x < bot] <- bot x }
#
Don't sort y. Calculate xbot and xtop using
xtemp<-quantile(y,c(tr,1-tr),na.rm=na.rm)
xbot<-xtemp[1]
xtop<-xtemp[2]

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Karl Healey
Sent: Friday, January 16, 2009 2:51 PM
To: r-help at r-project.org
Subject: [R] Winsorizing Multiple Variables

Hi All,

I want to take a matrix (or data frame) and winsorize each variable.
So I can, for example, correlate the winsorized variables.

The code below will winsorize a single vector, but when applied to
several vectors, each ends up sorted independently in ascending order
so that a given observation is no longer on the same row for each
vector.

So I need to winsorize the variable but then return it to its original
order. Or another solution that will take a data frame, wisorize each
variable, and return a new data frame with all the variables in the
original order.

Thanks for any help!

-Karl


#The function I'm working from

win<-function(x,tr=.2,na.rm=F){

    if(na.rm)x<-x[!is.na(x)]
    y<-sort(x)
    n<-length(x)
    ibot<-floor(tr*n)+1
    itop<-length(x)-ibot+1
    xbot<-y[ibot]
    xtop<-y[itop]
    y<-ifelse(y<=xbot,xbot,y)
    y<-ifelse(y>=xtop,xtop,y)
    win<-y
    win
}

#Produces an example data frame, ss is the observation id, vars 1-5
are the variables I want to winzorise.

ss
=
c
(1
:
5
);var1
=
rnorm
(5
);var2
=
rnorm
(5
);var3
=rnorm(5);var4=rnorm(5);as.data.frame(cbind(ss,var1,var2,var3,var4))-
 >data
data

#Winsorizes each variable, but sorts them independently so the
observations no longer line up.

sapply(data,win)


___________________________
M. Karl Healey
Ph.D. Student

Department of Psychology
University of Toronto
Sidney Smith Hall
100 St. George Street
Toronto, ON
M5S 3G3

karl at psych.utoronto.ca

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
#
Thanks to Michael for giving a nice solution to Karl's question .

This identified a bug in the psych package winsor function which has 
now been fixed in version 1.0.63.  (The current development version). 
Although my winsor.means function  in 1.0..62 (and ealier) worked 
correctly, my winsor function when applied to matrices or data.frames 
gave an incorrect result.

Bill
At 1:24 PM -0800 1/16/09, Michael Conklin wrote: