Logical subset of the columns in a dataframe

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090128/46c1aa88/attachment-0001.pl>

Hi R-helpers,

I've been struggling with a problem for most of the day (!) so am finally
resorting to R-help.

I would like to subset the columns of my dataframe based on the frequency
with which the columns contain non-zero values. For example, let's say that
I want to retain only those columns which contain non-zero values in at
least 1% of their rows.

In Excel I would calculate a row at the bottom of my data sheet and use the
following function

=countif(range,">0")

to identify the number of non-zero cells in each column. Then, I would
divide that by the number of rows to obtain the frequency of non-zero values
in each column. Then, I would delete those columns with frequencies < 0.01.

But, I'd like to do this in R. I think the missing link is an analog to
Excel's countif function. Any ideas?
Use something like

     DF[sapply(DF, function(x) mean(x) >= 0.01)]

Since logical values are converted to 0/1, mean() gives the frequency 
(and sum() the count).
Thanks! Mark

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595
One approach to such a problem would be to use a logical vector inside  
the function colSums.

?colSums

 > DF <- data.frame(XX= runif(20), YY=runif(20))

 > colSums(DF > 0.5)
XX YY
11  9

 > colSums(DF > -Inf)
XX YY
20 20
 >
 > colSums(DF> 0.5)/colSums(DF > -Inf) #could have used DF >= min(DF)  
in the denominator
   XX   YY
0.55 0.45
David Winsemius

On Jan 28, 2009, at 11:11 AM, Mark Na wrote:

> Hi R-helpers,
>
> I've been struggling with a problem for most of the day (!) so am  
> finally
> resorting to R-help.
>
> I would like to subset the columns of my dataframe based on the  
> frequency
> with which the columns contain non-zero values. For example, let's  
> say that
> I want to retain only those columns which contain non-zero values in  
> at
> least 1% of their rows.
>
> In Excel I would calculate a row at the bottom of my data sheet and  
> use the
> following function
>
> =countif(range,">0")
>
> to identify the number of non-zero cells in each column. Then, I would
> divide that by the number of rows to obtain the frequency of non- 
> zero values
> in each column. Then, I would delete those columns with frequencies  
> < 0.01.

I don't think that would do what you describe unless you were only  
working with single column ranges. Functions on ranges in Excel are  
not calculated by column.

>
>
> But, I'd like to do this in R. I think the missing link is an analog  
> to
> Excel's countif function. Any ideas?
>
> Thanks! Mark
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.