An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090128/46c1aa88/attachment-0001.pl>
Logical subset of the columns in a dataframe
3 messages · Mark, Brian Ripley, David Winsemius
On Wed, 28 Jan 2009, Mark Na wrote:
Hi R-helpers, I've been struggling with a problem for most of the day (!) so am finally resorting to R-help. I would like to subset the columns of my dataframe based on the frequency with which the columns contain non-zero values. For example, let's say that I want to retain only those columns which contain non-zero values in at least 1% of their rows. In Excel I would calculate a row at the bottom of my data sheet and use the following function =countif(range,">0") to identify the number of non-zero cells in each column. Then, I would divide that by the number of rows to obtain the frequency of non-zero values in each column. Then, I would delete those columns with frequencies < 0.01. But, I'd like to do this in R. I think the missing link is an analog to Excel's countif function. Any ideas?
Use something like
DF[sapply(DF, function(x) mean(x) >= 0.01)]
Since logical values are converted to 0/1, mean() gives the frequency
(and sum() the count).
Thanks! Mark [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
One approach to such a problem would be to use a logical vector inside the function colSums. ?colSums > DF <- data.frame(XX= runif(20), YY=runif(20)) > colSums(DF > 0.5) XX YY 11 9 > colSums(DF > -Inf) XX YY 20 20 > > colSums(DF> 0.5)/colSums(DF > -Inf) #could have used DF >= min(DF) in the denominator XX YY 0.55 0.45
David Winsemius On Jan 28, 2009, at 11:11 AM, Mark Na wrote: > Hi R-helpers, > > I've been struggling with a problem for most of the day (!) so am > finally > resorting to R-help. > > I would like to subset the columns of my dataframe based on the > frequency > with which the columns contain non-zero values. For example, let's > say that > I want to retain only those columns which contain non-zero values in > at > least 1% of their rows. > > In Excel I would calculate a row at the bottom of my data sheet and > use the > following function > > =countif(range,">0") > > to identify the number of non-zero cells in each column. Then, I would > divide that by the number of rows to obtain the frequency of non- > zero values > in each column. Then, I would delete those columns with frequencies > < 0.01. I don't think that would do what you describe unless you were only working with single column ranges. Functions on ranges in Excel are not calculated by column. > > > But, I'd like to do this in R. I think the missing link is an analog > to > Excel's countif function. Any ideas? > > Thanks! Mark > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.