An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20091118/67d09dee/attachment-0001.pl>
Median on Aggregated data
3 messages · Satsangi, Vivek (GE Capital), David Winsemius, William Dunlap
On Nov 18, 2009, at 4:55 PM, Satsangi, Vivek (GE Capital) wrote:
Folks, I have the following code, that works fine on smaller data sets. For larger datasets, it runs out of memory and runs way too slow because we are essentially creating large vectors with rep() and then calling median() on it. (I learned this approach from a post on the web). Below that, I have written the corresponding SAS code. The SAS code works fast because I can just tell the proc summary (by the weights option) that the Counts variable is a frequency. So, the question is, is there a simple way to do the same thing in R? I have to run this on a large dataset -- for a small set it is not a problem.
Not sure and I see no reproducible dataset (that I recognize), but Harrell's Hmisc:::wtd.quantile might be an alternate approach.
---------------------- Begin R code
------------------------------------
N <- 1005 * 14;
myNorm <- data.frame(PaydexNormingCategory = numeric(N),
SIC = numeric(N), CatMedian = numeric(N));
k=1;
#j = 7941; ## For testing only
for (j in levels(SIC)){
for (i in levels(PaydexNormingCategory)){
myData <- dfpaydex[(Paydex==i) & (SIC==j),];
myMedian <- with(myData, levels(Paydex)[median(rep(as.numeric(Paydex),
Counts))]);
myNorm[k] <-c( as.numeric(i), as.numeric(j), as.numeric(myMedian) );
k <- k+1;
}
}
---------------------- Begin SAS code
------------------------------------
proc summary data=SASUser.PaydexNormfull nway;
class PaydexNormingCategory SIC ;
weight Counts;
var Paydex;
output out=outstat (drop=_type_ _freq_)
median= / autoname;
run;
---------------------- End SAS code
------------------------------------
Thanks for your guidance!
Vivek Satsangi
GE Capital
Americas
GE imagination at work
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
You could use S+. Its median function has a weights argument. E.g., > median(c(1,2,3,4e4), weights=c(1e8,1e8,1,2e8)) [1] 3 > median(c(1,2,3,4e4), weights=c(1e8,1e8,1,2e8+10)) [1] 40000 > median(c(1,2,3,4e4), weights=c(1e8,1e8,1,2e8+1)) [1] 20001.5 Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com
-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On Behalf Of Satsangi,
Vivek (GE Capital)
Sent: Wednesday, November 18, 2009 1:55 PM
To: r-help at r-project.org
Subject: [R] Median on Aggregated data
Folks,
I have the following code, that works fine on smaller data sets. For
larger datasets, it runs out of memory and runs way too slow
because we
are essentially creating large vectors with rep() and then calling
median() on it. (I learned this approach from a post on the web).
Below that, I have written the corresponding SAS code. The SAS code
works fast because I can just tell the proc summary (by the weights
option) that the Counts variable is a frequency.
So, the question is, is there a simple way to do the same
thing in R? I
have to run this on a large dataset -- for a small set it is not a
problem.
---------------------- Begin R code
------------------------------------
N <- 1005 * 14;
myNorm <- data.frame(PaydexNormingCategory = numeric(N),
SIC = numeric(N), CatMedian = numeric(N));
k=1;
#j = 7941; ## For testing only
for (j in levels(SIC)){
for (i in levels(PaydexNormingCategory)){
myData <- dfpaydex[(Paydex==i) & (SIC==j),];
myMedian <- with(myData,
levels(Paydex)[median(rep(as.numeric(Paydex),
Counts))]);
myNorm[k] <-c( as.numeric(i), as.numeric(j), as.numeric(myMedian) );
k <- k+1;
}
}
---------------------- Begin SAS code
------------------------------------
proc summary data=SASUser.PaydexNormfull nway;
class PaydexNormingCategory SIC ;
weight Counts;
var Paydex;
output out=outstat (drop=_type_ _freq_)
median= / autoname;
run;
---------------------- End SAS code
------------------------------------
Thanks for your guidance!
Vivek Satsangi
GE Capital
Americas
GE imagination at work
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.