An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20121208/a2b712e3/attachment.pl>
Mean-Centering Question
9 messages · Ray DiGiacomo, Jr., Elizabeth Fuller Bettini, David Winsemius +2 more
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20121208/35bd9d64/attachment.pl>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20121208/0aef1d6d/attachment.pl>
On Dec 8, 2012, at 3:54 PM, Ray DiGiacomo, Jr. wrote:
Hello, I'm trying to create a custom function that "mean-centers" data and can be applied across many columns. Here is an example dataset, which is similar to my dataset:
dat <- read.table(text="Location,TimePeriod,Units,AveragePrice Los Angeles,5/1/11,61,5.42 Los Angeles,5/8/11,49,4.69 Los Angeles,5/15/11,40,5.05 New York,5/1/11,259,6.4 New York,5/8/11,187,5.3 New York,5/15/11,177,5.7 Paris,5/1/11,672,6.26 Paris,5/8/11,514,5.3 Paris,5/15/11,455,5.2", header=TRUE, sep=",")
I want to mean-center the "Units" and "AveragePrice" Columns.
So, I created this function:
specialFunction <- function(x){ log(x) - colMeans(log(x), na.rm = T) }
I needed to modify this to avoid errors relating to how colMeans is
expecting its arguments:
specialFunction2 <- function(x){ log(x) - mean(log(x), na.rm = T) }
aggregate(dat[3:4], dat[1], FUN=specialFunction2)
Location Units.1 Units.2 Units.3 AveragePrice.1
AveragePrice.2
1 Los Angeles 0.2136827 -0.0053709 -0.2083118 0.0717903
-0.0728730
2 New York 0.2354659 -0.0902535 -0.1452124 0.1014743
-0.0871168
3 Paris 0.2193320 -0.0487031 -0.1706289 0.1173316
-0.0491417
AveragePrice.3
1 0.0010827
2 -0.0143575
3 -0.0681899
If I use only "one" column in the first argument of the "by" function,
everything is in fine. For example the following code will work fine:
by(data[c("Units")],
data["Location"],
specialFunction)
But the following code will "not" work, because I have "two" columns
in the
first argument...
by(data[c("Units", "AveragePrice")],
data["Location"],
specialFunction)
OK. So then I tried this with your function and was surprised to see
that it also works:
> by(dat[c("Units", "AveragePrice")],
+ dat["Location"],
+ specialFunction)
Location: Los Angeles
Units AveragePrice
1 0.21368 0.0717903
2 2.27351 -2.3517586
3 -0.20831 0.0010827
------------------------------------------------------------------
Location: New York
Units AveragePrice
4 0.23547 0.101474
5 3.47628 -3.653655
6 -0.14521 -0.014357
------------------------------------------------------------------
Location: Paris
Units AveragePrice
7 0.21933 0.11733
8 4.52537 -4.62322
9 -0.17063 -0.06819
Does anyone have any ideas as to what I am doing wrong?
I guess I don't. Cannot reproduce and my other methods worked as
well.This also works with your version and with mine but I get the
deprecation message for `mean.data.frame` from mine:
> lapply( split(dat[3:4], dat[1]) , FUN=specialFunction )
$`Los Angeles`
Units AveragePrice
1 0.21368 0.0717903
2 2.27351 -2.3517586
3 -0.20831 0.0010827
$`New York`
Units AveragePrice
4 0.23547 0.101474
5 3.47628 -3.653655
6 -0.14521 -0.014357
$Paris
Units AveragePrice
7 0.21933 0.11733
8 4.52537 -4.62322
9 -0.17063 -0.06819
Please note that I'm trying to get the following results (for the "Los Angeles" group): Los Angeles "Units" variable (Mean-Centered) 0.213682659 -0.005370907 -0.208311751 Los Angeles "AveragePrice" variable (Mean-Centered) 0.071790268 -0.072872965 0.001082696
David Winsemius, MD Alameda, CA, USA
On Dec 8, 2012, at 7:06 PM, Elizabeth Fuller Bettini wrote:
please remove me from this list.
You subscribed and only you know the password that allows you to control the subscription options. Please use the links at the bottom of every posting to Rhelp.
On Sat, Dec 8, 2012 at 6:54 PM, Ray DiGiacomo, Jr. <rayd at liondatasystems.com
wrote:
R-help at r-project.org
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD Alameda, CA, USA
Hi,
It works for me also:
?by(dat1[c("Units","AveragePrice")],dat1[,1],specialFunction)
#dat1[, 1]: Los Angeles
?# ???? Units AveragePrice
#1? 0.2136827? 0.071790268
#2? 2.2735148 -2.351758623
#3 -0.2083118? 0.001082696
----------------------------------------------
#or
?by(cbind(Units=dat1[,3],AveragePrice=dat1[,4]),dat1[,1],specialFunction)
#INDICES: Los Angeles
?# ???? Units AveragePrice
#1? 0.2136827? 0.071790268
#2? 2.2735148 -2.351758623
#3 -0.2083118? 0.001082696
--------------------------------------------
A.K.
----- Original Message -----
From: "Ray DiGiacomo, Jr." <rayd at liondatasystems.com>
To: R Help <r-help at r-project.org>
Cc:
Sent: Saturday, December 8, 2012 6:54 PM
Subject: [R] Mean-Centering Question
Hello,
I'm trying to create a custom function that "mean-centers" data and can be
applied across many columns.
Here is an example dataset, which is similar to my dataset:
*Location,TimePeriod,Units,AveragePrice*
Los Angeles,5/1/11,61,5.42
Los Angeles,5/8/11,49,4.69
Los Angeles,5/15/11,40,5.05
New York,5/1/11,259,6.4
New York,5/8/11,187,5.3
New York,5/15/11,177,5.7
Paris,5/1/11,672,6.26
Paris,5/8/11,514,5.3
Paris,5/15/11,455,5.2
I want to mean-center the "Units" and "AveragePrice" Columns.
So, I created this function:
specialFunction <- function(x){ log(x) - colMeans(log(x), na.rm = T) }
If I use only "one" column in the first argument of the "by" function,
everything is in fine.? For example the following code will work fine:
by(data[c("Units")],
data["Location"],
specialFunction)
But the following code will "not" work, because I have "two" columns in the
first argument...
by(data[c("Units", "AveragePrice")],
data["Location"],
specialFunction)
Does anyone have any ideas as to what I am doing wrong?
Please note that I'm trying to get the following results (for the "Los
Angeles" group):
Los Angeles "Units" variable (Mean-Centered)
0.213682659
-0.005370907
-0.208311751
Los Angeles "AveragePrice" variable (Mean-Centered)
0.071790268
-0.072872965
0.001082696
Best Regards,
Ray DiGiacomo, Jr.
Healthcare Predictive Analytics Specialist
President, Lion Data Systems LLC
President, The Orange County R User Group
Board Member, TDWI
rayd at liondatasystems.com
(m) 408-425-7851
San Juan Capistrano, California USA
twitter.com/liondatasystems
linkedin.com/in/raydigiacomojr
youtube.com/user/liondatasystems/videos
??? [[alternative HTML version deleted]]
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20121208/cbdcd0b8/attachment.pl>
Hi,
You could also use:
newFunction1<-function(x) {t(t(log(x))-colMeans(log(x)))}
?res1<-by(dat1[c("Units","AveragePrice")],dat1["Location"],newFunction1)
?res1
#Location: Los Angeles
#???????? Units AveragePrice
#1? 0.213682659? 0.071790268
#2 -0.005370907 -0.072872965
#3 -0.208311751? 0.001082696
#------------------------------------------------------------
#Location: New York
?# ????? Units AveragePrice
#4? 0.23546592?? 0.10147433
#5 -0.09025352? -0.08711684
#6 -0.14521240? -0.01435749
#------------------------------------------------------------
#Location: Paris
?# ????? Units AveragePrice
#7? 0.21933200?? 0.11733164
#8 -0.04870308? -0.04914172
#9 -0.17062892? -0.06818992
? newFunction <- function(x) { sweep(log(x), 2, colMeans(log(x)), "-") }
?res<-by(dat1[c("Units","AveragePrice")],dat1["Location"],newFunction)
?res
#Location: Los Angeles
?# ?????? Units AveragePrice
#1? 0.213682659? 0.071790268
#2 -0.005370907 -0.072872965
#3 -0.208311751? 0.001082696
#------------------------------------------------------------
#Location: New York
?# ????? Units AveragePrice
#4? 0.23546592?? 0.10147433
#5 -0.09025352? -0.08711684
#6 -0.14521240? -0.01435749
#------------------------------------------------------------
#Location: Paris
?# ????? Units AveragePrice
#7? 0.21933200?? 0.11733164
#8 -0.04870308? -0.04914172
#9 -0.17062892? -0.06818992
#the ?identical() will be FALSE, as the list elements for res is data.frame and res1 is matrix.?
A.K.
----- Original Message -----
From: "Ray DiGiacomo, Jr." <rayd at liondatasystems.com>
To: R Help <r-help at r-project.org>
Cc:
Sent: Saturday, December 8, 2012 11:11 PM
Subject: Re: [R] Mean-Centering Question
Hi David and Arun,
Thanks for looking into this.? I think I have found a solution.
The "by" function will run ok without errors but the values returned in the
second row of the "Los Angeles" output are both incorrect.? These incorrect
values are shown below in red.
I think my original custom function was causing the incorrect values
because the subtraction inside the original custom function was subtracting
frames that had different dimensions and I think there was some "recycling"
happening.
Using the "sweep" function fixes the problem.? This is what I did to fix
things:
# here is my "new" custom function
newFunction <- function(x) { sweep(log(x), 2, colMeans(log(x)), "-") }
# this gives the correct values
by(PullData[c("Units","AveragePrice")],
PullData[c("StoreLocation")],
? ? ? ? newFunction)
- Ray
On Sat, Dec 8, 2012 at 7:12 PM, David Winsemius <dwinsemius at comcast.net>wrote:
On Dec 8, 2012, at 3:54 PM, Ray DiGiacomo, Jr. wrote: ? Hello,
I'm trying to create a custom function that "mean-centers" data and can be applied across many columns. Here is an example dataset, which is similar to my dataset: ? dat <- read.table(text="Location,**TimePeriod,Units,AveragePrice
Los Angeles,5/1/11,61,5.42 Los Angeles,5/8/11,49,4.69 Los Angeles,5/15/11,40,5.05 New York,5/1/11,259,6.4 New York,5/8/11,187,5.3 New York,5/15/11,177,5.7 Paris,5/1/11,672,6.26 Paris,5/8/11,514,5.3 Paris,5/15/11,455,5.2", header=TRUE, sep=",")
I want to mean-center the "Units" and "AveragePrice" Columns.
So, I created this function:
specialFunction <- function(x){ log(x) - colMeans(log(x), na.rm = T) }
I needed to modify this to avoid errors relating to how colMeans is
expecting its arguments:
specialFunction2 <- function(x){ log(x) - mean(log(x), na.rm = T) }
aggregate(dat[3:4], dat[1], FUN=specialFunction2)
? ? ? Location? ? Units.1? ? Units.2? ? Units.3 AveragePrice.1
AveragePrice.2
1 Los Angeles? 0.2136827 -0.0053709 -0.2083118? ? ? 0.0717903
-0.0728730
2? ? New York? 0.2354659 -0.0902535 -0.1452124? ? ? 0.1014743
-0.0871168
3? ? ? Paris? 0.2193320 -0.0487031 -0.1706289? ? ? 0.1173316
-0.0491417
? AveragePrice.3
1? ? ? 0.0010827
2? ? -0.0143575
3? ? -0.0681899
If I use only "one" column in the first argument of the "by" function,
everything is in fine.? For example the following code will work fine:
by(data[c("Units")],
data["Location"],
specialFunction)
But the following code will "not" work, because I have "two" columns in
the
first argument...
by(data[c("Units", "AveragePrice")],
data["Location"],
specialFunction)
OK. So then I tried this with your function and was surprised to see that it also works:
by(dat[c("Units", "AveragePrice")],
+ dat["Location"], + specialFunction) Location: Los Angeles ? ? ? Units AveragePrice 1? 0.21368? ? 0.0717903 2? *2.27351? -2.3517586* 3 -0.20831? ? 0.0010827 ------------------------------**------------------------------**------ Location: New York ? ? ? Units AveragePrice 4? 0.23547? ? 0.101474 5? 3.47628? ? -3.653655 6 -0.14521? ? -0.014357 ------------------------------**------------------------------**------ Location: Paris ? ? ? Units AveragePrice 7? 0.21933? ? ? 0.11733 8? 4.52537? ? -4.62322 9 -0.17063? ? -0.06819
Does anyone have any ideas as to what I am doing wrong?
I guess I don't. Cannot reproduce and my other methods worked as well.This also works with your version and with mine but I get the deprecation message for `mean.data.frame` from mine:
lapply( split(dat[3:4], dat[1]) , FUN=specialFunction )
$`Los Angeles` ? ? ? Units AveragePrice 1? 0.21368? ? 0.0717903 2? 2.27351? -2.3517586 3 -0.20831? ? 0.0010827 $`New York` ? ? ? Units AveragePrice 4? 0.23547? ? 0.101474 5? 3.47628? ? -3.653655 6 -0.14521? ? -0.014357 $Paris ? ? ? Units AveragePrice 7? 0.21933? ? ? 0.11733 8? 4.52537? ? -4.62322 9 -0.17063? ? -0.06819
Please note that I'm trying to get the following results (for the "Los Angeles" group): Los Angeles "Units" variable (Mean-Centered) 0.213682659 -0.005370907 -0.208311751 Los Angeles "AveragePrice" variable (Mean-Centered) 0.071790268 -0.072872965 0.001082696
-- David Winsemius, MD Alameda, CA, USA
??? [[alternative HTML version deleted]] ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
If you are willing to rethink the definition of your special function, the process can be simplified. The function lmc() log-mean centers a single grouped numeric vector. Then sapply() can be used to center a batch of them.
lmc <- function(x, g) unsplit(lapply(split(log(x), g), scale,
scale=FALSE), g)
dat2 <- data.frame(dat[,1:2], sapply(dat[,3:4], lmc, g=dat[,1])) dat2
Location X..TimePeriod Units AveragePrice 1 Los Angeles 5/1/11 0.213682659 0.071790268 2 Los Angeles 5/8/11 -0.005370907 -0.072872965 3 Los Angeles 5/15/11 -0.208311751 0.001082696 4 New York 5/1/11 0.235465925 0.101474328 5 New York 5/8/11 -0.090253520 -0.087116841 6 New York 5/15/11 -0.145212404 -0.014357487 7 Paris 5/1/11 0.219331999 0.117331641 8 Paris 5/8/11 -0.048703076 -0.049141723 9 Paris 5/15/11 -0.170628923 -0.068189918 ---------------------------------------------- David L Carlson Associate Professor of Anthropology Texas A&M University College Station, TX 77843-4352
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
project.org] On Behalf Of arun
Sent: Sunday, December 09, 2012 10:27 AM
To: Ray DiGiacomo, Jr.
Cc: R help
Subject: Re: [R] Mean-Centering Question
Hi,
You could also use:
newFunction1<-function(x) {t(t(log(x))-colMeans(log(x)))}
?res1<-
by(dat1[c("Units","AveragePrice")],dat1["Location"],newFunction1)
?res1
#Location: Los Angeles
#???????? Units AveragePrice
#1? 0.213682659? 0.071790268
#2 -0.005370907 -0.072872965
#3 -0.208311751? 0.001082696
#------------------------------------------------------------
#Location: New York
?# ????? Units AveragePrice
#4? 0.23546592?? 0.10147433
#5 -0.09025352? -0.08711684
#6 -0.14521240? -0.01435749
#------------------------------------------------------------
#Location: Paris
?# ????? Units AveragePrice
#7? 0.21933200?? 0.11733164
#8 -0.04870308? -0.04914172
#9 -0.17062892? -0.06818992
? newFunction <- function(x) { sweep(log(x), 2, colMeans(log(x)), "-")
}
?res<-by(dat1[c("Units","AveragePrice")],dat1["Location"],newFunction)
?res
#Location: Los Angeles
?# ?????? Units AveragePrice
#1? 0.213682659? 0.071790268
#2 -0.005370907 -0.072872965
#3 -0.208311751? 0.001082696
#------------------------------------------------------------
#Location: New York
?# ????? Units AveragePrice
#4? 0.23546592?? 0.10147433
#5 -0.09025352? -0.08711684
#6 -0.14521240? -0.01435749
#------------------------------------------------------------
#Location: Paris
?# ????? Units AveragePrice
#7? 0.21933200?? 0.11733164
#8 -0.04870308? -0.04914172
#9 -0.17062892? -0.06818992
#the ?identical() will be FALSE, as the list elements for res is
data.frame and res1 is matrix.
A.K.
----- Original Message -----
From: "Ray DiGiacomo, Jr." <rayd at liondatasystems.com>
To: R Help <r-help at r-project.org>
Cc:
Sent: Saturday, December 8, 2012 11:11 PM
Subject: Re: [R] Mean-Centering Question
Hi David and Arun,
Thanks for looking into this.? I think I have found a solution.
The "by" function will run ok without errors but the values returned in
the
second row of the "Los Angeles" output are both incorrect.? These
incorrect
values are shown below in red.
I think my original custom function was causing the incorrect values
because the subtraction inside the original custom function was
subtracting
frames that had different dimensions and I think there was some
"recycling"
happening.
Using the "sweep" function fixes the problem.? This is what I did to
fix
things:
# here is my "new" custom function
newFunction <- function(x) { sweep(log(x), 2, colMeans(log(x)), "-") }
# this gives the correct values
by(PullData[c("Units","AveragePrice")],
PullData[c("StoreLocation")],
? ? ? ? newFunction)
- Ray
On Sat, Dec 8, 2012 at 7:12 PM, David Winsemius
<dwinsemius at comcast.net>wrote:
On Dec 8, 2012, at 3:54 PM, Ray DiGiacomo, Jr. wrote: ? Hello,
I'm trying to create a custom function that "mean-centers" data and
can be
applied across many columns. Here is an example dataset, which is similar to my dataset: ? dat <- read.table(text="Location,**TimePeriod,Units,AveragePrice
Los Angeles,5/1/11,61,5.42 Los Angeles,5/8/11,49,4.69 Los Angeles,5/15/11,40,5.05 New York,5/1/11,259,6.4 New York,5/8/11,187,5.3 New York,5/15/11,177,5.7 Paris,5/1/11,672,6.26 Paris,5/8/11,514,5.3 Paris,5/15/11,455,5.2", header=TRUE, sep=",")
I want to mean-center the "Units" and "AveragePrice" Columns.
So, I created this function:
specialFunction <- function(x){ log(x) - colMeans(log(x), na.rm = T)
}
I needed to modify this to avoid errors relating to how colMeans is
expecting its arguments:
specialFunction2 <- function(x){ log(x) - mean(log(x), na.rm = T) }
aggregate(dat[3:4], dat[1], FUN=specialFunction2)
? ? ? Location? ? Units.1? ? Units.2? ? Units.3 AveragePrice.1
AveragePrice.2
1 Los Angeles? 0.2136827 -0.0053709 -0.2083118? ? ? 0.0717903
-0.0728730
2? ? New York? 0.2354659 -0.0902535 -0.1452124? ? ? 0.1014743
-0.0871168
3? ? ? Paris? 0.2193320 -0.0487031 -0.1706289? ? ? 0.1173316
-0.0491417
? AveragePrice.3
1? ? ? 0.0010827
2? ? -0.0143575
3? ? -0.0681899
If I use only "one" column in the first argument of the "by"
function,
everything is in fine.? For example the following code will work
fine:
by(data[c("Units")],
data["Location"],
specialFunction)
But the following code will "not" work, because I have "two" columns
in
the
first argument...
by(data[c("Units", "AveragePrice")],
data["Location"],
specialFunction)
OK. So then I tried this with your function and was surprised to see
that
it also works:
by(dat[c("Units", "AveragePrice")],
+ dat["Location"], + specialFunction) Location: Los Angeles ? ? ? Units AveragePrice 1? 0.21368? ? 0.0717903 2? *2.27351? -2.3517586* 3 -0.20831? ? 0.0010827 ------------------------------**------------------------------**-----
-
Location: New York ? ? ? Units AveragePrice 4? 0.23547? ? 0.101474 5? 3.47628? ? -3.653655 6 -0.14521? ? -0.014357 ------------------------------**------------------------------**-----
-
Location: Paris ? ? ? Units AveragePrice 7? 0.21933? ? ? 0.11733 8? 4.52537? ? -4.62322 9 -0.17063? ? -0.06819
Does anyone have any ideas as to what I am doing wrong?
I guess I don't. Cannot reproduce and my other methods worked as
well.This
also works with your version and with mine but I get the deprecation message for `mean.data.frame` from mine:
lapply( split(dat[3:4], dat[1]) , FUN=specialFunction )
$`Los Angeles` ? ? ? Units AveragePrice 1? 0.21368? ? 0.0717903 2? 2.27351? -2.3517586 3 -0.20831? ? 0.0010827 $`New York` ? ? ? Units AveragePrice 4? 0.23547? ? 0.101474 5? 3.47628? ? -3.653655 6 -0.14521? ? -0.014357 $Paris ? ? ? Units AveragePrice 7? 0.21933? ? ? 0.11733 8? 4.52537? ? -4.62322 9 -0.17063? ? -0.06819
Please note that I'm trying to get the following results (for the
"Los
Angeles" group): Los Angeles "Units" variable (Mean-Centered) 0.213682659 -0.005370907 -0.208311751 Los Angeles "AveragePrice" variable (Mean-Centered) 0.071790268 -0.072872965 0.001082696
-- David Winsemius, MD Alameda, CA, USA
??? [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code.