Joining two datasets - recursive procedure?

11 messages · Jeff Newmiller, Luca Meyer, Bert Gunter

Original

1

11

Luca Meyer

Wed, Mar 18, 2015 9:05 AM #

Thanks for you input Michael,

The continuous variable I have measures quantities (down to the 3rd
decimal level) so unfortunately are not frequencies.

Any more specific suggestions on how that could be tackled?

Thanks & kind regards,

Luca


===

Michael Friendly wrote:

I'm not sure I understand completely what you want to do, but
if the data were frequencies, it sounds like task for fitting a
loglinear model with the model formula

~ V1*V2 + V3

On 3/18/2015 2:17 AM, Luca Meyer wrote:

*>>* I am facing a quite challenging task (at least to me) and I was wondering
*>* if someone could advise how R could assist me to speed the task up.
*>>* I am dealing with a dataset with 3 discrete variables and one continuous
*>* variable. The discrete variables are:
*>>* V1: 8 modalities
*>* V2: 13 modalities
*>* V3: 13 modalities
*>>* The continuous variable V4 is a decimal number always greater than zero in
*>* the marginals of each of the 3 variables but it is sometimes equal to zero
*>* (and sometimes negative) in the joint tables.
*>>* I have got 2 files:
*>>* => one with distribution of all possible combinations of V1xV2 (some of
*>* which are zero or neagtive) and
*>* => one with the marginal distribution of V3.
*>>* I am trying to build the long and narrow dataset V1xV2xV3 in such a way
*>* that each V1xV2 cell does not get modified and V3 fits as closely as
*>* possible to its marginal distribution. Does it make sense?
*>>* To be even more specific, my 2 input files look like the following.
*>>* FILE 1
*>* V1,V2,V4
*>* A, A, 24.251
*>* A, B, 1.065
*>* (...)
*>* B, C, 0.294
*>* B, D, 2.731
*>* (...)
*>* H, L, 0.345
*>* H, M, 0.000
*>>* FILE 2
*>* V3, V4
*>* A, 1.575
*>* B, 4.294
*>* C, 10.044
*>* (...)
*>* L, 5.123
*>* M, 3.334
*>>* What I need to achieve is a file such as the following
*>>* FILE 3
*>* V1, V2, V3, V4
*>* A, A, A, ???
*>* A, A, B, ???
*>* (...)
*>* D, D, E, ???
*>* D, D, F, ???
*>* (...)
*>* H, M, L, ???
*>* H, M, M, ???
*>>* Please notice that FILE 3 need to be such that if I aggregate on V1+V2 I
*>* recover exactly FILE 1 and that if I aggregate on V3 I can recover a file
*>* as close as possible to FILE 3 (ideally the same file).
*>>* Can anyone suggest how I could do that with R?
*>>* Thank you very much indeed for any assistance you are able to provide.
*>>* Kind regards,
*>>* Luca*

Wed, Mar 18, 2015 6:38 PM #

I don't understand your description. The standard practice on this list is to provide a reproducible R example [1] of the kind of data you are working with (and any code you have tried) to go along with your description. In this case, that would be two dputs of your input data frames and a dput of an output data frame (generated by hand from your input data frame). (Probably best to not use the full number of input values just to keep the size down.) We could then make an attempt to generate code that goes from input to output.

Of course, if you post that hard work using HTML then it will get corrupted (much like the text below from your earlier emails) and we won't be able to use it. Please learn to post from your email software using plain text when corresponding with this mailing list.

[1] http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

On March 18, 2015 9:05:37 AM PDT, Luca Meyer <lucam1968 at gmail.com> wrote:

Thanks for you input Michael,

The continuous variable I have measures quantities (down to the 3rd
decimal level) so unfortunately are not frequencies.

Any more specific suggestions on how that could be tackled?

Thanks & kind regards,

Luca


===

Michael Friendly wrote:
I'm not sure I understand completely what you want to do, but
if the data were frequencies, it sounds like task for fitting a
loglinear model with the model formula

~ V1*V2 + V3

On 3/18/2015 2:17 AM, Luca Meyer wrote:

* Hello,

*>>* I am facing a quite challenging task (at least to me) and I was
wondering
*>* if someone could advise how R could assist me to speed the task up.
*>>* I am dealing with a dataset with 3 discrete variables and one
continuous
*>* variable. The discrete variables are:
*>>* V1: 8 modalities
*>* V2: 13 modalities
*>* V3: 13 modalities
*>>* The continuous variable V4 is a decimal number always greater than
zero in
*>* the marginals of each of the 3 variables but it is sometimes equal
to zero
*>* (and sometimes negative) in the joint tables.
*>>* I have got 2 files:
*>>* => one with distribution of all possible combinations of V1xV2
(some of
*>* which are zero or neagtive) and
*>* => one with the marginal distribution of V3.
*>>* I am trying to build the long and narrow dataset V1xV2xV3 in such
a way
*>* that each V1xV2 cell does not get modified and V3 fits as closely
as
*>* possible to its marginal distribution. Does it make sense?
*>>* To be even more specific, my 2 input files look like the
following.
*>>* FILE 1
*>* V1,V2,V4
*>* A, A, 24.251
*>* A, B, 1.065
*>* (...)
*>* B, C, 0.294
*>* B, D, 2.731
*>* (...)
*>* H, L, 0.345
*>* H, M, 0.000
*>>* FILE 2
*>* V3, V4
*>* A, 1.575
*>* B, 4.294
*>* C, 10.044
*>* (...)
*>* L, 5.123
*>* M, 3.334
*>>* What I need to achieve is a file such as the following
*>>* FILE 3
*>* V1, V2, V3, V4
*>* A, A, A, ???
*>* A, A, B, ???
*>* (...)
*>* D, D, E, ???
*>* D, D, F, ???
*>* (...)
*>* H, M, L, ???
*>* H, M, M, ???
*>>* Please notice that FILE 3 need to be such that if I aggregate on
V1+V2 I
*>* recover exactly FILE 1 and that if I aggregate on V3 I can recover
a file
*>* as close as possible to FILE 3 (ideally the same file).
*>>* Can anyone suggest how I could do that with R?
*>>* Thank you very much indeed for any assistance you are able to
provide.
*>>* Kind regards,
*>>* Luca*

[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

2 days later

Luca Meyer

Sat, Mar 21, 2015 4:53 AM #

Hi Jeff & other R-experts,

Thank you for your note. I have tried myself to solve the issue without
success.

Following your suggestion, I am providing a sample of the dataset I am
using below (also downloadble in plain text from
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0):

#this is an extract of the overall dataset (n=1200 cases)
f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A",
"B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C",
"B", "B", "B", "C", "C", "C"), v4 = c(18.1853007621835, 3.43806581506388,
0.002733567617055, 1.42917483425029, 1.05786640463504,
0.000420548864162308,
2.37232740842861, 3.01835841813241, 0, 1.13430282139936, 0.928725667117666,
0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", row.names =
c(2L,
9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))

I need to find a automated procedure that allows me to adjust v3 marginals
while maintaining v1xv2 marginals unchanged.

That is: modify the v4 values you can find by running:

aggregate(f1[,c("v4")],list(f1$v3),sum)

while maintaining costant the values you can find by running:

aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)

Now does it make sense?

Please notice I have tried to build some syntax that tries to modify values
within each v1xv2 combination by computing sum of v4, row percentage in
terms of v4, and there is where my effort is blocked. Not really sure how I
should proceed. Any suggestion?

Thanks,

Luca


2015-03-19 2:38 GMT+01:00 Jeff Newmiller <jdnewmil at dcn.davis.ca.us>:

I don't understand your description. The standard practice on this list is
to provide a reproducible R example [1] of the kind of data you are working
with (and any code you have tried) to go along with your description. In
this case, that would be two dputs of your input data frames and a dput of
an output data frame (generated by hand from your input data frame).
(Probably best to not use the full number of input values just to keep the
size down.) We could then make an attempt to generate code that goes from
input to output.

Of course, if you post that hard work using HTML then it will get
corrupted (much like the text below from your earlier emails) and we won't
be able to use it. Please learn to post from your email software using
plain text when corresponding with this mailing list.

[1]
http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.

On March 18, 2015 9:05:37 AM PDT, Luca Meyer <lucam1968 at gmail.com> wrote:

Thanks for you input Michael,

The continuous variable I have measures quantities (down to the 3rd
decimal level) so unfortunately are not frequencies.

Any more specific suggestions on how that could be tackled?

Thanks & kind regards,

Luca


===

Michael Friendly wrote:
I'm not sure I understand completely what you want to do, but
if the data were frequencies, it sounds like task for fitting a
loglinear model with the model formula

~ V1*V2 + V3

On 3/18/2015 2:17 AM, Luca Meyer wrote:

* Hello,

*>>* I am facing a quite challenging task (at least to me) and I was
wondering
*>* if someone could advise how R could assist me to speed the task up.
*>>* I am dealing with a dataset with 3 discrete variables and one
continuous
*>* variable. The discrete variables are:
*>>* V1: 8 modalities
*>* V2: 13 modalities
*>* V3: 13 modalities
*>>* The continuous variable V4 is a decimal number always greater than
zero in
*>* the marginals of each of the 3 variables but it is sometimes equal
to zero
*>* (and sometimes negative) in the joint tables.
*>>* I have got 2 files:
*>>* => one with distribution of all possible combinations of V1xV2
(some of
*>* which are zero or neagtive) and
*>* => one with the marginal distribution of V3.
*>>* I am trying to build the long and narrow dataset V1xV2xV3 in such
a way
*>* that each V1xV2 cell does not get modified and V3 fits as closely
as
*>* possible to its marginal distribution. Does it make sense?
*>>* To be even more specific, my 2 input files look like the
following.
*>>* FILE 1
*>* V1,V2,V4
*>* A, A, 24.251
*>* A, B, 1.065
*>* (...)
*>* B, C, 0.294
*>* B, D, 2.731
*>* (...)
*>* H, L, 0.345
*>* H, M, 0.000
*>>* FILE 2
*>* V3, V4
*>* A, 1.575
*>* B, 4.294
*>* C, 10.044
*>* (...)
*>* L, 5.123
*>* M, 3.334
*>>* What I need to achieve is a file such as the following
*>>* FILE 3
*>* V1, V2, V3, V4
*>* A, A, A, ???
*>* A, A, B, ???
*>* (...)
*>* D, D, E, ???
*>* D, D, F, ???
*>* (...)
*>* H, M, L, ???
*>* H, M, M, ???
*>>* Please notice that FILE 3 need to be such that if I aggregate on
V1+V2 I
*>* recover exactly FILE 1 and that if I aggregate on V3 I can recover
a file
*>* as close as possible to FILE 3 (ideally the same file).
*>>* Can anyone suggest how I could do that with R?
*>>* Thank you very much indeed for any assistance you are able to
provide.
*>>* Kind regards,
*>>* Luca*

      [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Bert Gunter

Sat, Mar 21, 2015 5:18 AM #

1. Still not sure what you mean, but maybe look at ?ave and ?tapply,
for which ave() is a wrapper.

2. You still need to heed the rest of Jeff's advice.

Cheers,
Bert

Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll

On Sat, Mar 21, 2015 at 4:53 AM, Luca Meyer <lucam1968 at gmail.com> wrote:

Hi Jeff & other R-experts,

Thank you for your note. I have tried myself to solve the issue without
success.

Following your suggestion, I am providing a sample of the dataset I am
using below (also downloadble in plain text from
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0):

#this is an extract of the overall dataset (n=1200 cases)
f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A",
"B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C",
"B", "B", "B", "C", "C", "C"), v4 = c(18.1853007621835, 3.43806581506388,
0.002733567617055, 1.42917483425029, 1.05786640463504,
0.000420548864162308,
2.37232740842861, 3.01835841813241, 0, 1.13430282139936, 0.928725667117666,
0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", row.names =
c(2L,
9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))

I need to find a automated procedure that allows me to adjust v3 marginals
while maintaining v1xv2 marginals unchanged.

That is: modify the v4 values you can find by running:

aggregate(f1[,c("v4")],list(f1$v3),sum)

while maintaining costant the values you can find by running:

aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)

Now does it make sense?

Please notice I have tried to build some syntax that tries to modify values
within each v1xv2 combination by computing sum of v4, row percentage in
terms of v4, and there is where my effort is blocked. Not really sure how I
should proceed. Any suggestion?

Thanks,

Luca


2015-03-19 2:38 GMT+01:00 Jeff Newmiller <jdnewmil at dcn.davis.ca.us>:

I don't understand your description. The standard practice on this list is
to provide a reproducible R example [1] of the kind of data you are working
with (and any code you have tried) to go along with your description. In
this case, that would be two dputs of your input data frames and a dput of
an output data frame (generated by hand from your input data frame).
(Probably best to not use the full number of input values just to keep the
size down.) We could then make an attempt to generate code that goes from
input to output.

Of course, if you post that hard work using HTML then it will get
corrupted (much like the text below from your earlier emails) and we won't
be able to use it. Please learn to post from your email software using
plain text when corresponding with this mailing list.

[1]
http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.

On March 18, 2015 9:05:37 AM PDT, Luca Meyer <lucam1968 at gmail.com> wrote:

Thanks for you input Michael,

The continuous variable I have measures quantities (down to the 3rd
decimal level) so unfortunately are not frequencies.

Any more specific suggestions on how that could be tackled?

Thanks & kind regards,

Luca


===

Michael Friendly wrote:
I'm not sure I understand completely what you want to do, but
if the data were frequencies, it sounds like task for fitting a
loglinear model with the model formula

~ V1*V2 + V3

On 3/18/2015 2:17 AM, Luca Meyer wrote:

* Hello,

*>>* I am facing a quite challenging task (at least to me) and I was
wondering
*>* if someone could advise how R could assist me to speed the task up.
*>>* I am dealing with a dataset with 3 discrete variables and one
continuous
*>* variable. The discrete variables are:
*>>* V1: 8 modalities
*>* V2: 13 modalities
*>* V3: 13 modalities
*>>* The continuous variable V4 is a decimal number always greater than
zero in
*>* the marginals of each of the 3 variables but it is sometimes equal
to zero
*>* (and sometimes negative) in the joint tables.
*>>* I have got 2 files:
*>>* => one with distribution of all possible combinations of V1xV2
(some of
*>* which are zero or neagtive) and
*>* => one with the marginal distribution of V3.
*>>* I am trying to build the long and narrow dataset V1xV2xV3 in such
a way
*>* that each V1xV2 cell does not get modified and V3 fits as closely
as
*>* possible to its marginal distribution. Does it make sense?
*>>* To be even more specific, my 2 input files look like the
following.
*>>* FILE 1
*>* V1,V2,V4
*>* A, A, 24.251
*>* A, B, 1.065
*>* (...)
*>* B, C, 0.294
*>* B, D, 2.731
*>* (...)
*>* H, L, 0.345
*>* H, M, 0.000
*>>* FILE 2
*>* V3, V4
*>* A, 1.575
*>* B, 4.294
*>* C, 10.044
*>* (...)
*>* L, 5.123
*>* M, 3.334
*>>* What I need to achieve is a file such as the following
*>>* FILE 3
*>* V1, V2, V3, V4
*>* A, A, A, ???
*>* A, A, B, ???
*>* (...)
*>* D, D, E, ???
*>* D, D, F, ???
*>* (...)
*>* H, M, L, ???
*>* H, M, M, ???
*>>* Please notice that FILE 3 need to be such that if I aggregate on
V1+V2 I
*>* recover exactly FILE 1 and that if I aggregate on V3 I can recover
a file
*>* as close as possible to FILE 3 (ideally the same file).
*>>* Can anyone suggest how I could do that with R?
*>>* Thank you very much indeed for any assistance you are able to
provide.
*>>* Kind regards,
*>>* Luca*

      [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Luca Meyer

Sat, Mar 21, 2015 6:49 AM #

Hi Bert,

Thank you for your message. I am looking into ave() and tapply() as you
suggested but at the same time I have prepared a example of input and
output files, just in case you or someone else would like to make an
attempt to generate a code that goes from input to output.

Please see below or download it from
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0

# this is (an extract of) the INPUT file I have:
f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A",
"B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C",
"B", "B", "B", "C", "C", "C"), v4 = c(18.18530, 3.43806,0.00273, 1.42917,
1.05786, 0.00042, 2.37232, 3.01835, 0, 1.13430, 0.92872,
0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", row.names =
c(2L,
9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))

# this is (an extract of) the OUTPUT file I would like to obtain:
f2 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A",
"B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C",
"B", "B", "B", "C", "C", "C"), v4 = c(17.83529, 3.43806,0.00295, 1.77918,
1.05786, 0.0002, 2.37232, 3.01835, 0, 1.13430, 0.92872,
0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", row.names =
c(2L,
9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))

# please notice that while the aggregated v4 on v3 has changed ?
aggregate(f1[,c("v4")],list(f1$v3),sum)
aggregate(f2[,c("v4")],list(f2$v3),sum)

# ? the aggregated v4 over v1xv2 has remained unchanged:
aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)
aggregate(f2[,c("v4")],list(f2$v1,f2$v2),sum)

Thank you very much in advance for your assitance.

Luca

2015-03-21 13:18 GMT+01:00 Bert Gunter <gunter.berton at gene.com>:

1. Still not sure what you mean, but maybe look at ?ave and ?tapply,
for which ave() is a wrapper.

2. You still need to heed the rest of Jeff's advice.

Cheers,
Bert

Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll




On Sat, Mar 21, 2015 at 4:53 AM, Luca Meyer <lucam1968 at gmail.com> wrote:

Hi Jeff & other R-experts,

Thank you for your note. I have tried myself to solve the issue without
success.

Following your suggestion, I am providing a sample of the dataset I am
using below (also downloadble in plain text from
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0):

#this is an extract of the overall dataset (n=1200 cases)
f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A",
"B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C",
"B", "B", "B", "C", "C", "C"), v4 = c(18.1853007621835, 3.43806581506388,
0.002733567617055, 1.42917483425029, 1.05786640463504,
0.000420548864162308,
2.37232740842861, 3.01835841813241, 0, 1.13430282139936,

0.928725667117666,

0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", row.names

c(2L,
9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))

I need to find a automated procedure that allows me to adjust v3

marginals

while maintaining v1xv2 marginals unchanged.

That is: modify the v4 values you can find by running:

aggregate(f1[,c("v4")],list(f1$v3),sum)

while maintaining costant the values you can find by running:

aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)

Now does it make sense?

Please notice I have tried to build some syntax that tries to modify

values

within each v1xv2 combination by computing sum of v4, row percentage in
terms of v4, and there is where my effort is blocked. Not really sure

how I

should proceed. Any suggestion?

Thanks,

Luca


2015-03-19 2:38 GMT+01:00 Jeff Newmiller <jdnewmil at dcn.davis.ca.us>:

I don't understand your description. The standard practice on this list

is

to provide a reproducible R example [1] of the kind of data you are

working

with (and any code you have tried) to go along with your description. In
this case, that would be two dputs of your input data frames and a dput

of

an output data frame (generated by hand from your input data frame).
(Probably best to not use the full number of input values just to keep

the

size down.) We could then make an attempt to generate code that goes

from

input to output.

Of course, if you post that hard work using HTML then it will get
corrupted (much like the text below from your earlier emails) and we

won't

be able to use it. Please learn to post from your email software using
plain text when corresponding with this mailing list.

[1]

http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example

---------------------------------------------------------------------------

Jeff Newmiller                        The     .....       .....  Go

Live...

DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.

rocks...1k

---------------------------------------------------------------------------

Sent from my phone. Please excuse my brevity.

On March 18, 2015 9:05:37 AM PDT, Luca Meyer <lucam1968 at gmail.com>

wrote:

Thanks for you input Michael,

The continuous variable I have measures quantities (down to the 3rd
decimal level) so unfortunately are not frequencies.

Any more specific suggestions on how that could be tackled?

Thanks & kind regards,

Luca


===

Michael Friendly wrote:
I'm not sure I understand completely what you want to do, but
if the data were frequencies, it sounds like task for fitting a
loglinear model with the model formula

~ V1*V2 + V3

On 3/18/2015 2:17 AM, Luca Meyer wrote:

* Hello,

*>>* I am facing a quite challenging task (at least to me) and I was
wondering
*>* if someone could advise how R could assist me to speed the task up.
*>>* I am dealing with a dataset with 3 discrete variables and one
continuous
*>* variable. The discrete variables are:
*>>* V1: 8 modalities
*>* V2: 13 modalities
*>* V3: 13 modalities
*>>* The continuous variable V4 is a decimal number always greater than
zero in
*>* the marginals of each of the 3 variables but it is sometimes equal
to zero
*>* (and sometimes negative) in the joint tables.
*>>* I have got 2 files:
*>>* => one with distribution of all possible combinations of V1xV2
(some of
*>* which are zero or neagtive) and
*>* => one with the marginal distribution of V3.
*>>* I am trying to build the long and narrow dataset V1xV2xV3 in such
a way
*>* that each V1xV2 cell does not get modified and V3 fits as closely
as
*>* possible to its marginal distribution. Does it make sense?
*>>* To be even more specific, my 2 input files look like the
following.
*>>* FILE 1
*>* V1,V2,V4
*>* A, A, 24.251
*>* A, B, 1.065
*>* (...)
*>* B, C, 0.294
*>* B, D, 2.731
*>* (...)
*>* H, L, 0.345
*>* H, M, 0.000
*>>* FILE 2
*>* V3, V4
*>* A, 1.575
*>* B, 4.294
*>* C, 10.044
*>* (...)
*>* L, 5.123
*>* M, 3.334
*>>* What I need to achieve is a file such as the following
*>>* FILE 3
*>* V1, V2, V3, V4
*>* A, A, A, ???
*>* A, A, B, ???
*>* (...)
*>* D, D, E, ???
*>* D, D, F, ???
*>* (...)
*>* H, M, L, ???
*>* H, M, M, ???
*>>* Please notice that FILE 3 need to be such that if I aggregate on
V1+V2 I
*>* recover exactly FILE 1 and that if I aggregate on V3 I can recover
a file
*>* as close as possible to FILE 3 (ideally the same file).
*>>* Can anyone suggest how I could do that with R?
*>>* Thank you very much indeed for any assistance you are able to
provide.
*>>* Kind regards,
*>>* Luca*

      [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide

http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Bert Gunter

Sat, Mar 21, 2015 7:53 AM #

z <- rnorm(nrow(f1)) ## or anything you want
z1 <- f1$v4 + z - with(f1,ave(z,v1,v2,FUN=mean))


aggregate(v4~v1,f1,sum)
aggregate(z1~v1,f1,sum)
aggregate(v4~v2,f1,sum)
aggregate(z1~v2,f1,sum)
aggregate(v4~v3,f1,sum)
aggregate(z1~v3,f1,sum)


Cheers,
Bert

Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll

On Sat, Mar 21, 2015 at 6:49 AM, Luca Meyer <lucam1968 at gmail.com> wrote:

Hi Bert,

Thank you for your message. I am looking into ave() and tapply() as you
suggested but at the same time I have prepared a example of input and output
files, just in case you or someone else would like to make an attempt to
generate a code that goes from input to output.

Please see below or download it from
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0

# this is (an extract of) the INPUT file I have:
f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A",
"B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C",
"B", "B", "B", "C", "C", "C"), v4 = c(18.18530, 3.43806,0.00273, 1.42917,
1.05786, 0.00042, 2.37232, 3.01835, 0, 1.13430, 0.92872,
0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", row.names =
c(2L,
9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))

# this is (an extract of) the OUTPUT file I would like to obtain:
f2 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A",
"B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C",
"B", "B", "B", "C", "C", "C"), v4 = c(17.83529, 3.43806,0.00295, 1.77918,
1.05786, 0.0002, 2.37232, 3.01835, 0, 1.13430, 0.92872,
0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", row.names =
c(2L,
9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))

# please notice that while the aggregated v4 on v3 has changed ?
aggregate(f1[,c("v4")],list(f1$v3),sum)
aggregate(f2[,c("v4")],list(f2$v3),sum)

# ? the aggregated v4 over v1xv2 has remained unchanged:
aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)
aggregate(f2[,c("v4")],list(f2$v1,f2$v2),sum)

Thank you very much in advance for your assitance.

Luca

2015-03-21 13:18 GMT+01:00 Bert Gunter <gunter.berton at gene.com>:

1. Still not sure what you mean, but maybe look at ?ave and ?tapply,
for which ave() is a wrapper.

2. You still need to heed the rest of Jeff's advice.

Cheers,
Bert

Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll




On Sat, Mar 21, 2015 at 4:53 AM, Luca Meyer <lucam1968 at gmail.com> wrote:

Hi Jeff & other R-experts,

Thank you for your note. I have tried myself to solve the issue without
success.

Following your suggestion, I am providing a sample of the dataset I am
using below (also downloadble in plain text from
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0):

#this is an extract of the overall dataset (n=1200 cases)
f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A",
"B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C",
"B", "B", "B", "C", "C", "C"), v4 = c(18.1853007621835,
3.43806581506388,
0.002733567617055, 1.42917483425029, 1.05786640463504,
0.000420548864162308,
2.37232740842861, 3.01835841813241, 0, 1.13430282139936,
0.928725667117666,
0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", row.names
=
c(2L,
9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))

I need to find a automated procedure that allows me to adjust v3
marginals
while maintaining v1xv2 marginals unchanged.

That is: modify the v4 values you can find by running:

aggregate(f1[,c("v4")],list(f1$v3),sum)

while maintaining costant the values you can find by running:

aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)

Now does it make sense?

Please notice I have tried to build some syntax that tries to modify
values
within each v1xv2 combination by computing sum of v4, row percentage in
terms of v4, and there is where my effort is blocked. Not really sure
how I
should proceed. Any suggestion?

Thanks,

Luca


2015-03-19 2:38 GMT+01:00 Jeff Newmiller <jdnewmil at dcn.davis.ca.us>:

I don't understand your description. The standard practice on this list
is
to provide a reproducible R example [1] of the kind of data you are
working
with (and any code you have tried) to go along with your description.
In
this case, that would be two dputs of your input data frames and a dput
of
an output data frame (generated by hand from your input data frame).
(Probably best to not use the full number of input values just to keep
the
size down.) We could then make an attempt to generate code that goes
from
input to output.

Of course, if you post that hard work using HTML then it will get
corrupted (much like the text below from your earlier emails) and we
won't
be able to use it. Please learn to post from your email software using
plain text when corresponding with this mailing list.

[1]

http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go
Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
Go...
                                      Live:   OO#.. Dead: OO#..
Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.
rocks...1k

---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.

On March 18, 2015 9:05:37 AM PDT, Luca Meyer <lucam1968 at gmail.com>
wrote:

Thanks for you input Michael,

The continuous variable I have measures quantities (down to the 3rd
decimal level) so unfortunately are not frequencies.

Any more specific suggestions on how that could be tackled?

Thanks & kind regards,

Luca


===

Michael Friendly wrote:
I'm not sure I understand completely what you want to do, but
if the data were frequencies, it sounds like task for fitting a
loglinear model with the model formula

~ V1*V2 + V3

On 3/18/2015 2:17 AM, Luca Meyer wrote:

* Hello,

*>>* I am facing a quite challenging task (at least to me) and I was
wondering
*>* if someone could advise how R could assist me to speed the task
up.
*>>* I am dealing with a dataset with 3 discrete variables and one
continuous
*>* variable. The discrete variables are:
*>>* V1: 8 modalities
*>* V2: 13 modalities
*>* V3: 13 modalities
*>>* The continuous variable V4 is a decimal number always greater
than
zero in
*>* the marginals of each of the 3 variables but it is sometimes equal
to zero
*>* (and sometimes negative) in the joint tables.
*>>* I have got 2 files:
*>>* => one with distribution of all possible combinations of V1xV2
(some of
*>* which are zero or neagtive) and
*>* => one with the marginal distribution of V3.
*>>* I am trying to build the long and narrow dataset V1xV2xV3 in such
a way
*>* that each V1xV2 cell does not get modified and V3 fits as closely
as
*>* possible to its marginal distribution. Does it make sense?
*>>* To be even more specific, my 2 input files look like the
following.
*>>* FILE 1
*>* V1,V2,V4
*>* A, A, 24.251
*>* A, B, 1.065
*>* (...)
*>* B, C, 0.294
*>* B, D, 2.731
*>* (...)
*>* H, L, 0.345
*>* H, M, 0.000
*>>* FILE 2
*>* V3, V4
*>* A, 1.575
*>* B, 4.294
*>* C, 10.044
*>* (...)
*>* L, 5.123
*>* M, 3.334
*>>* What I need to achieve is a file such as the following
*>>* FILE 3
*>* V1, V2, V3, V4
*>* A, A, A, ???
*>* A, A, B, ???
*>* (...)
*>* D, D, E, ???
*>* D, D, F, ???
*>* (...)
*>* H, M, L, ???
*>* H, M, M, ???
*>>* Please notice that FILE 3 need to be such that if I aggregate on
V1+V2 I
*>* recover exactly FILE 1 and that if I aggregate on V3 I can recover
a file
*>* as close as possible to FILE 3 (ideally the same file).
*>>* Can anyone suggest how I could do that with R?
*>>* Thank you very much indeed for any assistance you are able to
provide.
*>>* Kind regards,
*>>* Luca*

      [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Bert Gunter

Sat, Mar 21, 2015 10:13 AM #

... or cleaner:

z1 <- with(f1,v4 + z -ave(z,v1,v2,FUN=mean))


Just for curiosity, was this homework? (in which case I should
probably have not provided you an answer -- that is, assuming that I
HAVE provided an answer).

Cheers,
Bert

Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll

On Sat, Mar 21, 2015 at 7:53 AM, Bert Gunter <bgunter at gene.com> wrote:

z <- rnorm(nrow(f1)) ## or anything you want
z1 <- f1$v4 + z - with(f1,ave(z,v1,v2,FUN=mean))


aggregate(v4~v1,f1,sum)
aggregate(z1~v1,f1,sum)
aggregate(v4~v2,f1,sum)
aggregate(z1~v2,f1,sum)
aggregate(v4~v3,f1,sum)
aggregate(z1~v3,f1,sum)


Cheers,
Bert

Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll




On Sat, Mar 21, 2015 at 6:49 AM, Luca Meyer <lucam1968 at gmail.com> wrote:

Hi Bert,

Thank you for your message. I am looking into ave() and tapply() as you
suggested but at the same time I have prepared a example of input and output
files, just in case you or someone else would like to make an attempt to
generate a code that goes from input to output.

Please see below or download it from
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0

# this is (an extract of) the INPUT file I have:
f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A",
"B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C",
"B", "B", "B", "C", "C", "C"), v4 = c(18.18530, 3.43806,0.00273, 1.42917,
1.05786, 0.00042, 2.37232, 3.01835, 0, 1.13430, 0.92872,
0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", row.names =
c(2L,
9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))

# this is (an extract of) the OUTPUT file I would like to obtain:
f2 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A",
"B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C",
"B", "B", "B", "C", "C", "C"), v4 = c(17.83529, 3.43806,0.00295, 1.77918,
1.05786, 0.0002, 2.37232, 3.01835, 0, 1.13430, 0.92872,
0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", row.names =
c(2L,
9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))

# please notice that while the aggregated v4 on v3 has changed ?
aggregate(f1[,c("v4")],list(f1$v3),sum)
aggregate(f2[,c("v4")],list(f2$v3),sum)

# ? the aggregated v4 over v1xv2 has remained unchanged:
aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)
aggregate(f2[,c("v4")],list(f2$v1,f2$v2),sum)

Thank you very much in advance for your assitance.

Luca

2015-03-21 13:18 GMT+01:00 Bert Gunter <gunter.berton at gene.com>:

1. Still not sure what you mean, but maybe look at ?ave and ?tapply,
for which ave() is a wrapper.

2. You still need to heed the rest of Jeff's advice.

Cheers,
Bert

Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll




On Sat, Mar 21, 2015 at 4:53 AM, Luca Meyer <lucam1968 at gmail.com> wrote:

Hi Jeff & other R-experts,

Thank you for your note. I have tried myself to solve the issue without
success.

Following your suggestion, I am providing a sample of the dataset I am
using below (also downloadble in plain text from
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0):

#this is an extract of the overall dataset (n=1200 cases)
f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A",
"B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C",
"B", "B", "B", "C", "C", "C"), v4 = c(18.1853007621835,
3.43806581506388,
0.002733567617055, 1.42917483425029, 1.05786640463504,
0.000420548864162308,
2.37232740842861, 3.01835841813241, 0, 1.13430282139936,
0.928725667117666,
0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", row.names
=
c(2L,
9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))

I need to find a automated procedure that allows me to adjust v3
marginals
while maintaining v1xv2 marginals unchanged.

That is: modify the v4 values you can find by running:

aggregate(f1[,c("v4")],list(f1$v3),sum)

while maintaining costant the values you can find by running:

aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)

Now does it make sense?

Please notice I have tried to build some syntax that tries to modify
values
within each v1xv2 combination by computing sum of v4, row percentage in
terms of v4, and there is where my effort is blocked. Not really sure
how I
should proceed. Any suggestion?

Thanks,

Luca


2015-03-19 2:38 GMT+01:00 Jeff Newmiller <jdnewmil at dcn.davis.ca.us>:

I don't understand your description. The standard practice on this list
is
to provide a reproducible R example [1] of the kind of data you are
working
with (and any code you have tried) to go along with your description.
In
this case, that would be two dputs of your input data frames and a dput
of
an output data frame (generated by hand from your input data frame).
(Probably best to not use the full number of input values just to keep
the
size down.) We could then make an attempt to generate code that goes
from
input to output.

Of course, if you post that hard work using HTML then it will get
corrupted (much like the text below from your earlier emails) and we
won't
be able to use it. Please learn to post from your email software using
plain text when corresponding with this mailing list.

[1]

http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go
Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
Go...
                                      Live:   OO#.. Dead: OO#..
Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.
rocks...1k

---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.

On March 18, 2015 9:05:37 AM PDT, Luca Meyer <lucam1968 at gmail.com>
wrote:

Thanks for you input Michael,

The continuous variable I have measures quantities (down to the 3rd
decimal level) so unfortunately are not frequencies.

Any more specific suggestions on how that could be tackled?

Thanks & kind regards,

Luca


===

Michael Friendly wrote:
I'm not sure I understand completely what you want to do, but
if the data were frequencies, it sounds like task for fitting a
loglinear model with the model formula

~ V1*V2 + V3

On 3/18/2015 2:17 AM, Luca Meyer wrote:

* Hello,

*>>* I am facing a quite challenging task (at least to me) and I was
wondering
*>* if someone could advise how R could assist me to speed the task
up.
*>>* I am dealing with a dataset with 3 discrete variables and one
continuous
*>* variable. The discrete variables are:
*>>* V1: 8 modalities
*>* V2: 13 modalities
*>* V3: 13 modalities
*>>* The continuous variable V4 is a decimal number always greater
than
zero in
*>* the marginals of each of the 3 variables but it is sometimes equal
to zero
*>* (and sometimes negative) in the joint tables.
*>>* I have got 2 files:
*>>* => one with distribution of all possible combinations of V1xV2
(some of
*>* which are zero or neagtive) and
*>* => one with the marginal distribution of V3.
*>>* I am trying to build the long and narrow dataset V1xV2xV3 in such
a way
*>* that each V1xV2 cell does not get modified and V3 fits as closely
as
*>* possible to its marginal distribution. Does it make sense?
*>>* To be even more specific, my 2 input files look like the
following.
*>>* FILE 1
*>* V1,V2,V4
*>* A, A, 24.251
*>* A, B, 1.065
*>* (...)
*>* B, C, 0.294
*>* B, D, 2.731
*>* (...)
*>* H, L, 0.345
*>* H, M, 0.000
*>>* FILE 2
*>* V3, V4
*>* A, 1.575
*>* B, 4.294
*>* C, 10.044
*>* (...)
*>* L, 5.123
*>* M, 3.334
*>>* What I need to achieve is a file such as the following
*>>* FILE 3
*>* V1, V2, V3, V4
*>* A, A, A, ???
*>* A, A, B, ???
*>* (...)
*>* D, D, E, ???
*>* D, D, F, ???
*>* (...)
*>* H, M, L, ???
*>* H, M, M, ???
*>>* Please notice that FILE 3 need to be such that if I aggregate on
V1+V2 I
*>* recover exactly FILE 1 and that if I aggregate on V3 I can recover
a file
*>* as close as possible to FILE 3 (ideally the same file).
*>>* Can anyone suggest how I could do that with R?
*>>* Thank you very much indeed for any assistance you are able to
provide.
*>>* Kind regards,
*>>* Luca*

      [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Luca Meyer

Sun, Mar 22, 2015 2:00 AM #

Hi Bert, hello R-experts,

I am close to a solution but I still need one hint w.r.t. the following
procedure (available also from
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0)

rm(list=ls())

# this is (an extract of) the INPUT file I have:
f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B", "B",
"B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A", "B", "C", "A",
"B", "C"), v3 = c("B", "B", "B", "C", "C", "C", "B", "B", "B", "C", "C",
"C"), v4 = c(18.18530, 3.43806,0.00273, 1.42917, 1.05786, 0.00042, 2.37232,
3.01835, 0, 1.13430, 0.92872, 0)), .Names = c("v1", "v2", "v3", "v4"),
class = "data.frame", row.names = c(2L, 9L, 11L, 41L, 48L, 50L, 158L, 165L,
167L, 197L, 204L, 206L))

# this is the procedure that Bert suggested (slightly adjusted):
z <- rnorm(nrow(f1)) ## or anything you want
z1 <- round(with(f1,v4 + z -ave(z,v1,v2,FUN=mean)), digits=5)
aggregate(v4~v1*v2,f1,sum)
aggregate(z1~v1*v2,f1,sum)
aggregate(v4~v3,f1,sum)
aggregate(z1~v3,f1,sum)

My question to you is: how can I set z so that I can obtain specific values
for z1-v4 in the v3 aggregation?
In other words, how can I configure the procedure so that e.g. B=29 and
C=2.56723 after running the procedure:
aggregate(z1~v3,f1,sum)

Thank you,

Luca

PS: to avoid any doubts you might have about who I am the following is my
web page: http://lucameyer.wordpress.com/


2015-03-21 18:13 GMT+01:00 Bert Gunter <gunter.berton at gene.com>:

... or cleaner:

z1 <- with(f1,v4 + z -ave(z,v1,v2,FUN=mean))


Just for curiosity, was this homework? (in which case I should
probably have not provided you an answer -- that is, assuming that I
HAVE provided an answer).

Cheers,
Bert

Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll




On Sat, Mar 21, 2015 at 7:53 AM, Bert Gunter <bgunter at gene.com> wrote:

z <- rnorm(nrow(f1)) ## or anything you want
z1 <- f1$v4 + z - with(f1,ave(z,v1,v2,FUN=mean))


aggregate(v4~v1,f1,sum)
aggregate(z1~v1,f1,sum)
aggregate(v4~v2,f1,sum)
aggregate(z1~v2,f1,sum)
aggregate(v4~v3,f1,sum)
aggregate(z1~v3,f1,sum)


Cheers,
Bert

Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll




On Sat, Mar 21, 2015 at 6:49 AM, Luca Meyer <lucam1968 at gmail.com> wrote:

Hi Bert,

Thank you for your message. I am looking into ave() and tapply() as you
suggested but at the same time I have prepared a example of input and

output

files, just in case you or someone else would like to make an attempt to
generate a code that goes from input to output.

Please see below or download it from
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0

# this is (an extract of) the INPUT file I have:
f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A",
"B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C",
"B", "B", "B", "C", "C", "C"), v4 = c(18.18530, 3.43806,0.00273,

1.42917,

1.05786, 0.00042, 2.37232, 3.01835, 0, 1.13430, 0.92872,
0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame",

row.names =

c(2L,
9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))

# this is (an extract of) the OUTPUT file I would like to obtain:
f2 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A",
"B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C",
"B", "B", "B", "C", "C", "C"), v4 = c(17.83529, 3.43806,0.00295,

1.77918,

1.05786, 0.0002, 2.37232, 3.01835, 0, 1.13430, 0.92872,
0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame",

row.names =

c(2L,
9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))

# please notice that while the aggregated v4 on v3 has changed ?
aggregate(f1[,c("v4")],list(f1$v3),sum)
aggregate(f2[,c("v4")],list(f2$v3),sum)

# ? the aggregated v4 over v1xv2 has remained unchanged:
aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)
aggregate(f2[,c("v4")],list(f2$v1,f2$v2),sum)

Thank you very much in advance for your assitance.

Luca

2015-03-21 13:18 GMT+01:00 Bert Gunter <gunter.berton at gene.com>:

1. Still not sure what you mean, but maybe look at ?ave and ?tapply,
for which ave() is a wrapper.

2. You still need to heed the rest of Jeff's advice.

Cheers,
Bert

Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll




On Sat, Mar 21, 2015 at 4:53 AM, Luca Meyer <lucam1968 at gmail.com>

wrote:

Hi Jeff & other R-experts,

Thank you for your note. I have tried myself to solve the issue

without

success.

Following your suggestion, I am providing a sample of the dataset I

am

using below (also downloadble in plain text from
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0):

#this is an extract of the overall dataset (n=1200 cases)
f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A",
"B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C",
"B", "B", "B", "C", "C", "C"), v4 = c(18.1853007621835,
3.43806581506388,
0.002733567617055, 1.42917483425029, 1.05786640463504,
0.000420548864162308,
2.37232740842861, 3.01835841813241, 0, 1.13430282139936,
0.928725667117666,
0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame",

row.names

=
c(2L,
9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))

I need to find a automated procedure that allows me to adjust v3
marginals
while maintaining v1xv2 marginals unchanged.

That is: modify the v4 values you can find by running:

aggregate(f1[,c("v4")],list(f1$v3),sum)

while maintaining costant the values you can find by running:

aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)

Now does it make sense?

Please notice I have tried to build some syntax that tries to modify
values
within each v1xv2 combination by computing sum of v4, row percentage

in

terms of v4, and there is where my effort is blocked. Not really sure
how I
should proceed. Any suggestion?

Thanks,

Luca


2015-03-19 2:38 GMT+01:00 Jeff Newmiller <jdnewmil at dcn.davis.ca.us>:

I don't understand your description. The standard practice on this

list

is
to provide a reproducible R example [1] of the kind of data you are
working
with (and any code you have tried) to go along with your

description.

In
this case, that would be two dputs of your input data frames and a

dput

of
an output data frame (generated by hand from your input data frame).
(Probably best to not use the full number of input values just to

keep

the
size down.) We could then make an attempt to generate code that goes
from
input to output.

Of course, if you post that hard work using HTML then it will get
corrupted (much like the text below from your earlier emails) and we
won't
be able to use it. Please learn to post from your email software

using

plain text when corresponding with this mailing list.

[1]

http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example

---------------------------------------------------------------------------

Jeff Newmiller                        The     .....       .....  Go
Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.

Live

Go...
                                      Live:   OO#.. Dead: OO#..
Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.

with

/Software/Embedded Controllers)               .OO#.       .OO#.
rocks...1k

---------------------------------------------------------------------------

Sent from my phone. Please excuse my brevity.

On March 18, 2015 9:05:37 AM PDT, Luca Meyer <lucam1968 at gmail.com>
wrote:

Thanks for you input Michael,

The continuous variable I have measures quantities (down to the 3rd
decimal level) so unfortunately are not frequencies.

Any more specific suggestions on how that could be tackled?

Thanks & kind regards,

Luca


===

Michael Friendly wrote:
I'm not sure I understand completely what you want to do, but
if the data were frequencies, it sounds like task for fitting a
loglinear model with the model formula

~ V1*V2 + V3

On 3/18/2015 2:17 AM, Luca Meyer wrote:

* Hello,

*>>* I am facing a quite challenging task (at least to me) and I

was

wondering
*>* if someone could advise how R could assist me to speed the task
up.
*>>* I am dealing with a dataset with 3 discrete variables and one
continuous
*>* variable. The discrete variables are:
*>>* V1: 8 modalities
*>* V2: 13 modalities
*>* V3: 13 modalities
*>>* The continuous variable V4 is a decimal number always greater
than
zero in
*>* the marginals of each of the 3 variables but it is sometimes

equal

to zero
*>* (and sometimes negative) in the joint tables.
*>>* I have got 2 files:
*>>* => one with distribution of all possible combinations of V1xV2
(some of
*>* which are zero or neagtive) and
*>* => one with the marginal distribution of V3.
*>>* I am trying to build the long and narrow dataset V1xV2xV3 in

such

a way
*>* that each V1xV2 cell does not get modified and V3 fits as

closely

as
*>* possible to its marginal distribution. Does it make sense?
*>>* To be even more specific, my 2 input files look like the
following.
*>>* FILE 1
*>* V1,V2,V4
*>* A, A, 24.251
*>* A, B, 1.065
*>* (...)
*>* B, C, 0.294
*>* B, D, 2.731
*>* (...)
*>* H, L, 0.345
*>* H, M, 0.000
*>>* FILE 2
*>* V3, V4
*>* A, 1.575
*>* B, 4.294
*>* C, 10.044
*>* (...)
*>* L, 5.123
*>* M, 3.334
*>>* What I need to achieve is a file such as the following
*>>* FILE 3
*>* V1, V2, V3, V4
*>* A, A, A, ???
*>* A, A, B, ???
*>* (...)
*>* D, D, E, ???
*>* D, D, F, ???
*>* (...)
*>* H, M, L, ???
*>* H, M, M, ???
*>>* Please notice that FILE 3 need to be such that if I aggregate

on

V1+V2 I
*>* recover exactly FILE 1 and that if I aggregate on V3 I can

recover

a file
*>* as close as possible to FILE 3 (ideally the same file).
*>>* Can anyone suggest how I could do that with R?
*>>* Thank you very much indeed for any assistance you are able to
provide.
*>>* Kind regards,
*>>* Luca*

      [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Bert Gunter

Sun, Mar 22, 2015 7:55 AM #

I would have thought that this is straightforward given my previous email...

Just set z to what you want -- e,g, all B values to 29/number of B's,
and all C values to 2.567/number of C's (etc. for more categories).

A slick but sort of cheat way to do this programmatically -- in the
sense that it relies on the implementation of factor() rather than its
API -- is:

y <- f1$v3  ## to simplify the notation; could be done using with()
z <- (c(29,2.567)/table(y))[c(y)]

Then proceed to z1 as I previously described

-- Bert


Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll

On Sun, Mar 22, 2015 at 2:00 AM, Luca Meyer <lucam1968 at gmail.com> wrote:

Hi Bert, hello R-experts,

I am close to a solution but I still need one hint w.r.t. the following
procedure (available also from
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0)

rm(list=ls())

# this is (an extract of) the INPUT file I have:
f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B", "B",
"B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A", "B", "C", "A",
"B", "C"), v3 = c("B", "B", "B", "C", "C", "C", "B", "B", "B", "C", "C",
"C"), v4 = c(18.18530, 3.43806,0.00273, 1.42917, 1.05786, 0.00042, 2.37232,
3.01835, 0, 1.13430, 0.92872, 0)), .Names = c("v1", "v2", "v3", "v4"), class
= "data.frame", row.names = c(2L, 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L,
197L, 204L, 206L))

# this is the procedure that Bert suggested (slightly adjusted):
z <- rnorm(nrow(f1)) ## or anything you want
z1 <- round(with(f1,v4 + z -ave(z,v1,v2,FUN=mean)), digits=5)
aggregate(v4~v1*v2,f1,sum)
aggregate(z1~v1*v2,f1,sum)
aggregate(v4~v3,f1,sum)
aggregate(z1~v3,f1,sum)

My question to you is: how can I set z so that I can obtain specific values
for z1-v4 in the v3 aggregation?
In other words, how can I configure the procedure so that e.g. B=29 and
C=2.56723 after running the procedure:
aggregate(z1~v3,f1,sum)

Thank you,

Luca

PS: to avoid any doubts you might have about who I am the following is my
web page: http://lucameyer.wordpress.com/


2015-03-21 18:13 GMT+01:00 Bert Gunter <gunter.berton at gene.com>:

... or cleaner:

z1 <- with(f1,v4 + z -ave(z,v1,v2,FUN=mean))


Just for curiosity, was this homework? (in which case I should
probably have not provided you an answer -- that is, assuming that I
HAVE provided an answer).

Cheers,
Bert

Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll




On Sat, Mar 21, 2015 at 7:53 AM, Bert Gunter <bgunter at gene.com> wrote:

z <- rnorm(nrow(f1)) ## or anything you want
z1 <- f1$v4 + z - with(f1,ave(z,v1,v2,FUN=mean))


aggregate(v4~v1,f1,sum)
aggregate(z1~v1,f1,sum)
aggregate(v4~v2,f1,sum)
aggregate(z1~v2,f1,sum)
aggregate(v4~v3,f1,sum)
aggregate(z1~v3,f1,sum)


Cheers,
Bert

Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll




On Sat, Mar 21, 2015 at 6:49 AM, Luca Meyer <lucam1968 at gmail.com> wrote:

Hi Bert,

Thank you for your message. I am looking into ave() and tapply() as you
suggested but at the same time I have prepared a example of input and
output
files, just in case you or someone else would like to make an attempt
to
generate a code that goes from input to output.

Please see below or download it from
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0

# this is (an extract of) the INPUT file I have:
f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A",
"B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C",
"B", "B", "B", "C", "C", "C"), v4 = c(18.18530, 3.43806,0.00273,
1.42917,
1.05786, 0.00042, 2.37232, 3.01835, 0, 1.13430, 0.92872,
0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame",
row.names =
c(2L,
9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))

# this is (an extract of) the OUTPUT file I would like to obtain:
f2 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A",
"B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C",
"B", "B", "B", "C", "C", "C"), v4 = c(17.83529, 3.43806,0.00295,
1.77918,
1.05786, 0.0002, 2.37232, 3.01835, 0, 1.13430, 0.92872,
0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame",
row.names =
c(2L,
9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))

# please notice that while the aggregated v4 on v3 has changed ?
aggregate(f1[,c("v4")],list(f1$v3),sum)
aggregate(f2[,c("v4")],list(f2$v3),sum)

# ? the aggregated v4 over v1xv2 has remained unchanged:
aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)
aggregate(f2[,c("v4")],list(f2$v1,f2$v2),sum)

Thank you very much in advance for your assitance.

Luca

2015-03-21 13:18 GMT+01:00 Bert Gunter <gunter.berton at gene.com>:

1. Still not sure what you mean, but maybe look at ?ave and ?tapply,
for which ave() is a wrapper.

2. You still need to heed the rest of Jeff's advice.

Cheers,
Bert

Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll




On Sat, Mar 21, 2015 at 4:53 AM, Luca Meyer <lucam1968 at gmail.com>
wrote:

Hi Jeff & other R-experts,

Thank you for your note. I have tried myself to solve the issue
without
success.

Following your suggestion, I am providing a sample of the dataset I
am
using below (also downloadble in plain text from
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0):

#this is an extract of the overall dataset (n=1200 cases)
f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A",
"B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C",
"B", "B", "B", "C", "C", "C"), v4 = c(18.1853007621835,
3.43806581506388,
0.002733567617055, 1.42917483425029, 1.05786640463504,
0.000420548864162308,
2.37232740842861, 3.01835841813241, 0, 1.13430282139936,
0.928725667117666,
0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame",
row.names
=
c(2L,
9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))

I need to find a automated procedure that allows me to adjust v3
marginals
while maintaining v1xv2 marginals unchanged.

That is: modify the v4 values you can find by running:

aggregate(f1[,c("v4")],list(f1$v3),sum)

while maintaining costant the values you can find by running:

aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)

Now does it make sense?

Please notice I have tried to build some syntax that tries to modify
values
within each v1xv2 combination by computing sum of v4, row percentage
in
terms of v4, and there is where my effort is blocked. Not really
sure
how I
should proceed. Any suggestion?

Thanks,

Luca


2015-03-19 2:38 GMT+01:00 Jeff Newmiller <jdnewmil at dcn.davis.ca.us>:

I don't understand your description. The standard practice on this
list
is
to provide a reproducible R example [1] of the kind of data you are
working
with (and any code you have tried) to go along with your
description.
In
this case, that would be two dputs of your input data frames and a
dput
of
an output data frame (generated by hand from your input data
frame).
(Probably best to not use the full number of input values just to
keep
the
size down.) We could then make an attempt to generate code that
goes
from
input to output.

Of course, if you post that hard work using HTML then it will get
corrupted (much like the text below from your earlier emails) and
we
won't
be able to use it. Please learn to post from your email software
using
plain text when corresponding with this mailing list.

[1]


http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example


---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go
Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.
Live
Go...
                                      Live:   OO#.. Dead: OO#..
Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.
with
/Software/Embedded Controllers)               .OO#.       .OO#.
rocks...1k


---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.

On March 18, 2015 9:05:37 AM PDT, Luca Meyer <lucam1968 at gmail.com>
wrote:

Thanks for you input Michael,

The continuous variable I have measures quantities (down to the
3rd
decimal level) so unfortunately are not frequencies.

Any more specific suggestions on how that could be tackled?

Thanks & kind regards,

Luca


===

Michael Friendly wrote:
I'm not sure I understand completely what you want to do, but
if the data were frequencies, it sounds like task for fitting a
loglinear model with the model formula

~ V1*V2 + V3

On 3/18/2015 2:17 AM, Luca Meyer wrote:

* Hello,

*>>* I am facing a quite challenging task (at least to me) and I
was
wondering
*>* if someone could advise how R could assist me to speed the
task
up.
*>>* I am dealing with a dataset with 3 discrete variables and one
continuous
*>* variable. The discrete variables are:
*>>* V1: 8 modalities
*>* V2: 13 modalities
*>* V3: 13 modalities
*>>* The continuous variable V4 is a decimal number always greater
than
zero in
*>* the marginals of each of the 3 variables but it is sometimes
equal
to zero
*>* (and sometimes negative) in the joint tables.
*>>* I have got 2 files:
*>>* => one with distribution of all possible combinations of
V1xV2
(some of
*>* which are zero or neagtive) and
*>* => one with the marginal distribution of V3.
*>>* I am trying to build the long and narrow dataset V1xV2xV3 in
such
a way
*>* that each V1xV2 cell does not get modified and V3 fits as
closely
as
*>* possible to its marginal distribution. Does it make sense?
*>>* To be even more specific, my 2 input files look like the
following.
*>>* FILE 1
*>* V1,V2,V4
*>* A, A, 24.251
*>* A, B, 1.065
*>* (...)
*>* B, C, 0.294
*>* B, D, 2.731
*>* (...)
*>* H, L, 0.345
*>* H, M, 0.000
*>>* FILE 2
*>* V3, V4
*>* A, 1.575
*>* B, 4.294
*>* C, 10.044
*>* (...)
*>* L, 5.123
*>* M, 3.334
*>>* What I need to achieve is a file such as the following
*>>* FILE 3
*>* V1, V2, V3, V4
*>* A, A, A, ???
*>* A, A, B, ???
*>* (...)
*>* D, D, E, ???
*>* D, D, F, ???
*>* (...)
*>* H, M, L, ???
*>* H, M, M, ???
*>>* Please notice that FILE 3 need to be such that if I aggregate
on
V1+V2 I
*>* recover exactly FILE 1 and that if I aggregate on V3 I can
recover
a file
*>* as close as possible to FILE 3 (ideally the same file).
*>>* Can anyone suggest how I could do that with R?
*>>* Thank you very much indeed for any assistance you are able to
provide.
*>>* Kind regards,
*>>* Luca*

      [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Bert Gunter

Sun, Mar 22, 2015 8:05 AM #

Oh, wait a minute ...

You still want the marginals for the other columns to be as originally?

If so, then this is impossible in general as the sum of all the values
must be what they were originally and you cannot therefore choose your
values for V3 arbitrarily.

Or at least, that seems to be what you are trying to do.

-- Bert

Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll

On Sun, Mar 22, 2015 at 7:55 AM, Bert Gunter <bgunter at gene.com> wrote:

I would have thought that this is straightforward given my previous email...

Just set z to what you want -- e,g, all B values to 29/number of B's,
and all C values to 2.567/number of C's (etc. for more categories).

A slick but sort of cheat way to do this programmatically -- in the
sense that it relies on the implementation of factor() rather than its
API -- is:

y <- f1$v3  ## to simplify the notation; could be done using with()
z <- (c(29,2.567)/table(y))[c(y)]

Then proceed to z1 as I previously described

-- Bert


Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll




On Sun, Mar 22, 2015 at 2:00 AM, Luca Meyer <lucam1968 at gmail.com> wrote:

Hi Bert, hello R-experts,

I am close to a solution but I still need one hint w.r.t. the following
procedure (available also from
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0)

rm(list=ls())

# this is (an extract of) the INPUT file I have:
f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B", "B",
"B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A", "B", "C", "A",
"B", "C"), v3 = c("B", "B", "B", "C", "C", "C", "B", "B", "B", "C", "C",
"C"), v4 = c(18.18530, 3.43806,0.00273, 1.42917, 1.05786, 0.00042, 2.37232,
3.01835, 0, 1.13430, 0.92872, 0)), .Names = c("v1", "v2", "v3", "v4"), class
= "data.frame", row.names = c(2L, 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L,
197L, 204L, 206L))

# this is the procedure that Bert suggested (slightly adjusted):
z <- rnorm(nrow(f1)) ## or anything you want
z1 <- round(with(f1,v4 + z -ave(z,v1,v2,FUN=mean)), digits=5)
aggregate(v4~v1*v2,f1,sum)
aggregate(z1~v1*v2,f1,sum)
aggregate(v4~v3,f1,sum)
aggregate(z1~v3,f1,sum)

My question to you is: how can I set z so that I can obtain specific values
for z1-v4 in the v3 aggregation?
In other words, how can I configure the procedure so that e.g. B=29 and
C=2.56723 after running the procedure:
aggregate(z1~v3,f1,sum)

Thank you,

Luca

PS: to avoid any doubts you might have about who I am the following is my
web page: http://lucameyer.wordpress.com/


2015-03-21 18:13 GMT+01:00 Bert Gunter <gunter.berton at gene.com>:

... or cleaner:

z1 <- with(f1,v4 + z -ave(z,v1,v2,FUN=mean))


Just for curiosity, was this homework? (in which case I should
probably have not provided you an answer -- that is, assuming that I
HAVE provided an answer).

Cheers,
Bert

Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll




On Sat, Mar 21, 2015 at 7:53 AM, Bert Gunter <bgunter at gene.com> wrote:

z <- rnorm(nrow(f1)) ## or anything you want
z1 <- f1$v4 + z - with(f1,ave(z,v1,v2,FUN=mean))


aggregate(v4~v1,f1,sum)
aggregate(z1~v1,f1,sum)
aggregate(v4~v2,f1,sum)
aggregate(z1~v2,f1,sum)
aggregate(v4~v3,f1,sum)
aggregate(z1~v3,f1,sum)


Cheers,
Bert

Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll




On Sat, Mar 21, 2015 at 6:49 AM, Luca Meyer <lucam1968 at gmail.com> wrote:

Hi Bert,

Thank you for your message. I am looking into ave() and tapply() as you
suggested but at the same time I have prepared a example of input and
output
files, just in case you or someone else would like to make an attempt
to
generate a code that goes from input to output.

Please see below or download it from
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0

# this is (an extract of) the INPUT file I have:
f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A",
"B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C",
"B", "B", "B", "C", "C", "C"), v4 = c(18.18530, 3.43806,0.00273,
1.42917,
1.05786, 0.00042, 2.37232, 3.01835, 0, 1.13430, 0.92872,
0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame",
row.names =
c(2L,
9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))

# this is (an extract of) the OUTPUT file I would like to obtain:
f2 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A",
"B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C",
"B", "B", "B", "C", "C", "C"), v4 = c(17.83529, 3.43806,0.00295,
1.77918,
1.05786, 0.0002, 2.37232, 3.01835, 0, 1.13430, 0.92872,
0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame",
row.names =
c(2L,
9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))

# please notice that while the aggregated v4 on v3 has changed ?
aggregate(f1[,c("v4")],list(f1$v3),sum)
aggregate(f2[,c("v4")],list(f2$v3),sum)

# ? the aggregated v4 over v1xv2 has remained unchanged:
aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)
aggregate(f2[,c("v4")],list(f2$v1,f2$v2),sum)

Thank you very much in advance for your assitance.

Luca

2015-03-21 13:18 GMT+01:00 Bert Gunter <gunter.berton at gene.com>:

1. Still not sure what you mean, but maybe look at ?ave and ?tapply,
for which ave() is a wrapper.

2. You still need to heed the rest of Jeff's advice.

Cheers,
Bert

Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll




On Sat, Mar 21, 2015 at 4:53 AM, Luca Meyer <lucam1968 at gmail.com>
wrote:

Hi Jeff & other R-experts,

Thank you for your note. I have tried myself to solve the issue
without
success.

Following your suggestion, I am providing a sample of the dataset I
am
using below (also downloadble in plain text from
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0):

#this is an extract of the overall dataset (n=1200 cases)
f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A",
"B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C",
"B", "B", "B", "C", "C", "C"), v4 = c(18.1853007621835,
3.43806581506388,
0.002733567617055, 1.42917483425029, 1.05786640463504,
0.000420548864162308,
2.37232740842861, 3.01835841813241, 0, 1.13430282139936,
0.928725667117666,
0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame",
row.names
=
c(2L,
9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))

I need to find a automated procedure that allows me to adjust v3
marginals
while maintaining v1xv2 marginals unchanged.

That is: modify the v4 values you can find by running:

aggregate(f1[,c("v4")],list(f1$v3),sum)

while maintaining costant the values you can find by running:

aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)

Now does it make sense?

Please notice I have tried to build some syntax that tries to modify
values
within each v1xv2 combination by computing sum of v4, row percentage
in
terms of v4, and there is where my effort is blocked. Not really
sure
how I
should proceed. Any suggestion?

Thanks,

Luca


2015-03-19 2:38 GMT+01:00 Jeff Newmiller <jdnewmil at dcn.davis.ca.us>:

I don't understand your description. The standard practice on this
list
is
to provide a reproducible R example [1] of the kind of data you are
working
with (and any code you have tried) to go along with your
description.
In
this case, that would be two dputs of your input data frames and a
dput
of
an output data frame (generated by hand from your input data
frame).
(Probably best to not use the full number of input values just to
keep
the
size down.) We could then make an attempt to generate code that
goes
from
input to output.

Of course, if you post that hard work using HTML then it will get
corrupted (much like the text below from your earlier emails) and
we
won't
be able to use it. Please learn to post from your email software
using
plain text when corresponding with this mailing list.

[1]


http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example


---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go
Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.
Live
Go...
                                      Live:   OO#.. Dead: OO#..
Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.
with
/Software/Embedded Controllers)               .OO#.       .OO#.
rocks...1k


---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.

On March 18, 2015 9:05:37 AM PDT, Luca Meyer <lucam1968 at gmail.com>
wrote:

Thanks for you input Michael,

The continuous variable I have measures quantities (down to the
3rd
decimal level) so unfortunately are not frequencies.

Any more specific suggestions on how that could be tackled?

Thanks & kind regards,

Luca


===

Michael Friendly wrote:
I'm not sure I understand completely what you want to do, but
if the data were frequencies, it sounds like task for fitting a
loglinear model with the model formula

~ V1*V2 + V3

On 3/18/2015 2:17 AM, Luca Meyer wrote:

* Hello,

*>>* I am facing a quite challenging task (at least to me) and I
was
wondering
*>* if someone could advise how R could assist me to speed the
task
up.
*>>* I am dealing with a dataset with 3 discrete variables and one
continuous
*>* variable. The discrete variables are:
*>>* V1: 8 modalities
*>* V2: 13 modalities
*>* V3: 13 modalities
*>>* The continuous variable V4 is a decimal number always greater
than
zero in
*>* the marginals of each of the 3 variables but it is sometimes
equal
to zero
*>* (and sometimes negative) in the joint tables.
*>>* I have got 2 files:
*>>* => one with distribution of all possible combinations of
V1xV2
(some of
*>* which are zero or neagtive) and
*>* => one with the marginal distribution of V3.
*>>* I am trying to build the long and narrow dataset V1xV2xV3 in
such
a way
*>* that each V1xV2 cell does not get modified and V3 fits as
closely
as
*>* possible to its marginal distribution. Does it make sense?
*>>* To be even more specific, my 2 input files look like the
following.
*>>* FILE 1
*>* V1,V2,V4
*>* A, A, 24.251
*>* A, B, 1.065
*>* (...)
*>* B, C, 0.294
*>* B, D, 2.731
*>* (...)
*>* H, L, 0.345
*>* H, M, 0.000
*>>* FILE 2
*>* V3, V4
*>* A, 1.575
*>* B, 4.294
*>* C, 10.044
*>* (...)
*>* L, 5.123
*>* M, 3.334
*>>* What I need to achieve is a file such as the following
*>>* FILE 3
*>* V1, V2, V3, V4
*>* A, A, A, ???
*>* A, A, B, ???
*>* (...)
*>* D, D, E, ???
*>* D, D, F, ???
*>* (...)
*>* H, M, L, ???
*>* H, M, M, ???
*>>* Please notice that FILE 3 need to be such that if I aggregate
on
V1+V2 I
*>* recover exactly FILE 1 and that if I aggregate on V3 I can
recover
a file
*>* as close as possible to FILE 3 (ideally the same file).
*>>* Can anyone suggest how I could do that with R?
*>>* Thank you very much indeed for any assistance you are able to
provide.
*>>* Kind regards,
*>>* Luca*

      [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Luca Meyer

Sun, Mar 22, 2015 8:21 AM #

Hi Bert,

Thanks again for your assistance.

Unfortunately when I apply the additional code you suggest I get B=40.23326
& C=-8.66603 and not  B=29 & C=2.56723. Any idea why that might be
happening?

Please see below or on
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0 the code I
am running:

rm(list=ls())

# this is (an extract of) the INPUT file I have:
f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A",
"B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C",
"B", "B", "B", "C", "C", "C"), v4 = c(18.18530, 3.43806,0.00273, 1.42917,
1.05786, 0.00042, 2.37232, 3.01835, 0, 1.13430, 0.92872,
0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", row.names =
c(2L,
9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))

# this is the procedure that Bert suggested (slightly adjusted):

y <- f1$v3  ## to simplify the notation; could be done using with()
z <- (c(29,2.567)/table(y))[c(y)]
# z <- rnorm(nrow(f1)) ## or anything you want
z1 <- round(with(f1,v4 + z -ave(z,v1,v2,FUN=mean)), digits=5)
aggregate(v4~v1*v2,f1,sum)
aggregate(z1~v1*v2,f1,sum)
aggregate(v4~v3,f1,sum)
aggregate(z1~v3,f1,sum)

Thanks again,

Luca


2015-03-22 15:55 GMT+01:00 Bert Gunter <gunter.berton at gene.com>:

I would have thought that this is straightforward given my previous
email...

Just set z to what you want -- e,g, all B values to 29/number of B's,
and all C values to 2.567/number of C's (etc. for more categories).

A slick but sort of cheat way to do this programmatically -- in the
sense that it relies on the implementation of factor() rather than its
API -- is:

y <- f1$v3  ## to simplify the notation; could be done using with()
z <- (c(29,2.567)/table(y))[c(y)]

Then proceed to z1 as I previously described

-- Bert


Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll




On Sun, Mar 22, 2015 at 2:00 AM, Luca Meyer <lucam1968 at gmail.com> wrote:

Hi Bert, hello R-experts,

I am close to a solution but I still need one hint w.r.t. the following
procedure (available also from
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0)

rm(list=ls())

# this is (an extract of) the INPUT file I have:
f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B", "B",
"B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A", "B", "C", "A",
"B", "C"), v3 = c("B", "B", "B", "C", "C", "C", "B", "B", "B", "C", "C",
"C"), v4 = c(18.18530, 3.43806,0.00273, 1.42917, 1.05786, 0.00042,

2.37232,

3.01835, 0, 1.13430, 0.92872, 0)), .Names = c("v1", "v2", "v3", "v4"),

class

= "data.frame", row.names = c(2L, 9L, 11L, 41L, 48L, 50L, 158L, 165L,

167L,

197L, 204L, 206L))

# this is the procedure that Bert suggested (slightly adjusted):
z <- rnorm(nrow(f1)) ## or anything you want
z1 <- round(with(f1,v4 + z -ave(z,v1,v2,FUN=mean)), digits=5)
aggregate(v4~v1*v2,f1,sum)
aggregate(z1~v1*v2,f1,sum)
aggregate(v4~v3,f1,sum)
aggregate(z1~v3,f1,sum)

My question to you is: how can I set z so that I can obtain specific

values

for z1-v4 in the v3 aggregation?
In other words, how can I configure the procedure so that e.g. B=29 and
C=2.56723 after running the procedure:
aggregate(z1~v3,f1,sum)

Thank you,

Luca

PS: to avoid any doubts you might have about who I am the following is my
web page: http://lucameyer.wordpress.com/


2015-03-21 18:13 GMT+01:00 Bert Gunter <gunter.berton at gene.com>:

... or cleaner:

z1 <- with(f1,v4 + z -ave(z,v1,v2,FUN=mean))


Just for curiosity, was this homework? (in which case I should
probably have not provided you an answer -- that is, assuming that I
HAVE provided an answer).

Cheers,
Bert

Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll




On Sat, Mar 21, 2015 at 7:53 AM, Bert Gunter <bgunter at gene.com> wrote:

z <- rnorm(nrow(f1)) ## or anything you want
z1 <- f1$v4 + z - with(f1,ave(z,v1,v2,FUN=mean))


aggregate(v4~v1,f1,sum)
aggregate(z1~v1,f1,sum)
aggregate(v4~v2,f1,sum)
aggregate(z1~v2,f1,sum)
aggregate(v4~v3,f1,sum)
aggregate(z1~v3,f1,sum)


Cheers,
Bert

Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll




On Sat, Mar 21, 2015 at 6:49 AM, Luca Meyer <lucam1968 at gmail.com>

wrote:

Hi Bert,

Thank you for your message. I am looking into ave() and tapply() as

you

suggested but at the same time I have prepared a example of input and
output
files, just in case you or someone else would like to make an attempt
to
generate a code that goes from input to output.

Please see below or download it from
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0

# this is (an extract of) the INPUT file I have:
f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A",
"B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C",
"B", "B", "B", "C", "C", "C"), v4 = c(18.18530, 3.43806,0.00273,
1.42917,
1.05786, 0.00042, 2.37232, 3.01835, 0, 1.13430, 0.92872,
0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame",
row.names =
c(2L,
9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))

# this is (an extract of) the OUTPUT file I would like to obtain:
f2 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A",
"B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C",
"B", "B", "B", "C", "C", "C"), v4 = c(17.83529, 3.43806,0.00295,
1.77918,
1.05786, 0.0002, 2.37232, 3.01835, 0, 1.13430, 0.92872,
0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame",
row.names =
c(2L,
9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))

# please notice that while the aggregated v4 on v3 has changed ?
aggregate(f1[,c("v4")],list(f1$v3),sum)
aggregate(f2[,c("v4")],list(f2$v3),sum)

# ? the aggregated v4 over v1xv2 has remained unchanged:
aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)
aggregate(f2[,c("v4")],list(f2$v1,f2$v2),sum)

Thank you very much in advance for your assitance.

Luca

2015-03-21 13:18 GMT+01:00 Bert Gunter <gunter.berton at gene.com>:

1. Still not sure what you mean, but maybe look at ?ave and ?tapply,
for which ave() is a wrapper.

2. You still need to heed the rest of Jeff's advice.

Cheers,
Bert

Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And

knowledge

is certainly not wisdom."
Clifford Stoll




On Sat, Mar 21, 2015 at 4:53 AM, Luca Meyer <lucam1968 at gmail.com>
wrote:

Hi Jeff & other R-experts,

Thank you for your note. I have tried myself to solve the issue
without
success.

Following your suggestion, I am providing a sample of the dataset

am
using below (also downloadble in plain text from
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0):

#this is an extract of the overall dataset (n=1200 cases)
f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B",

"B",

"B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A",
"B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C",
"B", "B", "B", "C", "C", "C"), v4 = c(18.1853007621835,
3.43806581506388,
0.002733567617055, 1.42917483425029, 1.05786640463504,
0.000420548864162308,
2.37232740842861, 3.01835841813241, 0, 1.13430282139936,
0.928725667117666,
0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame",
row.names
=
c(2L,
9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))

I need to find a automated procedure that allows me to adjust v3
marginals
while maintaining v1xv2 marginals unchanged.

That is: modify the v4 values you can find by running:

aggregate(f1[,c("v4")],list(f1$v3),sum)

while maintaining costant the values you can find by running:

aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)

Now does it make sense?

Please notice I have tried to build some syntax that tries to

modify

values
within each v1xv2 combination by computing sum of v4, row

percentage

in
terms of v4, and there is where my effort is blocked. Not really
sure
how I
should proceed. Any suggestion?

Thanks,

Luca


2015-03-19 2:38 GMT+01:00 Jeff Newmiller <

jdnewmil at dcn.davis.ca.us>:

I don't understand your description. The standard practice on

this

list
is
to provide a reproducible R example [1] of the kind of data you

are

working
with (and any code you have tried) to go along with your
description.
In
this case, that would be two dputs of your input data frames and

dput
of
an output data frame (generated by hand from your input data
frame).
(Probably best to not use the full number of input values just to
keep
the
size down.) We could then make an attempt to generate code that
goes
from
input to output.

Of course, if you post that hard work using HTML then it will get
corrupted (much like the text below from your earlier emails) and
we
won't
be able to use it. Please learn to post from your email software
using
plain text when corresponding with this mailing list.

[1]

http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example

---------------------------------------------------------------------------

Jeff Newmiller                        The     .....       .....

Go

Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.
Live
Go...
                                      Live:   OO#.. Dead: OO#..
Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.
with
/Software/Embedded Controllers)               .OO#.       .OO#.
rocks...1k

---------------------------------------------------------------------------

Sent from my phone. Please excuse my brevity.

On March 18, 2015 9:05:37 AM PDT, Luca Meyer <

lucam1968 at gmail.com>

wrote:

Thanks for you input Michael,

The continuous variable I have measures quantities (down to the
3rd
decimal level) so unfortunately are not frequencies.

Any more specific suggestions on how that could be tackled?

Thanks & kind regards,

Luca


===

Michael Friendly wrote:
I'm not sure I understand completely what you want to do, but
if the data were frequencies, it sounds like task for fitting a
loglinear model with the model formula

~ V1*V2 + V3

On 3/18/2015 2:17 AM, Luca Meyer wrote:

* Hello,

*>>* I am facing a quite challenging task (at least to me) and I
was
wondering
*>* if someone could advise how R could assist me to speed the
task
up.
*>>* I am dealing with a dataset with 3 discrete variables and

one

continuous
*>* variable. The discrete variables are:
*>>* V1: 8 modalities
*>* V2: 13 modalities
*>* V3: 13 modalities
*>>* The continuous variable V4 is a decimal number always

greater

than
zero in
*>* the marginals of each of the 3 variables but it is sometimes
equal
to zero
*>* (and sometimes negative) in the joint tables.
*>>* I have got 2 files:
*>>* => one with distribution of all possible combinations of
V1xV2
(some of
*>* which are zero or neagtive) and
*>* => one with the marginal distribution of V3.
*>>* I am trying to build the long and narrow dataset V1xV2xV3

in

such
a way
*>* that each V1xV2 cell does not get modified and V3 fits as
closely
as
*>* possible to its marginal distribution. Does it make sense?
*>>* To be even more specific, my 2 input files look like the
following.
*>>* FILE 1
*>* V1,V2,V4
*>* A, A, 24.251
*>* A, B, 1.065
*>* (...)
*>* B, C, 0.294
*>* B, D, 2.731
*>* (...)
*>* H, L, 0.345
*>* H, M, 0.000
*>>* FILE 2
*>* V3, V4
*>* A, 1.575
*>* B, 4.294
*>* C, 10.044
*>* (...)
*>* L, 5.123
*>* M, 3.334
*>>* What I need to achieve is a file such as the following
*>>* FILE 3
*>* V1, V2, V3, V4
*>* A, A, A, ???
*>* A, A, B, ???
*>* (...)
*>* D, D, E, ???
*>* D, D, F, ???
*>* (...)
*>* H, M, L, ???
*>* H, M, M, ???
*>>* Please notice that FILE 3 need to be such that if I

aggregate

on
V1+V2 I
*>* recover exactly FILE 1 and that if I aggregate on V3 I can
recover
a file
*>* as close as possible to FILE 3 (ideally the same file).
*>>* Can anyone suggest how I could do that with R?
*>>* Thank you very much indeed for any assistance you are able

to

provide.
*>>* Kind regards,
*>>* Luca*

      [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,

https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.