Skip to content

[R-pkg-devel] Formula modeling

7 messages · pik@pp@@devei m@iii@g oii gm@ii@com, Richard M. Heiberger, Ben Bolker +2 more

#
Dear R-package-devel subscribers,

 

My question concerns a package design issue relating to the usage of
formulas.

 

I am interested in describing via formulas systems of the form:

 

d = p + x + y 

s = p + w + y

p = z + y

q = min(d,s).

 

The context in which I am working is that of market models with, primarily,
panel data. In the above system, one may think of the first equation as
demand, the second as supply, and the third as an equation (co-)determining
prices. The fourth equation is implicitly used by the estimation method, and
it does not need to be specified when programming the R formula. If you need
more information bout the system, you may check the package diseq.
Currently, I am using constructors to build market model objects. In a
constructor call, I pass [i] the right-hand sides of the first three
equations as strings, [ii] an argument indicating whether the equations of
the system have correlated shocks, [iii] the identifiers of the used dataset
(one for the subjects of the panel and one for time), and [iv] the quantity
(q) and price (p) variables. These four arguments contain all the necessary
information for constructing a model.

 

I would now like to re-implement model construction using formulas, which
would be a more regular practice for most R users. I am currently
considering passing all the above information with a single formula of the
form:

 

q | p | subject | time | rho ~ p + x + y | p + w + y | z + y 

 

where subject and time are the identifiers, and rho indicates whether
correlated or independent shocks should be used.

 

I am unaware of other packages that use formulas in this way (for instance,
passing the identifiers in the formula), and I wonder if this would go
against any good practices. Would it be better to exclude some of the
necessary elements for constructing the model? This might make the resuting
formulas more similar to those of models with multiple responses or multiple
parts. I am not sure, though, how one would use such model formulas without
all the relevant information. Is there any suggested design alternative that
I could check?

 

I would appreciate any suggestions and discussion!

 

Kind regards,

Pantelis
#
I am responding to a subset of what you asked.  There are packages which use multiple formulas
in their argument sequence.


What you have as a single formula with | as a separator
q | p | subject | time | rho ~ p + x + y | p + w + y | z + y 
I think would be better as a comma-separated list of formulas

q , p , subject , time , rho ~ p + x + y , p + w + y , z + y 

because in R notation | is usually an operator, not a separator.

lattice uses formulas and the | is used as a conditioning operator.

nlme and lme4 can have multiple formulas in the same calling sequence.

lme4 is newer.  from its ?lme4-package
?lme4? covers approximately the same ground as the earlier ?nlme?
     package.

lme4 should probably be the modelyou are looking for for the package design.
#
I don't work with models like this, but I would find it more natural to 
express the multiple formulas in a list:

   list(d ~ p + x + y, s ~ p + w + y, p ~ z + y)

I'd really have no idea how either of the proposals below should be parsed.

Of course, if people working with models like this are used to working 
with notation like yours, that would be a strong argument to use your 
notation.

Duncan Murdoch
On 07/10/2021 5:51 p.m., Richard M. Heiberger wrote:
#
There's a Formula package on CRAN 
<https://cran.r-project.org/web/packages/Formula/index.html> that's 
designed for this use case.

   lme4 and nlme don't use it, but implement their own formula 
manipulation machinery. (The cleanest version of this machinery is 
actually in glmmTMB at 
https://github.com/glmmTMB/glmmTMB/blob/master/glmmTMB/R/reformulas.R .)

   I would probably recommend Duncan's or Richard's approach, but if you 
want to keep your original syntax then the Formula package is probably 
the way to go.
On 10/7/21 5:58 PM, Duncan Murdoch wrote:

  
    
#
On 07/10/2021 5:58 p.m., Duncan Murdoch wrote:
There's a disadvantage to this proposal.  I'd assume that "p" means the 
same in all 3 formulas, but with the notation I give, it could refer to 
3 unrelated variables, because each of the formulas would have its own 
environment, and they could all be different.  I guess you could make it 
a requirement that they all use the same environment, but that's likely 
going to be confusing to users, who won't know what it means.

Another possibility that wouldn't have this problem (but in my opinion 
is kind of ugly) is to use R vector construction notation:

   c(d, s, p) ~ c(p + x + y, p + w + y, z + y)

Duncan Murdoch
#
Hi,

The different environments can potentially be an issue in the future. I was not aware of the vector construction notation, and I think this is what I was mainly looking for. 

I could provide two initialization methods. One will use the ugly vector notation that one could use to bind the whole model with a particular environment. The second can be more user-friendly and use the comma-separated list of formulas. Essentially, the second will prepare the vector formula and call the first initialization method.

The (|) operator comment makes sense, and I would also want to avoid this to the extent that it is feasible.  So, I am currently thinking something along the line:

c(d, s, p | subject | time) ~ c(p + x + y, p + w + y, z + y)

This is very similar to how the function ?lme4::lmer uses the bar to separate expressions for design matrices from grouping factors. Actually, the subject and time variables are needed for subsetting prices for various operations required for the model matrix. 

Thanks for the suggestions; they are very helpful!

Best,
Pantelis

-----Original Message-----
From: Duncan Murdoch <murdoch.duncan at gmail.com> 
Sent: Friday, October 8, 2021 2:04 AM
To: Richard M. Heiberger <rmh at temple.edu>; pikappa.devel at gmail.com
Cc: r-package-devel at r-project.org
Subject: Re: [R-pkg-devel] [External] Formula modeling
On 07/10/2021 5:58 p.m., Duncan Murdoch wrote:
There's a disadvantage to this proposal.  I'd assume that "p" means the same in all 3 formulas, but with the notation I give, it could refer to
3 unrelated variables, because each of the formulas would have its own environment, and they could all be different.  I guess you could make it a requirement that they all use the same environment, but that's likely going to be confusing to users, who won't know what it means.

Another possibility that wouldn't have this problem (but in my opinion is kind of ugly) is to use R vector construction notation:

   c(d, s, p) ~ c(p + x + y, p + w + y, z + y)

Duncan Murdoch
#
On Fri, 8 Oct 2021, pikappa.devel at gmail.com wrote:

            
xyplot() and glm(), this is a bit hard to parse visually. One could 
imagine making a mistake that s corresponds to x, rather than p+w+y.

I wonder if there is a way to write something along the lines of

~c( d~p+x+y,
     s~p+w+y,
     p~z+y |subject | time
    )

A quick experiment with R shows that this is treated like a formula, so ~c 
becomes a way to group formulas.

best

Vladimir Dergachev