Dear Steffen (and other gRs)
I think it is absolutely the way ahead. Some quick questions and
comments
1. At first sight it seems confusing to use parentheses to mark block
components, I think I'd prefer
some special symbol (eg & ?) or the b() notation:
~ (A+B)*C|D & B*E|E & D|E or equivalently
~ b(A+B)*C|D, B*E|E, D|E)
2. Syntax constraints
* Interactions between continuous variables of max order 2, eg
X*Y*Z is illegal
* (I suppose X*X is equivalent to X^2?)
* Higher order continuous interactions could be disallowed or
ignored (prefer the former)
* Categorical variables are 'factors' in R (sorry for the
Rothamsted ambience here)
* I suppose A*A is illegal if A is a factor, or is it just
equivalent to A?
* Conditioning symbol | is followed by a simple variable list eg
(X,Y,A)
* no directed cycles for chain graph models
* more ?
eg ~ A*(X+Y+Z)^2|(X,Y,Z)
3. Functions can be used, eg ~Z+log(X), sqrt(x-min(x))
4. Ramifications of ':'
* My understanding is that the use of ':' rather than '*' relates to
different parametrisations of the same space.
In principle when specifying a model this should be irrelevant.
Or do we want to commit ourselves to a certain
parametrisation - if so, why?
* I suppose if ':' is allowed we should also allow %in% and /
(nested).
5. Question to the (ha)R(d)-core: can the existing R formula parser be
used with these formulae? Or how should it be done?
If we need a special parser, what should this return?
Best regards
David
-----Original Message-----
From: r-sig-gr-bounces at stat.math.ethz.ch
[mailto:r-sig-gr-bounces at stat.math.ethz.ch] On Behalf Of Steffen
Lauritzen
Sent: 19. august 2004 11:39
To: gRlist
Subject: [R--gR] Modelformulae
Dear gR-folks
The Danish gR-gang have been talking about describing a model language
for graphical models that
1) could specify at least chain graph models, based on the most general
hierarchical mixed models as described in Lauritzen (1996) [my book],
section 6.4, pages 199-216. (More general than MIM-models).
2) did not confuse people who were accustomed to glim-type notation and
formulae
3) did not conflict too much with existing formula conventions (MIM,
ggm)
4) was clear and unambiguous, and immediately understandable without too
much explanation
5) did not conflict too much with the whole idea and setup of graphical
interaction models
6) accomodates idea of multiple response variables
Here is a first attempt. It may well work, but I would appreciate having
response back if I have overlooked some nasty conflicts or bad sides to
this.
The whole issue is somewhat plagued by the "coincidental" fact that
*intrinsically multivariate* log-linear models via "the Poisson trick"
can be described through univariate response models for the counts.
Below I will first describe the basic general setup, then some
conventions which enable people to use alternative, more traditional
approaches, without ambiguity.
What do you all think of this? Please reply to the entire list...;-)
If it works, the suggestion would be for gRbase to adopt it and abandon
MIM-notation alltogether, as the latter is slightly different in style.
Hopefully it can also be extended to cover BUGS-type models without too
many direct conflicts.
Best regards
Steffen
--
Steffen L. Lauritzen
Department of Statistics, University of Oxford
1 South Parks Road, Oxford OX1 3TG, United Kingdom
Tel: +44 1865 272877; Fax: +44 1865 272595
email: steffen at stats.ox.ac.uk URL: www.stats.ox.ac.uk/~steffen/
---------
The following signs are (at least) permissible: ~, + , * , : , ^ ,
. and |
~ indicates the beginning of a formula. Implicitly think of
log f ~ ....
| denotes parenthood in graph, equiv to normalising/conditioning
+ denotes multiplicative combination (log-additive). Chain components
+ must
be contained within parentheses.
* or : denotes (tensor)product of interaction terms, decomposed into
terms of lower order or not, i.e. A*B*C specifies all subsets of ABC,
whereas A:B:C only uses ABC.
strength of bindings (*,:) > + > |
examples of legal formulae (same model with three chain components
specified)
m <- gm( log f ~ (A:B+C:D|D)+(B*E|E)+(D*E|E))
m <- gm( ~ (A:B+C*D|D), ~(B*E|E)+(D*E|E))
hierarchical models, as in CoCoCg and Lauritzen (1996)cf p. 213
~ A+B:X+B*Y+A*B*X^2+A*X:Y+Y^2 not a mim-model
~ A+B:X+A*Y+A*(X+Y)^2 = mim(A+B/AX+BY/AXY)
some different models
m1<- gm(~A*B+C*D|B*D) equiv gm(~A*B+C*D+B*D|B*D)
m2<-gm(~((B+D)*E)|E)
m<-b(m1,m2)
m <- gm( ~ (A*B)+(C*D|D)+(B*E+D*E|E))
m<- gm( ~ (A*B)+(C|D)+(B+D|E))
CONVENTION for compatibility with standard regression and ggm:
Y~X+U:A is the same as ~(Y:X+Y:U:A |XUA) = ~(Y:(X+U:A) |XUA),
that is: *If * there is a variable on the left hand side of ~, this is a
response to the variables on the right hand side, and the interaction
structure is the product of right and left hand sides.
Work still needs to be done to identify when models are legal, the same,
and parse them for proper and correct analysis.
Is this the way ahead?
_______________________________________________
R-sig-gR mailing list
R-sig-gR at stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/r-sig-gr
[R--gR] Modelformulae
4 messages · DED (David George Edwards), Steffen Lauritzen, David Meyer
just 2 comments:
~ (A+B)*C|D & B*E|E & D|E or equivalently ~ b(A+B)*C|D, B*E|E, D|E)
I think you will have to use the b() notation. The '&' operator will probably confuse the formula parser.
* Conditioning symbol | is followed by a simple variable list eg (X,Y,A) eg ~ A*(X+Y+Z)^2|(X,Y,Z)
This is illegal in R. You will need to use a separator like + or again a grouping function: ~ A * (X+Y+Z) | X + Y + Z or ~ A * (X+Y+Z) | l(X,Y,Z) just my 2c David
3. Functions can be used, eg ~Z+log(X), sqrt(x-min(x))
4. Ramifications of ':'
* My understanding is that the use of ':' rather than '*' relates
to different parametrisations of the same space.
In principle when specifying a model this should be irrelevant.
Or do we want to commit ourselves to a certain
parametrisation - if so, why?
* I suppose if ':' is allowed we should also allow %in% and /
(nested).
5. Question to the (ha)R(d)-core: can the existing R formula parser be
used with these formulae? Or how should it be done?
If we need a special parser, what should this return?
Best regards
David
-----Original Message-----
From: r-sig-gr-bounces at stat.math.ethz.ch
[mailto:r-sig-gr-bounces at stat.math.ethz.ch] On Behalf Of Steffen
Lauritzen
Sent: 19. august 2004 11:39
To: gRlist
Subject: [R--gR] Modelformulae
Dear gR-folks
The Danish gR-gang have been talking about describing a model language
for graphical models that
1) could specify at least chain graph models, based on the most
general hierarchical mixed models as described in Lauritzen (1996) [my
book], section 6.4, pages 199-216. (More general than MIM-models).
2) did not confuse people who were accustomed to glim-type notation
and formulae
3) did not conflict too much with existing formula conventions (MIM,
ggm)
4) was clear and unambiguous, and immediately understandable without
too much explanation
5) did not conflict too much with the whole idea and setup of
graphical interaction models
6) accomodates idea of multiple response variables
Here is a first attempt. It may well work, but I would appreciate
having response back if I have overlooked some nasty conflicts or bad
sides to this.
The whole issue is somewhat plagued by the "coincidental" fact that
*intrinsically multivariate* log-linear models via "the Poisson trick"
can be described through univariate response models for the counts.
Below I will first describe the basic general setup, then some
conventions which enable people to use alternative, more traditional
approaches, without ambiguity.
What do you all think of this? Please reply to the entire list...;-)
If it works, the suggestion would be for gRbase to adopt it and
abandon MIM-notation alltogether, as the latter is slightly different
in style.
Hopefully it can also be extended to cover BUGS-type models without
too many direct conflicts.
Best regards
Steffen
--
Steffen L. Lauritzen
Department of Statistics, University of Oxford
1 South Parks Road, Oxford OX1 3TG, United Kingdom
Tel: +44 1865 272877; Fax: +44 1865 272595
email: steffen at stats.ox.ac.uk URL: www.stats.ox.ac.uk/~steffen/
---------
The following signs are (at least) permissible: ~, + , * , : , ^
,. and |
~ indicates the beginning of a formula. Implicitly think of
log f ~ ....
| denotes parenthood in graph, equiv to normalising/conditioning
+ denotes multiplicative combination (log-additive). Chain components
+ must
be contained within parentheses.
* or : denotes (tensor)product of interaction terms, decomposed into
terms of lower order or not, i.e. A*B*C specifies all subsets of ABC,
whereas A:B:C only uses ABC.
strength of bindings (*,:) > + > |
examples of legal formulae (same model with three chain components
specified)
m <- gm( log f ~ (A:B+C:D|D)+(B*E|E)+(D*E|E))
m <- gm( ~ (A:B+C*D|D), ~(B*E|E)+(D*E|E))
hierarchical models, as in CoCoCg and Lauritzen (1996)cf p. 213
~ A+B:X+B*Y+A*B*X^2+A*X:Y+Y^2 not a mim-model
~ A+B:X+A*Y+A*(X+Y)^2 = mim(A+B/AX+BY/AXY)
some different models
m1<- gm(~A*B+C*D|B*D) equiv gm(~A*B+C*D+B*D|B*D)
m2<-gm(~((B+D)*E)|E)
m<-b(m1,m2)
m <- gm( ~ (A*B)+(C*D|D)+(B*E+D*E|E))
m<- gm( ~ (A*B)+(C|D)+(B+D|E))
CONVENTION for compatibility with standard regression and ggm:
Y~X+U:A is the same as ~(Y:X+Y:U:A |XUA) = ~(Y:(X+U:A) |XUA),
that is: *If * there is a variable on the left hand side of ~, this is
a response to the variables on the right hand side, and the
interaction structure is the product of right and left hand sides.
Work still needs to be done to identify when models are legal, the
same, and parse them for proper and correct analysis.
Is this the way ahead?
_______________________________________________ R-sig-gR mailing list R-sig-gR at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/r-sig-gr _______________________________________________ R-sig-gR mailing list R-sig-gR at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/r-sig-gr
Dr. David Meyer Department of Information Systems Vienna University of Economics and Business Administration Augasse 2-6, A-1090 Wien, Austria, Europe Fax: +43-1-313 36x746 Tel: +43-1-313 36x4393 HP: http://wi.wu-wien.ac.at/Wer_sind_wir/meyer/
Just a quick reply to some of David E's comments (David M's comments have been recorded):
2. Syntax constraints * Interactions between continuous variables of max order 2, eg X*Y*Z is illegal * (I suppose X*X is equivalent to X^2?)
This needs thinking, but you are probably right. Generally the formulas on the left hand side of the conditioning symbols should refer to vector spaces and one should think about this all the time when formulas are abbreviated and combined in strange ways. Somebody (Svante and I?) should look carefully into this. For the moment, in my head X^2 is span(1, x, x^2). Generally X is span(1, x). For vectorspaces U and V, U:V denotes span of all componentwise products of basis vectors for U and V, embedded into suitable (tensor)productspaces. A "factor" A is span(e_alpha, alpha\in levels of A), where e_alpha =1 in cell alpha and 0 otherwise. A:B is then what we want it to be and X:X is X^2. X:Y:Z (or X*Y*Z) should be deemed illegal unless one of them is a factor. But some care may have to be taken with what products mean and how they are interpreted.
* Higher order continuous interactions could be disallowed or ignored (prefer the former)
OK
* Categorical variables are 'factors' in R (sorry for the Rothamsted ambience here)
OK, probably no way to get rid of this...
* I suppose A*A is illegal if A is a factor, or is it just equivalent to A?
with the above definition, A*A=A, but X*X is not X, when X is not a factor.
* Conditioning symbol | is followed by a simple variable list eg (X,Y,A)
yes, but not with parentheses (as commented by David M)
* no directed cycles for chain graph models * more ? eg ~ A*(X+Y+Z)^2|(X,Y,Z)
Here (X+Y+Z)^2= span(1,x,y,z)*span(1,x,y,z)= X^2+Y^2+Z^2+X:Y+X:Z+Y:Z
3. Functions can be used, eg ~Z+log(X), sqrt(x-min(x))
Careful! Translation to vector spaces needs to be clear.
4. Ramifications of ':' * My understanding is that the use of ':' rather than '*' relates to different parametrisations of the same space.
A little more than that. It also specifies a lattice of models, rather than a single one, namely the hierarchical submodels obtained be removing terms of higher order.
In principle when specifying a model this should be irrelevant.
Old Rothamstead ambience: Nelder implicitly specifies much more than a model with a formula, namely a full ANOVA of the data.... He never writes so in the theoretical part, but does so in his examples. Confusing, but true... I a
Or do we want to commit ourselves to a certain
parametrisation - if so, why?
No, not other than by the above. With a given parametrisation it becomes particularly easy to analyse all models by very few computations (only a single one in the normal case).
* I suppose if ':' is allowed we should also allow %in% and / (nested).
In principle yes, but we have to identify what it means. Split models? Best regards Steffen
[...]
* Conditioning symbol | is followed by a simple variable list eg (X,Y,A)
yes, but not with parentheses (as commented by David M)
and neither with colons: things like fun(~ A + B | C, D) are clearly misunderstood by the parser. -d