problems with glm (PR#771) - R-help

Jim Lindsey · 2001-01-03T14:10:01Z

> > [We do have a TooMuchAtOnce category, but it is easier to cope with > short separate reports with informative subject lines for, e.g. the BUGS > list.] > > On Mon, 18 Dec 2000 james.lindsey at luc.ac.be wrote: > > > R1.2.0 with Linux RH5.2 > > > > I do not believe that the problems below are new to 1.2 but I only > > cover this sort of thing once a year in my course and some of that > > happened to be last Friday so too late to report for 1.2. I see that > > one or two things that I was

Jim Lindsey

Wed, Jan 3, 2001 6:10 AM #

Simple: I do not trust R factor levels (among others, for reasons
outlined below).
  For any serious work involving factor variables, I use GLIM4.

Good. That patch was posted after I submitted my bug report.

It *has* estimated the diagonals under the fitted mover-stayer model
(that is, if it were doing it right). They are estimates of certain
parameters in this model. (Another way to fit the same model is to use
a 4-level factor variable on the diagonal instead of weighting it out,
but that does not supply the desired estimates.) R incorrectly calls
them fitted values in its objects and printouts. GLIM4 clearly prints
them out in a distinct way so that there is no confusion.

Sorry. Typo.

Here is what the docs say:

The levels of a factor are by default sorted, but the sort order
may well depend on the locale at the time of creation, and should
not be assumed to be ASCII.

Because this is the only documentation on factor variables and it does
not mention the factor function explicitly, I take it to be a general
statement.

I do not see what aov has to do with it when discussing glm's. It is
for lm and, as far as I know, produces the old-fashioned material for
tests that more and more scientific journals are rightly refusing. In
say a standard high-dimensional contingency table (or a factorial
design), the first thing that scientists and industry people want is
to know which levels of factor variables yield scientifically
important effects.  Only then do they want measures of statistical
significance. What are required are first the appropriate coefficients
and then the corresponding intervals.
  The idea about predictions may be theoretically fine, at least in
simple models, but is practically useless in realistically complex
cases involving multidimensional contingency tables (or multi-
factorial designs, from which factor variables take their name, for
that matter), where one wants to extract the relative effects of the
various factor levels and interactions. The predictions are virtually
useless to disentangle such effects.
  I teach that all different constraint sets are equivalent because
they give the same predictions so choose that set which is most
appropriate or most easily interpretable for the problem at hand. If
one level of a factor is of particular interest, use baseline
constraints (so-called treatment contrasts); if levels are rather
symmetric, use mean value constraints (so-called sum contrasts). In
other cases, special contrasts need to be set up. Contrasts are a
property of a variable, not a design so that it would be nice to have
an automatic way of doing this for each factor variable rather than a
global contrast option. But I know of no software that allows this.

R makes it extremely difficult to understand codings. They are very
obscure and easily lead to errors of interpretation.  Everything seems
to be done to discourage their use (as opposed simply to looking at
tests) when their correct interpretation is essential. That is my
whole point.

^^^^^^^^^^^^^^^

Does this mean that they thought twice about it or is it a type like
my "factors" above?

I stated above, in a part that you cut, to pay attention to the
contrasts options that I set. "Mean value constraints" is standard
terminology in the fields where I have worked for the simple reason
that the coefficients give differences from the mean value.
"Conventional constraints" is also commonly used but is less
informative.

What does the "number of the *contrast*" mean and how do you easily
find out what it is after the levels have been sorted around? In a log
linear model (or complex factorial design), they do refer to differences
("contrasts") of each level from the mean value. Single levels are at
least as relevant here as for any other contrast set.

They should be CC GL LY. The coefficient for WM is minus the sum of
the other three. The coefficients will be differences from the mean
value (on the log scale for log linear models) and are symmetric in
all factor levels (in the sense that no particular category is chosen
as special). Some statistical software packages very nicely print out
the coefficients for all levels instead of arbitrarily dropping the
last one.

You have lost me here. The answer is, of course, No. What a
suggestion! It does not mean that at all. It compares CC to the mean
not to WM. To me, that looks like the appropriate label for baseline
constraints with the last category as baseline (what R calls treatment
but is more usually placebo!).

Here is a simple and unrealistic (as no one would ever fit the model
to these data) example that illustrates what can happen in more
complex tables:

y <- matrix(c(2,0,7,3,0,9), ncol=2)
x <- gl(3,1)
glm(y~x, family=binomial)

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._