Skip to content

OT: A test with dependent samples.

12 messages · David Winsemius, Charles C. Berry, Bert Gunter +6 more

#
I am appealing to the general collective wisdom of this
list in respect of a statistics (rather than R) question.  This question
comes to me from a friend who is a veterinary oncologist.  In a study  
that
she is writing up there were 73 cats who were treated with a drug called
piroxicam.  None of the cats were observed to be subject to vomiting  
prior
to treatment; 12 of the cats were subject to vomiting after treatment
commenced.  She wants to be able to say that the treatment had a  
``significant''
impact with respect to this unwanted side-effect.

Initially she did a chi-squared test.  (Presumably on the matrix
matrix(c(73,0,61,12),2,2) --- she didn't give details and I didn't  
pursue
this.) I pointed out to her that because of the dependence --- same 73
cats pre- and post- treatment --- the chi-squared test is inappropriate.

So what *is* appropriate?  There is a dependence structure of some sort,
but it seems to me to be impossible to estimate.

After mulling it over for a long while (I'm slow!) I decided that a
non-parametric approach, along the following lines, makes sense:

We have 73 independent pairs of outcomes (a,b) where a or b is 0
if the cat didn't barf, and is 1 if it did barf.

We actually observe 61 (0,0) pairs and 12 (0,1) pairs.

If there is no effect from the piroxicam, then (0,1) and (1,0) are
equally likely.  So given that the outcome is in {(0,1),(1,0)} the
probability of each is 1/2.

Thus we have a sequence of 12 (0,1)-s where (under the null hypothesis)
the probability of each entry is 1/2.  Hence the probability of this
sequence is (1/2)^12 = 0.00024.  So the p-value of the (one-sided) test
is 0.00024.  Hence the result is ``significant'' at the usual levels,
and my vet friend is happy.

I would very much appreciate comments on my reasoning.  Have I made any
goof-ups, missed any obvious pit-falls?  Gone down a wrong garden path?

Is there a better approach?

Most importantly (!!!): Is there any literature in which this  
approach is
spelled out?  (The journal in which she wishes to publish will almost  
surely
demand a citation.  They *won't* want to see the reasoning spelled  
out in
the paper.)

I would conjecture that this sort of scenario must arise reasonably  
often
in medical statistics and the suggested approach (if it is indeed valid
and sensible) would be ``standard''.  It might even have a name!  But I
have no idea where to start looking, so I thought I'd ask this  
wonderfully
learned list.

Thanks for any input.

	cheers,

		Rolf Turner

######################################################################
Attention:\ This e-mail message is privileged and confid...{{dropped:9}}
#
In the biomedical arena, at least as I learned from Rosner's  
introductory text, the usual approach to analyzing paired 2 x 2 tables  
is McNemar's test.

?mcnemar.test

 > mcnemar.test(matrix(c(73,0,61,12),2,2))

	McNemar's Chi-squared test with continuity correction

data:  matrix(c(73, 0, 61, 12), 2, 2)
McNemar's chi-squared = 59.0164, df = 1, p-value = 1.564e-14

The help page has citation to Agresti.
#
73 cats were treated. None barfing before and 12 after.

This gives the table:

| After      | Yes | No | Total |
|------------+-----+----+-------|
| Before Yes | 0   | 0  |     0 |
| Before  No | 12  | 61 |    73 |
|------------+-----+----+-------|
| Total      | 12  | 61 |    73 |


and a McNemar Test will assess symmetry with chi-square = 12 on 1 d.f., 
rejecting symmetry at conventional p-values.

But I think symmetry is an unreasonable null in this context, as I guess 
that one would not medicate a barfing cat. Certainly in the human oncology 
context with which I am familiar, it would be most unusual to posit 
symmetry of mucositis before and after chemotherapy when an agent that 
might induce mucositis is to be given.

I'd try to elicit an upper bound for the acceptable fraction of such side 
effects and then perform a test using that fraction as the alternative.

Failing that (because such elicitations are sometimes met with a blank 
stare) and even in addition to that, I'd calculate the point estimate and 
the 95% CI (or maybe even the 90% CI) and present those along with some 
interpretative advice. prop.test(12,73) would do it.

HTH,

Chuck
On Tue, 10 Feb 2009, David Winsemius wrote:

            
Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901
#
Ah, experimental units,again ... a subject little taught by statisticians
that is often the crux of the matter. As here.

The cat is the experimental unit. There are 73 of them. 12 of them
experienced vomiting after treatment. What's a confidence interval for the
true proportion based on our sample of 73? binom.test(12,72) gives us .088
to .27 for an exact 2 sided interval (and a P value of 2.2e-16 for the null
= 0).

Seems rather convincing -- and simple -- to me!

-- Bert Gunter

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
Behalf Of David Winsemius
Sent: Tuesday, February 10, 2009 1:51 PM
To: Rolf Turner
Cc: R-help Forum
Subject: Re: [R] OT: A test with dependent samples.

In the biomedical arena, at least as I learned from Rosner's  
introductory text, the usual approach to analyzing paired 2 x 2 tables  
is McNemar's test.

?mcnemar.test

 > mcnemar.test(matrix(c(73,0,61,12),2,2))

	McNemar's Chi-squared test with continuity correction

data:  matrix(c(73, 0, 61, 12), 2, 2)
McNemar's chi-squared = 59.0164, df = 1, p-value = 1.564e-14

The help page has citation to Agresti.
#
on 02/10/2009 03:33 PM Rolf Turner wrote:
Rolf,

I am a little confused, perhaps due to lack of sleep (sick dog with CHF).

Typically in this type of study, essentially looking at the
efficacy/safety profile of a treatment, there are two options.

One does a two arm randomized study, whereby "subjects" are randomized
to one of two treatments. The two treatments may both be "active" or one
may be a placebo. Then a typical two sample comparison of the primary
hypothesis is made. In this setting, you would have a second group of 73
cats who received a comparative treatment (or a placebo) to compare
against the 16.4% observed in this treatment group.

For example, say that patients were undergoing cancer treatment, which
has nausea and vomiting as a side effect. Due to the side effect, it is
common to see a reduction in dosing, which of course reduces treatment
effectiveness. You might want to study a treatment that favorably
reduces that side effect, to enable improved treatment dosing and
patient tolerance.

The other option would be to perform a single sample study, where there
is an a priori hypothesis, based upon prior work, of the expected
incidence of the adverse event or perhaps a "clinically acceptable"
incidence of the adverse event. This would seem to be the scenario
indicated above.

What is lacking is some a priori expectation of the incidence of the
event in question, so that one can show that you have reduced the
incidence from the expected.

50% would not make sense here, though if it did, a single sample
binomial test would be used, presuming a two-sided hypothesis:
[1] 4.802197e-09


That none of them had vomiting prior to treatment seems to be of little
interest here. You could just as easily argue that there was a
significant increase in the incidence of vomiting from 0% to 16.4% due
to the treatment.

What am I missing?

Regards,

Marc Schwartz
#
Still seems that McNemar's test is the appropriate test for the  
matched design, but my first answer sent the input to the function  
incorrectly, and it's not clear that a normal theory test would be  
accurate in all instances. The matrix should have 61 cats with no  
vomiting under either situation, 12 cats vomiting in the condition of  
piroxicam, no cats vomiting prior to treatment no cats vomiting under  
both situations):

 > mcnemar.test(matrix(c(61,0,12,0),2,2))

	McNemar's Chi-squared test with continuity correction

data:  matrix(c(61, 0, 12, 0), 2, 2)
McNemar's chi-squared = 10.0833, df = 1, p-value = 0.001496

It's very close to 1/2 the value of Bolker's calculation, but the  
"sidedness" is not specified in the output. Rosner's text  
"Fundamentals of Biostatistics" also describes an exact analogue of  
the Normal theory test.
#
David,

If you really want to do a test on this data, I would suggest
a Fisher's Exact test, but you want to use hypergeometric
probabilities. You would probably want to try the CMH
test, if the function allows a single table and actually uses
hypergeometric probabilities.

My suggestion, would be to calculate the frequency of
vomiting, for animals that didn't vomit before, calculate
the CIs and then use some historical data on the vomiting
rate, for non-treated cats and see whether it falls inside the
CIs for your treated animals. If it does, then you might
conclude that the vomiting rate, for treated cats, is
similar to non-treated cats.

Murray M Cooper, Ph.D.
Richland Statistics
9800 N 24th St
Richland, MI, USA 49083
Mail: richstat at earthlink.net

----- Original Message ----- 
From: "David Winsemius" <dwinsemius at comcast.net>
To: "Rolf Turner" <r.turner at auckland.ac.nz>
Cc: "R-help Forum" <r-help at r-project.org>
Sent: Tuesday, February 10, 2009 4:50 PM
Subject: Re: [R] OT: A test with dependent samples.
#
Respectfully, I must disagree. (And it's not my cats, but those of  
Turner's colleague.)  I particularly disagree with using a Fisher's  
exact test as did Turner, as it would double the sample size  
improperly (even though the FET is known to be conservative.)

Your strategy appears very much in the Bayesian tradition with a very  
informative prior. The strategy of calculating CI's and seeing if they  
overlap appears pretty non-standard, I must say.

It appears to me that the exact version of the McNemar test offered by  
Rosner is equivalent to his explication of the Wilcoxon signed rank  
test which also tests for the null of symmetry of deviation on either  
side of the signs of difference in rankings.

-- David Winsemius
On Feb 10, 2009, at 9:03 PM, Murray Cooper wrote:

            
#
On Tue, 2009-02-10 at 21:03 -0500, Murray Cooper wrote:
Hi R-masters

Well,

I think this a complex problem because haven't a control group OR a not
randomized study.

But i think the solution is a Bayesian approach.

I don't know the probability of vomiting in a cat but isn't 0, so i
think de priori is a beta (1[0+1],73[72+1]).

The likelihood is oblivious beta(13[12+1],61[60+1])

So the posteriori is beta(1,73)*beta(13,61)=beta(14,134)

The expected valeu of posteriori is 0.1 in 72 cats is same 7.2 or 7 CATS
is almost a half of numbers of study.
Rolf,

as you explicitly asked for a comment on your proposal: It is generally
equivalent to McNemar's test and maybe even more appropriate because of
the asymptotics involved in the chi-squared distribution, which might
not be too good with n=12. In some more detail:

McNemar's test basically considers the difference between the
off-diagonal elements, normalized by their sum. You do the same,
ignoring all (0,0) and (1,1) pairs (the latter do not occur in your
example anyway). binom.test(12,12,alternative="greater") gives you the
one-sided p-value.

If, however, you do not use the exact binomial distribution (with n=12)
but the normal approximation, you find E(X)=6 and VAR(X)=3 with X the
number of sequences (0,1), and you observed x=12. This gives you a
z-score of (12-6)/sqrt(3) and a corresponding one-sided p-value:
[1] 0.0002660028

Further, McNemar's test WITHOUT continuity correction gives
McNemar's Chi-squared test

data:  matrix(c(61, 0, 12, 0), 2, 2) 
McNemar's chi-squared = 12, df = 1, p-value = 0.000532

It comes as no surprise that the reported (two-sided!) p-value is
exactly twice the (one-sided!) p-value from the binomial test with
normal approximation:
[1] 0.0005320055

I do not want to stress the pros and cons of continuity corrections, but
if neglected, you see that you get the same results (except that McNemar
is generally two-sided), due to the relation of normal and chi-squared
distribution. 

If you use the binomial test, you can forget about asymptotics and
continuity correction, that's why I indeed consider you approach
superior to the chi-squared approximation used in McNemar's test. But
you might fail to find a reference for the exact approach...


You briefly asked in a later mail testing for p=0. Indeed, _any_
incident will disprove this hypothesis, and the p value reported by
Exact binomial test

data:  1 and 73 
number of successes = 1, number of trials = 73, p-value < 2.2e-16
alternative hypothesis: true probability of success is not equal to 0 
95 percent confidence interval:
 0.0003467592 0.0739763232 
sample estimates:
probability of success 
            0.01369863 

is not wrong (as R just tells us it is BELOW a certain value), but could
be refined by saying "p-value = 0". BTW, I am not sure what
"p-value=TRUE" tells me, as derived from binom.test(0,73,p=0). I
personally don't care about either, as to test H0: p=0, I would not use
any software but rather stress some basic statistical theory.


It remains the questions whether McNemar (or your proposal) is generally
appropriate here. Like others, I have doubts, except maybe if
observations were made on two subsequent days, on the second of which
the drug was administered. How could you otherwise be sure that the
increase in incidences it not due to the progression of cancer, say,
which would be difficult to rule out. Also, from my experience and in
contrast to you point of view, I'd rather assume it likely that a vet is
much more reluctant to apply the new treatment to an already vomitting
cat, your "baseline" with 0 out of 73 at least seems suspicious, and
until proven wrong, I'd take the assumption that this is not by chance
only.
I understand that this is an observational study with no
control/randomization, but concluding a side-effect from a statistically
significant change in incidence rates from this study seems questionable
to me. What I'd propose (and used in comparable situations) is to
confine yourself on the confidence interval and let the investigator
decide / discuss whether the lower bound of this is higher than what she
would expect under control or alternative treatment. She needs to have
some experience, if not historical data, and using a test or just this
confidence interval, you will only get a hint on a possible side-effect,
no formal proof (which is always difficult from observational studies).
Discussing the CI would seem fairer and even stronger to me then. In
other words: Mathematically, you'll get an appropriate test, while
conceptually, I'm far from being convinced.

Hope this makes sense, 
Michael




-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
On Behalf Of Rolf Turner
Sent: Dienstag, 10. Februar 2009 22:33
To: R-help Forum
Subject: [R] OT: A test with dependent samples.


I am appealing to the general collective wisdom of this list in respect
of a statistics (rather than R) question.  This question comes to me
from a friend who is a veterinary oncologist.  In a study that she is
writing up there were 73 cats who were treated with a drug called
piroxicam.  None of the cats were observed to be subject to vomiting
prior to treatment; 12 of the cats were subject to vomiting after
treatment commenced.  She wants to be able to say that the treatment had
a ``significant''
impact with respect to this unwanted side-effect.

Initially she did a chi-squared test.  (Presumably on the matrix
matrix(c(73,0,61,12),2,2) --- she didn't give details and I didn't
pursue
this.) I pointed out to her that because of the dependence --- same 73
cats pre- and post- treatment --- the chi-squared test is inappropriate.

So what *is* appropriate?  There is a dependence structure of some sort,
but it seems to me to be impossible to estimate.

After mulling it over for a long while (I'm slow!) I decided that a
non-parametric approach, along the following lines, makes sense:

We have 73 independent pairs of outcomes (a,b) where a or b is 0 if the
cat didn't barf, and is 1 if it did barf.

We actually observe 61 (0,0) pairs and 12 (0,1) pairs.

If there is no effect from the piroxicam, then (0,1) and (1,0) are
equally likely.  So given that the outcome is in {(0,1),(1,0)} the
probability of each is 1/2.

Thus we have a sequence of 12 (0,1)-s where (under the null hypothesis)
the probability of each entry is 1/2.  Hence the probability of this
sequence is (1/2)^12 = 0.00024.  So the p-value of the (one-sided) test
is 0.00024.  Hence the result is ``significant'' at the usual levels,
and my vet friend is happy.

I would very much appreciate comments on my reasoning.  Have I made any
goof-ups, missed any obvious pit-falls?  Gone down a wrong garden path?

Is there a better approach?

Most importantly (!!!): Is there any literature in which this approach
is spelled out?  (The journal in which she wishes to publish will almost
surely demand a citation.  They *won't* want to see the reasoning
spelled out in the paper.)

I would conjecture that this sort of scenario must arise reasonably
often in medical statistics and the suggested approach (if it is indeed
valid and sensible) would be ``standard''.  It might even have a name!
But I have no idea where to start looking, so I thought I'd ask this
wonderfully learned list.

Thanks for any input.

	cheers,

		Rolf Turner

######################################################################
Attention:\ This e-mail message is privileged and confid...{{dropped:9}}

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
#
Thank you ***VERY*** much.  That clarifies a lot of things which were  
vaguely
knocking around in my mind but about which I could not get my  
thoughts properly
organized.

Thanks again.

	cheers,

		Rolf Turner

On 11/02/2009, at 11:27 PM,
Meyners,Michael,LAUSANNE,AppliedMathematics wrote:

            
######################################################################
Attention:\ This e-mail message is privileged and confid...{{dropped:9}}
3 days later
#
Dear List,

Catching up with my backlog, I stumbled upon this :

On Wed, 11 Feb 2009 10:33:13 +1300, Rolf Turner wrote?:
I read with interest the answers given, but got frustrated by (among
other points) seeing the main point unanswered : what in hell do you
want to *test* ? And to prove what ?

Classical test theory (Neyman and Pearson's sin, pride and glory) gives 
you a (conventionnaly accepted) way to check if your data support your
assertions. It starts by computing somehow the probability of getting
your data by sheer chance under the hypothesis of your assertion being
*false* (i. e. the dreaded "null hypothesis); if this probability is
"too low" (less than 1 in 20, according to a R. A. Fisher's whim, now
idolized as a standard), it proceeds by asserting that this "too low"
probability means "impossible" and, by way of modus tollens (a-->b and
not(b) ==> not(a), in propositional logic), rejects your null
hypothesis. Therefore, what your test "proves" is just the negation of
your null hypothesis.

The "null" hypothesis that "the drug does not cause cats to barf"
implies that the *probability* of seeing a cat barfing is zero. Any barf
is enough to disprove it and your alternative ("some cat(s) may barf
after having the drug") is therefore "supported" at all conventional
(and unconventional) levels. (See below for reasons for which this
reasoning is partially false).

Did you really bother to treat 73 cats to learn this ? In this case,
you've way too much money to burn and time on your hands. You might have
learned that much cheaper by treating cats one at a time and stopping at
the first barf. You coud even have obtained a (low precision) estimate
of the post-treatement barf probability, by remembering that the
distribution of the number of cats treated is binomial negative...

This (point- and interval-) estimation is probably much more
interesting that a nonsensical "test". Getting some precision on this
estimation might well be worth treating 73 cats. In this case, both
classical ("Fisherian") and Bayesian points of view give "interesting"
answers. You may note that "classical" confidence interval and Bayesian
credible interval with a noninformative prior have the same (numerical)
bounds, with very different significations (pick your poison, but be
aware that the Bayesian point of view | gospel | madness is quite
ill-accepted in most medical journals nowadays...). But the "test
significance level" is still 0, meaning that this test is sheer pure,
unadulterated, analytical-quality nonsense. Because your "null" has no
possible realistic meaning.

Now, another "null" has been suggested : "the cats have the same
probability of barfing before and after drug administration", leading
you to a symmetry (McNemar) test. This is much more interesting, and
might have some value ... unless, at it has been suggested, your
subjects are not "random cats" but "cats that do not barf before drug
administration". In this case, your whole experiment is biased, and your
null is effectively (almost) the same as before (i. e. "the drug does
not cause non-previously-barfing-cats to barf"), in which case the same
grilling can be applied to it.

In both cases, the null hypothesis tested is so far away from any
"reasonable" hypothesis that the test turns to a farce. A much better
way to present these results to a referee would be to give a
(well-motivated) point- and interval-estimation and plainly refuse to
"test" it against nonsense (and explaining why, of course). Medical
journals editors and referees have spent way too much time complying to
dead statisticians' fetishes, turning the whole methodological problem
in (para-)clinical research into an exercise of rubing blue mud in the
same place and at the same time their ancestors did, with no place for
statistical (and probabilistic) thinking...

Another, more interesting problem, would be to know if taking the drug
in question does not cause an "unacceptable" probability of barfs. This
would entail 1) defining the lowest "unacceptable" amount of barfing,
2) defining first-species risk and power, 3) computing the number of
subjects necessary to a non-superiority trial against a prespecified
fixed hypothesis and 4) effectively running and analysing such a trial.
Such a test of a *realistic* hypothesis would indeed be worthy.

======== Cut here for a loosely related methodological rant ========

In any case, if I refereed this paper, I would not "buy" it : the causal
link is very badly established. Because the intervention is not the
*drug*, it is *receiving the drug*, which is *quite* different, even in
animals (maybe especially in animals).

I happen to have worked on a very similar setup (testing an antiemetic
on dogs receiving platinium salts, which is a bit more intricate that
the present setup). I *know* for a fact that the combination of placebo
"platinium salts" and placebo "antiemetic" *will* cause some dogs to
barf (and other stress manifestations) : it is even enough to have one
dog in the same room (or even the same (small) building) starting barfing
to start barfings in other, totally untouched (not even by placebos)
animals. You see, stress can be communicated by sight, sound and
smell, and it's bloody hard to isolate animals from each other on these
three aspects... Furthermore, if you were able to obtain such an
isolation, you'd get animals subject to a *major* stress : loleliness.
Dogs are social beasts, y'know...


And I don't suppose cats being less sensitive (? Bubastis, forgive them,
they do not know what they do...).

Therefore, to study the possible emetic effect of the *drug*, the
simple (maybe simplistic) way to test the (semi-realistic) null "the
*drug* doesn't modify the emetic risk entailed by taking any drug", I'd
compare the "barfing rates" of a *random* sample of cats receiving the
drug and another, *distinct*, *random* *control* sample of cats
receiving a suitable placebo (same presentation, same coulour, same
smell, same "galenic",  etc ...). I'd blind any person in contact with
any of the animals to the exact nature of the intervention (dogs and
cats will somehow "feel" your expectations, don't ask me how, but they
can be worse than human patients in this respect...). An I'd analyze
this exactly as a clinical trial (this *is* a veterinary clinical trial,
indeed).

This simplistic scheme does not account for individual sensitivities of
animals. Various ways exist to "absorb" this, the simplest being of
course a properly randomized cross-over trial. The null becomes, of
course, triple : "the emetic properties of the administration do not
depend of the product administered", "the emetic properties of an
administration do not depend on a previous administration", and "the
emetic properties of a product do not depend of the previous 
administration of another product". The interpretation of such an
experiment may become ... delicate.

Other schemes are possible : e. g., repeated experiments on the same
animals may allow to *describe* individual sensitivities and the
variability thereof (analysis by mixed models including a "subject"
factor). However, I'd be very wary of the validity of such an
experiment, given the possibility (almost certainty...) of inducing
a stereotyped comportment in the subjects.

And the importance of the ultimate goal would have to be bloody mighty
in order to justify such a treatment being inflicted to cats (or dogs,
for that matter)...

One might note that my previous proposal of a non-superiority trial does
not involve a control. That's because this trial has a *pragmatic*
goal : checking the acceptability of the administration of a drug on an
a priori set of criteria. It does not allow inferences on the effect of
the drug, and *postulates* that the non-administration of the drug will
result in nothing of interest. This allows us to pull a realistic,
interesting, null hypothesis out of our hats.

On the other hand, the controlled plans, by virtue of having a control,
allow us to be analytic, and separate the effect of the administration
from the effect of the drug itself : this latter one might indeed be
zero, the associated null hypothesis isn't nonsensical and the test of
this null isn't worthless.

======== End of the loosely related methodological rant ========

In any case, my point is : hypothesis testing is *not* the alpha and
omega of biostatistics, and other methods of describing and analysing
experimental results are often much more interesting, nonwhistanding the
fetishes of journal referees. Furthermore, testing of impossible or
worthless hypotheses lead to worthless conclusions. Corollary : do not
test for the sake of testing, because "everybody does it" or because a 
referee started a tantrum ; test realistic hypotheses, whose rejection
has at least some relation to your subject matter.

The two cents of someone tired of reading utter nonsense in prestigious
journals...

					Emmanuel Charpentier