Regression with few observations per factor level

With such a small data set, why not simulate some data sets with > reasonable effect sizes and see how an analysis performs? Krzysztof
Dear Krzysztof,
It is good idea. Would you know some R functions thatis are well suited for this kind of simulations

___________________________________________________________
Mode, hifi, maison,? J'ach?te malin. Je compare les prix avec
Why not take the opportunity of getting to know ABC some more? Rasmus 
B??th wrote a piece on Tiny Data and ABC which might suit your problem 
very well.
http://www.r-bloggers.com/tiny-data-approximate-bayesian-computation-and-the-socks-of-karl-broman/

Cheers
/Lars
With such a small data set, why not simulate some data sets with > reasonable effect sizes and see how an analysis performs? Krzysztof
Dear Krzysztof,
It is good idea. Would you know some R functions thatis are well suited for this kind of simulations

___________________________________________________________
Mode, hifi, maison,? J'ach?te malin. Je compare les prix avec
	[[alternative HTML version deleted]]

_______________________________________________
R-sig-ecology mailing list
R-sig-ecology at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

A good place to start is by looking at your residuals  to determine if
the normality assumptions are being met, if not then some form of glm
that correctly models the residuals or a non parametric method should
be used.

But just as important though is considering how you intend to use your
data and exactly what it is. Irrelevant to what the statistics says if
you only have 4 datum are you really confident in making broad
generalisations with it? And writing a paper with your name on it?
Just a couple datum could change everything, particularly if the scale
isn't bounded so outliers can have a big impact. If the datum are some
form of average I would be more confident with only 4 of them, but if
they are raw values I would consider being very cautious about any
conclusions you draw.

Another reason I would be cautious of a result using only 4 datum is
that their p value estimates may be very poorly estimated. Although
not widely discussed we often use the Central limit theorem to assume
parameter estimates are normally distributed when calculating the p
value. (Because parameters can be thought of as weighted average the
CLT applies to them). With only 4 datum we can't invoke the magic of
the CLT and since there is no way to test whether the parameters are
normal we take quite a risk assuming we have accurate p values at
small sample sample sizes

Chris Howden
Founding Partner
Tricky Solutions
Tricky Solutions 4 Tricky Problems
Evidence Based Strategic Development, IP Commercialisation and
Innovation, Data Analysis, Modelling and Training

(mobile) 0410 689 945
(fax / office)
chris at trickysolutions.com.au

Disclaimer: The information in this email and any attachments to it are
confidential and may contain legally privileged information. If you are not
the named or intended recipient, please delete this communication and
contact us immediately. Please note you are not authorised to copy,
use or disclose this communication or any attachments without our
consent. Although this email has been checked by anti-virus software,
there is a risk that email messages may be corrupted or infected by
viruses or other
interferences. No responsibility is accepted for such interference. Unless
expressly stated, the views of the writer are not those of the
company. Tricky Solutions always does our best to provide accurate
forecasts and analyses based on the data supplied, however it is
possible that some important predictors were not included in the data
sent to us. Information provided by us should not be solely relied
upon when making decisions and clients should use their own judgement.

With such a small data set, why not simulate some data sets with > reasonable effect sizes and see how an analysis performs? Krzysztof
Dear Krzysztof,
It is good idea. Would you know some R functions thatis are well suited for this kind of simulations

___________________________________________________________
Mode, hifi, maison,? J'ach?te malin. Je compare les prix avec
   [[alternative HTML version deleted]]

_______________________________________________
R-sig-ecology mailing list
R-sig-ecology at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
Dear All,

Please do not take any offence, I would really like to be removed from this mailing list, can someone let me know how this can be done.

Best Regards,

--
Nicholas Hamilton
School of Materials Science and Engineering
University of New South Wales (Australia)
--
www.ggtern.com

A good place to start is by looking at your residuals  to determine if
the normality assumptions are being met, if not then some form of glm
that correctly models the residuals or a non parametric method should
be used.

But just as important though is considering how you intend to use your
data and exactly what it is. Irrelevant to what the statistics says if
you only have 4 datum are you really confident in making broad
generalisations with it? And writing a paper with your name on it?
Just a couple datum could change everything, particularly if the scale
isn't bounded so outliers can have a big impact. If the datum are some
form of average I would be more confident with only 4 of them, but if
they are raw values I would consider being very cautious about any
conclusions you draw.

Another reason I would be cautious of a result using only 4 datum is
that their p value estimates may be very poorly estimated. Although
not widely discussed we often use the Central limit theorem to assume
parameter estimates are normally distributed when calculating the p
value. (Because parameters can be thought of as weighted average the
CLT applies to them). With only 4 datum we can't invoke the magic of
the CLT and since there is no way to test whether the parameters are
normal we take quite a risk assuming we have accurate p values at
small sample sample sizes

Chris Howden
Founding Partner
Tricky Solutions
Tricky Solutions 4 Tricky Problems
Evidence Based Strategic Development, IP Commercialisation and
Innovation, Data Analysis, Modelling and Training

(mobile) 0410 689 945
(fax / office)
chris at trickysolutions.com.au

Disclaimer: The information in this email and any attachments to it are
confidential and may contain legally privileged information. If you are not
the named or intended recipient, please delete this communication and
contact us immediately. Please note you are not authorised to copy,
use or disclose this communication or any attachments without our
consent. Although this email has been checked by anti-virus software,
there is a risk that email messages may be corrupted or infected by
viruses or other
interferences. No responsibility is accepted for such interference. Unless
expressly stated, the views of the writer are not those of the
company. Tricky Solutions always does our best to provide accurate
forecasts and analyses based on the data supplied, however it is
possible that some important predictors were not included in the data
sent to us. Information provided by us should not be solely relied
upon when making decisions and clients should use their own judgement.

On 22 Oct 2014, at 17:20, V. Coudrain <v_coudrain at voila.fr> wrote:

With such a small data set, why not simulate some data sets with > reasonable effect sizes and see how an analysis performs? Krzysztof
Dear Krzysztof,
It is good idea. Would you know some R functions thatis are well suited for this kind of simulations

___________________________________________________________
Mode, hifi, maison,? J'ach?te malin. Je compare les prix avec
  [[alternative HTML version deleted]]

_______________________________________________
R-sig-ecology mailing list
R-sig-ecology at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

_______________________________________________
R-sig-ecology mailing list
R-sig-ecology at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

A good place to start is by looking at your residuals  to determine if
the normality assumptions are being met, if not then some form of glm
that correctly models the residuals or a non parametric method should
be used.

Doing that could be very tricky indeed; I defy anyone, without knowledge of
how the data were generated, to detect departures from normality in such a
small data set. Try qqnorm(rnorm(4)) a few times and you'll see what I mean.

Second, one usually considers the distribution of the response when fitting
a GLM, not decide if residuals from an LM are non-Gaussian then move on.
The decision to use the GLM should be motivated directly from the data and
question to hand. Perhaps sometimes we can get away with fitting the LM,
but that usually involves some thought, in which case one has probably
already thought about the GLM as well.

G
But just as important though is considering how you intend to use your
data and exactly what it is. Irrelevant to what the statistics says if
you only have 4 datum are you really confident in making broad
generalisations with it? And writing a paper with your name on it?
Just a couple datum could change everything, particularly if the scale
isn't bounded so outliers can have a big impact. If the datum are some
form of average I would be more confident with only 4 of them, but if
they are raw values I would consider being very cautious about any
conclusions you draw.

Another reason I would be cautious of a result using only 4 datum is
that their p value estimates may be very poorly estimated. Although
not widely discussed we often use the Central limit theorem to assume
parameter estimates are normally distributed when calculating the p
value. (Because parameters can be thought of as weighted average the
CLT applies to them). With only 4 datum we can't invoke the magic of
the CLT and since there is no way to test whether the parameters are
normal we take quite a risk assuming we have accurate p values at
small sample sample sizes

Chris Howden
Founding Partner
Tricky Solutions
Tricky Solutions 4 Tricky Problems
Evidence Based Strategic Development, IP Commercialisation and
Innovation, Data Analysis, Modelling and Training

(mobile) 0410 689 945
(fax / office)
chris at trickysolutions.com.au

Disclaimer: The information in this email and any attachments to it are
confidential and may contain legally privileged information. If you are not
the named or intended recipient, please delete this communication and
contact us immediately. Please note you are not authorised to copy,
use or disclose this communication or any attachments without our
consent. Although this email has been checked by anti-virus software,
there is a risk that email messages may be corrupted or infected by
viruses or other
interferences. No responsibility is accepted for such interference. Unless
expressly stated, the views of the writer are not those of the
company. Tricky Solutions always does our best to provide accurate
forecasts and analyses based on the data supplied, however it is
possible that some important predictors were not included in the data
sent to us. Information provided by us should not be solely relied
upon when making decisions and clients should use their own judgement.

On 22 Oct 2014, at 17:20, V. Coudrain <v_coudrain at voila.fr> wrote:

With such a small data set, why not simulate some data sets with >
reasonable effect sizes and see how an analysis performs? Krzysztof
Dear Krzysztof,
It is good idea. Would you know some R functions thatis are well suited
for this kind of simulations

___________________________________________________________
Mode, hifi, maison,? J'ach?te malin. Je compare les prix avec
   [[alternative HTML version deleted]]

_______________________________________________
R-sig-ecology mailing list
R-sig-ecology at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

_______________________________________________
R-sig-ecology mailing list
R-sig-ecology at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

Gavin Simpson, PhD

	[[alternative HTML version deleted]]

On 22 October 2014 17:24, Chris Howden <chris at trickysolutions.com.au> wrote:

A good place to start is by looking at your residuals  to determine if
the normality assumptions are being met, if not then some form of glm
that correctly models the residuals or a non parametric method should
be used.

Doing that could be very tricky indeed; I defy anyone, without knowledge of
how the data were generated, to detect departures from normality in such a
small data set. Try qqnorm(rnorm(4)) a few times and you'll see what I mean.

Second, one usually considers the distribution of the response when fitting
a GLM, not decide if residuals from an LM are non-Gaussian then move on.
The decision to use the GLM should be motivated directly from the data and
question to hand. Perhaps sometimes we can get away with fitting the LM,
but that usually involves some thought, in which case one has probably
already thought about the GLM as well.
I agree completely with Gavin. If you have four data points and fit a two-parameter linear model and in addition select a one-parameter exponential family distribution (as implied in selecting a GLM family) you don't have many degrees of freedom left. I don't think you get such models accepted in many journals. Forget the regression and get more data. Some people suggested here that an acceptable model could be possible if your data points are not single observations but means from several observations. That is true: then you can proceed, but consult a statistician on the way to proceed.

Cheers, Jari Oksanen
I think there are actually 4 data points per level of some factor (after
seeing some of the other no-threaded emails - why can't people use emails
that preserve threads?**); but yes, either way this is a small data set and
trying to decide if residuals are normal or not is going to be nigh on
impossible.

I like the suggestion that someone made to actually do some simulation to
work out whether you have any power to detect an effect of a given size;
seems pointless doing the analysis if you conclusions would be "well, I
didn't detect an effect, but I have no power so I don't even know if I
should have been able to detect an effect if one were present". You'd be in
no worse off a position then than if you hadn't run the analysis or
collected the data.

G

** He says, hoping to heck that GMail preserves the threading information...

On 23/10/2014, at 18:17 PM, Gavin Simpson wrote:

On 22 October 2014 17:24, Chris Howden <chris at trickysolutions.com.au>
wrote:

A good place to start is by looking at your residuals  to determine if
the normality assumptions are being met, if not then some form of glm
that correctly models the residuals or a non parametric method should
be used.

Doing that could be very tricky indeed; I defy anyone, without knowledge
of
how the data were generated, to detect departures from normality in such
a
small data set. Try qqnorm(rnorm(4)) a few times and you'll see what I
mean.
Second, one usually considers the distribution of the response when
fitting
a GLM, not decide if residuals from an LM are non-Gaussian then move on.
The decision to use the GLM should be motivated directly from the data
and
question to hand. Perhaps sometimes we can get away with fitting the LM,
but that usually involves some thought, in which case one has probably
already thought about the GLM as well.
I agree completely with Gavin. If you have four data points and fit a
two-parameter linear model and in addition select a one-parameter
exponential family distribution (as implied in selecting a GLM family) you
don't have many degrees of freedom left. I don't think you get such models
accepted in many journals. Forget the regression and get more data. Some
people suggested here that an acceptable model could be possible if your
data points are not single observations but means from several
observations. That is true: then you can proceed, but consult a
statistician on the way to proceed.

Cheers, Jari Oksanen

Gavin Simpson, PhD

	[[alternative HTML version deleted]]