Skip to content

Regression analysis with small but complete dataset (fully representing reality)?

4 messages · Patrick (Malone Quantitative), sree datta, Phillip Alday +1 more

#
Diana,

cc'ing the list again in case anyone else has input

I was asking if the missing was structural--for example, hours per shift if
someone is unemployed at the time of measurement. In that scenario, you
could have missing "values" but still completely observed *data*.

Normally, I would assume that questions about missing data refer to
incomplete observation, but you clearly have a special situation, which is
why I asked.

If your population data is completely observed, again, you don't need
inferential statistics.

If not, you do indeed have a sample of the data, not the population, even
though you have most of it. I believe there are corrections that need to be
made to inferential statistics for small populations. I don't have
experience with that, but that might get you started.

Pat
On Fri, Dec 25, 2020 at 9:55 AM Diana Michl <dianamichl at aikq.de> wrote:

            

  
    
#
Hi Diana

In addition to using descriptive statistics, I would also recommend using
Partial Least Squares regression that was specifically designed for the
problem of small sample size and having many variables. (your dependent can
be continuous, binary or multinomial in PLS). I have successfully used PLS
regression in medical / healthcare arena for rare and orphan disease
analyses where the affected population is very small and getting data from
30 patients represents any where from 25% to 60% of the overall population.

I strongly recommend this excellent resource (a detailed PDF document - 235
pages)  by Gaston Sanchez on his website:
https://www.gastonsanchez.com/PLS_Path_Modeling_with_R.pdf

Hope this helps. If you have any questions or need additional information
please get back to me and I can help you in identifying whether PLS
regression would be relevant and helpful for you.

Sree



<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon>
Virus-free.
www.avast.com
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

On Fri, Dec 25, 2020 at 12:08 PM Patrick (Malone Quantitative) <
malone at malonequantitative.com> wrote:

            

  
  
#
I think there is some confusion about what's meant with "complete" --do
you mean that

- all possible combinations of predictors occur?
- you observed all possible individuals in a population?
- you observed all possible individuals in a 'cohort' but there might be
future cohorts (e.g. all students in a given degree program in a given
year, but there will be more students in other years)?
- something else entirely?

The first three possibilities can obviously overlap and which aspect you
focus on depends on your exact inferential question. For example, if you
observed all students in a given degree program in a given year, then
you might want to make statements about those students (which would be a
descriptive task, as Pat mentioned) or you might want to make statements
about the entire abstract population of students who may in the future
be in that degree program (in which case you would have an inferential
task). That distinction may not be obvious in the original research
question, but one of the hardest things in statistics is figuring out
what the actual statistical problem is, which may or may not be obvious
from the research question. :)

If you're doing descriptive stats, then you don't need any special
methods. The usual summary statistics -- mean, median, mode for central
tendency; range, standard deviation, median absolute deviation,
histogram for variability -- will do the trick.

If you're doing inferential stats with small data, then there a few
intertwined issues:

- the amount of inference you can perform is inherently limited because
the amount of information present is inherently limited. (this is of
course always true, regardless of how much data you have!)

- regularization of various forms is your friend and can even help you
fit otherwise 'impossible' models. Ridge regression, LASSO, elastic net
are all examples of regularized methods; mixed models also perform
regularization in the random effects, but it's a bit different.

- if you have prior knowledge from other means (strong theory, other
data, etc.), then Bayesian methods can help you integrate that into the
statistical procedure.

Note that you can also use priors as a form of regularization, see e.g.
https://jakevdp.github.io/blog/2015/07/06/model-complexity-myth/ for a
good overview of lots of relevant details for the tips above.

Elsewhere in the thread, PLS was suggested. PLS is an interesting
technique, but it doesn't solve the small data problem. In some sense,
you can think of PLS as a generalization of PCA, where the components
are determined not the basis of shared variation within the predictors,
but rather shared variation between the predictors and response
variable. The PLS package
(https://cran.r-project.org/web/packages/pls/index.html) has decent
documentation. PLS is really useful if you want to identify specific
combinations of predictors that can be combined into a single predictive
factor.  Both PLS and PCA are often used for 'dimensionality reduction'
where you transform your original variables into a new set of variables,
ordered by something like explanatory power. (That is a massive
oversimplification.) Then you can drop the low-ranked variables and thus
reduce the number of variables you're dealing with. In other words, PLS
and PCA can be useful for reducing the number of variables you're
dealing with, which can sidestep the small data problem. This is great
for prediction, but if you want to do inference on model parameters,
then it makes things a bit more complicated. It really depends on what
you want to do.

All that said, I'm not seeing anything here that's particular to mixed
models (nor actually anything involving mixed models at all...), so you
might have better luck finding information in you look beyond the mixed
models mailing list. :)

Best,
Phillip
On 25/12/20 6:07 pm, Patrick (Malone Quantitative) wrote:
1 day later
#
Hi all,

sorry it took me a while to respond, the holidays... Thanks very much 
for your help and suggestions!

@Pat: Right, I get it. The data is completely observed and the missing 
data not structural. I mostly get what you're saying about not needing 
inferential statistics. I thought, though, that they give information 
about relationships between variables which descriptive statistics just 
can't. Like, descriptive stats can tell me means, iqrs, maybe frequency 
distributions - but regressions can show how some variables /predict/ 
others. Or correlations show how (strongly) variables relate to one 
another and whether that's likely significant or random. I could really 
use methods that can do that. But if it's not possible with a dataset 
such as mine, then that's the way it is.

@Sree: Maybe partial least squares is what I'm looking for! I've never 
done this or heard of it. Is it much like ordinary least squares?
Thanks very much for the link, I'll look into it. I'll see how far I get 
and will gladly get back to you once I'm there. It will take a few days. 
My data sounds similar to yours indeed, except the set never represents 
less than about 70% of all existing cases.

Best

Diana


Am 26.12.2020 um 07:27 schrieb sree datta: