Skip to content

How many samples ACTUALLY used in regression?

6 messages · Ben Bolker, Marc Schwartz, Cade, Brian +2 more

#
Dear All,

is there a simple way that covers all regression models to extract the number of samples from a data frame/matrix actually used in a regression model?

For instance I might have a data of 100 rows and 4 colums (1 response + 3 explanatory variables).  If 3 samples have one or more NAs in the explanatory variable columns these samples will be dropped in any model:

my.model = lm(y ~ x + w + z, my.data)
my.model = glm(y ~ x + w + z, my.data, family = binomial)
my.model = polr(y ~ x + w + z, my.data)
?

I don't seem to be able to find one single method that works in the exact same way -- irrespective of the model type -- to interrogate my.model to see how many samples of my.data were actually used.  Is there such function or do I need to hack something together?

Best wishes

Federico
#
Federico Calboli <f.calboli <at> imperial.ac.uk> writes:
my.model = lm(y ~ x + w + z, my.data)
my.model = glm(y ~ x + w + z, my.data, family = binomial)
my.model = polr(y ~ x + w + z, my.data)
I haven't tested it (don't want to bother to put together the
test data), but does nrow(model.frame(my.model)) work ?
#
On Mar 18, 2013, at 7:36 AM, Federico Calboli <f.calboli at imperial.ac.uk> wrote:

            
I don't know that this would be universal to all possible R model implementations, but should work for those that at least abide by "certain standards"[1] relative to the internal use of ?model.frame.

In the case where model functions use 'model = TRUE' as the default in their call (eg. lm(),  glm() and MASS::polr()), the returned model object will have a component called 'model', such that:

  nrow(my.model$model)

returns the number of rows in the internally created data frame.

Note that 'model = TRUE' is not the default for many functions, for example Terry's coxph() in survival or Frank's lrm() in rms. 

Note also that the value of 'na.action' in the modeling function call may have a potential effect on whether the number of rows in the retained 'model' data frame is really the correct value.

You can also use model.frame(), independently matching arguments passed to the model function, to replicate what takes place internally in many modeling functions. The result of model.frame() will be a data frame, again, subject to similar limitations as above.

Regards,

Marc Schwartz

[1]: http://developer.r-project.org/model-fitting-functions.txt
#
On 18/03/2013 14:51, Cade, Brian wrote:
Not very reliable (what about zero weights, for example?), and the 
component is usually 'residuals'.

No one has so far mentioned nobs(), which seems to me to be the closest.

  
    
#
On 18 Mar 2013, at 15:07, Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote:

            
Given a my.data where 3 out of 100 rows will be discarded due to NAs

test = lm(formula = y ~ x + w, my.data, model = T)
nobs(test) 
[1] 97 # as expected

But if I substitute 1 NA in one of the row of the housing data:

house.plr = polr(formula = Sat ~ Infl + Type + Cont, data = housing, weights = Freq)
nobs(house.plr)
[1] 1661

because of weights (which would not be take into account in a glm() fit).

Because I only care about the raw number of observations, is there a (hopefully) trivial way of getting nobs(poor.fit) to behave like a nobs(vlm.fit)?

BW

Federico