Hello,
I'm interested in correcting for and measuring unobserved
heterogeneity ("missing variables") using R. In particular, I'm
searching for a simple way to measure the amount of unobserved
heterogeneity remaining in a series of increasingly complex models
(adding additional variables to each new model) on the same data.
I have a static database of 400,000 or so individual mortgage
loans, each of which is observed monthly from origination (t=0) until
termination (a binary yes/no variable). In my update database, there
are up to 60 months of observed data for each loan in the static
database, and an individual loan has an "average life" of roughly 36
months.
Each loan has static covariates observed at origination, such as
original loan amount and credit score, as well as time-varying
covariates (TVC) such as age, interest rates, and house prices.
Because these TVC change each month, I've constructed a modeling
database that merges the static database with the update database.
The resulting "loan-month" modeling database has one observation
for every loan-month, and the static covariates remain the same for
all loan-months for a given loan. Thus, the modeling database has
roughly 14.4 million loan-month records. A loan is considered
"active" as long as it has not yet terminated or been censored; my
interest is in predicting termination.
This type of data is often referred to as "event history" or
"discrete hazard" data. The standard R package to apply to such data
is "survival", with which I could estimate a Cox proportional hazard
model using coxph. The advantage of such an approach is that
unobserved heterogeneity is easily addressed using the "frailty" term.
The disadvantages, at least for my purposes, are two-fold.
First, my audience is unfamiliar with hazard models. Second, my
monthly data has many "ties" (many terminations in the same month),
so I've been told that coxph won't work well on a large dataset with
many ties.
On the other hand, because the data is measured discretely each
month, many references suggest applying generalized linear models
(GLM, "logit"-type models) or even generalized addivitive models
(GAM, "logit"-type models that incorporate nonlinearity in individual
covariates). The advantage to this approach is that GLM and GAM are
readily available in R, and my audience is very familiar with logit-
type models.
The disadvantage, however, is that I am totally unfamiliar with
ways to correct for and measure unobserved heterogeneity using GLM/
GAM-type models. I've been told that unobserved heterogeneity in the
hazard framework is analogous to random effects in the GLM/GAM
framework, but there seem to be a number of R packages that address
this issue in different ways.
So, I'd greatly appreciate suggestions on a simple way to
incorporate unobserved heterogeneity into a GLM/GAM-type model. I'm
not much of a statistician, so simple examples are always helpful.
I'm also happy to track down specific article/book references, if
folks think those might be of help.
Many thanks,
Kyle
---
kyle at hotmail . com
(email altered in obvious ways)
GLM/GAM and unobserved heterogeneity
2 messages · Kyle G. Lundstedt, Spencer Graves
7 days later
Have you considered "lmer" in library(lme4)? See for example sec/ 4
pm "Two-level models for binary data" in vignette("MlmSoftRev") wiht
library(mlmRev) in addition to www.r-project.org -> "Documentation:
Newsletter" -> "R News Volume 5/1" -> "Fitting Linear Mixed Models in R"
by Doug Bates, pp. 27-30.
If you have more questions after reviewing this material please
submit another question, preferably following the posting guide!
"http://www.R-project.org/posting-guide.html". The posting guide is not
just another symbol of burocracy. It was written to try to help
questioners improve the chances that they will get the information they
want quickly. I believe it is quite effective when it is used. Many
people get answers to their questions in minutes, but that requires a
question that a potential respondent can understand and formulate a
sensible answer in seconds.
spencer graves
Kyle G. Lundstedt wrote:
Hello,
I'm interested in correcting for and measuring unobserved
heterogeneity ("missing variables") using R. In particular, I'm
searching for a simple way to measure the amount of unobserved
heterogeneity remaining in a series of increasingly complex models
(adding additional variables to each new model) on the same data.
I have a static database of 400,000 or so individual mortgage
loans, each of which is observed monthly from origination (t=0) until
termination (a binary yes/no variable). In my update database, there
are up to 60 months of observed data for each loan in the static
database, and an individual loan has an "average life" of roughly 36
months.
Each loan has static covariates observed at origination, such as
original loan amount and credit score, as well as time-varying
covariates (TVC) such as age, interest rates, and house prices.
Because these TVC change each month, I've constructed a modeling
database that merges the static database with the update database.
The resulting "loan-month" modeling database has one observation
for every loan-month, and the static covariates remain the same for
all loan-months for a given loan. Thus, the modeling database has
roughly 14.4 million loan-month records. A loan is considered
"active" as long as it has not yet terminated or been censored; my
interest is in predicting termination.
This type of data is often referred to as "event history" or
"discrete hazard" data. The standard R package to apply to such data
is "survival", with which I could estimate a Cox proportional hazard
model using coxph. The advantage of such an approach is that
unobserved heterogeneity is easily addressed using the "frailty" term.
The disadvantages, at least for my purposes, are two-fold.
First, my audience is unfamiliar with hazard models. Second, my
monthly data has many "ties" (many terminations in the same month),
so I've been told that coxph won't work well on a large dataset with
many ties.
On the other hand, because the data is measured discretely each
month, many references suggest applying generalized linear models
(GLM, "logit"-type models) or even generalized addivitive models
(GAM, "logit"-type models that incorporate nonlinearity in individual
covariates). The advantage to this approach is that GLM and GAM are
readily available in R, and my audience is very familiar with logit-
type models.
The disadvantage, however, is that I am totally unfamiliar with
ways to correct for and measure unobserved heterogeneity using GLM/
GAM-type models. I've been told that unobserved heterogeneity in the
hazard framework is analogous to random effects in the GLM/GAM
framework, but there seem to be a number of R packages that address
this issue in different ways.
So, I'd greatly appreciate suggestions on a simple way to
incorporate unobserved heterogeneity into a GLM/GAM-type model. I'm
not much of a statistician, so simple examples are always helpful.
I'm also happy to track down specific article/book references, if
folks think those might be of help.
Many thanks,
Kyle
---
kyle at hotmail . com
(email altered in obvious ways)
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Spencer Graves, PhD Senior Development Engineer PDF Solutions, Inc. 333 West San Carlos Street Suite 700 San Jose, CA 95110, USA spencer.graves at pdf.com www.pdf.com <http://www.pdf.com> Tel: 408-938-4420 Fax: 408-280-7915