best selection of covariates (for each individual)

2 messages · Ramon Diaz-Uriarte, Martyn Plummer

Wed, Jun 19, 2002 1:42 AM #

Dear All,

This is not strictly R related (though I would implement the solution in R; 
besides, being this list so helpful for these kinds of stats questions...).

I got a "strange" request from a colleage. He has a bunch (approx. 25000)  
subjects that belong to one of 12 possible classes. In addition, there are 8 
covariates (factors) that can take as values either "absence" or "presence". 
Some of the subjects only have one covariate with value "presence" (the other 
covariates being absent), but many of the subjects have more than one 
covariate with a value of "presence". 
My colleague wants each subject with more than one "presence" covariate to 
have one, and only one, covariate be "presence" (of course, the final 
"present" covariate would belong to the original "present" covars. for that 
subject): in other words, each subject would be characterized by only one 
covariate. This "selection of covariates for each subject" (or eliminating 
covariates for each subject) has to be done in a way that maximizes the 
correct classification of class based on the presence/absence of covariates.

(His reason for doing this is that this simplifies further analyses and 
decission-making; I tried to explain that with 12 classes and 8 covariates 
where each subject only has one "presence" covar we would not be able to do a 
great job predicting class memebership, but he insists the 
one-covar-per-subject is essential).


I thought about a couple of approaches (see details below) but none seem very 
satisfactory. This issue keeps reminding me of things such as the LASSO and 
other shrinkage methods, but the twist here is that it is not the beta for a  
covariate, but different covars in each subject which are made zero.

Is there any obvious solution I am missing? Any suggestions?

Thanks,

************
Approach 1: the final statistic to judge predictive quality is Goodman & 
Kruskal's tau (or concentration coefficient) for IxJ contingency tables. 
Since for every subject with m "present" covars, there are m possible 
contingency tables, and there are many subjects with multiple present covars, 
there is an astronomical number of possible contingency tables, and we can 
not do an exahustively search (nor do I see an obvious way to simplify the 
problem from tau's definition, because we have 12 categories to predict based 
on the 8 covars). I would use a genetic algorithm to try and find a decent 
solution.


Approach 2: set this up as a multinomial loglinear model. Fit it (using 
multinom) to the original data set. Do not make the covars as factors but 
code present as 1 and absent as 0. 
For each subject with several (say, k) "present" covars, predict the class 
membership (predict.multinom) for each of the k covar. vectors obtained after 
subtracting, say, 0.1, from each of the covariates (except 1) with value 
non-zero. Set as the new covariate vector for that subject the one that gives 
the highest predicted probability to the right class.
Repeat the model fitting and modify covariates as in the last step 
(re-escaling at the end, so that the max. covar. value is always one for each 
subject) until there is only one non-zero covar. (If there ever is!).
This seems to me like a very clumsy approach, and I am not sure if there is 
any reason for it to arrive at a reasonable solution; I thought it could be a 
way of smoothly moving, within subject, each covariate (except one) "along 
its path of least resistance" to a value of zero.

(Note: in both approaches further simplification can be achieved by applying 
the same transformation or mutation ---with ga--- to all subjects that belong 
to the same class and have the same initial configuration of covariates. This 
way I also forcefully prevent identical subjects to end up with different 
final configurations).

Ram?n D?az-Uriarte
Unidad de Bioinform?tica
Centro Nacional de Investigaciones Oncol?gicas (CNIO)
Melchor Fern?ndez Almagro, 3
28029 Madrid (Spain)
http://bioinfo.cnio.es


-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Martyn Plummer

Wed, Jun 19, 2002 2:32 AM #

On 19-Jun-2002 Ramon Diaz wrote:

Well don't help him any more then! One of the perils of being an applied
statistician is that one's colleagues may (and often do) suggest ad hoc
solutions that make the analysis more complicated.  In this case, your
colleague seems to have turned what appears to be a straightforward
problem into an insoluble one.  If he won't listen to reason, I don't
see why you should be obliged work around him.

I know this doesn't answer your question, but if the problem is to
get the best classification of the subjects into 12 classes (where the
class of each subject is known) based on 8 binary covariates, then it
can be handled easily by a classification tree.  Trying to persuade your
colleague to use this approach will be more fruitful than working around
his prejudices, especially in the long run.

I'm feeling grumpy today (as you can probably tell). My apologies.

Martyn

I thought about a couple of approaches (see details below) but none seem very
satisfactory. This issue keeps reminding me of things such as the LASSO and 
other shrinkage methods, but the twist here is that it is not the beta for a 
covariate, but different covars in each subject which are made zero.

Is there any obvious solution I am missing? Any suggestions?

Thanks,

************
Approach 1: the final statistic to judge predictive quality is Goodman & 
Kruskal's tau (or concentration coefficient) for IxJ contingency tables. 
Since for every subject with m "present" covars, there are m possible 
contingency tables, and there are many subjects with multiple present covars,
there is an astronomical number of possible contingency tables, and we can 
not do an exahustively search (nor do I see an obvious way to simplify the 
problem from tau's definition, because we have 12 categories to predict based
on the 8 covars). I would use a genetic algorithm to try and find a decent 
solution.


Approach 2: set this up as a multinomial loglinear model. Fit it (using 
multinom) to the original data set. Do not make the covars as factors but 
code present as 1 and absent as 0. 
For each subject with several (say, k) "present" covars, predict the class 
membership (predict.multinom) for each of the k covar. vectors obtained after
subtracting, say, 0.1, from each of the covariates (except 1) with value 
non-zero. Set as the new covariate vector for that subject the one that gives
the highest predicted probability to the right class.
Repeat the model fitting and modify covariates as in the last step 
(re-escaling at the end, so that the max. covar. value is always one for each
subject) until there is only one non-zero covar. (If there ever is!).
This seems to me like a very clumsy approach, and I am not sure if there is 
any reason for it to arrive at a reasonable solution; I thought it could be a
way of smoothly moving, within subject, each covariate (except one) "along 
its path of least resistance" to a value of zero.

(Note: in both approaches further simplification can be achieved by applying 
the same transformation or mutation ---with ga--- to all subjects that belong
to the same class and have the same initial configuration of covariates. This
way I also forcefully prevent identical subjects to end up with different 
final configurations).

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._