Skip to content

Random Forest weighting

3 messages · Raghu Naik, Liaw, Andy

#
Folks,

I have a query around weighting in Random Forest (RF). I know that several
earlier emails in this group have raised this issue, but I did not find an
answer to my query.

I am working on a dataset (dataset1) that consists of 4 million records that
can be reduced to a dataset (dataset2) of approximately 1500 unique records
with frequency counts that add up to the 4 million records number as above.
Because of size issues, I cannot work with dataset1 in R and therefore, I am
working with dataset2 .

Each record consists of whether or not a patient chose a particular drug
based on 14 comorbidity (Yes / No) variables; I am using RF to understand
the comorbidity drivers of drug adoption (yes/no) classification.

At full dataset level (dataset1), the drug adoption incidence is ~11%. At
the reduced dataset dataset2 level, the drug adoption incidence increases to
~38%.

My question is that, if am using the reduced dataset (dataset2), how should
I inform RF that the adoption incidence at the full dataset level was 11%.
Should that be used as a classwt prior with classwt=c(Yes=.11, No=.89)? My
understanding is that RF does not allow case weighting.
Or can this be handled with the sampsize arguement through oversampling?
What proportions should one use for this (e.g., sampsize=c(Yes=100,
No=100))?



I would appreciate any feedback or pointers to any earlier thread that I may
have overlooked.

Regards,

Raghu
#
If I understand your situation correctly, you may be able to make use of
the "strata" and "sampsize" arguments in randomForest() to get bootstrap
samples that resemble the original data distribution.  They allow you to
specify stratified samples using the "strata" variable.

Best,
Andy 

From: Raghu Naik
Notice:  This e-mail message, together with any attachme...{{dropped:12}}
1 day later