Folks,
I have a query around weighting in Random Forest (RF). I know
that several
earlier emails in this group have raised this issue, but I
did not find an
answer to my query.
I am working on a dataset (dataset1) that consists of 4
million records that
can be reduced to a dataset (dataset2) of approximately 1500
unique records
with frequency counts that add up to the 4 million records
number as above.
Because of size issues, I cannot work with dataset1 in R and
therefore, I am
working with dataset2 .
Each record consists of whether or not a patient chose a
particular drug
based on 14 comorbidity (Yes / No) variables; I am using RF
to understand
the comorbidity drivers of drug adoption (yes/no) classification.
At full dataset level (dataset1), the drug adoption incidence
is ~11%. At
the reduced dataset dataset2 level, the drug adoption
incidence increases to
~38%.
My question is that, if am using the reduced dataset
(dataset2), how should
I inform RF that the adoption incidence at the full dataset
level was 11%.
Should that be used as a classwt prior with
classwt=c(Yes=.11, No=.89)? My
understanding is that RF does not allow case weighting.
Or can this be handled with the sampsize arguement through
oversampling?
What proportions should one use for this (e.g., sampsize=c(Yes=100,
No=100))?
I would appreciate any feedback or pointers to any earlier
thread that I may
have overlooked.
Regards,
Raghu