Folks, I have a query around weighting in Random Forest (RF). I know that several earlier emails in this group have raised this issue, but I did not find an answer to my query. I am working on a dataset (dataset1) that consists of 4 million records that can be reduced to a dataset (dataset2) of approximately 1500 unique records with frequency counts that add up to the 4 million records number as above. Because of size issues, I cannot work with dataset1 in R and therefore, I am working with dataset2 . Each record consists of whether or not a patient chose a particular drug based on 14 comorbidity (Yes / No) variables; I am using RF to understand the comorbidity drivers of drug adoption (yes/no) classification. At full dataset level (dataset1), the drug adoption incidence is ~11%. At the reduced dataset dataset2 level, the drug adoption incidence increases to ~38%. My question is that, if am using the reduced dataset (dataset2), how should I inform RF that the adoption incidence at the full dataset level was 11%. Should that be used as a classwt prior with classwt=c(Yes=.11, No=.89)? My understanding is that RF does not allow case weighting. Or can this be handled with the sampsize arguement through oversampling? What proportions should one use for this (e.g., sampsize=c(Yes=100, No=100))? I would appreciate any feedback or pointers to any earlier thread that I may have overlooked. Regards, Raghu
Random Forest weighting
3 messages · Raghu Naik, Liaw, Andy
If I understand your situation correctly, you may be able to make use of the "strata" and "sampsize" arguments in randomForest() to get bootstrap samples that resemble the original data distribution. They allow you to specify stratified samples using the "strata" variable. Best, Andy From: Raghu Naik
Folks, I have a query around weighting in Random Forest (RF). I know that several earlier emails in this group have raised this issue, but I did not find an answer to my query. I am working on a dataset (dataset1) that consists of 4 million records that can be reduced to a dataset (dataset2) of approximately 1500 unique records with frequency counts that add up to the 4 million records number as above. Because of size issues, I cannot work with dataset1 in R and therefore, I am working with dataset2 . Each record consists of whether or not a patient chose a particular drug based on 14 comorbidity (Yes / No) variables; I am using RF to understand the comorbidity drivers of drug adoption (yes/no) classification. At full dataset level (dataset1), the drug adoption incidence is ~11%. At the reduced dataset dataset2 level, the drug adoption incidence increases to ~38%. My question is that, if am using the reduced dataset (dataset2), how should I inform RF that the adoption incidence at the full dataset level was 11%. Should that be used as a classwt prior with classwt=c(Yes=.11, No=.89)? My understanding is that RF does not allow case weighting. Or can this be handled with the sampsize arguement through oversampling? What proportions should one use for this (e.g., sampsize=c(Yes=100, No=100))? I would appreciate any feedback or pointers to any earlier thread that I may have overlooked. Regards, Raghu
Notice: This e-mail message, together with any attachme...{{dropped:12}}
1 day later
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20081205/9c10c347/attachment.pl>