Subsampling-oversampling from a data frame
# Perhaps I misunderstand your original need, but....
## I added a few lines to your data and used dput() to get the below data (I
named "df")
df<- structure(list(age = c(15L, 20L, 15L, 10L, 10L, 12L, 17L, 17L,
11L, 12L, 16L, 20L, 23L, 14L, 22L, 16L, 10L, 11L, 21L, 10L, 13L,
17L), sex = structure(c(2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L,
2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L), .Label = c("f",
"m"), class = "factor"), class = structure(c(2L, 1L, 2L, 2L,
2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 1L,
2L, 1L), .Label = c("high", "low"), class = "factor")), .Names = c("age",
"sex", "class"), class = "data.frame", row.names = c(NA, -22L
))
## the following line uses which(), sample(), and rbind(), along with some
indexing to get a new dataframe; see ?which, ?sample, and ?rbind for more
info
# For the "indexing", play with it, ... type in df[1:3,1:2] as an example
new_df <- rbind(df[sample(which(df$class=="low"), 4),],
df[sample(which(df$class=="high"), 4),])
Now replace 4 with the the size of each you want.
hgwelec wrote:
Thank you for your answer. The problem is that i am learning R now, so i do not know how i could do this. I have found the following code but it does not work unfortunately (=create distribution 0.1 "low" class - 0.9 high) : data[c(rownames(data.df[data.df$class=="high",]), sample(rownames(data[data.df$class=="low"]), 0.1)) , ]
2 posts This post has NOT been accepted by the mailing list yet. Dear members, Consider the following data frame (first 4 rows shown) age sex class 15 m low 20 f high 15 f low 10 m low in my original data set i have 1200 rows and a class distribution of low=0.3 and high=0.7 My question : how can i create a new data frame as the one shown above but with the 'high' class subsampled so that in the new data frame the class distribution is low=0.5 and high=0.5? I tried looking at the sample function and prob option but all examples i seen do not use an imbalanced class problem as the one shown above Thank you in advance Thank you in advance -- View this message in context: http://r.789695.n4.nabble.com/Subsampling-oversampling-from-a-data-frame-tp3965771p3971840.html Sent from the R help mailing list archive at Nabble.com.