Skip to content
Prev 246038 / 398503 Next

data frame subset too slow

Hi all,

First I dont have much experience with R so be gentle. OK, I am dealing 
with a dataset (~ tens of thousand lines, each line ~ 10 columns of 
data). I have to create some subset of this data based on some certain 
conditions (for example, same first column with another dataset etc...). 
Here is how I did it:

# import data
dat <- read.table( "test.txt", header=TRUE, fill=TRUE, sep="\t" )
list <- read.table( "list.txt", header=TRUE, fill=TRUE, sep="\t" )
# create sub data
subdat <- dat[dat[1] %in% list[1],]

So the third line is to create a new data frame with all the same first 
column in both dat and list. There is no problem with the code as it 
runs just fine with testing data (small). When I tried with my real data 
(~80k lines, ~ 15MB size), it takes like forever (few hours). I dont 
know why it takes that long, but I think it shouldnt. I think even with 
a for loop in C++, I can get this done in say few minutes.

So anyone has any idea/advice/suggestion?

Thanks so much in advance and Happy New Year to all of you.

D.