Sorry for the late response. I was away for vacation and was unable to keep on working on the codes. Anyway, I was unable to provide *str* of that specific data since they are all in a big package with lots of inputs/outputs. Quickly gazing through the code, I narrowed them down (and made a bad guess) to data frame. But it turned out that data frame was not the reason. After carefully check through the package, I found out that there is a double for loop. I replaced that double for loop and now instead of running ~ 13hrs, the package now runs ~ 13min for a similar dataset. Thanks for all your helps, D.
On 12/30/10 11:40 AM, jim holtman wrote:
If you want the data in the first column of the dataframe, then you should be using '[['. Notice what comes back in each of these cases:
str(dat)
'data.frame': 80000 obs. of 5 variables: $ sample.1.200..n..TRUE.: int 25 199 70 124 93 157 49 137 192 57 ... $ runif.n. : num 0.7725 0.0263 0.0728 0.7594 0.2792 ... $ runif.n..1 : num 0.4304 0.8608 0.0882 0.5666 0.1721 ... $ runif.n..2 : num 0.3797 0.1191 0.0481 0.3297 0.0649 ... $ runif.n..3 : num 0.0895 0.0441 0.0403 0.9679 0.3986 ...
str(dat[1])
'data.frame': 80000 obs. of 1 variable: $ sample.1.200..n..TRUE.: int 25 199 70 124 93 157 49 137 192 57 ...
str(dat[[1]])
int [1:80000] 25 199 70 124 93 157 49 137 192 57 ...
str(dat$sample.1.200..n..TRUE)
int [1:80000] 25 199 70 124 93 157 49 137 192 57 ...
str(dat[,1])
int [1:80000] 25 199 70 124 93 157 49 137 192 57 ... You will get different classes of values. We would really need to see the output of 'str' on your data structures to see what might be happening. Your data is not that big and most subsetting/extractions should be in less than a second unless there is something funny in your data. So provide the 'str' so we can see. On Thu, Dec 30, 2010 at 11:28 AM, Duke<duke.lists at gmx.com> wrote:
Hi Jim, Is this really a problem for me to use [1] instead of [[1]]? Will this make it run slower? Also, if I use dat$V1 %in% list$V1, will it be fine? Anyway, my data and list are basically gene lists (tab delimited): $ head test.txt Xkr4 chr1 - 3204562 3661579 3206102 3661429 3 3204562,3411782,3660632, 3207049,3411982,3661579, Rp1 chr1 - 4280926 4399322 4283061 4399268 4 4280926,4341990,4342282,4399250, 4283093,4342162,4342918,4399322, Rp1_2 chr1 - 4333587 4350395 4334680 4342906 4 4333587,4341990,4342282,4350280, 4340172,4342162,4342918,4350395, Sox17 chr1 - 4481008 4486494 4481796 4483487 5 4481008,4483180,4483852,4485216,4486371, 4482749,4483547,4483944,4486023,4486494, Mrpl15 chr1 - 4763278 4775807 4764532 4775758 5 4763278,4767605,4772648,4774031,4775653, 4764597,4767729,4772814,4774186,4775807, Mrpl15_2 chr1 - 4763278 4775807 4775807 4775807 4 4763278,4767605,4772648,4775653, 4764597,4767729,4772814,4775807, $ head list.txt GeneNames Chr Start End 0610007C21Rik chr5 31351012 31356996 0610007L01Rik chr5 130695613 130719635 0610007L01Rik_2 chr5 130698204 130719635 0610007P08Rik chr13 63916627 64001609 0610007P08Rik_2 chr13 63916641 63970963 0610007P14Rik chr12 87156404 87165495 Thanks, D. On 12/30/10 11:13 AM, jim holtman wrote:
You should be using dat[[1]]. Here is an example with 80000 rows that take about 0.02 seconds to get the subset. Provide an 'str' of what your data looks like
n<- 80000 # rows to create dat<- data.frame(sample(1:200, n, TRUE), runif(n), runif(n), runif(n), runif(n)) lst<- data.frame(sample(1:100, n, TRUE), runif(n), runif(n), runif(n), runif(n)) str(dat)
'data.frame': 80000 obs. of 5 variables: $ sample.1.200..n..TRUE.: int 39 116 69 163 51 125 144 32 28 4 ... $ runif.n. : num 0.519 0.793 0.549 0.77 0.272 ... $ runif.n..1 : num 0.691 0.89 0.783 0.467 0.357 ... $ runif.n..2 : num 0.705 0.254 0.584 0.998 0.279 ... $ runif.n..3 : num 0.873 1 0.678 0.702 0.455 ...
str(lst)
'data.frame': 80000 obs. of 5 variables: $ sample.1.100..n..TRUE.: int 38 83 38 70 77 44 81 55 32 1 ... $ runif.n. : num 0.0621 0.7374 0.074 0.4281 0.0516 ... $ runif.n..1 : num 0.879 0.294 0.146 0.884 0.58 ... $ runif.n..2 : num 0.648 0.745 0.825 0.507 0.799 ... $ runif.n..3 : num 0.2523 0.1679 0.9728 0.0478 0.0967 ...
system.time({
+ dat.sub<- dat[dat[[1]] %in% lst[[1]],]
+ })
user system elapsed
0.02 0.00 0.01
str(dat.sub)
'data.frame': 39803 obs. of 5 variables: $ sample.1.200..n..TRUE.: int 39 69 51 32 28 4 69 3 48 69 ... $ runif.n. : num 0.5188 0.5494 0.2718 0.5566 0.0893 ... $ runif.n..1 : num 0.691 0.783 0.357 0.619 0.717 ... $ runif.n..2 : num 0.705 0.584 0.279 0.789 0.192 ... $ runif.n..3 : num 0.873 0.678 0.455 0.843 0.383 ... On Thu, Dec 30, 2010 at 10:23 AM, Duke<duke.lists at gmx.com> wrote:
Hi all, First I dont have much experience with R so be gentle. OK, I am dealing with a dataset (~ tens of thousand lines, each line ~ 10 columns of data). I have to create some subset of this data based on some certain conditions (for example, same first column with another dataset etc...). Here is how I did it: # import data dat<- read.table( "test.txt", header=TRUE, fill=TRUE, sep="\t" ) list<- read.table( "list.txt", header=TRUE, fill=TRUE, sep="\t" ) # create sub data subdat<- dat[dat[1] %in% list[1],] So the third line is to create a new data frame with all the same first column in both dat and list. There is no problem with the code as it runs just fine with testing data (small). When I tried with my real data (~80k lines, ~ 15MB size), it takes like forever (few hours). I dont know why it takes that long, but I think it shouldnt. I think even with a for loop in C++, I can get this done in say few minutes. So anyone has any idea/advice/suggestion? Thanks so much in advance and Happy New Year to all of you. D.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.