Efficient way to create new column based on comparison with another dataframe
Thanks Denes, I should have thought of foverlaps as an option. I wonder how fast it is compared to my solution! My particular solution does not need data.table in order to work. It just loops through the ChrArms (Chromosome Arms, which always has 39 rows) and assigns the proper arm to all rows within mapfile that lie within Start and End on a particular Chr. This is opposed to my first solution, where I was trying to loop through mapfile (which could be millions of rows) and assign each row one at a time. That's why I used data.frame. For some reason, yesterday, data.table was acting funny on the computer I remote to, so I need to figure out why that is once I can get on it. Then I want to time my solution and one with foverlaps to see if one is faster. Thanks, Gaius
On Sun, Jan 31, 2016 at 2:17 AM, D?nes T?th <toth.denes at ttk.mta.hu> wrote:
Hi,
I have not followed this thread from the beginning, but have you tried the
foverlaps() function from the data.table package?
Something along the lines of:
---
# create the tables (use as.data.table() or setDT() if you
# start with a data.frame)
mapfile <- data.table(Name = c("S1", "S2", "S3"), Chr = 1,
Position = c(3000, 6000, 1000))
Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"),
Start = c(0, 5001), End = c(5000, 10000))
# add a dummy variable to be able to define Position as an interval
mapfile[, Position2 := Position]
# add keys
setkey(mapfile, Chr, Position, Position2)
setkey(Chr.Arms, Chr, Start, End)
# use data.table::foverlaps (see ?foverlaps)
mapfile <- foverlaps(mapfile, Chr.Arms, type = "within")
# remove the dummy variable
mapfile[, Position2 := NULL]
# recreate original order
setorder(mapfile, Chr, Name)
---
BTW, there is a typo in your *SOLUTION*. I guess you wanted to write
data.table(Name = c("S1", "S2", "S3"), Chr = 1, Position = c(3000, 6000,
1000), key = "Chr") instead of data.frame(Name = c("S1", "S2", "S3"), Chr =
1, Position = c(3000, 6000, 1000), key = "Chr").
HTH,
Denes
On 01/30/2016 07:48 PM, Gaius Augustus wrote:
I'll look into the Intervals idea. The data.table code posted might not
work (because I don't believe it would put the rows in the correct order
if
the chromosomes are interspersed), however, it did make me think about
possibly assigning based on values...
*SOLUTION*
mapfile <- data.frame(Name = c("S1", "S2", "S3"), Chr = 1, Position =
c(3000, 6000, 1000), key = "Chr")
Chr.Arms <- data.frame(Chr = 1, Arm = c("p", "q"), Start = c(0, 5001), End
= c(5000, 10000), key = "Chr")
for(i in 1:nrow(Chr.Arms)){
cur.row <- Chr.Arms[i, ]
mapfile$Arm[ mapfile$Chr == cur.row$Chr & mapfile$Position >=
cur.row$Start & mapfile$Position <= cur.row$End] <- cur.row$Arm
}
This took out the need for the intermediate table/vector. This worked for
me, and was VERY fast. Took <5 minutes on a dataframe with 35 million
rows.
Thanks for the help,
Gaius
On Sat, Jan 30, 2016 at 10:50 AM, Gaius Augustus <
gaiusjaugustus at gmail.com>
wrote:
I'll look into the Intervals idea. The data.table code posted might not
work (because I don't believe it would put the rows in the correct order
if
the chromosomes are interspersed), however, it did make me think about
possibly assigning based on values...
Something like:
mapfile <- data.table(Name = c("S1", "S2", "S3"), Chr = 1, Position =
c(3000, 6000, 1000), key = "Chr")
Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"), Start = c(0, 5001),
End
= c(5000, 10000), key = "Chr")
for(i in 1:nrow(Chr.Arms)){
cur.row <- Chr.Arms[i, ]
mapfile[ Chr == cur.row$Chr & Position >= cur.row$Start & Position <=
cur.row$End] <- Chr.Arms$Arm
}
This might take out the need for the intermediate table/vector. Not sure
yet if it'll work, but we'll see. I'm interested to know if anyone else
has any ideas, too.
Thanks,
Gaius
On Fri, Jan 29, 2016 at 11:34 PM, Ulrik Stervbo <ulrik.stervbo at gmail.com
wrote: Hi Gaius,
Could you use data.table and loop over the small Chr.arms?
library(data.table)
mapfile <- data.table(Name = c("S1", "S2", "S3"), Chr = 1, Position =
c(3000, 6000, 1000), key = "Chr")
Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"), Start = c(0, 5001),
End = c(5000, 10000), key = "Chr")
Arms <- data.table()
for(i in 1:nrow(Chr.Arms)){
cur.row <- Chr.Arms[i, ]
Arm <- mapfile[ Position >= cur.row$Start & Position <= cur.row$End]
Arm <- Arm[ , Arm:=cur.row$Arm][]
Arms <- rbind(Arms, Arm)
}
# Or use plyr to loop over each possible arm
library(plyr)
Arms <- ddply(Chr.Arms, .variables = "Arm", function(cur.row, mapfile){
mapfile <- mapfile[ Position >= cur.row$Start & Position <=
cur.row$End]
mapfile <- mapfile[ , Arm:=cur.row$Arm][]
return(mapfile)
}, mapfile = mapfile)
I have just started to use the data.table and I have the feeling the
code
above can be greatly improved - maybe the loop can be dropped entirely?
Hope this helps
Ulrik
On Sat, 30 Jan 2016 at 03:29 Gaius Augustus <gaiusjaugustus at gmail.com>
wrote:
I have two dataframes. One has chromosome arm information, and the other
has SNP position information. I am trying to assign each SNP an arm
identity. I'd like to create this new column based on comparing it to
the
reference file.
*1) Mapfile (has millions of rows)*
Name Chr Position
S1 1 3000
S2 1 6000
S3 1 1000
*2) Chr.Arms file (has 39 rows)*
Chr Arm Start End
1 p 0 5000
1 q 5001 10000
*R Script that works, but slow:*
Arms <- c()
for (line in 1:nrow(Mapfile)){
Arms[line] <- Chr.Arms$Arm[ Mapfile$Chr[line] == Chr.Arms$Chr &
Mapfile$Position[line] > Chr.Arms$Start & Mapfile$Position[line] <
Chr.Arms$End]}
}
Mapfile$Arm <- Arms
*Output Table:*
Name Chr Position Arm
S1 1 3000 p
S2 1 6000 q
S3 1 1000 p
In words: I want each line to look up the location ( 1) find the right
Chr,
2) find the line where the START < POSITION < END), then get the ARM
information and place it in a new column.
This R script works, but surely there is a more time/processing
efficient
way to do it.
Thanks in advance for any help,
Gaius
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.