Efficient way to create new column based on comparison with another dataframe

Sun, Jan 31, 2016 11:17 AM

Thanks Denes,
I should have thought of foverlaps as an option.  I wonder how fast it is
compared to my solution!

My particular solution does not need data.table in order to work.  It just
loops through the ChrArms (Chromosome Arms, which always has 39 rows) and
assigns the proper arm to all rows within mapfile that lie within Start and
End on a particular Chr.  This is opposed to my first solution, where I was
trying to loop through mapfile (which could be millions of rows) and assign
each row one at a time.  That's why I used data.frame.

For some reason, yesterday, data.table was acting funny on the computer I
remote to, so I need to figure out why that is once I can get on it.  Then
I want to time my solution and one with foverlaps to see if one is faster.

Thanks,
Gaius

On Sun, Jan 31, 2016 at 2:17 AM, D?nes T?th <toth.denes at ttk.mta.hu> wrote:

Hi,

I have not followed this thread from the beginning, but have you tried the
foverlaps() function from the data.table package?

Something along the lines of:

---
# create the tables (use as.data.table() or setDT() if you
# start with a data.frame)
mapfile <- data.table(Name = c("S1", "S2", "S3"), Chr = 1,
                      Position = c(3000, 6000, 1000))
Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"),
                       Start = c(0, 5001), End = c(5000, 10000))

# add a dummy variable to be able to define Position as an interval
mapfile[, Position2 := Position]

# add keys
setkey(mapfile, Chr, Position, Position2)
setkey(Chr.Arms, Chr, Start, End)

# use data.table::foverlaps (see ?foverlaps)
mapfile <- foverlaps(mapfile, Chr.Arms, type = "within")

# remove the dummy variable
mapfile[, Position2 := NULL]

# recreate original order
setorder(mapfile, Chr, Name)

---

BTW, there is a typo in your *SOLUTION*. I guess you wanted to write
data.table(Name = c("S1", "S2", "S3"), Chr = 1, Position = c(3000, 6000,
1000), key = "Chr") instead of data.frame(Name = c("S1", "S2", "S3"), Chr =
1, Position = c(3000, 6000, 1000), key = "Chr").

HTH,
  Denes



On 01/30/2016 07:48 PM, Gaius Augustus wrote:

I'll look into the Intervals idea.  The data.table code posted might not
work (because I don't believe it would put the rows in the correct order
if
the chromosomes are interspersed), however, it did make me think about
possibly assigning based on values...

*SOLUTION*

mapfile <- data.frame(Name = c("S1", "S2", "S3"), Chr = 1, Position =
c(3000, 6000, 1000), key = "Chr")
Chr.Arms <- data.frame(Chr = 1, Arm = c("p", "q"), Start = c(0, 5001), End
= c(5000, 10000), key = "Chr")

for(i in 1:nrow(Chr.Arms)){
   cur.row <- Chr.Arms[i, ]
   mapfile$Arm[ mapfile$Chr == cur.row$Chr & mapfile$Position >=
cur.row$Start & mapfile$Position <= cur.row$End] <- cur.row$Arm
}

This took out the need for the intermediate table/vector.  This worked for
me, and was VERY fast.  Took <5 minutes on a dataframe with 35 million
rows.

Thanks for the help,
Gaius

On Sat, Jan 30, 2016 at 10:50 AM, Gaius Augustus <
gaiusjaugustus at gmail.com>
wrote:

I'll look into the Intervals idea.  The data.table code posted might not

work (because I don't believe it would put the rows in the correct order
if
the chromosomes are interspersed), however, it did make me think about
possibly assigning based on values...

Something like:
mapfile <- data.table(Name = c("S1", "S2", "S3"), Chr = 1, Position =
c(3000, 6000, 1000), key = "Chr")
Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"), Start = c(0, 5001),
End
= c(5000, 10000), key = "Chr")

for(i in 1:nrow(Chr.Arms)){
   cur.row <- Chr.Arms[i, ]
   mapfile[ Chr == cur.row$Chr & Position >= cur.row$Start & Position <=
cur.row$End] <- Chr.Arms$Arm
}

This might take out the need for the intermediate table/vector.  Not sure
yet if it'll work, but we'll see.  I'm interested to know if anyone else
has any ideas, too.

Thanks,
Gaius

On Fri, Jan 29, 2016 at 11:34 PM, Ulrik Stervbo <ulrik.stervbo at gmail.com

wrote:

Hi Gaius,

Could you use data.table and loop over the small Chr.arms?

library(data.table)
mapfile <- data.table(Name = c("S1", "S2", "S3"), Chr = 1, Position =
c(3000, 6000, 1000), key = "Chr")
Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"), Start = c(0, 5001),
End = c(5000, 10000), key = "Chr")

Arms <- data.table()
for(i in 1:nrow(Chr.Arms)){
   cur.row <- Chr.Arms[i, ]
   Arm <- mapfile[ Position >= cur.row$Start & Position <= cur.row$End]
   Arm <- Arm[ , Arm:=cur.row$Arm][]
   Arms <- rbind(Arms, Arm)
}

# Or use plyr to loop over each possible arm
library(plyr)
Arms <- ddply(Chr.Arms, .variables = "Arm", function(cur.row, mapfile){
   mapfile <- mapfile[ Position >= cur.row$Start & Position <=
cur.row$End]
   mapfile <- mapfile[ , Arm:=cur.row$Arm][]
   return(mapfile)
}, mapfile = mapfile)

I have just started to use the data.table and I have the feeling the
code
above can be greatly improved - maybe the loop can be dropped entirely?

Hope this helps
Ulrik

On Sat, 30 Jan 2016 at 03:29 Gaius Augustus <gaiusjaugustus at gmail.com>
wrote:

I have two dataframes. One has chromosome arm information, and the other

has SNP position information. I am trying to assign each SNP an arm
identity.  I'd like to create this new column based on comparing it to
the
reference file.

*1) Mapfile (has millions of rows)*

Name    Chr   Position
S1      1      3000
S2      1      6000
S3      1      1000

*2) Chr.Arms   file (has 39 rows)*

Chr    Arm    Start   End
1      p      0       5000
1      q      5001    10000


*R Script that works, but slow:*
Arms  <- c()
for (line in 1:nrow(Mapfile)){
       Arms[line] <- Chr.Arms$Arm[ Mapfile$Chr[line] == Chr.Arms$Chr &
  Mapfile$Position[line] > Chr.Arms$Start &  Mapfile$Position[line] <
Chr.Arms$End]}
}
Mapfile$Arm <- Arms


*Output Table:*

Name   Chr   Position   Arm
S1      1     3000      p
S2      1     6000      q
S3      1     1000      p


In words: I want each line to look up the location ( 1) find the right
Chr,
2) find the line where the START < POSITION < END), then get the ARM
information and place it in a new column.

This R script works, but surely there is a more time/processing
efficient
way to do it.

Thanks in advance for any help,
Gaius

         [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Efficient way to create new column based on comparison with another dataframe

Thread (6 messages)