Hello all,
I have written a for loop to act on a dataframe with close to 3million rows
and 6 columns and I would like to pass it to apply() to speed the process up
(I let the loop run for 2 days before stopping it and it had only gone
through 200,000 rows) but I am really struggling to find a way to pass the
arguments. Below are the loop and the head of the dataframe I am working on.
Any hints would be much appreciated, thank you! (I have searched for this
but could not find any other posts doing quite what I want)
Paul
x<-as.numeric(all.tf7[1,2])
for (i in 2:nrow(all.tf7)) {
if (all.tf7[i,1]==all.tf7[i-1,1] & (all.tf7[i,2]-x)<115341)
all.tf7[i,6]<-all.tf7[i-1,6]
else if (all.tf7[i,1]==all.tf7[i-1,1] & (all.tf7[i,2]-x)>=115341) {
all.tf7[i,6]<-(all.tf7[i-1,6]+1)
x<-as.numeric(all.tf7[i,2]) }
else if (all.tf7[i,1]!=all.tf7[i-1,1]) {
all.tf7[i,6]<-(all.tf7[i-1,6]+1)
x<-as.numeric(all.tf7[i,2]) }
}
#the aim here is to attribute a bin number to each row so that I can then
split the dataframe according to those bins.
chrom chromStart chromEnd name cumsum bin
chr1 10089 10309 ZBTB33 10089 1
chr1 10132 10536 TAF7_(SQ-8) 20221 1
chr1 10133 10362 Pol2-4H8 30354 1
chr1 10148 10418 MafF_(M8194) 40502 1
chr1 10382 10578 ZBTB33 50884 1
chr1 16132 16352 CTCF 67016 1
--
View this message in context: http://r.789695.n4.nabble.com/apply-with-multiple-conditions-tp4635098.html
Sent from the R help mailing list archive at Nabble.com.
apply with multiple conditions
10 messages · Jean V Adams, Paul Guilhamon, pguilha +1 more
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120702/0050823d/attachment.pl>
Thanks for your reply Jean, I think your interpretation is correct but when I run your code I end up with the below dataframe and obviously the bins created there don't correspond to a chromStart change of 115341: chrom chromStart chromEnd name cumsum bin 1 chr1 10089 10309 ZBTB33 10089 1 2 chr1 10132 10536 TAF7_(SQ-8) 20221 2 3 chr2 10133 10362 Pol2-4H8 30354 3 4 chr2 10148 10418 MafF_(M8194) 40502 4 5 chr2 210382 210578 ZBTB33 50884 5 6 chr2 216132 216352 CTCF 67016 6 the first two rows should have the same bin number (same chrom, <115341 diff), then rows 3&4 should be in another bin (different chrom from rows 1&2, <115341 diff), and rows 5&6 in another one (same chrom but >115341 difference between row 4 and row 5). it seems the new.bin line of your code isn't quite doing what it should but I can't pinpoint the error there... Paul
On 2 July 2012 14:19, Jean V Adams <jvadams at usgs.gov> wrote:
Paul,
My interpretation is that you are trying to assign a new bin number to a row
every time the variable chrom changes and every time the variable chromStart
changes by 115341 or more. Is that right? If so, you don't need a loop at
all. Check out the code below. I made a couple changes to the all.tf7
example data frame so that it would have two changes in bin number, one
based on the chrom variable and one based on the chromStart variable.
Jean
all.tf7 <- data.frame(
chrom = c("chr1", "chr1", "chr2", "chr2", "chr2", "chr2"),
chromStart = c(10089, 10132, 10133, 10148, 210382, 216132),
chromEnd = c(10309, 10536, 10362, 10418, 210578, 216352),
name = c("ZBTB33", "TAF7_(SQ-8)", "Pol2-4H8", "MafF_(M8194)",
"ZBTB33", "CTCF"),
cumsum = c(10089, 20221, 30354, 40502, 50884, 67016),
bin = rep(NA, 6)
)
# assign a new bin every time chrom changes and every time chromStart
changes by 115341 or more
L <- nrow(all.tf7)
prev.chrom <- c(NA, all.tf7$chrom[-L])
delta.start <- c(NA, all.tf7$chromStart[-1] - all.tf7$chromStart[-L])
new.bin <- is.na(prev.chrom) | all.tf7$chrom != prev.chrom | delta.start >=
115341
all.tf7$bin <- cumsum(new.bin)
all.tf7
pguilha <paul.guilhamon at gmail.com> wrote on 07/02/2012 06:25:13 AM:
Hello all,
I have written a for loop to act on a dataframe with close to 3million
rows
and 6 columns and I would like to pass it to apply() to speed the process
up
(I let the loop run for 2 days before stopping it and it had only gone
through 200,000 rows) but I am really struggling to find a way to pass the
arguments. Below are the loop and the head of the dataframe I am working
on.
Any hints would be much appreciated, thank you! (I have searched for this
but could not find any other posts doing quite what I want)
Paul
x<-as.numeric(all.tf7[1,2])
for (i in 2:nrow(all.tf7)) {
if (all.tf7[i,1]==all.tf7[i-1,1] & (all.tf7[i,2]-x)<115341)
all.tf7[i,6]<-all.tf7[i-1,6]
else if (all.tf7[i,1]==all.tf7[i-1,1] & (all.tf7[i,2]-x)>=115341) {
all.tf7[i,6]<-(all.tf7[i-1,6]+1)
x<-as.numeric(all.tf7[i,2]) }
else if (all.tf7[i,1]!=all.tf7[i-1,1]) {
all.tf7[i,6]<-(all.tf7[i-1,6]+1)
x<-as.numeric(all.tf7[i,2]) }
}
#the aim here is to attribute a bin number to each row so that I can then
split the dataframe according to those bins.
chrom chromStart chromEnd name cumsum bin
chr1 10089 10309 ZBTB33 10089 1
chr1 10132 10536 TAF7_(SQ-8) 20221 1
chr1 10133 10362 Pol2-4H8 30354 1
chr1 10148 10418 MafF_(M8194) 40502 1
chr1 10382 10578 ZBTB33 50884 1
chr1 16132 16352 CTCF 67016 1
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120702/141cbb28/attachment.pl>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120702/5b80e910/attachment.pl>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120702/3eb2b529/attachment.pl>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120702/06ded65f/attachment.pl>
Hello, Sorry to intrude, but I think it's a factor issue. Try the changing the disjunction to, (in multiline edit) new.bin <- is.na(prev.chrom) | df$chrom != levels(df$chrom)[prev.chrom] | delta.start >= 115341 It should work, now. Hope this helps, Rui Barradas Em 02-07-2012 20:03, pguilha escreveu:
Jean, It's crazy, I'm still getting 1,2,3,4,5,6 in the bin column..... Also (this is an unrelated problem i think), unless I've misunderstood it, I think your code will only create a new bin if the difference between chromStart at i and i-1 position is >=115341....What I want is for a new bin to be created each time the difference between chromStart at i and i-j is >=115341, where 'i-j' corresponds to the first row of the last bin....Im not sure if I'm being clear...chromStart values correspond to coordinates along a chromosome so I want to basically cut up each chromosome into sections/bins of approximately 115341... thanks again for all your efforts with this, they're much appreciated! Paul On 2 July 2012 19:36, Jean V Adams [via R] <ml-node+s789695n4635185h87 at n4.nabble.com> wrote:
Paul,
Try this (I changed some of the object names, but the meat of the code is
the same):
df <- data.frame(
chrom = c("chr1", "chr1", "chr2", "chr2", "chr2", "chr2"),
chromStart = c(10089, 10132, 10133, 10148, 210382, 216132),
chromEnd = c(10309, 10536, 10362, 10418, 210578, 216352),
name = c("ZBTB33", "TAF7_(SQ-8)", "Pol2-4H8", "MafF_(M8194)",
"ZBTB33", "CTCF"),
cumsum = c(10089, 20221, 30354, 40502, 50884, 67016)
)
# assign a new bin every time chrom changes and every time chromStart
changes by 115341 or more
L <- nrow(df)
prev.chrom <- c(NA, df$chrom[-L])
delta.start <- c(NA, df$chromStart[-1] - df$chromStart[-L])
new.bin <- is.na(prev.chrom) | df$chrom != prev.chrom | delta.start >=
115341
df$bin <- cumsum(new.bin)
df
pguilha <[hidden email]> wrote on 07/02/2012 10:23:36 AM:
Jean, that's exactly what it should be, but yes I copied and pasted from your email so I don't see how I could have introduced an error in there.... paul On 2 July 2012 15:57, Jean V Adams [via R] <[hidden email]> wrote:
Paul, Are you submitting the exact code that I included in my previous
e-mail?
When I submit that code, I get this ... chrom chromStart chromEnd name cumsum bin 1 chr1 10089 10309 ZBTB33 10089 1 2 chr1 10132 10536 TAF7_(SQ-8) 20221 1 3 chr2 10133 10362 Pol2-4H8 30354 2 4 chr2 10148 10418 MafF_(M8194) 40502 2 5 chr2 210382 210578 ZBTB33 50884 3 6 chr2 216132 216352 CTCF 67016 3 Jean Paul Guilhamon <[hidden email]> wrote on 07/02/2012 08:59:00 AM:
Thanks for your reply Jean, I think your interpretation is correct but when I run your code I end up with the below dataframe and obviously the bins created there
don't
correspond to a chromStart change of 115341: chrom chromStart chromEnd name cumsum bin 1 chr1 10089 10309 ZBTB33 10089 1 2 chr1 10132 10536 TAF7_(SQ-8) 20221 2 3 chr2 10133 10362 Pol2-4H8 30354 3 4 chr2 10148 10418 MafF_(M8194) 40502 4 5 chr2 210382 210578 ZBTB33 50884 5 6 chr2 216132 216352 CTCF 67016 6 the first two rows should have the same bin number (same chrom, <115341 diff), then rows 3&4 should be in another bin (different
chrom
from rows 1&2, <115341 diff), and rows 5&6 in another one (same chrom but >115341 difference between row 4 and row 5). it seems the new.bin line of your code isn't quite doing what it should but I can't pinpoint the error there... Paul On 2 July 2012 14:19, Jean V Adams <[hidden email]> wrote:
Paul, My interpretation is that you are trying to assign a new bin number
to
a row
every time the variable chrom changes and every time the variable
chromStart
changes by 115341 or more. Is that right? If so, you don't need a
loop at
all. Check out the code below. I made a couple changes to the
all.tf7
example data frame so that it would have two changes in bin number,
one
based on the chrom variable and one based on the chromStart
variable.
Jean
all.tf7 <- data.frame(
chrom = c("chr1", "chr1", "chr2", "chr2", "chr2", "chr2"),
chromStart = c(10089, 10132, 10133, 10148, 210382, 216132),
chromEnd = c(10309, 10536, 10362, 10418, 210578, 216352),
name = c("ZBTB33", "TAF7_(SQ-8)", "Pol2-4H8",
"MafF_(M8194)",
"ZBTB33", "CTCF"),
cumsum = c(10089, 20221, 30354, 40502, 50884, 67016),
bin = rep(NA, 6)
)
# assign a new bin every time chrom changes and every time
chromStart
changes by 115341 or more L <- nrow(all.tf7) prev.chrom <- c(NA, all.tf7$chrom[-L]) delta.start <- c(NA, all.tf7$chromStart[-1] -
all.tf7$chromStart[-L])
new.bin <- is.na(prev.chrom) | all.tf7$chrom != prev.chrom |
delta.start >=
115341 all.tf7$bin <- cumsum(new.bin) all.tf7 pguilha <[hidden email]> wrote on 07/02/2012 06:25:13 AM:
Hello all, I have written a for loop to act on a dataframe with close to
3million
rows and 6 columns and I would like to pass it to apply() to speed the
process
up (I let the loop run for 2 days before stopping it and it had only
gone
through 200,000 rows) but I am really struggling to find a way to
pass the
arguments. Below are the loop and the head of the dataframe I am
working
on. Any hints would be much appreciated, thank you! (I have searched
for
this
but could not find any other posts doing quite what I want)
Paul
x<-as.numeric(all.tf7[1,2])
for (i in 2:nrow(all.tf7)) {
if (all.tf7[i,1]==all.tf7[i-1,1] & (all.tf7[i,2]-x)<115341)
all.tf7[i,6]<-all.tf7[i-1,6]
else if (all.tf7[i,1]==all.tf7[i-1,1] &
(all.tf7[i,2]-x)>=115341) {
all.tf7[i,6]<-(all.tf7[i-1,6]+1)
x<-as.numeric(all.tf7[i,2]) }
else if (all.tf7[i,1]!=all.tf7[i-1,1]) {
all.tf7[i,6]<-(all.tf7[i-1,6]+1)
x<-as.numeric(all.tf7[i,2]) }
}
#the aim here is to attribute a bin number to each row so that I
can
then
split the dataframe according to those bins. chrom chromStart chromEnd name cumsum bin chr1 10089 10309 ZBTB33 10089 1 chr1 10132 10536 TAF7_(SQ-8) 20221 1 chr1 10133 10362 Pol2-4H8 30354 1 chr1 10148 10418 MafF_(M8194) 40502 1 chr1 10382 10578 ZBTB33 50884 1 chr1 16132 16352 CTCF 67016 1
[[alternative HTML version deleted]]
______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ________________________________ If you reply to this email, your message will be added to the discussion below: http://r.789695.n4.nabble.com/apply-with-multiple-conditions-tp4635098p4635185.html To unsubscribe from apply with multiple conditions, click here. NAML
-- View this message in context: http://r.789695.n4.nabble.com/apply-with-multiple-conditions-tp4635098p4635189.html Sent from the R help mailing list archive at Nabble.com. [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120702/42c9637d/attachment.pl>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120702/f29f05f0/attachment.pl>