[Bioc-devel] could bsseq::data.frame2GRanges be added to GenomicRanges

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20131006/4943fd37/attachment.pl>
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20131006/18b6928b/attachment.pl>
+1

--t
On Oct 6, 2013, at 6:28 AM, Kasper Daniel Hansen <kasperdanielhansen at gmail.com> wrote:

bsseq::data.frame2GRanges does the obvious step of converting a data.frame
to GRanges.  It has a couple of bells and whistles where strand can be
ignored and additional columns (apart from genomic location) may be ignore
in the output object.

I (and now quite a few other people) use this function almost every day.  I
have seen other implementations in other packages, suggesting this is not
just something I (we) use.

I suggests adding this function to GenomicRanges.  I am happy to support it
going forward.

Using this function we could also add an as(x, "GRanges") method for
x=data.frame, but I still suggest keeping the basic function for the
extended functionality it provides.

Best,
Kasper

   [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20131006/3f823500/attachment.pl>
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20131006/29247b7e/attachment.pl>
For any tabular data structure with "chr[om]" and one or more of starts, ends, widths, and strands, there _is_ an obvious mapping, though!  And I personally always have an optional argument to keep the rest as mcols(). It just seems so straightforward. 

The generic granges() method also suggests that people often want a way for  something that is not itself a GRanges to at least emit one. 

granges(foo) is also a lot easier than 

with(foo, GRanges(fairly,
                               Involved(),
                               Constructor,
                               ...)

It would be nice if everything were a bed, wig, BAM or similar. Then rtracklayer would be all anyone needs. But, sometimes in the course of events, an unenlightened soul will have their library emit a table of coordinates bearing information that would be more useful as a GRanges. Given the frequency with which this occurs, it's nice to have that generic. 

Anyways, this is all your fault!  If you hadn't built such a marvelously useful data structure, people wouldn't want to use it in ways you never intended. :-)

No good deed goes unpunished,

--t
On Oct 6, 2013, at 1:26 PM, Michael Lawrence <lawrence.michael at gene.com> wrote:

I'm still unconvinced that there is an obvious, general path from
data.frame -> GRanges. It's usually easy enough to just call GRanges(),
often of the pattern with(df, GRanges(...)). Moreover, it's unusual for me
to encounter genomic data in data.frames.

On Sun, Oct 6, 2013 at 8:37 AM, Kasper Daniel Hansen <
kasperdanielhansen at gmail.com> wrote:

Also, it goes without saying that I am happy to provide a patch for
GenomicRanges, and check that it passes R CMD check, to minimize the work
of the maintainer.

Kasper

On Sun, Oct 6, 2013 at 9:28 AM, Kasper Daniel Hansen <
kasperdanielhansen at gmail.com> wrote:

bsseq::data.frame2GRanges does the obvious step of converting a
data.frame
to GRanges.  It has a couple of bells and whistles where strand can be
ignored and additional columns (apart from genomic location) may be
ignore
in the output object.

I (and now quite a few other people) use this function almost every day.
I have seen other implementations in other packages, suggesting this is
not just something I (we) use.

I suggests adding this function to GenomicRanges.  I am happy to support
it going forward.

Using this function we could also add an as(x, "GRanges") method for
x=data.frame, but I still suggest keeping the basic function for the
extended functionality it provides.

Best,
Kasper
       [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
   [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20131006/897f67d1/attachment.pl>
Hi,

+1 from me, too ... I've also had a similar conversion function
(data.[frame|table] <--> GRanges) in my toolbelt which I found quite
useful over the years.

-steve

On Sun, Oct 6, 2013 at 5:00 PM, Kasper Daniel Hansen
This is a convenience function, which provably has saved tons of time for
me and others.  I get lots of data from various excel/cvs files lying
around various places, and these files _always_ have a clear path to a
GRanges.  Perhaps you never have to deal with this kind of data, but we are
a few experienced users who find it extremely handy and would like it to be
in a more centralized place.

On Sun, Oct 6, 2013 at 4:26 PM, Michael Lawrence
<lawrence.michael at gene.com>wrote:

I'm still unconvinced that there is an obvious, general path from
data.frame -> GRanges. It's usually easy enough to just call GRanges(),
often of the pattern with(df, GRanges(...)). Moreover, it's unusual for me
to encounter genomic data in data.frames.

On Sun, Oct 6, 2013 at 8:37 AM, Kasper Daniel Hansen <
kasperdanielhansen at gmail.com> wrote:

Also, it goes without saying that I am happy to provide a patch for
GenomicRanges, and check that it passes R CMD check, to minimize the work
of the maintainer.

Kasper

On Sun, Oct 6, 2013 at 9:28 AM, Kasper Daniel Hansen <
kasperdanielhansen at gmail.com> wrote:

bsseq::data.frame2GRanges does the obvious step of converting a
data.frame
to GRanges.  It has a couple of bells and whistles where strand can be
ignored and additional columns (apart from genomic location) may be
ignore
in the output object.

I (and now quite a few other people) use this function almost every day.
 I have seen other implementations in other packages, suggesting this is
not just something I (we) use.

I suggests adding this function to GenomicRanges.  I am happy to support
it going forward.

Using this function we could also add an as(x, "GRanges") method for
x=data.frame, but I still suggest keeping the basic function for the
extended functionality it provides.

Best,
Kasper

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20131007/b40a726d/attachment.pl>
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20131007/763501a8/attachment.pl>
Hi GRangers,

I'll add this to GenomicRanges. Hope you don't mind if I rename
it makeGRangesFromDataFrame(). This is more consistent
with the current naming scheme for "specialized constructors"
e.g.:

   makeTranscriptDbFrom*()
   makeGRangesListFromFeatureFragments()
   readGAlignment*FromBam()
   etc...

I'll add some flexibility to let the user choose the columns to
import (we sometimes see start/stop instead of start/end), but by
default it'll do what bsseq::data.frame2GRanges currently does.

'keepColumns' and 'ignoreStrand' will become 'keep.columns' and
'ignore.strand'.

H.
nb.  Somehow I got a typo in there:

aGR <- df2GR(slow)
mm9toHg19 <- import('mm9ToHg19.over.chain')

## was full <- liftOver(slow, mm9toHg19)
full <- liftOver(aGR, mm9toHg19)
length(unlist(full))/length(slow)
## [1] 0.549

That should remind me to re-run everything.
In my defense... well, it's slow to re-run that :-/

apologies

On Mon, Oct 7, 2013 at 10:58 AM, Tim Triche, Jr. <tim.triche at gmail.com>wrote:

GRs are a great data structure.  But, standard bioinformatic file formats
(BED, WIG, BAM) don't always fit 1:1 with the "organic" beginnings of some
projects. The GenomicRanges infrastructure isn't on the radar of every R
developer, and some useful data can be found in ugly formats.  Wouldn't it
be handy if users could easily turn those blobs of data into something that
export() can handle?

To better understand Michael's point of view, and as someone who has seen
firsthand the nontrivial amount of work required to maintain rtracklayer as
a high-performance import library, I wrote a few trivial Perl scripts to
convert some mouse data from assorted wacky tabular formats to standard
BED6 files. Besides, once I have BEDs, I can Tabix them, which speeds up
operations.

I noticed that, when I imported the resulting BED files, where I had
cloned the base position into identical "chromStart" and "chromEnd"
coordinates, the import.bed() function assumed that I meant for them to be
UCSC-style, and therefore gave everything negative widths.  (On the bright
side, this also explained why liftOver wasn't doing anything useful with
the results)

packageVersion('rtracklayer')
## [1] '1.21.12'

wacky <- import('converted.theirSillyFormat.mm9.bed.gz', genome='mm9')
mm9toHg19 <- import('mm9ToHg19.over.chain')
empty <- liftOver(wacky, mm9toHg19)
length(unlist(empty))/length(wacky)
## [1] 0

## cursing ensues

That's not so good.  If I wasn't already aware of the insanity of
UCSC-style indexing, this could have been a problem in and of itself.  (As
it was, I fixed it)

slow <- read.table('theirSillyFormat.mm9.txt.gz')
## time passes...

aGR <- df2GR(slow)
mm9toHg19 <- import('mm9ToHg19.over.chain')
full <- liftOver(slow, mm9toHg19)
length(unlist(full))/length(slow)
## [1] 0.549

## better late than never

So: turning nonstandard data into standard data with "standard" (grr) UCSC
assumptions took longer than simply brute-forcing the issue with
read.table().

After importing the files with the "appropriate" (chrom, base-1, base)
indexing, I then went to liftOver() a related GRanges.  (Note that the
related GRanges was from a BED file submitted by guys at the Broad, so any
wacky formatting wasn't an issue in this case; here I wanted to control for
any other possible fubars).

foo <- import('an.RRBS.file.mm9.bed.gz', genome='mm9')
mm9toHg19 <- import('mm9ToHg19.over.chain')
bar <- liftOver(foo, mm9toHg19)

length(unlist(bar)) / length(foo)
## [1] 0.622

Needless to say, this was a hell of a lot faster than importing the
corresponding file as a table.  However, for the wacky file formats, it was
more time & trouble to decode all the assumptions prior to liftOver() than
it would have been to use granges(read.table('wacky.file.format.csv.gz')).
  Ugly and sad, but still true!

So, in conclusion, sometimes it might just be better to import one of
those wacky file formats to a data.frame D or what have you, and use
granges(D).

Or, since I'm one of the people that wrote a df2GR() function, I just use
that.

--t

On Mon, Oct 7, 2013 at 9:23 AM, Steve Lianoglou <lianoglou.steve at gene.com>wrote:

Hi,

+1 from me, too ... I've also had a similar conversion function
(data.[frame|table] <--> GRanges) in my toolbelt which I found quite
useful over the years.

-steve

On Sun, Oct 6, 2013 at 5:00 PM, Kasper Daniel Hansen
<kasperdanielhansen at gmail.com> wrote:
This is a convenience function, which provably has saved tons of time
for
me and others.  I get lots of data from various excel/cvs files lying
around various places, and these files _always_ have a clear path to a
GRanges.  Perhaps you never have to deal with this kind of data, but we
are
a few experienced users who find it extremely handy and would like it
to be
in a more centralized place.

On Sun, Oct 6, 2013 at 4:26 PM, Michael Lawrence
<lawrence.michael at gene.com>wrote:

I'm still unconvinced that there is an obvious, general path from
data.frame -> GRanges. It's usually easy enough to just call GRanges(),
often of the pattern with(df, GRanges(...)). Moreover, it's unusual
for me
to encounter genomic data in data.frames.

On Sun, Oct 6, 2013 at 8:37 AM, Kasper Daniel Hansen <
kasperdanielhansen at gmail.com> wrote:

Also, it goes without saying that I am happy to provide a patch for
GenomicRanges, and check that it passes R CMD check, to minimize the
work
of the maintainer.

Kasper

On Sun, Oct 6, 2013 at 9:28 AM, Kasper Daniel Hansen <
kasperdanielhansen at gmail.com> wrote:

bsseq::data.frame2GRanges does the obvious step of converting a
data.frame
to GRanges.  It has a couple of bells and whistles where strand can
be
ignored and additional columns (apart from genomic location) may be
ignore
in the output object.

I (and now quite a few other people) use this function almost every
day.
  I have seen other implementations in other packages, suggesting
this is
not just something I (we) use.

I suggests adding this function to GenomicRanges.  I am happy to
support
it going forward.

Using this function we could also add an as(x, "GRanges") method for
x=data.frame, but I still suggest keeping the basic function for the
extended functionality it provides.

Best,
Kasper

         [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

         [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

--
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

--
*He that would live in peace and at ease, *
*Must not speak all he knows, nor judge all he sees.*
  *
*
Benjamin Franklin, Poor Richard's Almanack<http://archive.org/details/poorrichardsalma00franrich>

Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319