Hi all,
I have a feeling I asked this before, but cannot find the mail, or the answer I received, so sorry if we're covering old ground.
I have a package which can deal with data.frames, RangedData objects or Granges objects.
Assuming I have:
library( IRanges )
library( GenomicRanges )
# Setup a data.frame
df = data.frame( start=c( 1, 2 ),
end=c( 3, 4 ),
strand=c( '+', '+' ),
name=c( 'tim', 'yates' ),
more=c( T, F ) )
# Setup a RangedData object
rd = as( df, 'RangedData' )
# Setup a GRanges object
gr = as( rd, 'GRanges' )
Then the following is the behaviour when trying to use $ for accessing the column 'name':
print( df$name )
# [1] tim yates
# Levels: tim yates
print( rd$name )
# [1] tim yates
# Levels: tim yates
print( gr$name )
# Error in gr$meta : $ operator not defined for this S4 class
Using it for assignment also gives problems:
df$foo = c( 'bar', 'baz' )
# Adds new column 'foo' to df with values 'bar' and 'baz'
rd$foo = c( 'bar', 'baz' )
# Adds new column 'foo' to rd with values 'bar' and 'baz'
gr$foo = c( 'bar', 'baz' )
# Error in `$<-`(`*tmp*`, "foo", value = c("bar", "baz")) :
# no method for assigning subsets of this S4 class
Now, if I add the following two methods to my code:
`$.GRanges` = function( x, name ) {
elementMetadata( x )[[ name ]]
}
`$<-.GRanges` = function( x, name, value ) {
elementMetadata( x )[[ name ]] = value
x
}
Then all the above functions behave in a similar fashion, however I don't want to do this in my package, as I believe it will be a gateway to namespace hell?
Is there any plan to add this accessor operator to Granges so that access is normalised across all three of these types?
Thanks!
Tim Yates
--------------------------------------------------------
This email is confidential and intended solely for the u...{{dropped:12}}
[Bioc-devel] Meta data access with $ for GRanges vs RangedData vs data.frame
15 messages · Tim Yates, Stefano Berri, Steve Lianoglou +3 more
Obviously, my accessor method was wrong, and missing a check for non-meta
columns:
`$.GRanges` = function( x, name ) {
if( name %in% c( 'seqnames', 'ranges', 'strand' ) ) {
do.call( name, list( x ) )
}
else {
elementMetadata( x )[[ name ]]
}
}
Sorry about that (it's amazing what you see as soon as you click 'send')
Cheers,
Tim Yates
On 19/03/2012 12:02, "Tim Yates" <TYates at picr.man.ac.uk> wrote:
Hi all,
I have a feeling I asked this before, but cannot find the mail, or the
answer I received, so sorry if we're covering old ground.
I have a package which can deal with data.frames, RangedData objects or
Granges objects.
Assuming I have:
library( IRanges )
library( GenomicRanges )
# Setup a data.frame
df = data.frame( start=c( 1, 2 ),
end=c( 3, 4 ),
strand=c( '+', '+' ),
name=c( 'tim', 'yates' ),
more=c( T, F ) )
# Setup a RangedData object
rd = as( df, 'RangedData' )
# Setup a GRanges object
gr = as( rd, 'GRanges' )
Then the following is the behaviour when trying to use $ for accessing
the column 'name':
print( df$name )
# [1] tim yates
# Levels: tim yates
print( rd$name )
# [1] tim yates
# Levels: tim yates
print( gr$name )
# Error in gr$meta : $ operator not defined for this S4 class
Using it for assignment also gives problems:
df$foo = c( 'bar', 'baz' )
# Adds new column 'foo' to df with values 'bar' and 'baz'
rd$foo = c( 'bar', 'baz' )
# Adds new column 'foo' to rd with values 'bar' and 'baz'
gr$foo = c( 'bar', 'baz' )
# Error in `$<-`(`*tmp*`, "foo", value = c("bar", "baz")) :
# no method for assigning subsets of this S4 class
Now, if I add the following two methods to my code:
`$.GRanges` = function( x, name ) {
elementMetadata( x )[[ name ]]
}
`$<-.GRanges` = function( x, name, value ) {
elementMetadata( x )[[ name ]] = value
x
}
Then all the above functions behave in a similar fashion, however I don't
want to do this in my package, as I believe it will be a gateway to
namespace hell?
Is there any plan to add this accessor operator to Granges so that access
is normalised across all three of these types?
Thanks!
Tim Yates
--------------------------------------------------------
This email is confidential and intended solely for the ...{{dropped:19}}
Hi.
I am not sure this is the right place to ask, but could not find anywhere
better.
I am author of CNAnorm. I have a problem when MACOSX users want to install
packages.
it works fine if they do
source("http://bioconductor.org/biocLite.R")
biocLite("CNAnorm")
but this install the "stable" version 1.0.0
It is also fine if they download CNAnorm_1.1.8.tar.gz and then
R CMD INSTALL CNAnorm_1.1.8.tar.gz
but this requires XCode and a fortran compiler which is not so straight
forward for some users
If they download MacOS 10.5 (Leopard) binary CNAnorm_1.1.8.tgz
and then
R CMD INSTALL CNAnorm_1.1.8.tgz
they get
* installing to library
?/Library/Frameworks/R.framework/Versions/2.14/Resources/library?
* installing *binary* package ?CNAnorm? ...
* DONE (CNAnorm)
but then when starting CNAnorm
library(CNAnorm)
Loading required package: DNAcopy
**************************************************************************
The plan to change the data format for CNA object has been postponed
in order to ensure backward compatibility with older versions of DNAcopy
**************************************************************************
Error in dyn.load(file, DLLpath = DLLpath, ...) :
unable to load shared object
'/Library/Frameworks/R.framework/Versions/2.14/Resources/library/CNAnorm/lib
s/x86_64/CNAnorm.so':
dlopen(/Library/Frameworks/R.framework/Versions/2.14/Resources/library/CNAno
rm/libs/x86_64/CNAnorm.so, 6): Library not loaded:
/Library/Frameworks/R.framework/Versions/2.15/Resources/lib/libgfortran.2.dy
lib
Referenced from:
/Library/Frameworks/R.framework/Versions/2.14/Resources/library/CNAnorm/libs
/x86_64/CNAnorm.so
Reason: image not found
In addition: Warning message:
package ?CNAnorm? was built under R version 2.15.0
Error: package/namespace load failed for ?CNAnorm?
The user has R 2.14.1
What am I doing wrong? Is there a way to install the development version of
CNAnorm using biocLite?
Thank you in advance
cheers
Stefano
Hi Tim,
On Mon, Mar 19, 2012 at 8:02 AM, Tim Yates <TYates at picr.man.ac.uk> wrote:
[snip]
`$.GRanges` ? = function( x, name ) ? ? ? ?{
?elementMetadata( x )[[ name ]]
}
`$<-.GRanges` = function( x, name, value ) {
?elementMetadata( x )[[ name ]] = value
?x
}
Then all the above functions behave in a similar fashion, however I don't want to do this in my package, as I believe it will be a gateway to namespace hell?
Is there any plan to add this accessor operator to Granges so that access is normalised across all three of these types?
This does seem to come up every now and again. Most recently (I think) in this post: http://thread.gmane.org/gmane.science.biology.informatics.conductor/39169/focus=39189 There is a *technically* correct reason as to why there's no `$` accessor on GRanges that is explained there by Martin (it has to do w/ GRanges extending the Vector class), but still ... it seems that some of us feel like we still want it, even if it's bad for us. :-) -steve
Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
Hi Stefano,
On Mon, Mar 19, 2012 at 8:50 AM, Stefano Berri <S.Berri at leeds.ac.uk> wrote:
Hi.
I am not sure this is the right place to ask, but could not find anywhere
better.
I am author of CNAnorm. I have a problem when MACOSX users want to install
packages.
it works fine if they do
source("http://bioconductor.org/biocLite.R")
biocLite("CNAnorm")
but this install the "stable" version 1.0.0
It is also fine if they download CNAnorm_1.1.8.tar.gz and then
R CMD INSTALL CNAnorm_1.1.8.tar.gz
but this requires XCode and a fortran compiler which is not so straight
forward for some users
If they download MacOS 10.5 (Leopard) binary ?CNAnorm_1.1.8.tgz
and then
R CMD INSTALL CNAnorm_1.1.8.tgz
they get
* installing to library
?/Library/Frameworks/R.framework/Versions/2.14/Resources/library?
* installing *binary* package ?CNAnorm? ...
* DONE (CNAnorm)
but then when starting CNAnorm
library(CNAnorm)
Loading required package: DNAcopy
**************************************************************************
? The plan to change the data format for CNA object has been postponed
?in order to ensure backward compatibility with older versions of DNAcopy
**************************************************************************
Error in dyn.load(file, DLLpath = DLLpath, ...) :
?unable to load shared object
'/Library/Frameworks/R.framework/Versions/2.14/Resources/library/CNAnorm/lib
s/x86_64/CNAnorm.so':
dlopen(/Library/Frameworks/R.framework/Versions/2.14/Resources/library/CNAno
rm/libs/x86_64/CNAnorm.so, 6): Library not loaded:
/Library/Frameworks/R.framework/Versions/2.15/Resources/lib/libgfortran.2.dy
lib
?Referenced from:
/Library/Frameworks/R.framework/Versions/2.14/Resources/library/CNAnorm/libs
/x86_64/CNAnorm.so
?Reason: image not found
In addition: Warning message:
package ?CNAnorm? was built under R version 2.15.0
Error: package/namespace load failed for ?CNAnorm?
The user has R 2.14.1
What am I doing wrong? Is there a way to install the development version of
CNAnorm using biocLite?
You aren't doing anything wrong ... this is the intended behavior. The package versions that biocLite installs are tied to the version of R that is being used. Someone who is running R-2.14.x is "tied" to the bioconductor 2.9 "release train." If you want the development version of packages, you need to be running R-devel. When R-2.15.0 comes out, it will be "tied" to bioconductor 2.10 packages. For the time being, if the mac users want to install the devel version of CNAnorm, they can either: (1) just download and install the source manually -- which usually isn't recommended and you might be opening the door to dependency hell, but ... it can be done. As you point out, the users will also need XCode and a gfortran compiler, which they can get from here: http://r.research.att.com/gfortran-4.2.3.dmg (2) they can grab the R-2.15-beta installer from the link below and install the devel versions of *all* bioconductor packages using biocLite: http://r.research.att.com/R-2.15-branch-leopard.pkg HTH, -steve
Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
Actually, just an additional note about fortran compilers: On Mon, Mar 19, 2012 at 8:59 AM, Steve Lianoglou
<mailinglist.honeypot at gmail.com> wrote:
[snip]
For the time being, if the mac users want to install the devel version of CNAnorm, they can either: (1) just download and install the source manually -- which usually isn't recommended and you might be opening the door to dependency hell, but ... it can be done. As you point out, the users will also need XCode and a gfortran compiler, which they can get from here: http://r.research.att.com/gfortran-4.2.3.dmg
Instead of grabbing that installer directly, macosx folk should read the info on the following page to make sure they are downloading the correct version of their OS: http://r.research.att.com/tools/#gcc42 -steve
Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20120319/41889fc2/attachment.pl>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20120319/4080b233/attachment.pl>
Tim,
On Mon, Mar 19, 2012 at 2:34 PM, Tim Triche, Jr. <tim.triche at gmail.com> wrote:
[snip]
At some point I'd like to discuss the pros and cons of data.table vs. data.frame for applications that talk to databases and emit SEs and/or Ranges.
[/snip] While I'm not really sure what your question is, I'm pretty sure the answer is that you should use data.table ;-) -steve
Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
Hi Tim, Steve,
On 03/19/2012 05:50 AM, Steve Lianoglou wrote:
Hi Tim, On Mon, Mar 19, 2012 at 8:02 AM, Tim Yates<TYates at picr.man.ac.uk> wrote: [snip]
`$.GRanges` = function( x, name ) {
elementMetadata( x )[[ name ]]
}
`$<-.GRanges` = function( x, name, value ) {
elementMetadata( x )[[ name ]] = value
x
}
Then all the above functions behave in a similar fashion, however I don't want to do this in my package, as I believe it will be a gateway to namespace hell?
Is there any plan to add this accessor operator to Granges so that access is normalised across all three of these types?
This does seem to come up every now and again. Most recently (I think) in this post: http://thread.gmane.org/gmane.science.biology.informatics.conductor/39169/focus=39189 There is a *technically* correct reason as to why there's no `$` accessor on GRanges that is explained there by Martin (it has to do w/ GRanges extending the Vector class), but still ... it seems that some of us feel like we still want it, even if it's bad for us. :-)
Also note that, generally speaking, you can't expect to be able to
manipulate a GRanges exactly like a RangedData or like a data.frame.
There are important differences:
> rd
RangedData with 3 rows and 3 value columns across 1 space
space ranges | strand score id
<factor> <IRanges> | <character> <numeric> <character>
1 1 [1, 4] | + 5.1 K
2 1 [2, 5] | + 5.2 L
3 1 [3, 6] | + 5.3 M
> gr <- as(rd, "GRanges")
> names(gr) <- letters[1:3]
> gr
GRanges with 3 ranges and 2 elementMetadata cols:
seqnames ranges strand | score id
<Rle> <IRanges> <Rle> | <numeric> <character>
a 1 [1, 4] + | 5.1 K
b 1 [2, 5] + | 5.2 L
c 1 [3, 6] + | 5.3 M
---
seqlengths:
1
6
> dim(rd)
[1] 3 3
> dim(as.data.frame(rd))
[1] 3 7
> dim(gr)
NULL
> length(rd)
[1] 1
> length(as.data.frame(rd))
[1] 7
> length(gr)
[1] 3
> names(rd)
[1] "1"
> names(as.data.frame(rd))
[1] "space" "start" "end" "width" "strand" "score" "id"
> names(gr)
[1] "a" "b" "c"
Each of them has its own notion of dim, length and names.
Therefore each of them needs to have its own notion of what $ does.
For GRanges the notion of $ would be to be compatible with its
notion of length and names i.e. gr$c would access element named "c".
This is already the case for GRangesList, DNAStringSet, and probably
98% of the classes defined in the IRanges/GenomicRanges/Biostrings
infrastructure (I wish I could say 100%):
> grl <- GRangesList(tx1=gr, tx2=gr)
> grl$tx2
GRanges with 3 ranges and 2 elementMetadata cols:
seqnames ranges strand | score id
<Rle> <IRanges> <Rle> | <numeric> <character>
a 1 [1, 4] + | 5.1 K
b 1 [2, 5] + | 5.2 L
c 1 [3, 6] + | 5.3 M
---
seqlengths:
1
6
> DNAStringSet(c(aa="aa", bb="acgtatt"))$bb
7-letter "DNAString" instance
seq: ACGTATT
I can see it's tempting to sacrifice consistency for convenience
but there are so many classes in the IRanges/GenomicRanges that IMO
this would be a dangerous slope in the long run.
Cheers,
H.
-steve
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20120319/18ce1e51/attachment.pl>
Hi Michael,
On 03/19/2012 01:57 PM, Michael Lawrence wrote:
On Mon, Mar 19, 2012 at 12:37 PM, Herv? Pag?s <hpages at fhcrc.org
<mailto:hpages at fhcrc.org>> wrote:
Hi Tim, Steve,
On 03/19/2012 05:50 AM, Steve Lianoglou wrote:
Hi Tim,
On Mon, Mar 19, 2012 at 8:02 AM, Tim Yates<TYates at picr.man.ac.uk
<mailto:TYates at picr.man.ac.uk>> wrote:
[snip]
`$.GRanges` = function( x, name ) {
elementMetadata( x )[[ name ]]
}
`$<-.GRanges` = function( x, name, value ) {
elementMetadata( x )[[ name ]] = value
x
}
Then all the above functions behave in a similar fashion,
however I don't want to do this in my package, as I believe
it will be a gateway to namespace hell?
Is there any plan to add this accessor operator to Granges
so that access is normalised across all three of these types?
This does seem to come up every now and again. Most recently (I
think)
in this post:
http://thread.gmane.org/gmane.__science.biology.informatics.__conductor/39169/focus=39189
<http://thread.gmane.org/gmane.science.biology.informatics.conductor/39169/focus=39189>
There is a *technically* correct reason as to why there's no `$`
accessor on GRanges that is explained there by Martin (it has to
do w/
GRanges extending the Vector class), but still ... it seems that
some
of us feel like we still want it, even if it's bad for us. :-)
Also note that, generally speaking, you can't expect to be able to
manipulate a GRanges exactly like a RangedData or like a data.frame.
There are important differences:
> rd
RangedData with 3 rows and 3 value columns across 1 space
space ranges | strand score id
<factor> <IRanges> | <character> <numeric> <character>
1 1 [1, 4] | + 5.1 K
2 1 [2, 5] | + 5.2 L
3 1 [3, 6] | + 5.3 M
> gr <- as(rd, "GRanges")
> names(gr) <- letters[1:3]
> gr
GRanges with 3 ranges and 2 elementMetadata cols:
seqnames ranges strand | score id
<Rle> <IRanges> <Rle> | <numeric> <character>
a 1 [1, 4] + | 5.1 K
b 1 [2, 5] + | 5.2 L
c 1 [3, 6] + | 5.3 M
---
seqlengths:
1
6
> dim(rd)
[1] 3 3
> dim(as.data.frame(rd))
[1] 3 7
> dim(gr)
NULL
> length(rd)
[1] 1
> length(as.data.frame(rd))
[1] 7
> length(gr)
[1] 3
> names(rd)
[1] "1"
> names(as.data.frame(rd))
[1] "space" "start" "end" "width" "strand" "score" "id"
> names(gr)
[1] "a" "b" "c"
Each of them has its own notion of dim, length and names.
Therefore each of them needs to have its own notion of what $ does.
For GRanges the notion of $ would be to be compatible with its
notion of length and names i.e. gr$c would access element named "c".
This is already the case for GRangesList, DNAStringSet, and probably
98% of the classes defined in the IRanges/GenomicRanges/__Biostrings
infrastructure (I wish I could say 100%):
> grl <- GRangesList(tx1=gr, tx2=gr)
> grl$tx2
GRanges with 3 ranges and 2 elementMetadata cols:
seqnames ranges strand | score id
<Rle> <IRanges> <Rle> | <numeric> <character>
a 1 [1, 4] + | 5.1 K
b 1 [2, 5] + | 5.2 L
c 1 [3, 6] + | 5.3 M
---
seqlengths:
1
6
> DNAStringSet(c(aa="aa", bb="acgtatt"))$bb
7-letter "DNAString" instance
seq: ACGTATT
I can see it's tempting to sacrifice consistency for convenience
but there are so many classes in the IRanges/GenomicRanges that IMO
this would be a dangerous slope in the long run.
We need to balance the concerns of the developers and the users.
Developers who are building on top of low-level infrastructure need data
structures with consistent API and behaviors, whereas for users,
convenience is more of a concern.
For any List derivative, I would be against having '$' mean anything
other than what it does now. However, $ on a non-List Vector is not
generally useful for extracting elements of the vector, and indeed R has
disallowed the use of $ on atomic vectors.
IMO the difference between being a List derivative or a non-List Vector is too subtle to take this as the dividing line. This dividing line wouldn't even necessarily give you what you want e.g. IRanges or DNAStringSet are List derivatives but few users/developers know or care about this. If $ gives you an elementMetadata column of a GRanges, how could we justify that it doesn't do so for an IRanges too? Because IRanges and GRanges are so close and use the same notion of length and names, this would be an unfortunate inconsistency, a much more unfortunate one than the current inconsistencies between RangedData and GRanges. I have concerns that going on that slope will give us a situation where some of the 140+ classes in the IRanges/GenomicRanges/Biostrings /ShortRead infrastructure do one thing with $, while others do something else, with no clear dividing line, so every time I need to use $ on an object, I need to check the man page for that kind of object (I'm not very good at remembering the specificities of 140+ containers for such a standard operator as $). I'd rather have everybody do the same thing. Sacrificing clean design for convenience might seem like a good idea at first sight but we've seen notable examples in base R where this turned out to not be such a good deal in the long run... and in some cases the convenience was finally abandoned (e.g. name partial matching for list element access). The cost of the return ticket can be high, sometimes it's just impossible (e.g. name mangling in unlist()). In our own code, I recently had to go thru a deprecation/defunct cycle for the coercion methods from AtomicList to atomic vectors in IRanges. It will take 1 year to completely get rid of this and in the meantime it broke a lot of code.
Thus, '$' is available for other uses, such as extracting elements from a List associated with a Vector (like elementMetadata). This might be confusing to people, because they might start thinking that the Vector object is actually a List, or something, but in practice this has not happened with RangedData. There have been few queries about the "shape" of RangedData. People seem to just get it. Even with the proposed $ accessor, GRanges would remain comparatively simple. There is also something to be said for preserving the consistency between RangedData and GRanges, as evidenced by the multiple discussions on this topic.
I don't like having to type elementMetadata either everytime I need to access an elementMetadata column. But at least I know I can rely on this to work *anywhere* (i.e. any Vector derivative), *even* on a RangedData. Note that both rd$score and elementMetadata(rd)$score work on a RangedData but they do different things. Having gr$score being a shortcut for elementMetadata(gr)$score on a GRanges object would actually introduce another inconsistency between RangedData and GRanges, rather than eliminate one. H.
Michael
Cheers,
H.
-steve
--
Herv? Pag?s
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
_________________________________________________
Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list
https://stat.ethz.ch/mailman/__listinfo/bioc-devel
<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20120319/df8a95f8/attachment.pl>
On 03/19/2012 04:00 PM, Michael Lawrence wrote:
On Mon, Mar 19, 2012 at 3:27 PM, Herv? Pag?s <hpages at fhcrc.org
<mailto:hpages at fhcrc.org>> wrote:
Hi Michael,
On 03/19/2012 01:57 PM, Michael Lawrence wrote:
On Mon, Mar 19, 2012 at 12:37 PM, Herv? Pag?s <hpages at fhcrc.org
<mailto:hpages at fhcrc.org>
<mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>> wrote:
Hi Tim, Steve,
On 03/19/2012 05:50 AM, Steve Lianoglou wrote:
Hi Tim,
On Mon, Mar 19, 2012 at 8:02 AM, Tim
Yates<TYates at picr.man.ac.uk <mailto:TYates at picr.man.ac.uk>
<mailto:TYates at picr.man.ac.uk <mailto:TYates at picr.man.ac.uk>>__>
wrote:
[snip]
`$.GRanges` = function( x, name ) {
elementMetadata( x )[[ name ]]
}
`$<-.GRanges` = function( x, name, value ) {
elementMetadata( x )[[ name ]] = value
x
}
Then all the above functions behave in a similar
fashion,
however I don't want to do this in my package, as I
believe
it will be a gateway to namespace hell?
Is there any plan to add this accessor operator to
Granges
so that access is normalised across all three of
these types?
This does seem to come up every now and again. Most
recently (I
think)
in this post:
http://thread.gmane.org/gmane.____science.biology.informatics.____conductor/39169/focus=39189
<http://thread.gmane.org/gmane.__science.biology.informatics.__conductor/39169/focus=39189>
<http://thread.gmane.org/__gmane.science.biology.__informatics.conductor/39169/__focus=39189
<http://thread.gmane.org/gmane.science.biology.informatics.conductor/39169/focus=39189>>
There is a *technically* correct reason as to why
there's no `$`
accessor on GRanges that is explained there by Martin
(it has to
do w/
GRanges extending the Vector class), but still ... it
seems that
some
of us feel like we still want it, even if it's bad for
us. :-)
Also note that, generally speaking, you can't expect to be
able to
manipulate a GRanges exactly like a RangedData or like a
data.frame.
There are important differences:
> rd
RangedData with 3 rows and 3 value columns across 1 space
space ranges | strand score id
<factor> <IRanges> | <character> <numeric> <character>
1 1 [1, 4] | + 5.1 K
2 1 [2, 5] | + 5.2 L
3 1 [3, 6] | + 5.3 M
> gr <- as(rd, "GRanges")
> names(gr) <- letters[1:3]
> gr
GRanges with 3 ranges and 2 elementMetadata cols:
seqnames ranges strand | score id
<Rle> <IRanges> <Rle> | <numeric> <character>
a 1 [1, 4] + | 5.1 K
b 1 [2, 5] + | 5.2 L
c 1 [3, 6] + | 5.3 M
---
seqlengths:
1
6
> dim(rd)
[1] 3 3
> dim(as.data.frame(rd))
[1] 3 7
> dim(gr)
NULL
> length(rd)
[1] 1
> length(as.data.frame(rd))
[1] 7
> length(gr)
[1] 3
> names(rd)
[1] "1"
> names(as.data.frame(rd))
[1] "space" "start" "end" "width" "strand" "score" "id"
> names(gr)
[1] "a" "b" "c"
Each of them has its own notion of dim, length and names.
Therefore each of them needs to have its own notion of what
$ does.
For GRanges the notion of $ would be to be compatible with its
notion of length and names i.e. gr$c would access element
named "c".
This is already the case for GRangesList, DNAStringSet, and
probably
98% of the classes defined in the
IRanges/GenomicRanges/____Biostrings
infrastructure (I wish I could say 100%):
> grl <- GRangesList(tx1=gr, tx2=gr)
> grl$tx2
GRanges with 3 ranges and 2 elementMetadata cols:
seqnames ranges strand | score id
<Rle> <IRanges> <Rle> | <numeric> <character>
a 1 [1, 4] + | 5.1 K
b 1 [2, 5] + | 5.2 L
c 1 [3, 6] + | 5.3 M
---
seqlengths:
1
6
> DNAStringSet(c(aa="aa", bb="acgtatt"))$bb
7-letter "DNAString" instance
seq: ACGTATT
I can see it's tempting to sacrifice consistency for convenience
but there are so many classes in the IRanges/GenomicRanges
that IMO
this would be a dangerous slope in the long run.
We need to balance the concerns of the developers and the users.
Developers who are building on top of low-level infrastructure
need data
structures with consistent API and behaviors, whereas for users,
convenience is more of a concern.
For any List derivative, I would be against having '$' mean anything
other than what it does now. However, $ on a non-List Vector is not
generally useful for extracting elements of the vector, and
indeed R has
disallowed the use of $ on atomic vectors.
IMO the difference between being a List derivative or a non-List Vector
is too subtle to take this as the dividing line. This dividing line
wouldn't even necessarily give you what you want e.g. IRanges or
DNAStringSet are List derivatives but few users/developers know or
care about this.
I think this hits at the core of the issue. Very few users understand
the low-level intricacies of the IRanges infrastructure and are instead
guided by their intuition. Empirically, it seems few have tried to
manipulate the element metadata of IRanges, probably due to a lack of a
use case. Similarly, most people do not even realize that one could use
elementMetadata on a RangedData, nor would they often need to do so.
They instead of think in high-level notions such as having "values" on
ranges, where GRanges and RangedData could have more consistency.
This is all about choosing a balance between high-level convenience and
low-level consistency, and one's opinion probably strongly depends on
how often one needs (and maybe forgets) to type "values()" in a day, vs.
how often one writes "setMethod(...)" in a day.
Looking back at Tim's question, perhaps we need some sort of
compatibility API for package authors, so that RangedData and GRanges
could be treated identically for many operations. I've found that there
is often a need to get a value column using the same API. This is not
presently possible.
Yes most of the time RangedData and GRanges are interchangeable. It would be good to know why users/developers need to be able to switch back and forth between the 2 representations. Maybe that could be avoided by adding the functionalities that are missing on either side. I guess bringing the 2 containers exactly on par won't be possible (otherwise that would mean they are 100% redundant?), and in some rare situations, switching back and forth will still be the only solution. Having specific use cases would help. I would still prefer trying to work in that direction rather than encouraging the users/developers to switch back and forth or to write code that works on both containers. In the case of Tim's package for example, it's of course a good idea to write tools that support different types of input, but maybe the input could be coerced to 1 type (the "native" type) and all the internal code written to work on that "native" type only? It's a technique I personally use a lot in my own code and I find it makes the code cleaner and *much* easier to maintain e.g. it's straightforward to add support for additional types and also I'm not at the mercy of subtle changes in the API of *one* of the many types I support (I only need to know and care about the API of the "native" type). Even if the initial coercion introduces some overhead, I find this approach really worth it. Cheers, H.
Michael
If $ gives you an elementMetadata column of a GRanges,
how could we justify that it doesn't do so for an IRanges too?
Because IRanges and GRanges are so close and use the same notion of
length and names, this would be an unfortunate inconsistency,
a much more unfortunate one than the current inconsistencies between
RangedData and GRanges.
I have concerns that going on that slope will give us a situation
where some of the 140+ classes in the IRanges/GenomicRanges/__Biostrings
/ShortRead infrastructure do one thing with $, while others do
something else, with no clear dividing line, so every time I need
to use $ on an object, I need to check the man page for that kind
of object (I'm not very good at remembering the specificities of
140+ containers for such a standard operator as $). I'd rather
have everybody do the same thing.
Sacrificing clean design for convenience might seem like a good idea
at first sight but we've seen notable examples in base R where this
turned out to not be such a good deal in the long run... and in some
cases the convenience was finally abandoned (e.g. name partial matching
for list element access). The cost of the return ticket can be high,
sometimes it's just impossible (e.g. name mangling in unlist()). In
our own code, I recently had to go thru a deprecation/defunct cycle
for the coercion methods from AtomicList to atomic vectors in IRanges.
It will take 1 year to completely get rid of this and in the meantime
it broke a lot of code.
Thus, '$' is available for
other uses, such as extracting elements from a List associated
with a
Vector (like elementMetadata). This might be confusing to people,
because they might start thinking that the Vector object is
actually a
List, or something, but in practice this has not happened with
RangedData. There have been few queries about the "shape" of
RangedData.
People seem to just get it. Even with the proposed $ accessor,
GRanges
would remain comparatively simple. There is also something to be
said
for preserving the consistency between RangedData and GRanges, as
evidenced by the multiple discussions on this topic.
I don't like having to type elementMetadata either everytime I need
to access an elementMetadata column. But at least I know I can rely
on this to work *anywhere* (i.e. any Vector derivative), *even* on a
RangedData. Note that both rd$score and elementMetadata(rd)$score
work on a RangedData but they do different things. Having gr$score
being a shortcut for elementMetadata(gr)$score on a GRanges object
would actually introduce another inconsistency between RangedData
and GRanges, rather than eliminate one.
H.
Michael
Cheers,
H.
-steve
--
Herv? Pag?s
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
<mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
<tel:%28206%29%20667-5791>
Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
<tel:%28206%29%20667-1319>
___________________________________________________
Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
<mailto:Bioc-devel at r-project.__org
<mailto:Bioc-devel at r-project.org>> mailing list
https://stat.ethz.ch/mailman/____listinfo/bioc-devel
<https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
<https://stat.ethz.ch/mailman/__listinfo/bioc-devel
<https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
--
Herv? Pag?s
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20120320/97b50065/attachment.pl>