Skip to content

[Bioc-devel] Meta data access with $ for GRanges vs RangedData vs data.frame

15 messages · Tim Yates, Stefano Berri, Steve Lianoglou +3 more

#
Hi all,

I have a feeling I asked this before, but cannot find the mail, or the answer I received, so sorry if we're covering old ground.

I have a package which can deal with data.frames, RangedData objects or Granges objects.

Assuming I have:

library( IRanges )
library( GenomicRanges )

# Setup a data.frame
df = data.frame( start=c( 1, 2 ),
                 end=c( 3, 4 ),
                 strand=c( '+', '+' ),
                 name=c( 'tim', 'yates' ),
                 more=c( T, F ) )
# Setup a RangedData object
rd = as( df, 'RangedData' )
# Setup a GRanges object
gr = as( rd, 'GRanges' )

Then the following is the behaviour when trying to use $ for accessing the column 'name':

print( df$name )
# [1] tim   yates
# Levels: tim yates

print( rd$name )
# [1] tim   yates
# Levels: tim yates

print( gr$name )
# Error in gr$meta : $ operator not defined for this S4 class

Using it for assignment also gives problems:

df$foo = c( 'bar', 'baz' )
# Adds new column 'foo' to df with values 'bar' and 'baz'
rd$foo = c( 'bar', 'baz' )
# Adds new column 'foo' to rd with values 'bar' and 'baz'
gr$foo = c( 'bar', 'baz' )
# Error in `$<-`(`*tmp*`, "foo", value = c("bar", "baz")) :
#  no method for assigning subsets of this S4 class

Now, if I add the following two methods to my code:

`$.GRanges`   = function( x, name )        {
  elementMetadata( x )[[ name ]]
}
`$<-.GRanges` = function( x, name, value ) {
  elementMetadata( x )[[ name ]] = value
  x
}

Then all the above functions behave in a similar fashion, however I don't want to do this in my package, as I believe it will be a gateway to namespace hell?

Is there any plan to add this accessor operator to Granges so that access is normalised across all three of these types?

Thanks!

Tim Yates 
--------------------------------------------------------
This email is confidential and intended solely for the u...{{dropped:12}}
#
Obviously, my accessor method was wrong, and missing a check for non-meta
columns:

`$.GRanges`   = function( x, name )        {
  if( name %in% c( 'seqnames', 'ranges', 'strand' ) ) {
    do.call( name, list( x ) )
  }
  else {
    elementMetadata( x )[[ name ]]
  }
}

Sorry about that (it's amazing what you see as soon as you click 'send')


Cheers,

Tim Yates
On 19/03/2012 12:02, "Tim Yates" <TYates at picr.man.ac.uk> wrote:

            
#
Hi.

I am not sure this is the right place to ask, but could not find anywhere
better.

I am author of CNAnorm. I have a problem when MACOSX users want to install
packages.

it works fine if they do

source("http://bioconductor.org/biocLite.R")
biocLite("CNAnorm")

but this install the "stable" version 1.0.0

It is also fine if they download CNAnorm_1.1.8.tar.gz and then

R CMD INSTALL CNAnorm_1.1.8.tar.gz

but this requires XCode and a fortran compiler which is not so straight
forward for some users

If they download MacOS 10.5 (Leopard) binary  CNAnorm_1.1.8.tgz
and then

R CMD INSTALL CNAnorm_1.1.8.tgz

they get

* installing to library
?/Library/Frameworks/R.framework/Versions/2.14/Resources/library?
* installing *binary* package ?CNAnorm? ...

* DONE (CNAnorm)

but then when starting CNAnorm

library(CNAnorm)
Loading required package: DNAcopy

**************************************************************************
   The plan to change the data format for CNA object has been postponed
 in order to ensure backward compatibility with older versions of DNAcopy
**************************************************************************

Error in dyn.load(file, DLLpath = DLLpath, ...) :
  unable to load shared object
'/Library/Frameworks/R.framework/Versions/2.14/Resources/library/CNAnorm/lib
s/x86_64/CNAnorm.so':
  
dlopen(/Library/Frameworks/R.framework/Versions/2.14/Resources/library/CNAno
rm/libs/x86_64/CNAnorm.so, 6): Library not loaded:
/Library/Frameworks/R.framework/Versions/2.15/Resources/lib/libgfortran.2.dy
lib
  Referenced from: 
/Library/Frameworks/R.framework/Versions/2.14/Resources/library/CNAnorm/libs
/x86_64/CNAnorm.so
  Reason: image not found
In addition: Warning message:
package ?CNAnorm? was built under R version 2.15.0
Error: package/namespace load failed for ?CNAnorm?


The user has R 2.14.1

What am I doing wrong? Is there a way to install the development version of
CNAnorm using biocLite?

Thank you in advance

cheers

Stefano
#
Hi Tim,
On Mon, Mar 19, 2012 at 8:02 AM, Tim Yates <TYates at picr.man.ac.uk> wrote:
[snip]
This does seem to come up every now and again. Most recently (I think)
in this post:

http://thread.gmane.org/gmane.science.biology.informatics.conductor/39169/focus=39189

There is a *technically* correct reason as to why there's no `$`
accessor on GRanges that is explained there by Martin (it has to do w/
GRanges extending the Vector class), but still ... it seems that some
of us feel like we still want it, even if it's bad for us. :-)

-steve
#
Hi Stefano,
On Mon, Mar 19, 2012 at 8:50 AM, Stefano Berri <S.Berri at leeds.ac.uk> wrote:
You aren't doing anything wrong ... this is the intended behavior.

The package versions that biocLite installs are tied to the version of
R that is being used. Someone who is running R-2.14.x is "tied" to the
bioconductor 2.9 "release train." If you want the development version
of packages, you need to be running R-devel.

When R-2.15.0 comes out, it will be "tied" to bioconductor 2.10 packages.

For the time being, if the mac users want to install the devel version
of CNAnorm, they can either:

(1) just download and install the source manually -- which usually
isn't recommended and you might be opening the door to dependency
hell, but ... it can be done. As you point out, the users will also
need XCode and a gfortran compiler, which they can get from here:

http://r.research.att.com/gfortran-4.2.3.dmg

(2) they can grab the R-2.15-beta installer from the link below and
install the devel versions of *all* bioconductor packages using
biocLite:

http://r.research.att.com/R-2.15-branch-leopard.pkg

HTH,
-steve
#
Actually, just an additional note about fortran compilers:


On Mon, Mar 19, 2012 at 8:59 AM, Steve Lianoglou
<mailinglist.honeypot at gmail.com> wrote:
[snip]
Instead of grabbing that installer directly, macosx folk should read
the info on the following page to make sure they are downloading the
correct version of their OS:

http://r.research.att.com/tools/#gcc42

-steve
#
Tim,
On Mon, Mar 19, 2012 at 2:34 PM, Tim Triche, Jr. <tim.triche at gmail.com> wrote:
[snip]
[/snip]

While I'm not really sure what your question is, I'm pretty sure the
answer is that you should use data.table ;-)

-steve
#
Hi Tim, Steve,
On 03/19/2012 05:50 AM, Steve Lianoglou wrote:
Also note that, generally speaking, you can't expect to be able to
manipulate a GRanges exactly like a RangedData or like a data.frame.
There are important differences:

 > rd
RangedData with 3 rows and 3 value columns across 1 space
      space    ranges |      strand     score          id
   <factor> <IRanges> | <character> <numeric> <character>
1        1    [1, 4] |           +       5.1           K
2        1    [2, 5] |           +       5.2           L
3        1    [3, 6] |           +       5.3           M
 > gr <- as(rd, "GRanges")
 > names(gr) <- letters[1:3]
 > gr
GRanges with 3 ranges and 2 elementMetadata cols:
     seqnames    ranges strand |     score          id
        <Rle> <IRanges>  <Rle> | <numeric> <character>
   a        1    [1, 4]      + |       5.1           K
   b        1    [2, 5]      + |       5.2           L
   c        1    [3, 6]      + |       5.3           M
   ---
   seqlengths:
    1
    6

 > dim(rd)
[1] 3 3
 > dim(as.data.frame(rd))
[1] 3 7
 > dim(gr)
NULL

 > length(rd)
[1] 1
 > length(as.data.frame(rd))
[1] 7
 > length(gr)
[1] 3

 > names(rd)
[1] "1"
 > names(as.data.frame(rd))
[1] "space"  "start"  "end"    "width"  "strand" "score"  "id"
 > names(gr)
[1] "a" "b" "c"

Each of them has its own notion of dim, length and names.
Therefore each of them needs to have its own notion of what $ does.
For GRanges the notion of $ would be to be compatible with its
notion of length and names i.e. gr$c would access element named "c".
This is already the case for GRangesList, DNAStringSet, and probably
98% of the classes defined in the IRanges/GenomicRanges/Biostrings
infrastructure (I wish I could say 100%):

 > grl <- GRangesList(tx1=gr, tx2=gr)
 > grl$tx2
GRanges with 3 ranges and 2 elementMetadata cols:
     seqnames    ranges strand |     score          id
        <Rle> <IRanges>  <Rle> | <numeric> <character>
   a        1    [1, 4]      + |       5.1           K
   b        1    [2, 5]      + |       5.2           L
   c        1    [3, 6]      + |       5.3           M
   ---
   seqlengths:
    1
    6

 > DNAStringSet(c(aa="aa", bb="acgtatt"))$bb
   7-letter "DNAString" instance
seq: ACGTATT

I can see it's tempting to sacrifice consistency for convenience
but there are so many classes in the IRanges/GenomicRanges that IMO
this would be a dangerous slope in the long run.

Cheers,
H.

  
    
#
Hi Michael,
On 03/19/2012 01:57 PM, Michael Lawrence wrote:
IMO the difference between being a List derivative or a non-List Vector
is too subtle to take this as the dividing line. This dividing line
wouldn't even necessarily give you what you want e.g. IRanges or
DNAStringSet are List derivatives but few users/developers know or
care about this. If $ gives you an elementMetadata column of a GRanges,
how could we justify that it doesn't do so for an IRanges too?
Because IRanges and GRanges are so close and use the same notion of
length and names, this would be an unfortunate inconsistency,
a much more unfortunate one than the current inconsistencies between
RangedData and GRanges.

I have concerns that going on that slope will give us a situation
where some of the 140+ classes in the IRanges/GenomicRanges/Biostrings
/ShortRead infrastructure do one thing with $, while others do
something else, with no clear dividing line, so every time I need
to use $ on an object, I need to check the man page for that kind
of object (I'm not very good at remembering the specificities of
140+ containers for such a standard operator as $). I'd rather
have everybody do the same thing.

Sacrificing clean design for convenience might seem like a good idea
at first sight but we've seen notable examples in base R where this
turned out to not be such a good deal in the long run... and in some
cases the convenience was finally abandoned (e.g. name partial matching
for list element access). The cost of the return ticket can be high,
sometimes it's just impossible (e.g. name mangling in unlist()). In
our own code, I recently had to go thru a deprecation/defunct cycle
for the coercion methods from AtomicList to atomic vectors in IRanges.
It will take 1 year to completely get rid of this and in the meantime
it broke a lot of code.
I don't like having to type elementMetadata either everytime I need
to access an elementMetadata column. But at least I know I can rely
on this to work *anywhere* (i.e. any Vector derivative), *even* on a
RangedData. Note that both rd$score and elementMetadata(rd)$score
work on a RangedData but they do different things. Having gr$score
being a shortcut for elementMetadata(gr)$score on a GRanges object
would actually introduce another inconsistency between RangedData
and GRanges, rather than eliminate one.

H.

  
    
#
On 03/19/2012 04:00 PM, Michael Lawrence wrote:
Yes most of the time RangedData and GRanges are interchangeable. It
would be good to know why users/developers need to be able to switch
back and forth between the 2 representations. Maybe that could be
avoided by adding the functionalities that are missing on either side.
I guess bringing the 2 containers exactly on par won't be possible
(otherwise that would mean they are 100% redundant?), and in some
rare situations, switching back and forth will still be the only
solution. Having specific use cases would help.

I would still prefer trying to work in that direction rather than
encouraging the users/developers to switch back and forth or to write
code that works on both containers. In the case of Tim's package for
example, it's of course a good idea to write tools that support
different types of input, but maybe the input could be coerced to
1 type (the "native" type) and all the internal code written to work
on that "native" type only? It's a technique I personally use a lot
in my own code and I find it makes the code cleaner and *much*
easier to maintain e.g. it's straightforward to add support for
additional types and also I'm not at the mercy of subtle changes
in the API of *one* of the many types I support (I only need to
know and care about the API of the "native" type). Even if the
initial coercion introduces some overhead, I find this approach
really worth it.

Cheers,
H.