Skip to content

[Bioc-devel] deprecation of keepSeqlevels and renameSeqlevels

14 messages · Vincent Carey, Tim Triche, Jr., Michael Lawrence +2 more

#
For sorting the chromosome names in "natural" order, you can use
the makeSeqnameIds() helper function in GenomicRanges:

   seqlevels(gr) <- seqlevels(gr)[makeSeqnameIds(seqlevels(gr))]

It recognizes several naming conventions (chr-prefixed or not,
roman, etc...) e.g.:

   > makeSeqnameIds(c("Y", "1", "9", "M", "2"))
   [1] 4 1 3 5 2

   > makeSeqnameIds(c("chrY", "chrI", "chrIX", "chrM", "chrII"))
   [1] 4 1 3 5 2

I still need to improve the documentation of this function, put
more examples, and add unit tests. It's used internally by the
makeTranscriptDb* functions in GenomicFeatures to assign internal
ids to the chromosomes so that all the TranscriptDb extractors
can then return GRanges and GRangesList objects with the seqlevels
in the "natural" order.

Cheers,
H.
On 09/17/2013 10:23 AM, Kasper Daniel Hansen wrote:

  
    
#
On 09/17/2013 11:30 AM, Michael Lawrence wrote:
That would be confusing because order() does something different:

   > order(c("chrY", "chr1", "chr9", "chr2"))
   [1] 2 4 3 1

   > makeSeqnameIds(c("chrY", "chr1", "chr9", "chr2"))
   [1] 4 1 3 2

makeSeqnameIds() is more like rank() except on how it deals with
duplicates:

   > makeSeqnameIds(c("chrY", "chr1", "chr9", "chr2", "chr9"))
   [1] 4 1 3 2 3

Whatever 'ties.method' you use, rank() will put "chrY" at rank 5.

IIRC makeSeqnameIds() needs to support duplicates because of how it's
used internally in the makeTranscriptDb* functions.

Alternatively we could add sortSeqlevels() as a wrapper to
makeSeqnameIds() so you would be able to do:

   seqlevels(gr) <- sortSeqlevels(seqlevels(gr))

Does that work?

Thanks,
H.

  
    
#
I have been doing this in a kludgey fashion for a long long time, and would favor restoreNaturalOrder() as a clear name for what it does.  It would be great to have this in BioC-2.13 release. 

Thanks Michael and Herve and Kasper for bringing up and solving this cleanly.

--t
On Sep 17, 2013, at 11:30 AM, Michael Lawrence <lawrence.michael at gene.com> wrote:

            
#
On 09/17/2013 10:58 AM, Herv? Pag?s wrote:
I meant:

   seqlevels(gr) <- seqlevels(gr)[order(makeSeqnameIds(seqlevels(gr)))]

Probably the source of Michael's confusion... sorry.

sortSeqlevels() just added to GenomicRanges 1.13.44:

   seqlevels <- c("chr2RHet", "chr4", "chrUextra", "chrYHet",
                   "chrM", "chrXHet", "chr2LHet", "chrU",
                   "chr3L", "chr3R", "chr2R", "chrX")

Then:

   > sortSeqlevels(seqlevels)
    [1] "chr2R"     "chr3L"     "chr3R"     "chr4"      "chrX" 
"chrU"
    [7] "chrM"      "chr2LHet"  "chr2RHet"  "chrXHet"   "chrYHet" 
"chrUextra"

   > sortSeqlevels(seqlevels, X.is.sexchrom=TRUE)
    [1] "chr2R"     "chr3L"     "chr3R"     "chr4"      "chrX" 
"chrU"
    [7] "chrM"      "chr2LHet"  "chr2RHet"  "chrXHet"   "chrYHet" 
"chrUextra"

H.

  
    
#
OK I just made sortSeqlevels() a generic with a method for character
vectors and a default method. So you can do sortSeqlevels() directly on
a Seqinfo object, a GRanges object and anything that contains a Seqinfo
component:

   > gr
   GRanges with 0 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
     ---
     seqlengths:
       chr2RHet      chr4 chrUextra   chrYHet ...     chr3R     chr2R 
    chrX
             NA        NA        NA        NA ...        NA        NA 
      NA

   > sortSeqlevels(gr)
   GRanges with 0 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
     ---
     seqlengths:
          chr2R     chr3L     chr3R      chr4 ...   chrXHet   chrYHet 
chrUextra
             NA        NA        NA        NA ...        NA        NA 
      NA

H.
On 09/17/2013 01:05 PM, Michael Lawrence wrote:

  
    
#
Hi Kasper,
On 09/17/2013 06:05 PM, Kasper Daniel Hansen wrote:
sortSeqlevels() only sorts the seqlevels. Fixing them (if by this
you mean make them adhere to some naming covention) is another story
and a more complicated one.
We tried to tackled this with the seqnameStyle() which is only a
half-baked solution at the moment. We need to revisit the current
seqnameStyle() approach. Yes in the future we could have something
like fixSeqlevels() for doing both: sorting and renaming.

Cheers,
H.