[Bioc-devel] deprecation of keepSeqlevels and renameSeqlevels

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20130917/4487b6fe/attachment.pl>
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20130917/4d8548ca/attachment.pl>
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20130917/49c2277f/attachment.pl>
For sorting the chromosome names in "natural" order, you can use
the makeSeqnameIds() helper function in GenomicRanges:

   seqlevels(gr) <- seqlevels(gr)[makeSeqnameIds(seqlevels(gr))]

It recognizes several naming conventions (chr-prefixed or not,
roman, etc...) e.g.:

   > makeSeqnameIds(c("Y", "1", "9", "M", "2"))
   [1] 4 1 3 5 2

   > makeSeqnameIds(c("chrY", "chrI", "chrIX", "chrM", "chrII"))
   [1] 4 1 3 5 2

I still need to improve the documentation of this function, put
more examples, and add unit tests. It's used internally by the
makeTranscriptDb* functions in GenomicFeatures to assign internal
ids to the chromosomes so that all the TranscriptDb extractors
can then return GRanges and GRangesList objects with the seqlevels
in the "natural" order.

Cheers,
H.
Actually, what I (and I think several others) just want is
   fixSeqnames()
which sorts and fixes them to some convention.  I don't really care which
one, but I want to do some easy harmonization, preferably in line with
other bioc tools.  I know this is hard to do for all organisms, but having
something that deals with the major ones (human, mouse, fruit fly, yeast,
some monkeys) would be extremely valuable.  If this function existed, I
could just throw all my GRanges at it.

Kasper

On Tue, Sep 17, 2013 at 1:14 PM, Michael Lawrence <lawrence.michael at gene.com
wrote:

When I hit this I usually just do:

seqlevels(gr) <- paste0("chr", c(1:22, "X", "Y"))

It would be nice if there were a function for sorting chromosome names
according to a specified naming convention. Like sortSeqlevels(gr,
UCSCNaming()) or the alternative replacement syntax: seqinfo(gr) <-
sort(seqinfo(gr), UCSCNaming()). Although sort is only single-dispatch.

Michael

On Tue, Sep 17, 2013 at 8:59 AM, Vincent Carey
<stvjc at channing.harvard.edu>wrote:

I am replying to this as Michael mentions that this is a powerful
operation.  Here's my
problem that I believe is related, but I do not see a straightforward
solution

bhmm19uni
GRanges with 559456 ranges and 4 metadata columns:
          seqnames                 ranges strand   |              name
score
             <Rle>              <IRanges>  <Rle>   |       <character>
<numeric>
        1     chr1 [ 67229025,  67341224]      *   | 13_Heterochrom/lo
     0
        2     chr1 [203057955, 203060154]      *   |      12_Repressed
     0
        3     chr1 [  8458427,   8486226]      *   | 13_Heterochrom/lo
     0
        4     chr1 [ 16904427,  16904826]      *   | 5_Strong_Enhancer
     0
        5     chr1 [ 25289427,  25293426]      *   |       11_Weak_Txn
     0
      ...      ...                    ...    ... ...               ...
   ...
   570766    chr22   [51100931, 51101330]      *   |   2_Weak_Promoter
     0
   570767    chr22   [51101331, 51101530]      *   |   6_Weak_Enhancer
     0
   570768    chr22   [51101531, 51101730]      *   |   2_Weak_Promoter
     0
   570769    chr22   [51101731, 51178130]      *   | 13_Heterochrom/lo
     0
   570770    chr22   [51178131, 51178330]      *   | 15_Repetitive/CNV
     0
                           thick     numcode
                       <IRanges> <character>
        1 [ 67001613,  67113812]          13
        2 [201324578, 201326777]          12
        3 [  8381014,   8408813]          13
        4 [ 16777014,  16777413]           5
        5 [ 25162014,  25166013]          11
      ...                    ...         ...
   570766   [49447797, 49448196]           2
   570767   [49448197, 49448396]           6
   570768   [49448397, 49448596]           2
   570769   [49448597, 49524996]          13
   570770   [49524997, 49525196]          15
   ---
   seqlengths:
         chr1     chr10     chr11      chr5 ...      chr2     chr21
  chr4
    249250621 135534747 135006516 180915260 ... 243199373  48129895
191154276

If I try to make a karyogram with ggbio, the chromosomes are in an
unnatural order.  What is the right approach to getting the seqinfo in a
natural order with respect to chromosome names and lengths?  Reassigning
seqlevels seems dangerous but I haven't experimented fully.  It must be a
common use case but I do not see it in doc.

On Tue, May 21, 2013 at 11:00 AM, Michael Lawrence <
lawrence.michael at gene.com> wrote:

On Mon, May 20, 2013 at 10:36 PM, Herv? Pag?s <hpages at fhcrc.org> wrote:

Michael,

On 05/20/2013 08:44 PM, Michael Lawrence wrote:

On Mon, May 20, 2013 at 4:13 PM, Herv? Pag?s <hpages at fhcrc.org
<mailto:hpages at fhcrc.org>> wrote:

     Hi,

     On 05/20/2013 03:15 PM, Michael Lawrence wrote:

         On Mon, May 20, 2013 at 3:11 PM, Martin Morgan
         <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>> wrote:

             Hi Michael --

             On 5/20/2013 5:56 AM, Michael Lawrence wrote:

                 While it's cool that seqlevels<- has become so
flexible,
                 I still claim
                 that
                 rename/keep/drop would be a lot easier to read,
because
                 they describe the
                 high-level operation, and there's no reason to
deprecate
                 them. They're
                 also
                 easier to remember. For example, for dropping with
                 seqlevels<-, the user
                 needs
                 to remember that "force" is necessary to drop. Much
                 easier to just say
                 "dropSeqlevels(), please". And reimplementing
                 keepSeqlevels is still too
                 complicated. Is it such a maintenance burden to have
                 these simple
                 wrappers sit
                 on top of the low-level manipulators?

                 Another suggestion: renameSeqlevels should not
require
a
                 named vector (in
                 cases
                 where it is unnamed, require that it is parallel to
the
                 existing
                 seqlevels, as
                 with seqlevels<-).

             I didn't really indicate what drove my desire to see
             keepSeqlevels
             deprecated. keepSeqlevels, seqlevels<-, and isActiveSeq
were
             developed more
             or less independently, and have different contracts. The
             different
             contracts are confusing to the user, as is creating a
usable
             help system
             when there are multiple end points for what is a common
             operation. The help
             pages of each were inadequate in different ways. And
because
             the code was
             developed independently, support for different objects
was
             inconsistent. So
             actually, yes, the maintenance (and use) burden for the
             previous state of
             affairs was substantial.

             On the other hand, I agree that keepSeqlevels is
convenient
             as a simple
             wrapper around seqlevels<-, in the same way that setNames
             and names<- are
             both useful.

             So we could iterate to

                 keepSeqlevels <-
                     function(x, value, ...)
                 {
                     seqlevels(x, force=TRUE) <- value
                     x
                 }

             but I'd be less enthusiastic about maintaining the
original
             contract of
             keepSeqlevels, where keepSeqlevels(gr1, gr2), would have
             worked for two
             GRanges objects.

     Why would this be called keepSeqlevels() if, depending on how
it's
used,
     it will either add, drop, rename, or permute the seqlevels?
     Couldn't this be called setSeqlevels?

I thought that new2old was necessary for permuting.

The seqlevels setter has no 'new2old' arg.

  You're right that
adding should be disallowed for keepSeqlevels(). Adding seqlevels is
not
a common operation. The two common operations are:
* Permuting, usually because the data were imported in non-canonical
order (seqnameStyle was designed to address this, no?).

Permuting is straightforward with seqlevels<-:

   > seqlevels(gr)
   [1] "chr1" "chr2" "chr3"

   > seqlevels(gr) <- rev(seqlevels(gr))

   > seqlevels(gr)
   [1] "chr3" "chr2" "chr1"

  * Subsetting, either by keeping or dropping.

Also straightforward:

   > seqlevels(gr, force=TRUE) <- "chr2"

   > seqlevels(gr)
   [1] "chr2"

  Two main reasons: taking a
small slice of the data for prototyping, or removing problematic
chromosomes (sex, circular). This goes back to bsapply and the
exclude
argument.

There are other important use cases:

   - Dropping seqlevels that are *not* in use. This happens for example
     when subsetting a BAM file with Rsamtools::filterBam to keep only
     the alignments located on a given chromosome. The entire sequence
     dictionary (in the header) is propagated to the resulting BAM file
     so when you read the file with readGAlignments() you end up with a
     bunch of useless seqlevels that you generally start by dropping.
     You don't want to drop any alignment so force=FALSE is your
friend.

This is sort of tangential to the discussion, but do you really want to
do
this? I would preserve the universe as given by the BAM.

   - An even more common operation is renaming: 90% of the times that's
     what I use seqlevels<- for, and that's what I tell people to use
     when they need to rename their chromosomes. Also straightforward:

       > seqlevels(gr)
       [1] "chr1" "chr2" "chr3" "chrM"

       > seqlevels(gr) <- sub("^chr", "", seqlevels(gr))
       > seqlevels(gr)
       [1] "1" "2" "3" "M"

       > seqlevels(gr)[seqlevels(gr) == "M"] <- "MT"
       > seqlevels(gr)
       [1] "1"  "2"  "3"  "MT"

     Note that this is simpler to use than renameSeqlevels() which
     always required you to build the entire right value as a named
     vector.

Sure, renaming is a common use case. Not sure how I forgot that.

As I wrote earlier in the thread, renameSeqlevels should be changed so
as
not to require naming of the vector.

So the added-value of keepSeqlevels() seems to boil down to:

   (a) it always uses force=TRUE

   (b) it preserves the original object and returns the modified object

If you want to restrict the use of keepSeqlevels() to permuting and
dropping, you'll also have to disallow renaming (in addition to
adding).
Then its name will more or less reflect what it does (if "keep" means
"permute" or "drop"). The final result will be something that does
less
than setSeqlevels() and that is not necessarily easier to read: both
will set on the object whatever vector of seqlevels they are supplied,
except that one will fail when doing so would actually result in
adding
or renaming seqlevels.

So it looks like seqlevels<- is pretty powerful. Now I see that if I
scroll
down to the examples in the man page, I find the actual documentation.
It's
cool to learn that we can replace seqlevels on TxDb objects. My argument
has always been that since these low-level operations are so powerful,
it's
nice to have high-level operations to clarify the code. We have to think
hard to know what the RHS will do in that replacement. It's probably one
of
the most complex replacement operations; far more complex than the
typical
one, including levels<- on factor. There's nothing wrong with that; it's
just complexity that we should be able to hide.

Michael

H.

Michael

     H.

         Well, I wasn't even aware of that feature, so you've made
your
         point about
         the documentation ;) Sounds like a good compromise to me.

         Thanks for understanding,
         Michael

             Martin

                 Michael

                 On Sat, May 18, 2013 at 6:00 PM, Martin Morgan
                 <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
                 <mailto:mtmorgan at fhcrc.org <mailto:
mtmorgan at fhcrc.org

                 wrote:

                       On 05/18/2013 05:42 PM, Martin Morgan wrote:

                           Some of the most common operations are
                 straight-forward, e.g.,

                                 > gr = GRanges(paste0("chr", c(1:22,
                 "X", "Y")), IRanges(1,
                 100))
                                 > seqlevels(gr) = sub("chr", "ch",
                 seqlevels(gr))

                       To get a more comprehensive example I should
have
                 followed my own
                 advice and
                       grabbed from the help page!

                             ## Rename:
                             seqlevels(txdb) <- sub("chr", "",
                 seqlevels(txdb))
                             seqlevels(txdb)

                             seqlevels(txdb) <- paste0("CH",
seqlevels(txdb))
                             seqlevels(txdb)

                             seqlevels(txdb)[seqlevels(__****__txdb)
==

                 "CHM"] <- "M"

                             seqlevels(txdb)

                       --
                       Computational Biology / Fred Hutchinson Cancer
                 Research Center
                       1100 Fairview Ave. N.
                       PO Box 19024 Seattle, WA 98109

                       Location: Arnold Building M1 B861
                       Phone: (206) 667-2793
<tel:%28206%29%20667-2793>
                 <tel:%28206%29%20667-2793>

             --
             Dr. Martin Morgan, PhD

             Fred Hutchinson Cancer Research Center
             1100 Fairview Ave. N.
             PO Box 19024 Seattle, WA 98109

                  [[alternative HTML version deleted]]

         ______________________________**___________________
         Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.**org<
Bioc-devel at r-project.org>

         mailing list
         https://stat.ethz.ch/mailman/_**_listinfo/bioc-devel<
https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

         <https://stat.ethz.ch/mailman/**listinfo/bioc-devel<
https://stat.ethz.ch/mailman/listinfo/bioc-devel>

     --
     Herv? Pag?s

     Program in Computational Biology
     Division of Public Health Sciences

     Fred Hutchinson Cancer Research Center
     1100 Fairview Ave. N, M1-B514
     P.O. Box 19024
     Seattle, WA 98109-1024

     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>

--
Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

         [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

         [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

	[[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20130917/d66b6c33/attachment.pl>
Sounds great. Could we have a more intuitive name? Like naturalOrder().
That would be confusing because order() does something different:

   > order(c("chrY", "chr1", "chr9", "chr2"))
   [1] 2 4 3 1

   > makeSeqnameIds(c("chrY", "chr1", "chr9", "chr2"))
   [1] 4 1 3 2

makeSeqnameIds() is more like rank() except on how it deals with
duplicates:

   > makeSeqnameIds(c("chrY", "chr1", "chr9", "chr2", "chr9"))
   [1] 4 1 3 2 3

Whatever 'ties.method' you use, rank() will put "chrY" at rank 5.

IIRC makeSeqnameIds() needs to support duplicates because of how it's
used internally in the makeTranscriptDb* functions.

Alternatively we could add sortSeqlevels() as a wrapper to
makeSeqnameIds() so you would be able to do:

   seqlevels(gr) <- sortSeqlevels(seqlevels(gr))

Does that work?

Thanks,
H.
And then the method to fix the GRanges could be restoreNaturalOrder().
Just some suggestions...

On Tue, Sep 17, 2013 at 10:58 AM, Herv? Pag?s <hpages at fhcrc.org
<mailto:hpages at fhcrc.org>> wrote:

    For sorting the chromosome names in "natural" order, you can use
    the makeSeqnameIds() helper function in GenomicRanges:

       seqlevels(gr) <- seqlevels(gr)[makeSeqnameIds(__seqlevels(gr))]

    It recognizes several naming conventions (chr-prefixed or not,
    roman, etc...) e.g.:

       > makeSeqnameIds(c("Y", "1", "9", "M", "2"))
       [1] 4 1 3 5 2

       > makeSeqnameIds(c("chrY", "chrI", "chrIX", "chrM", "chrII"))
       [1] 4 1 3 5 2

    I still need to improve the documentation of this function, put
    more examples, and add unit tests. It's used internally by the
    makeTranscriptDb* functions in GenomicFeatures to assign internal
    ids to the chromosomes so that all the TranscriptDb extractors
    can then return GRanges and GRangesList objects with the seqlevels
    in the "natural" order.

    Cheers,
    H.

    On 09/17/2013 10:23 AM, Kasper Daniel Hansen wrote:

        Actually, what I (and I think several others) just want is
            fixSeqnames()
        which sorts and fixes them to some convention.  I don't really
        care which
        one, but I want to do some easy harmonization, preferably in
        line with
        other bioc tools.  I know this is hard to do for all organisms,
        but having
        something that deals with the major ones (human, mouse, fruit
        fly, yeast,
        some monkeys) would be extremely valuable.  If this function
        existed, I
        could just throw all my GRanges at it.

        Kasper

        On Tue, Sep 17, 2013 at 1:14 PM, Michael Lawrence
        <lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>

            wrote:

            When I hit this I usually just do:

            seqlevels(gr) <- paste0("chr", c(1:22, "X", "Y"))

            It would be nice if there were a function for sorting
            chromosome names
            according to a specified naming convention. Like
            sortSeqlevels(gr,
            UCSCNaming()) or the alternative replacement syntax:
            seqinfo(gr) <-
            sort(seqinfo(gr), UCSCNaming()). Although sort is only
            single-dispatch.

            Michael

            On Tue, Sep 17, 2013 at 8:59 AM, Vincent Carey
            <stvjc at channing.harvard.edu
            <mailto:stvjc at channing.harvard.edu>>__wrote:

                I am replying to this as Michael mentions that this is a
                powerful
                operation.  Here's my
                problem that I believe is related, but I do not see a
                straightforward
                solution

                    bhmm19uni

                GRanges with 559456 ranges and 4 metadata columns:
                           seqnames                 ranges strand   |
                            name
                score
                              <Rle>              <IRanges>  <Rle>   |
                     <character>
                <numeric>
                         1     chr1 [ 67229025,  67341224]      *   |
                13_Heterochrom/lo
                      0
                         2     chr1 [203057955, 203060154]      *   |
                    12_Repressed
                      0
                         3     chr1 [  8458427,   8486226]      *   |
                13_Heterochrom/lo
                      0
                         4     chr1 [ 16904427,  16904826]      *   |
                5_Strong_Enhancer
                      0
                         5     chr1 [ 25289427,  25293426]      *   |
                     11_Weak_Txn
                      0
                       ...      ...                    ...    ... ...
                             ...
                    ...
                    570766    chr22   [51100931, 51101330]      *   |
                2_Weak_Promoter
                      0
                    570767    chr22   [51101331, 51101530]      *   |
                6_Weak_Enhancer
                      0
                    570768    chr22   [51101531, 51101730]      *   |
                2_Weak_Promoter
                      0
                    570769    chr22   [51101731, 51178130]      *   |
                13_Heterochrom/lo
                      0
                    570770    chr22   [51178131, 51178330]      *   |
                15_Repetitive/CNV
                      0
                                            thick     numcode
                                        <IRanges> <character>
                         1 [ 67001613,  67113812]          13
                         2 [201324578, 201326777]          12
                         3 [  8381014,   8408813]          13
                         4 [ 16777014,  16777413]           5
                         5 [ 25162014,  25166013]          11
                       ...                    ...         ...
                    570766   [49447797, 49448196]           2
                    570767   [49448197, 49448396]           6
                    570768   [49448397, 49448596]           2
                    570769   [49448597, 49524996]          13
                    570770   [49524997, 49525196]          15
                    ---
                    seqlengths:
                          chr1     chr10     chr11      chr5 ...
                  chr2     chr21
                   chr4
                     249250621 135534747 135006516 180915260 ...
                243199373  48129895
                191154276

                If I try to make a karyogram with ggbio, the chromosomes
                are in an
                unnatural order.  What is the right approach to getting
                the seqinfo in a
                natural order with respect to chromosome names and
                lengths?  Reassigning
                seqlevels seems dangerous but I haven't experimented
                fully.  It must be a
                common use case but I do not see it in doc.

                On Tue, May 21, 2013 at 11:00 AM, Michael Lawrence <
                lawrence.michael at gene.com
                <mailto:lawrence.michael at gene.com>> wrote:

                    On Mon, May 20, 2013 at 10:36 PM, Herv? Pag?s
                    <hpages at fhcrc.org <mailto:hpages at fhcrc.org>> wrote:

                        Michael,

                        On 05/20/2013 08:44 PM, Michael Lawrence wrote:

                            On Mon, May 20, 2013 at 4:13 PM, Herv? Pag?s
                            <hpages at fhcrc.org <mailto:hpages at fhcrc.org>
                            <mailto:hpages at fhcrc.org
                            <mailto:hpages at fhcrc.org>>> wrote:

                                  Hi,

                                  On 05/20/2013 03:15 PM, Michael
                            Lawrence wrote:

                                      On Mon, May 20, 2013 at 3:11 PM,
                            Martin Morgan
                                      <mtmorgan at fhcrc.org
                            <mailto:mtmorgan at fhcrc.org>
                            <mailto:mtmorgan at fhcrc.org
                            <mailto:mtmorgan at fhcrc.org>>> wrote:

                                          Hi Michael --

                                          On 5/20/2013 5:56 AM, Michael
                            Lawrence wrote:

                                              While it's cool that
                            seqlevels<- has become so

                    flexible,

                                              I still claim
                                              that
                                              rename/keep/drop would be
                            a lot easier to read,

            because

                                              they describe the
                                              high-level operation, and
                            there's no reason to

                    deprecate

                                              them. They're
                                              also
                                              easier to remember. For
                            example, for dropping with
                                              seqlevels<-, the user
                                              needs
                                              to remember that "force"
                            is necessary to drop. Much
                                              easier to just say
                                              "dropSeqlevels(), please".
                            And reimplementing
                                              keepSeqlevels is still too
                                              complicated. Is it such a
                            maintenance burden to have
                                              these simple
                                              wrappers sit
                                              on top of the low-level
                            manipulators?

                                              Another suggestion:
                            renameSeqlevels should not

            require

                    a

                                              named vector (in
                                              cases
                                              where it is unnamed,
                            require that it is parallel to

            the

                                              existing
                                              seqlevels, as
                                              with seqlevels<-).

                                          I didn't really indicate what
                            drove my desire to see
                                          keepSeqlevels
                                          deprecated. keepSeqlevels,
                            seqlevels<-, and isActiveSeq

                    were

                                          developed more
                                          or less independently, and
                            have different contracts. The
                                          different
                                          contracts are confusing to the
                            user, as is creating a

                    usable

                                          help system
                                          when there are multiple end
                            points for what is a common
                                          operation. The help
                                          pages of each were inadequate
                            in different ways. And

                    because

                                          the code was
                                          developed independently,
                            support for different objects

            was

                                          inconsistent. So
                                          actually, yes, the maintenance
                            (and use) burden for the
                                          previous state of
                                          affairs was substantial.

                                          On the other hand, I agree
                            that keepSeqlevels is

            convenient

                                          as a simple
                                          wrapper around seqlevels<-, in
                            the same way that setNames
                                          and names<- are
                                          both useful.

                                          So we could iterate to

                                              keepSeqlevels <-
                                                  function(x, value, ...)
                                              {
                                                  seqlevels(x,
                            force=TRUE) <- value
                                                  x
                                              }

                                          but I'd be less enthusiastic
                            about maintaining the

            original

                                          contract of
                                          keepSeqlevels, where
                            keepSeqlevels(gr1, gr2), would have
                                          worked for two
                                          GRanges objects.

                                  Why would this be called
                            keepSeqlevels() if, depending on how

            it's

                            used,
                                  it will either add, drop, rename, or
                            permute the seqlevels?
                                  Couldn't this be called setSeqlevels?

                            I thought that new2old was necessary for
                            permuting.

                        The seqlevels setter has no 'new2old' arg.

                           You're right that

                            adding should be disallowed for
                            keepSeqlevels(). Adding seqlevels is

                    not

                            a common operation. The two common
                            operations are:
                            * Permuting, usually because the data were
                            imported in non-canonical
                            order (seqnameStyle was designed to address
                            this, no?).

                        Permuting is straightforward with seqlevels<-:

                            > seqlevels(gr)
                            [1] "chr1" "chr2" "chr3"

                            > seqlevels(gr) <- rev(seqlevels(gr))

                            > seqlevels(gr)
                            [1] "chr3" "chr2" "chr1"

                           * Subsetting, either by keeping or dropping.

                        Also straightforward:

                            > seqlevels(gr, force=TRUE) <- "chr2"

                            > seqlevels(gr)
                            [1] "chr2"

                           Two main reasons: taking a

                            small slice of the data for prototyping, or
                            removing problematic
                            chromosomes (sex, circular). This goes back
                            to bsapply and the

            exclude

                            argument.

                        There are other important use cases:

                            - Dropping seqlevels that are *not* in use.
                        This happens for example
                              when subsetting a BAM file with
                        Rsamtools::filterBam to keep only
                              the alignments located on a given
                        chromosome. The entire sequence
                              dictionary (in the header) is propagated
                        to the resulting BAM file
                              so when you read the file with
                        readGAlignments() you end up with a
                              bunch of useless seqlevels that you
                        generally start by dropping.
                              You don't want to drop any alignment so
                        force=FALSE is your

            friend.

                    This is sort of tangential to the discussion, but do
                    you really want to

            do

                    this? I would preserve the universe as given by the BAM.

                            - An even more common operation is renaming:
                        90% of the times that's
                              what I use seqlevels<- for, and that's
                        what I tell people to use
                              when they need to rename their
                        chromosomes. Also straightforward:

                                > seqlevels(gr)
                                [1] "chr1" "chr2" "chr3" "chrM"

                                > seqlevels(gr) <- sub("^chr", "",
                        seqlevels(gr))
                                > seqlevels(gr)
                                [1] "1" "2" "3" "M"

                                > seqlevels(gr)[seqlevels(gr) == "M"] <-
                        "MT"
                                > seqlevels(gr)
                                [1] "1"  "2"  "3"  "MT"

                              Note that this is simpler to use than
                        renameSeqlevels() which
                              always required you to build the entire
                        right value as a named
                              vector.

                    Sure, renaming is a common use case. Not sure how I
                    forgot that.

                    As I wrote earlier in the thread, renameSeqlevels
                    should be changed so

            as

                    not to require naming of the vector.

                        So the added-value of keepSeqlevels() seems to
                        boil down to:

                            (a) it always uses force=TRUE

                            (b) it preserves the original object and
                        returns the modified object

                        If you want to restrict the use of
                        keepSeqlevels() to permuting and
                        dropping, you'll also have to disallow renaming
                        (in addition to

            adding).

                        Then its name will more or less reflect what it
                        does (if "keep" means
                        "permute" or "drop"). The final result will be
                        something that does

            less

                        than setSeqlevels() and that is not necessarily
                        easier to read: both
                        will set on the object whatever vector of
                        seqlevels they are supplied,
                        except that one will fail when doing so would
                        actually result in

            adding

                        or renaming seqlevels.

                    So it looks like seqlevels<- is pretty powerful. Now
                    I see that if I
                    scroll
                    down to the examples in the man page, I find the
                    actual documentation.
                    It's
                    cool to learn that we can replace seqlevels on TxDb
                    objects. My argument
                    has always been that since these low-level
                    operations are so powerful,
                    it's
                    nice to have high-level operations to clarify the
                    code. We have to think
                    hard to know what the RHS will do in that
                    replacement. It's probably one
                    of
                    the most complex replacement operations; far more
                    complex than the

            typical

                    one, including levels<- on factor. There's nothing
                    wrong with that; it's
                    just complexity that we should be able to hide.

                    Michael

                    H.

                            Michael

                                  H.

                                      Well, I wasn't even aware of that
                            feature, so you've made

            your

                                      point about
                                      the documentation ;) Sounds like a
                            good compromise to me.

                                      Thanks for understanding,
                                      Michael

                                          Martin

                                              Michael

                                              On Sat, May 18, 2013 at
                            6:00 PM, Martin Morgan
                                              <mtmorgan at fhcrc.org
                            <mailto:mtmorgan at fhcrc.org>
                            <mailto:mtmorgan at fhcrc.org
                            <mailto:mtmorgan at fhcrc.org>>
                                              <mailto:mtmorgan at fhcrc.org
                            <mailto:mtmorgan at fhcrc.org> <mailto:

            mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>

                                              wrote:

                                                    On 05/18/2013 05:42
                            PM, Martin Morgan wrote:

                                                        Some of the most
                            common operations are
                                              straight-forward, e.g.,

                                                              > gr =
                            GRanges(paste0("chr", c(1:22,
                                              "X", "Y")), IRanges(1,
                                              100))
                                                              >
                            seqlevels(gr) = sub("chr", "ch",
                                              seqlevels(gr))

                                                    To get a more
                            comprehensive example I should

            have

                                              followed my own
                                              advice and
                                                    grabbed from the
                            help page!

                                                          ## Rename:

                              seqlevels(txdb) <- sub("chr", "",
                                              seqlevels(txdb))
                                                          seqlevels(txdb)

                              seqlevels(txdb) <- paste0("CH",
                            seqlevels(txdb))
                                                          seqlevels(txdb)

                              seqlevels(txdb)[seqlevels(__**__**__txdb)

            ==

                                              "CHM"] <- "M"

                                                          seqlevels(txdb)

                                                    --
                                                    Computational
                            Biology / Fred Hutchinson Cancer
                                              Research Center
                                                    1100 Fairview Ave. N.
                                                    PO Box 19024
                            Seattle, WA 98109

                                                    Location: Arnold
                            Building M1 B861
                                                    Phone: (206)
                            667-2793 <tel:%28206%29%20667-2793>

            <tel:%28206%29%20667-2793>

                                              <tel:%28206%29%20667-2793>

                                          --
                                          Dr. Martin Morgan, PhD

                                          Fred Hutchinson Cancer
                            Research Center
                                          1100 Fairview Ave. N.
                                          PO Box 19024 Seattle, WA 98109

                                               [[alternative HTML
                            version deleted]]

                              ________________________________**___________________
                            Bioc-devel at r-project.org
                            <mailto:Bioc-devel at r-project.org>
                            <mailto:Bioc-devel at r-project.
                            <mailto:Bioc-devel at r-project.>*__*org<

                    Bioc-devel at r-project.org
                    <mailto:Bioc-devel at r-project.org>>

                                      mailing list
                            https://stat.ethz.ch/mailman/___**_listinfo/bioc-devel
                            <https://stat.ethz.ch/mailman/_**_listinfo/bioc-devel><

                    https://stat.ethz.ch/mailman/____listinfo/bioc-devel
                    <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>

                              <https://stat.ethz.ch/mailman/__**listinfo/bioc-devel <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><

                    https://stat.ethz.ch/mailman/__listinfo/bioc-devel
                    <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>

                                  --
                                  Herv? Pag?s

                                  Program in Computational Biology
                                  Division of Public Health Sciences

                                  Fred Hutchinson Cancer Research Center
                                  1100 Fairview Ave. N, M1-B514
                                  P.O. Box 19024
                                  Seattle, WA 98109-1024

                                  E-mail: hpages at fhcrc.org
                            <mailto:hpages at fhcrc.org>
                            <mailto:hpages at fhcrc.org
                            <mailto:hpages at fhcrc.org>>
                                  Phone: (206) 667-5791
                            <tel:%28206%29%20667-5791>
                            <tel:%28206%29%20667-5791>
                                  Fax: (206) 667-1319
                            <tel:%28206%29%20667-1319>
                            <tel:%28206%29%20667-1319>

                        --
                        Herv? Pag?s

                        Program in Computational Biology
                        Division of Public Health Sciences
                        Fred Hutchinson Cancer Research Center
                        1100 Fairview Ave. N, M1-B514
                        P.O. Box 19024
                        Seattle, WA 98109-1024

                        E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
                        Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
                        Fax: (206) 667-1319 <tel:%28206%29%20667-1319>

                              [[alternative HTML version deleted]]

                    _________________________________________________
                    Bioc-devel at r-project.org
                    <mailto:Bioc-devel at r-project.org> mailing list
                    https://stat.ethz.ch/mailman/__listinfo/bioc-devel
                    <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

                      [[alternative HTML version deleted]]

            _________________________________________________
            Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
            mailing list
            https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

                 [[alternative HTML version deleted]]

        _________________________________________________
        Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
        mailing list
        https://stat.ethz.ch/mailman/__listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

    --
    Herv? Pag?s

    Program in Computational Biology
    Division of Public Health Sciences
    Fred Hutchinson Cancer Research Center
    1100 Fairview Ave. N, M1-B514
    P.O. Box 19024
    Seattle, WA 98109-1024

    E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
    Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
    Fax: (206) 667-1319 <tel:%28206%29%20667-1319>

Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319
I have been doing this in a kludgey fashion for a long long time, and would favor restoreNaturalOrder() as a clear name for what it does.  It would be great to have this in BioC-2.13 release. 

Thanks Michael and Herve and Kasper for bringing up and solving this cleanly.

--t

Sounds great. Could we have a more intuitive name? Like naturalOrder(). And
then the method to fix the GRanges could be restoreNaturalOrder(). Just
some suggestions...

On Tue, Sep 17, 2013 at 10:58 AM, Herv? Pag?s <hpages at fhcrc.org> wrote:

For sorting the chromosome names in "natural" order, you can use
the makeSeqnameIds() helper function in GenomicRanges:

 seqlevels(gr) <- seqlevels(gr)[makeSeqnameIds(**seqlevels(gr))]

It recognizes several naming conventions (chr-prefixed or not,
roman, etc...) e.g.:

makeSeqnameIds(c("Y", "1", "9", "M", "2"))
 [1] 4 1 3 5 2

makeSeqnameIds(c("chrY", "chrI", "chrIX", "chrM", "chrII"))
 [1] 4 1 3 5 2

I still need to improve the documentation of this function, put
more examples, and add unit tests. It's used internally by the
makeTranscriptDb* functions in GenomicFeatures to assign internal
ids to the chromosomes so that all the TranscriptDb extractors
can then return GRanges and GRangesList objects with the seqlevels
in the "natural" order.

Cheers,
H.

On 09/17/2013 10:23 AM, Kasper Daniel Hansen wrote:

Actually, what I (and I think several others) just want is
  fixSeqnames()
which sorts and fixes them to some convention.  I don't really care which
one, but I want to do some easy harmonization, preferably in line with
other bioc tools.  I know this is hard to do for all organisms, but having
something that deals with the major ones (human, mouse, fruit fly, yeast,
some monkeys) would be extremely valuable.  If this function existed, I
could just throw all my GRanges at it.

Kasper

On Tue, Sep 17, 2013 at 1:14 PM, Michael Lawrence <
lawrence.michael at gene.com

wrote:
When I hit this I usually just do:
seqlevels(gr) <- paste0("chr", c(1:22, "X", "Y"))

It would be nice if there were a function for sorting chromosome names
according to a specified naming convention. Like sortSeqlevels(gr,
UCSCNaming()) or the alternative replacement syntax: seqinfo(gr) <-
sort(seqinfo(gr), UCSCNaming()). Although sort is only single-dispatch.

Michael

On Tue, Sep 17, 2013 at 8:59 AM, Vincent Carey
<stvjc at channing.harvard.edu>**wrote:

I am replying to this as Michael mentions that this is a powerful
operation.  Here's my
problem that I believe is related, but I do not see a straightforward
solution

bhmm19uni
GRanges with 559456 ranges and 4 metadata columns:
         seqnames                 ranges strand   |              name
score
            <Rle>              <IRanges>  <Rle>   |       <character>
<numeric>
       1     chr1 [ 67229025,  67341224]      *   | 13_Heterochrom/lo
    0
       2     chr1 [203057955, 203060154]      *   |      12_Repressed
    0
       3     chr1 [  8458427,   8486226]      *   | 13_Heterochrom/lo
    0
       4     chr1 [ 16904427,  16904826]      *   | 5_Strong_Enhancer
    0
       5     chr1 [ 25289427,  25293426]      *   |       11_Weak_Txn
    0
     ...      ...                    ...    ... ...               ...
  ...
  570766    chr22   [51100931, 51101330]      *   |   2_Weak_Promoter
    0
  570767    chr22   [51101331, 51101530]      *   |   6_Weak_Enhancer
    0
  570768    chr22   [51101531, 51101730]      *   |   2_Weak_Promoter
    0
  570769    chr22   [51101731, 51178130]      *   | 13_Heterochrom/lo
    0
  570770    chr22   [51178131, 51178330]      *   | 15_Repetitive/CNV
    0
                          thick     numcode
                      <IRanges> <character>
       1 [ 67001613,  67113812]          13
       2 [201324578, 201326777]          12
       3 [  8381014,   8408813]          13
       4 [ 16777014,  16777413]           5
       5 [ 25162014,  25166013]          11
     ...                    ...         ...
  570766   [49447797, 49448196]           2
  570767   [49448197, 49448396]           6
  570768   [49448397, 49448596]           2
  570769   [49448597, 49524996]          13
  570770   [49524997, 49525196]          15
  ---
  seqlengths:
        chr1     chr10     chr11      chr5 ...      chr2     chr21
 chr4
   249250621 135534747 135006516 180915260 ... 243199373  48129895
191154276

If I try to make a karyogram with ggbio, the chromosomes are in an
unnatural order.  What is the right approach to getting the seqinfo in a
natural order with respect to chromosome names and lengths?  Reassigning
seqlevels seems dangerous but I haven't experimented fully.  It must be
a
common use case but I do not see it in doc.

On Tue, May 21, 2013 at 11:00 AM, Michael Lawrence <
lawrence.michael at gene.com> wrote:

On Mon, May 20, 2013 at 10:36 PM, Herv? Pag?s <hpages at fhcrc.org>
wrote:

Michael,

On 05/20/2013 08:44 PM, Michael Lawrence wrote:

On Mon, May 20, 2013 at 4:13 PM, Herv? Pag?s <hpages at fhcrc.org
<mailto:hpages at fhcrc.org>> wrote:

    Hi,

    On 05/20/2013 03:15 PM, Michael Lawrence wrote:

        On Mon, May 20, 2013 at 3:11 PM, Martin Morgan
        <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>> wrote:

            Hi Michael --

            On 5/20/2013 5:56 AM, Michael Lawrence wrote:

                While it's cool that seqlevels<- has become so
flexible,

                I still claim
                that
                rename/keep/drop would be a lot easier to read,
because

                they describe the
                high-level operation, and there's no reason to
deprecate

                them. They're
                also
                easier to remember. For example, for dropping with
                seqlevels<-, the user
                needs
                to remember that "force" is necessary to drop. Much
                easier to just say
                "dropSeqlevels(), please". And reimplementing
                keepSeqlevels is still too
                complicated. Is it such a maintenance burden to have
                these simple
                wrappers sit
                on top of the low-level manipulators?

                Another suggestion: renameSeqlevels should not
require

a

                named vector (in
                cases
                where it is unnamed, require that it is parallel to
the

                existing
                seqlevels, as
                with seqlevels<-).

            I didn't really indicate what drove my desire to see
            keepSeqlevels
            deprecated. keepSeqlevels, seqlevels<-, and isActiveSeq
were

            developed more
            or less independently, and have different contracts. The
            different
            contracts are confusing to the user, as is creating a
usable

            help system
            when there are multiple end points for what is a common
            operation. The help
            pages of each were inadequate in different ways. And
because

            the code was
            developed independently, support for different objects
was

            inconsistent. So
            actually, yes, the maintenance (and use) burden for the
            previous state of
            affairs was substantial.

            On the other hand, I agree that keepSeqlevels is
convenient

            as a simple
            wrapper around seqlevels<-, in the same way that
setNames
            and names<- are
            both useful.

            So we could iterate to

                keepSeqlevels <-
                    function(x, value, ...)
                {
                    seqlevels(x, force=TRUE) <- value
                    x
                }

            but I'd be less enthusiastic about maintaining the
original

            contract of
            keepSeqlevels, where keepSeqlevels(gr1, gr2), would have
            worked for two
            GRanges objects.

    Why would this be called keepSeqlevels() if, depending on how
it's

used,
    it will either add, drop, rename, or permute the seqlevels?
    Couldn't this be called setSeqlevels?

I thought that new2old was necessary for permuting.
The seqlevels setter has no 'new2old' arg.

 You're right that

adding should be disallowed for keepSeqlevels(). Adding seqlevels is
not

a common operation. The two common operations are:
* Permuting, usually because the data were imported in non-canonical
order (seqnameStyle was designed to address this, no?).
Permuting is straightforward with seqlevels<-:

seqlevels(gr)
  [1] "chr1" "chr2" "chr3"

seqlevels(gr) <- rev(seqlevels(gr))

seqlevels(gr)
  [1] "chr3" "chr2" "chr1"

 * Subsetting, either by keeping or dropping.

Also straightforward:

seqlevels(gr, force=TRUE) <- "chr2"

seqlevels(gr)
  [1] "chr2"

 Two main reasons: taking a

small slice of the data for prototyping, or removing problematic
chromosomes (sex, circular). This goes back to bsapply and the
exclude

argument.
There are other important use cases:

  - Dropping seqlevels that are *not* in use. This happens for
example
    when subsetting a BAM file with Rsamtools::filterBam to keep only
    the alignments located on a given chromosome. The entire sequence
    dictionary (in the header) is propagated to the resulting BAM
file
    so when you read the file with readGAlignments() you end up with
a
    bunch of useless seqlevels that you generally start by dropping.
    You don't want to drop any alignment so force=FALSE is your
friend.

This is sort of tangential to the discussion, but do you really want
to
do

this? I would preserve the universe as given by the BAM.

   - An even more common operation is renaming: 90% of the times
that's
    what I use seqlevels<- for, and that's what I tell people to use
    when they need to rename their chromosomes. Also straightforward:

seqlevels(gr)
      [1] "chr1" "chr2" "chr3" "chrM"

seqlevels(gr) <- sub("^chr", "", seqlevels(gr))
seqlevels(gr)
      [1] "1" "2" "3" "M"

seqlevels(gr)[seqlevels(gr) == "M"] <- "MT"
seqlevels(gr)
      [1] "1"  "2"  "3"  "MT"

    Note that this is simpler to use than renameSeqlevels() which
    always required you to build the entire right value as a named
    vector.

Sure, renaming is a common use case. Not sure how I forgot that.
As I wrote earlier in the thread, renameSeqlevels should be changed so
as

not to require naming of the vector.

So the added-value of keepSeqlevels() seems to boil down to:
  (a) it always uses force=TRUE

  (b) it preserves the original object and returns the modified
object

If you want to restrict the use of keepSeqlevels() to permuting and
dropping, you'll also have to disallow renaming (in addition to
adding).

Then its name will more or less reflect what it does (if "keep" means
"permute" or "drop"). The final result will be something that does
less

than setSeqlevels() and that is not necessarily easier to read: both
will set on the object whatever vector of seqlevels they are supplied,
except that one will fail when doing so would actually result in
adding

or renaming seqlevels.

So it looks like seqlevels<- is pretty powerful. Now I see that if I
scroll
down to the examples in the man page, I find the actual documentation.
It's
cool to learn that we can replace seqlevels on TxDb objects. My
argument
has always been that since these low-level operations are so powerful,
it's
nice to have high-level operations to clarify the code. We have to
think
hard to know what the RHS will do in that replacement. It's probably
one
of
the most complex replacement operations; far more complex than the
typical

one, including levels<- on factor. There's nothing wrong with that; it's
just complexity that we should be able to hide.

Michael

H.

Michael

    H.

        Well, I wasn't even aware of that feature, so you've made
your

        point about
        the documentation ;) Sounds like a good compromise to me.

        Thanks for understanding,
        Michael

            Martin

                Michael

                On Sat, May 18, 2013 at 6:00 PM, Martin Morgan
                <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
                <mailto:mtmorgan at fhcrc.org <mailto:
mtmorgan at fhcrc.org

                wrote:

                      On 05/18/2013 05:42 PM, Martin Morgan wrote:

                          Some of the most common operations are
                straight-forward, e.g.,

gr = GRanges(paste0("chr", c(1:22,
                "X", "Y")), IRanges(1,
                100))
seqlevels(gr) = sub("chr", "ch",
                seqlevels(gr))

                      To get a more comprehensive example I should
have

                followed my own
                advice and
                      grabbed from the help page!

                            ## Rename:
                            seqlevels(txdb) <- sub("chr", "",
                seqlevels(txdb))
                            seqlevels(txdb)

                            seqlevels(txdb) <- paste0("CH",
seqlevels(txdb))
                            seqlevels(txdb)

                            seqlevels(txdb)[seqlevels(__****
**__txdb)
==

                "CHM"] <- "M"

                            seqlevels(txdb)

                      --
                      Computational Biology / Fred Hutchinson Cancer
                Research Center
                      1100 Fairview Ave. N.
                      PO Box 19024 Seattle, WA 98109

                      Location: Arnold Building M1 B861
                      Phone: (206) 667-2793
<tel:%28206%29%20667-2793>

                <tel:%28206%29%20667-2793>

            --
            Dr. Martin Morgan, PhD

            Fred Hutchinson Cancer Research Center
            1100 Fairview Ave. N.
            PO Box 19024 Seattle, WA 98109

                 [[alternative HTML version deleted]]

        ______________________________****___________________
        Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.***
*org<
Bioc-devel at r-project.org>

         mailing list
        https://stat.ethz.ch/mailman/_****_listinfo/bioc-devel<https://stat.ethz.ch/mailman/_**_listinfo/bioc-devel>
<
https://stat.ethz.ch/mailman/_**_listinfo/bioc-devel<https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/****listinfo/bioc-devel<https://stat.ethz.ch/mailman/**listinfo/bioc-devel>
<
https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel>

    --
    Herv? Pag?s

    Program in Computational Biology
    Division of Public Health Sciences

    Fred Hutchinson Cancer Research Center
    1100 Fairview Ave. N, M1-B514
    P.O. Box 19024
    Seattle, WA 98109-1024

    E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
    Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
    Fax: (206) 667-1319 <tel:%28206%29%20667-1319>

--
Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319
        [[alternative HTML version deleted]]

______________________________**_________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
        [[alternative HTML version deleted]]

______________________________**_________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
       [[alternative HTML version deleted]]

______________________________**_________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
--
Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319
   [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
For sorting the chromosome names in "natural" order, you can use
the makeSeqnameIds() helper function in GenomicRanges:

   seqlevels(gr) <- seqlevels(gr)[makeSeqnameIds(seqlevels(gr))]
I meant:

   seqlevels(gr) <- seqlevels(gr)[order(makeSeqnameIds(seqlevels(gr)))]

Probably the source of Michael's confusion... sorry.

sortSeqlevels() just added to GenomicRanges 1.13.44:

   seqlevels <- c("chr2RHet", "chr4", "chrUextra", "chrYHet",
                   "chrM", "chrXHet", "chr2LHet", "chrU",
                   "chr3L", "chr3R", "chr2R", "chrX")

Then:

   > sortSeqlevels(seqlevels)
    [1] "chr2R"     "chr3L"     "chr3R"     "chr4"      "chrX" 
"chrU"
    [7] "chrM"      "chr2LHet"  "chr2RHet"  "chrXHet"   "chrYHet" 
"chrUextra"

   > sortSeqlevels(seqlevels, X.is.sexchrom=TRUE)
    [1] "chr2R"     "chr3L"     "chr3R"     "chr4"      "chrX" 
"chrU"
    [7] "chrM"      "chr2LHet"  "chr2RHet"  "chrXHet"   "chrYHet" 
"chrUextra"

H.
It recognizes several naming conventions (chr-prefixed or not,
roman, etc...) e.g.:

   > makeSeqnameIds(c("Y", "1", "9", "M", "2"))
   [1] 4 1 3 5 2

   > makeSeqnameIds(c("chrY", "chrI", "chrIX", "chrM", "chrII"))
   [1] 4 1 3 5 2

I still need to improve the documentation of this function, put
more examples, and add unit tests. It's used internally by the
makeTranscriptDb* functions in GenomicFeatures to assign internal
ids to the chromosomes so that all the TranscriptDb extractors
can then return GRanges and GRangesList objects with the seqlevels
in the "natural" order.

Cheers,
H.

On 09/17/2013 10:23 AM, Kasper Daniel Hansen wrote:
Actually, what I (and I think several others) just want is
   fixSeqnames()
which sorts and fixes them to some convention.  I don't really care which
one, but I want to do some easy harmonization, preferably in line with
other bioc tools.  I know this is hard to do for all organisms, but
having
something that deals with the major ones (human, mouse, fruit fly, yeast,
some monkeys) would be extremely valuable.  If this function existed, I
could just throw all my GRanges at it.

Kasper

On Tue, Sep 17, 2013 at 1:14 PM, Michael Lawrence
<lawrence.michael at gene.com
wrote:

When I hit this I usually just do:

seqlevels(gr) <- paste0("chr", c(1:22, "X", "Y"))

It would be nice if there were a function for sorting chromosome names
according to a specified naming convention. Like sortSeqlevels(gr,
UCSCNaming()) or the alternative replacement syntax: seqinfo(gr) <-
sort(seqinfo(gr), UCSCNaming()). Although sort is only single-dispatch.

Michael

On Tue, Sep 17, 2013 at 8:59 AM, Vincent Carey
<stvjc at channing.harvard.edu>wrote:

I am replying to this as Michael mentions that this is a powerful
operation.  Here's my
problem that I believe is related, but I do not see a straightforward
solution

bhmm19uni
GRanges with 559456 ranges and 4 metadata columns:
          seqnames                 ranges strand   |              name
score
             <Rle>              <IRanges>  <Rle>   |       <character>
<numeric>
        1     chr1 [ 67229025,  67341224]      *   | 13_Heterochrom/lo
     0
        2     chr1 [203057955, 203060154]      *   |      12_Repressed
     0
        3     chr1 [  8458427,   8486226]      *   | 13_Heterochrom/lo
     0
        4     chr1 [ 16904427,  16904826]      *   | 5_Strong_Enhancer
     0
        5     chr1 [ 25289427,  25293426]      *   |       11_Weak_Txn
     0
      ...      ...                    ...    ... ...               ...
   ...
   570766    chr22   [51100931, 51101330]      *   |   2_Weak_Promoter
     0
   570767    chr22   [51101331, 51101530]      *   |   6_Weak_Enhancer
     0
   570768    chr22   [51101531, 51101730]      *   |   2_Weak_Promoter
     0
   570769    chr22   [51101731, 51178130]      *   | 13_Heterochrom/lo
     0
   570770    chr22   [51178131, 51178330]      *   | 15_Repetitive/CNV
     0
                           thick     numcode
                       <IRanges> <character>
        1 [ 67001613,  67113812]          13
        2 [201324578, 201326777]          12
        3 [  8381014,   8408813]          13
        4 [ 16777014,  16777413]           5
        5 [ 25162014,  25166013]          11
      ...                    ...         ...
   570766   [49447797, 49448196]           2
   570767   [49448197, 49448396]           6
   570768   [49448397, 49448596]           2
   570769   [49448597, 49524996]          13
   570770   [49524997, 49525196]          15
   ---
   seqlengths:
         chr1     chr10     chr11      chr5 ...      chr2     chr21
  chr4
    249250621 135534747 135006516 180915260 ... 243199373  48129895
191154276

If I try to make a karyogram with ggbio, the chromosomes are in an
unnatural order.  What is the right approach to getting the seqinfo
in a
natural order with respect to chromosome names and lengths?
Reassigning
seqlevels seems dangerous but I haven't experimented fully.  It must
be a
common use case but I do not see it in doc.

On Tue, May 21, 2013 at 11:00 AM, Michael Lawrence <
lawrence.michael at gene.com> wrote:

On Mon, May 20, 2013 at 10:36 PM, Herv? Pag?s <hpages at fhcrc.org>
wrote:

Michael,

On 05/20/2013 08:44 PM, Michael Lawrence wrote:

On Mon, May 20, 2013 at 4:13 PM, Herv? Pag?s <hpages at fhcrc.org
<mailto:hpages at fhcrc.org>> wrote:

     Hi,

     On 05/20/2013 03:15 PM, Michael Lawrence wrote:

         On Mon, May 20, 2013 at 3:11 PM, Martin Morgan
         <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>> wrote:

             Hi Michael --

             On 5/20/2013 5:56 AM, Michael Lawrence wrote:

                 While it's cool that seqlevels<- has become so
flexible,
                 I still claim
                 that
                 rename/keep/drop would be a lot easier to read,
because
                 they describe the
                 high-level operation, and there's no reason to
deprecate
                 them. They're
                 also
                 easier to remember. For example, for dropping with
                 seqlevels<-, the user
                 needs
                 to remember that "force" is necessary to drop. Much
                 easier to just say
                 "dropSeqlevels(), please". And reimplementing
                 keepSeqlevels is still too
                 complicated. Is it such a maintenance burden to
have
                 these simple
                 wrappers sit
                 on top of the low-level manipulators?

                 Another suggestion: renameSeqlevels should not
require
a
                 named vector (in
                 cases
                 where it is unnamed, require that it is parallel to
the
                 existing
                 seqlevels, as
                 with seqlevels<-).

             I didn't really indicate what drove my desire to see
             keepSeqlevels
             deprecated. keepSeqlevels, seqlevels<-, and isActiveSeq
were
             developed more
             or less independently, and have different contracts.
The
             different
             contracts are confusing to the user, as is creating a
usable
             help system
             when there are multiple end points for what is a common
             operation. The help
             pages of each were inadequate in different ways. And
because
             the code was
             developed independently, support for different objects
was
             inconsistent. So
             actually, yes, the maintenance (and use) burden for the
             previous state of
             affairs was substantial.

             On the other hand, I agree that keepSeqlevels is
convenient
             as a simple
             wrapper around seqlevels<-, in the same way that
setNames
             and names<- are
             both useful.

             So we could iterate to

                 keepSeqlevels <-
                     function(x, value, ...)
                 {
                     seqlevels(x, force=TRUE) <- value
                     x
                 }

             but I'd be less enthusiastic about maintaining the
original
             contract of
             keepSeqlevels, where keepSeqlevels(gr1, gr2), would
have
             worked for two
             GRanges objects.

     Why would this be called keepSeqlevels() if, depending on how
it's
used,
     it will either add, drop, rename, or permute the seqlevels?
     Couldn't this be called setSeqlevels?

I thought that new2old was necessary for permuting.

The seqlevels setter has no 'new2old' arg.

  You're right that
adding should be disallowed for keepSeqlevels(). Adding seqlevels is
not
a common operation. The two common operations are:
* Permuting, usually because the data were imported in non-canonical
order (seqnameStyle was designed to address this, no?).

Permuting is straightforward with seqlevels<-:

   > seqlevels(gr)
   [1] "chr1" "chr2" "chr3"

   > seqlevels(gr) <- rev(seqlevels(gr))

   > seqlevels(gr)
   [1] "chr3" "chr2" "chr1"

  * Subsetting, either by keeping or dropping.

Also straightforward:

   > seqlevels(gr, force=TRUE) <- "chr2"

   > seqlevels(gr)
   [1] "chr2"

  Two main reasons: taking a
small slice of the data for prototyping, or removing problematic
chromosomes (sex, circular). This goes back to bsapply and the
exclude
argument.

There are other important use cases:

   - Dropping seqlevels that are *not* in use. This happens for
example
     when subsetting a BAM file with Rsamtools::filterBam to keep
only
     the alignments located on a given chromosome. The entire
sequence
     dictionary (in the header) is propagated to the resulting BAM
file
     so when you read the file with readGAlignments() you end up
with a
     bunch of useless seqlevels that you generally start by dropping.
     You don't want to drop any alignment so force=FALSE is your
friend.

This is sort of tangential to the discussion, but do you really
want to
do
this? I would preserve the universe as given by the BAM.

   - An even more common operation is renaming: 90% of the times
that's
     what I use seqlevels<- for, and that's what I tell people to use
     when they need to rename their chromosomes. Also
straightforward:

       > seqlevels(gr)
       [1] "chr1" "chr2" "chr3" "chrM"

       > seqlevels(gr) <- sub("^chr", "", seqlevels(gr))
       > seqlevels(gr)
       [1] "1" "2" "3" "M"

       > seqlevels(gr)[seqlevels(gr) == "M"] <- "MT"
       > seqlevels(gr)
       [1] "1"  "2"  "3"  "MT"

     Note that this is simpler to use than renameSeqlevels() which
     always required you to build the entire right value as a named
     vector.

Sure, renaming is a common use case. Not sure how I forgot that.

As I wrote earlier in the thread, renameSeqlevels should be changed so
as
not to require naming of the vector.

So the added-value of keepSeqlevels() seems to boil down to:

   (a) it always uses force=TRUE

   (b) it preserves the original object and returns the modified
object

If you want to restrict the use of keepSeqlevels() to permuting and
dropping, you'll also have to disallow renaming (in addition to
adding).
Then its name will more or less reflect what it does (if "keep" means
"permute" or "drop"). The final result will be something that does
less
than setSeqlevels() and that is not necessarily easier to read: both
will set on the object whatever vector of seqlevels they are
supplied,
except that one will fail when doing so would actually result in
adding
or renaming seqlevels.

So it looks like seqlevels<- is pretty powerful. Now I see that if I
scroll
down to the examples in the man page, I find the actual documentation.
It's
cool to learn that we can replace seqlevels on TxDb objects. My
argument
has always been that since these low-level operations are so powerful,
it's
nice to have high-level operations to clarify the code. We have to
think
hard to know what the RHS will do in that replacement. It's
probably one
of
the most complex replacement operations; far more complex than the
typical
one, including levels<- on factor. There's nothing wrong with that;
it's
just complexity that we should be able to hide.

Michael

H.

Michael

     H.

         Well, I wasn't even aware of that feature, so you've made
your
         point about
         the documentation ;) Sounds like a good compromise to me.

         Thanks for understanding,
         Michael

             Martin

                 Michael

                 On Sat, May 18, 2013 at 6:00 PM, Martin Morgan
                 <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
                 <mailto:mtmorgan at fhcrc.org <mailto:
mtmorgan at fhcrc.org

                 wrote:

                       On 05/18/2013 05:42 PM, Martin Morgan wrote:

                           Some of the most common operations are
                 straight-forward, e.g.,

                                 > gr = GRanges(paste0("chr",
c(1:22,
                 "X", "Y")), IRanges(1,
                 100))
                                 > seqlevels(gr) = sub("chr", "ch",
                 seqlevels(gr))

                       To get a more comprehensive example I should
have
                 followed my own
                 advice and
                       grabbed from the help page!

                             ## Rename:
                             seqlevels(txdb) <- sub("chr", "",
                 seqlevels(txdb))
                             seqlevels(txdb)

                             seqlevels(txdb) <- paste0("CH",
seqlevels(txdb))
                             seqlevels(txdb)

                             seqlevels(txdb)[seqlevels(__****__txdb)
==

                 "CHM"] <- "M"

                             seqlevels(txdb)

                       --
                       Computational Biology / Fred Hutchinson
Cancer
                 Research Center
                       1100 Fairview Ave. N.
                       PO Box 19024 Seattle, WA 98109

                       Location: Arnold Building M1 B861
                       Phone: (206) 667-2793
<tel:%28206%29%20667-2793>
                 <tel:%28206%29%20667-2793>

             --
             Dr. Martin Morgan, PhD

             Fred Hutchinson Cancer Research Center
             1100 Fairview Ave. N.
             PO Box 19024 Seattle, WA 98109

                  [[alternative HTML version deleted]]

         ______________________________**___________________
         Bioc-devel at r-project.org
<mailto:Bioc-devel at r-project.**org<
Bioc-devel at r-project.org>

         mailing list
         https://stat.ethz.ch/mailman/_**_listinfo/bioc-devel<
https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

         <https://stat.ethz.ch/mailman/**listinfo/bioc-devel<
https://stat.ethz.ch/mailman/listinfo/bioc-devel>

     --
     Herv? Pag?s

     Program in Computational Biology
     Division of Public Health Sciences

     Fred Hutchinson Cancer Research Center
     1100 Fairview Ave. N, M1-B514
     P.O. Box 19024
     Seattle, WA 98109-1024

     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>

--
Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

         [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

         [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

    [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20130917/ad0aca9b/attachment.pl>
OK I just made sortSeqlevels() a generic with a method for character
vectors and a default method. So you can do sortSeqlevels() directly on
a Seqinfo object, a GRanges object and anything that contains a Seqinfo
component:

   > gr
   GRanges with 0 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
     ---
     seqlengths:
       chr2RHet      chr4 chrUextra   chrYHet ...     chr3R     chr2R 
    chrX
             NA        NA        NA        NA ...        NA        NA 
      NA

   > sortSeqlevels(gr)
   GRanges with 0 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
     ---
     seqlengths:
          chr2R     chr3L     chr3R      chr4 ...   chrXHet   chrYHet 
chrUextra
             NA        NA        NA        NA ...        NA        NA 
      NA

H.
Thanks, Herv?, that's great.

Would it make sense to have a sort or sortSeqlevels method for Seqinfo
and then a sortSeqlevels,ANY method that just does seqinfo(x) <-
sort(seqinfo(x))?

Michael

On Tue, Sep 17, 2013 at 12:56 PM, Herv? Pag?s <hpages at fhcrc.org
<mailto:hpages at fhcrc.org>> wrote:

    On 09/17/2013 10:58 AM, Herv? Pag?s wrote:

        For sorting the chromosome names in "natural" order, you can use
        the makeSeqnameIds() helper function in GenomicRanges:

            seqlevels(gr) <- seqlevels(gr)[makeSeqnameIds(__seqlevels(gr))]

    I meant:

       seqlevels(gr) <-
    seqlevels(gr)[order(__makeSeqnameIds(seqlevels(gr)))__]

    Probably the source of Michael's confusion... sorry.

    sortSeqlevels() just added to GenomicRanges 1.13.44:

       seqlevels <- c("chr2RHet", "chr4", "chrUextra", "chrYHet",
                       "chrM", "chrXHet", "chr2LHet", "chrU",
                       "chr3L", "chr3R", "chr2R", "chrX")

    Then:

       > sortSeqlevels(seqlevels)
        [1] "chr2R"     "chr3L"     "chr3R"     "chr4"      "chrX" "chrU"
        [7] "chrM"      "chr2LHet"  "chr2RHet"  "chrXHet"   "chrYHet"
    "chrUextra"

       > sortSeqlevels(seqlevels, X.is.sexchrom=TRUE)
        [1] "chr2R"     "chr3L"     "chr3R"     "chr4"      "chrX" "chrU"
        [7] "chrM"      "chr2LHet"  "chr2RHet"  "chrXHet"   "chrYHet"
    "chrUextra"

    H.

        It recognizes several naming conventions (chr-prefixed or not,
        roman, etc...) e.g.:

            > makeSeqnameIds(c("Y", "1", "9", "M", "2"))
            [1] 4 1 3 5 2

            > makeSeqnameIds(c("chrY", "chrI", "chrIX", "chrM", "chrII"))
            [1] 4 1 3 5 2

        I still need to improve the documentation of this function, put
        more examples, and add unit tests. It's used internally by the
        makeTranscriptDb* functions in GenomicFeatures to assign internal
        ids to the chromosomes so that all the TranscriptDb extractors
        can then return GRanges and GRangesList objects with the seqlevels
        in the "natural" order.

        Cheers,
        H.

        On 09/17/2013 10:23 AM, Kasper Daniel Hansen wrote:

            Actually, what I (and I think several others) just want is
                fixSeqnames()
            which sorts and fixes them to some convention.  I don't
            really care which
            one, but I want to do some easy harmonization, preferably in
            line with
            other bioc tools.  I know this is hard to do for all
            organisms, but
            having
            something that deals with the major ones (human, mouse,
            fruit fly, yeast,
            some monkeys) would be extremely valuable.  If this function
            existed, I
            could just throw all my GRanges at it.

            Kasper

            On Tue, Sep 17, 2013 at 1:14 PM, Michael Lawrence
            <lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>

                wrote:

                When I hit this I usually just do:

                seqlevels(gr) <- paste0("chr", c(1:22, "X", "Y"))

                It would be nice if there were a function for sorting
                chromosome names
                according to a specified naming convention. Like
                sortSeqlevels(gr,
                UCSCNaming()) or the alternative replacement syntax:
                seqinfo(gr) <-
                sort(seqinfo(gr), UCSCNaming()). Although sort is only
                single-dispatch.

                Michael

                On Tue, Sep 17, 2013 at 8:59 AM, Vincent Carey
                <stvjc at channing.harvard.edu
                <mailto:stvjc at channing.harvard.edu>>__wrote:

                    I am replying to this as Michael mentions that this
                    is a powerful
                    operation.  Here's my
                    problem that I believe is related, but I do not see
                    a straightforward
                    solution

                        bhmm19uni

                    GRanges with 559456 ranges and 4 metadata columns:
                               seqnames                 ranges strand
                    |              name
                    score
                                  <Rle>              <IRanges>  <Rle>
                    |       <character>
                    <numeric>
                             1     chr1 [ 67229025,  67341224]      *
                    | 13_Heterochrom/lo
                          0
                             2     chr1 [203057955, 203060154]      *
                    |      12_Repressed
                          0
                             3     chr1 [  8458427,   8486226]      *
                    | 13_Heterochrom/lo
                          0
                             4     chr1 [ 16904427,  16904826]      *
                    | 5_Strong_Enhancer
                          0
                             5     chr1 [ 25289427,  25293426]      *
                    |       11_Weak_Txn
                          0
                           ...      ...                    ...    ...
                    ...               ...
                        ...
                        570766    chr22   [51100931, 51101330]      *
                    |   2_Weak_Promoter
                          0
                        570767    chr22   [51101331, 51101530]      *
                    |   6_Weak_Enhancer
                          0
                        570768    chr22   [51101531, 51101730]      *
                    |   2_Weak_Promoter
                          0
                        570769    chr22   [51101731, 51178130]      *
                    | 13_Heterochrom/lo
                          0
                        570770    chr22   [51178131, 51178330]      *
                    | 15_Repetitive/CNV
                          0
                                                thick     numcode
                                            <IRanges> <character>
                             1 [ 67001613,  67113812]          13
                             2 [201324578, 201326777]          12
                             3 [  8381014,   8408813]          13
                             4 [ 16777014,  16777413]           5
                             5 [ 25162014,  25166013]          11
                           ...                    ...         ...
                        570766   [49447797, 49448196]           2
                        570767   [49448197, 49448396]           6
                        570768   [49448397, 49448596]           2
                        570769   [49448597, 49524996]          13
                        570770   [49524997, 49525196]          15
                        ---
                        seqlengths:
                              chr1     chr10     chr11      chr5 ...
                      chr2     chr21
                       chr4
                         249250621 135534747 135006516 180915260 ...
                    243199373  48129895
                    191154276

                    If I try to make a karyogram with ggbio, the
                    chromosomes are in an
                    unnatural order.  What is the right approach to
                    getting the seqinfo
                    in a
                    natural order with respect to chromosome names and
                    lengths?
                    Reassigning
                    seqlevels seems dangerous but I haven't experimented
                    fully.  It must
                    be a
                    common use case but I do not see it in doc.

                    On Tue, May 21, 2013 at 11:00 AM, Michael Lawrence <
                    lawrence.michael at gene.com
                    <mailto:lawrence.michael at gene.com>> wrote:

                        On Mon, May 20, 2013 at 10:36 PM, Herv? Pag?s
                        <hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
                        wrote:

                            Michael,

                            On 05/20/2013 08:44 PM, Michael Lawrence wrote:

                                On Mon, May 20, 2013 at 4:13 PM, Herv?
                                Pag?s <hpages at fhcrc.org
                                <mailto:hpages at fhcrc.org>
                                <mailto:hpages at fhcrc.org
                                <mailto:hpages at fhcrc.org>>> wrote:

                                      Hi,

                                      On 05/20/2013 03:15 PM, Michael
                                Lawrence wrote:

                                          On Mon, May 20, 2013 at 3:11
                                PM, Martin Morgan
                                          <mtmorgan at fhcrc.org
                                <mailto:mtmorgan at fhcrc.org>
                                <mailto:mtmorgan at fhcrc.org
                                <mailto:mtmorgan at fhcrc.org>>> wrote:

                                              Hi Michael --

                                              On 5/20/2013 5:56 AM,
                                Michael Lawrence wrote:

                                                  While it's cool that
                                seqlevels<- has become so

                        flexible,

                                                  I still claim
                                                  that
                                                  rename/keep/drop would
                                be a lot easier to read,

                because

                                                  they describe the
                                                  high-level operation,
                                and there's no reason to

                        deprecate

                                                  them. They're
                                                  also
                                                  easier to remember.
                                For example, for dropping with
                                                  seqlevels<-, the user
                                                  needs
                                                  to remember that
                                "force" is necessary to drop. Much
                                                  easier to just say
                                                  "dropSeqlevels(),
                                please". And reimplementing
                                                  keepSeqlevels is still too
                                                  complicated. Is it
                                such a maintenance burden to
                                have
                                                  these simple
                                                  wrappers sit
                                                  on top of the
                                low-level manipulators?

                                                  Another suggestion:
                                renameSeqlevels should not

                require

                        a

                                                  named vector (in
                                                  cases
                                                  where it is unnamed,
                                require that it is parallel to

                the

                                                  existing
                                                  seqlevels, as
                                                  with seqlevels<-).

                                              I didn't really indicate
                                what drove my desire to see
                                              keepSeqlevels
                                              deprecated. keepSeqlevels,
                                seqlevels<-, and isActiveSeq

                        were

                                              developed more
                                              or less independently, and
                                have different contracts.
                                The
                                              different
                                              contracts are confusing to
                                the user, as is creating a

                        usable

                                              help system
                                              when there are multiple
                                end points for what is a common
                                              operation. The help
                                              pages of each were
                                inadequate in different ways. And

                        because

                                              the code was
                                              developed independently,
                                support for different objects

                was

                                              inconsistent. So
                                              actually, yes, the
                                maintenance (and use) burden for the
                                              previous state of
                                              affairs was substantial.

                                              On the other hand, I agree
                                that keepSeqlevels is

                convenient

                                              as a simple
                                              wrapper around
                                seqlevels<-, in the same way that
                                setNames
                                              and names<- are
                                              both useful.

                                              So we could iterate to

                                                  keepSeqlevels <-
                                                      function(x, value,
                                ...)
                                                  {
                                                      seqlevels(x,
                                force=TRUE) <- value
                                                      x
                                                  }

                                              but I'd be less
                                enthusiastic about maintaining the

                original

                                              contract of
                                              keepSeqlevels, where
                                keepSeqlevels(gr1, gr2), would
                                have
                                              worked for two
                                              GRanges objects.

                                      Why would this be called
                                keepSeqlevels() if, depending on how

                it's

                                used,
                                      it will either add, drop, rename,
                                or permute the seqlevels?
                                      Couldn't this be called setSeqlevels?

                                I thought that new2old was necessary for
                                permuting.

                            The seqlevels setter has no 'new2old' arg.

                               You're right that

                                adding should be disallowed for
                                keepSeqlevels(). Adding seqlevels is

                        not

                                a common operation. The two common
                                operations are:
                                * Permuting, usually because the data
                                were imported in non-canonical
                                order (seqnameStyle was designed to
                                address this, no?).

                            Permuting is straightforward with seqlevels<-:

                                > seqlevels(gr)
                                [1] "chr1" "chr2" "chr3"

                                > seqlevels(gr) <- rev(seqlevels(gr))

                                > seqlevels(gr)
                                [1] "chr3" "chr2" "chr1"

                               * Subsetting, either by keeping or dropping.

                            Also straightforward:

                                > seqlevels(gr, force=TRUE) <- "chr2"

                                > seqlevels(gr)
                                [1] "chr2"

                               Two main reasons: taking a

                                small slice of the data for prototyping,
                                or removing problematic
                                chromosomes (sex, circular). This goes
                                back to bsapply and the

                exclude

                                argument.

                            There are other important use cases:

                                - Dropping seqlevels that are *not* in
                            use. This happens for
                            example
                                  when subsetting a BAM file with
                            Rsamtools::filterBam to keep
                            only
                                  the alignments located on a given
                            chromosome. The entire
                            sequence
                                  dictionary (in the header) is
                            propagated to the resulting BAM
                            file
                                  so when you read the file with
                            readGAlignments() you end up
                            with a
                                  bunch of useless seqlevels that you
                            generally start by dropping.
                                  You don't want to drop any alignment
                            so force=FALSE is your

                friend.

                        This is sort of tangential to the discussion,
                        but do you really
                        want to

                do

                        this? I would preserve the universe as given by
                        the BAM.

                                - An even more common operation is
                            renaming: 90% of the times
                            that's
                                  what I use seqlevels<- for, and that's
                            what I tell people to use
                                  when they need to rename their
                            chromosomes. Also
                            straightforward:

                                    > seqlevels(gr)
                                    [1] "chr1" "chr2" "chr3" "chrM"

                                    > seqlevels(gr) <- sub("^chr", "",
                            seqlevels(gr))
                                    > seqlevels(gr)
                                    [1] "1" "2" "3" "M"

                                    > seqlevels(gr)[seqlevels(gr) ==
                            "M"] <- "MT"
                                    > seqlevels(gr)
                                    [1] "1"  "2"  "3"  "MT"

                                  Note that this is simpler to use than
                            renameSeqlevels() which
                                  always required you to build the
                            entire right value as a named
                                  vector.

                        Sure, renaming is a common use case. Not sure
                        how I forgot that.

                        As I wrote earlier in the thread,
                        renameSeqlevels should be changed so

                as

                        not to require naming of the vector.

                            So the added-value of keepSeqlevels() seems
                            to boil down to:

                                (a) it always uses force=TRUE

                                (b) it preserves the original object and
                            returns the modified
                            object

                            If you want to restrict the use of
                            keepSeqlevels() to permuting and
                            dropping, you'll also have to disallow
                            renaming (in addition to

                adding).

                            Then its name will more or less reflect what
                            it does (if "keep" means
                            "permute" or "drop"). The final result will
                            be something that does

                less

                            than setSeqlevels() and that is not
                            necessarily easier to read: both
                            will set on the object whatever vector of
                            seqlevels they are
                            supplied,
                            except that one will fail when doing so
                            would actually result in

                adding

                            or renaming seqlevels.

                        So it looks like seqlevels<- is pretty powerful.
                        Now I see that if I
                        scroll
                        down to the examples in the man page, I find the
                        actual documentation.
                        It's
                        cool to learn that we can replace seqlevels on
                        TxDb objects. My
                        argument
                        has always been that since these low-level
                        operations are so powerful,
                        it's
                        nice to have high-level operations to clarify
                        the code. We have to
                        think
                        hard to know what the RHS will do in that
                        replacement. It's
                        probably one
                        of
                        the most complex replacement operations; far
                        more complex than the

                typical

                        one, including levels<- on factor. There's
                        nothing wrong with that;
                        it's
                        just complexity that we should be able to hide.

                        Michael

                        H.

                                Michael

                                      H.

                                          Well, I wasn't even aware of
                                that feature, so you've made

                your

                                          point about
                                          the documentation ;) Sounds
                                like a good compromise to me.

                                          Thanks for understanding,
                                          Michael

                                              Martin

                                                  Michael

                                                  On Sat, May 18, 2013
                                at 6:00 PM, Martin Morgan
                                                  <mtmorgan at fhcrc.org
                                <mailto:mtmorgan at fhcrc.org>
                                <mailto:mtmorgan at fhcrc.org
                                <mailto:mtmorgan at fhcrc.org>>

                                  <mailto:mtmorgan at fhcrc.org
                                <mailto:mtmorgan at fhcrc.org> <mailto:

                mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>

                                                  wrote:

                                                        On 05/18/2013
                                05:42 PM, Martin Morgan wrote:

                                                            Some of the
                                most common operations are
                                                  straight-forward, e.g.,

                                                                  > gr =
                                GRanges(paste0("chr",
                                c(1:22,
                                                  "X", "Y")), IRanges(1,
                                                  100))
                                                                  >
                                seqlevels(gr) = sub("chr", "ch",
                                                  seqlevels(gr))

                                                        To get a more
                                comprehensive example I should

                have

                                                  followed my own
                                                  advice and
                                                        grabbed from the
                                help page!

                                                              ## Rename:

                                  seqlevels(txdb) <- sub("chr", "",
                                                  seqlevels(txdb))

                                  seqlevels(txdb)

                                  seqlevels(txdb) <- paste0("CH",
                                seqlevels(txdb))

                                  seqlevels(txdb)

                                  seqlevels(txdb)[seqlevels(__**__**__txdb)

                ==

                                                  "CHM"] <- "M"

                                  seqlevels(txdb)

                                                        --
                                                        Computational
                                Biology / Fred Hutchinson
                                Cancer
                                                  Research Center
                                                        1100 Fairview
                                Ave. N.
                                                        PO Box 19024
                                Seattle, WA 98109

                                                        Location: Arnold
                                Building M1 B861
                                                        Phone: (206)
                                667-2793 <tel:%28206%29%20667-2793>

                <tel:%28206%29%20667-2793>

                                                  <tel:%28206%29%20667-2793>

                                              --
                                              Dr. Martin Morgan, PhD

                                              Fred Hutchinson Cancer
                                Research Center
                                              1100 Fairview Ave. N.
                                              PO Box 19024 Seattle, WA 98109

                                                   [[alternative HTML
                                version deleted]]

                                  ________________________________**___________________
                                Bioc-devel at r-project.org
                                <mailto:Bioc-devel at r-project.org>
                                <mailto:Bioc-devel at r-project.
                                <mailto:Bioc-devel at r-project.>*__*org<

                        Bioc-devel at r-project.org
                        <mailto:Bioc-devel at r-project.org>>

                                          mailing list
                                https://stat.ethz.ch/mailman/___**_listinfo/bioc-devel
                                <https://stat.ethz.ch/mailman/_**_listinfo/bioc-devel><

                        https://stat.ethz.ch/mailman/____listinfo/bioc-devel
                        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>

                                  <https://stat.ethz.ch/mailman/__**listinfo/bioc-devel <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><

                        https://stat.ethz.ch/mailman/__listinfo/bioc-devel
                        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>

                                      --
                                      Herv? Pag?s

                                      Program in Computational Biology
                                      Division of Public Health Sciences

                                      Fred Hutchinson Cancer Research Center
                                      1100 Fairview Ave. N, M1-B514
                                      P.O. Box 19024
                                      Seattle, WA 98109-1024

                                      E-mail: hpages at fhcrc.org
                                <mailto:hpages at fhcrc.org>
                                <mailto:hpages at fhcrc.org
                                <mailto:hpages at fhcrc.org>>
                                      Phone: (206) 667-5791
                                <tel:%28206%29%20667-5791>
                                <tel:%28206%29%20667-5791>
                                      Fax: (206) 667-1319
                                <tel:%28206%29%20667-1319>
                                <tel:%28206%29%20667-1319>

                            --
                            Herv? Pag?s

                            Program in Computational Biology
                            Division of Public Health Sciences
                            Fred Hutchinson Cancer Research Center
                            1100 Fairview Ave. N, M1-B514
                            P.O. Box 19024
                            Seattle, WA 98109-1024

                            E-mail: hpages at fhcrc.org
                            <mailto:hpages at fhcrc.org>
                            Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
                            Fax: (206) 667-1319 <tel:%28206%29%20667-1319>

                                  [[alternative HTML version deleted]]

                        _________________________________________________
                        Bioc-devel at r-project.org
                        <mailto:Bioc-devel at r-project.org> mailing list
                        https://stat.ethz.ch/mailman/__listinfo/bioc-devel
                        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

                          [[alternative HTML version deleted]]

                _________________________________________________
                Bioc-devel at r-project.org
                <mailto:Bioc-devel at r-project.org> mailing list
                https://stat.ethz.ch/mailman/__listinfo/bioc-devel
                <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

                 [[alternative HTML version deleted]]

            _________________________________________________
            Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
            mailing list
            https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

    --
    Herv? Pag?s

    Program in Computational Biology
    Division of Public Health Sciences
    Fred Hutchinson Cancer Research Center
    1100 Fairview Ave. N, M1-B514
    P.O. Box 19024
    Seattle, WA 98109-1024

    E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
    Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
    Fax: (206) 667-1319 <tel:%28206%29%20667-1319>

Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20130917/e9f523b9/attachment.pl>
Hi Kasper,
Nice.  I thought you had made something like this.  What would be
extremely useful is a function that just takes a GRanges and return a
fixed version - less typing, and easier to understand what goes on.
sortSeqlevels() only sorts the seqlevels. Fixing them (if by this
you mean make them adhere to some naming covention) is another story
and a more complicated one.
We tried to tackled this with the seqnameStyle() which is only a
half-baked solution at the moment. We need to revisit the current
seqnameStyle() approach. Yes in the future we could have something
like fixSeqlevels() for doing both: sorting and renaming.

Cheers,
H.
Best,
Kasper

On Tue, Sep 17, 2013 at 4:22 PM, Herv? Pag?s <hpages at fhcrc.org
<mailto:hpages at fhcrc.org>> wrote:

    OK I just made sortSeqlevels() a generic with a method for character
    vectors and a default method. So you can do sortSeqlevels() directly on
    a Seqinfo object, a GRanges object and anything that contains a Seqinfo
    component:

       > gr
       GRanges with 0 ranges and 0 metadata columns:
          seqnames    ranges strand
             <Rle> <IRanges>  <Rle>
         ---
         seqlengths:
           chr2RHet      chr4 chrUextra   chrYHet ...     chr3R
    chr2R    chrX
                 NA        NA        NA        NA ...        NA
      NA      NA

       > sortSeqlevels(gr)
       GRanges with 0 ranges and 0 metadata columns:
          seqnames    ranges strand
             <Rle> <IRanges>  <Rle>
         ---
         seqlengths:
              chr2R     chr3L     chr3R      chr4 ...   chrXHet
    chrYHet chrUextra
                 NA        NA        NA        NA ...        NA
      NA      NA

    H.

    On 09/17/2013 01:05 PM, Michael Lawrence wrote:

        Thanks, Herv?, that's great.

        Would it make sense to have a sort or sortSeqlevels method for
        Seqinfo
        and then a sortSeqlevels,ANY method that just does seqinfo(x) <-
        sort(seqinfo(x))?

        Michael

        On Tue, Sep 17, 2013 at 12:56 PM, Herv? Pag?s <hpages at fhcrc.org
        <mailto:hpages at fhcrc.org>
        <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>> wrote:

             On 09/17/2013 10:58 AM, Herv? Pag?s wrote:

                 For sorting the chromosome names in "natural" order,
        you can use
                 the makeSeqnameIds() helper function in GenomicRanges:

                     seqlevels(gr) <-
        seqlevels(gr)[makeSeqnameIds(____seqlevels(gr))]

             I meant:

                seqlevels(gr) <-
             seqlevels(gr)[order(____makeSeqnameIds(seqlevels(gr)))____]

             Probably the source of Michael's confusion... sorry.

             sortSeqlevels() just added to GenomicRanges 1.13.44:

                seqlevels <- c("chr2RHet", "chr4", "chrUextra", "chrYHet",
                                "chrM", "chrXHet", "chr2LHet", "chrU",
                                "chr3L", "chr3R", "chr2R", "chrX")

             Then:

                > sortSeqlevels(seqlevels)
                 [1] "chr2R"     "chr3L"     "chr3R"     "chr4"
          "chrX" "chrU"
                 [7] "chrM"      "chr2LHet"  "chr2RHet"  "chrXHet"
        "chrYHet"
             "chrUextra"

                > sortSeqlevels(seqlevels, X.is.sexchrom=TRUE)
                 [1] "chr2R"     "chr3L"     "chr3R"     "chr4"
          "chrX" "chrU"
                 [7] "chrM"      "chr2LHet"  "chr2RHet"  "chrXHet"
        "chrYHet"
             "chrUextra"

             H.

                 It recognizes several naming conventions (chr-prefixed
        or not,
                 roman, etc...) e.g.:

                     > makeSeqnameIds(c("Y", "1", "9", "M", "2"))
                     [1] 4 1 3 5 2

                     > makeSeqnameIds(c("chrY", "chrI", "chrIX", "chrM",
        "chrII"))
                     [1] 4 1 3 5 2

                 I still need to improve the documentation of this
        function, put
                 more examples, and add unit tests. It's used internally
        by the
                 makeTranscriptDb* functions in GenomicFeatures to
        assign internal
                 ids to the chromosomes so that all the TranscriptDb
        extractors
                 can then return GRanges and GRangesList objects with
        the seqlevels
                 in the "natural" order.

                 Cheers,
                 H.

                 On 09/17/2013 10:23 AM, Kasper Daniel Hansen wrote:

                     Actually, what I (and I think several others) just
        want is
                         fixSeqnames()
                     which sorts and fixes them to some convention.  I don't
                     really care which
                     one, but I want to do some easy harmonization,
        preferably in
                     line with
                     other bioc tools.  I know this is hard to do for all
                     organisms, but
                     having
                     something that deals with the major ones (human, mouse,
                     fruit fly, yeast,
                     some monkeys) would be extremely valuable.  If this
        function
                     existed, I
                     could just throw all my GRanges at it.

                     Kasper

                     On Tue, Sep 17, 2013 at 1:14 PM, Michael Lawrence
                     <lawrence.michael at gene.com
        <mailto:lawrence.michael at gene.com>
        <mailto:lawrence.michael at gene.__com
        <mailto:lawrence.michael at gene.com>>

                         wrote:

                         When I hit this I usually just do:

                         seqlevels(gr) <- paste0("chr", c(1:22, "X", "Y"))

                         It would be nice if there were a function for
        sorting
                         chromosome names
                         according to a specified naming convention. Like
                         sortSeqlevels(gr,
                         UCSCNaming()) or the alternative replacement
        syntax:
                         seqinfo(gr) <-
                         sort(seqinfo(gr), UCSCNaming()). Although sort
        is only
                         single-dispatch.

                         Michael

                         On Tue, Sep 17, 2013 at 8:59 AM, Vincent Carey
                         <stvjc at channing.harvard.edu
        <mailto:stvjc at channing.harvard.edu>
                         <mailto:stvjc at channing.__harvard.edu
        <mailto:stvjc at channing.harvard.edu>>>__wrote:

                             I am replying to this as Michael mentions
        that this
                             is a powerful
                             operation.  Here's my
                             problem that I believe is related, but I do
        not see
                             a straightforward
                             solution

                                 bhmm19uni

                             GRanges with 559456 ranges and 4 metadata
        columns:
                                        seqnames                 ranges
        strand
                             |              name
                             score
                                           <Rle>              <IRanges>
          <Rle>
                             |       <character>
                             <numeric>
                                      1     chr1 [ 67229025,  67341224]
              *
                             | 13_Heterochrom/lo
                                   0
                                      2     chr1 [203057955, 203060154]
              *
                             |      12_Repressed
                                   0
                                      3     chr1 [  8458427,   8486226]
              *
                             | 13_Heterochrom/lo
                                   0
                                      4     chr1 [ 16904427,  16904826]
              *
                             | 5_Strong_Enhancer
                                   0
                                      5     chr1 [ 25289427,  25293426]
              *
                             |       11_Weak_Txn
                                   0
                                    ...      ...                    ...
            ...
                             ...               ...
                                 ...
                                 570766    chr22   [51100931, 51101330]
              *
                             |   2_Weak_Promoter
                                   0
                                 570767    chr22   [51101331, 51101530]
              *
                             |   6_Weak_Enhancer
                                   0
                                 570768    chr22   [51101531, 51101730]
              *
                             |   2_Weak_Promoter
                                   0
                                 570769    chr22   [51101731, 51178130]
              *
                             | 13_Heterochrom/lo
                                   0
                                 570770    chr22   [51178131, 51178330]
              *
                             | 15_Repetitive/CNV
                                   0
                                                         thick     numcode
                                                     <IRanges> <character>
                                      1 [ 67001613,  67113812]          13
                                      2 [201324578, 201326777]          12
                                      3 [  8381014,   8408813]          13
                                      4 [ 16777014,  16777413]           5
                                      5 [ 25162014,  25166013]          11
                                    ...                    ...         ...
                                 570766   [49447797, 49448196]           2
                                 570767   [49448197, 49448396]           6
                                 570768   [49448397, 49448596]           2
                                 570769   [49448597, 49524996]          13
                                 570770   [49524997, 49525196]          15
                                 ---
                                 seqlengths:
                                       chr1     chr10     chr11
          chr5 ...
                               chr2     chr21
                                chr4
                                  249250621 135534747 135006516
        180915260 ...
                             243199373  48129895
                             191154276

                             If I try to make a karyogram with ggbio, the
                             chromosomes are in an
                             unnatural order.  What is the right approach to
                             getting the seqinfo
                             in a
                             natural order with respect to chromosome
        names and
                             lengths?
                             Reassigning
                             seqlevels seems dangerous but I haven't
        experimented
                             fully.  It must
                             be a
                             common use case but I do not see it in doc.

                             On Tue, May 21, 2013 at 11:00 AM, Michael
        Lawrence <
        lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>
                             <mailto:lawrence.michael at gene.__com
        <mailto:lawrence.michael at gene.com>>> wrote:

                                 On Mon, May 20, 2013 at 10:36 PM, Herv?
        Pag?s
                                 <hpages at fhcrc.org
        <mailto:hpages at fhcrc.org> <mailto:hpages at fhcrc.org
        <mailto:hpages at fhcrc.org>>>

                                 wrote:

                                     Michael,

                                     On 05/20/2013 08:44 PM, Michael
        Lawrence wrote:

                                         On Mon, May 20, 2013 at 4:13
        PM, Herv?
                                         Pag?s <hpages at fhcrc.org
        <mailto:hpages at fhcrc.org>
                                         <mailto:hpages at fhcrc.org
        <mailto:hpages at fhcrc.org>>
                                         <mailto:hpages at fhcrc.org
        <mailto:hpages at fhcrc.org>

                                         <mailto:hpages at fhcrc.org
        <mailto:hpages at fhcrc.org>>>> wrote:

                                               Hi,

                                               On 05/20/2013 03:15 PM,
        Michael
                                         Lawrence wrote:

                                                   On Mon, May 20, 2013
        at 3:11
                                         PM, Martin Morgan
                                                   <mtmorgan at fhcrc.org
        <mailto:mtmorgan at fhcrc.org>
                                         <mailto:mtmorgan at fhcrc.org
        <mailto:mtmorgan at fhcrc.org>>
                                         <mailto:mtmorgan at fhcrc.org
        <mailto:mtmorgan at fhcrc.org>
                                         <mailto:mtmorgan at fhcrc.org
        <mailto:mtmorgan at fhcrc.org>>>> wrote:

                                                       Hi Michael --

                                                       On 5/20/2013 5:56 AM,
                                         Michael Lawrence wrote:

                                                           While it's
        cool that
                                         seqlevels<- has become so

                                 flexible,

                                                           I still claim
                                                           that

        rename/keep/drop would
                                         be a lot easier to read,

                         because

                                                           they describe the
                                                           high-level
        operation,
                                         and there's no reason to

                                 deprecate

                                                           them. They're
                                                           also
                                                           easier to
        remember.
                                         For example, for dropping with
                                                           seqlevels<-,
        the user
                                                           needs
                                                           to remember that
                                         "force" is necessary to drop. Much
                                                           easier to
        just say
                                                           "dropSeqlevels(),
                                         please". And reimplementing
                                                           keepSeqlevels
        is still too
                                                           complicated.
        Is it
                                         such a maintenance burden to
                                         have
                                                           these simple
                                                           wrappers sit
                                                           on top of the
                                         low-level manipulators?

                                                           Another
        suggestion:
                                         renameSeqlevels should not

                         require

                                 a

                                                           named vector (in
                                                           cases
                                                           where it is
        unnamed,
                                         require that it is parallel to

                         the

                                                           existing
                                                           seqlevels, as
                                                           with
        seqlevels<-).

                                                       I didn't really
        indicate
                                         what drove my desire to see
                                                       keepSeqlevels
                                                       deprecated.
        keepSeqlevels,
                                         seqlevels<-, and isActiveSeq

                                 were

                                                       developed more
                                                       or less
        independently, and
                                         have different contracts.
                                         The
                                                       different
                                                       contracts are
        confusing to
                                         the user, as is creating a

                                 usable

                                                       help system
                                                       when there are
        multiple
                                         end points for what is a common
                                                       operation. The help
                                                       pages of each were
                                         inadequate in different ways. And

                                 because

                                                       the code was
                                                       developed
        independently,
                                         support for different objects

                         was

                                                       inconsistent. So
                                                       actually, yes, the
                                         maintenance (and use) burden
        for the
                                                       previous state of
                                                       affairs was
        substantial.

                                                       On the other
        hand, I agree
                                         that keepSeqlevels is

                         convenient

                                                       as a simple
                                                       wrapper around
                                         seqlevels<-, in the same way that
                                         setNames
                                                       and names<- are
                                                       both useful.

                                                       So we could
        iterate to

                                                           keepSeqlevels <-

        function(x, value,
                                         ...)
                                                           {
                                                               seqlevels(x,
                                         force=TRUE) <- value
                                                               x
                                                           }

                                                       but I'd be less
                                         enthusiastic about maintaining the

                         original

                                                       contract of
                                                       keepSeqlevels, where
                                         keepSeqlevels(gr1, gr2), would
                                         have
                                                       worked for two
                                                       GRanges objects.

                                               Why would this be called
                                         keepSeqlevels() if, depending
        on how

                         it's

                                         used,
                                               it will either add, drop,
        rename,
                                         or permute the seqlevels?
                                               Couldn't this be called
        setSeqlevels?

                                         I thought that new2old was
        necessary for
                                         permuting.

                                     The seqlevels setter has no
        'new2old' arg.

                                        You're right that

                                         adding should be disallowed for
                                         keepSeqlevels(). Adding
        seqlevels is

                                 not

                                         a common operation. The two common
                                         operations are:
                                         * Permuting, usually because
        the data
                                         were imported in non-canonical
                                         order (seqnameStyle was designed to
                                         address this, no?).

                                     Permuting is straightforward with
        seqlevels<-:

                                         > seqlevels(gr)
                                         [1] "chr1" "chr2" "chr3"

                                         > seqlevels(gr) <-
        rev(seqlevels(gr))

                                         > seqlevels(gr)
                                         [1] "chr3" "chr2" "chr1"

                                        * Subsetting, either by keeping
        or dropping.

                                     Also straightforward:

                                         > seqlevels(gr, force=TRUE) <-
        "chr2"

                                         > seqlevels(gr)
                                         [1] "chr2"

                                        Two main reasons: taking a

                                         small slice of the data for
        prototyping,
                                         or removing problematic
                                         chromosomes (sex, circular).
        This goes
                                         back to bsapply and the

                         exclude

                                         argument.

                                     There are other important use cases:

                                         - Dropping seqlevels that are
        *not* in
                                     use. This happens for
                                     example
                                           when subsetting a BAM file with
                                     Rsamtools::filterBam to keep
                                     only
                                           the alignments located on a given
                                     chromosome. The entire
                                     sequence
                                           dictionary (in the header) is
                                     propagated to the resulting BAM
                                     file
                                           so when you read the file with
                                     readGAlignments() you end up
                                     with a
                                           bunch of useless seqlevels
        that you
                                     generally start by dropping.
                                           You don't want to drop any
        alignment
                                     so force=FALSE is your

                         friend.

                                 This is sort of tangential to the
        discussion,
                                 but do you really
                                 want to

                         do

                                 this? I would preserve the universe as
        given by
                                 the BAM.

                                         - An even more common operation is
                                     renaming: 90% of the times
                                     that's
                                           what I use seqlevels<- for,
        and that's
                                     what I tell people to use
                                           when they need to rename their
                                     chromosomes. Also
                                     straightforward:

                                             > seqlevels(gr)
                                             [1] "chr1" "chr2" "chr3" "chrM"

                                             > seqlevels(gr) <-
        sub("^chr", "",
                                     seqlevels(gr))
                                             > seqlevels(gr)
                                             [1] "1" "2" "3" "M"

                                             >
        seqlevels(gr)[seqlevels(gr) ==
                                     "M"] <- "MT"
                                             > seqlevels(gr)
                                             [1] "1"  "2"  "3"  "MT"

                                           Note that this is simpler to
        use than
                                     renameSeqlevels() which
                                           always required you to build the
                                     entire right value as a named
                                           vector.

                                 Sure, renaming is a common use case.
        Not sure
                                 how I forgot that.

                                 As I wrote earlier in the thread,
                                 renameSeqlevels should be changed so

                         as

                                 not to require naming of the vector.

                                     So the added-value of
        keepSeqlevels() seems
                                     to boil down to:

                                         (a) it always uses force=TRUE

                                         (b) it preserves the original
        object and
                                     returns the modified
                                     object

                                     If you want to restrict the use of
                                     keepSeqlevels() to permuting and
                                     dropping, you'll also have to disallow
                                     renaming (in addition to

                         adding).

                                     Then its name will more or less
        reflect what
                                     it does (if "keep" means
                                     "permute" or "drop"). The final
        result will
                                     be something that does

                         less

                                     than setSeqlevels() and that is not
                                     necessarily easier to read: both
                                     will set on the object whatever
        vector of
                                     seqlevels they are
                                     supplied,
                                     except that one will fail when doing so
                                     would actually result in

                         adding

                                     or renaming seqlevels.

                                 So it looks like seqlevels<- is pretty
        powerful.
                                 Now I see that if I
                                 scroll
                                 down to the examples in the man page, I
        find the
                                 actual documentation.
                                 It's
                                 cool to learn that we can replace
        seqlevels on
                                 TxDb objects. My
                                 argument
                                 has always been that since these low-level
                                 operations are so powerful,
                                 it's
                                 nice to have high-level operations to
        clarify
                                 the code. We have to
                                 think
                                 hard to know what the RHS will do in that
                                 replacement. It's
                                 probably one
                                 of
                                 the most complex replacement
        operations; far
                                 more complex than the

                         typical

                                 one, including levels<- on factor. There's
                                 nothing wrong with that;
                                 it's
                                 just complexity that we should be able
        to hide.

                                 Michael

                                 H.

                                         Michael

                                               H.

                                                   Well, I wasn't even
        aware of
                                         that feature, so you've made

                         your

                                                   point about
                                                   the documentation ;)
        Sounds
                                         like a good compromise to me.

                                                   Thanks for understanding,
                                                   Michael

                                                       Martin

                                                           Michael

                                                           On Sat, May
        18, 2013
                                         at 6:00 PM, Martin Morgan

        <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
                                         <mailto:mtmorgan at fhcrc.org
        <mailto:mtmorgan at fhcrc.org>>
                                         <mailto:mtmorgan at fhcrc.org
        <mailto:mtmorgan at fhcrc.org>
                                         <mailto:mtmorgan at fhcrc.org
        <mailto:mtmorgan at fhcrc.org>>>

                                           <mailto:mtmorgan at fhcrc.org
        <mailto:mtmorgan at fhcrc.org>
                                         <mailto:mtmorgan at fhcrc.org
        <mailto:mtmorgan at fhcrc.org>> <mailto:

        mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
        <mailto:mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>>

                                                           wrote:

                                                                 On
        05/18/2013
                                         05:42 PM, Martin Morgan wrote:

        Some of the
                                         most common operations are

        straight-forward, e.g.,

           > gr =
                                         GRanges(paste0("chr",
                                         c(1:22,
                                                           "X", "Y")),
        IRanges(1,
                                                           100))
                                                                           >
                                         seqlevels(gr) = sub("chr", "ch",
                                                           seqlevels(gr))

                                                                 To get
        a more
                                         comprehensive example I should

                         have

                                                           followed my own
                                                           advice and
                                                                 grabbed
        from the
                                         help page!

        ## Rename:

                                           seqlevels(txdb) <- sub("chr", "",
                                                           seqlevels(txdb))

                                           seqlevels(txdb)

                                           seqlevels(txdb) <- paste0("CH",
                                         seqlevels(txdb))

                                           seqlevels(txdb)

        seqlevels(txdb)[seqlevels(__**____**__txdb)

                         ==

                                                           "CHM"] <- "M"

                                           seqlevels(txdb)

                                                                 --

        Computational
                                         Biology / Fred Hutchinson
                                         Cancer
                                                           Research Center
                                                                 1100
        Fairview
                                         Ave. N.
                                                                 PO Box
        19024
                                         Seattle, WA 98109

        Location: Arnold
                                         Building M1 B861
                                                                 Phone:
        (206)
                                         667-2793 <tel:%28206%29%20667-2793>

                         <tel:%28206%29%20667-2793>

        <tel:%28206%29%20667-2793>

                                                       --
                                                       Dr. Martin
        Morgan, PhD

                                                       Fred Hutchinson
        Cancer
                                         Research Center
                                                       1100 Fairview Ave. N.
                                                       PO Box 19024
        Seattle, WA 98109

          [[alternative HTML
                                         version deleted]]

        __________________________________**___________________
        Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>

        <mailto:Bioc-devel at r-project.__org
        <mailto:Bioc-devel at r-project.org>>
                                         <mailto:Bioc-devel at r-project
        <mailto:Bioc-devel at r-project>.
                                         <mailto:Bioc-devel at r-project
        <mailto:Bioc-devel at r-project>.>__*__*org<

        Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
                                 <mailto:Bioc-devel at r-project.__org
        <mailto:Bioc-devel at r-project.org>>>

                                                   mailing list
        https://stat.ethz.ch/mailman/_____**_listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/___**_listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/___**_listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/_**_listinfo/bioc-devel>><

        https://stat.ethz.ch/mailman/______listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>

        <https://stat.ethz.ch/mailman/____**listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/__**listinfo/bioc-devel>
        <https://stat.ethz.ch/mailman/__**listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/**listinfo/bioc-devel>><

        https://stat.ethz.ch/mailman/____listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>

                                               --
                                               Herv? Pag?s

                                               Program in Computational
        Biology
                                               Division of Public Health
        Sciences

                                               Fred Hutchinson Cancer
        Research Center
                                               1100 Fairview Ave. N, M1-B514
                                               P.O. Box 19024
                                               Seattle, WA 98109-1024

                                               E-mail: hpages at fhcrc.org
        <mailto:hpages at fhcrc.org>
                                         <mailto:hpages at fhcrc.org
        <mailto:hpages at fhcrc.org>>
                                         <mailto:hpages at fhcrc.org
        <mailto:hpages at fhcrc.org>

                                         <mailto:hpages at fhcrc.org
        <mailto:hpages at fhcrc.org>>>
                                               Phone: (206) 667-5791
        <tel:%28206%29%20667-5791>
                                         <tel:%28206%29%20667-5791>
                                         <tel:%28206%29%20667-5791>
                                               Fax: (206) 667-1319
        <tel:%28206%29%20667-1319>
                                         <tel:%28206%29%20667-1319>
                                         <tel:%28206%29%20667-1319>

                                     --
                                     Herv? Pag?s

                                     Program in Computational Biology
                                     Division of Public Health Sciences
                                     Fred Hutchinson Cancer Research Center
                                     1100 Fairview Ave. N, M1-B514
                                     P.O. Box 19024
                                     Seattle, WA 98109-1024

                                     E-mail: hpages at fhcrc.org
        <mailto:hpages at fhcrc.org>
                                     <mailto:hpages at fhcrc.org
        <mailto:hpages at fhcrc.org>>
                                     Phone: (206) 667-5791
        <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791>
                                     Fax: (206) 667-1319
        <tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319>

                                           [[alternative HTML version
        deleted]]

        ___________________________________________________
        Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
                                 <mailto:Bioc-devel at r-project.__org
        <mailto:Bioc-devel at r-project.org>> mailing list
        https://stat.ethz.ch/mailman/____listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>

                                   [[alternative HTML version deleted]]

                         ___________________________________________________
        Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
                         <mailto:Bioc-devel at r-project.__org
        <mailto:Bioc-devel at r-project.org>> mailing list
        https://stat.ethz.ch/mailman/____listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>

                          [[alternative HTML version deleted]]

                     ___________________________________________________
        Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
        <mailto:Bioc-devel at r-project.__org
        <mailto:Bioc-devel at r-project.org>>
                     mailing list
        https://stat.ethz.ch/mailman/____listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
                     <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>

             --
             Herv? Pag?s

             Program in Computational Biology
             Division of Public Health Sciences
             Fred Hutchinson Cancer Research Center
             1100 Fairview Ave. N, M1-B514
             P.O. Box 19024
             Seattle, WA 98109-1024

             E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
        <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
             Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
        <tel:%28206%29%20667-5791>
             Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
        <tel:%28206%29%20667-1319>

    --
    Herv? Pag?s

    Program in Computational Biology
    Division of Public Health Sciences
    Fred Hutchinson Cancer Research Center
    1100 Fairview Ave. N, M1-B514
    P.O. Box 19024
    Seattle, WA 98109-1024

    E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
    Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
    Fax: (206) 667-1319 <tel:%28206%29%20667-1319>

Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20130918/28b8095a/attachment.pl>
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20130918/6d570d27/attachment.pl>