Skip to content

[Bioc-devel] Best object structure for representing a pairwise genome alignment ?

3 messages · Charles Plessy, Vincent Carey, Hervé Pagès

#
Dear Bioc developers,

I am currently analysing pairwise genome alignments with Bioconductor, 
and I represent them with a GRanges object of the first genome, 
containing one element by alignment block, and storing the coordinates 
in the other genome in a metadata column containing another GRanges object.

Something like this.

GRanges object with 36582 ranges and 2 metadata columns:
           seqnames      ranges strand |     score                query
              <Rle>   <IRanges>  <Rle> | <numeric>            <GRanges>
       [1]       S1     162-550      + |       861    XSR:909374-909853
       [2]       S1    833-3738      + |      7238    XSR:910181-913291
       [3]       S1   3769-4212      + |      1165    XSR:913510-913953
       [4]       S1   4246-4381      + |       359    XSR:914134-914275
       [5]       S1   4532-5990      + |      2977 chr2:6694031-6695569
       ...      ...         ...    ... .       ...                  ...
   [36578]      S99 17228-17759      - |       793 chr1:2375870-2376379
   [36579]      S99 16417-16935      - |       632 chr1:2376612-2377077
   [36580]      S99 12370-12759      - |       773 chr1:2379949-2380343
   [36581]      S99   5270-5384      - |       295   chr1:843397-843511
   [36582]      S99   1949-3053      - |      2105   chr1:845358-846326
   -------

Using "Pairwise genome alignment" as a keyword in a search engine, I 
found that the packages CNEr is doing something similar, although it 
uses a dedicated "GRangePairs" object for the purpose.

Before I start to invest time in either direction, I wanted to check on 
that mailing list if there were other solutions already existing, in 
particularly closer to the core packages ?

Have a nice day,

Charles
#
Starting from

PairwiseAlignments-class      package:Biostrings       R Documentation

PairwiseAlignments, PairwiseAlignmentsSingleSubject, and
PairwiseAlignmentsSingleSubjectSummary objects

Description:

     The ?PairwiseAlignments? class is a container for storing a set of
     pairwise alignments.

     The ?PairwiseAlignmentsSingleSubject? class is a container for
     storing a set of pairwise alignments with a single subject.

     The ?PairwiseAlignmentsSingleSubjectSummary? class is a container
     for storing the summary of a set of pairwise alignments.

Usage:

     ## Constructors:
     ## When subject is missing, pattern must be of length 2
     ## S4 method for signature 'XString,XString'
     PairwiseAlignments(pattern, subject,
       type = "global", substitutionMatrix = NULL, gapOpening = 0,
gapExtension = 1)
     ## S4 method for signature 'XStringSet,missing'
     PairwiseAlignments(pattern, subject,
       type = "global", substitutionMatrix = NULL, gapOpening = 0,
gapExtension = 1)
     ## S4 method for signature 'character,character'
     PairwiseAlignments(pattern, subject,
       type = "global", substitutionMatrix = NULL, gapOpening = 0,
gapExtension = 1,
       baseClass = "BString")

...

my question would be whether this is a relevant starting place?  Clearly
the focus is not on coordinates, but perhaps a structure that maintains
genomic content and coordinates together would be of use?


On Fri, Sep 18, 2020 at 2:49 AM Charles Plessy <charles.plessy at oist.jp>
wrote:

  
    
3 days later
#
Hi Charles, Vince,

Yes, a PairwiseAlignments object will contain the sequences of the 2 
genomes being aligned so will be big. Could be mitigated by using one 
object per chromosome instead of trying to represent the full genome 
alignment in a single object, but then you loose the ability to 
represent regions that align across chromosomes.

Other downsides of using PairwiseAlignments are:
- You loose the nice/simple block-to-block mapping that GRangePairs 
gives you, together with the easy/straightforward way to annotate the 
links between blocks (via the metadata columns of the GRangePairs).
- A PairwiseAlignments object can only represent replacements and indels 
while the block-to-block mapping in a GRangePairs object can support 
rearrangements (in addition to indels and replacements).
- The GRangesPairs approach even allows you to represent a many-to-many 
relationship between the blocks/regions of the 2 genomes, something that 
a PairwiseAlignments-based approach cannot do.

So the GRangePairs approach seems more flexible.

Maybe a better way to support an arbitrary relationship between the 
blocks/regions of the 2 genomes would be to use a 3-slot data structure: 
2 slots for 2 GRanges objects defining regions on the 2 genomes + 1 slot 
for representing the links between the regions defined on each genome 
(these links could be stored in a Hits object). Note that this is a 
classic bipartite graph. Would particularly make sense if the mapping 
between the regions is expected to be many-to-many. This kind of 
container would be able to represent a side-by-side comparison of 2 
arbitrary genomes, in its more general form, not just a pairwise genome 
alignment, which is more restrictive.

Cheers,
H.
On 9/18/20 02:41, Vincent Carey wrote: