[Bioc-devel] GAlignments Sorting Causes C Stack Error
Hi,
No ordering was defined until now for GAlignments. Starting with
GAlignments 1.11.7, the elements of a GAlignments object 'x' are
ordered based on the order of the elements in 'granges(x)', that
is, by chromosome, then by strand, then by start, then by end.
Note that I chose a simple ordering definition for GAlignments objects
but it has the following important caveat: == does NOT look at the
cigar. That means that 2 elements in a GAlignments object are considered
equal even if their cigars are different:
> gal <- GAlignments(Rle(factor("chr1"), 2), c(5L, 5L), c("10M",
"5M2I5M"), Rle(strand("+"), 2))
> gal
GAlignments object with 2 alignments and 0 metadata columns:
seqnames strand cigar qwidth start end width
<Rle> <Rle> <character> <integer> <integer> <integer> <integer>
[1] chr1 + 10M 10 5 14 10
[2] chr1 + 5M2I5M 12 5 14 10
njunc
<integer>
[1] 0
[2] 0
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
> gal[1] == gal
[1] TRUE TRUE
Another ordering definition would be an ordering that looks at
chromosome/strand/start/end/cigar instead of just
chromosome/strand/start/end (i.e. the cigar is used to break ties when
looking at
chromosome/strand/start/end only leads to a tie). That would address
the above caveat but it feels weird to use lexicographic ordering of
the cigars in order to break ties. So I'm kind of hesitant to do this.
Anyway now <=, <, ==, !=, pcompare(), order(), sort(), rank(), etc...
work.
Still no match(), selfmatch(), duplicated(), or unique() for GAlignments
objects. Note that these things should adhere to the same notion of
equality as == so they would also ignore the cigar right now if we
wanted to have them. Maybe that's another argument in favor of an
ordering based on chromosome/strand/start/end/cigar.
H.
On 01/08/2017 05:00 PM, Dario Strbenac wrote:
Good day, When sort is used on a GAlignments object, a stack error is shown, no matter how small the object is.
testAlignments
GAlignments object with 3 alignments and 0 metadata columns:
seqnames strand cigar qwidth start end width njunc
<Rle> <Rle> <character> <integer> <integer> <integer> <integer> <integer>
700666F:126:C8768ANXX:3:2204:3175:99484 chr14 + 71S27M 98 18386040 18386066 27 0
700666F:126:C8768ANXX:1:1107:8115:31928 chr14 + 40S60M 100 18915005 18915064 60 0
700666F:126:C8768ANXX:1:2206:7564:34686 chr14 + 40S50M 90 18915005 18915054 50 0
-------
seqinfo: 23 sequences from an unspecified genome
sort(testAlignments)
Error: C stack usage 7970544 is too close to the limit I use up-to-date packages.
sessionInfo()
R version 3.3.2 (2016-10-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Debian GNU/Linux 8 (jessie) locale: [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 [6] LC_MESSAGES=C.UTF-8 LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] GenomicAlignments_1.10.0 Rsamtools_1.26.1 Biostrings_2.42.1 XVector_0.14.0 [5] SummarizedExperiment_1.4.0 Biobase_2.34.0 GenomicRanges_1.26.2 GenomeInfoDb_1.10.1 [9] IRanges_2.8.1 S4Vectors_0.12.1 BiocGenerics_0.20.0 loaded via a namespace (and not attached): [1] lattice_0.20-34 bitops_1.0-6 grid_3.3.2 zlibbioc_1.20.0 Matrix_1.2-7.1 BiocParallel_1.8.1 [7] tools_3.3.2 -------------------------------------- Dario Strbenac University of Sydney Camperdown NSW 2050 Australia
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319