Hi, I have encountered myself in a strange situation when using the function locateVariants() from VariantAnnotation with an input VRanges object. The problem is that some of the expected coding annotations are not showing up when using locateVariants() with default parameters. After investigating this situation I think I found the reason, which does not look like a bug but I would like that you give me some clarification about the logic behind using locateVariants() with VRanges objects. The documentation of the VRanges-class says that in this class of objects "The strand is always constrained to be positive (+).". I guess there may be a good reason for this but I could not find it in the documentation or googling about it. This means that when you coerce a CollapsedVCF object (obtained, for example, from a VCF file via readVcf()) to a VRanges object, even though variants in the VCF may have no strand, they get a positive strand in the VRanges object. The problem arises then, when you call locateVariants() with this VRanges object, because features on the negative strand are never going to overlap with the variants since, by default, the argument ignore.strand=FALSE. Let me illustrate this with a toy example. Consider the SNP rs1129038 (http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1129038) at chr15:28356859 with allels A/G. It is located on the 3' UTR of the gene HERC2 coded on the negative strand of the human reference genome. Let's build a toy VRanges object having this variant: library(VariantAnnotation) vr <- VRanges(seqnames="chr15", ranges=IRanges(28356859, 28356859), ref="A", alt="G", refDepth=5, altDepth=7, totalDepth=12, sampleNames="A") strand(vr) factor-Rle of length 1 with 1 run Lengths: 1 Values : + Levels(3): + - * Let's build now its CollapsedVCF counterpart by using the corresponding coercion method and set the strand to "*": vcf <- asVCF(vr) strand(vcf) <- "*" Now run locateVariants() on both objects with UCSC annotations: library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene locateVariants(vcf, txdb, region=AllVariants()) GRanges object with 2 ranges and 9 metadata columns: seqnames ranges strand | LOCATION LOCSTART LOCEND QUERYID TXID CDSID <Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer> <character> <IntegerList> [1] chr15 [28356859, 28356859] * | threeUTR 50 50 1 55386 [2] chr15 [28356859, 28356859] * | threeUTR 50 50 1 55387 GENEID PRECEDEID FOLLOWID <character> <CharacterList> <CharacterList> [1] 8924 [2] 8924 ------- seqinfo: 1 sequence from an unspecified genome; no seqlengths locateVariants(vr, txdb, region=AllVariants()) GRanges object with 1 range and 9 metadata columns: seqnames ranges strand | LOCATION LOCSTART LOCEND QUERYID TXID CDSID <Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer> <integer> <IntegerList> [1] chr15 [28356859, 28356859] + | intergenic <NA> <NA> 1 <NA> GENEID PRECEDEID FOLLOWID <character> <CharacterList> <CharacterList> [1] <NA> 100132565,100289656,100616223,... 2567 ------- seqinfo: 1 sequence from an unspecified genome; no seqlengths Note that while we get the 3' UTR annotation from the strandless VCF object we do not get it from the VRanges object with the positive strand. To make my point clear: this positive strand shows up when you coerce a strandless VCF object to a VRanges one, because positive strandness seems to be the convention for VRanges objects: as(vcf, VRanges) VRanges object with 1 range and 1 metadata column: seqnames ranges strand ref alt totalDepth refDepth altDepth <Rle> <IRanges> <Rle> <character> <characterOrRle> <integerOrRle> <integerOrRle> <integerOrRle> [1] chr15 [28356859, 28356859] + A G 12 5 7 sampleNames softFilterMatrix | QUAL <factorOrRle> <matrix> | <numeric> [1] A | <NA> ------- seqinfo: 1 sequence from an unspecified genome; no seqlengths hardFilters: NULL Of course, if I run locateVariants() with the argument ignore.strand=TRUE, then I get the expected annotation: locateVariants(vr, txdb, region=AllVariants(), ignore.strand=TRUE) GRanges object with 2 ranges and 9 metadata columns: seqnames ranges strand | LOCATION LOCSTART LOCEND QUERYID TXID CDSID <Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer> <character> <IntegerList> [1] chr15 [28356859, 28356859] + | threeUTR 677 677 1 55386 [2] chr15 [28356859, 28356859] + | threeUTR 677 677 1 55387 GENEID PRECEDEID FOLLOWID <character> <CharacterList> <CharacterList> [1] 8924 [2] 8924 ------- seqinfo: 1 sequence from an unspecified genome; no seqlengths So, my question is, given that VRanges objects are enforced to have a positive strand, would not be better to have ignore.strand=TRUE as default in locateVariants? Alternatively, I would recommend that locateVariants() issues a warning, or maybe an error, when the input object is VRanges and ignore.strand=FALSE. Finally, out of curiosity, why a VRanges object enforces the positive strand in all its genomic ranges? Would not be better just taking the strand of the CollapsedVCF object when coercing the CollapsedVCF object to VRanges? thanks!! robert.
[Bioc-devel] VRanges-class positive strandness and locateVariants() strandawareness
3 messages · Robert Castelo, Michael Lawrence
This changed recently. VariantAnnotation in devel no longer enforces a strand on VRanges, or at least it allows the "*" case. On Fri, May 22, 2015 at 11:33 AM, Robert Castelo <robert.castelo at upf.edu> wrote:
Hi, I have encountered myself in a strange situation when using the function locateVariants() from VariantAnnotation with an input VRanges object. The problem is that some of the expected coding annotations are not showing up when using locateVariants() with default parameters. After investigating this situation I think I found the reason, which does not look like a bug but I would like that you give me some clarification about the logic behind using locateVariants() with VRanges objects. The documentation of the VRanges-class says that in this class of objects "The strand is always constrained to be positive (+).". I guess there may be a good reason for this but I could not find it in the documentation or googling about it. This means that when you coerce a CollapsedVCF object (obtained, for example, from a VCF file via readVcf()) to a VRanges object, even though variants in the VCF may have no strand, they get a positive strand in the VRanges object. The problem arises then, when you call locateVariants() with this VRanges object, because features on the negative strand are never going to overlap with the variants since, by default, the argument ignore.strand=FALSE. Let me illustrate this with a toy example. Consider the SNP rs1129038 ( http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1129038) at chr15:28356859 with allels A/G. It is located on the 3' UTR of the gene HERC2 coded on the negative strand of the human reference genome. Let's build a toy VRanges object having this variant: library(VariantAnnotation) vr <- VRanges(seqnames="chr15", ranges=IRanges(28356859, 28356859), ref="A", alt="G", refDepth=5, altDepth=7, totalDepth=12, sampleNames="A") strand(vr) factor-Rle of length 1 with 1 run Lengths: 1 Values : + Levels(3): + - * Let's build now its CollapsedVCF counterpart by using the corresponding coercion method and set the strand to "*": vcf <- asVCF(vr) strand(vcf) <- "*" Now run locateVariants() on both objects with UCSC annotations: library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene locateVariants(vcf, txdb, region=AllVariants()) GRanges object with 2 ranges and 9 metadata columns: seqnames ranges strand | LOCATION LOCSTART LOCEND QUERYID TXID CDSID <Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer> <character> <IntegerList> [1] chr15 [28356859, 28356859] * | threeUTR 50 50 1 55386 [2] chr15 [28356859, 28356859] * | threeUTR 50 50 1 55387 GENEID PRECEDEID FOLLOWID <character> <CharacterList> <CharacterList> [1] 8924 [2] 8924 ------- seqinfo: 1 sequence from an unspecified genome; no seqlengths locateVariants(vr, txdb, region=AllVariants()) GRanges object with 1 range and 9 metadata columns: seqnames ranges strand | LOCATION LOCSTART LOCEND QUERYID TXID CDSID <Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer> <integer> <IntegerList> [1] chr15 [28356859, 28356859] + | intergenic <NA> <NA> 1 <NA> GENEID PRECEDEID FOLLOWID <character> <CharacterList> <CharacterList> [1] <NA> 100132565,100289656,100616223,... 2567 ------- seqinfo: 1 sequence from an unspecified genome; no seqlengths Note that while we get the 3' UTR annotation from the strandless VCF object we do not get it from the VRanges object with the positive strand. To make my point clear: this positive strand shows up when you coerce a strandless VCF object to a VRanges one, because positive strandness seems to be the convention for VRanges objects: as(vcf, VRanges) VRanges object with 1 range and 1 metadata column: seqnames ranges strand ref alt totalDepth refDepth altDepth <Rle> <IRanges> <Rle> <character> <characterOrRle> <integerOrRle> <integerOrRle> <integerOrRle> [1] chr15 [28356859, 28356859] + A G 12 5 7 sampleNames softFilterMatrix | QUAL <factorOrRle> <matrix> | <numeric> [1] A | <NA> ------- seqinfo: 1 sequence from an unspecified genome; no seqlengths hardFilters: NULL Of course, if I run locateVariants() with the argument ignore.strand=TRUE, then I get the expected annotation: locateVariants(vr, txdb, region=AllVariants(), ignore.strand=TRUE) GRanges object with 2 ranges and 9 metadata columns: seqnames ranges strand | LOCATION LOCSTART LOCEND QUERYID TXID CDSID <Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer> <character> <IntegerList> [1] chr15 [28356859, 28356859] + | threeUTR 677 677 1 55386 [2] chr15 [28356859, 28356859] + | threeUTR 677 677 1 55387 GENEID PRECEDEID FOLLOWID <character> <CharacterList> <CharacterList> [1] 8924 [2] 8924 ------- seqinfo: 1 sequence from an unspecified genome; no seqlengths So, my question is, given that VRanges objects are enforced to have a positive strand, would not be better to have ignore.strand=TRUE as default in locateVariants? Alternatively, I would recommend that locateVariants() issues a warning, or maybe an error, when the input object is VRanges and ignore.strand=FALSE. Finally, out of curiosity, why a VRanges object enforces the positive strand in all its genomic ranges? Would not be better just taking the strand of the CollapsedVCF object when coercing the CollapsedVCF object to VRanges? thanks!! robert.
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
ok, thanks for the clarification, I was using release and did not try devel. actually, this also affects the function predictCoding() and there, using ignore.strand=TRUE is not appropriate, so my workaround by now in release is to coerce the input VRanges to GRanges and set strand to '*' before calling locateVariants() or predictCoding(). Just mentioning in case somebody encountered the same situation in release. cheers, robert.
On 05/22/2015 09:49 PM, Michael Lawrence wrote:
This changed recently. VariantAnnotation in devel no longer enforces a
strand on VRanges, or at least it allows the "*" case.
On Fri, May 22, 2015 at 11:33 AM, Robert Castelo <robert.castelo at upf.edu
<mailto:robert.castelo at upf.edu>> wrote:
Hi,
I have encountered myself in a strange situation when using the
function locateVariants() from VariantAnnotation with an input
VRanges object. The problem is that some of the expected coding
annotations are not showing up when using locateVariants() with
default parameters.
After investigating this situation I think I found the reason, which
does not look like a bug but I would like that you give me some
clarification about the logic behind using locateVariants() with
VRanges objects.
The documentation of the VRanges-class says that in this class of
objects "The strand is always constrained to be positive (+).". I
guess there may be a good reason for this but I could not find it in
the documentation or googling about it.
This means that when you coerce a CollapsedVCF object (obtained, for
example, from a VCF file via readVcf()) to a VRanges object, even
though variants in the VCF may have no strand, they get a positive
strand in the VRanges object.
The problem arises then, when you call locateVariants() with this
VRanges object, because features on the negative strand are never
going to overlap with the variants since, by default, the argument
ignore.strand=FALSE.
Let me illustrate this with a toy example. Consider the SNP
rs1129038
(http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1129038) at
chr15:28356859 with allels A/G. It is located on the 3' UTR of the
gene HERC2 coded on the negative strand of the human reference
genome. Let's build a toy VRanges object having this variant:
library(VariantAnnotation)
vr <- VRanges(seqnames="chr15",
ranges=IRanges(28356859, 28356859),
ref="A", alt="G",
refDepth=5, altDepth=7,
totalDepth=12, sampleNames="A")
strand(vr)
factor-Rle of length 1 with 1 run
Lengths: 1
Values : +
Levels(3): + - *
Let's build now its CollapsedVCF counterpart by using the
corresponding coercion method and set the strand to "*":
vcf <- asVCF(vr)
strand(vcf) <- "*"
Now run locateVariants() on both objects with UCSC annotations:
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
locateVariants(vcf, txdb, region=AllVariants())
GRanges object with 2 ranges and 9 metadata columns:
seqnames ranges strand | LOCATION LOCSTART
LOCEND QUERYID TXID CDSID
<Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer>
<character> <IntegerList>
[1] chr15 [28356859, 28356859] * | threeUTR 50 50
1 55386
[2] chr15 [28356859, 28356859] * | threeUTR 50 50
1 55387
GENEID PRECEDEID FOLLOWID
<character> <CharacterList> <CharacterList>
[1] 8924
[2] 8924
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
locateVariants(vr, txdb, region=AllVariants())
GRanges object with 1 range and 9 metadata columns:
seqnames ranges strand | LOCATION LOCSTART
LOCEND QUERYID TXID CDSID
<Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer>
<integer> <IntegerList>
[1] chr15 [28356859, 28356859] + | intergenic <NA> <NA>
1 <NA>
GENEID PRECEDEID FOLLOWID
<character> <CharacterList> <CharacterList>
[1] <NA> 100132565,100289656,100616223,... 2567
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
Note that while we get the 3' UTR annotation from the strandless VCF
object we do not get it from the VRanges object with the positive
strand. To make my point clear: this positive strand shows up when
you coerce a strandless VCF object to a VRanges one, because
positive strandness seems to be the convention for VRanges objects:
as(vcf, VRanges)
VRanges object with 1 range and 1 metadata column:
seqnames ranges strand ref
alt totalDepth refDepth altDepth
<Rle> <IRanges> <Rle> <character> <characterOrRle> <integerOrRle>
<integerOrRle> <integerOrRle>
[1] chr15 [28356859, 28356859] + A
G 12 5 7
sampleNames softFilterMatrix | QUAL
<factorOrRle> <matrix> | <numeric>
[1] A | <NA>
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
hardFilters: NULL
Of course, if I run locateVariants() with the argument
ignore.strand=TRUE, then I get the expected annotation:
locateVariants(vr, txdb, region=AllVariants(), ignore.strand=TRUE)
GRanges object with 2 ranges and 9 metadata columns:
seqnames ranges strand | LOCATION LOCSTART
LOCEND QUERYID TXID CDSID
<Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer>
<character> <IntegerList>
[1] chr15 [28356859, 28356859] + | threeUTR 677
677 1 55386
[2] chr15 [28356859, 28356859] + | threeUTR 677
677 1 55387
GENEID PRECEDEID FOLLOWID
<character> <CharacterList> <CharacterList>
[1] 8924
[2] 8924
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
So, my question is, given that VRanges objects are enforced to have
a positive strand, would not be better to have ignore.strand=TRUE as
default in locateVariants?
Alternatively, I would recommend that locateVariants() issues a
warning, or maybe an error, when the input object is VRanges and
ignore.strand=FALSE.
Finally, out of curiosity, why a VRanges object enforces the
positive strand in all its genomic ranges? Would not be better just
taking the strand of the CollapsedVCF object when coercing the
CollapsedVCF object to VRanges?
thanks!!
robert.
_______________________________________________
Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
Robert Castelo, PhD Associate Professor Dept. of Experimental and Health Sciences Universitat Pompeu Fabra (UPF) Barcelona Biomedical Research Park (PRBB) Dr Aiguader 88 E-08003 Barcelona, Spain telf: +34.933.160.514 fax: +34.933.160.550