Skip to content
Prev 17045 / 21318 Next

[Bioc-devel] VariantAnnotation::readVcf() sets the wrong seqlevelsStyle in devel

Hi Robert,

Yes seqlevelsStyle's new behavior is slightly different and less 
forgiving. The thing is that it will generally reveal dormant issues 
which is not such a bad thing after all.

Note that it doesn't seem completely straightforward to retrieve the 
reference genome/assembly directly from the VCF header. AFAICT this 
information is either missing or weirdly formatted. For example the 
headers of the 1000genomes VCF files located at 
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/ 
contain

   ##reference=1000Genomes-NCBI37

or in the ex2.vcf file included in VariantAnnotation it's:

   ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta

so not clear that importing this in the genome field of the returned VCF 
object would be that helpful.

Thanks for pointing me to the VariantAnnotation vignette. I'll fix the 
calls to readVcf() to use GRCh37 instead of hg19. Seems like one call 
(on ex2.vcf) is using the wrong genome: ex2.vcf is based on hg18/NCBI36, 
not on hg19/GRCh37. Will fix that too.

Sure readVcf() could probably be improved to perform some sanity checks 
by making sure that the user-supplied genome is compatible with the 
chromosome names. However that still won't prevent the user from 
specifying the wrong genome (e.g. GRCh37 instead of NCBI36) like in the 
ex2.vcf case. Anyway this is a feature request for readVcf().

In the end I'm not sure what's the purpose of specifying the genome 
anyway. What does it give us? Maybe the vignette and examples in 
VariantAnnotation should stop doing that? Better to not specify the 
genome than specifying the wrong one.

Best,
H.
On 8/6/20 07:42, Robert Castelo wrote: