Thanks -- It is good to know more about the complications of adding
seqlevelsStyle elements.
I am not sure how pervasive this will be in SNP annotation in the future.
The "new API" for dbSNP
references SPDI annotation conventions.
https://api.ncbi.nlm.nih.gov/variation/v0/
at least one dbsnp build 152 resource uses this nomenclature. The one
referenced below is the "go-to" resource for current rsid-coordinate
correspondence, as far as I know.
library(VariantAnnotation)
*0/0 packages newly attached/loaded, see sessionInfo() for details.*
mypar = GRanges("NC_000001.11", IRanges(100000,120000)) # note seqnames
ftp://ftp.ncbi.nih.gov/snp/redesign/latest_release/VCF/GCF_000001405.38.gz
",
+ genome="GRCh38", param=mypar)
GRanges object with 3 ranges and 5 metadata columns:
seqnames ranges strand | paramRangeID REF
<Rle> <IRanges> <Rle> | <factor> <DNAStringSet>
rs1331956057 NC_000001.11 100000 * | <NA> C
rs1252351580 NC_000001.11 100036 * | <NA> T
rs1238523913 NC_000001.11 100051 * | <NA> T
ALT QUAL FILTER
<DNAStringSetList> <numeric> <character>
rs1331956057 T <NA> .
rs1252351580 G <NA> .
rs1238523913 C <NA> .
-------
seqinfo: 1 sequence from GRCh38 genome; no seqlengths
On Fri, Dec 13, 2019 at 11:01 AM Robert Castelo <robert.castelo at upf.edu>
wrote:
hi Herv?,
i didn't know about this new sequence style until Vince posted his
message and we briefly talked about it at the European BioC meeting this
week in Brussels. however, i didn't know that the style was specific to
a particular assembly. i have no use case of this at the mome moment,
i.e., i have not encountered myself any annotation or BAM file with
chromosome names written that way, so i don't know how pressing this
issue is, maybe Vince can tell us how spread such chromosome naming
style may become in the near future.
naively, i'd think that it would be matter of adding a
reference-specific column, i.e., 'GRCh38.p13', 'GRCh37.p13', etc., but i
can imagine that maybe the "reference style" concept might not be the
appropriate placeholder to map all different chromosome names of all
different individual human genomes uploaded to NCBI. maybe we should
wait until we have a specific use case .. Vince?
robert.
On 12/11/19 10:06 PM, Pages, Herve wrote:
Hi Vince, Robert,
Looks like Vince wants the RefSeq accession e.g. NC_000017.11 for chrom
17 in the GRCh38.
@Robert: Is this what you're also interested in?
The problem is that the RefSeq accessions are specific to a particular
assembly (e.g. NC_000017.11 for chrom 17 in GRCh38 but NC_000017.10 for
the same chrom in GRCh37).
Currently seqlevelsStyle() doesn't know how to distinguish between
different assemblies of the same organism. Not saying it couldn't but it
would require some thinking and some significant refactoring. It
wouldn't be just a matter of adding a column to
genomeStyles()$Homo_sapiens.
H.
On 12/10/19 14:19, Robert Castelo wrote:
I second this, and would suggest to name the style as 'GRC' for "Genome
Reference Consortium".
thanks Vince for bringing this up, being able to easily switch between
genome styles is great.
if 'paste0()' in R is one of the most influential contributions to
statistical computing
i think that 'seqlevelsStyle()' from the GenomeInfoDb package is one of
the most influential contributions to human genetics, if you think
the time invested by researchers in parsing and changing between
different styles of chromosome names :)
robert.
On 06/12/2019 15:03, Vincent Carey wrote:
I raised this issue previously with little response.
I'd propose that we add a column or two to genomeStyles()$Homo_sapiens
head(genomeStyles()$Homo_sapiens, 2)
circular auto sex NCBI UCSC dbSNP Ensembl
1 FALSE TRUE FALSE 1 chr1 ch1 1
2 FALSE TRUE FALSE 2 chr2 ch2 2
that includes the values for "NCBI reference sequence names"
See
for one report on chr17,
and
for a table that includes the Genbank labels.
Should I just file a PR at