[Bioc-devel] how to get genomic sequences?
Hi Roger, You can use one of the Biostrings-based genome data packages for this. Those packages contain the full genomic sequences for some organisms. Here is how to proceed (with R-devel + Bioc-devel). 1) Install BSgenome ===================
source("http://bioconductor.org/biocLite.R")
biocLite("BSgenome")
library(BSgenome)
available.genomes()
[1] "BSgenome.Celegans.UCSC.ce2" [2] "BSgenome.Dmelanogaster.BDGP.Release5" [3] "BSgenome.Dmelanogaster.FlyBase.r51" [4] "BSgenome.Dmelanogaster.UCSC.dm2" [5] "BSgenome.Hsapiens.UCSC.hg16" [6] "BSgenome.Hsapiens.UCSC.hg17" [7] "BSgenome.Hsapiens.UCSC.hg18" [8] "BSgenome.Mmusculus.UCSC.mm7" [9] "BSgenome.Mmusculus.UCSC.mm8" [10] "BSgenome.Scerevisiae.UCSC.sacCer1" 2) Install and load a specific genome =====================================
biocLite("BSgenome.Hsapiens.UCSC.hg18") # can take a long time (850M to download)
library(BSgenome.Hsapiens.UCSC.hg18)
ls(2)
[1] "Hsapiens"
Hsapiens
Homo sapiens genome:
Single sequences (DNAString objects, see '?seqnames'):
chr1 chr2 chr3 chr4 chr5
chr6 chr7 chr8 chr9 chr10
chr11 chr12 chr13 chr14 chr15
chr16 chr17 chr18 chr19 chr20
chr21 chr22 chrX chrY chrM
chr5_h2_hap1 chr6_cox_hap1 chr6_qbl_hap2 chr1_random chr2_random
chr3_random chr4_random chr5_random chr6_random chr7_random
chr8_random chr9_random chr10_random chr11_random chr13_random
chr15_random chr16_random chr17_random chr18_random chr19_random
chr21_random chr22_random chrX_random
Multiple sequences (BStringViews objects, see '?mseqnames'):
upstream1000 upstream2000 upstream5000
(use the '$' or '[[' operator to access a given sequence)
3) Use getSeq() to retrieve the genomic sequence in a given chromosome, at given start and end
==============================================================================================
getSeq(Hsapiens, "chrX", 100, 150)
[1] "CCTGAGCCAGCAGTGGCAACCCAATGGGGTCCCTTTCCATACTGTGGAAGC" If you need to retrieve a big chunk (> 100000 nucleotides), then it's much more efficient to use as.BStringViews=TRUE:
getSeq(Hsapiens, "chrX", 100, 5000000, as.BStringViews=TRUE)
Views on a 154913754-letter DNAString subject
Subject: CTAACCCTAACCCTAACCCTAACCCTAACCCTAA...TGTGGGTGTGTGGGTGTGGTGTGTGGGTGTGGT
Views:
start end width
[1] 100 5000000 4999901 [CCTGAGCCAGCAGTGGCAACCCAA...CCTATTATTGACTTCACTTGAGCT]
See ?getSeq (from BSgenome package) for more info...
Finally, there have been some important improvements + changes in the devel versions
of Biostrings and BSgenome so I strongly suggest you use Bioc-devel for this.
Let me know if you need further help.
Cheers,
H.
Roger Liu wrote:
Hi, I have a set of data with chromosome number and coordinates of the sequences such as,chr10, start 1000, end 2000. I have tried using biomart to retrieve the genomic sequences for my dataset, but I didn't get success, I used seqType argument as: seqType="genomic", it reported error as"The type of sequence specified with seqType is not available. Please select from: cdna, peptide, 3utr, 5utr", but I have seen this "genomic" argument for seqType in the help file. So what's going on there? Or anyone can recommend a package which can help me retrieve the genomic sequences from hg18 with known chromosome number and sequences coordinates(start and end). Thanks. ZRL [[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel