An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioc-devel/attachments/20070327/ec746121/attachment.pl
[Bioc-devel] how to get genomic sequences?
4 messages · Roger Liu, Hervé Pagès, Sean Davis
Hi Roger, You can use one of the Biostrings-based genome data packages for this. Those packages contain the full genomic sequences for some organisms. Here is how to proceed (with R-devel + Bioc-devel). 1) Install BSgenome ===================
source("http://bioconductor.org/biocLite.R")
biocLite("BSgenome")
library(BSgenome)
available.genomes()
[1] "BSgenome.Celegans.UCSC.ce2" [2] "BSgenome.Dmelanogaster.BDGP.Release5" [3] "BSgenome.Dmelanogaster.FlyBase.r51" [4] "BSgenome.Dmelanogaster.UCSC.dm2" [5] "BSgenome.Hsapiens.UCSC.hg16" [6] "BSgenome.Hsapiens.UCSC.hg17" [7] "BSgenome.Hsapiens.UCSC.hg18" [8] "BSgenome.Mmusculus.UCSC.mm7" [9] "BSgenome.Mmusculus.UCSC.mm8" [10] "BSgenome.Scerevisiae.UCSC.sacCer1" 2) Install and load a specific genome =====================================
biocLite("BSgenome.Hsapiens.UCSC.hg18") # can take a long time (850M to download)
library(BSgenome.Hsapiens.UCSC.hg18)
ls(2)
[1] "Hsapiens"
Hsapiens
Homo sapiens genome:
Single sequences (DNAString objects, see '?seqnames'):
chr1 chr2 chr3 chr4 chr5
chr6 chr7 chr8 chr9 chr10
chr11 chr12 chr13 chr14 chr15
chr16 chr17 chr18 chr19 chr20
chr21 chr22 chrX chrY chrM
chr5_h2_hap1 chr6_cox_hap1 chr6_qbl_hap2 chr1_random chr2_random
chr3_random chr4_random chr5_random chr6_random chr7_random
chr8_random chr9_random chr10_random chr11_random chr13_random
chr15_random chr16_random chr17_random chr18_random chr19_random
chr21_random chr22_random chrX_random
Multiple sequences (BStringViews objects, see '?mseqnames'):
upstream1000 upstream2000 upstream5000
(use the '$' or '[[' operator to access a given sequence)
3) Use getSeq() to retrieve the genomic sequence in a given chromosome, at given start and end
==============================================================================================
getSeq(Hsapiens, "chrX", 100, 150)
[1] "CCTGAGCCAGCAGTGGCAACCCAATGGGGTCCCTTTCCATACTGTGGAAGC" If you need to retrieve a big chunk (> 100000 nucleotides), then it's much more efficient to use as.BStringViews=TRUE:
getSeq(Hsapiens, "chrX", 100, 5000000, as.BStringViews=TRUE)
Views on a 154913754-letter DNAString subject
Subject: CTAACCCTAACCCTAACCCTAACCCTAACCCTAA...TGTGGGTGTGTGGGTGTGGTGTGTGGGTGTGGT
Views:
start end width
[1] 100 5000000 4999901 [CCTGAGCCAGCAGTGGCAACCCAA...CCTATTATTGACTTCACTTGAGCT]
See ?getSeq (from BSgenome package) for more info...
Finally, there have been some important improvements + changes in the devel versions
of Biostrings and BSgenome so I strongly suggest you use Bioc-devel for this.
Let me know if you need further help.
Cheers,
H.
Roger Liu wrote:
Hi, I have a set of data with chromosome number and coordinates of the sequences such as,chr10, start 1000, end 2000. I have tried using biomart to retrieve the genomic sequences for my dataset, but I didn't get success, I used seqType argument as: seqType="genomic", it reported error as"The type of sequence specified with seqType is not available. Please select from: cdna, peptide, 3utr, 5utr", but I have seen this "genomic" argument for seqType in the help file. So what's going on there? Or anyone can recommend a package which can help me retrieve the genomic sequences from hg18 with known chromosome number and sequences coordinates(start and end). Thanks. ZRL [[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Roger Liu wrote:
Hi, I have a set of data with chromosome number and coordinates of the sequences such as,chr10, start 1000, end 2000. I have tried using biomart to retrieve the genomic sequences for my dataset, but I didn't get success, I used seqType argument as: seqType="genomic", it reported error as"The type of sequence specified with seqType is not available. Please select from: cdna, peptide, 3utr, 5utr", but I have seen this "genomic" argument for seqType in the help file. So what's going on there? Or anyone can recommend a package which can help me retrieve the genomic sequences from hg18 with known chromosome number and sequences coordinates(start and end).
URLs like so: http://genome.ucsc.edu/cgi-bin/das/hg18/dna?segment=chr1:1000,2000;segment=chr2:1000,2000 will get you sequences of interest. These come as simple well-formed XML, so you can use the XML package to parse them pretty easily. You can do one-at-a-time, or string together a bunch. Sean
Sean Davis wrote:
Roger Liu wrote:
Hi,
I have a set of data with chromosome number and coordinates of the sequences
such as,chr10, start 1000, end 2000.
I have tried using biomart to retrieve the genomic sequences for my dataset,
but I didn't get success, I used seqType argument as:
seqType="genomic", it reported error as"The type of sequence specified with
seqType is not available. Please select from: cdna, peptide, 3utr, 5utr",
but I have seen this "genomic" argument for seqType in the help file. So
what's going on there?
Or anyone can recommend a package which can help me retrieve the genomic
sequences from hg18 with known chromosome number and sequences
coordinates(start and end).
URLs like so: http://genome.ucsc.edu/cgi-bin/das/hg18/dna?segment=chr1:1000,2000;segment=chr2:1000,2000 will get you sequences of interest. These come as simple well-formed XML, so you can use the XML package to parse them pretty easily. You can do one-at-a-time, or string together a bunch.
Just a quick followup and a skeleton function... The function takes a
dataframe with three columns, chrom, start, end and a single character
argument, genome, which needs to match the UCSC genome browser notation
for the genome build. It fetches the das DNA one at a time and returns
the original dataframe and a fourth column with the sequence. There
isn't any error checking, etc., and it fetches one sequence at a time.
Both of these issues could be eliminated with a little work.
getSequences <- function(locs,genome='hg18') {
require(XML)
chrom <- locs$chrom
starts <- locs$start
ends <- locs$end
seqs <- vector('character',length=length(chrom))
for(i in 1:length(chrom)) {
url <-
sprintf('http://genome.ucsc.edu/cgi-bin/das/%s/dna?segment=%s:%d,%d',genome,chrom[i],starts[i],ends[i])
results <- xmlRoot(xmlTreeParse(url))
seqs[i] <- gsub('\n','',xmlValue(results[[1]][[1]]))
}
return(data.frame(chrom=chrom,start=starts,end=ends,sequence=seqs))
}
And you could use it like so:
mySeqs <-
getSequences(data.frame(chrom=c('chr1','chr2'),starts=c(1000,1000),ends=c(2000,1500)))
Herve Page's solution is undoubtedly the better one, but if memory is a
limitation, this solution might be of use.
Sean