[Bioc-devel] how to get genomic sequences?

An embedded and charset-unspecified text was scrubbed...
Name: not available
Url: https://stat.ethz.ch/pipermail/bioc-devel/attachments/20070327/ec746121/attachment.pl
Hi Roger,

You can use one of the Biostrings-based genome data packages for this.
Those packages contain the full genomic sequences for some organisms.

Here is how to proceed (with R-devel + Bioc-devel).

1) Install BSgenome
===================
source("http://bioconductor.org/biocLite.R")
biocLite("BSgenome")
library(BSgenome)
available.genomes()
[1] "BSgenome.Celegans.UCSC.ce2"
 [2] "BSgenome.Dmelanogaster.BDGP.Release5"
 [3] "BSgenome.Dmelanogaster.FlyBase.r51"
 [4] "BSgenome.Dmelanogaster.UCSC.dm2"
 [5] "BSgenome.Hsapiens.UCSC.hg16"
 [6] "BSgenome.Hsapiens.UCSC.hg17"
 [7] "BSgenome.Hsapiens.UCSC.hg18"
 [8] "BSgenome.Mmusculus.UCSC.mm7"
 [9] "BSgenome.Mmusculus.UCSC.mm8"
[10] "BSgenome.Scerevisiae.UCSC.sacCer1"

2) Install and load a specific genome
=====================================
biocLite("BSgenome.Hsapiens.UCSC.hg18") # can take a long time (850M to download)
library(BSgenome.Hsapiens.UCSC.hg18)
ls(2)
[1] "Hsapiens"
Hsapiens
Homo sapiens genome:

  Single sequences (DNAString objects, see '?seqnames'):
    chr1           chr2           chr3           chr4           chr5
    chr6           chr7           chr8           chr9           chr10
    chr11          chr12          chr13          chr14          chr15
    chr16          chr17          chr18          chr19          chr20
    chr21          chr22          chrX           chrY           chrM
    chr5_h2_hap1   chr6_cox_hap1  chr6_qbl_hap2  chr1_random    chr2_random
    chr3_random    chr4_random    chr5_random    chr6_random    chr7_random
    chr8_random    chr9_random    chr10_random   chr11_random   chr13_random
    chr15_random   chr16_random   chr17_random   chr18_random   chr19_random
    chr21_random   chr22_random   chrX_random

  Multiple sequences (BStringViews objects, see '?mseqnames'):
    upstream1000  upstream2000  upstream5000

  (use the '$' or '[[' operator to access a given sequence)

3) Use getSeq() to retrieve the genomic sequence in a given chromosome, at given start and end
==============================================================================================
getSeq(Hsapiens, "chrX", 100, 150)
[1] "CCTGAGCCAGCAGTGGCAACCCAATGGGGTCCCTTTCCATACTGTGGAAGC"

If you need to retrieve a big chunk (> 100000 nucleotides), then it's much more efficient
to use as.BStringViews=TRUE:
getSeq(Hsapiens, "chrX", 100, 5000000, as.BStringViews=TRUE)
Views on a 154913754-letter DNAString subject
Subject: CTAACCCTAACCCTAACCCTAACCCTAACCCTAA...TGTGGGTGTGTGGGTGTGGTGTGTGGGTGTGGT
Views:
    start     end   width
[1]   100 5000000 4999901 [CCTGAGCCAGCAGTGGCAACCCAA...CCTATTATTGACTTCACTTGAGCT]

See ?getSeq (from BSgenome package) for more info...

Finally, there have been some important improvements + changes in the devel versions
of Biostrings and BSgenome so I strongly suggest you use Bioc-devel for this.
Let me know if you need further help.

Cheers,
H.
Hi,

I have a set of data with chromosome number and coordinates of the sequences
such as,chr10, start 1000, end 2000.
I have tried using biomart to retrieve the genomic sequences for my dataset,
but I didn't get success, I used seqType argument as:
seqType="genomic", it reported error as"The type of sequence specified with
seqType is not available. Please select from: cdna, peptide, 3utr, 5utr",
but I have seen this "genomic" argument for seqType in the help file. So
what's going on there?

Or anyone can recommend a package which can help me retrieve the genomic
sequences from hg18 with known chromosome number and sequences
coordinates(start and end).

Thanks.

ZRL

	[[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
Hi,

I have a set of data with chromosome number and coordinates of the sequences
such as,chr10, start 1000, end 2000.
I have tried using biomart to retrieve the genomic sequences for my dataset,
but I didn't get success, I used seqType argument as:
seqType="genomic", it reported error as"The type of sequence specified with
seqType is not available. Please select from: cdna, peptide, 3utr, 5utr",
but I have seen this "genomic" argument for seqType in the help file. So
what's going on there?

Or anyone can recommend a package which can help me retrieve the genomic
sequences from hg18 with known chromosome number and sequences
coordinates(start and end).

URLs like so:

http://genome.ucsc.edu/cgi-bin/das/hg18/dna?segment=chr1:1000,2000;segment=chr2:1000,2000

will get you sequences of interest.  These come as simple well-formed 
XML, so you can use the XML package to parse them pretty easily.  You 
can do one-at-a-time, or string together a bunch.

Sean
Roger Liu wrote:

Hi,

I have a set of data with chromosome number and coordinates of the sequences
such as,chr10, start 1000, end 2000.
I have tried using biomart to retrieve the genomic sequences for my dataset,
but I didn't get success, I used seqType argument as:
seqType="genomic", it reported error as"The type of sequence specified with
seqType is not available. Please select from: cdna, peptide, 3utr, 5utr",
but I have seen this "genomic" argument for seqType in the help file. So
what's going on there?

Or anyone can recommend a package which can help me retrieve the genomic
sequences from hg18 with known chromosome number and sequences
coordinates(start and end).

URLs like so:

http://genome.ucsc.edu/cgi-bin/das/hg18/dna?segment=chr1:1000,2000;segment=chr2:1000,2000

will get you sequences of interest.  These come as simple well-formed 
XML, so you can use the XML package to parse them pretty easily.  You 
can do one-at-a-time, or string together a bunch.

Just a quick followup and a skeleton function...  The function takes a 
dataframe with three columns, chrom, start, end and a single character 
argument, genome, which needs to match the UCSC genome browser notation 
for the genome build.  It fetches the das DNA one at a time and returns 
the original dataframe and a fourth column with the sequence.  There 
isn't any error checking, etc., and it fetches one sequence at a time.  
Both of these issues could be eliminated with a little work.

getSequences <- function(locs,genome='hg18') {
  require(XML)
  chrom <- locs$chrom
  starts <- locs$start
  ends <- locs$end
  seqs <- vector('character',length=length(chrom))
  for(i in 1:length(chrom)) {
    url <- 
sprintf('http://genome.ucsc.edu/cgi-bin/das/%s/dna?segment=%s:%d,%d',genome,chrom[i],starts[i],ends[i])
    results <- xmlRoot(xmlTreeParse(url))
    seqs[i] <- gsub('\n','',xmlValue(results[[1]][[1]]))
  }
  return(data.frame(chrom=chrom,start=starts,end=ends,sequence=seqs))
}

And you could use it like so:

mySeqs <- 
getSequences(data.frame(chrom=c('chr1','chr2'),starts=c(1000,1000),ends=c(2000,1500)))

Herve Page's solution is undoubtedly the better one, but if memory is a 
limitation, this solution might be of use.

Sean