Skip to content
Prev 1025 / 21312 Next

[Bioc-devel] how to get genomic sequences?

Sean Davis wrote:
Just a quick followup and a skeleton function...  The function takes a 
dataframe with three columns, chrom, start, end and a single character 
argument, genome, which needs to match the UCSC genome browser notation 
for the genome build.  It fetches the das DNA one at a time and returns 
the original dataframe and a fourth column with the sequence.  There 
isn't any error checking, etc., and it fetches one sequence at a time.  
Both of these issues could be eliminated with a little work.

getSequences <- function(locs,genome='hg18') {
  require(XML)
  chrom <- locs$chrom
  starts <- locs$start
  ends <- locs$end
  seqs <- vector('character',length=length(chrom))
  for(i in 1:length(chrom)) {
    url <- 
sprintf('http://genome.ucsc.edu/cgi-bin/das/%s/dna?segment=%s:%d,%d',genome,chrom[i],starts[i],ends[i])
    results <- xmlRoot(xmlTreeParse(url))
    seqs[i] <- gsub('\n','',xmlValue(results[[1]][[1]]))
  }
  return(data.frame(chrom=chrom,start=starts,end=ends,sequence=seqs))
}


And you could use it like so:

mySeqs <- 
getSequences(data.frame(chrom=c('chr1','chr2'),starts=c(1000,1000),ends=c(2000,1500)))

Herve Page's solution is undoubtedly the better one, but if memory is a 
limitation, this solution might be of use.

Sean