Skip to content
Prev 7476 / 21312 Next

[Bioc-devel] custom genome

On 05/09/2015 08:36 AM, Vincent Carey wrote:
Vince is right that the question should probably be asked on the support forum, 
but for what it's worth I downloaded the file(?) to my current working directory

   url = "ftp://ftp.plantgdb.org/pub/Genomes/PpGDB/Ppatens_152.fa.gz"
   fl = basename(url)
   download.file(url, fl, mode="wb")

It's small enough that it can be read in to memory.

   library(Biostrings)
   dna = readDNAStringSet(fl)

but it would be convenient to have quick random access to to the sequences, so I 
exported it to "2bit' format

   library(BSgenome); library(rtracklayer)
   twobit = TwoBitFile(sub(".gz", ".2bit", fl)))
   export(dna, twobit)

and then I have ready access to the sequences, e.g.,

 > seqinfo(twobit)
Seqinfo object with 2106 sequences from an unspecified genome:
   seqnames       seqlengths isCircular genome
   scaffold_1        5387533       <NA>   <NA>
   scaffold_2        4973254       <NA>   <NA>
   scaffold_3        4236991       <NA>   <NA>
   scaffold_4        4046325       <NA>   <NA>
   scaffold_5        3779828       <NA>   <NA>
   ...                   ...        ...    ...
   scaffold_11864       1003       <NA>   <NA>
   scaffold_11874       1002       <NA>   <NA>
   scaffold_11881       1000       <NA>   <NA>
   scaffold_11890       1000       <NA>   <NA>
   scaffold_11893       1000       <NA>   <NA>
 > query = GRanges("scaffold_109", IRanges(seq(1, 1000, by=100), width=50))
 > getSeq(twobit, query)
   A DNAStringSet instance of length 10
      width seq
  [1]    50 GTTACAGATTGGTAAGATCTTGTAGAGGCTCTTTTACTCTGGGCATATAT
  [2]    50 AAAAAAATTGATAGTATAAGAGAAGGAAATTCTTTCAAGAGACAAATTTG
  [3]    50 TTTTTACAATATTGGAAAAAAGAAAAAAAAAAAAAAAAAATGAGAAAAGA
  [4]    50 CATATTTAAAGGGAGTAATTGAAAAACATGTAGTAAATGCTAAGATAGAA
  [5]    50 AATATTTGATATATAATTCACTATATAATATAGTGAGAGTTATTTTATTA
  [6]    50 GTTTAAAATGGCCATTGTTGAAGTTGATTTTAGGTTGATAATAGTTTTTA
  [7]    50 TTATTTCTTTTATTTGATTTTTAGAACATATATTATATAAATCTTATAAA
  [8]    50 ATATGGATGAATGGAGATGGATGATGAAGTTCTCCATTGCTATTTCCAAA
  [9]    50 TATGATCACTCTTTCAGGGCATTTAAACTTAAGAGACTAAGATCAAGAAG
[10]    50 CTTTATAAAAAAAAATATAGGCATGCTATTGATGCAAAGATATATCAGGA
 > query$seq = getSeq(twobit, query)

Hope that's helpful; I might have made different decisions with a larger genome 
or on Windows, so if you re-post your question on the support forum (which I 
hope you do) it would be helpful to include that information as well as your 
intended use.

Martin