[Bioc-devel] custom genome
On 05/09/2015 08:36 AM, Vincent Carey wrote:
did you put this request on the forum? did you follow the steps with BSforge and find an instruction that could not be followed or led to an error? if you did, please put an account of that event on the support.bioconductor.org and you will get assistance. read the posting guide and provide sessionInfo() so that your errors can be reproduced.
Vince is right that the question should probably be asked on the support forum,
but for what it's worth I downloaded the file(?) to my current working directory
url = "ftp://ftp.plantgdb.org/pub/Genomes/PpGDB/Ppatens_152.fa.gz"
fl = basename(url)
download.file(url, fl, mode="wb")
It's small enough that it can be read in to memory.
library(Biostrings)
dna = readDNAStringSet(fl)
but it would be convenient to have quick random access to to the sequences, so I
exported it to "2bit' format
library(BSgenome); library(rtracklayer)
twobit = TwoBitFile(sub(".gz", ".2bit", fl)))
export(dna, twobit)
and then I have ready access to the sequences, e.g.,
> seqinfo(twobit)
Seqinfo object with 2106 sequences from an unspecified genome:
seqnames seqlengths isCircular genome
scaffold_1 5387533 <NA> <NA>
scaffold_2 4973254 <NA> <NA>
scaffold_3 4236991 <NA> <NA>
scaffold_4 4046325 <NA> <NA>
scaffold_5 3779828 <NA> <NA>
... ... ... ...
scaffold_11864 1003 <NA> <NA>
scaffold_11874 1002 <NA> <NA>
scaffold_11881 1000 <NA> <NA>
scaffold_11890 1000 <NA> <NA>
scaffold_11893 1000 <NA> <NA>
> query = GRanges("scaffold_109", IRanges(seq(1, 1000, by=100), width=50))
> getSeq(twobit, query)
A DNAStringSet instance of length 10
width seq
[1] 50 GTTACAGATTGGTAAGATCTTGTAGAGGCTCTTTTACTCTGGGCATATAT
[2] 50 AAAAAAATTGATAGTATAAGAGAAGGAAATTCTTTCAAGAGACAAATTTG
[3] 50 TTTTTACAATATTGGAAAAAAGAAAAAAAAAAAAAAAAAATGAGAAAAGA
[4] 50 CATATTTAAAGGGAGTAATTGAAAAACATGTAGTAAATGCTAAGATAGAA
[5] 50 AATATTTGATATATAATTCACTATATAATATAGTGAGAGTTATTTTATTA
[6] 50 GTTTAAAATGGCCATTGTTGAAGTTGATTTTAGGTTGATAATAGTTTTTA
[7] 50 TTATTTCTTTTATTTGATTTTTAGAACATATATTATATAAATCTTATAAA
[8] 50 ATATGGATGAATGGAGATGGATGATGAAGTTCTCCATTGCTATTTCCAAA
[9] 50 TATGATCACTCTTTCAGGGCATTTAAACTTAAGAGACTAAGATCAAGAAG
[10] 50 CTTTATAAAAAAAAATATAGGCATGCTATTGATGCAAAGATATATCAGGA
> query$seq = getSeq(twobit, query)
Hope that's helpful; I might have made different decisions with a larger genome
or on Windows, so if you re-post your question on the support forum (which I
hope you do) it would be helpful to include that information as well as your
intended use.
Martin
On Fri, May 8, 2015 at 7:34 PM, Eric P <pedereri at gmail.com> wrote:
Hi
I have 2-3 genomes that are not listed that I would like to use with
bioconductor and I do not really understand how to prepare one properly. I
know you have to use BSforge, but after that I need some help. One of the
genomes is *Physcomitrella patens*, which is actually being updated at the
moment, so I am waiting for version 3.1 to come out, but if someone has
time they could add version 1.6 or 3.0.
However, the other two genomes I have to do myself as they will not be
published for a while so I could just as well try a third. Any help would
be appreciated, thanks. Also, if this was already posted I apologize but I
could not find enough information on this subject when I searched the
forums.
--
Cheers,
Eric
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793