Forgive the really naive questions, but my C/C++ and Biostrings skills are pretty rough. I have been playing with Rcpp to interface to an external library for reading sequence formats. I am reading DNA sequences one-at-a-time and would like to package a set of those up as a DNAStringSet. I have at least two options: 1) Return a character vector back to R and then create a DNAStringSet in R. This works now and works fine. 2) Create the DNAStringSet in the cpp code. Are there advantages to using choice #2? If so, where would I start to get the necessary Rcpp/Biostrings glue in place? I know how to make new S4 objects in Rcpp, so it is really a question of knowing what class hierarchy I need to instantiate and, roughly, how. Any pointers or code examples? Thanks, Sean
[Bioc-devel] Creating Biostrings DNAStringSet from Rcpp/C++
3 messages · Sean Davis, Michael Lawrence, Hervé Pagès
You're going to want to use the Biostrings C API. You can see TwoBitFile_read() in src/twoBit.c in rtracklayer for an example. It's not trivial. Michael
On Tue, Jan 5, 2016 at 5:21 AM, Sean Davis <seandavi at gmail.com> wrote:
Forgive the really naive questions, but my C/C++ and Biostrings skills are pretty rough. I have been playing with Rcpp to interface to an external library for reading sequence formats. I am reading DNA sequences one-at-a-time and would like to package a set of those up as a DNAStringSet. I have at least two options: 1) Return a character vector back to R and then create a DNAStringSet in R. This works now and works fine. 2) Create the DNAStringSet in the cpp code. Are there advantages to using choice #2? If so, where would I start to get the necessary Rcpp/Biostrings glue in place? I know how to make new S4 objects in Rcpp, so it is really a question of knowing what class hierarchy I need to instantiate and, roughly, how. Any pointers or code examples? Thanks, Sean
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Hi Sean, The advantage of using choice #2 is better performance (memory and speed). In addition to Michael's pointer, there is this post https://stat.ethz.ch/pipermail/r-devel/2013-June/066890.html on R-devel that provides some details and pseudo-code. Unfortunately a few things have changed since 2013: 1) You need to replace: cachedXVectorList with XVectorList_holder cachedCharSeq with Chars_holder cache_XVectorList with hold_XVectorList get_cachedXRawList_elt with get_elt_from_XRawList_holder 2) For readability you probably want to rename the variables accordingly. So replace: cached_ans with ans_holder cached_ans_elt with ans_elt_older 3) The 'seq' member of a Chars_holder struct was renamed 'ptr'. So replace: cached_ans_elt->seq with ans_elt_older->ptr Also since you want the result in a DNAStringSet object, you'll need to replace any occurrence of BStringSet with DNAStringSet. One extra difficulty with DNAStringSet objects is that the incoming string data needs to be encoded before it's stored in the object. This encoding can be performed by replacing memcpy((char *) cached_ans_elt->seq, foo, INTEGER(width)[i] * sizeof(char)); with Ocopy_bytes_to_i1i2_with_lkup(0, ans_elt_holder.length - 1, (char *) ans_elt_holder.ptr, ans_elt_holder.length, foo, INTEGER(width)[i] * sizeof(char), INTEGER(lkup), LENGTH(lkup)); where 'lkup' is an integer vector that you need to prepare at the R level with: lkup <- get_seqtype_conversion_lookup("B", "DNA") and that you then need to pass all the way down to the read_text_file_in_DNAStringSet() function as an extra argument: SEXP read_text_file_in_DNAStringSet(FILE *FN, SEXP lkup) { ... } Let me know if you need further help with this. H.
On 01/05/2016 05:31 AM, Michael Lawrence wrote:
You're going to want to use the Biostrings C API. You can see TwoBitFile_read() in src/twoBit.c in rtracklayer for an example. It's not trivial. Michael On Tue, Jan 5, 2016 at 5:21 AM, Sean Davis <seandavi at gmail.com> wrote:
Forgive the really naive questions, but my C/C++ and Biostrings skills are pretty rough. I have been playing with Rcpp to interface to an external library for reading sequence formats. I am reading DNA sequences one-at-a-time and would like to package a set of those up as a DNAStringSet. I have at least two options: 1) Return a character vector back to R and then create a DNAStringSet in R. This works now and works fine. 2) Create the DNAStringSet in the cpp code. Are there advantages to using choice #2? If so, where would I start to get the necessary Rcpp/Biostrings glue in place? I know how to make new S4 objects in Rcpp, so it is really a question of knowing what class hierarchy I need to instantiate and, roughly, how. Any pointers or code examples? Thanks, Sean
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319