[Bioc-devel] C library or C package API for regular expressions
Dne 25.1.2016 v 23:34 Herv? Pag?s napsal(a):
Hi Jiri, On 01/25/2016 09:40 AM, Ji?? Hon wrote:
Hi Martin Dne 25.1.2016 v 13:08 Morgan, Martin napsal(a):
There is discussion at http://stackoverflow.com/questions/23556205/using-boost-regex-with-rcpp pointing to http://gallery.rcpp.org/articles/boost-regular-expressions/ There is a Bioconductor example in that bundles the regex library at flowCore/src/ https://github.com/Bioconductor-mirror/flowCore A second example is in the mzR package.
Thank you for pointing me to the flowCore and mzR packages, these examples are really helpful.
A real question is, do you really need this functionality at the C level?
I think it's unavoidable in my case for performance reasons. I'am trying to dedect all possible overlapping motifs in DNA compounded from elements matching some regular expression.
I think Martin's question is: are you sure you need this at the C
level? What makes you think that calling a regex engine from C will
perform better than calling it from R?
Note that using a regex for finding motifs in a DNA sequence has 2
fundamental problems:
(1) It doesn't always find all the matches. For example if 2 matches
are overlapping, it only returns the 1st of the 2 matches:
> library(Biostrings)
> matchPattern("ATAAT", "CCATAATAATGATAAT")
Views on a 16-letter BString subject
subject: CCATAATAATGATAAT
views:
start end width
[1] 3 7 5 [ATAAT]
[2] 6 10 5 [ATAAT]
[3] 12 16 5 [ATAAT]
> gregexpr("ATAAT", "CCATAATAATGATAAT")[[1]]
[1] 3 12 attr(,"match.length") [1] 5 5 attr(,"useBytes") [1] TRUE (2) It's inefficient on a long DNA sequence:
> library(BSgenome.Hsapiens.UCSC.hg19)
> chr1 <- BSgenome.Hsapiens.UCSC.hg19$chr1
> system.time(m1 <- matchPattern("ATAAT", chr1))
user system elapsed
0.946 0.000 0.940
> chr1c <- as.character(chr1)
> system.time(m2 <- gregexpr("ATAAT", chr1c)[[1]])
user system elapsed
4.109 0.000 4.109
This was actually the very first motivating use case for developing
the Biostrings package. It's important to realize that using the regex
engine at the C level wouldn't make much difference.
matchPattern() and family don't support regex though. However when
working with DNA motifs, the motifs can often be described with IUPAC
ambiguity letters. For example, instead of describing the motifs
with regular expression AT(A|G|T|)T(A|C)GG.G, you can describe it with
ATDTMGGNG. Then you can use matchPattern() on this pattern and with
fixed=FALSE to find all the matches. Additionally you can use the
'max.mismatch' and/or 'with.indels' arguments to allow a small number
of mismatches and/or indels. See ?matchPattern for more information
and examples.
Of course this has its own limitations: you can only do this for a
subclass of regular expressions. For example regular expressions that
use * or + to allow for repetitions cannot be replaced by a sequence
with just IUPAC codes, so the string matching tools in Biostrings
cannnot be used in that case.
Cheers,
H.
Thank you Herv? for your tips. I'm aware of the limited power of regular expressions, but using matchPattern doesn't solves my problem. The reason for using regexp library at C level is that I plan to call it million times (on short DNA parts) and I suppose it would be better to avoid the calling and for-loop overhead. Therefore I wanted to get the idea about possible regex C APIs I can use or if its usually bundled. Jirka
A secondary question is that if several packages are using this functionality, then perhaps the library could be bundled separately and made available just once; zlibbioc does something like this (sort of; zlib is only needed on Windows). The flowCore and mzR maintainers (cc'd) might be a valuable resource in this regard.
Efficient regexp algorithms seems useful to me for solving many bioinformatic problems. So it would be natural to have package with C API to the most efficient regexp libraries.
Martin
________________________________________ From: Bioc-devel <bioc-devel-bounces at r-project.org> on behalf of Ji?? Hon <xhonji01 at stud.fit.vutbr.cz> Sent: Monday, January 25, 2016 4:33 AM To: Charles Determan Cc: bioc-devel at r-project.org Subject: Re: [Bioc-devel] C library or C package API for regular expressions Hi Charles, thank you a lot for your helpful hint. There is still a thing that I'm not sure about - Boost manual says that Boost.Regex is not header only [1]. So as BH package contains only headers, I will have to bundle the Boost.Regex library into the package code anyway. Am I right? Jiri [1] http://www.boost.org/doc/libs/1_60_0/more/getting_started/unix-variants.html#header-only-libraries
Dne 23.1.2016 v 13:35 Charles Determan napsal(a):
Hi Jiri, I believe you can use the BH package. It contains most of the Boost
headers.
Regards, Charles On Saturday, January 23, 2016, Ji?? Hon <xhonji01 at stud.fit.vutbr.cz>
wrote:
Dear package developers, I would like to ask you for advice. Please, what is the most seamless way to use regular expressions in C/C++ code of R/Bioconductor package? Is it allowed to bundle some C/C++ library for that (like PCRE or Boost.Regex)? Or is there existing C API of some package I can depend on and import? Thank you a lot for your attention and please have a nice day :) Jiri Hon
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel This email message may contain legally privileged and/or confidential information. If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited. If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel