error handling in strcapture
Once again, nice catch. I've committed a check for this. Michael
On Tue, Oct 4, 2016 at 2:37 PM, William Dunlap <wdunlap at tibco.com> wrote:
It is also not catching the cases where the number of capture expressions does not match the number of entries in proto. I think all of the following should give an error about the mismatch.
strcapture("(.)(.)", c("ab", "cde", "fgh", "ij", "lm"),
proto=list(A="",B="",C=""))
A B C 1 a b cd 2 d fg f 3 ij i j 4 l m ab Warning message: In matrix(as.character(unlist(str)), ncol = ntokens, byrow = TRUE) : data length [15] is not a sub-multiple or multiple of the number of rows [4]
strcapture("(.)(.)(.)", c("abc", "def", "ghi", "jkl", "mno"),
proto=list(A="",B=""))
A B 1 a b 2 def d 3 f ghi 4 h i 5 j k 6 mno m 7 o abc Warning message: In matrix(as.character(unlist(str)), ncol = ntokens, byrow = TRUE) : data length [20] is not a sub-multiple or multiple of the number of rows [7]
strcapture("(.)(.)(.)", c("abc", "def"), proto=list(A=""))
A 1 a 2 c 3 d 4 f Bill Dunlap TIBCO Software wdunlap tibco.com On Tue, Oct 4, 2016 at 2:21 PM, Michael Lawrence <lawrence.michael at gene.com> wrote:
Hi Bill, This is a bug in regexec() and I will commit a fix. Thanks for the report, Michael On Tue, Oct 4, 2016 at 1:40 PM, William Dunlap <wdunlap at tibco.com> wrote:
I noticed a problem in the strcapture from R-devel (2016-09-27 r71386),
when
the text contains a missing value and perl=TRUE.
{
# NA in text input should map to row of NA's in output, without
warning
r9p <- strcapture(perl = TRUE, "(.).* ([[:digit:]]+)", c("One 1",
NA,
"Fifty 50"), data.frame(Initial=factor(), Number=numeric()))
e9p <- structure(list(Initial = structure(c(2L, NA, 1L), .Label =
c("F", "O"), class = "factor"),
Number = c(1, NA, 50)),
row.names = c(NA, -3L),
class = "data.frame")
all.equal(e9p, r9p)
}
#Error in if (any(ind)) { : missing value where TRUE/FALSE needed
Bill Dunlap
TIBCO Software
wdunlap tibco.com
On Wed, Sep 21, 2016 at 2:32 PM, Michael Lawrence
<lawrence.michael at gene.com> wrote:
The new behavior is that it yields NAs when the pattern does not match (like strptime) and for empty captures in a matching pattern it yields the empty string, which is consistent with regmatches(). Michael On Wed, Sep 21, 2016 at 2:21 PM, William Dunlap <wdunlap at tibco.com> wrote:
If there are any matches then strcapture can see if the pattern has the same number of capture expressions as the prototype has columns and give an error if not. That seems appropriate. If there are no matches, then there is no easy way to see if the prototype is compatible with the pattern, so should strcapture just assume the best and fill in the prototype with NA's? Should there be warnings? This is kind of like strptime(), which silently gives NA's when the format does not match the text input. Bill Dunlap TIBCO Software wdunlap tibco.com On Wed, Sep 21, 2016 at 2:10 PM, Michael Lawrence <lawrence.michael at gene.com> wrote:
Hi Bill, Thanks, another good suggestion. strcapture() now returns NAs for non-matches. It's nice to have someone kicking the tires on that function. Michael On Wed, Sep 21, 2016 at 12:11 PM, William Dunlap via R-devel <r-devel at r-project.org> wrote:
Michael, thanks for looking at my first issue with utils::strcapture. Another issue is how it deals with lines that don't match the pattern. Currently it gives an error
strcapture("(.+) (.+)", c("One 1", "noSpaceInLine", "Three 3"),
proto=list(Name="", Number=0))
Error in strcapture("(.+) (.+)", c("One 1", "noSpaceInLine",
"Three
3"),
:
number of matches does not always match ncol(proto)
First, isn't the 'number of matches' the number of parenthesized
subpatterns in the regular expression? I thought that if the
entire
pattern matches then the subpatterns without matches would be
shown as matches at position 0 with length 0. Hence either the
pattern is compatible with the prototype or it isn't, it does not
depend
on the text input. E.g.,
regexec("^(([[:alpha:]]+)|([[:digit:]]+))$", c("Twelve", "12",
"Z280"))
[[1]]
[1] 1 1 1 0
attr(,"match.length")
[1] 6 6 6 0
attr(,"useBytes")
[1] TRUE
[[2]]
[1] 1 1 0 1
attr(,"match.length")
[1] 2 2 0 2
attr(,"useBytes")
[1] TRUE
[[3]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"useBytes")
[1] TRUE
Second, an error message like 'some lines were bad' is not very
helpful.
Should it put NA's in all the columns of the current output row if
the
input line didn't match the pattern and perhaps warn the user that
there
were problems? The user could then look for rows of NA's to see
where
the
problems were.
Bill Dunlap
TIBCO Software
wdunlap tibco.com
[[alternative HTML version deleted]]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel