So I have a really long string and I want to work with multiple matches. I can only seem to get the first position of the first match using regexpr
. How can I g
gregexpr
and regmatches
as suggested in Dason's answer allow extracting multiple instance of a regex pattern in a string. Furthermore this solution has the advantage of relying exclusively on the {base}
package of R rather than requiring an additional package.
Never the less, I'd like to suggest an alternative solution based on the stringr package. In general, this package makes it easier to work with character strings by providing most of the functionality of the various string-support functions of base R (not just the regex-related functions), with a set of functions intuitively named and offering a consistent API. Indeed stringr functions not merely replace base R functions, but in many cases introduce additional features; for example the regex-related functions of stringr are vectorized for both the string and the pattern.
Specifically for the question of extracting multiple patterns in a long string, either str_extract_all
and str_match_all
can be used as shown below. Depending on the fact that the input is a single string or a vector of it, the logic can be adapted, using list/matrix subscripts, unlist
or other approaches like lapply
, sapply
etc. The point is that the stringr functions return structures that can be used to access just what we want.
# simulate html input. (Using bogus html tags to mark the target texts; the demo works
# the same for actual html patterns, the regular expression is just a bit more complex.
htmlInput <- paste("Lorem ipsum dolor<blah>MATCH_ONE<blah> sit amet, purus",
"sollicitudin<blah>MATCH2<blah>mauris, <blah>MATCH Nr 3<blah>vitae donec",
"risus ipsum, aenean quis, sapien",
"in lorem, condimentum ornare viverra",
"suscipit <blah>LAST MATCH<blah> ipsum eget ac. Non senectus",
"dolor mauris tellus, dui leo purus varius")
# str_extract() may need a bit of extra work to remove the leading and trailing parts
str_extract_all(htmlInput, "(<blah>)([^<]+)<")
# [[1]]
# [1] "<blah>MATCH_ONE<" "<blah>MATCH2<" "<blah>MATCH Nr 3<" "<blah>LAST MATCH<"
str_match_all(htmlInput, "<blah>([^<]+)<")[[1]][, 2]
# [1] "MATCH_ONE" "MATCH2" "MATCH Nr 3" "LAST MATCH"
Using gregexpr
allows for multiple matches.
> x <- c("only one match", "match1 and match2", "none here")
> m <- gregexpr("match[0-9]*", x)
> m
[[1]]
[1] 10
attr(,"match.length")
[1] 5
attr(,"useBytes")
[1] TRUE
[[2]]
[1] 1 12
attr(,"match.length")
[1] 6 6
attr(,"useBytes")
[1] TRUE
[[3]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"useBytes")
[1] TRUE
and if you're looking to extract the match you can use regmatches
to do that for you.
> regmatches(x, m)
[[1]]
[1] "match"
[[2]]
[1] "match1" "match2"
[[3]]
character(0)