问题
Assume I have text strings that look something like this:
A-B-C-I1-I2-D-E-F-I1-I3-D-D-D-D-I1-I1-I2-I1-I1-I3-I3
Here I want to identify sequences of markers (A is a marker, I3 is a marker etc.) that leads up to a subsequence consisting only of IX markers (i.e. I1, I2, or I3) that contains an I3. This subsequence can have a length of 1 (i.e. be a single I3 marker) or it can be of unlimited length, but always needs to contain at least 1 I3 marker, and can only contain IX markers. In the subsequence that leads up to the IX subsequence, I1 and I2 can be included, but never I3.
In the string above I need to identify:
A-B-C-I1-I2-D-E-F
which leads up to the I1-I3 subsequence which contains I3
and
D-D-D-D
which leads up to the I1-I1-I2-I1-I1-I3-I3 subsequence that contains at least 1 I3.
Here are a few additional examples:
A-B-I3-C-I3
from this string we should identify A-B because it is followed by a subsequence of 1 that contains I3, and also C, because it is followed by a subsequence of 1 that contains I3.
and:
I3-A-I3
here A should be identified because it is followed by a subsequence of 1 which contains I3. The first I3 itself will not be identified, because we are only interested in subsequences that are followed by a subsequence of IX markers that contains I3.
How can I write a generic function/regex that accomplishes this task?
回答1:
Use strsplit
> x <- "A-B-C-I1-I2-D-E-F-I1-I3-D-D-D-D-I1-I1-I2-I1-I1-I3-I3"
> strsplit(x, "(?:-?I\\d+)*-?\\bI3-?(?:I\\d+-?)*")
[[1]]
[1] "A-B-C-I1-I2-D-E-F" "D-D-D-D"
> strsplit("A-B-I3-C-I3", "(?:-?I\\d+)*-?\\bI3\\b-?(?:I\\d+-?)*")
[[1]]
[1] "A-B" "C"
or
> strsplit("A-B-I3-C-I3", "(?:-?I\\d+)*-?\\bI3\\b-?(?:I3-?)*")
[[1]]
[1] "A-B" "C"
回答2:
You can identify the sequences which contains I3 with following regex :
(?:I\\d-?)*I3(?:-?I\\d)*
So you can split your text with this regex to get the desire result.
See demo https://regex101.com/r/bJ3iA3/4
回答3:
Try the following expression: (.*?)(?:I[0-9]-)*I3(?:-I[0-9])*.
See the match groups:
https://regex101.com/r/yA6aV9/1
来源:https://stackoverflow.com/questions/31449589/identifying-substrings-based-on-complex-rules