问题
I'm trying to use strsplit() in R to break a string into pieces based on commas, but I don't want to split up anything in parentheses. I think the answer is a regex but I'm struggling to get the code right.
So for example:
x <- "This is it, isn't it (well, yes)"
> strsplit(x, ", ")
[[1]]
[1] "This is it" "isn't it (well" "yes)"
When what I would like is:
[1] "This is it" "isn't it (well, yes)"
回答1:
We can use PCRE regex to FAIL any , that follows that a ( before the ) and split by , followed by 0 or more space (\\s*)
strsplit(x, '\\([^)]+,(*SKIP)(*FAIL)|,\\s*', perl=TRUE)[[1]]
#[1] "This is it" "isn't it (well, yes)"
回答2:
I would suggest another regex with (*SKIP)(*F) to ignore all the (...) substrings and only match the commas outside of parenthesized substrings:
x <- "This is it, isn't it (well, yes), and (well, this, that, and this, too)"
strsplit(x, "\\([^()]*\\)(*SKIP)(*F)|\\h*,\\h*", perl=T)
See IDEONE demo
You can read more about How do (*SKIP) or (*F) work on regex? here. The regex matches:
\(- an opening bracket[^()]*- zero or more characters other than(and)\)- a closing bracket(*SKIP)(*F)- the verbs that advance the current regex index to the position after the closing bracket|- or...\\h*,\\h*- a comma surrounded with zero or more horizontal whitespaces.
回答3:
A different approach:
Adding on to @Wiktor's sample string,
x <- "This is it, isn't it (well, yes), and (well, this, that, and this, too). Let's look, does it work?"
Now the magic:
> strsplit(x, ", |(?>\\(.*?\\).*?\\K(, |$))", perl = TRUE)
[[1]]
[1] "This is it"
[2] "isn't it (well, yes)"
[3] "and (well, this, that, and this, too). Let's look"
[4] "does it work?"
So how does , |(?>\\(.*?\\).*?\\K(, |$)) match?
|captures either of the groups on either side, both- on the left, the string
, - and on the right,
(?>\\(.*?\\).*?\\K(, |$)):(?> ... )sets up an atomic group, which does not allow backtracking to reevaluate what it matches.- In this case, it looks for an open parenthesis (
\\(), - then any character (
.) repeated from 0 to infinity times (*), but as few as possible (?), i.e..is evaluated lazily. - The previous
.repetition is then limited by the first close parenthesis (\\)), - followed by another set of any character repeated 0 to as few as possible (
.*?) - with a \\K at the end, which throws away the match so far and sets the starting point of a new match.
- The previous
.*?is limited by a capturing group (( ... )) with an|that either- selects an actual text string,
,, - or moves
\\Kto the end of the line,$, if there are no more commas.
- selects an actual text string,
- on the left, the string
*Whew.*
If my explanation is confusing, see the docs linked above, and check out regex101.com, where you can put in the above regex (single escaped—\—instead of R-style double escaped—\\) and a test string to see what it matches and get an explanation of what it's doing. You'll need to set the g (global) modifier in the box next to the regex box to show all matches and not just the first.
Happy strspliting!
来源:https://stackoverflow.com/questions/35347537/using-strsplit-in-r-ignoring-anything-in-parentheses