问题
I need to split a paragraph into sentences. That's where i got a bit confused with the regex.
I have already referred this question to which this Q is marked as a duplicate to. but the issue here is different.
Here is a example of the string i need to split :
hello! how are you? how is life
live life, live free. "isnt it?"
here is the code i tried :
$sentence_array = preg_split('/([.!?\r\n|\r|\n])+(?![^"]*")/', $paragraph, -1);
What i need is :
array (
[0] => "hello"
[1] => "how are you"
[2] => "how is life"
[3] => "live life, live free"
[4] => ""isnt it?""
)
What i get is :
array(
[0] => "hello! how are you? how is life live life, live free. "isnt it?""
)
When i do not have any quotes in the string, the split works as required.
Any help is appreciated. Thank you.
回答1:
There are some problems with your regular expression that the main of them is confusing group constructs with character classes. A pipe |
in a character class means a |
literally. It doesn't have any special meaning.
What you need is this:
("[^"]*")|[!?.]+\s*|\R+
This first tries to match a string enclosed in double quotation marks (and captures the content). Then tries to match any punctuation marks from [!?.]
set to split on them. Then goes for any kind of newline characters if found.
PHP:
var_dump(preg_split('~("[^"]*")|[!?.]+\s*|\R+~', <<<STR
hello! how are you? how is life
live life, live free. "isnt it?"
STR
, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY));
Output:
array(5) {
[0]=>
string(5) "hello"
[1]=>
string(11) "how are you"
[2]=>
string(11) "how is life"
[3]=>
string(20) "live life, live free"
[4]=>
string(10) ""isnt it?""
}
回答2:
I view your problem of splitting based on certain punctuation already solved, except that it fails in the case of double quotes. We can phrase a solution as saying that we should split when seeing such punctuation, or when seeing this punctuation followed by a double quote.
The split should happen when the previous character matches one of your markers and what follows is not a double quote, or the previous two characters should be a marker and a double quote. This implies splitting on the following pattern, which uses lookarounds:
(?<=[.!?\r\n])(?=[^"])|(?<=[.!?\r\n]")(?=.)
Code sample:
$input = "hello! how \"are\" \"you?\" how is life\nlive life, live free. \"isnt it?\"";
$sentence_array = preg_split('/(?<=[.!?\r\n])(?=[^"])|(?<=[.!?\r\n]\")(?=.)/', $input, -1);
print_r($sentence_array);
Array ( [0] => hello! [1] => how "are" "you?" [2] => how is life
[3] => live life, live free. [4] => "isnt it?" )
来源:https://stackoverflow.com/questions/52551031/php-preg-split-with-multiple-patterns-not-splitting-quoted-string