php - preg_split() with multiple patterns not splitting quoted string

喜夏-厌秋 提交于 2019-12-24 08:08:23

问题


I need to split a paragraph into sentences. That's where i got a bit confused with the regex.

I have already referred this question to which this Q is marked as a duplicate to. but the issue here is different.

Here is a example of the string i need to split :

hello! how are you? how is life
live life, live free. "isnt it?"

here is the code i tried :

$sentence_array = preg_split('/([.!?\r\n|\r|\n])+(?![^"]*")/', $paragraph, -1);

What i need is :

array (  
  [0] => "hello"  
  [1] => "how are you"  
  [2] => "how is life"  
  [3] => "live life, live free"  
  [4] => ""isnt it?""  
)

What i get is :

array(
  [0] => "hello! how are you? how is life live life, live free. "isnt it?""
)

When i do not have any quotes in the string, the split works as required.

Any help is appreciated. Thank you.


回答1:


There are some problems with your regular expression that the main of them is confusing group constructs with character classes. A pipe | in a character class means a | literally. It doesn't have any special meaning.

What you need is this:

("[^"]*")|[!?.]+\s*|\R+

This first tries to match a string enclosed in double quotation marks (and captures the content). Then tries to match any punctuation marks from [!?.] set to split on them. Then goes for any kind of newline characters if found.

PHP:

var_dump(preg_split('~("[^"]*")|[!?.]+\s*|\R+~', <<<STR
hello! how are you? how is life
live life, live free. "isnt it?"
STR
, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY));

Output:

array(5) {
  [0]=>
  string(5) "hello"
  [1]=>
  string(11) "how are you"
  [2]=>
  string(11) "how is life"
  [3]=>
  string(20) "live life, live free"
  [4]=>
  string(10) ""isnt it?""
}



回答2:


I view your problem of splitting based on certain punctuation already solved, except that it fails in the case of double quotes. We can phrase a solution as saying that we should split when seeing such punctuation, or when seeing this punctuation followed by a double quote.

The split should happen when the previous character matches one of your markers and what follows is not a double quote, or the previous two characters should be a marker and a double quote. This implies splitting on the following pattern, which uses lookarounds:

(?<=[.!?\r\n])(?=[^"])|(?<=[.!?\r\n]")(?=.)

Code sample:

$input = "hello! how \"are\" \"you?\" how is life\nlive life, live free. \"isnt it?\"";
$sentence_array = preg_split('/(?<=[.!?\r\n])(?=[^"])|(?<=[.!?\r\n]\")(?=.)/', $input, -1);
print_r($sentence_array);

Array ( [0] => hello! [1] => how "are" "you?" [2] => how is life
    [3] => live life, live free. [4] => "isnt it?" )


来源:https://stackoverflow.com/questions/52551031/php-preg-split-with-multiple-patterns-not-splitting-quoted-string

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!