Splitting paragraphs into sentences with regexp and PHP

大城市里の小女人 提交于 2019-12-18 11:55:33

问题


I'm a regexp noob and trying to split paragraphs into sentences. In my language we use quite a bit of abbreviations (like: bl.a.) in the middle of sentences, so I have come to the conclusion, that what I need to do is to look for punctuations, that are followed by a single space and then a word that starts with a capital letter like:

[sentence1]...anymore. However...[sentence2]

So a paragraph like:

Der er en lang og bevæget forhistorie bag lov om varsling m.v. i forbindelse med afskedigelser af større omfang. Det er ikke en bureaukratisk lovtekst blandt så mange andre.

Should end in this output:

[0] => Der er en lang og bevæget forhistorie bag lov om varsling m.v. i forbindelse med afskedigelser af større omfang.
[1] => Det er ikke en bureaukratisk lovtekst blandt så mange andre.

and NOT this:

[0] => Der er en lang og bevæget forhistorie bag lov om varsling m.v. 
[1] => i forbindelse med afskedigelser af større omfang.
[2] => Det er ikke en bureaukratisk lovtekst blandt så mange andre.

I have found a solution that does the first part of this with the positive lookbehind feature:

$regexp = (?<=[.!?] | [.!?][\'"]);

and then

$sentences = preg_split($regexp, $paragraph, -1, PREG_SPLIT_NO_EMPTY);

which is a great starting point, but splits way too many times because of the many abbreviations.

I have tried to do this:

(?<=[.!?]\s[A-Z] | [.!?][\'"])

to target every occurance of either

. or ! or ?

followed by a space and a capital letter, but that did not work.

Does anyone know, if there is a way to accomplish what I am trying to do?


回答1:


Unicode RegExp for splitting sentences: (?<=[.?!;])\s+(?=\p{Lu})

Explained demo here: http://regex101.com/r/iR7cC8




回答2:


Searching for such a pattern still seems unreliable but as sentences may be ended by line returns I would try just the following

[.\!\?][\s\n\r\t][A-Z] 

I don't think you actually meant for the look-ahead's do you? ( !? together, so using the \ escapes it - tells the regex ignore any special meaning )



来源:https://stackoverflow.com/questions/15853097/splitting-paragraphs-into-sentences-with-regexp-and-php

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!