I have random text stored in $sentences. Using regex, I want to split the text into sentences, see:
function splitSentences($text) {
$re = \
I believe that it is impossible to get a bullet-proof sentence splitter considering user-generated content is not always grammatically and syntactically correct. Moreover, reaching 100% correct results is just impossible due to technical imperfection of scraping/content getting tools that may fail to get clean contents that will either contain whitespace or punctuation rubbish. And finally, business is now more biased towards a good-enough strategy, and if you manage to split the text into 95% of times, it is in most cases considered a success.
Now, any sentence splitting task is an NLP task, and just one, or two, or three regexps are not enough. Rather than think of your own regex chain, I'd advise to use some existing NLP libraries for that.
The following is a rough list of the rules used to split sentences.
- Each linebreak separates sentences.
- The end of the text indicates the end if a sentence if not otherwise ended through proper punctuation.
- Sentences must be at least two words long, unless a linebreak or end-of-text.
- An empty line is not a sentence.
- Each question- or exclamation mark or combination thereof, is considered the end of a sentence.
- A single period is considered the end of a sentence, unless...
- It is preceded by one word, or...
- It is followed by one word.
- A sequence of multiple periods is not considered the end of a sentence.
Usage example:
split($text); // Split into array of sentences
$count = $Sentence->count($text); // Count the number of sentences
?>
Sample code:
getDocumentData();
$dotcnt = count(explode('.',$token))-1;
$lastdot = substr($token,-1)=='.';
if (!$lastdot) // assume that all sentences end in full stops
return 'O';
if ($dotcnt>1) // to catch some naive abbreviations U.S.A.
return 'O';
return 'EOW';
}
}
$tok = new ClassifierBasedTokenizer(
new EndOfSentence(),
new WhitespaceTokenizer()
);
$text = "We are what we repeatedly do.
Excellence, then, is not an act, but a habit.";
print_r($tok->tokenize($text));
// Array
// (
// [0] => We are what we repeatedly do.
// [1] => Excellence, then, is not an act, but a habit.
// )
IMPORTANT NOTE: Most NLP tokenization models I tested do not handle glued sentences well. However, if you add a space after a punctuation chain, sentence splitting quality raises. Just add this before sending the text to the sentence splitting function:
$txt = preg_replace('~\p{P}+~', "$0 ", $txt);