Split string into sentences using regex

前端未结

关注

 6  1953

挽巷 2020-11-28 10:54

I have random text stored in $sentences. Using regex, I want to split the text into sentences, see:

function splitSentences($text) {
    $re = \


      
      
        
          6条回答        

        
                    
            
            
                         
                
              
              
                
                   旧巷少年郎
                                             
                
                
                (楼主)
            
              
              
                2020-11-28 11:54
              

            
            
                        
I believe that it is impossible to get a bullet-proof sentence splitter considering user-generated content is not always grammatically and syntactically correct. Moreover, reaching 100% correct results is just impossible due to technical imperfection of scraping/content getting tools that may fail to get clean contents that will either contain whitespace or punctuation rubbish. And finally, business is now more biased towards a good-enough strategy, and if you manage to split the text into 95% of times, it is in most cases considered a success.
Now, any sentence splitting task is an NLP task, and just one, or two, or three regexps are not enough. Rather than think of your own regex chain, I'd advise to use some existing NLP libraries for that.

vanderlee's php-sentence (depends on reasonably gramatically correct punctuation)


The following is a rough list of the rules used to split sentences.

Each linebreak separates sentences.
The end of the text indicates the end if a sentence if not otherwise ended through proper punctuation.
Sentences must be at least two words long, unless a linebreak or end-of-text.
An empty line is not a sentence.
Each question- or exclamation mark or combination thereof, is considered the end of a sentence.
A single period is considered the end of a sentence, unless...

It is preceded by one word, or...
It is followed by one word.


A sequence of multiple periods is not considered the end of a sentence.


Usage example:
split($text); // Split into array of sentences
    $count      = $Sentence->count($text); // Count the number of sentences
?>


NlpTools is another library you might utilize for this task. Here is a sample code implementing a naive rule based sentence tokenizer:

Sample code:
getDocumentData();
 
        $dotcnt = count(explode('.',$token))-1;
        $lastdot = substr($token,-1)=='.';
 
        if (!$lastdot) // assume that all sentences end in full stops
            return 'O';
 
        if ($dotcnt>1) // to catch some naive abbreviations U.S.A.
            return 'O';
 
        return 'EOW';
    }
}
$tok = new ClassifierBasedTokenizer(
    new EndOfSentence(),
    new WhitespaceTokenizer()
);
$text = "We are what we repeatedly do.
        Excellence, then, is not an act, but a habit.";
 
print_r($tok->tokenize($text));
 
// Array
// (
//    [0] => We are what we repeatedly do.
//    [1] => Excellence, then, is not an act, but a habit.
// )
 


You can get a PHP/JAVA bridge for using Java StanfordNLP (here is a Java example of splitting text into sentences).

IMPORTANT NOTE: Most NLP tokenization models I tested do not handle glued sentences well. However, if you add a space after a punctuation chain, sentence splitting quality raises. Just add this before sending the text to the sentence splitting function:
$txt = preg_replace('~\p{P}+~', "$0 ", $txt);

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它6个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复