How can I split a text into sentences using the Stanford parser?

后端未结
关注
 12  1911
终归单人心 2020-11-27 14:52
How can I split a text or paragraph into sentences using Stanford parser?
Is there any method that can extract sentences, such as getSentencesFromString()

      
      
        
          12条回答        

        
                    
            
            
                         
                
              
              
                
                   借酒劲吻你
                                             
                
                
                (楼主)
            
              
              
                2020-11-27 15:39
              

            
            
                        
Using the .net C# package:
This will split sentences, get the parentheses correct and preserve original spaces and punctuation:

public class NlpDemo
{
    public static readonly TokenizerFactory TokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(),
                "normalizeParentheses=false,normalizeOtherBrackets=false,invertible=true");

    public void ParseFile(string fileName)
    {
        using (var stream = File.OpenRead(fileName))
        {
            SplitSentences(stream);
        }
    }

    public void SplitSentences(Stream stream)
    {            
        var preProcessor = new DocumentPreprocessor(new UTF8Reader(new InputStreamWrapper(stream)));
        preProcessor.setTokenizerFactory(TokenizerFactory);

        foreach (java.util.List sentence in preProcessor)
        {
            ProcessSentence(sentence);
        }            
    }

    // print the sentence with original spaces and punctuation.
    public void ProcessSentence(java.util.List sentence)
    {
        System.Console.WriteLine(edu.stanford.nlp.util.StringUtils.joinWithOriginalWhiteSpace(sentence));
    }
}


Input:
 - This sentence's characters possess a certain charm, one often found in punctuation and prose. This is a second sentence? It is indeed.

Output:
3 sentences ('?' is considered an end-of-sentence delimiter)

Note: for a sentence like "Mrs. Havisham's class was impeccable (as far as one could see!) in all aspects." The tokenizer will correctly discern that the period at the end of Mrs. is not an EOS, however it will incorrectly mark the ! within the parentheses as an EOS and split "in all aspects." as a second sentence.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它12个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复