get indices of original text from nltk word_tokenize

后端未结

关注

 3  1007

I am tokenizing a text using nltk.word_tokenize and I would like to also get the index in the original raw text to the first character of every token, i.e.


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  野趣味        
                
              
                            
                2020-12-16 22:51
              
            
            
                                                                       
You can also do this:

def spans(txt):
    tokens=nltk.word_tokenize(txt)
    offset = 0
    for token in tokens:
        offset = txt.find(token, offset)
        yield token, offset, offset+len(token)
        offset += len(token)


s = "And now for something completely different and."
for token in spans(s):
    print token
    assert token[0]==s[token[1]:token[2]]


And get:

('And', 0, 3)
('now', 4, 7)
('for', 8, 11)
('something', 12, 21)
('completely', 22, 32)
('different', 33, 42)
('.', 42, 43)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  陌清茗        
                
              
                            
                2020-12-16 23:04
              
            
            
                                                                       
I think you are looking for is the span_tokenize() method.
Apparently this is not supported by the default tokenizer.
Here is a code example with another tokenizer.

from nltk.tokenize import WhitespaceTokenizer
s = "Good muffins cost $3.88\nin New York."
span_generator = WhitespaceTokenizer().span_tokenize(s)
spans = [span for span in span_generator]
print(spans)


Which gives:

[(0, 4), (5, 12), (13, 17), (18, 23), (24, 26), (27, 30), (31, 36)]


just getting the offsets:

offsets = [span[0] for span in spans]
[0, 5, 13, 18, 24, 27, 31]


For further information (on the different tokenizers available) see the tokenize api docs
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  生来不讨喜        
                
              
                            
                2020-12-16 23:07
              
            
            
                                                                       
pytokenizations have a useful function get_original_spans to get the spans:

# $ pip install pytokenizations
import tokenizations
tokens = ["hello", "world"]
text = "Hello world"
tokenizations.get_original_spans(tokens, text)
>>> [(0,5), (6,11)]


This function can handle noisy texts:

tokens = ["a", "bc"]
original_text = "å\n \tBC"
tokenizations.get_original_spans(tokens, original_text)
>>> [(0,1), (4,6)]


See the documentation for other useful functions.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复