get indices of original text from nltk word_tokenize

后端未结

关注

 3  1008

走了就别回头了 2020-12-16 22:32

I am tokenizing a text using nltk.word_tokenize and I would like to also get the index in the original raw text to the first character of every token, i.e.


      
      
        
          3条回答        

        
                    
            
            
                         
                
              
              
                
                   野趣味
                                             
                
                
                (楼主)
            
              
              
                2020-12-16 22:51
              

            
            
                        
You can also do this:

def spans(txt):
    tokens=nltk.word_tokenize(txt)
    offset = 0
    for token in tokens:
        offset = txt.find(token, offset)
        yield token, offset, offset+len(token)
        offset += len(token)


s = "And now for something completely different and."
for token in spans(s):
    print token
    assert token[0]==s[token[1]:token[2]]


And get:

('And', 0, 3)
('now', 4, 7)
('for', 8, 11)
('something', 12, 21)
('completely', 22, 32)
('different', 33, 42)
('.', 42, 43)

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它3个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复