get indices of original text from nltk word_tokenize

后端 未结 3 1008
走了就别回头了
走了就别回头了 2020-12-16 22:32

I am tokenizing a text using nltk.word_tokenize and I would like to also get the index in the original raw text to the first character of every token, i.e.

i         


        
3条回答
  •  野趣味
    野趣味 (楼主)
    2020-12-16 22:51

    You can also do this:

    def spans(txt):
        tokens=nltk.word_tokenize(txt)
        offset = 0
        for token in tokens:
            offset = txt.find(token, offset)
            yield token, offset, offset+len(token)
            offset += len(token)
    
    
    s = "And now for something completely different and."
    for token in spans(s):
        print token
        assert token[0]==s[token[1]:token[2]]
    

    And get:

    ('And', 0, 3)
    ('now', 4, 7)
    ('for', 8, 11)
    ('something', 12, 21)
    ('completely', 22, 32)
    ('different', 33, 42)
    ('.', 42, 43)
    

提交回复
热议问题