get indices of original text from nltk word_tokenize

后端 未结 3 1007
走了就别回头了
走了就别回头了 2020-12-16 22:32

I am tokenizing a text using nltk.word_tokenize and I would like to also get the index in the original raw text to the first character of every token, i.e.

i         


        
相关标签:
3条回答
  • 2020-12-16 22:51

    You can also do this:

    def spans(txt):
        tokens=nltk.word_tokenize(txt)
        offset = 0
        for token in tokens:
            offset = txt.find(token, offset)
            yield token, offset, offset+len(token)
            offset += len(token)
    
    
    s = "And now for something completely different and."
    for token in spans(s):
        print token
        assert token[0]==s[token[1]:token[2]]
    

    And get:

    ('And', 0, 3)
    ('now', 4, 7)
    ('for', 8, 11)
    ('something', 12, 21)
    ('completely', 22, 32)
    ('different', 33, 42)
    ('.', 42, 43)
    
    0 讨论(0)
  • 2020-12-16 23:04

    I think you are looking for is the span_tokenize() method. Apparently this is not supported by the default tokenizer. Here is a code example with another tokenizer.

    from nltk.tokenize import WhitespaceTokenizer
    s = "Good muffins cost $3.88\nin New York."
    span_generator = WhitespaceTokenizer().span_tokenize(s)
    spans = [span for span in span_generator]
    print(spans)
    

    Which gives:

    [(0, 4), (5, 12), (13, 17), (18, 23), (24, 26), (27, 30), (31, 36)]
    

    just getting the offsets:

    offsets = [span[0] for span in spans]
    [0, 5, 13, 18, 24, 27, 31]
    

    For further information (on the different tokenizers available) see the tokenize api docs

    0 讨论(0)
  • 2020-12-16 23:07

    pytokenizations have a useful function get_original_spans to get the spans:

    # $ pip install pytokenizations
    import tokenizations
    tokens = ["hello", "world"]
    text = "Hello world"
    tokenizations.get_original_spans(tokens, text)
    >>> [(0,5), (6,11)]
    

    This function can handle noisy texts:

    tokens = ["a", "bc"]
    original_text = "å\n \tBC"
    tokenizations.get_original_spans(tokens, original_text)
    >>> [(0,1), (4,6)]
    

    See the documentation for other useful functions.

    0 讨论(0)
提交回复
热议问题