get indices of original text from nltk word_tokenize

后端 未结 3 1012
走了就别回头了
走了就别回头了 2020-12-16 22:32

I am tokenizing a text using nltk.word_tokenize and I would like to also get the index in the original raw text to the first character of every token, i.e.

i         


        
3条回答
  •  陌清茗
    陌清茗 (楼主)
    2020-12-16 23:04

    I think you are looking for is the span_tokenize() method. Apparently this is not supported by the default tokenizer. Here is a code example with another tokenizer.

    from nltk.tokenize import WhitespaceTokenizer
    s = "Good muffins cost $3.88\nin New York."
    span_generator = WhitespaceTokenizer().span_tokenize(s)
    spans = [span for span in span_generator]
    print(spans)
    

    Which gives:

    [(0, 4), (5, 12), (13, 17), (18, 23), (24, 26), (27, 30), (31, 36)]
    

    just getting the offsets:

    offsets = [span[0] for span in spans]
    [0, 5, 13, 18, 24, 27, 31]
    

    For further information (on the different tokenizers available) see the tokenize api docs

提交回复
热议问题