get indices of original text from nltk word_tokenize

后端 未结 3 1013
走了就别回头了
走了就别回头了 2020-12-16 22:32

I am tokenizing a text using nltk.word_tokenize and I would like to also get the index in the original raw text to the first character of every token, i.e.

i         


        
3条回答
  •  生来不讨喜
    2020-12-16 23:07

    pytokenizations have a useful function get_original_spans to get the spans:

    # $ pip install pytokenizations
    import tokenizations
    tokens = ["hello", "world"]
    text = "Hello world"
    tokenizations.get_original_spans(tokens, text)
    >>> [(0,5), (6,11)]
    

    This function can handle noisy texts:

    tokens = ["a", "bc"]
    original_text = "å\n \tBC"
    tokenizations.get_original_spans(tokens, original_text)
    >>> [(0,1), (4,6)]
    

    See the documentation for other useful functions.

提交回复
热议问题