I am tokenizing a text using nltk.word_tokenize and I would like to also get the index in the original raw text to the first character of every token, i.e.
i
pytokenizations have a useful function get_original_spans to get the spans:
# $ pip install pytokenizations
import tokenizations
tokens = ["hello", "world"]
text = "Hello world"
tokenizations.get_original_spans(tokens, text)
>>> [(0,5), (6,11)]
This function can handle noisy texts:
tokens = ["a", "bc"]
original_text = "å\n \tBC"
tokenizations.get_original_spans(tokens, original_text)
>>> [(0,1), (4,6)]
See the documentation for other useful functions.