I have a problem in text matching when I tokenize text that splits specific words, dates and numbers. How can I prevent some phrases like \"run in my family\" ,\"30 minute w
You can use the MWETokenizer:
from nltk import word_tokenize
from nltk.tokenize import MWETokenizer
tokenizer = MWETokenizer([('20', '-', '30', 'minutes', 'a', 'day')])
tokenizer.tokenize(word_tokenize('Yes 20-30 minutes a day on my bike, it works great!!'))
[out]:
['Yes', '20-30_minutes_a_day', 'on', 'my', 'bike', ',', 'it', 'works', 'great', '!', '!']
A more principled approach since you don't know how `word_tokenize will split the words you want to keep:
from nltk import word_tokenize
from nltk.tokenize import MWETokenizer
def multiword_tokenize(text, mwe):
# Initialize the MWETokenizer
protected_tuples = [word_tokenize(word) for word in mwe]
protected_tuples_underscore = ['_'.join(word) for word in protected_tuples]
tokenizer = MWETokenizer(protected_tuples)
# Tokenize the text.
tokenized_text = tokenizer.tokenize(word_tokenize(text))
# Replace the underscored protected words with the original MWE
for i, token in enumerate(tokenized_text):
if token in protected_tuples_underscore:
tokenized_text[i] = mwe[protected_tuples_underscore.index(token)]
return tokenized_text
mwe = ['20-30 minutes a day', '!!']
print(multiword_tokenize('Yes 20-30 minutes a day on my bike, it works great!!', mwe))
[out]:
['Yes', '20-30 minutes a day', 'on', 'my', 'bike', ',', 'it', 'works', 'great', '!!']