How to prevent splitting specific words or phrases and numbers in NLTK?

前端 未结 2 582
独厮守ぢ
独厮守ぢ 2020-12-11 07:47

I have a problem in text matching when I tokenize text that splits specific words, dates and numbers. How can I prevent some phrases like \"run in my family\" ,\"30 minute w

2条回答
  •  刺人心
    刺人心 (楼主)
    2020-12-11 08:39

    You can use the MWETokenizer:

    from nltk import word_tokenize
    from nltk.tokenize import MWETokenizer
    
    tokenizer = MWETokenizer([('20', '-', '30', 'minutes', 'a', 'day')])
    tokenizer.tokenize(word_tokenize('Yes 20-30 minutes a day on my bike, it works great!!'))
    

    [out]:

    ['Yes', '20-30_minutes_a_day', 'on', 'my', 'bike', ',', 'it', 'works', 'great', '!', '!']
    

    A more principled approach since you don't know how `word_tokenize will split the words you want to keep:

    from nltk import word_tokenize
    from nltk.tokenize import MWETokenizer
    
    def multiword_tokenize(text, mwe):
        # Initialize the MWETokenizer
        protected_tuples = [word_tokenize(word) for word in mwe]
        protected_tuples_underscore = ['_'.join(word) for word in protected_tuples]
        tokenizer = MWETokenizer(protected_tuples)
        # Tokenize the text.
        tokenized_text = tokenizer.tokenize(word_tokenize(text))
        # Replace the underscored protected words with the original MWE
        for i, token in enumerate(tokenized_text):
            if token in protected_tuples_underscore:
                tokenized_text[i] = mwe[protected_tuples_underscore.index(token)]
        return tokenized_text
    
    mwe = ['20-30 minutes a day', '!!']
    print(multiword_tokenize('Yes 20-30 minutes a day on my bike, it works great!!', mwe))
    

    [out]:

    ['Yes', '20-30 minutes a day', 'on', 'my', 'bike', ',', 'it', 'works', 'great', '!!']
    

提交回复
热议问题