nltk wordpunct_tokenize vs word_tokenize

断了今生、忘了曾经 提交于 2020-02-18 08:05:31

问题


Does anyone know the difference between nltk's wordpunct_tokenize and word_tokenize? I'm using nltk=3.2.4 and there's nothing on the doc string of wordpunct_tokenize that explains the difference. I couldn't find this info either in the documentation of nltk (perhaps I didn't search in the right place!). I would have expected that first one would get rid of punctuation tokens or the like, but it doesn't.


回答1:


wordpunct_tokenize is based on a simple regexp tokenization. It is defined as

wordpunct_tokenize = WordPunctTokenizer().tokenize

which you can find here. Basically it uses the regular expression \w+|[^\w\s]+ to split the input.

word_tokenize on the other hand is based on a TreebankWordTokenizer, see the docs here. It basically tokenizes text like in the Penn Treebank. Here is a silly example that should show how the two differ.

sent = "I'm a dog and it's great! You're cool and Sandy's book is big. Don't tell her, you'll regret it! 'Hey', she'll say!"
>>> word_tokenize(sent)
['I', "'m", 'a', 'dog', 'and', 'it', "'s", 'great', '!', 'You', "'re", 
 'cool', 'and', 'Sandy', "'s", 'book', 'is', 'big', '.', 'Do', "n't", 'tell',
 'her', ',', 'you', "'ll", 'regret', 'it', '!', "'Hey", "'", ',', 'she', "'ll", 'say', '!']
>>> wordpunct_tokenize(sent)
['I', "'", 'm', 'a', 'dog', 'and', 'it', "'", 's', 'great', '!', 'You', "'",
 're', 'cool', 'and', 'Sandy', "'", 's', 'book', 'is', 'big', '.', 'Don',
 "'", 't', 'tell', 'her', ',', 'you', "'", 'll', 'regret', 'it', '!', "'", 
 'Hey', "',", 'she', "'", 'll', 'say', '!']

As we can see, wordpunct_tokenize will split pretty much at all special symbols and treat them as separate units. word_tokenize on the other hand keeps things like 're together. It doesn't seem to be all that smart though, since as we can see it fails to separate the initial single quote from 'Hey'.

Interestingly, if we write the sentence like this instead (single quotes as string delimiter and double quotes around "Hey"):

sent = 'I\'m a dog and it\'s great! You\'re cool and Sandy\'s book is big. Don\'t tell her, you\'ll regret it! "Hey", she\'ll say!'

we get

>>> word_tokenize(sent)
['I', "'m", 'a', 'dog', 'and', 'it', "'s", 'great', '!', 'You', "'re", 
 'cool', 'and', 'Sandy', "'s", 'book', 'is', 'big', '.', 'Do', "n't", 
 'tell', 'her', ',', 'you', "'ll", 'regret', 'it', '!', '``', 'Hey', "''", 
 ',', 'she', "'ll", 'say', '!']

so word_tokenize does split off double quotes, however it also converts them to `` and ''. wordpunct_tokenize doesn't do this:

>>> wordpunct_tokenize(sent)
['I', "'", 'm', 'a', 'dog', 'and', 'it', "'", 's', 'great', '!', 'You', "'", 
 're', 'cool', 'and', 'Sandy', "'", 's', 'book', 'is', 'big', '.', 'Don', 
 "'", 't', 'tell', 'her', ',', 'you', "'", 'll', 'regret', 'it', '!', '"', 
 'Hey', '",', 'she', "'", 'll', 'say', '!']


来源:https://stackoverflow.com/questions/50240029/nltk-wordpunct-tokenize-vs-word-tokenize

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!