NLTK tokenizer and Stanford corenlp tokenizer cannot distinct 2 sentences without space at period (.)

我的未来我决定 提交于 2021-02-09 08:17:29

问题


I have 2 sentences in my dataset:

w1 = I am Pusheen the cat.I am so cute. # no space after period
w2 = I am Pusheen the cat. I am so cute. # with space after period

When I use NKTL tokenizer (both word and sent), nltk cannot distinct the between cat.I.

Here is word tokenize

>>> nltk.word_tokenize(w1, 'english')
['I', 'am', 'Pusheen', 'the', 'cat.I', 'am', 'so', 'cute']
>>> nltk.word_tokenize(w2, 'english')
['I', 'am', 'Pusheen', 'the', 'cat', '.', 'I', 'am', 'so', 'cute']

and sent tokenize

>>> nltk.sent_tokenize(w1, 'english')
['I am Pusheen the cat.I am so cute']
>>> nltk.sent_tokenize(w2, 'english')
['I am Pusheen the cat.', 'I am so cute']

I would like to ask how to fix that ? i.e: make nlkt detect as w2 while in my dataset, sometime word and punctuation are stick together.

Update: Tried Stanford CoreNLP 3.7.0, they also cannot distinct 'cat.I' as 'cat', '.', 'I'

meow@meow-server:~/projects/stanfordcorenlp$ java edu.stanford.nlp.process.PTBTokenizer sample.txt
I
am
Pusheen
the
cat.I
am
so
cute
.
PTBTokenizer tokenized 9 tokens at 111.21 tokens per second.

回答1:


It's implemented this way on purpose -- a period with no space after it usually doesn't signify the end of a sentence (think about the periods in phrases such as "version 4.3", "i.e.", "A.M.", etc.). If you have a corpus in which ends of sentences with no space after the full stop is a common occurrence, you'll have to preprocess the text with a regular expression or some such before sending it to NLTK.

A good rule-of-thumb might be that usually a lowercase letter followed by a period followed by an uppercase letter usually signifies the end of a sentence. To insert a space after the period in such cases, you could use a regular expression, e.g.

import re
w1 = re.sub(r'([a-z])\.([A-Z])', r'\1. \2', w1)


来源:https://stackoverflow.com/questions/44858741/nltk-tokenizer-and-stanford-corenlp-tokenizer-cannot-distinct-2-sentences-withou

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!