Does the NLTK sentence tokenizer assume correct punctuation and spacing?

拈花ヽ惹草 提交于 2019-12-13 17:56:35

问题


I'm trying to split sentences using NLTK and I've noticed it treats sentences without a whitespace in between as one sentence. For instance:

text = 'Today is Monday.I went shopping.'
sentences = sent_tokenize(text)
# 1) Today is Monday.I went shopping.

text = 'Today is Monday. I went shopping.'
sentences = sent_tokenize(text)
# 1) Today is Monday.
# 2) I went shopping.

Is there a way to properly split mispunctuated/misspaced sentences?


回答1:


While sentence segmentation is not very complicated for most western languages, as you've encountered it still goes off every now and then. There's several tools for this (opennlp and corenlp both have their own modules for this, for example), sent_tokenize from nltk is fairly rudimentary, and probably regex-based. You can 'repair' your output with something like the following:

import re
s = 'Today is Monday.I went shopping.Tomorrow is Tuesday.'
slices = []
for match in re.finditer('\w\.\w', s):
    slices.append(match.start()+2)
slices.append(len(s))
offset = 0
subsentences = []
for pos in sorted(slices):
    subsent = s[offset:pos]
    offset += len(subsent)
    subsentences.append(subsent)
print(subsentences)

Which splits strings on word characters followed by a dot followed by word characters. Mind that word characters actually include digits, so you may want to change this for [a-zA-Z] or something, and perhaps also the . for any punctuation character.




回答2:


The Punkt tokenizer (what NLTK is using here) is easy to train as long as you have enough plain text, but as far as I can tell it doesn't consider splitting sentences on internal periods like for shopping.Tomorrow.

Although training data would be easy to create by removing spaces between sentences in an existing corpus, I can't think of a commonly-used sentence segmenter that would support this case, since they're either trained with one sentence per line (OpenNLP) or rely on a previous tokenization step that wouldn't split this into three tokens (CoreNLP, many others).

If you have any way to make the data look more like newspaper text in extraction/pre-processing steps (see my answer to another question: https://stackoverflow.com/a/44860394/461847), it becomes much easier to use standard tools and you could save yourself a lot of hassle.

Instead of post-processing your detected sentences as @igor suggests, I would suggest pre-processing it to try to insert spaces between cases that look like sentence boundaries and then running sent_tokenize(), since there are probably some cases like dates that might include periods but aren't sentence boundaries. It obviously depends on your data.



来源:https://stackoverflow.com/questions/51693199/does-the-nltk-sentence-tokenizer-assume-correct-punctuation-and-spacing

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!