Does the NLTK sentence tokenizer assume correct punctuation and spacing?

问题

I'm trying to split sentences using NLTK and I've noticed it treats sentences without a whitespace in between as one sentence. For instance:

text = 'Today is Monday.I went shopping.'
sentences = sent_tokenize(text)
# 1) Today is Monday.I went shopping.

text = 'Today is Monday. I went shopping.'
sentences = sent_tokenize(text)
# 1) Today is Monday.
# 2) I went shopping.

Is there a way to properly split mispunctuated/misspaced sentences?

回答1:

While sentence segmentation is not very complicated for most western languages, as you've encountered it still goes off every now and then. There's several tools for this (opennlp and corenlp both have their own modules for this, for example), sent_tokenize from nltk is fairly rudimentary, and probably regex-based. You can 'repair' your output with something like the following:

import re
s = 'Today is Monday.I went shopping.Tomorrow is Tuesday.'
slices = []
for match in re.finditer('\w\.\w', s):
    slices.append(match.start()+2)
slices.append(len(s))
offset = 0
subsentences = []
for pos in sorted(slices):
    subsent = s[offset:pos]
    offset += len(subsent)
    subsentences.append(subsent)
print(subsentences)

Which splits strings on word characters followed by a dot followed by word characters. Mind that word characters actually include digits, so you may want to change this for [a-zA-Z] or something, and perhaps also the . for any punctuation character.

回答2:

The Punkt tokenizer (what NLTK is using here) is easy to train as long as you have enough plain text, but as far as I can tell it doesn't consider splitting sentences on internal periods like for shopping.Tomorrow.

Although training data would be easy to create by removing spaces between sentences in an existing corpus, I can't think of a commonly-used sentence segmenter that would support this case, since they're either trained with one sentence per line (OpenNLP) or rely on a previous tokenization step that wouldn't split this into three tokens (CoreNLP, many others).

If you have any way to make the data look more like newspaper text in extraction/pre-processing steps (see my answer to another question: https://stackoverflow.com/a/44860394/461847), it becomes much easier to use standard tools and you could save yourself a lot of hassle.

Instead of post-processing your detected sentences as @igor suggests, I would suggest pre-processing it to try to insert spaces between cases that look like sentence boundaries and then running sent_tokenize(), since there are probably some cases like dates that might include periods but aren't sentence boundaries. It obviously depends on your data.

来源：https://stackoverflow.com/questions/51693199/does-the-nltk-sentence-tokenizer-assume-correct-punctuation-and-spacing

标签

nlp

nltk