Figuring out where to add punctuation in bad user generated content?

问题

Is there a way to use NLP or an existing library to add missing punctuation to bad user generated content?

For example, this string:

Today is Tuesday I went to work on Monday Friday was off

would become:

Today is Tuesday. I went to work on Monday. Friday was off.

回答1:

I think this problem falls under sentence boundary disambiguation http://en.wikipedia.org/wiki/Sentence_boundary_disambiguation. I have used OpenNLP variant and was satisfied with the results.

回答2:

I've played briefly with this problem (with only partial success).

Your example text is missing only periods; if that's the only punctuation you're interested in restoring, @Rahul's suggestion of looking at sentence boundary disambiguation techniques is probably appropriate. If you're hoping to restore other punctuation as well, you might need something a little different. For example, you might want to transform:

Im still busy but ill call you when I can Feeling any better than yesterday

into:

I'm still busy but I'll call you when I can. Feeling any better than yesterday?

Note that both sentences are relatively grammatical (which might greatly affect the accuracy of your punctuation restoration system).

My recommendation is to train a character-based n-gram model, and use it to score punctuation additions in a Levenshtein distance calculation. LingPipe's Spelling-correction Tutorial is a good place to start. Their edit-distance calculator is easy to customize to only allow insertions, and (in your case), insertions of only the specific punctuation characters you're interested in. Note: I'd estimate that a language model of 8-12 characters would probably be appropriate in this case; you could go a little larger, but my guess is you're not likely to see huge improvements beyond that range.

As always when training any NLP model, your performance will improve if you can train your model on text that matches your target domain fairly closely. If you don't have enough in-domain data, you could combine a large standard corpus (e.g. newswire text) with a smaller in-domain set, and upweight your in-domain data somewhat (just replicating it n times and shuffling randomly with the out-of-domain text often works pretty well).

来源：https://stackoverflow.com/questions/22948506/figuring-out-where-to-add-punctuation-in-bad-user-generated-content

标签

ruby

algorithm

language-agnostic

nlp