Split text file at sentence boundary

问题

I have to process a text file (an e-book). I'd like to process it so that there is one sentence per line (a "newline-separated file", yes?). How would I do this task using sed the UNIX utility? Does it have a symbol for "sentence boundary" like a symbol for "word boundary" (I think the GNU version has that). Please note that the sentence can end in a period, ellipsis, question or exclamation mark, the last two in combination (for example, ?, !, !?, !!!!! are all valid "sentence terminators"). The input file is formatted in such a way that some sentences contain newlines that have to be removed.

I thought about a script like s/...|. |[!?]+ |/\n/g (unescaped for better reading). But it does not remove the newlines from inside the sentences.

How about in C#? Would it be remarkably faster if I use regular expressions like in sed? (I think not). Is there an other faster way?

Either way (sed or C#) is fine. Thank you.

回答1:

Regex is a good option that I was using for a long time.

A very good regex that worked fine for me is

 string[] sentences = Regex.Split(sentence, @"(?<=['""A-Za-z0-9][\.\!\?])\s+(?=[A-Z])");

However, regex is not efficient. Also, though the logic works for ideal cases, it does not work good in production environment.

For example, if my text is,

U.S.A. is a wonderful nation. Most people feel happy living there.

The regex method will classify it as 5 sentences by splitting at each period. But we know that logically that it should be split as only two sentences.

This is what made me to look for a Machine Learning Technique and at last the SharpNLP worked pretty fine for me.

 private string mModelPath = @"C:\Users\ATS\Documents\Visual Studio 2012\Projects\Google_page_speed_json\Google_page_speed_json\bin\Release\";
 private OpenNLP.Tools.SentenceDetect.MaximumEntropySentenceDetector mSentenceDetector;
 private string[] SplitSentences(string paragraph)
    {
        if (mSentenceDetector == null)
        {
            mSentenceDetector = new OpenNLP.Tools.SentenceDetect.EnglishMaximumEntropySentenceDetector(mModelPath + "EnglishSD.nbin");
        }

        return mSentenceDetector.SentenceDetect(paragraph);
    }

Here in this example, I have made use of SharpNLP, in which I have used EnglishSD.nbin - a pre-trained model for sentence detection.

Now if I apply the same input on this method, it will perfectly split text into two logical sentences.

You can even tokenize, POSTag, Chuck etc., using the SharpNLP project.

For step by step integration of SharpNLP into your C# application, read through the detailed article I have written. It will explain to you the integration with code snippets.

Thanks

回答2:

Sentence splitting is a non-trivial problem for which machine learning algorithms have been developed. But splitting on whitespace between [.\?!]+ and a capital letter [A-Z] might be a good heuristic. Remove the newlines first with tr, then apply the RE:

tr '\r\n' ' ' | sed 's/\([.?!]\)\s\s*\([A-Z]\)/\1\n\2/g'

The output should be one sentence per line. Inspect the output and refine the RE if you find errors. (E.g., mr. Ed would be handled incorrectly. Maybe compile a list of such abbreviations.)

Whether C# or sed is faster can only be determined experimentally.

回答3:

You could use something like this to extract the sentences:

var sentences = Regex.Matches(input, @"[\w ,]+[\.!?]+")
foreach (Match match in sentences)
{
  Console.WriteLine(match.Value);
}

This should match sentences containing words, spaces and commas and ending with (any number of) periods, exclamation and question marks.

回答4:

You can check my tutorial http://code.google.com/p/graph-expression/wiki/SentenceSplitting Basic idea is to have split chars and impossible pre/post condition at every split. Tjis simple heuristic works very well.

回答5:

The task you're interested in is often referred to as 'sentence segmentation'. As larsmans said, it's a non-trivial problem, but heuristic approaches often perform reasonably well, at least for English.

It sounds like you're primarily interested in English, so the regex heuristics already presented may perform adequately for your needs. If you'd like a somewhat more accurate solution (at the cost of just a little more complexity), you might consider using LingPipe, an open-source NLP framework. I've had pretty good luck with LingPipe, the few times I've used it.

See http://alias-i.com/lingpipe/demos/tutorial/sentences/read-me.html for a detailed tutorial on sentence segmentation.

来源：https://stackoverflow.com/questions/5620514/split-text-file-at-sentence-boundary

标签

sed

nlp

text-segmentation