Split text into sentence even Mr. Mrs. exists in a text

感情迁移 提交于 2021-01-03 07:24:26

问题


I have a problem, I want split a text into sentence using fullstop (.)

For instance:

Mr. Bean is a British comedy television series of 14 half-hour episodes starring Rowan Atkinson as the title character. Different episodes were written by Atkinson, Robin Driscoll, Richard Curtis and one by Ben Elton.

If I split the above text, I got 3 sentences like,

1. Mr.

2. Bean is a British comedy television series of 14 half-hour episodes starring Rowan Atkinson as the title character. Different episodes were written by Atkinson, Robin Driscoll, Richard Curtis and one by Ben Elton.

3. Different episodes were written by Atkinson, Robin Driscoll, Richard Curtis and one by Ben Elton.


I want to include Mr. in the second sentence as the text should split into two sentence not to three.

1. Mr. Bean is a British comedy television series of 14 half-hour episodes starring Rowan Atkinson as the title character. Different episodes were written by Atkinson, Robin Driscoll, Richard Curtis and one by Ben Elton.

2. Different episodes were written by Atkinson, Robin Driscoll, Richard Curtis and one by Ben Elton.

Kindly help me. I appreciate the instant feedback from the community.

Thanks.


回答1:


If you are looking for a way to avoid splitting sentences after an abbreviation (like a.m.), that's a difficult natural language problem.

If you just want to split sentences without worrying about Mr. or Mrs. (and have a character that won't likely show up in the text, like *), here's a simple way:

  1. replace all instances of Mr. and Mrs. with Mr* and Mrs*
  2. split text on .
  3. in the resulting array, replace all instances of Mr* and Mrs* with Mr. and Mrs.

Here's a version that uses NUL as a sentinel character, as it's pretty much impossible for it to show up in text unintentionally:

static IEnumerable<string> Splitter(string sentences)
{
    char sentinel = '\0';
    return sentences.Replace("Mr.", "Mr" + sentinel)
        .Replace("Mrs.", "Mrs" + sentinel)
        .Split(new[] { ". " }, StringSplitOptions.None)
        .Select(s => s.Replace("Mr" + sentinel, "Mr.")
                        .Replace("Mrs" + sentinel, "Mrs."));
}

If you're the paranoid sort of person who thinks any particular character is liable to show up in your text, feel free to use a GUID for the sentinel.




回答2:


The only way (I can think of right now) to do this, is to add intelligence to the split function. When to use the . as delimiter and when not.

You can do this like:

  1. Replace all occurences of <dot> by <dot><dot>.
  2. Replace all Mr. (and other entries in the dictionary) by Mr<dot>.
  3. Split the text using the remaining dots.
  4. Replace all Mr<dot> (and other...) by Mr. .
  5. Replace all occurences of <dot><dot> by <dot>.

Of course you can use another escape character/string.

You can keep a dictionary of translations. Preferably in a file so you can use a different dictionary for different languages.




回答3:


static IEnumerable<string> Splitter(string sentences)
{
    foreach (string s in 
        Regex.Split(sentences, "(?<!((mr)|(mrs)))\\.", RegexOptions.IgnoreCase))
    {
        if (!String.IsNullOrWhiteSpace(s)) yield return s.Trim() + ".";
    }
}

A simple regex-based answer using negative look-behind.



来源:https://stackoverflow.com/questions/5325800/split-text-into-sentence-even-mr-mrs-exists-in-a-text

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!