问题
I have a problem, I want split a text into sentence using fullstop (.)
For instance:
Mr. Bean is a British comedy television series of 14 half-hour episodes starring Rowan Atkinson as the title character. Different episodes were written by Atkinson, Robin Driscoll, Richard Curtis and one by Ben Elton.
If I split the above text, I got 3 sentences like,
1. Mr.
2. Bean is a British comedy television series of 14 half-hour episodes starring Rowan Atkinson as the title character. Different episodes were written by Atkinson, Robin Driscoll, Richard Curtis and one by Ben Elton.
3. Different episodes were written by Atkinson, Robin Driscoll, Richard Curtis and one by Ben Elton.
I want to include Mr. in the second sentence as the text should split into two sentence not to three.
1. Mr. Bean is a British comedy television series of 14 half-hour episodes starring Rowan Atkinson as the title character. Different episodes were written by Atkinson, Robin Driscoll, Richard Curtis and one by Ben Elton.
2. Different episodes were written by Atkinson, Robin Driscoll, Richard Curtis and one by Ben Elton.
Kindly help me. I appreciate the instant feedback from the community.
Thanks.
回答1:
If you are looking for a way to avoid splitting sentences after an abbreviation (like a.m.
), that's a difficult natural language problem.
If you just want to split sentences without worrying about Mr. or Mrs. (and have a character that won't likely show up in the text, like *
), here's a simple way:
- replace all instances of
Mr.
andMrs.
withMr*
andMrs*
- split text on
.
- in the resulting array, replace all instances of
Mr*
andMrs*
withMr.
andMrs.
Here's a version that uses NUL as a sentinel character, as it's pretty much impossible for it to show up in text unintentionally:
static IEnumerable<string> Splitter(string sentences)
{
char sentinel = '\0';
return sentences.Replace("Mr.", "Mr" + sentinel)
.Replace("Mrs.", "Mrs" + sentinel)
.Split(new[] { ". " }, StringSplitOptions.None)
.Select(s => s.Replace("Mr" + sentinel, "Mr.")
.Replace("Mrs" + sentinel, "Mrs."));
}
If you're the paranoid sort of person who thinks any particular character is liable to show up in your text, feel free to use a GUID for the sentinel.
回答2:
The only way (I can think of right now) to do this, is to add intelligence to the split function. When to use the . as delimiter and when not.
You can do this like:
- Replace all occurences of
<dot>
by<dot><dot>
. - Replace all Mr. (and other entries in the dictionary) by
Mr<dot>
. - Split the text using the remaining dots.
- Replace all
Mr<dot>
(and other...) by Mr. . - Replace all occurences of
<dot><dot>
by<dot>
.
Of course you can use another escape character/string.
You can keep a dictionary of translations. Preferably in a file so you can use a different dictionary for different languages.
回答3:
static IEnumerable<string> Splitter(string sentences)
{
foreach (string s in
Regex.Split(sentences, "(?<!((mr)|(mrs)))\\.", RegexOptions.IgnoreCase))
{
if (!String.IsNullOrWhiteSpace(s)) yield return s.Trim() + ".";
}
}
A simple regex-based answer using negative look-behind.
来源:https://stackoverflow.com/questions/5325800/split-text-into-sentence-even-mr-mrs-exists-in-a-text