Split sentence into words but having trouble with the punctuations in C#

删除回忆录丶 提交于 2019-12-30 03:05:28

问题


I have seen a few similar questions but I am trying to achieve this.

Given a string, str="The moon is our natural satellite, i.e. it rotates around the Earth!" I want to extract the words and store them in an array. The expected array elements would be this.

the 
moon 
is 
our 
natural 
satellite 
i.e. 
it  
rotates 
around 
the 
earth

I tried using String.split( ','\t','\r') but this does not work correctly. I also tried removing the ., and other punctuation marks but I would want a string like "i.e." to be parsed out too. What is the best way to achieve this? I also tried using regex.split to no avail.

string[] words = Regex.Split(line, @"\W+");

Would surely appreciate some nudges in the right direction.


回答1:


A regex solution.

(\b[^\s]+\b)

And if you really want to fix that last . on i.e. you could use this.

((\b[^\s]+\b)((?<=\.\w).)?)

Here's the code I'm using.

  var input = "The moon is our natural satellite, i.e. it rotates around the Earth!";
  var matches = Regex.Matches(input, @"((\b[^\s]+\b)((?<=\.\w).)?)");

  foreach(var match in matches)
  {
     Console.WriteLine(match);
  }

Results:

The
moon
is
our
natural
satellite
i.e.
it
rotates
around
the
Earth



回答2:


I suspect the solution you're looking for is much more complex than you think. You're looking for some form of actual language analysis, or at a minimum a dictionary, so that you can determine whether a period is part of a word or ends a sentence. Have you considered the fact that it may do both?

Consider adding a dictionary of allowed "words that contain punctuation." This may be the simplest way to solve your problem.




回答3:


This works for me.

var str="The moon is our natural satellite, i.e. it rotates around the Earth!";
var a = str.Split(new char[] {' ', '\t'});
for (int i=0; i < a.Length; i++)
{
    Console.WriteLine(" -{0}", a[i]);
}

Results:

 -The
 -moon
 -is
 -our
 -natural
 -satellite,
 -i.e.
 -it
 -rotates
 -around
 -the
 -Earth!

you could do some post-processing of the results, removing commas and semicolons, etc.




回答4:


Regex.Matches(input, @"\b\w+\b").OfType<Match>().Select(m => m.Value)


来源:https://stackoverflow.com/questions/7311734/split-sentence-into-words-but-having-trouble-with-the-punctuations-in-c-sharp

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!