What is a regular expression for parsing out individual sentences?

前端 未结 6 940
攒了一身酷
攒了一身酷 2020-11-27 18:16

I am looking for a good .NET regular expression that I can use for parsing out individual sentences from a body of text.

It should be able to parse the following blo

6条回答
  •  情歌与酒
    2020-11-27 19:17

    It is impossible to use regexes to parse natural language. What is the end of a sentence? A period can occur in many places (e.g. e.g.). You should use a natural language parsing toolkit such as OpenNLP or NLTK. Unfortunately there are very few, if any, offerings in C#. You may therefore have to create a webservice or otherwise link into C#.

    Note that it will cause problems in the future if you rely on exact whitespace as in "I.D.". You'll soon find examples that break your regex. For example most people put spaces after their intials.

    There is an excellent summary of Open and commercial offerings in WP (http://en.wikipedia.org/wiki/Natural_language_processing_toolkits). We have used several of them. It's worth the effort.

    [You use the word "train". This is normally associated with machine-learning (which is one approach to NLP and has been used for sentence-splitting). Indeed the toolkits I have mentioned include machine learning. I suspect that wasn't what you meant - rather that you would evolve your expression through heuristics. Don't!]

提交回复
热议问题