How to split text into sentences when there is no space after full stop?

ぃ、小莉子 提交于 2019-12-12 01:14:32

问题


I have a text like

'A gas well near Surabaya in East Java operated by Lapindo Brantas Inc. has spewed steaming mud since May last year, submerging villages, industries and fields.A gas well near Surabaya in East Java operated by PT Lapindo Brantas has spewed steaming mud since May last year, submerging villages, factories and fields.Last week, Indonesia's coordinating minister for social welfare, Aburizal Bakrie, whose family firm controls Lapindo Brantas, said the volcano was a "natural disaster" unrelated to the drilling activities.President Susilo Bambang Yudhoyono last month ordered Lapindo to pay 3.8 trillion rupiah (420.7 million dollars) in compensation and costs'

I want to split it into sentences. NLTK or any standard regex which I find online fails.


回答1:


You can use a regex positive lookahead to add spaces to the end of sentences and then pass it to the tool of your choice. This adds a space to periods that don't already have one, but skips non-alphanumerics like commas. By sticking to character classes instead of, say, A-Z, this works for any language.

>>> re.sub(r'\.(?=[^ \W\d])', '. ', 'Foo bar.Baz Inc., foobar. 1.1, and abc._')
'Foo bar. Baz Inc., foobar. 1.1, and abc. _'

You can catch some urls by adding another lookahead searching for slashes

>>> re.sub(r'\.(?=[^ \W\d])(?=[^\w*]/)', '. ', 'Foo bar.Baz Inc., foobar. 1.1, and abc._ http://www.example.com/whatever')
'Foo bar.Baz Inc., foobar. 1.1, and abc._ http://www.example.com/whatever'



回答2:


you can use this regex to capture the dots that are followed by new sentences

(\.)(?:[A-Z]) you can pass it to re.sub with r'\1\n' as a replacement

parsed_text = re.sub(r'(\.)(?:[A-Z])',r'\1\n',your_text)

you can also just split it into a list of sentences (but you lose the dots at the end)

 sentence_list = re.split(r'\.(?=[A-Z])',your_text)


来源:https://stackoverflow.com/questions/42445842/how-to-split-text-into-sentences-when-there-is-no-space-after-full-stop

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!