How to split text into sentences when there is no space after full stop?

问题

I have a text like

'A gas well near Surabaya in East Java operated by Lapindo Brantas Inc. has spewed steaming mud since May last year, submerging villages, industries and fields.A gas well near Surabaya in East Java operated by PT Lapindo Brantas has spewed steaming mud since May last year, submerging villages, factories and fields.Last week, Indonesia's coordinating minister for social welfare, Aburizal Bakrie, whose family firm controls Lapindo Brantas, said the volcano was a "natural disaster" unrelated to the drilling activities.President Susilo Bambang Yudhoyono last month ordered Lapindo to pay 3.8 trillion rupiah (420.7 million dollars) in compensation and costs'

I want to split it into sentences. NLTK or any standard regex which I find online fails.

回答1:

You can use a regex positive lookahead to add spaces to the end of sentences and then pass it to the tool of your choice. This adds a space to periods that don't already have one, but skips non-alphanumerics like commas. By sticking to character classes instead of, say, A-Z, this works for any language.

>>> re.sub(r'\.(?=[^ \W\d])', '. ', 'Foo bar.Baz Inc., foobar. 1.1, and abc._')
'Foo bar. Baz Inc., foobar. 1.1, and abc. _'

You can catch some urls by adding another lookahead searching for slashes

>>> re.sub(r'\.(?=[^ \W\d])(?=[^\w*]/)', '. ', 'Foo bar.Baz Inc., foobar. 1.1, and abc._ http://www.example.com/whatever')
'Foo bar.Baz Inc., foobar. 1.1, and abc._ http://www.example.com/whatever'

回答2:

you can use this regex to capture the dots that are followed by new sentences

(\.)(?:[A-Z]) you can pass it to re.sub with r'\1\n' as a replacement

parsed_text = re.sub(r'(\.)(?:[A-Z])',r'\1\n',your_text)

you can also just split it into a list of sentences (but you lose the dots at the end)

 sentence_list = re.split(r'\.(?=[A-Z])',your_text)

来源：https://stackoverflow.com/questions/42445842/how-to-split-text-into-sentences-when-there-is-no-space-after-full-stop

标签

python

regex

nlp

nltk