open file and read sentence

流过昼夜 提交于 2021-02-18 07:34:07

问题


I want to open a file and get sentences. The sentences in the file go across lines, like this:

"He said, 'I'll pay you five pounds a week if I can have it on my own
terms.'  I'm a poor woman, sir, and Mr. Warren earns little, and the
money meant much to me.  He took out a ten-pound note, and he held it
out to me then and there. 

currently I'm using this code:

text = ' '.join(file_to_open.readlines())
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

readlines cuts through the sentences, is there a good way to solve this to get only the sentences? (without NLTK)

Thanks for you attention.

The current problem:

file_to_read = 'test.txt'

with open(file_to_read) as f:
    text = f.read()

import re
word_list = ['Mrs.', 'Mr.']     

for i in word_list:
    text = re.sub(i, i[:-1], text)

What I get back ( in the test case) is that Mrs. changed to Mr while Mr. is just Mr . I tried several other things, but don't seem to work. Answer is probably easy but I'm missing it


回答1:


Your regex works on the text above if you do this:

with open(filename) as f:
    text = f.read()

sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

The only problem is, the regex splits on the dot in "Mr." from your text above, so you need to fix/change that.

One solution to this, though not perfect, is you could take out all occurences of a dot after Mr:

text = re.sub(r'(M\w{1,2})\.', r'\1', text) # no for loop needed for this, like there was before

this Matches an 'M' followed by minimum 1, maximum 2 alphanumeric chars(\w{1,3}), followed by a dot. The parenthesised part of the pattern is grouped and captured, and it's referenced in the replacement as '\1'(or group 1, as you could have more parenthesised groups). So essentially, the Mr. or Mrs. is matched, but only the Mr or Mrs part is captured, and the Mr. or Mrs. is then replaced by the captured part which excludes the dot.

and then :

sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

will work the way you want.




回答2:


You may want to try out the text-sentence tokenizer module.

From their example code:

>>> from text_sentence import Tokenizer
>>> t = Tokenizer()
>>> list(t.tokenize("This is first sentence. This is second one!And this is third, is it?"))
[T('this'/sent_start), T('is'), T('first'), T('sentence'), T('.'/sent_end),
 T('this'/sent_start), T('is'), T('second'), T('one'), T('!'/sent_end),
 T('and'/sent_start), T('this'), T('is'), T('third'), T(','/inner_sep),
 T('is'), T('it'), T('?'/sent_end)]

I've never actually tried it though, I'd prefer using NLTK/punkt.



来源:https://stackoverflow.com/questions/20719247/open-file-and-read-sentence

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!