Split Documents into Paragraphs

问题

I have a large stockpile of PDFs of documents. I use Apache Tika to convert them to text, and now I'd like to split them into paragraphs. I can't use regular expressions because the text conversion makes the distinction between paragraphs impossible: some documents have the standard way of a \n between paragraphs, but some have a \n between lines in the same paragraph and then a double \n between paragraphs (using Tika's conversion to HTML instead of text does not help).

Python's NLTK book have a way of splitting sentences using machine learning, so I thought trying something similar with paragraphs, but I couldn't find training data for that.

Is there training data for that? should I try some complex regular expression that might work?

回答1:

I will try to give an easier way to deal with your problem: What you need to do is check for the double \nl then if you find double \nl then sort data considering that, and if you do not find double \nl then just sort data according to single \nl.

Another thing, i am thinking \nl is not a special character since i could not get any ASCII value for it, it is probably newline character but since you have asked for \nl i am giving the example accordingly(if it is indeed \n then you need to just change the part checking for double \nl). Rough example to detect the way for new paragraph used in the file:

f=open('yourfile','r')
a=f.read()
f.close()
temp=0
for z in range(len(a)-4):
 if a[z:z+4]=='\nl\nl':
  temp=1
  break
#temp=1 if formatting is by double \nl otherwise 0

After this you can use simple string formatting to check for single \nl or double \nl and replace them according to your need to distinguish new line or new paragraph.(Please read the file in chunks if the file size is too big, otherwise you might have memory problems or slower code)

回答2:

You say

some documents have the standard way of a \n between paragraphs, but some have a \n between lines in the same paragraph and then a double \n between paragraphs

so I would preprocess all the files to detect with use the double newline between paragraphs. The files with double \n need to be stripped of all single new line characters, and all double new lines reduced to single ones.

You can then pass all the files to the next stage where you detect paragraphs using a single \n character.

回答3:

from nltk import tokenize
tk=tokenize
a='para here'
tk.sent_tokenize(a)
#output =list of sentences

#thats all u need

来源：https://stackoverflow.com/questions/41913668/split-documents-into-paragraphs

标签

python

regex

machine-learning

apache-tika