问题
I am using Python 3.5.2/ Anaconda 4.1.1 to extract text from a pdf: (http://www.mitpressjournals.org/doi/pdf/10.1162/INOV_a_00153) using pypdf2. I am getting many of these unicode characters in the middle of the printed text that i do not require:
\xc5 \xef \x82 \xef \xac \n.
Can you please help me get rid of these pesky characters?! Thanks for your help! This is my short piece of code below:
import PyPDF2
pdfFileObj = open('C:\\Users\\HP\\Desktop\\Datasets\\task1_rb.pdf','rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
num=pdfReader.numPages
for a in range(1,num):
text=''
pageObj = pdfReader.getPage(a)
text=pageObj.extractText().encode('utf-8')
print(text)
回答1:
You could encode text
in ASCII and ignore non-ASCII characters.
Try changing:
text=pageObj.extractText().encode('utf-8')
To:
text=pageObj.extractText().encode('ascii', 'ignore')
I've skimmed the output and it seems to have done the trick.
On a separate point, the range
in your for
loop is causing you to miss some of the output (unless that's what was intended).
Change for a in range(1,num):
to for a in range(0,num):
来源:https://stackoverflow.com/questions/44087872/python-unwanted-unicode-characters-in-printing-extracting-text-from-pdf