Python- Unwanted Unicode characters in printing/extracting text from pdf

余生颓废 提交于 2020-03-03 22:49:13

问题


I am using Python 3.5.2/ Anaconda 4.1.1 to extract text from a pdf: (http://www.mitpressjournals.org/doi/pdf/10.1162/INOV_a_00153) using pypdf2. I am getting many of these unicode characters in the middle of the printed text that i do not require:

\xc5 \xef \x82 \xef \xac \n.

Can you please help me get rid of these pesky characters?! Thanks for your help! This is my short piece of code below:

import PyPDF2


pdfFileObj = open('C:\\Users\\HP\\Desktop\\Datasets\\task1_rb.pdf','rb')   

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

num=pdfReader.numPages

for a in range(1,num):

    text=''

    pageObj = pdfReader.getPage(a)         

    text=pageObj.extractText().encode('utf-8')

    print(text)

回答1:


You could encode text in ASCII and ignore non-ASCII characters.

Try changing:

text=pageObj.extractText().encode('utf-8')

To:

text=pageObj.extractText().encode('ascii', 'ignore')

I've skimmed the output and it seems to have done the trick.

On a separate point, the range in your for loop is causing you to miss some of the output (unless that's what was intended).

Change for a in range(1,num): to for a in range(0,num):



来源:https://stackoverflow.com/questions/44087872/python-unwanted-unicode-characters-in-printing-extracting-text-from-pdf

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!