Extract text from pdf converted from webpage using Pypdf2

狂风中的少年 提交于 2020-06-29 04:34:38

问题


I used chrome to convert a webpage into Pdf using save as pdf option. Now the problem is that when I extract the data from it using PyPDF2, it shows Null whereas it works on other pdf files easily. I know that I can extract data directly from the website but I want to understand why this is not working. It shows the correct number of pages but when I extracttext(), it shows nothing. Does anyone know what is the problem? The link to the page is https://en.wikipedia.org/wiki/Rapping. I converted this webpage to pdf.

import PyPDF2
pdfFileObj = open('C:/Users/System/Desktop/Rapping - Wikipedia.pdf','rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
print(pdfReader.numPages)
pageObj = pdfReader.getPage(0)
print(pageObj.extractText())
pdfFileObj.close()

回答1:


PyPDF2 is highly unreliable for extracting text from pdf . as pointed out here too. it says :

While PyPDF2 has .extractText(), which can be used on its page objects (not shown in this example), it does not work very well. Some PDFs will return text and some will return an empty string. When you want to extract text from a PDF, you should check out the PDFMiner project instead. PDFMiner is much more robust and was specifically designed for extracting text from PDFs.

  1. You could instead install and use pdfminer using

    pip install pdfminer

  2. or you can use another open source utility named pdftotext by xpdfreader. instructions to use the utility is given on the page.

you can download the command line tools from here and could use the pdftotext.exe utility using subprocess .detailed explanation for using subprocess is given here



来源:https://stackoverflow.com/questions/60669890/extract-text-from-pdf-converted-from-webpage-using-pypdf2

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!