How do I decode text from a pdf online with Requests?

只谈情不闲聊 提交于 2020-12-06 04:17:20

问题


I am trying to create a pdf puller from the Australian Stock Exchange website which will allow me to search through all the 'Announcements' made by companies and search for key words in the pdfs of those announcements.

What I have done so far is used the requests library. Below is my code so far:

import requests

url = 'http://www.asx.com.au/asxpdf/20171103/pdf/43nyyw9r820c6r.pdf'
response = requests.get(url)

print(response.content)

However what prints is the following string (I will cut this off as it will be too long):

> b'%PDF-1.5\r%\xe2\xe3\xcf\xd3\r\n5 0 obj\r<</E 212221/H [ 1081 145 ]/L
> 212973/Linearized 1/N 1/O 8/T 212553>>\rendobj\r                      
> \r\r42 0 obj\r<</DecodeParms <</Columns 5/Predictor 12>>/Encrypt 7 0
> R/Filter /FlateDecode/ID [(\\216\\203\\217T\\n\\f\\236\\345?%\\214t4
> E\\271) (\\216\\203\\217T\\n\\f\\236\\345?%\\214t4 E\\271)]/Index [5
> 38]/Info 3 0 R/Length 86/Prev 212554/Root 6 0 R/Size 43/Type /XRef/W
> [1 3
> 1]>>\rstream\nx\x9ccbd`\x10``b``:\x04"\x19\xab\xc1d-X\xc4\x06D2\xac\x02\xb3\x93\xc0\xe2\x1d
> \x92?\x07,\x1e\t"\xb9T\x80$\xe3\x84\xcb@\x92\xa9m"\x03\x13\xe3\xdf\x13Z`Y\x06\xc6\x01#\xff3\xb0h\xbcfb`\xb6\x12\x02\xba\xe4\xef!S\x06\x0

I have searched stackexchange and other websites for a few days, and have tried to use print(response.content.decode('utf-8') as well as ascii but neither of them amount to anything I can read.

Apologies as I know it is obvious that I am a noobie, but any help would be greatly appreciated!

Thanks a lot.


回答1:


PDF file is binary mode, you should read it as its format with its headers and footers. you can not read bianry files as raw string.

1) If you have ANY spaces in your file name, then PyPDF 2 decrypt function will ultimately fail despite returning a success code. Try to stick to underscores when naming your PDFs before you run them through PyPDF2.

For example, Rather than "my pdf.pdf" do something like "my_pdf.pdf".

2) Try to decrypt it using an empty string as password and it works.

Try This :

import requests, PyPDF2


url = 'http://www.asx.com.au/asxpdf/20171103/pdf/43nyyw9r820c6r.pdf'
response = requests.get(url)
my_raw_data = response.content

with open("my_pdf.pdf", 'wb') as my_data:
    my_data.write(my_raw_data)

open_pdf_file = open("my_pdf.pdf", 'rb')
read_pdf = PyPDF2.PdfFileReader(open_pdf_file)
if read_pdf.isEncrypted:
    read_pdf.decrypt("")
    print(read_pdf.getPage(0).extractText())

else:
    print(read_pdf.getPage(0).extractText())



回答2:


That response is the encoded string representing the contents of the PDF. You need to use an extraction tool such as pdfminer. There is an example on the page showing you how to do a sample extraction via Python.



来源:https://stackoverflow.com/questions/47171154/how-do-i-decode-text-from-a-pdf-online-with-requests

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!