Extract embedded pdf

依然范特西╮ 提交于 2020-02-01 09:55:07

问题


I noted that docplayer.net embeds many pdfs. Example: http://docplayer.net/72489212-Excellence-in-prevention-descriptions-of-the-prevention-programs-and-strategies-with-the-greatest-evidence-of-success.html

However, how does the process of extracting these pdfs (i.e. downloading them) using an automated workflow work?


回答1:


You can notice in browser's developer tools under Network/XHR tab that the actual document is being requested. In your particular case given it's on URL http://docplayer.net/storage/75/72489212/72489212.pdf. Now you can try to look into page source to see if you could infer this URL somehow. It seems that XPath //iframe[@id="player_frame"]/@src could be helpful. I haven't checked with other pages, but I think something like this might work (part of your parse method):

...
url_template = 'http://docplayer.net/storage/{0}/{1}/{1}.pdf'
ids = response.xpath('//iframe[@id="player_frame"]/@src').re(r'/docview/([^/]+)/([^/]+)/')
file_url = url_template.format(*ids)
yield scrapy.Request(file_url, callback=self.parse_pdf)
...



回答2:


As you pointed out, grabbing the URL alone results in a 403 Forbidden. There are two headers you also need, "s" and "ex".

To get these using Firefox, open the Network tab in the inspector, and select "Copy... Copy as cURL". The resulting curl command will be the exact request the browser would have made to fetch the resource. In addition to the "s" and "ex" headers, you will also notice that there is a "Range" header -- make sure to remove this one, unless you only want to download part of the file. The remaining headers are not relevant.

I will not post the resulting direct link to the PDF here, but I did test it and was able to download the entire file with this technique.



来源:https://stackoverflow.com/questions/49064193/extract-embedded-pdf

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!