Extract embedded pdf | 易学教程

问题

I noted that docplayer.net embeds many pdfs. Example: http://docplayer.net/72489212-Excellence-in-prevention-descriptions-of-the-prevention-programs-and-strategies-with-the-greatest-evidence-of-success.html

However, how does the process of extracting these pdfs (i.e. downloading them) using an automated workflow work?

回答1:

You can notice in browser's developer tools under Network/XHR tab that the actual document is being requested. In your particular case given it's on URL http://docplayer.net/storage/75/72489212/72489212.pdf. Now you can try to look into page source to see if you could infer this URL somehow. It seems that XPath //iframe[@id="player_frame"]/@src could be helpful. I haven't checked with other pages, but I think something like this might work (part of your parse method):

...
url_template = 'http://docplayer.net/storage/{0}/{1}/{1}.pdf'
ids = response.xpath('//iframe[@id="player_frame"]/@src').re(r'/docview/([^/]+)/([^/]+)/')
file_url = url_template.format(*ids)
yield scrapy.Request(file_url, callback=self.parse_pdf)
...

回答2:

As you pointed out, grabbing the URL alone results in a 403 Forbidden. There are two headers you also need, "s" and "ex".

To get these using Firefox, open the Network tab in the inspector, and select "Copy... Copy as cURL". The resulting curl command will be the exact request the browser would have made to fetch the resource. In addition to the "s" and "ex" headers, you will also notice that there is a "Range" header -- make sure to remove this one, unless you only want to download part of the file. The remaining headers are not relevant.

I will not post the resulting direct link to the PDF here, but I did test it and was able to download the entire file with this technique.

来源：https://stackoverflow.com/questions/49064193/extract-embedded-pdf

标签

python

pdf

scrapy