Loop url from dataframe and download pdf files in Python

问题

Based on the code from here, I'm able to crawler url for each transation and save them into an excel file which can be downloaded here.

Now I would like to go further and click the url link:

For each url, I will need to open and save pdf format files:

How could I do that in Python? Any help would be greatly appreciated.

Code for references:

import shutil
from bs4 import BeautifulSoup
import requests
import os
from urllib.parse import urlparse

url = 'xxx'
for page in range(6):
    r = requests.get(url.format(page))
    soup = BeautifulSoup(r.content, "html.parser")
    for link in soup.select("h3[class='sv-card-title']>a"):
        r = requests.get(link.get("href"), stream=True)
        r.raw.decode_content = True
        with open('./files/' + link.text + '.pdf', 'wb') as f:
            shutil.copyfileobj(r.raw, f)

回答1:

An example of download a pdf file in your uploaded excel file.

from bs4 import BeautifulSoup
import requests

# Let's assume there is only one page.If you need to download many files, save them in a list.

url = 'http://xinsanban.eastmoney.com/Article/NoticeContent?id=AN201909041348533085'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")

link = soup.select_one(".lookmore")
title = soup.select_one(".newsContent").select_one("h1").text

print(title.strip() + '.pdf')
data = requests.get(link.get("href")).content
with open(title.strip().replace(":", "-") + '.pdf', "wb+") as f: # file name shouldn't contain ':', so I replace it to "-"
    f.write(data)

And download successfully:

回答2:

Here's bit different approach. You don't have to open those urls from the excel file as you can build the .pdf file source urls yourself.

For example:

import requests

urls = [
    "http://data.eastmoney.com/notices/detail/871792/AN201909041348533085,JWU2JWEwJTk2JWU5JTljJTllJWU3JTg5JWE5JWU0JWI4JTlh.html",
    "http://data.eastmoney.com/notices/detail/872955/AN201912101371726768,JWU0JWI4JWFkJWU5JTgzJWJkJWU3JTg5JWE5JWU0JWI4JTlh.html",
    "http://data.eastmoney.com/notices/detail/832816/AN202008171399155565,JWU3JWI0JWEyJWU1JTg1JThiJWU3JTg5JWE5JWU0JWI4JTlh.html",
    "http://data.eastmoney.com/notices/detail/831971/AN201505220009713696,JWU1JWJjJTgwJWU1JTg1JTgzJWU3JTg5JWE5JWU0JWI4JTlh.html",
]

for url in urls:
    file_id, _ = url.split('/')[-1].split(',')
    pdf_file_url = f"http://pdf.dfcfw.com/pdf/H2_{file_id}_1.pdf"
    print(f"Fetching {pdf_file_url}...")
    with open(f"{file_id}.pdf", "wb") as f:
        f.write(requests.get(pdf_file_url).content)

来源：https://stackoverflow.com/questions/65176880/loop-url-from-dataframe-and-download-pdf-files-in-python

标签

python-3.x

beautifulsoup

python-requests

web-crawler