How to get pdf filename with Python requests?

[亡魂溺海] 提交于 2019-11-29 03:00:48

It is specified in an http header content-disposition. So to extract the name you would do:

import re
d = r.headers['content-disposition']
fname = re.findall("filename=(.+)", d)

Name extracted from the string via regular expression (re module).

Apparently, for this particular resource it is in:

r.headers['content-disposition']

Don't know if it is always the case, though.

Building on some of the other answers, here's how I do it. If there isn't a Content-Disposition header, I parse it from the download URL.

import re
import requests
from request.exceptions import RequestException


url = 'http://www.example.com/downloads/sample.pdf'

try:
    with requests.get(url) as r:

        fname = ''
        if "Content-Disposition" in r.headers.keys():
            fname = re.findall("filename=(.+)", r.headers["Content-Disposition"])[0]
        else:
            fname = url.split("/")[-1]

        print(fname)
except RequestException as e:
    print(e)

There are arguably better ways of parsing the URL string, but for simplicity I didn't want to involve any more libraries.

You can use werkzeug for options headers https://werkzeug.palletsprojects.com/en/0.15.x/http/#werkzeug.http.parse_options_header

>>> import werkzeug


>>> werkzeug.parse_options_header('text/html; charset=utf8')
('text/html', {'charset': 'utf8'})
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!