I\'m using the Python requests lib to get a PDF file from the web. This works fine, but I now also want the original filename. If I go to a PDF file in Firefox and click
Apparently, for this particular resource it is in:
r.headers['content-disposition']
Don't know if it is always the case, though.
easy python3 implementation to get filename from Content-Disposition:
import requests
response = requests.get(<your-url>)
print(response.headers.get("Content-Disposition").split("filename=")[1])
It is specified in an http header content-disposition
. So to extract the name you would do:
import re
d = r.headers['content-disposition']
fname = re.findall("filename=(.+)", d)[0]
Name extracted from the string via regular expression (re
module).
Building on some of the other answers, here's how I do it. If there isn't a Content-Disposition
header, I parse it from the download URL:
import re
import requests
from requests.exceptions import RequestException
url = 'http://www.example.com/downloads/sample.pdf'
try:
with requests.get(url) as r:
fname = ''
if "Content-Disposition" in r.headers.keys():
fname = re.findall("filename=(.+)", r.headers["Content-Disposition"])[0]
else:
fname = url.split("/")[-1]
print(fname)
except RequestException as e:
print(e)
There are arguably better ways of parsing the URL string, but for simplicity I didn't want to involve any more libraries.
You can use werkzeug
for options headers https://werkzeug.palletsprojects.com/en/0.15.x/http/#werkzeug.http.parse_options_header
>>> import werkzeug
>>> werkzeug.parse_options_header('text/html; charset=utf8')
('text/html', {'charset': 'utf8'})