Get a header with Python and convert in JSON (requests - urllib2 - json)

浪尽此生 提交于 2019-12-08 19:27:13

问题


I’m trying to get the header from a website, encode it in JSON to write it to a file. I’ve tried two different ways without success.

FIRST with urllib2 and json

import urllib2
import json
host = ("https://www.python.org/")
header = urllib2.urlopen(host).info()
json_header = json.dumps(header)
print json_header

in this way I get the error:

TypeError: is not JSON serializable

So I try to bypass this issue by converting the object to a string -> json_header = str(header) In this way I can json_header = json.dumps(header) but the output it’s weird:

"Date: Wed, 02 Jul 2014 13:33:37 GMT\r\nServer: nginx\r\nContent-Type: text/html; charset=utf-8\r\nX-Frame-Options: SAMEORIGIN\r\nContent-Length: 45682\r\nAccept-Ranges: bytes\r\nVia: 1.1 varnish\r\nAge: 1263\r\nX-Served-By: cache-fra1220-FRA\r\nX-Cache: HIT\r\nX-Cache-Hits: 2\r\nVary: Cookie\r\nStrict-Transport-Security: max-age=63072000; includeSubDomains\r\nConnection: close\r\n"

SECOND with requests

import requests
r = requests.get(“https://www.python.org/”)
rh = r.headers
print rh

{'content-length': '45682', 'via': '1.1 varnish', 'x-cache': 'HIT', 'accept-ranges': 'bytes', 'strict-transport-security': 'max-age=63072000; includeSubDomains', 'vary': 'Cookie', 'server': 'nginx', 'x-served-by': 'cache-fra1226-FRA', 'x-cache-hits': '14', 'date': 'Wed, 02 Jul 2014 13:39:33 GMT', 'x-frame-options': 'SAMEORIGIN', 'content-type': 'text/html; charset=utf-8', 'age': '1619'}

In this way the output is more JSON like but still not OK (see the ‘ ‘ instead of “ “ and other stuff like the = and ;). Evidently there’s something (or a lot) I’m not doing in the right way. I’ve tried to read the documentation of the modules but I can’t understand how to solve this problem. Thank you for your help.


回答1:


There are more than a couple ways to encode headers as JSON, but my first thought would be to convert the headers attribute to an actual dictionary instead of accessing it as requests.structures.CaseInsensitiveDict

import requests, json
r = requests.get("https://www.python.org/")
rh = json.dumps(r.headers.__dict__['_store'])
print rh

{'content-length': ('content-length', '45474'), 'via': ('via', '1.1 varnish'), 'x-cache': ('x-cache', 'HIT'), 'accept-ranges': ('accept-ranges', 'bytes'), 'strict-transport-security': ('strict-transport-security', 'max-age=63072000; includeSubDomains'), 'vary': ('vary', 'Cookie'), 'server': ('server', 'nginx'), 'x-served-by': ('x-served-by', 'cache-iad2132-IAD'), 'x-cache-hits': ('x-cache-hits', '1'), 'date': ('date', 'Wed, 02 Jul 2014 14:13:37 GMT'), 'x-frame-options': ('x-frame-options', 'SAMEORIGIN'), 'content-type': ('content-type', 'text/html; charset=utf-8'), 'age': ('age', '1483')}

Depending on exactly what you want on the headers you can specifically access them after this, but this will give you all the information contained in the headers, if in a slightly different format.

If you prefer a different format, you can also convert your headers to a dictionary:

import requests, json
r = requests.get("https://www.python.org/")
print json.dumps(dict(r.headers))

{"content-length": "45682", "via": "1.1 varnish", "x-cache": "HIT", "accept-ranges": "bytes", "strict-transport-security": "max-age=63072000; includeSubDomains", "vary": "Cookie", "server": "nginx", "x-served-by": "cache-at50-ATL", "x-cache-hits": "5", "date": "Wed, 02 Jul 2014 14:08:15 GMT", "x-frame-options": "SAMEORIGIN", "content-type": "text/html; charset=utf-8", "age": "951"}




回答2:


If you are only interested in the header, make a head request. convert the CaseInsensitiveDict in a dict object and then convert it to json.

import requests
import json
r = requests.head('https://www.python.org/')
rh = dict(r.headers)
json.dumps(rh)



回答3:


import requests
import json

r = requests.get('https://www.python.org/')
rh = r.headers

print json.dumps( dict(rh) ) # use dict()

result:

{"content-length": "45682", "via": "1.1 varnish", "x-cache": "HIT", "accept-ranges": "bytes", "strict-transport-security": "max-age=63072000; includeSubDomains", "vary": "Cookie", "server": "nginx", "x-served-by": "cache-fra1224-FRA", "x-cache-hits": "5", "date": "Wed, 02 Jul 2014 14:08:04 GMT", "x-frame-options": "SAMEORIGIN", "content-type": "text/html; charset=utf-8", "age": "3329"}




回答4:


I know this is an old question, but I stumbled across it when trying to put together a quick and dirty Python curl-esque URL getter. I kept getting an error:

TypeError: Object of type 'CaseInsensitiveDict' is not JSON serializable

The above solutions are good if need to output a JSON string immediately, but in my case I needed to return a python dictionary of the headers, and I wanted to normalize the capitalization to make all keys lowercase.

My solution was to use a dict comprehension:

import requests

response = requests.head('https://www.python.org/')

my_dict = {
   'body': response.text,
   'http_status_code': response.status_code,
   'headers': {k.lower(): v for (k, v) in response.headers.items()}
}


来源:https://stackoverflow.com/questions/24533018/get-a-header-with-python-and-convert-in-json-requests-urllib2-json

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!