问题
I'm currently trying to scrape the json output of the follow 'https://sports.bovada.lv/soccer/premier-league'
it has a source with the following
<script type="text/javascript">var swc_market_lists = {"items":[{"description":"Game Lines","id":"23", ... </script>
I'm trying to get the contents of the swc_market_lists
var
Now the issue I have is that when I use the following code
import requests
from lxml import html
url = 'https://sports.bovada.lv/soccer/premier-league'
r = requests.get(url)
tree = html.fromstring(r.content)
var = tree.xpath('//script')
print(var)
I get an empty var value.
I have also tried saving the r.text
and viewing it but I don't see the script tags in there.
What am I missing?
回答1:
You need to pass the User-Agent
header to make it work:
r = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.103 Safari/537.36"})
To get the desired script
, you can check for presence of swc_market_lists
in the text:
script = tree.xpath('//script[contains(., "swc_market_lists")]/text()')[0]
print(script)
To extract the swc_market_lists
variable value:
import re
data = re.search(r"var swc_market_lists = (.*?);$", script).group(1)
print(data)
Then, to make it easy to work with it, load it with json.loads()
into a Python dictionary:
import json
data = json.loads(data)
来源:https://stackoverflow.com/questions/35306761/parsing-json-var-inside-script-tag