问题
I have a type element, bs4.element.Tag, product of a web scraping, I usually do: json.loads (soup.find ('script', type = 'application / ld + json'). Text) , but on this page it only appears in: <script> </script> so I had to do: scripts = soup.find_all ('script') until I get to the one that interests me: script = scripts [18].
The variable in question is script. My problem is that I want to access its attributes, for example script ['goodsInfo'], obviously being an element type bs4.element.Tag, try to do: script.attrs and return me {}. Then I tried to convert it to the type json: json.loads (str (script)) and it throws me the exception: 'JSONDecodeError: Expecting value: line 1 column 1 (char 0)'
This is my code:
import json
from bs4 import BeautifulSoup
import requests
url_aux = 'https://www.shein.com/Mock-neck-Brush-Stroke-Print-Bodycon-Dress-p-941649-cat-1727.html?scici=navbar_2~~tab01navbar04~~4~~real_1727~~~~0~~0'
response = requests.get(url_aux)
soup = BeautifulSoup(response.content, "html.parser")
scripts = soup.find_all('script')
script = scripts[18]
print(json.loads(str(script)))
#output: JSONDecodeError: Expecting value: line 1 column 1 (char 0)
print(type(script))
#output: bs4.element.Tag
print(str(json.loads(str(script))))
回答1:
You can use json module to extract the data, but first it's necessary to locate the right info - you can use re module for that.
For example:
import re
import json
import requests
url = 'https://eur.shein.com/Mock-neck-Brush-Stroke-Print-Bodycon-Dress-p-941649-cat-1727.html?scici=navbar_2~~tab01navbar04~~4~~real_1727~~~~0~~0&ref=www&rep=dir&ret=eur'
txt = re.findall(r'goodsInfo\s*:\s*({.*})', requests.get(url).text)[0]
data = json.loads(txt)
# print(json.dumps(data, indent=4)) # <-- uncomment to see all data
print(data['detail']['goods_name'])
print(data['detail']['brand'])
print('Num of comments:', data['detail']['comment']['comment_num'])
Prints:
Mock-neck Brush Stroke Print Bodycon Dress
SHEIN
Num of comments: 17
回答2:
BS4 does not parse javascript, from BS4's Tag object's POV the text in a <script> tag is, well, just text. I don't have any idea what this script looks like (since you didn't post it and I'm not going to bother try and find it), but if your expectations were that script ['goodsInfo'] would return the value of a JS variables named 'goodInfo' then, bad news, it's not going to work that way.
Also, Javascript is not JSON, so the chances a JS snippet will be valid json are rather small to say the least. The proper syntax to test it would be quite simply the same as the one you used for you first use case, ie json.loads(script.text), but I assume that's the first thing you tried ;-)
So, well, I'm afraid you'll have to manually parse this script to extract the relevant part. Depending on what the js code looks like, it may be a matter of a few lines of basic string parsing / regexp stuff, or it may require a proper Javascript parser etc.
来源:https://stackoverflow.com/questions/59665253/how-to-convert-a-beautifulsoup-tag-to-json