How to convert a BeautifulSoup tag to JSON?

廉价感情. 提交于 2020-01-16 19:35:10

问题


I have a type element, bs4.element.Tag, product of a web scraping, I usually do: json.loads (soup.find ('script', type = 'application / ld + json'). Text) , but on this page it only appears in: <script> </script> so I had to do: scripts = soup.find_all ('script') until I get to the one that interests me: script = scripts [18].

The variable in question is script. My problem is that I want to access its attributes, for example script ['goodsInfo'], obviously being an element type bs4.element.Tag, try to do: script.attrs and return me {}. Then I tried to convert it to the type json: json.loads (str (script)) and it throws me the exception: 'JSONDecodeError: Expecting value: line 1 column 1 (char 0)'

This is my code:

import json
from bs4 import BeautifulSoup
import requests
url_aux = 'https://www.shein.com/Mock-neck-Brush-Stroke-Print-Bodycon-Dress-p-941649-cat-1727.html?scici=navbar_2~~tab01navbar04~~4~~real_1727~~~~0~~0'

response = requests.get(url_aux)
soup = BeautifulSoup(response.content, "html.parser")

scripts = soup.find_all('script')
script = scripts[18]

print(json.loads(str(script)))
#output: JSONDecodeError: Expecting value: line 1 column 1 (char 0)

print(type(script))
#output: bs4.element.Tag

print(str(json.loads(str(script))))

回答1:


You can use json module to extract the data, but first it's necessary to locate the right info - you can use re module for that.

For example:

import re
import json
import requests

url = 'https://eur.shein.com/Mock-neck-Brush-Stroke-Print-Bodycon-Dress-p-941649-cat-1727.html?scici=navbar_2~~tab01navbar04~~4~~real_1727~~~~0~~0&ref=www&rep=dir&ret=eur'

txt = re.findall(r'goodsInfo\s*:\s*({.*})', requests.get(url).text)[0]

data = json.loads(txt)

# print(json.dumps(data, indent=4)) # <-- uncomment to see all data

print(data['detail']['goods_name'])
print(data['detail']['brand'])
print('Num of comments:', data['detail']['comment']['comment_num'])

Prints:

Mock-neck Brush Stroke Print Bodycon Dress
SHEIN
Num of comments: 17



回答2:


BS4 does not parse javascript, from BS4's Tag object's POV the text in a <script> tag is, well, just text. I don't have any idea what this script looks like (since you didn't post it and I'm not going to bother try and find it), but if your expectations were that script ['goodsInfo'] would return the value of a JS variables named 'goodInfo' then, bad news, it's not going to work that way.

Also, Javascript is not JSON, so the chances a JS snippet will be valid json are rather small to say the least. The proper syntax to test it would be quite simply the same as the one you used for you first use case, ie json.loads(script.text), but I assume that's the first thing you tried ;-)

So, well, I'm afraid you'll have to manually parse this script to extract the relevant part. Depending on what the js code looks like, it may be a matter of a few lines of basic string parsing / regexp stuff, or it may require a proper Javascript parser etc.



来源:https://stackoverflow.com/questions/59665253/how-to-convert-a-beautifulsoup-tag-to-json

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!