Extracting data from script tag using BeautifulSoup in Python

不打扰是莪最后的温柔 提交于 2019-12-02 00:49:08

Scripts don't change places in code so you can count them and use index to get correct script.

all_scripts[6]

Script is normal string so you can also use standard string functions ie.

if '{"loved"' in script.text:

Code with both methods - I use [:100] to display only part of string.

import requests
from bs4 import BeautifulSoup

base_url = 'https://www.deezer.com/en/profile/1589856782/loved'

r = requests.get(base_url)

soup = BeautifulSoup(r.text, 'html.parser')

all_scripts = soup.find_all('script')

print('--- first method ---')
print(all_scripts[6].text[:100])

print('--- second method ---')
for number, script in enumerate(all_scripts):
    if '{"loved"' in script.text:
        print(number, script.text[:100])

Result:

--- first method ---
window.__DZR_APP_STATE__ = {"TAB":{"loved":{"data":[{"SNG_ID":"126884459","PRODUCT_TRACK_ID":"360276
--- second method ---
6 window.__DZR_APP_STATE__ = {"TAB":{"loved":{"data":[{"SNG_ID":"126884459","PRODUCT_TRACK_ID":"360276

EDIT: When you have correct script then you can use slicing to get only JSON string and use module json to convert it to python dictionary and then tou can get data

import requests
from bs4 import BeautifulSoup
import json

base_url = 'https://www.deezer.com/en/profile/1589856782/loved'

r = requests.get(base_url)

soup = BeautifulSoup(r.text, 'html.parser')

all_scripts = soup.find_all('script')

data = json.loads(all_scripts[6].get_text()[27:])

print('key:', data.keys())
print('key:', data['TAB'].keys())
print('key:', data['DATA'].keys())
print('---')

for item in data['TAB']['loved']['data']:
    print('ART_NAME:', item['ART_NAME'])
    print('SNG_TITLE:', item['SNG_TITLE'])
    print('---')

Result:

key: dict_keys(['TAB', 'DATA'])
key: dict_keys(['loved'])
key: dict_keys(['USER', 'FOLLOW', 'FOLLOWING', 'HAS_BLOCKED', 'IS_BLOCKED', 'IS_PUBLIC', 'CURATOR', 'IS_PERSONNAL', 'NB_FOLLOWER', 'NB_FOLLOWING'])
---
ART_NAME: Twenty One Pilots
SNG_TITLE: Heathens
---
ART_NAME: Twenty One Pilots
SNG_TITLE: Stressed Out
---
ART_NAME: Linkin Park
SNG_TITLE: Numb
---
ART_NAME: Three Days Grace
SNG_TITLE: Animal I Have Become
---
ART_NAME: Three Days Grace
SNG_TITLE: Painkiller
---
ART_NAME: Slipknot
SNG_TITLE: Before I Forget
---
ART_NAME: Slipknot
SNG_TITLE: Duality
---
ART_NAME: Skrillex
SNG_TITLE: Make It Bun Dem
---
ART_NAME: Skrillex
SNG_TITLE: Bangarang (feat. Sirah)
---
ART_NAME: Limp Bizkit
SNG_TITLE: Break Stuff
---
ART_NAME: Three Days Grace
SNG_TITLE: I Hate Everything About You
---
ART_NAME: Three Days Grace
SNG_TITLE: Time of Dying
---
ART_NAME: Three Days Grace
SNG_TITLE: I Am Machine
---
ART_NAME: Three Days Grace
SNG_TITLE: Riot
---
ART_NAME: Three Days Grace
SNG_TITLE: So What
---
ART_NAME: Three Days Grace
SNG_TITLE: Pain
---
ART_NAME: Three Days Grace
SNG_TITLE: Tell Me Why
---
ART_NAME: Three Days Grace
SNG_TITLE: Chalk Outline
---
ART_NAME: Three Days Grace
SNG_TITLE: Gone Forever
---
ART_NAME: Slipknot
SNG_TITLE: The Devil In I
---
ART_NAME: Linkin Park
SNG_TITLE: No More Sorrow
---
ART_NAME: Linkin Park
SNG_TITLE: Bleed It Out
---
ART_NAME: The Doors
SNG_TITLE: Roadhouse Blues
---
ART_NAME: The Doors
SNG_TITLE: Riders On The Storm
---
ART_NAME: The Doors
SNG_TITLE: Break On Through (To The Other Side)
---
ART_NAME: The Doors
SNG_TITLE: Alabama Song (Whisky Bar)
---
ART_NAME: The Doors
SNG_TITLE: People Are Strange
---
ART_NAME: My Chemical Romance
SNG_TITLE: Welcome to the Black Parade
---
ART_NAME: My Chemical Romance
SNG_TITLE: Teenagers
---
ART_NAME: My Chemical Romance
SNG_TITLE: Na Na Na [Na Na Na Na Na Na Na Na Na]
---
ART_NAME: My Chemical Romance
SNG_TITLE: Famous Last Words
---
ART_NAME: The Doors
SNG_TITLE: Soul Kitchen
---
ART_NAME: The Black Keys
SNG_TITLE: Lonely Boy
---
ART_NAME: Katy Perry
SNG_TITLE: I Kissed a Girl
---
ART_NAME: Katy Perry
SNG_TITLE: Hot N Cold
---
ART_NAME: Katy Perry
SNG_TITLE: E.T.
---
ART_NAME: Linkin Park
SNG_TITLE: Given Up
---
ART_NAME: My Chemical Romance
SNG_TITLE: Dead!
---
ART_NAME: My Chemical Romance
SNG_TITLE: Mama
---
ART_NAME: My Chemical Romance
SNG_TITLE: The Sharpest Lives
---
RussellB

If my understanding is correct, you want only the script element with "SNG_TITLE" in it.

You can use re and get only the script element with the fields of your interest as follows:

import requests
from bs4 import BeautifulSoup
import re

base_url = 'https://www.deezer.com/en/profile/1589856782/loved'

r = requests.get(base_url)

soup = BeautifulSoup(r.text, 'html.parser')

user_name = soup.find(class_='user-name')
print(user_name.text)

for script in soup(text=re.compile(r'SNG_TITLE' )):
    print(script.parent)

EDIT:

@furas answer is the complete solution using json to find the 'SNG_TITLE' and 'ART_TITLE'. My answer help you find only the script with 'SNG_TITLE'. You can combine both to get better code.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!