Extracting text from script tag using BeautifulSoup in Python

后端未结

关注

 2  1845

自闭症患者 2020-11-27 22:17

Could you please help me with this lil thing. I am looking to extract email, phone and name value from the below code in SCRIPT tag(not in Body) using Beautiful soup(Python)

2条回答

清歌不尽 (楼主)

2020-11-27 22:51

You can get the script tag contents via BeautifulSoup and then apply a regex to get the desired data.

Working example (based on what you've described in the question):

import re
from bs4 import BeautifulSoup

data = """

    
        My Sample Page
        
    
    
        What a wonderful world
    

"""

soup = BeautifulSoup(data)
script = soup.find('script')

pattern = re.compile("(\w+): '(.*?)'")
fields = dict(re.findall(pattern, script.text))
print fields['email'], fields['phone'], fields['name']

Prints:

abc@g.com 9999999999 XYZ

I don't really like the solution, since that regex approach is really fragile. All sorts of things can happen that would break it. I still think there is a better solution and we are missing a bigger picture here. Providing a link to that specific site would help a lot, but it is what it is.

UPD (fixing the code OP provided):

soup = BeautifulSoup(data, 'html.parser')
script = soup.html.find_next_sibling('script', text=re.compile(r"\$\(document\)\.ready"))

pattern = re.compile("(\w+): '(.*?)'")
fields = dict(re.findall(pattern, script.text))
print fields['email'], fields['phone'], fields['name']

prints:

abcd@gmail.com 9999999999 Shamita Shetty

0 讨论(0)

查看其它2个回答