Extracting text from script tag using BeautifulSoup in Python

后端 未结 2 1845
自闭症患者
自闭症患者 2020-11-27 22:17

Could you please help me with this lil thing. I am looking to extract email, phone and name value from the below code in SCRIPT tag(not in Body) using Beautiful soup(Python)

2条回答
  •  清歌不尽
    2020-11-27 22:51

    You can get the script tag contents via BeautifulSoup and then apply a regex to get the desired data.

    Working example (based on what you've described in the question):

    import re
    from bs4 import BeautifulSoup
    
    data = """
    
        
            My Sample Page
            
        
        
            

    What a wonderful world

    """ soup = BeautifulSoup(data) script = soup.find('script') pattern = re.compile("(\w+): '(.*?)'") fields = dict(re.findall(pattern, script.text)) print fields['email'], fields['phone'], fields['name']

    Prints:

    abc@g.com 9999999999 XYZ
    

    I don't really like the solution, since that regex approach is really fragile. All sorts of things can happen that would break it. I still think there is a better solution and we are missing a bigger picture here. Providing a link to that specific site would help a lot, but it is what it is.


    UPD (fixing the code OP provided):

    soup = BeautifulSoup(data, 'html.parser')
    script = soup.html.find_next_sibling('script', text=re.compile(r"\$\(document\)\.ready"))
    
    pattern = re.compile("(\w+): '(.*?)'")
    fields = dict(re.findall(pattern, script.text))
    print fields['email'], fields['phone'], fields['name']
    

    prints:

    abcd@gmail.com 9999999999 Shamita Shetty
    

提交回复
热议问题