Extracting text from script tag using BeautifulSoup in Python

后端未结

关注

 2  1851

自闭症患者 2020-11-27 22:17

Could you please help me with this lil thing. I am looking to extract email, phone and name value from the below code in SCRIPT tag(not in Body) using Beautiful soup(Python)

2条回答

隐瞒了意图╮ (楼主)

2020-11-27 22:41

Alternatively to the regex-based approach, you can parse the javascript code using slimit module, that builds an Abstract Syntax Tree and gives you a way of getting all assignments and putting them into the dictionary:

from bs4 import BeautifulSoup
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor


data = """

    
        My Sample Page
        
    
    
        What a wonderful world
    

"""

# get the script tag contents from the html
soup = BeautifulSoup(data)
script = soup.find('script')

# parse js
parser = Parser()
tree = parser.parse(script.text)
fields = {getattr(node.left, 'value', ''): getattr(node.right, 'value', '')
          for node in nodevisitor.visit(tree)
          if isinstance(node, ast.Assign)}

print fields

Prints:

{u'name': u"'XYZ'", u'url': u"'http://www.example.com'", u'type': u'"POST"', u'phone': u"'9999999999'", u'data': '', u'email': u"'abc@g.com'"}

Among other fields, there are email, name and phone that you are interested in.

Hope that helps.

0 讨论(0)

查看其它2个回答