How to extract a JSON object that was defined in a HTML page javascript block using Python?

前端 未结 3 840
名媛妹妹
名媛妹妹 2020-12-03 01:51

I am downloading HTML pages that have data defined in them in the following way:

... 

        
3条回答
  •  庸人自扰
    2020-12-03 02:30

    BeautifulSoup is an html parser; you also need a javascript parser here. btw, some javascript object literals are not valid json (though in your example the literal is also a valid json object).

    In simple cases you could:

    1. extract

      some other html here """ import json import re from bs4 import BeautifulSoup # $ pip install beautifulsoup4 soup = BeautifulSoup(html) script = soup.find('script', text=re.compile('window\.blog\.data')) json_text = re.search(r'^\s*window\.blog\.data\s*=\s*({.*?})\s*;\s*$', script.string, flags=re.DOTALL | re.MULTILINE).group(1) data = json.loads(json_text) assert data['activity']['type'] == 'read'

      If the assumptions are incorrect then the code fails.

      To relax the second assumption, a javascript parser could be used instead of a regex e.g., slimit (suggested by @approximatenumber):

      from slimit import ast  # $ pip install slimit
      from slimit.parser import Parser as JavascriptParser
      from slimit.visitors import nodevisitor
      
      soup = BeautifulSoup(html, 'html.parser')
      tree = JavascriptParser().parse(soup.script.string)
      obj = next(node.right for node in nodevisitor.visit(tree)
                 if (isinstance(node, ast.Assign) and
                     node.left.to_ecma() == 'window.blog.data'))
      # HACK: easy way to parse the javascript object literal
      data = json.loads(obj.to_ecma())  # NOTE: json format may be slightly different
      assert data['activity']['type'] == 'read'
      

      There is no need to treat the object literal (obj) as a json object. To get the necessary info, obj can be visited recursively like other ast nodes. It would allow to support arbitrary javascript code (that can be parsed by slimit).

提交回复
热议问题