Scrapy: extract JSON from within HTML script

问题

I'm trying to extract (what appears to be) JSON data from within an HTML script. The HTML script looks like this on the site:

<script>
  $(document).ready(function(){
    var terms    = new Verba.Compare.Collections.Terms([{"id":"6436","name":"SUMMER 16","inquiry":true,"ordering":true},{"id":"6517","name":"FALL 16","inquiry":true,"ordering":true}]);
    var view     = new Verba.Compare.Views.CourseSelector({el: "body", terms: terms});
  });
</script>

I'd like to pull out the following:

[{"id":"6436","name":"SUMMER 16","inquiry":true,"ordering":true},{"id":"6517","name":"FALL 16","inquiry":true,"ordering":true}]

Using the following code, I'm able to get the full script.

    def parse(self, response):
        print response.xpath('/html/body/script[2]').extract()

Is there a simple way to then extract the values for "id", "name", etc. from that script. Or, is there a more direct way by modifying the xpath? I can't seem to go any deeper on the xpath using firebug.

回答1:

You can use js2xml for this.

To illustrate, first, let's create a Scrapy selector with your sample HTML, and grab the JavaScript statements:

>>> import scrapy
>>> sample = '''<script>
...   $(document).ready(function(){
...     var terms    = new Verba.Compare.Collections.Terms([{"id":"6436","name":"SUMMER 16","inquiry":true,"ordering":true},{"id":"6517","name":"FALL 16","inquiry":true,"ordering":true}]);
...     var view     = new Verba.Compare.Views.CourseSelector({el: "body", terms: terms});
...   });
... </script>'''
>>> selector = scrapy.Selector(text=sample, type='html')
>>> selector.xpath('//script//text()').extract_first()
u'\n  $(document).ready(function(){\n    var terms    = new Verba.Compare.Collections.Terms([{"id":"6436","name":"SUMMER 16","inquiry":true,"ordering":true},{"id":"6517","name":"FALL 16","inquiry":true,"ordering":true}]);\n    var view     = new Verba.Compare.Views.CourseSelector({el: "body", terms: terms});\n  });\n'

Then we can parse the JavaScript code with js2xml. You get an lxml tree back:

>>> import js2xml
>>> jssnippet = selector.xpath('//script//text()').extract_first()
>>> jstree = js2xml.parse(jssnippet)
>>> jstree
<Element program at 0x7fc7c6bae1b8>

What does the tree look like? It's pretty verbose:

>>> print(js2xml.pretty_print(jstree))
<program>
  <functioncall>
    <function>
      <dotaccessor>
        <object>
          <functioncall>
            <function>
              <identifier name="$"/>
            </function>
            <arguments>
              <identifier name="document"/>
            </arguments>
          </functioncall>
        </object>
        <property>
          <identifier name="ready"/>
        </property>
      </dotaccessor>
    </function>
    <arguments>
      <funcexpr>
        <identifier/>
        <parameters/>
        <body>
          <var name="terms">
            <new>
              <dotaccessor>
                <object>
                  <dotaccessor>
                    <object>
                      <dotaccessor>
                        <object>
                          <identifier name="Verba"/>
                        </object>
                        <property>
                          <identifier name="Compare"/>
                        </property>
                      </dotaccessor>
                    </object>
                    <property>
                      <identifier name="Collections"/>
                    </property>
                  </dotaccessor>
                </object>
                <property>
                  <identifier name="Terms"/>
                </property>
              </dotaccessor>
              <arguments>
                <array>
                  <object>
                    <property name="id">
                      <string>6436</string>
                    </property>
                    <property name="name">
                      <string>SUMMER 16</string>
                    </property>
                    <property name="inquiry">
                      <boolean>true</boolean>
                    </property>
                    <property name="ordering">
                      <boolean>true</boolean>
                    </property>
                  </object>
                  <object>
                    <property name="id">
                      <string>6517</string>
                    </property>
                    <property name="name">
                      <string>FALL 16</string>
                    </property>
                    <property name="inquiry">
                      <boolean>true</boolean>
                    </property>
                    <property name="ordering">
                      <boolean>true</boolean>
                    </property>
                  </object>
                </array>
              </arguments>
            </new>
          </var>
          <var name="view">
            <new>
              <dotaccessor>
                <object>
                  <dotaccessor>
                    <object>
                      <dotaccessor>
                        <object>
                          <identifier name="Verba"/>
                        </object>
                        <property>
                          <identifier name="Compare"/>
                        </property>
                      </dotaccessor>
                    </object>
                    <property>
                      <identifier name="Views"/>
                    </property>
                  </dotaccessor>
                </object>
                <property>
                  <identifier name="CourseSelector"/>
                </property>
              </dotaccessor>
              <arguments>
                <object>
                  <property name="el">
                    <string>body</string>
                  </property>
                  <property name="terms">
                    <identifier name="terms"/>
                  </property>
                </object>
              </arguments>
            </new>
          </var>
        </body>
      </funcexpr>
    </arguments>
  </functioncall>
</program>

You can use your XPath skills to point to the JavaScript array (you want the 1st argument of the "dot" accessor for the new contruct assigned to var terms):

>>> jstree.xpath('//var[@name="terms"]')
[<Element var at 0x7fc7c565e638>]
>>> jstree.xpath('//var[@name="terms"]/new/arguments/*')
[<Element array at 0x7fc7c565e5a8>]
>>> jstree.xpath('//var[@name="terms"]/new/arguments/*')[0]
<Element array at 0x7fc7c565e5a8>

Finally, now that you have the <array> element, you can pass it to js2xml.jsonlike.make_dict() to get a nice Python object to work with (make_dict is kinda misnamed):

>>> js2xml.jsonlike.make_dict(jstree.xpath('//var[@name="terms"]/new/arguments/*')[0])
[{'ordering': True, 'inquiry': True, 'id': '6436', 'name': 'SUMMER 16'}, {'ordering': True, 'inquiry': True, 'id': '6517', 'name': 'FALL 16'}]
>>>

Note: you can also use the shortcut js2xml.jsonlike.getall() to fetch everything that looks like a Python dict or list (you get 2 lists, you're interested in the 1st one):

>>> js2xml.jsonlike.getall(jstree)
[[{'ordering': True, 'inquiry': True, 'id': '6436', 'name': 'SUMMER 16'}, {'ordering': True, 'inquiry': True, 'id': '6517', 'name': 'FALL 16'}], {'el': 'body', 'terms': 'terms'}]

回答2:

I would extract it using a regex, something like:

response.xpath('/html/body/script[2]').re_first('\((\[.*\])\)')

回答3:

chompjs provides an API to parse JavaScript objects into a dict.

For example, if the JavaScript code contains var data = {field: "value", secondField: "second value"}; you can extract that data as follows:

import chompjs
javascript = response.css('script::text').get()
data = chompjs.parse_js_object(javascript)

The final result is {'field': 'value', 'secondField': 'second value'} a

回答4:

You can't go "deeper" because that element's contents are just text. It's not too hard to read out the JSON from the JavaScript:

line = javascript.strip().splitlines()[1]
the_json = line.split('(', 1)[1].split(')', 1)[0]

来源：https://stackoverflow.com/questions/38470261/scrapy-extract-json-from-within-html-script

标签

python

html

json

xpath

scrapy