Layui

Scrapy: Extract commented (hidden) content

匿名 (未验证) 提交于 2018-07-18 20:10:49

问题:

How can I extract content from within commented tags with scrappy ?

For instance, how to extract "Yellow" in the following example:

<div class="fruit">
    <div class="infos">
        <h2 class="Name">Banana</h2>
        <span class="edible">Edible: Yes</span>
    </div>
    <!--
    <p class="color">Yellow</p>
    -->
</div>

回答1:

You can use an XPath expression like //comment() to get the comment content, and then parse that content after having stripped the comment tags.

Example scrapy shell session:

paul@wheezy:~$ scrapy shell 
...
In [1]: doc = """<div class="fruit">
   ...:     <div class="infos">
   ...:         <h2 class="Name">Banana</h2>
   ...:         <span class="edible">Edible: Yes</span>
   ...:     </div>
   ...:     <!--
   ...:     <p class="color">Yellow</p>
   ...:     -->
   ...: </div>"""

In [2]: from scrapy.selector import Selector

In [4]: selector = Selector(text=doc, type="html")

In [5]: import re

In [6]: regex = re.compile(r'<!--(.*)-->', re.DOTALL)

In [7]: selector.xpath('//comment()').re(regex)
Out[7]: [u'\n    <p class="color">Yellow</p>\n    ']

In [8]: comment = selector.xpath('//comment()').re(regex)[0]

In [9]: commentsel = Selector(text=comment, type="html")

In [10]: commentsel.css('p.color')
Out[10]: [<Selector xpath=u"descendant-or-self::p[@class and contains(concat(' ', normalize-space(@class), ' '), ' color ')]" data=u'<p class="color">Yellow</p>'>]

In [11]: commentsel.css('p.color').extract()
Out[11]: [u'<p class="color">Yellow</p>']

In [12]: commentsel.css('p.color::text').extract()
Out[12]: [u'Yellow']


回答2:

First of all, use below xpath to get all the comments from the page.

data = response.xpath('//comment()').extract()

Now, using any key value identity your meaning comments.

up_data = []
for d in data:
    if 'key' in d:
        up_data.append(d)

define,

html_template = '<html><body>%s</body></html>'
for up_d in up_data:
    up_d = html_template % up_d.replace('<!--','').replace('-->', '')
    sel = Selector(text=up_d)
    sel.xpath('//div[@class="table_outer_container"]')

    // DO what you want