How to extract raw html from a Scrapy selector?

后端 未结 3 1454
孤城傲影
孤城傲影 2021-01-12 15:35

I\'m extracting js data using response.xpath(\'//*\')re_first() and later converting it to python native data. The problem is extract/re methods don\'t seem to provide a way

3条回答
  •  死守一世寂寞
    2021-01-12 15:53

    Short answer:

    • Scrapy/Parsel selectors' .re() and .re_first() methods replace HTML entities (except <, &)
    • instead, use .extract() or .extract_first() to get raw HTML (or raw JavaScript instructions) and use Python's re module on extracted string

    Long answer:

    Let's look at an example input and various ways of extracting Javascript data from HTML.

    Sample HTML:

    
    
    

    Using scrapy Selector, which is using the parsel library underneath, you have several ways of extracting the Javascript snippet:

    >>> import scrapy
    >>> t = """
    ... 
    ... 
    ... ... ...
    ... ... ... """ >>> selector = scrapy.Selector(text=t, type="html") >>> >>> # extracting the ' >>> >>> # only getting the text node inside the ' >>>

    The HTML entity ' has been replaced by an apostrophe. This is due to a w3lib.html.replace_entities() call in .re/re_first implementation (see parsel source code, in extract_regex function), which is not used when simply calling extract() or extract_first()

提交回复
热议问题