I\'m extracting js data using response.xpath(\'//*\')re_first() and later converting it to python native data. The problem is extract/re methods don\'t seem to provide a way
Short answer:
.re() and .re_first() methods replace HTML entities (except <, &).extract() or .extract_first() to get raw HTML (or raw JavaScript instructions) and use Python's re module on extracted stringLong answer:
Let's look at an example input and various ways of extracting Javascript data from HTML.
Sample HTML:
Using scrapy Selector, which is using the parsel library underneath, you have several ways of extracting the Javascript snippet:
>>> import scrapy
>>> t = """
...
...
...
...
...
...
...
... """
>>> selector = scrapy.Selector(text=t, type="html")
>>>
>>> # extracting the '
>>>
>>> # only getting the text node inside the '
>>>
The HTML entity ' has been replaced by an apostrophe. This is due to a w3lib.html.replace_entities() call in .re/re_first implementation (see parsel source code, in extract_regex function), which is not used when simply calling extract() or extract_first()