Convert the XPath gotten from browser to usable XPath for Scrapy

前端 未结 2 1784
暖寄归人
暖寄归人 2020-12-20 04:45

This is a problem that I always have getting a specific XPath with my browser.

Assume that I want to extract all the images from some websites like Google Image Sear

2条回答
  •  不思量自难忘°
    2020-12-20 05:26

    Usually it is not safe and reliable to blindly follow browser's suggestion about how to locate an element.

    First of all, XPath expression that developer tools generate are usually absolute - starting from the the parent of all parents - html tag, which makes it being more dependant on the page structure (well, firebug can also make expressions based on id attributes).

    Also, the HTML code you see in the browser can be pretty much different from what Scrapy receives due to asynchronous nature of the website page load and javascript being dynamically executed in the browser. Scrapy is not a browser and "sees" only the initial HTML code of a page, before the "dynamic" part.

    Instead, inspect what Scrapy really has in the response: open up the Scrapy Shell, inspect the response and debug your XPath expressions and CSS selectors:

    $ scrapy shell https://google.com
    >>> response.xpath('//div[@id="myid"]')
    ...
    

    Here is what I've got for the google image search:

    $ scrapy shell "https://www.google.com/search?q=test&tbm=isch&qscrl=1"
    In [1]: response.xpath('//*[@id="ires"]//img/@src').extract()
    Out[1]: 
    [u'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRO9ZkSuDqt0-CRhLrWhHAyeyt41Z5I8WhOhTkGCvjiHmRiTSvDBfHKYjx_',
     u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQpwyzbW_qsRenDw3d4wwpwwm8n99ukMtLCVaPiTJxyviyQVBQeRCglVaY',
     u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSrxtoY3-3QHwhjc5Ofx8090uDYI8VOUbi3gUrd9USxZ-Vb1D5pAbOzJLMS',
     u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcTQO1A3dDJ07tIaFMHlXNOsOnpiY_srvHKJE1xOpsMZscjL3aKGxaGLOgru',
     u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQ71ukeTGCPLuClWd6MetTtQ0-0mwzo3rn1ug0MUnbpXmKnwNuuBnSWXHU',
     u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRZmWrYR9A4W97jpjhtIbyUM5Lj3vRL0vgCKG_xfylc5wKFAk6UB8jiiKA',
     ...
     u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRj08jK8sBjX90Tu1RO4BfZkKe5A59U0g1TpMWPFZlNnA70SQ5i5DMJkvV0']
    

提交回复
热议问题