XPath not working for screen scraping

走远了吗. 提交于 2019-12-10 23:36:37

问题


I am using Scrapy for a screen scraping project and am having problems with an XPath.

I am trying to get the 94,218 from the image below, but the XPaths and CSS I have used is not working.

It's from this page: https://fancy.com/things/280558613/I%27m-Fine-T-Shirt

I have tried multiple XPaths and CSS with Scrapy but everything is returning blank.

Here are some examples:

response.xpath('/html/body/div[1]/div[1]/div[1]/aside/div[1]/div/div/a[2]/text()').extract()

response.xpath('//*[@id="sidebar"]/div[1]/div/div/a[2]/text()').extract()

response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "fancyd_list", " " ))])'.extract()

response.xpath(".//*[@id='sidebar']/div[1]/div/div/a[2]/text()")

I've tried Firebug, Firepath, Chrome Dev Tools and different plugins but none of the XPaths or CSS seem to work.. can someone assist?

The code on the actual page is:

<a href="#" class="fancyd_list "/>
    6
</a>

Some of the XPaths work, but they contain no text, so it looks like this: <a href="#" class="fancyd_list "/></a>

I've also tried using BeautifulSoup, but it has the same problem:

print soup.find_all('a',class_='fancyd_list')
[<a class="fancyd_list " href="#"></a>, <a class="fancyd_list " href="#"></a>]

Thanks!


回答1:


The problem here is that the provided URL is returning HTML with a malformed <a> tag in the following:

<a href="#" class="fancyd_list "/>  # Malformed HTML, <a> tag closes here
    94,218
</a>

The first line here contains a / prior to the closing bracket, which by HTML standards indicates a completion of the <a> tag. Since to Scrapy, the <a> element is done, you can't fetch the text outside of the tags.

The previous recommendation of using BeautifulSoup may be a good idea here, because it handles malformed HTML much better.

Another option you can have for this example would be to fix the HTML yourself, via something similar to:

new_body = re.sub(r'<a href="#" class="fancyd_list "/>', '<a href="#" class="fancyd_list ">', response.body)
response = response.replace(body=new_body)

You would then be able to select from the response via

response.xpath("//div[@class='frm']/div[@class='figure-button']/a[contains(@class, 'fancyd_list')]/text()").extract()

The reason I'm using "contains" is because the class name (for me) is appearing with a space at the end of it's name, and as such Scrapy's check of "a[@class='fancyd_list']" will fail, because "fancyd_list" != "fancyd_list "



来源:https://stackoverflow.com/questions/33110734/xpath-not-working-for-screen-scraping

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!