问题
I am using Scrapy for a screen scraping project and am having problems with an XPath.
I am trying to get the 94,218 from the image below, but the XPaths and CSS I have used is not working.
It's from this page: https://fancy.com/things/280558613/I%27m-Fine-T-Shirt
I have tried multiple XPaths and CSS with Scrapy but everything is returning blank.
Here are some examples:
response.xpath('/html/body/div[1]/div[1]/div[1]/aside/div[1]/div/div/a[2]/text()').extract()
response.xpath('//*[@id="sidebar"]/div[1]/div/div/a[2]/text()').extract()
response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "fancyd_list", " " ))])'.extract()
response.xpath(".//*[@id='sidebar']/div[1]/div/div/a[2]/text()")
I've tried Firebug, Firepath, Chrome Dev Tools and different plugins but none of the XPaths or CSS seem to work.. can someone assist?
The code on the actual page is:
<a href="#" class="fancyd_list "/>
6
</a>
Some of the XPaths work, but they contain no text, so it looks like this: <a href="#" class="fancyd_list "/></a>
I've also tried using BeautifulSoup, but it has the same problem:
print soup.find_all('a',class_='fancyd_list')
[<a class="fancyd_list " href="#"></a>, <a class="fancyd_list " href="#"></a>]
Thanks!
回答1:
The problem here is that the provided URL is returning HTML with a malformed <a>
tag in the following:
<a href="#" class="fancyd_list "/> # Malformed HTML, <a> tag closes here
94,218
</a>
The first line here contains a /
prior to the closing bracket, which by HTML standards indicates a completion of the <a>
tag. Since to Scrapy, the <a>
element is done, you can't fetch the text outside of the tags.
The previous recommendation of using BeautifulSoup may be a good idea here, because it handles malformed HTML much better.
Another option you can have for this example would be to fix the HTML yourself, via something similar to:
new_body = re.sub(r'<a href="#" class="fancyd_list "/>', '<a href="#" class="fancyd_list ">', response.body)
response = response.replace(body=new_body)
You would then be able to select from the response via
response.xpath("//div[@class='frm']/div[@class='figure-button']/a[contains(@class, 'fancyd_list')]/text()").extract()
The reason I'm using "contains" is because the class name (for me) is appearing with a space at the end of it's name, and as such Scrapy's check of "a[@class='fancyd_list']"
will fail, because "fancyd_list" != "fancyd_list "
来源:https://stackoverflow.com/questions/33110734/xpath-not-working-for-screen-scraping