Scrapy doesn't crawl the page

你离开我真会死。 提交于 2019-12-24 13:25:55

问题


I want to crawl a page http://www.jcvi.org/charprotdb/index.cgi/l_search?terms.1.field=all&terms.1.search_text=cancer&submit=+++Search+++&sort.key=organism&sort.order=%2B by scrapy. But seems there is a problem that I didn't get any data when crawling it.

Here is my spider code:

import scrapy
from scrapy.selector import Selector
from scrapy_Data.items import CharProt


class CPSpider(scrapy.Spider):

    name = "CharProt"
    allowed_domains = ["jcvi.org"]
    start_urls = ["http://www.jcvi.org/charprotdb/index.cgi/l_search?terms.1.field=all&terms.1.search_text=cancer&submit=+++Search+++&sort.key=organism&sort.order=%2B"]

    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//*[@id="middle_content_template"]/table/tbody/tr')

        for site in sites:
            item = CharProt()
            item['protein_name'] = site.xpath('td[1]/a/text()').extract()
            item['pn_link'] = site.xpath('td[1]/a/@href').extract()
            item['organism'] = site.xpath('td[2]/a/text()').extract()
            item['organism_link'] = site.xpath('td[2]/a/@href').extract()
            item['status'] = site.xpath('td[3]/a/text()').extract()
            item['status_link'] = site.xpath('td[3]/a/@href').extract()
            item['references'] = site.xpath('td[4]/a').extract()
            item['source'] = "CharProt"
            # collection.update({"protein_name": item['protein_name']}, dict(item), upsert=True)
            yield item

Here is the log:

2016-05-28 17:25:06 [scrapy] INFO: Spider opened
2016-05-28 17:25:06 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-05-28 17:25:06 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-05-28 17:25:07 [scrapy] DEBUG: Crawled (200) <GET http://www.jcvi.org/charprotdb/index.cgi/l_search?terms.1.field=all&terms.1.search_text=cancer&submit=+++Search+++&sort.key=organism&sort.order=%2B> (referer: None)
<200 http://www.jcvi.org/charprotdb/index.cgi/l_search?terms.1.field=all&terms.1.search_text=cancer&submit=+++Search+++&sort.key=organism&sort.order=%2B>
2016-05-28 17:25:08 [scrapy] INFO: Closing spider (finished)
2016-05-28 17:25:08 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 337,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 26198,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 5, 28, 9, 25, 8, 103577),
 'log_count/DEBUG': 2,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2016, 5, 28, 9, 25, 6, 55848)}

And when I run other spiders, they all works fine. So can anybody tell me what's wrong with my code? Or there is something wrong with this webpage?


回答1:


You are crawling it but your xpath is wrong

When you inspect an element with your browser the <tbody> tag appears but it's not anywhere in the source code therefore, nothing is to be crawled!

sites = sel.xpath('//*[@id="middle_content_template"]/table/tr')

That should work

edit

As a side note extract() returns a list rather than the element you want so you need to use the extract_first() method or extract()[0]

eg

item['protein_name'] = site.xpath('td[1]/a/text()').extract_first()



回答2:


Your xpath is wrong

  • you don't need tbody to acess table rows
  • simply use table/tr to access table rows

correct xpath would be :

sites = sel.xpath('//*[@id="middle_content_template"]//table//tr')

better xpath would be

sites = response.xpath('//table[@class="search_results"]/tr')

as you can seen in above example, you do not need to create selector object by Selector(response) to select xpath

in newer scrapy releases, selector attribute is already added in response class and one can use it with as mentioned below

response.selector.xpath(...) or

Short form response.xpath(...)

Scrapy Selector



来源:https://stackoverflow.com/questions/37497383/scrapy-doesnt-crawl-the-page

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!