问题
I want to crawl a page http://www.jcvi.org/charprotdb/index.cgi/l_search?terms.1.field=all&terms.1.search_text=cancer&submit=+++Search+++&sort.key=organism&sort.order=%2B by scrapy. But seems there is a problem that I didn't get any data when crawling it.
Here is my spider code:
import scrapy
from scrapy.selector import Selector
from scrapy_Data.items import CharProt
class CPSpider(scrapy.Spider):
name = "CharProt"
allowed_domains = ["jcvi.org"]
start_urls = ["http://www.jcvi.org/charprotdb/index.cgi/l_search?terms.1.field=all&terms.1.search_text=cancer&submit=+++Search+++&sort.key=organism&sort.order=%2B"]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//*[@id="middle_content_template"]/table/tbody/tr')
for site in sites:
item = CharProt()
item['protein_name'] = site.xpath('td[1]/a/text()').extract()
item['pn_link'] = site.xpath('td[1]/a/@href').extract()
item['organism'] = site.xpath('td[2]/a/text()').extract()
item['organism_link'] = site.xpath('td[2]/a/@href').extract()
item['status'] = site.xpath('td[3]/a/text()').extract()
item['status_link'] = site.xpath('td[3]/a/@href').extract()
item['references'] = site.xpath('td[4]/a').extract()
item['source'] = "CharProt"
# collection.update({"protein_name": item['protein_name']}, dict(item), upsert=True)
yield item
Here is the log:
2016-05-28 17:25:06 [scrapy] INFO: Spider opened
2016-05-28 17:25:06 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-05-28 17:25:06 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-05-28 17:25:07 [scrapy] DEBUG: Crawled (200) <GET http://www.jcvi.org/charprotdb/index.cgi/l_search?terms.1.field=all&terms.1.search_text=cancer&submit=+++Search+++&sort.key=organism&sort.order=%2B> (referer: None)
<200 http://www.jcvi.org/charprotdb/index.cgi/l_search?terms.1.field=all&terms.1.search_text=cancer&submit=+++Search+++&sort.key=organism&sort.order=%2B>
2016-05-28 17:25:08 [scrapy] INFO: Closing spider (finished)
2016-05-28 17:25:08 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 337,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 26198,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 5, 28, 9, 25, 8, 103577),
'log_count/DEBUG': 2,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 5, 28, 9, 25, 6, 55848)}
And when I run other spiders, they all works fine. So can anybody tell me what's wrong with my code? Or there is something wrong with this webpage?
回答1:
You are crawling it but your xpath is wrong
When you inspect an element with your browser the <tbody> tag appears but it's not anywhere in the source code therefore, nothing is to be crawled!
sites = sel.xpath('//*[@id="middle_content_template"]/table/tr')
That should work
edit
As a side note extract() returns a list rather than the element you want so you need to use the extract_first() method or extract()[0]
eg
item['protein_name'] = site.xpath('td[1]/a/text()').extract_first()
回答2:
Your xpath is wrong
- you don't need
tbodyto acess table rows - simply use
table/trto access table rows
correct xpath would be :
sites = sel.xpath('//*[@id="middle_content_template"]//table//tr')
better xpath would be
sites = response.xpath('//table[@class="search_results"]/tr')
as you can seen in above example, you do not need to create selector object by
Selector(response)to select xpathin newer scrapy releases, selector attribute is already added in response class and one can use it with as mentioned below
response.selector.xpath(...)orShort form
response.xpath(...)
Scrapy Selector
来源:https://stackoverflow.com/questions/37497383/scrapy-doesnt-crawl-the-page