scrapy

twisted.internet.error.DNSLookupError: DNS lookup failed: address \"'http:\" not found: [Errno 11001] getaddrinfo failed.解决办法

烈酒焚心 提交于 2020-01-16 18:48:54
C:\Users\wuzhi_000\Desktop\tutorial>scrapy shell 'http://quotes.toscrape.com' 2016-11-02 14:59:11 [scrapy] INFO: Scrapy 1.2.1 started (bot: tutorial) 2016-11-02 14:59:11 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'ROBOTSTXT_OBEY': True, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'SPIDER_MODULES': ['tutorial.spiders'], 'BOT_NAME': 'tutorial', 'LOGSTATS_INTERVAL': 0} 2016-11-02 14:59:11 [scrapy] INFO: Enabled extensions: ['scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats'] 2016-11-02 14:59:12 [scrapy] INFO: Enabled

Using multiple start_urls in CrawlSpider

孤者浪人 提交于 2020-01-16 18:38:07
问题 I'm using CrawlSpider to crawl a website. I have multiple start urls, and in each url, there is a "next link" linking to another similar page. I use rules to deal with the next page. rules = ( Rule(SgmlLinkExtractor(allow = ('/',), restrict_xpaths=('//span[@class="next"]')), callback='parse_item', follow=True), ) When there is only a url in start_urls, everything is ok. However, when there are many urls in start_urls, I got "Ignoring response <404 a url> : HTTP status code is not handled or

Scrapy add.xpath or join xpath

那年仲夏 提交于 2020-01-16 12:37:47
问题 I hope everyone is doing well. I have this code(part of it) for a spider, now this is the last part of the scraping, here it start to scrape and then write in the csv file, so I got this doubdt, it is possible to join or add xpath with the result printed in the file, for example: <h5>Soundbooster</h5> <br><br> <p class="details"> <b>Filtro attuale</b> </p> <blockquote> <p> <b>Catalogo:</b> Aliant</br> <b>Marca e Modello:</b> Mazda - 3 </br> <b>Versione:</b> (3th gen) 2013-now (Petrol) </p> <

scrapy a weird bug code that can't call pipeline

主宰稳场 提交于 2020-01-16 08:51:10
问题 I write a small spider, when I run and it can't call pipeline. After debug for a while, I find the bug code area. The logic of the spider is that I crawl the first url to fetch cookie, then I crawl the second url to download the code picture with cookie, and I post some data I prepare to the third url. And If the text I get from the picture wrong then I download again to post the third url repeatedly, until I got the right text. Let me show you the code: # -*- coding: gbk -*- import scrapy

StaleElementReferenceException

早过忘川 提交于 2020-01-16 06:08:32
问题 I've read about the StaleElementReferenceException in the official documentation, but I still don't understand why my code is raising this exception? Does browser.get() instantiate a new spider? class IndiegogoSpider(CrawlSpider): name = 'indiegogo' allowed_domains = [ 'indiegogo.com' ] start_urls = [ 'https://www.indiegogo.com/explore/all?project_type=all&project_timing=all&sort=trending' ] def parse(self, response): if (response.status != 404): options = Options() options.add_argument('

解决 scrapy redis爬虫空跑,redis中的链接跑完后,程序仍然在监听队列,不关闭问题

时光总嘲笑我的痴心妄想 提交于 2020-01-16 00:57:17
平时使用scrapy redis主从式爬虫的时候,一般都是每天都会有爬取,所以没有考虑过这个问题,但是现在有个爬虫项目,redis队列是直接生成的,并且数量是一定的,所以在使用 scrapy slave时,需要判断一下队列是否已经被爬取完毕! 经过度娘指点,知道了需要重写 spider_idle 方法,在该方法中写自己的规则来判断是否要停止爬取。 感谢!!! http://www.mamicode.com/info-detail-2225397.html 我在此基础上写了自己的规则,不需要像他一样等待那么长时间,我只要判断redis为空时,立刻停止爬虫即可 所以我的规则是这样的: import redis from scrapy import signals class RedisSpiderClosedExensions(object): def __init__(self, crawler, host, port, pwd, db): self.crawler = crawler # 链接redis self.r = redis.Redis(host=host, port=port, db=db, password=pwd, decode_responses=True) @classmethod def from_crawler(cls, crawler): ext = cls

how to extract a list of label value with scrapy when html tag are missing

…衆ロ難τιáo~ 提交于 2020-01-15 20:18:09
问题 I am currently processing a document with <b> label1 </b> value1 <br> <b> label2 </b> value2 <br> .... I can't figure out a clean approach to xpath with scrapy. here is my best implementation hxs = HtmlXPathSelector(response) section = hxs.select(..............) values = section.select("text()[preceding-sibling::b/text()]"): labels = section.select("text()/preceding-sibling::b/text()"): but I am not comfortable with this approach for matching nodes of both lists through index. I'd rather

how to extract a list of label value with scrapy when html tag are missing

会有一股神秘感。 提交于 2020-01-15 20:14:32
问题 I am currently processing a document with <b> label1 </b> value1 <br> <b> label2 </b> value2 <br> .... I can't figure out a clean approach to xpath with scrapy. here is my best implementation hxs = HtmlXPathSelector(response) section = hxs.select(..............) values = section.select("text()[preceding-sibling::b/text()]"): labels = section.select("text()/preceding-sibling::b/text()"): but I am not comfortable with this approach for matching nodes of both lists through index. I'd rather

how to extract a list of label value with scrapy when html tag are missing

南笙酒味 提交于 2020-01-15 20:13:53
问题 I am currently processing a document with <b> label1 </b> value1 <br> <b> label2 </b> value2 <br> .... I can't figure out a clean approach to xpath with scrapy. here is my best implementation hxs = HtmlXPathSelector(response) section = hxs.select(..............) values = section.select("text()[preceding-sibling::b/text()]"): labels = section.select("text()/preceding-sibling::b/text()"): but I am not comfortable with this approach for matching nodes of both lists through index. I'd rather

how to extract a list of label value with scrapy when html tag are missing

守給你的承諾、 提交于 2020-01-15 20:13:19
问题 I am currently processing a document with <b> label1 </b> value1 <br> <b> label2 </b> value2 <br> .... I can't figure out a clean approach to xpath with scrapy. here is my best implementation hxs = HtmlXPathSelector(response) section = hxs.select(..............) values = section.select("text()[preceding-sibling::b/text()]"): labels = section.select("text()/preceding-sibling::b/text()"): but I am not comfortable with this approach for matching nodes of both lists through index. I'd rather