scrapy | 易学教程

twisted.internet.error.DNSLookupError: DNS lookup failed: address \"'http:\" not found: [Errno 11001] getaddrinfo failed.解决办法

阅读更多关于 twisted.internet.error.DNSLookupError: DNS lookup failed: address \"'http:\" not found: [Errno 11001] getaddrinfo failed.解决办法

C:\Users\wuzhi_000\Desktop\tutorial>scrapy shell 'http://quotes.toscrape.com' 2016-11-02 14:59:11 [scrapy] INFO: Scrapy 1.2.1 started (bot: tutorial) 2016-11-02 14:59:11 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'ROBOTSTXT_OBEY': True, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'SPIDER_MODULES': ['tutorial.spiders'], 'BOT_NAME': 'tutorial', 'LOGSTATS_INTERVAL': 0} 2016-11-02 14:59:11 [scrapy] INFO: Enabled extensions: ['scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats'] 2016-11-02 14:59:12 [scrapy] INFO: Enabled

Using multiple start_urls in CrawlSpider

阅读更多关于 Using multiple start_urls in CrawlSpider

问题 I'm using CrawlSpider to crawl a website. I have multiple start urls, and in each url, there is a "next link" linking to another similar page. I use rules to deal with the next page. rules = ( Rule(SgmlLinkExtractor(allow = ('/',), restrict_xpaths=('//span[@class="next"]')), callback='parse_item', follow=True), ) When there is only a url in start_urls, everything is ok. However, when there are many urls in start_urls, I got "Ignoring response <404 a url> : HTTP status code is not handled or

Scrapy add.xpath or join xpath

阅读更多关于 Scrapy add.xpath or join xpath

问题 I hope everyone is doing well. I have this code(part of it) for a spider, now this is the last part of the scraping, here it start to scrape and then write in the csv file, so I got this doubdt, it is possible to join or add xpath with the result printed in the file, for example: <h5>Soundbooster</h5> Filtro attuale <blockquote> Catalogo: Aliant Marca e Modello: Mazda - 3 Versione: (3th gen) 2013-now (Petrol) <

scrapy a weird bug code that can't call pipeline

阅读更多关于 scrapy a weird bug code that can't call pipeline

问题 I write a small spider, when I run and it can't call pipeline. After debug for a while, I find the bug code area. The logic of the spider is that I crawl the first url to fetch cookie, then I crawl the second url to download the code picture with cookie, and I post some data I prepare to the third url. And If the text I get from the picture wrong then I download again to post the third url repeatedly, until I got the right text. Let me show you the code: # -*- coding: gbk -*- import scrapy

StaleElementReferenceException

阅读更多关于 StaleElementReferenceException

问题 I've read about the StaleElementReferenceException in the official documentation, but I still don't understand why my code is raising this exception? Does browser.get() instantiate a new spider? class IndiegogoSpider(CrawlSpider): name = 'indiegogo' allowed_domains = [ 'indiegogo.com' ] start_urls = [ 'https://www.indiegogo.com/explore/all?project_type=all&project_timing=all&sort=trending' ] def parse(self, response): if (response.status != 404): options = Options() options.add_argument('

解决 scrapy redis爬虫空跑，redis中的链接跑完后，程序仍然在监听队列，不关闭问题

阅读更多关于解决 scrapy redis爬虫空跑，redis中的链接跑完后，程序仍然在监听队列，不关闭问题

平时使用scrapy redis主从式爬虫的时候，一般都是每天都会有爬取，所以没有考虑过这个问题，但是现在有个爬虫项目，redis队列是直接生成的，并且数量是一定的，所以在使用 scrapy slave时，需要判断一下队列是否已经被爬取完毕！经过度娘指点，知道了需要重写 spider_idle 方法，在该方法中写自己的规则来判断是否要停止爬取。感谢！！！ http://www.mamicode.com/info-detail-2225397.html 我在此基础上写了自己的规则，不需要像他一样等待那么长时间，我只要判断redis为空时，立刻停止爬虫即可所以我的规则是这样的： import redis from scrapy import signals class RedisSpiderClosedExensions(object): def __init__(self, crawler, host, port, pwd, db): self.crawler = crawler # 链接redis self.r = redis.Redis(host=host, port=port, db=db, password=pwd, decode_responses=True) @classmethod def from_crawler(cls, crawler): ext = cls

how to extract a list of label value with scrapy when html tag are missing

阅读更多关于 how to extract a list of label value with scrapy when html tag are missing

问题 I am currently processing a document with label1 value1 label2 value2 .... I can't figure out a clean approach to xpath with scrapy. here is my best implementation hxs = HtmlXPathSelector(response) section = hxs.select(..............) values = section.select("text()[preceding-sibling::b/text()]"): labels = section.select("text()/preceding-sibling::b/text()"): but I am not comfortable with this approach for matching nodes of both lists through index. I'd rather

how to extract a list of label value with scrapy when html tag are missing

阅读更多关于 how to extract a list of label value with scrapy when html tag are missing

how to extract a list of label value with scrapy when html tag are missing

阅读更多关于 how to extract a list of label value with scrapy when html tag are missing

how to extract a list of label value with scrapy when html tag are missing

阅读更多关于 how to extract a list of label value with scrapy when html tag are missing