scrapy-spider | 易学教程

Pass argument to scrapy spider within a python script

阅读更多关于 Pass argument to scrapy spider within a python script

问题 I can run crawl in a python script with the following recipe from wiki : from twisted.internet import reactor from scrapy.crawler import Crawler from scrapy import log, signals from testspiders.spiders.followall import FollowAllSpider from scrapy.utils.project import get_project_settings spider = FollowAllSpider(domain='scrapinghub.com') settings = get_project_settings() crawler = Crawler(settings) crawler.signals.connect(reactor.stop, signal=signals.spider_closed) crawler.configure() crawler

Pass argument to scrapy spider within a python script

阅读更多关于 Pass argument to scrapy spider within a python script

I can run crawl in a python script with the following recipe from wiki : from twisted.internet import reactor from scrapy.crawler import Crawler from scrapy import log, signals from testspiders.spiders.followall import FollowAllSpider from scrapy.utils.project import get_project_settings spider = FollowAllSpider(domain='scrapinghub.com') settings = get_project_settings() crawler = Crawler(settings) crawler.signals.connect(reactor.stop, signal=signals.spider_closed) crawler.configure() crawler.crawl(spider) crawler.start() log.start() reactor.run() As you can see i can just pass the domain to

How to prevent a twisted.internet.error.ConnectionLost error when using Scrapy?

阅读更多关于 How to prevent a twisted.internet.error.ConnectionLost error when using Scrapy?

I'm scraping some pages with scrapy and get the following error: twisted.internet.error.ConnectionLost My command line output: 2015-05-04 18:40:32+0800 [cnproxy] INFO: Spider opened 2015-05-04 18:40:32+0800 [cnproxy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2015-05-04 18:40:32+0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 2015-05-04 18:40:32+0800 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080 2015-05-04 18:40:32+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy1.html> (failed 1 times): [<twisted.python.failure.Failure

Scrapy crawl with next page

阅读更多关于 Scrapy crawl with next page

问题 I have this code for scrapy framework: # -*- coding: utf-8 -*- import scrapy from scrapy.contrib.spiders import Rule from scrapy.linkextractors import LinkExtractor from lxml import html class Scrapy1Spider(scrapy.Spider): name = "scrapy1" allowed_domains = ["sfbay.craigslist.org"] start_urls = ( 'http://sfbay.craigslist.org/search/npo', ) rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="button next"]',)), callback="parse", follow= True),) def parse(self, response): site =

How to send post data in start_urls of the scrapy spider

阅读更多关于 How to send post data in start_urls of the scrapy spider

问题 I want to crawl a website which supports only post data. I want to send the query params in post data in all the requests. How to achieve this? 回答1: POST requests can be made using scrapy's Request or FormRequest classes. Also, consider using start_requests() method instead of start_urls property. Example: from scrapy.http import FormRequest class myspiderSpider(Spider): name = "myspider" allowed_domains = ["www.example.com"] def start_requests(self): return [ FormRequest("http://www.example

Pass Scrapy Spider a list of URLs to crawl via .txt file

阅读更多关于 Pass Scrapy Spider a list of URLs to crawl via .txt file

I'm a little new to Python and very new to Scrapy. I've set up a spider to crawl and extract all the information I need. However, I need to pass a .txt file of URLs to the start_urls variable. For exmaple: class LinkChecker(BaseSpider): name = 'linkchecker' start_urls = [] #Here I want the list to start crawling a list of urls from a text file a pass via the command line. I've done a little bit of research and keep coming up empty handed. I've seen this type of example ( How to pass a user defined argument in scrapy spider ), but I don't think that will work for a passing a text file. Run your

Run Multiple Spider sequentially

阅读更多关于 Run Multiple Spider sequentially

问题 Class Myspider1 #do something.... Class Myspider2 #do something... The above is the architecture of my spider.py file. and i am trying to run the Myspider1 first and then run the Myspider2 multiples times depend on some conditions. How Could I do that??? any tips? configure_logging() runner = CrawlerRunner() def crawl(): yield runner.crawl(Myspider1,arg.....) yield runner.crawl(Myspider2,arg.....) crawl() reactor.run() I am trying to use this way.but have no idea how to run it. Should I run

Using arguments in scrapy pipeline on init

阅读更多关于 Using arguments in scrapy pipeline on __init__

问题 i have a scrapy pipelines.py and i want to get the given arguments. In my spider.py it works perfect: class MySpider( CrawlSpider ): def __init__(self, host='', domain_id='', *args, **kwargs): super(MySpider, self).__init__(*args, **kwargs) print user_id ... Now, i need the "user_id" in my pipelines.py to create the sqlite database like "domain-123.db". I search the whole web about my problem, but i cant find any solution. Can someone help me? PS: Yes, i try'ed the super() function within my

Scrapy upload file

阅读更多关于 Scrapy upload file

I am making a form request to a website using scrapy. The form requires to upload a pdf file, How can we do it in Scrapy. I am trying this like - FormRequest(url,callback=self.parseSearchResponse,method="POST",formdata={'filename':'abc.xyz','file':'path to file/abc.xyz'}) starrify At this very moment Scrapy has no built-in support for uploading files. File uploading via forms in HTTP was specified in RFC1867 . According to the spec, an HTTP request with Content-Type: multipart/form-data is required (in your code it would be application/x-www-form-urlencoded ). To achieve file uploading with

Scrapy crawl with next page

阅读更多关于 Scrapy crawl with next page

I have this code for scrapy framework: # -*- coding: utf-8 -*- import scrapy from scrapy.contrib.spiders import Rule from scrapy.linkextractors import LinkExtractor from lxml import html class Scrapy1Spider(scrapy.Spider): name = "scrapy1" allowed_domains = ["sfbay.craigslist.org"] start_urls = ( 'http://sfbay.craigslist.org/search/npo', ) rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="button next"]',)), callback="parse", follow= True),) def parse(self, response): site = html.fromstring(response.body_as_unicode()) titles = site.xpath('//div[@class="content"]/p[@class="row"