scrapy | 易学教程

Scrapy - Crawl Multiple Pages Per Item

阅读更多关于 Scrapy - Crawl Multiple Pages Per Item

问题 I am trying to crawl a few extra pages per item to grab some location information. At the end of the item before return I check to see if we need to crawl extra pages to grab the information, essentially these pages contain some location details and are a simple get request. I.e. http://site.com.au/MVC/Offer/GetLocationDetails/?locationId=3761&companyId=206 The above link either returns a select with more pages to crawl - or a dd/dt with the address details. Either way I need to extract this

Scrapy - Crawl and Scrape a website

阅读更多关于 Scrapy - Crawl and Scrape a website

问题 As a part of learning to use Scrapy, I have tried to Crawl Amazon and there is a problem while scraping data, The output of my code is as follows: 2013-02-25 12:47:21+0530 [scanon] DEBUG: Scraped from <200 http://www.amazon.com/s/ref=sr_pg_2?ie=UTF8&page=2&qid=1361774681&rh=n%3A283155> {'link': [u'http://www.amazon.com/ObamaCare-Survival-Guide-Nick-Tate/dp/0893348627/ref=sr_1_13?s=books&ie=UTF8&qid=1361774694&sr=1-13', u'http://www.amazon.com/MELT-Method-Breakthrough-Self-Treatment-Eliminate

How to create a single executable file in windows 10 with scrapy and pyinstaller?

阅读更多关于 How to create a single executable file in windows 10 with scrapy and pyinstaller?

问题 I have created a scrapy spider and successfully converted to windows executable using pyinstaller with disc folder. In order to do that, I have to make some slight changes in the scrapy site-packages and add those packages in the windows disc folder, it works perfectly, How can I make this into a single exe with the commented scrapy packages from the disc folder? I have already tried with --OneFile command in pyinstaller, but it shows the scrapy error? 回答1: Very similar issue discussed here:

How to use threading in Scrapy/Twisted, i.e. how to do async calls to blocking code in response callbacks?

阅读更多关于 How to use threading in Scrapy/Twisted, i.e. how to do async calls to blocking code in response callbacks?

问题 I need run some multi-thread\multiprocessing work (because I have some library which uses blocking call) in Scrapy, and after its completion put back Request to Scrapy engine. I need something like this: def blocking_call(self, html): # .... # do some work in blocking call return Request(url) def parse(self, response): return self.blocking_call(response.body) How I can do that? I think I should to use Twisted reactor and Deferred object. But Scrapy parse callback must return only None or

How to use threading in Scrapy/Twisted, i.e. how to do async calls to blocking code in response callbacks?

阅读更多关于 How to use threading in Scrapy/Twisted, i.e. how to do async calls to blocking code in response callbacks?

Rotating proxies in Selenium

阅读更多关于 Rotating proxies in Selenium

问题 I use Selenium webdriver for Firefox with Scrapy and now I need to change proxies dynamically but all I've found so far in docs is that I can specify proxy in profile when I instantiate webdriver itself. So it means that I can't change proxy in dynamically? Is there any way to do that? 回答1: selenium does not provide this kind of functionality . It is not possible to dynamically change the browser preferences/desired capabilities once you've launched the browser . You have to close/quit the

Rotating proxies in Selenium

阅读更多关于 Rotating proxies in Selenium

Scrape using multiple POST data from the same URL

阅读更多关于 Scrape using multiple POST data from the same URL

问题 I have already created one spider that collects a list of company names with matching phone numbers. This is then saved to a CSV file. I am then wanting to scrape data from another site using the phones numbers in the CSV file as POST data. I am wanting it to loop through the same start URL but just scraping the data that each phone number produces until there are no more numbers left in the CSV file. This is what I have got so far: from scrapy.spider import BaseSpider from scrapy.http import

How to control the order of yield in Scrapy

阅读更多关于 How to control the order of yield in Scrapy

问题 Help! Reading the following scrapy code and the result of crawler. I want to crawl some data from http://china.fathom.info/data/data.json, and only Scrapy is allowed. But I don't know how to control the order of yield. I look forward to process all parse_member request in the loop and then return the group_item , but seems yield item is always executed before yield request. start_urls = [ "http://china.fathom.info/data/data.json" ] def parse(self, response): groups = json.loads(response.body)

Scrapy request+response+download time

阅读更多关于 Scrapy request+response+download time

问题 UPD : Not close question because I think my way is not so clear as should be Is it possible to get current request + response + download time for saving it to Item? In "plain" python I do start_time = time() urllib2.urlopen('http://example.com').read() time() - start_time But how i can do this with Scrapy? UPD : Solution enought for me but I'm not sure of quality of results. If you have many connections with timeout errors Download time may be wrong (even DOWNLOAD_TIMEOUT * 3) For settings.py