scrapy

Scrapy - Crawl Multiple Pages Per Item

喜欢而已 提交于 2020-01-01 19:01:09
问题 I am trying to crawl a few extra pages per item to grab some location information. At the end of the item before return I check to see if we need to crawl extra pages to grab the information, essentially these pages contain some location details and are a simple get request. I.e. http://site.com.au/MVC/Offer/GetLocationDetails/?locationId=3761&companyId=206 The above link either returns a select with more pages to crawl - or a dd/dt with the address details. Either way I need to extract this

Scrapy - Crawl and Scrape a website

让人想犯罪 __ 提交于 2020-01-01 18:54:15
问题 As a part of learning to use Scrapy, I have tried to Crawl Amazon and there is a problem while scraping data, The output of my code is as follows: 2013-02-25 12:47:21+0530 [scanon] DEBUG: Scraped from <200 http://www.amazon.com/s/ref=sr_pg_2?ie=UTF8&page=2&qid=1361774681&rh=n%3A283155> {'link': [u'http://www.amazon.com/ObamaCare-Survival-Guide-Nick-Tate/dp/0893348627/ref=sr_1_13?s=books&ie=UTF8&qid=1361774694&sr=1-13', u'http://www.amazon.com/MELT-Method-Breakthrough-Self-Treatment-Eliminate

How to create a single executable file in windows 10 with scrapy and pyinstaller?

会有一股神秘感。 提交于 2020-01-01 18:20:50
问题 I have created a scrapy spider and successfully converted to windows executable using pyinstaller with disc folder. In order to do that, I have to make some slight changes in the scrapy site-packages and add those packages in the windows disc folder, it works perfectly, How can I make this into a single exe with the commented scrapy packages from the disc folder? I have already tried with --OneFile command in pyinstaller, but it shows the scrapy error? 回答1: Very similar issue discussed here:

How to use threading in Scrapy/Twisted, i.e. how to do async calls to blocking code in response callbacks?

冷暖自知 提交于 2020-01-01 15:39:52
问题 I need run some multi-thread\multiprocessing work (because I have some library which uses blocking call) in Scrapy, and after its completion put back Request to Scrapy engine. I need something like this: def blocking_call(self, html): # .... # do some work in blocking call return Request(url) def parse(self, response): return self.blocking_call(response.body) How I can do that? I think I should to use Twisted reactor and Deferred object. But Scrapy parse callback must return only None or

How to use threading in Scrapy/Twisted, i.e. how to do async calls to blocking code in response callbacks?

霸气de小男生 提交于 2020-01-01 15:39:32
问题 I need run some multi-thread\multiprocessing work (because I have some library which uses blocking call) in Scrapy, and after its completion put back Request to Scrapy engine. I need something like this: def blocking_call(self, html): # .... # do some work in blocking call return Request(url) def parse(self, response): return self.blocking_call(response.body) How I can do that? I think I should to use Twisted reactor and Deferred object. But Scrapy parse callback must return only None or

Rotating proxies in Selenium

大城市里の小女人 提交于 2020-01-01 15:23:48
问题 I use Selenium webdriver for Firefox with Scrapy and now I need to change proxies dynamically but all I've found so far in docs is that I can specify proxy in profile when I instantiate webdriver itself. So it means that I can't change proxy in dynamically? Is there any way to do that? 回答1: selenium does not provide this kind of functionality . It is not possible to dynamically change the browser preferences/desired capabilities once you've launched the browser . You have to close/quit the

Rotating proxies in Selenium

 ̄綄美尐妖づ 提交于 2020-01-01 15:23:25
问题 I use Selenium webdriver for Firefox with Scrapy and now I need to change proxies dynamically but all I've found so far in docs is that I can specify proxy in profile when I instantiate webdriver itself. So it means that I can't change proxy in dynamically? Is there any way to do that? 回答1: selenium does not provide this kind of functionality . It is not possible to dynamically change the browser preferences/desired capabilities once you've launched the browser . You have to close/quit the

Scrape using multiple POST data from the same URL

若如初见. 提交于 2020-01-01 14:34:12
问题 I have already created one spider that collects a list of company names with matching phone numbers. This is then saved to a CSV file. I am then wanting to scrape data from another site using the phones numbers in the CSV file as POST data. I am wanting it to loop through the same start URL but just scraping the data that each phone number produces until there are no more numbers left in the CSV file. This is what I have got so far: from scrapy.spider import BaseSpider from scrapy.http import

How to control the order of yield in Scrapy

牧云@^-^@ 提交于 2020-01-01 12:01:37
问题 Help! Reading the following scrapy code and the result of crawler. I want to crawl some data from http://china.fathom.info/data/data.json, and only Scrapy is allowed. But I don't know how to control the order of yield. I look forward to process all parse_member request in the loop and then return the group_item , but seems yield item is always executed before yield request. start_urls = [ "http://china.fathom.info/data/data.json" ] def parse(self, response): groups = json.loads(response.body)

Scrapy request+response+download time

六眼飞鱼酱① 提交于 2020-01-01 08:45:27
问题 UPD : Not close question because I think my way is not so clear as should be Is it possible to get current request + response + download time for saving it to Item? In "plain" python I do start_time = time() urllib2.urlopen('http://example.com').read() time() - start_time But how i can do this with Scrapy? UPD : Solution enought for me but I'm not sure of quality of results. If you have many connections with timeout errors Download time may be wrong (even DOWNLOAD_TIMEOUT * 3) For settings.py