scrapy-spider

Scrapy FormRequest , trying to send a post request (FormRequest) with currency change formdata

我只是一个虾纸丫 提交于 2019-12-04 17:24:40
I've been trying to scrapy the following Website but with the currency changed to 'SAR' from the upper left settings form , i tried sending a scrapy request like this: r = Request(url='https://www.mooda.com/en/', cookies=[{'name': 'currency', 'value': 'SAR', 'domain': '.www.mooda.com', 'path': '/'}, {'name':'country','value':'SA','domain': '.www.mooda.com','path':'/'}],dont_filter=True) and i still get the price as EG In [10]: response.css('.price').xpath('text()').extract() Out[10]: [u'1,957 EG\xa3', u'3,736 EG\xa3', u'2,802 EG\xa3', u'10,380 EG\xa3', u'1,823 EG\xa3'] i have also tried to

How to get the followers of a person as well as comments under the photos in instagram using scrapy?

笑着哭i 提交于 2019-12-04 17:12:32
As you see, the following json has number of followers as well as number of comments but how can I access the data within each comment as well as ID of followers so I could crawl into them? { "logging_page_id": "profilePage_20327023", "user": { "biography": null, "blocked_by_viewer": false, "connected_fb_page": null, "country_block": false, "external_url": null, "external_url_linkshimmed": null, "followed_by": { "count": 2585 }, "followed_by_viewer": false, "follows": { "count": 561 }, "follows_viewer": false, "full_name": "LeAnne Barengo", "has_blocked_viewer": false, "has_requested_viewer":

Python scrapy - Login Authenication Issue

烈酒焚心 提交于 2019-12-04 15:33:31
问题 I have just started using scrapy. I am facing few problems with login in scrapy. I am trying the scrape items in the website www.instacart.com. But I am facing issues with logging in. The following is the code import scrapy from scrapy.loader import ItemLoader from project.items import ProjectItem from scrapy.http import Request from scrapy import optional_features optional_features.remove('boto') class FirstSpider(scrapy.Spider): name = "first" allowed_domains = ["https://instacart.com"]

Scrapy Return Multiple Items

有些话、适合烂在心里 提交于 2019-12-04 12:35:41
I'm new to Scrapy and I'm really just lost on how i can return multiple items in one block. Basically, I'm getting one HTML tag which has a quote that contains nested tags of text, author name, and some tags about that quote. The code here only returns one quote and that's it. It doesnt use the loop to return the rest. I've been searching the web for hours and I'm just hopeless I don't get it. Here's my code so far: Spider.py import scrapy from scrapy.loader import ItemLoader from first_spider.items import FirstSpiderItem class QuotesSpider(scrapy.Spider): name = 'quotes' allowed_domains = [

Scrapy: catch responses with specific HTTP server codes

淺唱寂寞╮ 提交于 2019-12-04 11:45:12
问题 We have a pretty much standard Scrapy project (Scrapy 0.24). I'd like to catch specific HTTP response codes, such as 200, 500, 502, 503, 504 etc. Something like that: class Spider(...): def parse(...): processes HTTP 200 def parse_500(...): processes HTTP 500 errors def parse_502(...): processes HTTP 502 errors ... How can we do that? 回答1: By default, Scrapy only handles responses with status codes 200 - 300 . Let Scrapy handle 500 and 502: class Spider(...): handle_httpstatus_list = [500,

How to control the order of yield in Scrapy

蓝咒 提交于 2019-12-04 11:00:35
Help! Reading the following scrapy code and the result of crawler. I want to crawl some data from http://china.fathom.info/data/data.json , and only Scrapy is allowed. But I don't know how to control the order of yield. I look forward to process all parse_member request in the loop and then return the group_item , but seems yield item is always executed before yield request. start_urls = [ "http://china.fathom.info/data/data.json" ] def parse(self, response): groups = json.loads(response.body)['group_members'] for i in groups: group_item = GroupItem() group_item['name'] = groups[i]['name']

Is there any method to using seperate scrapy pipeline for each spider?

廉价感情. 提交于 2019-12-04 09:44:20
问题 I wanna to fetch web pages under different domain, that means I have to use different spider under the command "scrapy crawl myspider". However, I have to use different pipeline logic to put the data into database since the content of web pages are different. But for every spider, they have to go through all of the pipelines which defined in settings.py. Is there have other elegant method to using seperate pipelines for each spider? 回答1: ITEM_PIPELINES setting is defined globally for all

Scrapy + Splash + ScrapyJS

﹥>﹥吖頭↗ 提交于 2019-12-04 08:37:30
i am using Splash 2.0.2 + Scrapy 1.0.5 + Scrapyjs 0.1.1 and im still not able to render javascript with a click. Here is an example url https://olx.pt/anuncio/loja-nova-com-250m2-garagem-em-box-fechada-para-arrumos-IDyTzAT.html#c49d3d94cf I am still getting the page without the phone number rendered: class OlxSpider(scrapy.Spider): name = "olx" rotate_user_agent = True allowed_domains = ["olx.pt"] start_urls = [ "https://olx.pt/imoveis/" ] def parse(self, response): script = """ function main(splash) splash:go(splash.args.url) splash:runjs('document.getElementById("contact_methods")

Pass extra values along with urls to scrapy spider

眉间皱痕 提交于 2019-12-04 06:43:08
I've a list of tuples in the form (id,url) I need to crawl a product from a list of urls, and when those products are crawled i need to store them in database under their id. problem is i can't understand how to pass id to parse function so that i can store crawled item under their id. Initialize start urls in start_requests() and pass id in meta : class MySpider(Spider): mapping = [(1, 'my_url1'), (2, 'my_url2')] ... def start_requests(self): for id, url in self.mapping: yield Request(url, callback=self.parse_page, meta={'id': id}) def parse_page(self, response): id = response.meta['id'] 来源:

How to specify parameters on a Request using scrapy

夙愿已清 提交于 2019-12-04 04:36:23
问题 How do I pass parameters to a a request on a url like this: site.com/search/?action=search&description=My Search here&e_author= How do I put the arguments on the structure of a Spider Request, something like this exemple: req = Request(url="site.com/",parameters={x=1,y=2,z=3}) 回答1: Pass your GET parameters inside the URL itself: return Request(url="https://yoursite.com/search/?action=search&description=MySearchhere&e_author=") You should probably define your parameters in a dictionary and