scrapy

Update scrapy settings based on spider property

大憨熊 提交于 2019-12-23 02:16:43
问题 Is there a way with scrapy to dynamically set the settings for a spider given at runtime? I want to add an isDebug variable to my spider and depending on it's value I want to adjust log level, pipelines and various other settings ... When trying to manipulate the settings as said in the manual, like this: def from_crawler(cls, crawler): settings = crawler.settings settings['USER_AGENT'] = 'Overwridden-UA' I always get TypeError: Trying to modify an immutable Settings object 回答1: Settings

How to scrape coupon code of coupon site (coupon code comes on clicking button)

和自甴很熟 提交于 2019-12-23 02:16:05
问题 I want to scrape a page like - I'm using scrapy and python for the same... I want to scrape the button which you can see in the below pic (left pic) http://postimg.org/image/syhauheo7/ When I click the button in green saying View Code , It does three things: Redirect to another id. Opens a popup containing code Show the code on the same page as can be seen in the above pic on right How can I scrape the code using scrapy and python framework? 回答1: Here's your spider: from scrapy.http import

Scrapy and Selenium submit form that renders dynamically

廉价感情. 提交于 2019-12-23 01:57:17
问题 I'm using selenium to first load the form which being generated via ajax. Now I'm having troubles passing the selenium response to scrapy FormReuqest method to send the form data values. The form has jquery validation before user can submit it, does it make it harder to submit using scrapy? any help is appreciated. thanks 回答1: You don't need neither Scrapy , nor selenium here. Just make an underlying POST request and parse the json response. Example using requests: import json import requests

Dynamically adding domains to scrapy crawlspider deny_domains list

六眼飞鱼酱① 提交于 2019-12-23 01:41:25
问题 I am currently using scrapy's CrawlSpider to look for specific info on a list of multiple start_urls. What I would like to do is stop scraping a specific start_url's domain once I've found the information I've looked for, so it won't keep hitting a domain and will instead just hit the other start_urls. Is there a way to do this? I have tried appending to deny_domains like so: deniedDomains = [] ... rules = [Rule(SgmlLinkExtractor(..., deny_domains=(etc), ...)] ... def parseURL(self, response)

How to enable overwriting a file everytime in scrapy item export?

最后都变了- 提交于 2019-12-23 01:12:15
问题 I am scraping a website which returns in a list of urls . Example - scrapy crawl xyz_spider -o urls.csv It is working absolutely fine now I want is to make new urls.csv not append data into the file. Is there any parameter passing I can do to make it enable? 回答1: Unfortunately scrapy can't do this at the moment. There is a proposed enhancement on github though: https://github.com/scrapy/scrapy/issues/547 However you can easily do redirect the output to stdout and redirect that to a file:

Python Scrapy override content type to be multipart/form-data on post Request

霸气de小男生 提交于 2019-12-23 00:42:30
问题 Trying to use scrapy to scrape a website which encodes its post requests as "multipart/form-data" for some reason. Is there a way to override scrapy's default behavior of posting using "application/x-www-form-urlencoded"? It looks like the site is not responding to the spider because it wants its requests posted using "multipart/form-data". Have tried multipart encoding the form variables but have seen using wireshark that scrapy still sets the header incorrectly irrespective of this encoding

Scrapy does not write data to a file

情到浓时终转凉″ 提交于 2019-12-22 20:33:12
问题 He created a spider in Scrapy: items.py: from scrapy.item import Item, Field class dns_shopItem (Item): # Define the fields for your item here like: # Name = Field () id = Field () idd = Field () dns_shop_spider.py: from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.contrib.loader.processor import TakeFirst from scrapy.contrib.loader import XPathItemLoader from scrapy.selector import HtmlXPathSelector from dns_shop

Scrapy does not write data to a file

江枫思渺然 提交于 2019-12-22 20:32:10
问题 He created a spider in Scrapy: items.py: from scrapy.item import Item, Field class dns_shopItem (Item): # Define the fields for your item here like: # Name = Field () id = Field () idd = Field () dns_shop_spider.py: from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.contrib.loader.processor import TakeFirst from scrapy.contrib.loader import XPathItemLoader from scrapy.selector import HtmlXPathSelector from dns_shop

Empty list with scrapy and Xpath

倖福魔咒の 提交于 2019-12-22 18:36:57
问题 I'm starting to use scrapy and xpath to scrape some page, I'm just trying simple things using ipython, an I get response in some pages like in IMDB, but when I try in others like www.bbb.org I always get an empty list. This is what I'm doing: scrapy shell 'http://www.bbb.org/central-western-massachusetts/business-reviews/auto-repair-and-service/toms-automotive-in-fitchburg-ma-211787' BBB Accreditation A BBB Accredited Business since 02/12/2010 BBB has determined that Tom's Automotive meets

How to recursively crawl subpages with Scrapy

徘徊边缘 提交于 2019-12-22 18:36:27
问题 So basically I am trying to crawl a page with a set of categories, scrape the names of each category, follow a sublink associated with each category to a page with a set of subcategories, scrape their names, and then follow each subcategory to their associated page and retrieve text data. At the end I want to output a json file formatted somewhat like: Category 1 name Subcategory 1 name data from this subcategory's page Subcategory n name data from this page Category n name Subcategory 1 name