scrapy | 易学教程

Update scrapy settings based on spider property

阅读更多关于 Update scrapy settings based on spider property

问题 Is there a way with scrapy to dynamically set the settings for a spider given at runtime? I want to add an isDebug variable to my spider and depending on it's value I want to adjust log level, pipelines and various other settings ... When trying to manipulate the settings as said in the manual, like this: def from_crawler(cls, crawler): settings = crawler.settings settings['USER_AGENT'] = 'Overwridden-UA' I always get TypeError: Trying to modify an immutable Settings object 回答1: Settings

How to scrape coupon code of coupon site (coupon code comes on clicking button)

阅读更多关于 How to scrape coupon code of coupon site (coupon code comes on clicking button)

问题 I want to scrape a page like - I'm using scrapy and python for the same... I want to scrape the button which you can see in the below pic (left pic) http://postimg.org/image/syhauheo7/ When I click the button in green saying View Code , It does three things: Redirect to another id. Opens a popup containing code Show the code on the same page as can be seen in the above pic on right How can I scrape the code using scrapy and python framework? 回答1: Here's your spider: from scrapy.http import

Scrapy and Selenium submit form that renders dynamically

阅读更多关于 Scrapy and Selenium submit form that renders dynamically

问题 I'm using selenium to first load the form which being generated via ajax. Now I'm having troubles passing the selenium response to scrapy FormReuqest method to send the form data values. The form has jquery validation before user can submit it, does it make it harder to submit using scrapy? any help is appreciated. thanks 回答1: You don't need neither Scrapy , nor selenium here. Just make an underlying POST request and parse the json response. Example using requests: import json import requests

Dynamically adding domains to scrapy crawlspider deny_domains list

阅读更多关于 Dynamically adding domains to scrapy crawlspider deny_domains list

问题 I am currently using scrapy's CrawlSpider to look for specific info on a list of multiple start_urls. What I would like to do is stop scraping a specific start_url's domain once I've found the information I've looked for, so it won't keep hitting a domain and will instead just hit the other start_urls. Is there a way to do this? I have tried appending to deny_domains like so: deniedDomains = [] ... rules = [Rule(SgmlLinkExtractor(..., deny_domains=(etc), ...)] ... def parseURL(self, response)

How to enable overwriting a file everytime in scrapy item export?

阅读更多关于 How to enable overwriting a file everytime in scrapy item export?

问题 I am scraping a website which returns in a list of urls . Example - scrapy crawl xyz_spider -o urls.csv It is working absolutely fine now I want is to make new urls.csv not append data into the file. Is there any parameter passing I can do to make it enable? 回答1: Unfortunately scrapy can't do this at the moment. There is a proposed enhancement on github though: https://github.com/scrapy/scrapy/issues/547 However you can easily do redirect the output to stdout and redirect that to a file:

Python Scrapy override content type to be multipart/form-data on post Request

阅读更多关于 Python Scrapy override content type to be multipart/form-data on post Request

问题 Trying to use scrapy to scrape a website which encodes its post requests as "multipart/form-data" for some reason. Is there a way to override scrapy's default behavior of posting using "application/x-www-form-urlencoded"? It looks like the site is not responding to the spider because it wants its requests posted using "multipart/form-data". Have tried multipart encoding the form variables but have seen using wireshark that scrapy still sets the header incorrectly irrespective of this encoding

Scrapy does not write data to a file

阅读更多关于 Scrapy does not write data to a file

问题 He created a spider in Scrapy: items.py: from scrapy.item import Item, Field class dns_shopItem (Item): # Define the fields for your item here like: # Name = Field () id = Field () idd = Field () dns_shop_spider.py: from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.contrib.loader.processor import TakeFirst from scrapy.contrib.loader import XPathItemLoader from scrapy.selector import HtmlXPathSelector from dns_shop

Scrapy does not write data to a file

阅读更多关于 Scrapy does not write data to a file

Empty list with scrapy and Xpath

阅读更多关于 Empty list with scrapy and Xpath

问题 I'm starting to use scrapy and xpath to scrape some page, I'm just trying simple things using ipython, an I get response in some pages like in IMDB, but when I try in others like www.bbb.org I always get an empty list. This is what I'm doing: scrapy shell 'http://www.bbb.org/central-western-massachusetts/business-reviews/auto-repair-and-service/toms-automotive-in-fitchburg-ma-211787' BBB Accreditation A BBB Accredited Business since 02/12/2010 BBB has determined that Tom's Automotive meets

How to recursively crawl subpages with Scrapy

阅读更多关于 How to recursively crawl subpages with Scrapy

问题 So basically I am trying to crawl a page with a set of categories, scrape the names of each category, follow a sublink associated with each category to a page with a set of subcategories, scrape their names, and then follow each subcategory to their associated page and retrieve text data. At the end I want to output a json file formatted somewhat like: Category 1 name Subcategory 1 name data from this subcategory's page Subcategory n name data from this page Category n name Subcategory 1 name