scrapy | 易学教程

How To Remove White Space in Scrapy Spider Data

阅读更多关于 How To Remove White Space in Scrapy Spider Data

问题 I am writing my first spider in Scrapy and attempting to follow the documentation. I have implemented ItemLoaders. The spider extracts the data, but the data contains many line returns. I have tried many ways to remove them, but nothing seems to work. The replace_escape_chars utility is supposed to work, but I can't figure out how to use it with the ItemLoader . Also some people use (unicode.strip), but again, I can't seem to get it to work. Some people try to use these in items.py and others

Scrapy retry or redirect middleware

阅读更多关于 Scrapy retry or redirect middleware

问题 While crawling through a site with scrapy, I get redirected to a user-blocked page about 1/5th of the time. I lose the pages that I get redirected from when that happens. I don't know which middleware to use or what settings to use in that middleware, but I want this: DEBUG: Redirecting (302) to (GET http://domain.com/foo.aspx) from (GET http://domain.com/bar.htm) To NOT drop bar.htm. I end up with no data from bar.htm when the scraper's done, but I'm rotating proxies, so if it tries bar.htm

Get scrapy spider to crawl entire site

阅读更多关于 Get scrapy spider to crawl entire site

问题 I am using scrapy to crawl old sites that I own, I am using the code below as my spider. I don't mind having files outputted for each webpage, or a database with all the content within that. But I do need to be able to have the spider crawl the whole thing with out me having to put in every single url that I am currently having to do import scrapy class DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["www.example.com"] start_urls = [ "http://www.example.com/contactus" ] def parse

Set headers for scrapy shell request

阅读更多关于 Set headers for scrapy shell request

问题 I know that you can scrapy shell -s USER_AGENT='custom user agent' 'http://www.example.com' to change the USER_AGENT , but how do you add request headers? 回答1: there is no current way to add headers directly on cli, but you could do something like: $ scrapy shell ... ... >>> from scrapy import Request >>> req = Request('yoururl.com', headers={"header1":"value1"}) >>> fetch(req) This will update the current shell information with that new request. 来源： https://stackoverflow.com/questions

How do I improve scrapy's download speed?

阅读更多关于 How do I improve scrapy's download speed?

问题 I'm using scrapy to download pages from many different domains in parallel. I have hundreds of thousands of pages to download, so performance is important. Unfortunately, as I've profiled scrapy's speed, I'm only getting a couple pages per second. Really, about 2 pages per second on average. I've previously written my own multithreaded spiders to do hundreds of pages per second -- I thought for sure scrapy's use of twisted, etc. would be capable of similar magic. How do I speed scrapy up? I

ImportError: No module named win32api while using Scrapy

阅读更多关于 ImportError: No module named win32api while using Scrapy

问题 I am a new learner of Scrapy. I installed python 2.7 and all other engines needed. Then I tried to build a Scrapy project following the tutorial http://doc.scrapy.org/en/latest/intro/tutorial.html. In the crawling step, after I typed scrapy crawl dmoz it generated this error message ImportError: No module named win32api. [twisted] CRITICAL : Unhandled error in deferred I am using Windows. Stack trace: I am using Windows. 回答1: Try this. pip install pypiwin32 回答2: If you search a bit along the

Scrapy: non-blocking pause

阅读更多关于 Scrapy: non-blocking pause

问题 I have a problem. I need to stop the execution of a function for a while, but not stop the implementation of parsing as a whole. That is, I need a non-blocking pause. It's looks like: class ScrapySpider(Spider): name = 'live_function' def start_requests(self): yield Request('some url', callback=self.non_stop_function) def non_stop_function(self, response): for url in ['url1', 'url2', 'url3', 'more urls']: yield Request(url, callback=self.second_parse_function) # Here I need some function for

dynamic start_urls in scrapy

阅读更多关于 dynamic start_urls in scrapy

问题 I'm using scrapy to crawl multiple pages on a site. The variable start_urls is used to define pages to be crawled. I would initially start with 1st page, thus defining start_urls = [1st page] in the file example_spider.py Upon getting more info from 1st page, I would determine what are next pages to be crawled, then would assign start_urls accordingly. Hence, I have to overwrite above example_spider.py with changes to start_urls = [1st page, 2nd page, ..., Kth page] , then run scrapy crawl

Scrapy: how to use items in spider and how to send items to pipelines?

阅读更多关于 Scrapy: how to use items in spider and how to send items to pipelines?

问题 I am new to scrapy and my task is simple: For a given e-commerce website: crawl all website pages look for products page If the URL point to a product page Create an Item Process the item to store it in a database I created the spider but products are just printed in a simple file. My question is about the project structure: how to use items in spider and how to send items to pipelines ? I can't find a simple example of a project using items and pipelines. 回答1: How to use items in my spider?

Speed up web scraper

阅读更多关于 Speed up web scraper

问题 I am scraping 23770 webpages with a pretty simple web scraper using scrapy . I am quite new to scrapy and even python, but managed to write a spider that does the job. It is, however, really slow (it takes approx. 28 hours to crawl the 23770 pages). I have looked on the scrapy webpage and the mailing lists and stackoverflow , but I can't seem to find generic recommendations for writing fast crawlers understandable for beginners. Maybe my problem is not the spider itself, but the way i run it.