scrapy

How To Remove White Space in Scrapy Spider Data

百般思念 提交于 2019-12-20 11:26:06
问题 I am writing my first spider in Scrapy and attempting to follow the documentation. I have implemented ItemLoaders. The spider extracts the data, but the data contains many line returns. I have tried many ways to remove them, but nothing seems to work. The replace_escape_chars utility is supposed to work, but I can't figure out how to use it with the ItemLoader . Also some people use (unicode.strip), but again, I can't seem to get it to work. Some people try to use these in items.py and others

Scrapy retry or redirect middleware

折月煮酒 提交于 2019-12-20 10:55:09
问题 While crawling through a site with scrapy, I get redirected to a user-blocked page about 1/5th of the time. I lose the pages that I get redirected from when that happens. I don't know which middleware to use or what settings to use in that middleware, but I want this: DEBUG: Redirecting (302) to (GET http://domain.com/foo.aspx) from (GET http://domain.com/bar.htm) To NOT drop bar.htm. I end up with no data from bar.htm when the scraper's done, but I'm rotating proxies, so if it tries bar.htm

Get scrapy spider to crawl entire site

拜拜、爱过 提交于 2019-12-20 10:42:32
问题 I am using scrapy to crawl old sites that I own, I am using the code below as my spider. I don't mind having files outputted for each webpage, or a database with all the content within that. But I do need to be able to have the spider crawl the whole thing with out me having to put in every single url that I am currently having to do import scrapy class DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["www.example.com"] start_urls = [ "http://www.example.com/contactus" ] def parse

Set headers for scrapy shell request

半腔热情 提交于 2019-12-20 09:37:52
问题 I know that you can scrapy shell -s USER_AGENT='custom user agent' 'http://www.example.com' to change the USER_AGENT , but how do you add request headers? 回答1: there is no current way to add headers directly on cli, but you could do something like: $ scrapy shell ... ... >>> from scrapy import Request >>> req = Request('yoururl.com', headers={"header1":"value1"}) >>> fetch(req) This will update the current shell information with that new request. 来源: https://stackoverflow.com/questions

How do I improve scrapy's download speed?

可紊 提交于 2019-12-20 09:25:27
问题 I'm using scrapy to download pages from many different domains in parallel. I have hundreds of thousands of pages to download, so performance is important. Unfortunately, as I've profiled scrapy's speed, I'm only getting a couple pages per second. Really, about 2 pages per second on average. I've previously written my own multithreaded spiders to do hundreds of pages per second -- I thought for sure scrapy's use of twisted, etc. would be capable of similar magic. How do I speed scrapy up? I

ImportError: No module named win32api while using Scrapy

半城伤御伤魂 提交于 2019-12-20 09:15:53
问题 I am a new learner of Scrapy. I installed python 2.7 and all other engines needed. Then I tried to build a Scrapy project following the tutorial http://doc.scrapy.org/en/latest/intro/tutorial.html. In the crawling step, after I typed scrapy crawl dmoz it generated this error message ImportError: No module named win32api. [twisted] CRITICAL : Unhandled error in deferred I am using Windows. Stack trace: I am using Windows. 回答1: Try this. pip install pypiwin32 回答2: If you search a bit along the

Scrapy: non-blocking pause

隐身守侯 提交于 2019-12-20 08:59:59
问题 I have a problem. I need to stop the execution of a function for a while, but not stop the implementation of parsing as a whole. That is, I need a non-blocking pause. It's looks like: class ScrapySpider(Spider): name = 'live_function' def start_requests(self): yield Request('some url', callback=self.non_stop_function) def non_stop_function(self, response): for url in ['url1', 'url2', 'url3', 'more urls']: yield Request(url, callback=self.second_parse_function) # Here I need some function for

dynamic start_urls in scrapy

天涯浪子 提交于 2019-12-20 08:51:44
问题 I'm using scrapy to crawl multiple pages on a site. The variable start_urls is used to define pages to be crawled. I would initially start with 1st page, thus defining start_urls = [1st page] in the file example_spider.py Upon getting more info from 1st page, I would determine what are next pages to be crawled, then would assign start_urls accordingly. Hence, I have to overwrite above example_spider.py with changes to start_urls = [1st page, 2nd page, ..., Kth page] , then run scrapy crawl

Scrapy: how to use items in spider and how to send items to pipelines?

自作多情 提交于 2019-12-20 08:49:37
问题 I am new to scrapy and my task is simple: For a given e-commerce website: crawl all website pages look for products page If the URL point to a product page Create an Item Process the item to store it in a database I created the spider but products are just printed in a simple file. My question is about the project structure: how to use items in spider and how to send items to pipelines ? I can't find a simple example of a project using items and pipelines. 回答1: How to use items in my spider?

Speed up web scraper

穿精又带淫゛_ 提交于 2019-12-20 08:15:10
问题 I am scraping 23770 webpages with a pretty simple web scraper using scrapy . I am quite new to scrapy and even python, but managed to write a spider that does the job. It is, however, really slow (it takes approx. 28 hours to crawl the 23770 pages). I have looked on the scrapy webpage and the mailing lists and stackoverflow , but I can't seem to find generic recommendations for writing fast crawlers understandable for beginners. Maybe my problem is not the spider itself, but the way i run it.