web-crawler | 易学教程

scrapy- how to stop Redirect (302)

阅读更多关于 scrapy- how to stop Redirect (302)

问题 I'm trying to crawl a url using Scrapy. But it redirects me to page that doesn't exist. Redirecting (302) to <GET http://www.shop.inonit.in/mobile/Products/Inonit-Home-Decor--Knick-Knacks-Cushions/Shor-Sharaba/Andaz-Apna-Apna-Cushion-Cover/1275197> from <GET http://www.shop.inonit.in/Products/Inonit-Home-Decor--Knick-Knacks-Cushions/Shor-Sharaba/Andaz-Apna-Apna-Cushion-Cover/pid-1275197.aspx> The problem is http://www.shop.inonit.in/Products/Inonit-Home-Decor--Knick-Knacks-Cushions/Shor

.NET Does NOT Have Reliable Asynchronouos Socket Communication?

阅读更多关于 .NET Does NOT Have Reliable Asynchronouos Socket Communication?

问题 I once wrote a Crawler in .NET. In order to improve its scalability, I tried to take advantage of asynchronous API of .NET. The System.Net.HttpWebRequest has asynchronous API BeginGetResponse/EndGetResponse. However, this pair of API is just to get a HTTP response headers and a Stream instance from which we can extract HTTP response content. So, my strategy is to use BeginGetResponse/EndGetResponse to asynchronously get the response Stream, then use BeginRead/EndRead to asynchronously get

Writing items to a MySQL database in Scrapy

阅读更多关于 Writing items to a MySQL database in Scrapy

I am new to Scrapy, I had the spider code class Example_spider(BaseSpider): name = "example" allowed_domains = ["www.example.com"] def start_requests(self): yield self.make_requests_from_url("http://www.example.com/bookstore/new") def parse(self, response): hxs = HtmlXPathSelector(response) urls = hxs.select('//div[@class="bookListingBookTitle"]/a/@href').extract() for i in urls: yield Request(urljoin("http://www.example.com/", i[1:]), callback=self.parse_url) def parse_url(self, response): hxs = HtmlXPathSelector(response) main = hxs.select('//div[@id="bookshelf-bg"]') items = [] for i in

Should I create pipeline to save files with scrapy?

阅读更多关于 Should I create pipeline to save files with scrapy?

问题 I need to save a file (.pdf) but I'm unsure how to do it. I need to save .pdfs and store them in such a way that they are organized in a directories much like they are stored on the site I'm scraping them off. From what I can gather I need to make a pipeline, but from what I understand pipelines save "Items" and "items" are just basic data like strings/numbers. Is saving files a proper use of pipelines, or should I save file in spider instead? 回答1: Yes and no[1]. If you fetch a pdf it will be

Following links, Scrapy web crawler framework

阅读更多关于 Following links, Scrapy web crawler framework

问题 After several readings to Scrapy docs I'm still not catching the diferrence between using CrawlSpider rules and implementing my own link extraction mechanism on the callback method. I'm about to write a new web crawler using the latter approach, but just becuase I had a bad experience in a past project using rules. I'd really like to know exactly what I'm doing and why. Anyone familiar with this tool? Thanks for your help! 回答1: CrawlSpider inherits BaseSpider. It just added rules to extract

How can I use different pipelines for different spiders in a single Scrapy project

阅读更多关于 How can I use different pipelines for different spiders in a single Scrapy project

I have a scrapy project which contains multiple spiders. Is there any way I can define which pipelines to use for which spider? Not all the pipelines i have defined are applicable for every spider. Thanks Building on the solution from Pablo Hoffman , you can use the following decorator on the process_item method of a Pipeline object so that it checks the pipeline attribute of your spider for whether or not it should be executed. For example: def check_spider_pipeline(process_item_method): @functools.wraps(process_item_method) def wrapper(self, item, spider): # message template for debugging

How to write a crawler?

阅读更多关于 How to write a crawler?

I have had thoughts of trying to write a simple crawler that might crawl and produce a list of its findings for our NPO's websites and content. Does anybody have any thoughts on how to do this? Where do you point the crawler to get started? How does it send back its findings and still keep crawling? How does it know what it finds, etc,etc. You'll be reinventing the wheel, to be sure. But here's the basics: A list of unvisited URLs - seed this with one or more starting pages A list of visited URLs - so you don't go around in circles A set of rules for URLs you're not interested in - so you don

Nodejs: Async request with a list of URL

阅读更多关于 Nodejs: Async request with a list of URL

I am working on a crawler. I have a list of URL need to be requested. There are several hundreds of request at the same time if I don't set it to be async. I am afraid that it would explode my bandwidth or produce to much network access to the target website. What should I do? Here is what I am doing: urlList.forEach((url, index) => { console.log('Fetching ' + url); request(url, function(error, response, body) { //do sth for body }); }); I want one request is called after one request is completed. The things you need to watch for are: Whether the target site has rate limiting and you may be

What's a good Web Crawler tool [closed]

阅读更多关于 What's a good Web Crawler tool [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 7 years ago . I need to index a whole lot of webpages, what good webcrawler utilities are there? I'm preferably after something that .NET can talk to, but that's not a showstopper. What I really need is something that I can give a site url to & it will follow every link and store the content for indexing. 回答1: HTTrack -- http

Node.JS: How to pass variables to asynchronous callbacks? [duplicate]

阅读更多关于 Node.JS: How to pass variables to asynchronous callbacks? [duplicate]

问题 This question already has answers here : JavaScript closure inside loops – simple practical example (44 answers) Closed 3 years ago . I'm sure my problem is based on a lack of understanding of asynch programming in node.js but here goes. For example: I have a list of links I want to crawl. When each asynch request returns I want to know which URL it is for. But, presumably because of race conditions, each request returns with the URL set to the last value in the list. var links = ['http:/