web-crawler

How do I stop all spiders and the engine immediately after a condition in a pipeline is met?

走远了吗. 提交于 2019-11-28 05:33:04
We have a system written with scrapy to crawl a few websites. There are several spiders , and a few cascaded pipelines for all items passed by all crawlers. One of the pipeline components queries the google servers for geocoding addresses . Google imposes a limit of 2500 requests per day per IP address , and threatens to ban an IP address if it continues querying google even after google has responded with a warning message: 'OVER_QUERY_LIMIT'. Hence I want to know about any mechanism which I can invoke from within the pipeline that will completely and immediately stop all further crawling

Does solr do web crawling?

无人久伴 提交于 2019-11-28 05:29:33
I am interested to do web crawling. I was looking at solr . Does solr do web crawling, or what are the steps to do web crawling? Jon Solr 5+ DOES in fact now do web crawling! http://lucene.apache.org/solr/ Older Solr versions do not do web crawling alone, as historically it's a search server that provides full text search capabilities. It builds on top of Lucene. If you need to crawl web pages using another Solr project then you have a number of options including: Nutch - http://lucene.apache.org/nutch/ Websphinx - http://www.cs.cmu.edu/~rcm/websphinx/ JSpider - http://j-spider.sourceforge.net

crawler vs scraper

风流意气都作罢 提交于 2019-11-28 05:12:14
Can somebody distinguish between a crawler and scraper in terms of scope and functionality. Jerry Coffin A crawler gets web pages -- i.e., given a starting address (or set of starting addresses) and some conditions (e.g., how many links deep to go, types of files to ignore) it downloads whatever is linked to from the starting point(s). A scraper takes pages that have been downloaded or, in a more general sense, data that's formatted for display, and (attempts to) extract data from those pages, so that it can (for example) be stored in a database and manipulated as desired. Depending on how you

Concurrent downloads - Python

☆樱花仙子☆ 提交于 2019-11-28 04:47:30
the plan is this: I download a webpage, collect a list of images parsed in the DOM and then download these. After this I would iterate through the images in order to evaluate which image is best suited to represent the webpage. Problem is that images are downloaded 1 by 1 and this can take quite some time. It would be great if someone could point me in some direction regarding the topic. Help would be very much appreciated. Speeding up crawling is basically Eventlet 's main use case. It's deeply fast -- we have an application that has to hit 2,000,000 urls in a few minutes. It makes use of the

How to generate the start_urls dynamically in crawling?

好久不见. 提交于 2019-11-28 03:36:05
I am crawling a site which may contain a lot of start_urls , like: http://www.a.com/list_1_2_3.htm I want to populate start_urls like [list_\d+_\d+_\d+\.htm] , and extract items from URLs like [node_\d+\.htm] during crawling. Can I use CrawlSpider to realize this function? And how can I generate the start_urls dynamically in crawling? juraseg The best way to generate URLs dynamically is to override the start_requests method of the spider: from scrapy.http.request import Request def start_requests(self): with open('urls.txt', 'rb') as urls: for url in urls: yield Request(url, self.parse) There

What is the difference between web-crawling and web-scraping? [duplicate]

纵然是瞬间 提交于 2019-11-28 03:05:23
This question already has an answer here: crawler vs scraper 4 answers Is there a difference between Crawling and Web-scraping? If there's a difference, what's the best method to use in order to collect some web data to supply a database for later use in a customised search engine? Ben Crawling would be essentially what Google, Yahoo, MSN, etc. do, looking for ANY information. Scraping is generally targeted at certain websites, for specfic data, e.g. for price comparison, so are coded quite differently. Usually a scraper will be bespoke to the websites it is supposed to be scraping, and would

Wildcards in robots.txt

只愿长相守 提交于 2019-11-28 02:14:15
If in WordPress website I have categories in this order: -Parent --Child ---Subchild I have permalinks set to: %category%/%postname% Let use an example. I create post with post name "Sport game". It's tag is sport-game. It's full url is: domain.com/parent/child/subchild/sport-game Why I use this kind of permalinks is exactly to block some content easier in robots.txt. And now this is the part I have question for. In robots.txt: User-agent: Googlebot Disallow: /parent/* Disallow: /parent/*/* Disallow: /parent/*/*/* Disallow: /parent/* Is meaning of this rule that it's blocking domain.com/parent

HtmlAgilityPack HtmlWeb.Load returning empty Document

北战南征 提交于 2019-11-28 01:43:29
I have been using HtmlAgilityPack for the last 2 months in a Web Crawler Application with no issues loading a webpage. Now when I try to load a this particular webpage, the document OuterHtml is empty, so this test fails var url = "http://www.prettygreen.com/"; var htmlWeb = new HtmlWeb(); var htmlDoc = htmlWeb.Load(url); var outerHtml = htmlDoc.DocumentNode.OuterHtml; Assert.AreNotEqual("", pageHtml); I can load another page from the site with no problems, such as setting url = "http://www.prettygreen.com/news/"; In the past I once had an issue with encodings, I played around with htmlWeb

Asp.net Request.Browser.Crawler - Dynamic Crawler List?

半腔热情 提交于 2019-11-28 01:04:04
I learned Why Request.Browser.Crawler is Always False in C# ( http://www.digcode.com/default.aspx?page=ed51cde3-d979-4daf-afae-fa6192562ea9&article=bc3a7a4f-f53e-4f88-8e9c-c9337f6c05a0 ). Does anyone uses some method to dynamically update the Crawler's list, so Request.Browser.Crawler will be really useful? I've been happy the the results supplied by Ocean's Browsercaps . It supports crawlers that Microsoft's config files has not bothered detecting. It will even parse out what version of the crawler is on your site, not that I really need that level of detail. You could check (regex) against

How do I save the origin html file with Apache Nutch

﹥>﹥吖頭↗ 提交于 2019-11-28 00:22:52
I'm new to search engines and web crawlers. Now I want to store all the original pages in a particular web site as html files, but with Apache Nutch I can only get the binary database files. How do I get the original html files with Nutch? Does Nutch support it? If not, what other tools can I use to achieve my goal.(The tools that support distributed crawling are better.) Well, nutch will write the crawled data in binary form so if if you want that to be saved in html format, you will have to modify the code. (this will be painful if you are new to nutch). If you want quick and easy solution