web-crawler

Using one Scrapy spider for several websites

家住魔仙堡 提交于 2019-11-27 01:10:39
问题 I need to create a user configurable web spider/crawler, and I'm thinking about using Scrapy. But, I can't hard-code the domains and allowed URL regex:es -- this will instead be configurable in a GUI. How do I (as simple as possible) create a spider or a set of spiders with Scrapy where the domains and allowed URL regex:es are dynamically configurable? E.g. I write the configuration to a file, and the spider reads it somehow. 回答1: WARNING: This answer was for Scrapy v0.7, spider manager api

Does solr do web crawling?

青春壹個敷衍的年華 提交于 2019-11-27 01:02:16
问题 I am interested to do web crawling. I was looking at solr . Does solr do web crawling, or what are the steps to do web crawling? 回答1: Solr 5+ DOES in fact now do web crawling! http://lucene.apache.org/solr/ Older Solr versions do not do web crawling alone, as historically it's a search server that provides full text search capabilities. It builds on top of Lucene. If you need to crawl web pages using another Solr project then you have a number of options including: Nutch - http://lucene

How do I stop all spiders and the engine immediately after a condition in a pipeline is met?

北慕城南 提交于 2019-11-27 00:58:38
问题 We have a system written with scrapy to crawl a few websites. There are several spiders , and a few cascaded pipelines for all items passed by all crawlers. One of the pipeline components queries the google servers for geocoding addresses . Google imposes a limit of 2500 requests per day per IP address , and threatens to ban an IP address if it continues querying google even after google has responded with a warning message: 'OVER_QUERY_LIMIT'. Hence I want to know about any mechanism which I

Click a Button in Scrapy

点点圈 提交于 2019-11-27 00:56:29
I'm using Scrapy to crawl a webpage. Some of the information I need only pops up when you click on a certain button (of course also appears in the HTML code after clicking). I found out that Scrapy can handle forms (like logins) as shown here . But the problem is that there is no form to fill out, so it's not exactly what I need. How can I simply click a button, which then shows the information I need? Do I have to use an external library like mechanize or lxml? Scrapy cannot interpret javascript. If you absolutely must interact with the javascript on the page, you want to be using Selenium.

crawler vs scraper

倾然丶 夕夏残阳落幕 提交于 2019-11-27 00:50:38
问题 Can somebody distinguish between a crawler and scraper in terms of scope and functionality. 回答1: A crawler gets web pages -- i.e., given a starting address (or set of starting addresses) and some conditions (e.g., how many links deep to go, types of files to ignore) it downloads whatever is linked to from the starting point(s). A scraper takes pages that have been downloaded or, in a more general sense, data that's formatted for display, and (attempts to) extract data from those pages, so

Concurrent downloads - Python

大兔子大兔子 提交于 2019-11-27 00:39:37
问题 the plan is this: I download a webpage, collect a list of images parsed in the DOM and then download these. After this I would iterate through the images in order to evaluate which image is best suited to represent the webpage. Problem is that images are downloaded 1 by 1 and this can take quite some time. It would be great if someone could point me in some direction regarding the topic. Help would be very much appreciated. 回答1: Speeding up crawling is basically Eventlet's main use case. It's

What is the difference between web-crawling and web-scraping? [duplicate]

孤街醉人 提交于 2019-11-26 23:59:45
问题 This question already has an answer here: crawler vs scraper 4 answers Is there a difference between Crawling and Web-scraping? If there's a difference, what's the best method to use in order to collect some web data to supply a database for later use in a customised search engine? 回答1: Crawling would be essentially what Google, Yahoo, MSN, etc. do, looking for ANY information. Scraping is generally targeted at certain websites, for specfic data, e.g. for price comparison, so are coded quite

How to request Google to re-crawl my website? [closed]

帅比萌擦擦* 提交于 2019-11-26 23:45:27
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 4 years ago . Does someone know a way to request Google to re-crawl a website? If possible, this shouldn't last months. My site is showing an old title in Google's search results. How can I show it with the correct title and description? 回答1: There are two options. The first (and better) one is using the Fetch as Google

Asp.net Request.Browser.Crawler - Dynamic Crawler List?

走远了吗. 提交于 2019-11-26 23:28:17
问题 I learned Why Request.Browser.Crawler is Always False in C# (http://www.digcode.com/default.aspx?page=ed51cde3-d979-4daf-afae-fa6192562ea9&article=bc3a7a4f-f53e-4f88-8e9c-c9337f6c05a0). Does anyone uses some method to dynamically update the Crawler's list, so Request.Browser.Crawler will be really useful? 回答1: I've been happy the the results supplied by Ocean's Browsercaps. It supports crawlers that Microsoft's config files has not bothered detecting. It will even parse out what version of

Send Post Request in Scrapy

a 夏天 提交于 2019-11-26 22:53:41
问题 I am trying to crawl the latest reviews from google play store and to get that I need to make a post request. With the Postman, it works and I get desired response. but a post request in terminal gives me a server error For ex: this page https://play.google.com/store/apps/details?id=com.supercell.boombeach curl -H "Content-Type: application/json" -X POST -d '{"id": "com.supercell.boombeach", "reviewType": '0', "reviewSortOrder": '0', "pageNum":'0'}' https://play.google.com/store/getreviews