web-crawler

Python Scrapy - populate start_urls from mysql

£可爱£侵袭症+ 提交于 2019-12-18 10:57:13
问题 I am trying to populate start_url with a SELECT from a MYSQL table using spider.py . When i run "scrapy runspider spider.py" i get no output, just that it finished with no error. I have tested the SELECT query in a python script and start_url get populated with the entrys from the MYSQL table. spider.py from scrapy.spider import BaseSpider from scrapy.selector import Selector import MySQLdb class ProductsSpider(BaseSpider): name = "Products" allowed_domains = ["test.com"] start_urls = [] def

Distributed Web crawling using Apache Spark - Is it Possible?

雨燕双飞 提交于 2019-12-18 10:44:59
问题 An interesting question asked of me when I attended one interview regarding web mining. The question was, is it possible to crawl the Websites using Apache Spark? I guessed that it was possible, because it supports distributed processing capacity of Spark. After the interview I searched for this, but couldn't find any interesting answer. Is that possible with Spark? 回答1: How about this way: Your application would get a set of websites URLs as input for your crawler, if you are implementing

how to tell if a web request is coming from google's crawler?

断了今生、忘了曾经 提交于 2019-12-18 08:32:03
问题 From the HTTP server's perspective. 回答1: I have captured google crawler request in my asp.net application and here's how the signature of the google crawler looks. Requesting IP : 66.249.71.113 Client : Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) My logs observe many different IPs for google crawler in 66.249.71.* range. All these IPs are geo-located at Mountain View, CA, USA. A nice solution to check if the request is coming from Google crawler would be to verify

how to tell if a web request is coming from google's crawler?

橙三吉。 提交于 2019-12-18 08:31:53
问题 From the HTTP server's perspective. 回答1: I have captured google crawler request in my asp.net application and here's how the signature of the google crawler looks. Requesting IP : 66.249.71.113 Client : Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) My logs observe many different IPs for google crawler in 66.249.71.* range. All these IPs are geo-located at Mountain View, CA, USA. A nice solution to check if the request is coming from Google crawler would be to verify

how to tell if a web request is coming from google's crawler?

余生长醉 提交于 2019-12-18 08:31:27
问题 From the HTTP server's perspective. 回答1: I have captured google crawler request in my asp.net application and here's how the signature of the google crawler looks. Requesting IP : 66.249.71.113 Client : Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) My logs observe many different IPs for google crawler in 66.249.71.* range. All these IPs are geo-located at Mountain View, CA, USA. A nice solution to check if the request is coming from Google crawler would be to verify

Is there a .NET equivalent of Perl's LWP / WWW::Mechanize?

本秂侑毒 提交于 2019-12-18 08:23:20
问题 After working with .NET's HttpWebRequest / Response objects, I'd rather shoot myself than use this to crawl through web sites. I'm looking for an existing .NET library that can fetch URLs, and give you the ability to follow links, extract/fill in/submit forms on the page, etc. Perl's LWP and WWW::Mechanize modules do this very well, but I'm working with a .NET project. I've come across the HTML Agility Pack, which looks awesome, but it stops short of simulating links/forms. Does such a tool

crawl dynamic web page using htmlunit

丶灬走出姿态 提交于 2019-12-18 03:36:06
问题 I am crawling data using HtmlUnit from a dynamic webpage, which uses infinite scrolling to fetch data dynamically, just like facebook's newsfeed. I used the following sentence to simulate the scrolling down event: webclient.setJavaScriptEnabled(true); webclient.setAjaxController(new NicelyResynchronizingAjaxController()); ScriptResult sr=myHtmlPage.executeJavaScript("window.scrollBy(0,600)"); webclient.waitForBackgroundJavaScript(10000); myHtmlPage=(HtmlPage)sr.getNewPage(); But it seems

Ruby on Rails, How to determine if a request was made by a robot or search engine spider?

风流意气都作罢 提交于 2019-12-17 22:41:34
问题 I've Rails apps, that record an IP-address from every request to specific URL, but in my IP database i've found facebook blok IP like 66.220.15.* and Google IP (i suggest it come from bot). Is there any formula to determine an IP from request was made by a robot or search engine spider ? Thanks 回答1: Robots are required (by common sense / courtesy more than any kind of law) to send along a User-Agent with their request. You can check for this using request.env["HTTP_USER_AGENT"] and filter as

Make a JavaScript-aware Crawler

a 夏天 提交于 2019-12-17 20:01:22
问题 I want to make a script that's crawling a website and it should return the locations of all the banners showed on that page. The locations of banners are most of the time from known domains. But banners are not in the HTML as an easy image or swf-file. Most of the times a Javascript is used to show the banner. So if a .swf-file or image-file is loaded from a banner-domain, it should return that url. Is that possible to do? And how could I do that roughly? Best would be if it can also returns

How to generate the start_urls dynamically in crawling?

ε祈祈猫儿з 提交于 2019-12-17 17:33:17
问题 I am crawling a site which may contain a lot of start_urls , like: http://www.a.com/list_1_2_3.htm I want to populate start_urls like [list_\d+_\d+_\d+\.htm] , and extract items from URLs like [node_\d+\.htm] during crawling. Can I use CrawlSpider to realize this function? And how can I generate the start_urls dynamically in crawling? 回答1: The best way to generate URLs dynamically is to override the start_requests method of the spider: from scrapy.http.request import Request def start