web-crawler | 易学教程

redirect all bots using htaccess apache

阅读更多关于 redirect all bots using htaccess apache

问题 What .htaccess rewriterule should i use to detect known bots, for example the big ones: altavista, google, bing, yahoo I know i can check for their ips, or hosts, but is there a better way? 回答1: RewriteCond %{HTTP_USER_AGENT} AltaVista [OR] RewriteCond %{HTTP_USER_AGENT} Googlebot [OR] RewriteCond %{HTTP_USER_AGENT} msnbot [OR] RewriteCond %{HTTP_USER_AGENT} Slurp RewriteRule ^.*$ IHateBots.html [L] 来源： https://stackoverflow.com/questions/2691956/redirect-all-bots-using-htaccess-apache

Scrapy - how to identify already scraped urls

阅读更多关于 Scrapy - how to identify already scraped urls

Im using scrapy to crawl a news website on a daily basis. How do i restrict scrapy from scraping already scraped URLs. Also is there any clear documentation or examples on SgmlLinkExtractor . Jama22 You can actually do this quite easily with the scrapy snippet located here: http://snipplr.com/view/67018/middleware-to-avoid-revisiting-already-visited-items/ To use it, copy the code from the link and put it into some file in your scrapy project. To reference it, add a line in your settings.py to reference it: SPIDER_MIDDLEWARES = { 'project.middlewares.ignore.IgnoreVisitedItems': 560 } The

How to Stop the page loading in firefox programmatically?

阅读更多关于 How to Stop the page loading in firefox programmatically?

问题 I am running several tests with WebDriver and Firefox. I'm running into a problem with the following command: WebDriver.get(www.google.com); With this command, WebDriver blocks till the onload event is fired. While this can normally takes seconds, it can take hours on websites which never finish loading. What I'd like to do is stop loading the page after a certain timeout, somehow simulating Firefox's stop button. I first tried execute the following JS code every time that I tried loading a

Locally run all of the spiders in Scrapy

阅读更多关于 Locally run all of the spiders in Scrapy

Is there a way to run all of the spiders in a Scrapy project without using the Scrapy daemon? There used to be a way to run multiple spiders with scrapy crawl , but that syntax was removed and Scrapy's code changed quite a bit. I tried creating my own command: from scrapy.command import ScrapyCommand from scrapy.utils.misc import load_object from scrapy.conf import settings class Command(ScrapyCommand): requires_project = True def syntax(self): return '[options]' def short_desc(self): return 'Runs all of the spiders' def run(self, args, opts): spman_cls = load_object(settings['SPIDER_MANAGER

Python: maximum recursion depth exceeded while calling a Python object

阅读更多关于 Python: maximum recursion depth exceeded while calling a Python object

I've built a crawler that had to run on about 5M pages (by increasing the url ID) and then parses the pages which contain the info' I need. after using an algorithm which run on the urls (200K) and saved the good and bad results I found that the I'm wasting a lot of time. I could see that there are a a few returning subtrahends which I can use to check the next valid url. you can see the subtrahends quite fast (a little ex' of the few first "good IDs") - 510000011 # +8 510000029 # +18 510000037 # +8 510000045 # +8 510000052 # +7 510000060 # +8 510000078 # +18 510000086 # +8 510000094 # +8

Difference between find and filter in jquery

阅读更多关于 Difference between find and filter in jquery

I'm working on fetching data from wiki pages. I'm using a combination of php and jquery to do this. First I am using curl in php to fetch page contents and echoing the content. The filename is content.php : $url = $_GET['url']; $url = trim($url," "); $url = urldecode($url); $url = str_replace(" ","%20",$url); echo "<a class='urlmax'>".$_GET['title']."</a>"; echo crawl($url); Then jQuery is used to find the matched elements. $.get("content.php",{url:"http://en.wikipedia.org/w/index.php?action=render&title="+str_replace(" ","_",data[x]),title:str_replace(" ","_",data[x])},function(hdata){ var

Detect Search Crawlers via JavaScript

阅读更多关于 Detect Search Crawlers via JavaScript

问题 I am wondering how would I go abouts in detecting search crawlers? The reason I ask is because I want to suppress certain JavaScript calls if the user agent is a bot. I have found an example of how to to detect a certain browser, but am unable to find examples of how to detect a search crawler: /MSIE (\d+\.\d+);/.test(navigator.userAgent); //test for MSIE x.x Example of search crawlers I want to block: Google Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) Googlebot/2

Save all image files from a website

阅读更多关于 Save all image files from a website

问题 I'm creating a small app for myself where I run a Ruby script and save all of the images off of my blog. I can't figure out how to save the image files after I've identified them. Any help would be much appreciated. require 'rubygems' require 'nokogiri' require 'open-uri' url = '[my blog url]' doc = Nokogiri::HTML(open(url)) doc.css("img").each do |item| #something end 回答1: URL = '[my blog url]' require 'nokogiri' # gem install nokogiri require 'open-uri' # already part of your ruby install

Simple web crawler in C#

阅读更多关于 Simple web crawler in C#

I have created a simple web crawler but i want to add the recursion function so that every page that is opened i can get the urls in this page,but i have no idea how i can do that and i want also to include threads to make it faster here it is my code namespace Crawler { public partial class Form1 : Form { String Rstring; public Form1() { InitializeComponent(); } private void button1_Click(object sender, EventArgs e) { WebRequest myWebRequest; WebResponse myWebResponse; String URL = textBox1.Text; myWebRequest = WebRequest.Create(URL); myWebResponse = myWebRequest.GetResponse();//Returns a

How to programmatically fill input elements built with React?

阅读更多关于 How to programmatically fill input elements built with React?

问题 I'm tasked with crawling website built with React. I'm trying to fill in input fields and submitting the form using javascript injects to the page (either selenium or webview in mobile). This works like a charm on every other site + technology but React seems to be a real pain. so here is a sample code var email = document.getElementById( 'email' ); email.value = 'example@mail.com'; I the value changes on the DOM input element, but the React does not trigger the change event. I've been trying