web-crawler | 易学教程

Prevent Custom Web Crawler from being blocked

阅读更多关于 Prevent Custom Web Crawler from being blocked

问题 I am creating a new web crawler using C# to crawl some specific websites. every thing goes fine. but the problem is that some websites are blocking my crawler IP address after some requests. I tried using timestamps between my crawl requests. but did not worked. is there any way to prevent websites from blocking my crawler ? some solutions like this would help (but I need to know how to apply them): simulating Google bot or yahoo slurp using multiple IP addresses (event fake IP addresses) as

Creating a bot/crawler

阅读更多关于 Creating a bot/crawler

问题 I would like to make a small bot in order to automatically and periodontally surf on a few partner website. This would save several hours to a lot of employees here. The bot must be able to : connect to this website, on some of them log itself as a user, access and parse a particular information on the website. The bot must be integrated to our website and change it's settings (used user…) with data of our website. Eventually it must sum up the parse information. Preferably this operation

What is a good Web search and web crawling engine for Java?

阅读更多关于 What is a good Web search and web crawling engine for Java?

I am working on an application where I need to integrate the search engine. This should do crawling also. Please suggest a good Java based search engine. Thank you in advance. Ajay Nutch ( Lucene ) is an Open Source engine which should satisfy your needs. In the past I worked with terrier , a search engine written in Java: Terrier is a highly flexible, efficient, effective, and robust search engine, readily deployable on large-scale collections of documents. Terrier implements state-of-the-art indexing and retrieval functionalities. Terrier provides an ideal platform for the rapid development

Ruby+Anemone Web Crawler: regex to match URLs ending in a series of digits

阅读更多关于 Ruby+Anemone Web Crawler: regex to match URLs ending in a series of digits

Suppose I was trying crawl a website a skip a page that ended like so: http://HIDDENWEBSITE.com/anonimize/index.php?page=press_and_news&subpage=20060117 I am currently using Anemone gem in Ruby to build the crawler. I am using the skip_links_like method but my pattern never seems to match. I am trying to make this as generic as possible so it isn't dependent on subpage but just =2105925 (the digits). I have tried /=\d+$/ and /\?.*\d+$/ but it doesn't seem to be working. This similar to Skipping web-pages with extension pdf, zip from crawling in Anemone but I can't make it worth with digits

Web-Scraping with R

阅读更多关于 Web-Scraping with R

I'm having some problems scraping data from a website. First, I have not a lot of experience with webscraping... My intended plan is to scrape some data using R from the following website: http://spiderbook.com/company/17495/details?rel=300795 Especially, I want to extract the links to the articles on this site. My idea so far: xmltext <- htmlParse("http://spiderbook.com/company/17495/details?rel=300795") sources <- xpathApply(xmltext, "//body//div") sourcesCharSep <- lapply(sourcesChar,function(x) unlist(strsplit(x, " "))) sourcesInd <- lapply(sourcesCharSep,function(x) grep('"(http://[^"]*)"

how to allow known web crawlers and block spammers and harmful robots from scanning asp.net website

阅读更多关于 how to allow known web crawlers and block spammers and harmful robots from scanning asp.net website

问题 How can I configure my site to allow crawling from well known robots like google, bing, yahoo, alexa etc. and stop other harmful spammers, robots should i block particular IP? please discuss any pros, cons Anything to be done in web.config or IIS? Can I do it server wide If i have vps with root access? Thanks. 回答1: I'd recommend that you take a look the answer I posted to a similar question: How to identify web-crawler? Robots.txt The robots.txt is useful for polite bots, but spammers are

How to understand this raw HTML of Yahoo! Finance when retrieving data using Python?

阅读更多关于 How to understand this raw HTML of Yahoo! Finance when retrieving data using Python?

问题 I've been trying to retrieve stock price from Yahoo! Finance, like for Apple Inc.. My code is like this:(using Python 2) import requests from bs4 import BeautifulSoup as bs html='http://finance.yahoo.com/quote/AAPL/profile?p=AAPL' r = requests.get(html) soup = bs(r.text) The problem is when I see raw HTML behind this webpage, the class is dynamic, see figure below. This makes it hard for BeautifulSoup to get tags. How to understand the class and how to get data? HTML of Yahoo! Finance page PS

Cannot navigate with casperjs evaluate and __doPostBack function

阅读更多关于 Cannot navigate with casperjs evaluate and __doPostBack function

When I try to navigate the pagination on sites with links where href is a __doPostBack function call, I never achieve the page change. I am not sure what I am missing, so after a few hours of messing around I decided to see if someone here can give me a clue. This is my code (uber-simplified to show the use case). var casper = require('casper').create({ verbose: true, logLevel: "debug" }); casper.start('http://www.gallito.com.uy/inmuebles/venta'); // here i simulate the click on a link in the pagination list casper.evaluate(function() { __doPostBack('RptPagerDos$ctl08$lnkPage2',''); }); casper

How to limit number of followed pages per site in Python Scrapy

阅读更多关于 How to limit number of followed pages per site in Python Scrapy

问题 I am trying to build a spider that could efficiently scrape text information from many websites. Since I am a Python user I was referred to Scrapy. However, in order to avoid scraping huge websites, I want to limit the spider to scrape no more than 20 pages of a certain "depth" per website . Here is my spider: class DownloadSpider(CrawlSpider): name = 'downloader' download_path = '/home/MyProjects/crawler' rules = (Rule(SgmlLinkExtractor(), callback='parse_item', follow=True),) def __init__

Sharepoint 2010 search cannot crawl mediawiki site

阅读更多关于 Sharepoint 2010 search cannot crawl mediawiki site

Using Sharepoint 2010 enterprise search, we are trying to crawl our internal mediawiki based wiki site. Search fails with error : 'The URL was permanently moved. ( URL redirected to ... )'. Since the wiki site has case sensitive URLs, when Sharepoint 2010 tries to crawl with lower case URL names, the Wiki says 'page does not exists' and redirects with 301 !!! Any got a solution ? Thanks in advance. By default, all links crawled are converted to lower case by the SharePoint search indexer. You will need to create case sensitive crawl rules. Have a look at the following post: http://blogs.msdn