web-crawler | 易学教程

apache nutch don't crawl website

阅读更多关于 apache nutch don't crawl website

问题 I have installed the apache nutch for web crawling. I want to crawl a website that has the following robots.txt : User-Agent: * Disallow: / Is there any way to crawl this website with apache nutch? 回答1: In nutch-site.xml, set protocol.plugin.check.robots to false OR You can comment out the code where the robots check is done. In Fetcher.java, lines 605-614 are doing the check. Comment that entire block if (!rules.isAllowed(fit.u)) { // unblock fetchQueues.finishFetchItem(fit, true); if (LOG

Crawl only content from multiple different Websites

阅读更多关于 Crawl only content from multiple different Websites

问题 currently I am working on a project, where i want to analyze different articles from different blogs, Magazine, etc. published online on their Website. Therefore i have already built a Webcrawler using Python, which get me every new article as html. Now here is the point, i want to Analyse the pure content (only the article, without comments or recommendations etc. ), but i cant access this content, without defining a regular expression, to extract the content from the html response i get.

Crawling tables from webpage

阅读更多关于 Crawling tables from webpage

问题 I'm trying to extract csu employee salary data from this webpage (http://www.sacbee.com/statepay/#req=employee%2Fsearch%2Fname%3D%2Fyear%3D2013%2Fdepartment%3DCSU%20Sacramento). I've tried using urlib2 and requests library, but none of them returned the actual table from the webpage. I guessed the reason could be that the table was generated dynamically by javascript. Below is my code using requests. from lxml import html import requests page = requests.get("http://www.sacbee.com/statepay/

Web-crawler for facebook in python

阅读更多关于 Web-crawler for facebook in python

问题 I am tring to work with web-Crawler in python to print the number of facebook recommenders. for example in this article from sky-news(http://news.sky.com/story/1330046/are-putins-little-green-men-back-in-ukraine) there are about 60 facebook reccomends. I want to print this number in the python program with web-crawler. i tried to do this, but it doesn't print anything: import requests from bs4 import BeautifulSoup def get_single_item_data(item_url): source_code = requests.get(item_url) plain

Looping through DirectoryEntry or any object hierarchy - C#

阅读更多关于 Looping through DirectoryEntry or any object hierarchy - C#

问题 I am currently developing an application that use the System.DirectoryServices namespace to create a DirectoryEntry object and loop through the entire hierarchy to collect information. I do not know number of child entries for each DirectoryEntry object in the hierarchy, so I can not create a N number of nested loops to spiders through the Children property Here is my pseudo code example: //root directory DirectoryEntry root = new DirectoryEntry(path); if(DirectoryEntry.Childern != null) {

Scrapy - doesn't crawl

阅读更多关于 Scrapy - doesn't crawl

问题 I'm trying to get a recursive crawl running and since the one I wrote wasn't working fine, I pulled an example from web and tried. I really don't know, where the problem is, but the crawl doesn't display any ERRORS. Can anyone help me with this. Also, Is there any step-by-step debugging tool to help understand the crawl flow of a spider. Any help regarding this is greatly appreciated. MacBook:spiders hadoop$ scrapy crawl craigs -o items.csv -t csv /System/Library/Frameworks/Python.framework

Parsing HTML with VB DOTNET

阅读更多关于 Parsing HTML with VB DOTNET

问题 I am trying to parse some data from a website to get specific items from their tables. I know that any tag with the bgcolor attribute set to #ffffff or #f4f4ff is where I want to start and my actual data sits in the 2nd within that . Currently I have: Private Sub runForm() Dim theElementCollection As HtmlElementCollection = WebBrowser1.Document.GetElementsByTagName("TR") For Each curElement As HtmlElement In theElementCollection Dim controlValue As String = curElement.GetAttribute("bgcolor")

Parsing HTML with VB DOTNET

阅读更多关于 Parsing HTML with VB DOTNET

Sending cookies in request with crawler4j?

阅读更多关于 Sending cookies in request with crawler4j?

问题 I need to grab some links that are depending on the sent cookies within a GET Request. So when I want to crawl the page with crawler4j I need to send some cookies with it to get the correct page back. Is this possible (I searched the web for it, but didn't find something useful)? Or is there a Java crawler out there who is capable doing this? Any help appreciated. 回答1: It appears that crawler4j might not support cookies: http://www.webuseragents.com/ua/427106/crawler4j-http-code-google-com-p

Scrapy rules not working when process_request and callback parameter are set

阅读更多关于 Scrapy rules not working when process_request and callback parameter are set

问题 I have this rule for scrapy CrawlSpider rules = [ Rule(LinkExtractor( allow= '/topic/\d+/organize$', restrict_xpaths = '//div[@id= "zh-topic-organize-child-editor"]' ), process_request='request_tagPage', callback = "parse_tagPage", follow = True) ] request_tagePage() refers to a function to add cookie into requests and parse_tagPage() refers to a function to parse target pages. According to documentation, CrawlSpider should use request_tagPage to make requests and once responses are returned,