web-crawler | 易学教程

crawledPage.HttpWebResponse is null in Abot

阅读更多关于 crawledPage.HttpWebResponse is null in Abot

问题 I'm trying to make a C# web crawler using Abot I followed the QuickStart Tutorial but I cannot seem to make it work. It has an unhandled exception in the method crawler_ProcessPageCrawlCompleted , in exactly this line : if (crawledPage.WebException != null || crawledPage.HttpWebResponse.StatusCode != HttpStatusCode.OK) { Console.WriteLine("Crawl of page failed {0}", crawledPage.Uri.AbsoluteUri); } Because crawledPage.HttpWebResponse is null. I'm probably missing something but what ? Notes and

Scrapy needs to crawl all the next links on website and move on to the next page

阅读更多关于 Scrapy needs to crawl all the next links on website and move on to the next page

问题 I need my scrapy to move on to the next page please give me the correct code for the rule,how to write it?? from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import Selector from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from delh.items import DelhItem class criticspider(CrawlSpider): name ="delh" allowed_domains =["consumercomplaints.in"] #start_urls =["http://www.consumercomplaints.in/?search=delhivery&page=2","http://www.consumercomplaints.in/

Parsing HTML in Cakephp

阅读更多关于 Parsing HTML in Cakephp

问题 I started building a web crawler in CakePHP 2.2. The pages, the script is crawling is HTML pages, and I need to parse them, to get my values. Have tried some different solutions, and looked on some open source things aswell, but not sure what the best way is to do this. DomDocument::loadHTML() - Looks like this is the solution but not 100% sure. Regular Expression - A bit hard to maintain Simple HTMLDom - http://electrokami.com/coding/simple-html-dom-baked-cakephp-component (Made for Cake 1.3

using a regex in jsoup

阅读更多关于 using a regex in jsoup

问题 I'm trying my first serious project in jsoup and I've got stuck in this matter- I'm trying to get zipcodes from a site. There is a list of zipcodes. Here is one of the lines that presents the zipcode- <td align="center"><a href="http://www.zipcodestogo.com/Hialeah/FL/33011/">33011</a></td> So the idea I've got is going through the page and getting all the strings that contain 6 digits from 1-9. Regex is ^[0-9]{6,6}$ code was - doc.select("td:matchesOwn(^[0-9]{5,5}$)"); but nothing came out. I

Scrapy returns more results than expected

阅读更多关于 Scrapy returns more results than expected

问题 This is a continuation of the question: Extract from dynamic JSON response with Scrapy I have a Scrapy spider that extract values from a JSON response. It works well, extract the right values, but somehow it enters in a loop and returns more results than expected (duplicate results). For example for 17 values provided in test.txt file it returns 289 results, that means 17 times more than expected. Spider content below: import scrapy import json from whois.items import WhoisItem class

Use crawler4j to download js files

阅读更多关于 Use crawler4j to download js files

问题 I'm trying to use crawler4j to download some websites. The only problem I have is that even though I return true for all .js files in the shouldVisit function, they never get downloaded. @Override public boolean shouldVisit(WebURL url) { return true; } @Override public void visit(Page page) { String url = page.getWebURL().getURL(); System.out.println("URL: " + url); } The URL for .js files never gets printed out. 回答1: I noticed that "<script>" tags do not get processed by crawler4j. This was

How to get immediate parent node with scrapy in python?

阅读更多关于 How to get immediate parent node with scrapy in python?

问题 I am new to scrapy . I want to crawl some data from the web. I got the html document like below. dom style1: <div class="user-info"> <p class="user-name"> something in p tag </p> text data I want </div> dom style2: <div class="user-info"> <div> <p class="user-img"> something in p tag </p> something in div tag </div> <div> <p class="user-name"> something in p tag </p> text data I want </div> </div> I want to get the data text data I want , now I can use css or xpath selector to get it by check

What sequence of steps does crawler4j follow to fetch data?

阅读更多关于 What sequence of steps does crawler4j follow to fetch data?

问题 I'd like to learn, how crawler4j works? Does it fetch web page then download its content and extract it ? What about .db and .cvs file and its structures? Generally ,What sequences it follows? please, I want a descriptive content Thanks 回答1: General Crawler Process The process for a typical multi-threaded crawler is as follows: We have a queue data structure, which is called frontier . Newly discovered URLs (or start points, so-called seeds) are added to this datastructure. In addition, for

Extract from dynamic JSON response with Scrapy

阅读更多关于 Extract from dynamic JSON response with Scrapy

问题 I want to extract the 'avail' value from the JSON output that look like this. { "result": { "code": 100, "message": "Command Successful" }, "domains": { "yolotaxpayers.com": { "avail": false, "tld": "com", "price": "49.95", "premium": false, "backorder": true } } } The problem is that the ['avail'] value is under ["domains"]["domain_name"] and I can't figure out how to get the domain name. You have my spider below. The first part works fine, but not the second one. import scrapy import json

How to get all the links leading to the next page?

阅读更多关于 How to get all the links leading to the next page?

问题 I've written some code in vba to get all the links leading to the next page from a webpage. The highest number of next page links is 255. Running my script, I get all the links within 6906 links. That means the loop runs again and again and I'm overwriting stuffs. Filtering out duplicate links I could see that 254 unique links are there. My objective here is not to hardcode the highest page number to the link for iteration. Here is what I'm trying with: Sub YifyLink() Const link = "https:/