web-crawler | 易学教程

How to find all links / pages on a website

阅读更多关于 How to find all links / pages on a website

Is it possible to find all the pages and links on ANY given website? I'd like to enter a URL and produce a directory tree of all links from that site? I've looked at HTTrack but that downloads the whole site and I simply need the directory tree. Hank Gay Check out linkchecker —it will crawl the site (while obeying robots.txt ) and generate a report. From there, you can script up a solution for creating the directory tree. If you have the developer console (JavaScript) in your browser, you can type this code in: urls = document.querySelectorAll('a'); for (url in urls) console.log(urls[url].href

Mass Downloading of Webpages C#

阅读更多关于 Mass Downloading of Webpages C#

问题 My application requires that I download a large amount of webpages into memory for further parsing and processing. What is the fastest way to do it? My current method (shown below) seems to be too slow and occasionally results in timeouts. for (int i = 1; i<=pages; i++) { string page_specific_link = baseurl + "&page=" + i.ToString(); try { WebClient client = new WebClient(); var pagesource = client.DownloadString(page_specific_link); client.Dispose(); sourcelist.Add(pagesource); } catch

Finding the layers and layer sizes for each Docker image

阅读更多关于 Finding the layers and layer sizes for each Docker image

问题 For research purposes I'm trying to crawl the public Docker registry ( https://registry.hub.docker.com/ ) and find out 1) how many layers an average image has and 2) the sizes of these layers to get an idea of the distribution. However I studied the API and public libraries as well as the details on the github but I cant find any method to: retrieve all the public repositories/images (even if those are thousands I still need a starting list to iterate through) find all the layers of an image

Locally run all of the spiders in Scrapy

阅读更多关于 Locally run all of the spiders in Scrapy

问题 Is there a way to run all of the spiders in a Scrapy project without using the Scrapy daemon? There used to be a way to run multiple spiders with scrapy crawl , but that syntax was removed and Scrapy's code changed quite a bit. I tried creating my own command: from scrapy.command import ScrapyCommand from scrapy.utils.misc import load_object from scrapy.conf import settings class Command(ScrapyCommand): requires_project = True def syntax(self): return '[options]' def short_desc(self): return

Designing a web crawler

阅读更多关于 Designing a web crawler

问题 I have come across an interview question "If you were designing a web crawler, how would you avoid getting into infinite loops? " and I am trying to answer it. How does it all begin from the beginning. Say Google started with some hub pages say hundreds of them (How these hub pages were found in the first place is a different sub-question). As Google follows links from a page and so on, does it keep making a hash table to make sure that it doesn't follow the earlier visited pages. What if the

How to follow all links in CasperJS?

阅读更多关于 How to follow all links in CasperJS?

问题 I'm having trouble clicking all JavaScript based links in a DOM and saving the output. The links have the form <a id="html" href="javascript:void(0);" onclick="goToHtml();">HTML</a> the following code works great: var casper = require('casper').create(); var fs = require('fs'); var firstUrl = 'http://www.testurl.com/test.html'; var css_selector = '#jan_html'; casper.start(firstUrl); casper.thenClick(css_selector, function(){ console.log("whoop"); }); casper.waitFor(function check() { return

Apache HTTPClient throws java.net.SocketException: Connection reset for many domains

阅读更多关于 Apache HTTPClient throws java.net.SocketException: Connection reset for many domains

问题 I'm creating a (well behaved) web spider and I notice that some servers are causing Apache HttpClient to give me a SocketException -- specifically: java.net.SocketException: Connection reset The code that causes this is: // Execute the request HttpResponse response; try { response = httpclient.execute(httpget); //httpclient is of type HttpClient } catch (NullPointerException e) { return;//deep down in apache http sometimes throws a null pointer... } For most servers it's just fine. But for

how to filter duplicate requests based on url in scrapy

阅读更多关于 how to filter duplicate requests based on url in scrapy

I am writing a crawler for a website using scrapy with CrawlSpider. Scrapy provides an in-built duplicate-request filter which filters duplicate requests based on urls. Also, I can filter requests using rules member of CrawlSpider. What I want to do is to filter requests like: http:://www.abc.com/p/xyz.html?id=1234&refer=5678 If I have already visited http:://www.abc.com/p/xyz.html?id=1234&refer=4567 NOTE: refer is a parameter that doesn't affect the response I get, so I don't care if the value of that parameter changes. Now, if I have a set which accumulates all ids I could ignore it in my

Spider a Website and Return URLs Only

阅读更多关于 Spider a Website and Return URLs Only

I'm looking for a way to pseudo-spider a website. The key is that I don't actually want the content, but rather a simple list of URIs. I can get reasonably close to this idea with Wget using the --spider option, but when piping that output through a grep , I can't seem to find the right magic to make it work: wget --spider --force-html -r -l1 http://somesite.com | grep 'Saving to:' The grep filter seems to have absolutely no affect on the wget output. Have I got something wrong or is there another tool I should try that's more geared towards providing this kind of limited result set? UPDATE So

Fetch contents(loaded through AJAX call) of a web page

阅读更多关于 Fetch contents(loaded through AJAX call) of a web page

I am a beginner to crawling. I have a requirement to fetch the posts and comments from a link. I want to automate this process. I considered using webcrawler and jsoup for this but was told that webcrawlers are mostly used for websites with greater depth. Sample for a page: Jive community website For this page, when I view the source of the page, I can see only the post and not the comments. Think this is because comments are fetched through an AJAX call to the server. Hence, when I use jsoup, it doesn't fetch the comments. So how can I automate the process of fetching posts and comments?