Web Cralwer Algorithm: depth?

问题

I'm working on a crawler and need to understand exactly what is meant by "link depth". Take nutch for example: http://wiki.apache.org/nutch/NutchTutorial

depth indicates the link depth from the root page that should be crawled.

So, say I have the domain www.domain.com and wanted to crawl a depth of, say, 3 -- what do I need to do? If a site could be represented as a binary tree, then it wouldn't be a problem I think.

回答1:

Link depth means the number of "hops" a page is be away from the root, where a "hop" means following a link on a page. The reason Nutch has this restriction is that links very "far away" from the main page are unlikely to hold much information (the main page will link to the most important information, so the farther you get, the more detailed info you find), while there can be very many of them, so they take up lots of storage space, computing time for ranking, and bandwidth.

Nutch thus uses an algorithm scheme known as depth-limited search to bound its running time and space usage. If it didn't use this heuristic, it would have to crawl an entire site to rank all the pages in it and find the top N.

To crawl to depth 3, implement this algorithm and give it a depth bound of three. The nice thing about depth-limited search is that it's a variant of depth-first search (DFS), so it's quite space-efficient:

function depth-limited-crawl(page p, int d)
    if d == 0
        return
    /* do something with p, store it or so */
    foreach (page l in links(p))
        depth-limited-crawl(linked, d-1)

And no, a site cannot in general be represented as a binary tree; it's a directed graph. If you somehow remove the backlinks, then it becomes a multiway tree. Either way, many sites are too large to store for your crawler.

回答2:

Make a list that you use as a queue.
Append www.domain.com, depth 0 to it
Pull the first element off it
current depth is the elements depth+1
Crawl that site
Append each link on the site to the queue if the current depth isn't greater than the maximum depth
If the list isn't empty, go back to 3..

回答3:

I guess the "depth" is the number of times the crawler "follows a link".

Say you start from the root page. You follow each of the links on this page: this is depth 1. For each of the target pages, you follow the links: this is depth 2, etc.

Note that there may be "cycles" while following links. The structure is not a tree, but a graph.

回答4:

Link depth means how many links you have to follow before you reach a given link.

Example: example.com links to example.com/foo.html which links to google.com. Therefore google.com has a link depth of 2 relative to example.com as you can reach it following 2 links.

To crawl example.com to a depth of 3, you would follow links to a maximum depth of 3 and then stop following links. Without this restriction, you could easily go on forever.

Example: example.com links to example.com/foo.html and example.com/bar.html. You follow those two links, the links they link to and stop there.

Note: The root page has a depth of 0.

回答5:

A web site's root is at depth 0. Documents you can reach by using links in the root are at depth 1. Documents you can in turn reach from links in documents at depth 1 would be at depth 2. And so on.

Depending on your crawler this might apply to only documents in the same site/domain (usual) or documents hosted elsewhere.

Most web sites are not representable by a binary tree, as the "root" might have more than two "nodes".

回答6:

Well in the case of Nutch, the depth argument is quite a misnommer, it just means the number of loops the crawler is going through. So you will reach pages which are 3 links away from your seed urls... on a given site it might in depth 3... that is if they hand up being within the top N limits.

回答7:

Depth is the number of slashes it the url path

example http://www.google.com/foo/bar/baz has a depth of 3

def depth(self,link):
    return len(urlparse.urlparse(url).path.split("/")) -1

来源：https://stackoverflow.com/questions/4356524/web-cralwer-algorithm-depth

标签

algorithm

web-crawler

nutch