Designing a web crawler

前端未结

关注

 10  1237

I have come across an interview question \"If you were designing a web crawler, how would you avoid getting into infinite loops? \" and I am trying to answer it.

How

相关标签:

10条回答

日久生厌

2020-12-04 05:20

The problem here is not to crawl duplicated URLS, wich is resolved by a index using a hash obtained from urls. The problem is to crawl DUPLICATED CONTENT. Each url of a "Crawler Trap" is different (year, day, SessionID...).

There is not a "perfect" solution... but you can use some of this strategies:

• Keep a field of wich level the url is inside the website. For each cicle of getting urls from a page, increase the level. It will be like a tree. You can stop to crawl at certain level, like 10 (i think google use this).

• You can try to create a kind of HASH wich can be compared to find similar documents, since you cant compare with each document in your database. There are SimHash from google, but i could not find any implementation to use. Then i´ve created my own. My hash count low and high frequency characters inside the html code and generate a 20bytes hash, wich is compared with a small cache of last crawled pages inside a AVLTree with an NearNeighbors search with some tolerance (about 2). You cant use any reference to characters locations in this hash. After "recognize" the trap, you can record the url pattern of the duplicate content and start to ignore pages with that too.

• Like google, you can create a ranking to each website and "trust" more in one than others.

0 讨论(0)
发布评论:

提交评论
- 加载中...
暖寄归人

2020-12-04 05:25

Depends on how deep their question was intended to be. If they were just trying to avoid following the same links back and forth, then hashing the URL's would be sufficient.

What about content that has literally thousands of URL's that lead to the same content? Like a QueryString parameter that doesn't affect anything, but can have an infinite number of iterations. I suppose you could hash the contents of the page as well and compare URL's to see if they are similar to catch content that is identified by multiple URL's. See for example, Bot Traps mentioned in @Lirik's post.

0 讨论(0)
发布评论:

提交评论
- 加载中...
南旧

2020-12-04 05:25

Well the web is basically a directed graph, so you can construct a graph out of the urls and then do a BFS or DFS traversal while marking the visited nodes so you don't visit the same page twice.

0 讨论(0)
发布评论:

提交评论
- 加载中...

梦毁少年i

2020-12-04 05:28

This is a web crawler example. Which can be used to collect mac Addresses for mac spoofing.

#!/usr/bin/env python

import sys
import os
import urlparse
import urllib
from bs4 import BeautifulSoup

def mac_addr_str(f_data):
global fptr
global mac_list
word_array = f_data.split(" ")

    for word in word_array:
        if len(word) == 17 and ':' in word[2] and ':' in word[5] and ':' in word[8] and ':' in word[11] and ':' in word[14]:
            if word not in mac_list:
                mac_list.append(word)
                fptr.writelines(word +"\n")
                print word



url = "http://stackoverflow.com/questions/tagged/mac-address"

url_list = [url]
visited = [url]
pwd = os.getcwd();
pwd = pwd + "/internet_mac.txt";

fptr = open(pwd, "a")
mac_list = []

while len(url_list) > 0:
    try:
        htmltext = urllib.urlopen(url_list[0]).read()
    except:
        url_list[0]
    mac_addr_str(htmltext)
    soup = BeautifulSoup(htmltext)
    url_list.pop(0)
    for tag in soup.findAll('a',href=True):
        tag['href'] = urlparse.urljoin(url,tag['href'])
        if url in tag['href'] and tag['href'] not in visited:
            url_list.append(tag['href'])
            visited.append(tag['href'])

Change the url to crawl more sites......good luck

0 讨论(0)

上一页 1 2