Designing a web crawler

前端 未结 10 1237
醉梦人生
醉梦人生 2020-12-04 04:54

I have come across an interview question \"If you were designing a web crawler, how would you avoid getting into infinite loops? \" and I am trying to answer it.

How

相关标签:
10条回答
  • 2020-12-04 05:20

    The problem here is not to crawl duplicated URLS, wich is resolved by a index using a hash obtained from urls. The problem is to crawl DUPLICATED CONTENT. Each url of a "Crawler Trap" is different (year, day, SessionID...).

    There is not a "perfect" solution... but you can use some of this strategies:

    • Keep a field of wich level the url is inside the website. For each cicle of getting urls from a page, increase the level. It will be like a tree. You can stop to crawl at certain level, like 10 (i think google use this).

    • You can try to create a kind of HASH wich can be compared to find similar documents, since you cant compare with each document in your database. There are SimHash from google, but i could not find any implementation to use. Then i´ve created my own. My hash count low and high frequency characters inside the html code and generate a 20bytes hash, wich is compared with a small cache of last crawled pages inside a AVLTree with an NearNeighbors search with some tolerance (about 2). You cant use any reference to characters locations in this hash. After "recognize" the trap, you can record the url pattern of the duplicate content and start to ignore pages with that too.

    • Like google, you can create a ranking to each website and "trust" more in one than others.

    0 讨论(0)
  • 2020-12-04 05:25

    Depends on how deep their question was intended to be. If they were just trying to avoid following the same links back and forth, then hashing the URL's would be sufficient.

    What about content that has literally thousands of URL's that lead to the same content? Like a QueryString parameter that doesn't affect anything, but can have an infinite number of iterations. I suppose you could hash the contents of the page as well and compare URL's to see if they are similar to catch content that is identified by multiple URL's. See for example, Bot Traps mentioned in @Lirik's post.

    0 讨论(0)
  • 2020-12-04 05:25

    Well the web is basically a directed graph, so you can construct a graph out of the urls and then do a BFS or DFS traversal while marking the visited nodes so you don't visit the same page twice.

    0 讨论(0)
  • 2020-12-04 05:28

    This is a web crawler example. Which can be used to collect mac Addresses for mac spoofing.

    #!/usr/bin/env python
    
    import sys
    import os
    import urlparse
    import urllib
    from bs4 import BeautifulSoup
    
    def mac_addr_str(f_data):
    global fptr
    global mac_list
    word_array = f_data.split(" ")
    
        for word in word_array:
            if len(word) == 17 and ':' in word[2] and ':' in word[5] and ':' in word[8] and ':' in word[11] and ':' in word[14]:
                if word not in mac_list:
                    mac_list.append(word)
                    fptr.writelines(word +"\n")
                    print word
    
    
    
    url = "http://stackoverflow.com/questions/tagged/mac-address"
    
    url_list = [url]
    visited = [url]
    pwd = os.getcwd();
    pwd = pwd + "/internet_mac.txt";
    
    fptr = open(pwd, "a")
    mac_list = []
    
    while len(url_list) > 0:
        try:
            htmltext = urllib.urlopen(url_list[0]).read()
        except:
            url_list[0]
        mac_addr_str(htmltext)
        soup = BeautifulSoup(htmltext)
        url_list.pop(0)
        for tag in soup.findAll('a',href=True):
            tag['href'] = urlparse.urljoin(url,tag['href'])
            if url in tag['href'] and tag['href'] not in visited:
                url_list.append(tag['href'])
                visited.append(tag['href'])
    

    Change the url to crawl more sites......good luck

    0 讨论(0)
提交回复
热议问题