how to avoid duplicate download urls in my python spider program?

﹥>﹥吖頭↗ 提交于 2019-12-02 07:37:10

问题


I wrote a spider program with python. It can recursively crawl web pages. I want to avoid download the same pages, so I store the urls in a list as follow:

urls = []
def download(mainPage):  # mainPage is a link
    global urls
    links = getHrefLinks(mainPage)
    for l in links:
        if l not in urls:
            urls.append(l)
            downPage(l)

But there is a problem that when the links are too much, the urls will be very large, and the efficiency of the code if l not in urls is low. How to solve the problem? What is the best way to avoid duplicate download urls without taking too much memory and improve the efficiency?


回答1:


you can make urls into a set:

urls = set()
def download(mainPage):  # mainPage is a link
    global urls
    links = getHrefLinks(mainPage)
    for l in links:
        if l not in urls:
            urls.add(l) #instead of append
            downPage(l)

Lookups of objects, i.e., x in s are, in the average case, of complexity O(1), which is better than the average case of the list.




回答2:


In general, as you iterate over your URL results you could store them in a dictionary. The key of this dictionary would be the url, the value could be a boolean if you've seen the url before. In the end print the keys of this dict and it would have all unique urls.

Also, doing the lookup via a dict will give you O(1) time when checking if the URL has been seen or not.

# Store mapping of {URL: Bool}
url_map = {}

# Iterate over url results
for url in URLs:
    if not url_map.get(url, False):
        url_map[url] = True

# Values of dict will have all unique urls 
print(url_maps.keys())


来源:https://stackoverflow.com/questions/26771396/how-to-avoid-duplicate-download-urls-in-my-python-spider-program

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!