Python Package For Multi-Threaded Spider w/ Proxy Support?

后端 未结 2 1303
执笔经年
执笔经年 2020-12-09 07:25

Instead of just using urllib does anyone know of the most efficient package for fast, multithreaded downloading of URLs that can operate through http proxies? I know of a fe

相关标签:
2条回答
  • 2020-12-09 07:38

    is's simple to implement this in python.

    The urlopen() function works transparently with proxies which do not require authentication. In a Unix or Windows environment, set the http_proxy, ftp_proxy or gopher_proxy environment variables to a URL that identifies the proxy server before starting the Python interpreter

    # -*- coding: utf-8 -*-
    
    import sys
    from urllib import urlopen
    from BeautifulSoup import BeautifulSoup
    from Queue import Queue, Empty
    from threading import Thread
    
    visited = set()
    queue = Queue()
    
    def get_parser(host, root, charset):
    
        def parse():
            try:
                while True:
                    url = queue.get_nowait()
                    try:
                        content = urlopen(url).read().decode(charset)
                    except UnicodeDecodeError:
                        continue
                    for link in BeautifulSoup(content).findAll('a'):
                        try:
                            href = link['href']
                        except KeyError:
                            continue
                        if not href.startswith('http://'):
                            href = 'http://%s%s' % (host, href)
                        if not href.startswith('http://%s%s' % (host, root)):
                            continue
                        if href not in visited:
                            visited.add(href)
                            queue.put(href)
                            print href
            except Empty:
                pass
    
        return parse
    
    if __name__ == '__main__':
        host, root, charset = sys.argv[1:]
        parser = get_parser(host, root, charset)
        queue.put('http://%s%s' % (host, root))
        workers = []
        for i in range(5):
            worker = Thread(target=parser)
            worker.start()
            workers.append(worker)
        for worker in workers:
            worker.join()
    
    0 讨论(0)
  • 2020-12-09 07:38

    usually proxies filter websites categorically based on how the website was created. It is difficult to transmit data through proxies based on categories. Eg youtube is classified as audio/video streams therefore youtube is blocked in some places espically schools. If you want to bypass proxies and get the data off a website and put it in your own genuine website like a dot com website that can be registered it to you. When you are making and registering the website categorise your website as anything you want.

    0 讨论(0)
提交回复
热议问题