Python Package For Multi-Threaded Spider w/ Proxy Support?

后端未结

关注

 2  1310

Instead of just using urllib does anyone know of the most efficient package for fast, multithreaded downloading of URLs that can operate through http proxies? I know of a fe

相关标签:

2条回答

误落风尘

2020-12-09 07:38

is's simple to implement this in python.

The urlopen() function works transparently with proxies which do not require authentication. In a Unix or Windows environment, set the http_proxy, ftp_proxy or gopher_proxy environment variables to a URL that identifies the proxy server before starting the Python interpreter

# -*- coding: utf-8 -*-

import sys
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
from Queue import Queue, Empty
from threading import Thread

visited = set()
queue = Queue()

def get_parser(host, root, charset):

    def parse():
        try:
            while True:
                url = queue.get_nowait()
                try:
                    content = urlopen(url).read().decode(charset)
                except UnicodeDecodeError:
                    continue
                for link in BeautifulSoup(content).findAll('a'):
                    try:
                        href = link['href']
                    except KeyError:
                        continue
                    if not href.startswith('http://'):
                        href = 'http://%s%s' % (host, href)
                    if not href.startswith('http://%s%s' % (host, root)):
                        continue
                    if href not in visited:
                        visited.add(href)
                        queue.put(href)
                        print href
        except Empty:
            pass

    return parse

if __name__ == '__main__':
    host, root, charset = sys.argv[1:]
    parser = get_parser(host, root, charset)
    queue.put('http://%s%s' % (host, root))
    workers = []
    for i in range(5):
        worker = Thread(target=parser)
        worker.start()
        workers.append(worker)
    for worker in workers:
        worker.join()

0 讨论(0)

后悔当初

2020-12-09 07:38

usually proxies filter websites categorically based on how the website was created. It is difficult to transmit data through proxies based on categories. Eg youtube is classified as audio/video streams therefore youtube is blocked in some places espically schools. If you want to bypass proxies and get the data off a website and put it in your own genuine website like a dot com website that can be registered it to you. When you are making and registering the website categorise your website as anything you want.

0 讨论(0)
发布评论:

提交评论
- 加载中...