Gevent pool with nested web requests

一个人想着一个人 提交于 2019-12-03 16:32:26
Bryan Mayes

I think the following should get you what you want. I'm using BeautifulSoup in my example instead the link striping stuff you had.

from bs4 import BeautifulSoup
import requests
import gevent
from gevent import monkey, pool
monkey.patch_all()

jobs = []
links = []
p = pool.Pool(10)

urls = [
    'http://www.google.com', 
    # ... another 100 urls
]

def get_links(url):
    r = requests.get(url)
    if r.status_code == 200:
        soup = BeautifulSoup(r.text)
        links + soup.find_all('a')

for url in urls:
    jobs.append(p.spawn(get_links, url))
gevent.joinall(jobs)

gevent.pool will limit the concurrent greenlets, not the connections.

You should use session with HTTPAdapter

connection_limit = 10
adapter = requests.adapters.HTTPAdapter(pool_connections=connection_limit, 
                                        pool_maxsize=connection_limit)
session = requests.session()
session.mount('http://', adapter)
session.get('some url')
# or do your work with gevent
from gevent.pool import Pool
# it should bigger than connection limit if the time of processing data 
# is longer than downings, 
# to give a change run processing.
pool_size = 15 
pool = Pool(pool_size)
for url in urls:
    pool.spawn(session.get, url)

You should use gevent.queue to do it in the right way.

Also this(eventlet examples) will be helpful for you to understand the basic idea.

Gevent solution is similar to the eventlet.

Keep in mind that will have somewhere to store visited URLs, so as not to get cycling, so you do not get out of memory error, you need to introduce some restrictions.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!