How to delay execution within ThreadPool?

雨燕双飞 提交于 2019-12-11 10:45:13

问题


I've written a script in python using multiprocessing.pool.ThreadPool to handle multiple requests concurrently and do the scraping process robust. The parser is doing it's job perfectly.

As I have noticed in several scripts that there should be a delay within the scraping process when it is created using multiprocessing, I would like to put a delay within my below script as well.

However, This is where I'm stuck and can't find out the right position to put that delay.

This is my script so far:

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
from multiprocessing.pool import ThreadPool

url = "http://srar.com/roster/index.php?agent_search=a"

def get_links(link):
    completelinks = []
    res = requests.get(link)  
    soup = BeautifulSoup(res.text,'lxml')
    for items in soup.select("table.border tr"):
        if not items.select("td a[href^='index.php?agent']"):continue
        data = [urljoin(link,item.get("href")) for item in items.select("td a[href^='index.php?agent']")]
        completelinks.extend(data)
    return completelinks

def get_info(nlink):
    req = requests.get(nlink)
    sauce = BeautifulSoup(req.text,"lxml")
    for tr in sauce.select("table[style$='1px;'] tr")[1:]:
        table = [td.get_text(strip=True) for td in tr.select("td")]
        print(table)

if __name__ == '__main__':
    ThreadPool(20).map(get_info, get_links(url))

Once again: all i need to know is the right position within my script to put a delay.


回答1:


For putting a delay in between your multiple requests.get() calls, located within get_info, you would have to expand get_info with a delay-argument, which it can take as input to a time.sleep() call. Since all your worker-threads start at once, your delays have to be cumulative for every call. Meaning, when you want your delay between the requests.get() calls to be 0.5 seconds, your list of delays you are passing along into the pool-method would look like [0.0, 0.5, 1.0, 1.5, 2.0, 2.5 ...].

For not having to alter get_info itself, I'm using a decorator in the example below to extend get_info with a delay parameter and a time.sleep(delay) call. Note that I'm passing the delays along the other argument for get_info in the pool.starmap call.

import logging
from multiprocessing.pool import ThreadPool
from functools import wraps

def delayed(func):
    @wraps(func)
    def wrapper(delay, *args, **kwargs):
        time.sleep(delay)  # <--
        return func(*args, **kwargs)
    return wrapper

@delayed
def get_info(nlink):
    info = nlink + '_info'
    logger.info(msg=info)
    return info


def get_links(n):
    return [f'link{i}' for i in range(n)]


def init_logging(level=logging.DEBUG):
    fmt = '[%(asctime)s %(levelname)-8s %(threadName)s' \
          ' %(funcName)s()] --- %(message)s'
    logging.basicConfig(format=fmt, level=level)


if __name__ == '__main__':

    DELAY = 0.5

    init_logging()
    logger = logging.getLogger(__name__)

    links = get_links(10) # ['link0', 'link1', 'link2', ...]
    delays = (x * DELAY for x in range(0, len(links)))
    arguments = zip(delays, links) # (0.0, 'link0'), (0.5, 'link1'), ...

    with ThreadPool(10) as pool:
        result = pool.starmap(get_info, arguments)
        print(result)

Example Output:

[2018-10-03 22:04:14,221 INFO     Thread-8 get_info()] --- link0_info
[2018-10-03 22:04:14,721 INFO     Thread-5 get_info()] --- link1_info
[2018-10-03 22:04:15,221 INFO     Thread-3 get_info()] --- link2_info
[2018-10-03 22:04:15,722 INFO     Thread-4 get_info()] --- link3_info
[2018-10-03 22:04:16,223 INFO     Thread-1 get_info()] --- link4_info
[2018-10-03 22:04:16,723 INFO     Thread-6 get_info()] --- link5_info
[2018-10-03 22:04:17,224 INFO     Thread-7 get_info()] --- link6_info
[2018-10-03 22:04:17,723 INFO     Thread-10 get_info()] --- link7_info
[2018-10-03 22:04:18,225 INFO     Thread-9 get_info()] --- link8_info
[2018-10-03 22:04:18,722 INFO     Thread-2 get_info()] --- link9_info
['link0_info', 'link1_info', 'link2_info', 'link3_info', 'link4_info', 
'link5_info', 'link6_info', 'link7_info', 'link8_info', 'link9_info']


来源:https://stackoverflow.com/questions/52623178/how-to-delay-execution-within-threadpool

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!