Connection pool is full, discarding connection with ThreadPoolExecutor and multiple headless browsers through Selenium and Python

问题

I'm writing some automation software using selenium==3.141.0, python 3.6.7, chromedriver 2.44.

Most of the the logic is ok to be executed by the single browser instance, but for some part i have to launch 10-20 instances to have a decent execution speed.

Once it comes to the part which is executed by ThreadPoolExecutor, browser interactions start throwing this error:

WARNING|05/Dec/2018 17:33:11|connectionpool|_put_conn|274|Connection pool is full, discarding connection: 127.0.0.1
WARNING|05/Dec/2018 17:33:11|connectionpool|urlopen|662|Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))': /session/119df5b95710793a0421c13ec3a83847/url
WARNING|05/Dec/2018 17:33:11|connectionpool|urlopen|662|Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcee7ada048>: Failed to establish a new connection: [Errno 111] Connection refused',)': /session/119df5b95710793a0421c13ec3a83847/url

browser setup:

def init_chromedriver(cls):
    try:
        chrome_options = webdriver.ChromeOptions()
        chrome_options.add_argument('--headless')
        chrome_options.add_argument(f"user-agent={Utils.get_random_browser_agent()}")
        prefs = {"profile.managed_default_content_settings.images": 2}
        chrome_options.add_experimental_option("prefs", prefs)

        driver = webdriver.Chrome(driver_paths['chrome'],
                                       chrome_options=chrome_options,
                                       service_args=['--verbose', f'--log-path={bundle_dir}/selenium/chromedriver.log'])
        driver.implicitly_wait(10)

        return driver
    except Exception as e:
        logger.error(e)

relevant code:

ProfileParser instantiates a webdriver and execute a few page interactions. I suppose the interactions themselves are not relevant because everything works without ThreadPoolExecutor. However, in short:

class ProfileParser(object):
    def __init__(self, acc):
        self.driver = Utils.init_chromedriver()
    def __exit__(self, exc_type, exc_val, exc_tb):
        Utils.shutdown_chromedriver(self.driver)
        self.driver = None

    collect_user_info(post_url)
           self.driver.get(post_url)
           profile_url = self.driver.find_element_by_xpath('xpath_here')]').get_attribute('href')

While runs in ThreadPoolExecutor, the error above appears at this point self.driver.find_element_by_xpath or at self.driver.get

this is working:

with ProfileParser(acc) as pparser:
        pparser.collect_user_info(posts[0])

these options are not working: (connectionpool errors)

futures = []
#one worker, one future
with ThreadPoolExecutor(max_workers=1) as executor:
        with ProfileParser(acc) as pparser:
            futures.append(executor.submit(pparser.collect_user_info, posts[0]))

#10 workers, multiple futures
with ThreadPoolExecutor(max_workers=10) as executor:
    for p in posts:
        with ProfileParser(acc) as pparser:
            futures.append(executor.submit(pparser.collect_user_info, p))

UPDATE:

I found a temporal solution (which does not invalidate this initial question) - to instantiate a webdriver outside of ProfileParser class. Don't know why it works but the initial does not. I suppose the cause in some language specifics? Thanks for answers, however it doesn't seem like the problem is with the ThreadPoolExecutor max_workers limit - as you see in one of the options i tried to submit a single instance and it is still didn't work.

current workaround:

futures = []
with ThreadPoolExecutor(max_workers=10) as executor:
    for p in posts:
        driver = Utils.init_chromedriver()
        futures.append({
            'future': executor.submit(collect_user_info, driver, acc, p),
            'driver': driver
        })

for f in futures:
    f['future'].done()
    Utils.shutdown_chromedriver(f['driver'])

回答1:

This error message...

WARNING|05/Dec/2018 17:33:11|connectionpool|_put_conn|274|Connection pool is full, discarding connection: 127.0.0.1
WARNING|05/Dec/2018 17:33:11|connectionpool|urlopen|662|Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))': /session/119df5b95710793a0421c13ec3a83847/url
WARNING|05/Dec/2018 17:33:11|connectionpool|urlopen|662|Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcee7ada048>: Failed to establish a new connection: [Errno 111] Connection refused',)': /session/119df5b95710793a0421c13ec3a83847/url

...seems to be an issue in urllib3's connection pooling which raised these WARNING while executing the def _put_conn(self, conn) method in connectionpool.py.

def _put_conn(self, conn):
    """
    Put a connection back into the pool.

    :param conn:
        Connection object for the current host and port as returned by
        :meth:`._new_conn` or :meth:`._get_conn`.

    If the pool is already full, the connection is closed and discarded
    because we exceeded maxsize. If connections are discarded frequently,
    then maxsize should be increased.

    If the pool is closed, then the connection will be closed and discarded.
    """
    try:
        self.pool.put(conn, block=False)
        return  # Everything is dandy, done.
    except AttributeError:
        # self.pool is None.
        pass
    except queue.Full:
        # This should never happen if self.block == True
        log.warning(
            "Connection pool is full, discarding connection: %s",
            self.host)

    # Connection never got put back into the pool, close it.
    if conn:
        conn.close()

ThreadPoolExecutor

ThreadPoolExecutor is an Executor subclass that uses a pool of threads to execute calls asynchronously. Deadlocks can occur when the callable associated with a Future waits on the results of another Future.

class concurrent.futures.ThreadPoolExecutor(max_workers=None, thread_name_prefix='', initializer=None, initargs=())

An Executor subclass that uses a pool of at most max_workers threads to execute calls asynchronously.
initializer is an optional callable that is called at the start of each worker thread; initargs is a tuple of arguments passed to the initializer. Should initializer raise an exception, all currently pending jobs will raise a BrokenThreadPool, as well as any attempt to submit more jobs to the pool.
From version 3.5 onwards: If max_workers is None or not given, it will default to the number of processors on the machine, multiplied by 5, assuming that ThreadPoolExecutor is often used to overlap I/O instead of CPU work and the number of workers should be higher than the number of workers for ProcessPoolExecutor.
From version 3.6 onwards: The thread_name_prefix argument was added to allow users to control the threading.Thread names for worker threads created by the pool for easier debugging.
From version 3.7: Added the initializer and initargs arguments.

As per your question as you are trying to launch 10-20 instances the default connection pool size of 10 seems not to be enough in your case which is hardcoded in adapters.py.

Moreover, @EdLeafe in the discussion Getting error: Connection pool is full, discarding connection mentions:

It looks like within the requests code, None objects are normal. If _get_conn() gets None from the pool, it simply creates a new connection. It seems odd, though, that it should start with all those None objects, and that _put_conn() isn't smart enough to replace None with the connection.

However the merge Add pool size parameter to client constructor have fixed this issue.

Solution

Increasing the default connection pool size of 10 which was earlier hardcoded in adapters.py and now configurable will solve your issue.

Update

As per your comment update ...submit a single instance and the outcome is the same.... as per @meferguson84 within the discussion Getting error: Connection pool is full, discarding connection:

I stepped into the code to the point where it mounts the adapter just to play with the pool size and see if it made a difference. What I found was that the queue is full of NoneType objects with the actual upload connection being the last item in the list. The list is 10 items long (which makes sense). What doesn't make sense is that the unfinished_tasks parameter for the pool is 11. How can this be when the queue itself is only 11 items? Also, is it normal for the queue to be full of NoneType objects with the connection we are using being the last item on the list?

That sounds like a possible cause in your usecase as well. It may sound redundant but you may still perform a couple of ad-hoc steps as follows:

Clean your Project Workspace through your IDE and Rebuild your project with required dependencies only.
(WindowsOS only) Use CCleaner tool to wipe off all the OS chores before and after the execution of your Test Suite.
(LinuxOS only) Free Up and Release the Unused/Cached Memory in Ubuntu/Linux Mint before and after the execution of your Test Suite.

回答2:

please see your error

ProtocolError('Connection aborted.', 
  RemoteDisconnected('Remote end closed connection without response',))

'NewConnectionError('<urllib3.connection.HTTPConnection object at >: 
   Failed to establish a new connection: [Errno 111] Connection refused',)':

The error come because you are doing multiple connection too fast, it can be server down or server block your request.

来源：https://stackoverflow.com/questions/53641068/connection-pool-is-full-discarding-connection-with-threadpoolexecutor-and-multi

标签

python

selenium

threadpool

threadpoolexecutor

urllib3