uWSGI and joblib Semaphore: Joblib will operate in serial mode

两盒软妹~` 提交于 2020-02-01 05:51:25

问题


I'm running joblib in a Flask application living inside a Docker container together with uWSGI (started with threads enabled) which is started by supervisord.

The startup of the webserver shows the following error:

unable to load configuration from from multiprocessing.semaphore_tracker import main;main(15)
/usr/local/lib/python3.5/dist-packages/sklearn/externals/joblib/_multiprocessing_helpers.py:38: UserWarning:

[Errno 32] Broken pipe.  joblib will operate in serial mode

Any idea how to fix this and make joblib run in parallel? Thanks!


The following packages are installed in the docker container:

pytest==4.0.1
pytest-cov==2.6.0
flake8==3.6.0
Cython==0.29.3
numpy==1.16.1
pandas==0.24.0
scikit-learn==0.20.2
fancyimpute==0.4.2
scikit-garden==0.1.3
category_encoders==1.3.0
boto3==1.9.86
joblib==0.13.1
dash==0.37.0
dash-renderer==0.18.0
dash-core-components==0.43.1
dash-table==3.4.0
dash-html-components==0.13.5
dash-auth==1.3.2
Flask-Caching==1.4.0
plotly==3.6.1
APScheduler==3.5.3

EDIT

The problems are either due to uWSGI, nginx, or supervisord. Missing rights on dev/shm are not the issue as Semaphores can be created if I run the flask server directly. Find below the config files of the three services. Disclaimer, I'm webserver noob, and the configs were born by copying and pasting from different blogs just to make it work :-D

So here's my uwsgi config:

[uwsgi]
module = prism_dash_frontend.__main__
callable = server

uid = nginx
gid = nginx

plugins = python3

socket = /tmp/uwsgi.sock
chown-socket = nginx:nginx
chmod-socket = 664

# set cheaper algorithm to use, if not set default will be used
cheaper-algo = spare

# minimum number of workers to keep at all times
cheaper = 3

# number of workers to spawn at startup
cheaper-initial = 5

# maximum number of workers that can be spawned
workers = 5

# how many workers should be spawned at a time
cheaper-step = 1
processes = 5

die-on-term = true
enable-threads = true

The nginx config:

# based on default config of nginx 1.12.1
# Define the user that will own and run the Nginx server
user nginx;
# Define the number of worker processes; recommended value is the number of
# cores that are being used by your server
# auto will default to number of vcpus/cores
worker_processes auto;

# altering default pid file location
pid /tmp/nginx.pid;

# turn off daemon mode to be watched by supervisord
daemon off;

# Enables the use of JIT for regular expressions to speed-up their processing.
pcre_jit on;

# Define the location on the file system of the error log, plus the minimum
# severity to log messages for
error_log /var/log/nginx/error.log warn;

# events block defines the parameters that affect connection processing.
events {
    # Define the maximum number of simultaneous connections that can be opened by a worker process
    worker_connections  1024;
}


# http block defines the parameters for how NGINX should handle HTTP web traffic
http {
    # Include the file defining the list of file types that are supported by NGINX
    include /etc/nginx/mime.types;
    # Define the default file type that is returned to the user
    default_type text/html;

    # Don't tell nginx version to clients.
    server_tokens off;

    # Specifies the maximum accepted body size of a client request, as
    # indicated by the request header Content-Length. If the stated content
    # length is greater than this size, then the client receives the HTTP
    # error code 413. Set to 0 to disable.
    client_max_body_size 0;

    # Define the format of log messages.
    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                        '$status $body_bytes_sent "$http_referer" '
                        '"$http_user_agent" "$http_x_forwarded_for"';

    # Define the location of the log of access attempts to NGINX
    access_log /var/log/nginx/access.log  main;

    # Define the parameters to optimize the delivery of static content
    sendfile       on;
    tcp_nopush     on;
    tcp_nodelay    on;

    # Define the timeout value for keep-alive connections with the client
    keepalive_timeout  65;

    # Define the usage of the gzip compression algorithm to reduce the amount of data to transmit
    #gzip  on;

    # Include additional parameters for virtual host(s)/server(s)
    include /etc/nginx/conf.d/*.conf;
}

The supervisord config:

[supervisord]
nodaemon=true

[program:uwsgi]
command=/usr/bin/uwsgi --ini /etc/uwsgi/uwsgi.ini
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0
stderr_logfile=/dev/stderr
stderr_logfile_maxbytes=0

[program:nginx]
command=/usr/sbin/nginx
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0
stderr_logfile=/dev/stderr
stderr_logfile_maxbytes=0

2nd EDIT

After moving from Python 3.5 to 3.7.2, the nature of the error has slightly changed:

unable to load configuration from from multiprocessing.semaphore_tracker import main;main(15)
/usr/local/lib/python3.7/multiprocessing/semaphore_tracker.py:55: UserWarning:

semaphore_tracker: process died unexpectedly, relaunching.  Some semaphores might leak.

unable to load configuration from from multiprocessing.semaphore_tracker import main;main(15)

Help really appreciated, this is currently a big blocker for me :-/


3rd EDIT:

HERE on my github account is a minimum, complete, and verifiable example.

You can run it easily via make build followed by make run.

It will display the following log message:

unable to load configuration from from multiprocessing.semaphore_tracker import main;main(14)

and crash once you visit http://127.0.0.1:8080/ with the following error:

exception calling callback for <Future at 0x7fbc520c7eb8 state=finished raised TerminatedWorkerError>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/joblib/externals/loky/_base.py", line 625, in _invoke_callbacks
    callback(self)
  File "/usr/local/lib/python3.7/site-packages/joblib/parallel.py", line 309, in __call__
    self.parallel.dispatch_next()
  File "/usr/local/lib/python3.7/site-packages/joblib/parallel.py", line 731, in dispatch_next
    if not self.dispatch_one_batch(self._original_iterator):
  File "/usr/local/lib/python3.7/site-packages/joblib/parallel.py", line 759, in dispatch_one_batch
    self._dispatch(tasks)
  File "/usr/local/lib/python3.7/site-packages/joblib/parallel.py", line 716, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/usr/local/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 510, in apply_async
    future = self._workers.submit(SafeFunction(func))
  File "/usr/local/lib/python3.7/site-packages/joblib/externals/loky/reusable_executor.py", line 151, in submit
    fn, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 1022, in submit
    raise self._flags.broken
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {EXIT(1), EXIT(1), EXIT(1), EXIT(1)}


回答1:


This was quite a rabbit hole.

The joblib issues page on Github has similar posts of joblib failing with Uwsgi. But most are for the older multiprocessing backend. The new loky backend was supposed to solve these issues.

There was PR for the multiprocessing backend that solved this issue for uwsgi:

joblib.Parallel(n_jobs=4,backend="multiprocessing")(joblib.delayed(sqrt)(i ** 2) for i in range(10))

But it failed sometimes randomly and fell back to the same issue that the PR above tried to solve.

Further digging revealed that the present backend loky parallelizes on processes by default (docs). But these processes dont have shared memory access and so need serialized and queued channels. This is probably the reason why uWSGI fails and gunicorn works.

So I tried switching to threads instead of processes:

joblib.Parallel(n_jobs=4,prefer="threads")(joblib.delayed(sqrt)(i ** 2) for i in range(10))

And it works :)




回答2:


Well, I did find an answer to my problem. It solves the issue in terms of being able to run a joblib dependent library with supervisor and nginx in docker. However, it is not very satisfying. Thus, I won't accept my own answer, but I am posting it here in case other people have the same problem and need to find an okayish fix.

The solution is replacing uWSGI by gunicorn. Well, at least I know now whose fault it is. I would still appreciate an answer that solves the issue using uWSGI instaed of gunicorn.




回答3:


It seems that semaphoring is not enabled on your image: Joblib checks for multiprocessing.Semaphore() and it only root have read/write permission on shared memory in /dev/shm. Have a look to this question and this answer.

This is run in one of my containers.

$ ls -ld /dev/shm
drwxrwxrwt 2 root root 40 Feb 19 15:23 /dev/shm

If you are running as non-root, you should change the permission on /dev/shm. To set the correct permissions, you need to modify the /etc/fstab in you Docker image:

none /dev/shm tmpfs rw,nosuid,nodev,noexec 0 0


来源:https://stackoverflow.com/questions/54751297/uwsgi-and-joblib-semaphore-joblib-will-operate-in-serial-mode

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!