问题

I want to keep a persistent connection open to a third party legacy database in a django powered web application.

I want to keep the a connection between the web app and the legacy database open since creating a new connection is very slow for this special DB.

It is not like usual connection pooling since I need to store the connection per web user. User "Foo" needs its own connection between web server and legacy DB.

Up to now I use Apache and wsgi, but I could change if an other solution fits better.

Up to now I use django. Here I could change, too. But the pain would be bigger since there is already a lot of code which needs to be integrated again.

Up to now I use Python. I guess Node.js would fit better here, but the pain to change is too high.

Of course some kind of timeout would be needed. If there is not http-request from user "Foo" for N minutes, then the persistent connection would need to get shut down.

How could this be solved?

Update

I call it DB but it is not a DB which is configured via settings.DATABASES. It is a strange, legacy not wide spread DB-like system I need to integrate.

If I there are 50 people online using the web app in this moment, then I need to have 50 persistent connections. One for each user.

Code to connect to DB

I could execute this line in every request:

strangedb_connection = strangedb.connect(request.user.username)

But this operation is slow. Using the connection is fast.

Of course the strangedb_connection can't be serialized and can't be stored in a session :-)

回答1:

worker daemon managing the connection

Your picture currently looks like:

user  ----------->  webserver  <--------[1]-->  3rd party DB

connection [1] is expensive.

You could solve this with:

user ---->  webserver  <--->  task queue[1]  <--->  worker daemon  <--[2]-> 3rd party DB

[1] task queue can be redis, celery or rabbitmq.
[2] worker daemon keeps connection open.

A worker daemon would do the connection to the 3rd party database and keep the connection open. This would mean that each request would not have to pay the connection costs. The task queue would be the inter-process communication, dispatching work to the daemon and do the queries in the 3rd party db. The webserver should be as light as possible in terms of processing and let workers do expensive tasks.

preloading with apache + modwsgi

You can actually preload and have the expensive connection done before the first request. This is done with the WSGIImportScript configuration directive. I don't remember at the top of my head if having a pre-load + forking configuration means each request will already have the connection opened and share it; but since you have most of the code, this could be an easy experiment.

preloading with uwsgi

uwsgi supports preloading too. This is done with the import directive.

回答2:

As far as I can tell, you have ruled out most (all?) of the common solutions to this type of problem:

Store connection in a dictionary ... need N workers and can't guarantee which request goes to which worker
Store data in cache ... too much data
Store connection info in cache ... connection is not serialisable

As far as I can see there is really only 1 'meta' solution to this, use @Gahbu's suggestion of a dictionary and guarantee the requests for a given user go to the same worker. I.e. figure out a way to map from the User object to a given worker the same way every time (maybe hash their name and MOD by the number of workers?).

This solution would not make the most of your N workers if the currently active Users all mapped to the same worker, but if all Users are equally likely to be active at the same time then the work should be equally spread. (If they are not all equally likely then the mapping may be able to account for that).

The two possible ways I can think of doing this would be:

1. Write a custom request allocator

I'm not really familiar with the apache/wsgi interfacing land but ... it might be possible to replace the component within your Apache server that dispatches the HTTP requests to the workers with some custom logic, such that it always dispatches to the same process.

2. Run a load-balancer/proxy in front of N single threaded workers

I'm not sure if you can use a ready to go package here or not, but the concept would be:

Run a proxy that implements this 'bind the User to an index' logic
Have the proxy then forward the requests to one of N copies of your Apache/wsgi webserver which each has a single worker.

NB: This second idea I came across here: https://github.com/benoitc/gunicorn/issues/183

Summary

For both options the implementation in your existing application is pretty simple. Your application just changes to use a dictionary for storing the persistent connection (creating one if there isn't one already). Testing a single instance is the same in dev as in production. In production, the instances themselves are none the wiser that they are always asked about the same users.

I like Option 2 here for the following reasons:

Maybe there is an existing server package that allows you to define this proxy trickiness
If not, creating a custom proxy application to sit in front of your current application might not be too hard (especially considering the restrictions you are (already) under when the request reaches the strangedb service)

回答3:

Rather than having multiple worker processes, you can use use the WSGIDaemonProcess directive to have multiple worker threads which all run in a single process. That way, all the threads can share the same DB connection mapping.

With something like this in your apache config...

# mydomain.com.conf

<VirtualHost *:80>

    ServerName mydomain.com
    ServerAdmin webmaster@mydomain.com

    <Directory />
        Require all granted
    </Directory>

    WSGIDaemonProcess myapp processes=1 threads=50 python-path=/path/to/django/root display-name=%{GROUP}
    WSGIProcessGroup myapp
    WSGIScriptAlias / /path/to/django/root/myapp/wsgi.py

</VirtualHost>

...you can then use something as simple as this in your Django app...

# views.py

import thread
from django.http import HttpResponse

# A global variable to hold the connection mappings
DB_CONNECTIONS = {}

# Fake up this "strangedb" module
class strangedb(object):

    class connection(object):
        def query(self, *args):
            return 'Query results for %r' % args

    @classmethod
    def connect(cls, *args):
        return cls.connection()


# View for homepage
def home(request, username='bob'):

    # Remember thread ID
    thread_info = 'Thread ID = %r' % thread.get_ident()

    # Connect only if we're not already connected
    if username in DB_CONNECTIONS:
        strangedb_connection = DB_CONNECTIONS[username]
        db_info = 'We reused an existing connection for %r' % username
    else:
        strangedb_connection = strangedb.connect(username)
        DB_CONNECTIONS[username] = strangedb_connection
        db_info = 'We made a connection for %r' % username

    # Fake up some query
    results = strangedb_connection.query('SELECT * FROM my_table')

    # Fake up an HTTP response
    text = '%s\n%s\n%s\n' % (thread_info, db_info, results)
    return HttpResponse(text, content_type='text/plain')

...which, on the first hit, produces...

Thread ID = 140597557241600
We made a connection for 'bob'
Query results for 'SELECT * FROM my_table'

...and, on the second...

Thread ID = 140597145999104
We reused an existing connection for 'bob'
Query results for 'SELECT * FROM my_table'

Obviously, you'll need to add something to tear down the DB connections when they're no longer required, but it's tough to know the best way to do that without more info about how your app is supposed to work.

Update #1: Regarding I/O multiplexing vs multithreading

I worked with threads twice in my live and each time it was a nightmare. A lot of time was wasted on debugging non reproducible problems. I think an event-driven and a non-blocking I/O architecture might be more solid.

A solution using I/O multiplexing might be better, but would be more complex, and would also require your "strangedb" library to support it, i.e. it would have to be able to handle EAGAIN/EWOULDBLOCK and have the capacity to retry the system call when necessary.

Multithreading in Python is far less dangerous than in most other languages, due to Python's GIL, which, in essence, makes all Python bytecode thread-safe.

In practice, threads only run concurrently when the underlying C code uses the Py_BEGIN_ALLOW_THREADS macro, which, with its counterpart, Py_END_ALLOW_THREADS, are typically wrapped around system calls, and CPU-intensive operations.

The upside of this is that it's almost impossible to have a thread collision in Python code, although the downside is that it won't always make optimal use of multiple CPU cores on a single machine.

The reason I suggest the above solution is that it's relatively simple, and would require minimal code changes, but there may be a better option if you could elaborate more on your "strangedb" library. It seems rather odd to have a DB which requires a separate network connection per concurrent user.

Update #2: Regarding multiprocessing vs multithreading

...the GIL limitations around threading seem to be a bit of an issue. Isn't this one of the reasons why the trend is to use separate processes instead?

That's quite possibly the main reason why Python's multiprocessing module exists, i.e. to provide concurrent execution of Python bytecode across multiple CPU cores, although there is an undocumented ThreadPool class in that module, which uses threads rather than processes.

The "GIL limitations" would certainly be problematic in cases where you really need to exploit every single CPU cycle on every CPU core, e.g. if you were writing a computer game which had to render 60 frames per second in high-definition.

Most web-based services, however, are likely to spend most of their time waiting for something to happen, e.g. network I/O or disk I/O, which Python threads will allow to occur concurrently.

Ultimately, it's trade-off between performance and maintainability, and given that hardware is usually much cheaper than a developer's time, favoring maintainability over performance is usually more cost-effective.

Frankly, the moment you decide to use a virtual machine language, such as Python, instead of a language which compiles into real machine code, such as C, you're already saying that you're prepared to sacrifice some performance in exchange for convenience.

See also The C10K problem for a comparison of techniques for scaling web-based services.

回答4:

One simple way to do this would be have another python process manage the pool of persistant connection (one for each user and can timeout when needed). And then the other python process and django can communicate with something fast like zeromq. interprocess communication in python

来源：https://stackoverflow.com/questions/34172964/persistent-db-connection-in-django-wsgi-application

标签

python

django

database-connection

Persistent DB Connection in Django/WSGI application

问题