Database Backed Work Queue

问题

My situation ...

I have a set of workers that are scheduled to run periodically, each at different intervals, and would like to find a good implementation to manage their execution.

Example: Let's say I have a worker that goes to the store and buys me milk once a week. I would like to store this job and it's configuration in a mysql table. But, it seems like a really bad idea to poll the table (every second?) and see which jobs are ready to be put into the execution pipeline.

All of my workers are written in javascript, so I'm using node.js for execution and beanstalkd as a pipeline.

If new jobs (ie. scheduling a worker to run at a given time) are being created asynchronously and I need to store the job result and configuration persistently, how do I avoid polling a table?

Thanks!

回答1:

I agree that it seems inelegant, but given the way that computers work something *somewhere* is going to have to do polling of some kind in order to figure out which jobs to execute when. So, let's go over some of your options:

Poll the database table. This isn't a bad idea at all - it's probably the simplest option if you're storing the jobs in MySQL anyway. A rate of one query per second is nothing - give it a try and you'll notice that your system doesn't even feel it.

Some ideas to help you scale this to possibly hundreds of queries per second, or just keep system resource requirements down:
- Create a second table, 'job_pending', where you put the jobs that need to be executed within the next X seconds/minutes/hours.
- Run queries on your big table of all jobs only once in a longer while, then populate the small table which you query every shorter while.
- Remove jobs that were executed from the small table in order to keep it small.
- Use an index on your 'execute_time' (or whatever you call it) column.
If you have to scale even further, keep the main jobs table in the database, and use the second, smaller table I suggest, just put that table in RAM: either as a memory table in the DB engine, or in a Queue of some kind in your program. Query the queue at extremely short intervals if you have too - it'll take some extreme use cases to cause any performance issues here.

The main issue with this option is that you'll have to keep track of jobs that were in memory but didn't execute, e.g. due to a system crash - more coding for you...
Create a thread for each of a bunch of jobs (say, all jobs that need to execute in the next minute), and call thread.sleep(millis_until_execution_time) (or whatever, I'm not that familiar with node.js).

This option has the same problem as no. 2 - where you have to keep track job execution for crash recovery. It's also the most wasteful imo - every sleeping job thread still takes system resources.

There may be additional options of course - I hope that others answer with more ideas.

Just realize that polling the DB every second isn't a bad idea at all. It's the most straightforward way imo (remember KISS), and at this rate you shouldn't have performance issues so avoid premature optimizations.

回答2:

Why not have a Job object in node.js that's saved to the database.

var Job = {
   id: long,
   task: String,
   configuration: JSON,
   dueDate: Date,
   finished: bit
};

I would suggest you only store the id in RAM and leave all the other Job data in the database. When your timeout function finally runs it only needs to know the .id to get the other data.

var job = createJob(...); // create from async data somewhere.
job.save(); // save the job.
var id = job.id // only store the id in RAM
// ask the job to be run in the future.
setTimeout(Date.now - job.dueDate, function() {
    // load the job when you want to run it
    db.load(id, function(job) {
        // run it.
        run(job);
        // mark as finished
        job.finished = true;
        // save your finished = true state
        job.save();
    });
});
// remove job from RAM now.
job = null;

If the server ever crashes all you have to is query all jobs that have [finished=false], load them into RAM and start the setTimeouts again.

If anything goes wrong you should be able to restart cleanly like such:

db.find("job", { finished: false }, function(jobs) {
    each(jobs, function(job) {
         var id = job.id;
         setTimeout(Date.now - job.dueDate, function() {
             // load the job when you want to run it
             db.load(id, function(job) {
                 // run it.
                 run(job);
                 // mark as finished
                 job.finished = true;
                 // save your finished = true state
                 job.save();
             });
         });
         job = null;
    });
});

来源：https://stackoverflow.com/questions/5603017/database-backed-work-queue

标签

javascript

architecture

node.js

scale

beanstalkd