Does MongoDB journaling guarantee durability?

后端 未结 5 1568
刺人心
刺人心 2020-12-10 04:48

Even if journaling is on, is there still a chance to lose writes in MongoDB?

\"By default, the greatest extent of lost writes, i.e., those not made to the journal,

5条回答
  •  清歌不尽
    2020-12-10 05:10

    Maybe. Yes, it waits for the data to be written, but according to the docs there's a 'there is a window between journal commits when the write operation is not fully durable', whatever that is. I couldn't find out what they refer to.

    I'm leaving the edited answer here, but I reversed myself back-and-forth, so it's a bit irritating:


    This is a bit tricky, because there are a lot of levers you can pull:

    Your MongoDB setup

    Assuming that journaling is activated (default for 64 bit), the journal will be committed in regular intervals. The default value for the journalCommitInterval is 100ms if the journal and the data files are on the same block device, or 30ms if they aren't (so it's preferable to have the journal on a separate disk).

    You can also change the journalCommitInterval to as little as 2ms, but it will increase the number of write operations and reduce overall write performance.

    The Write Concern

    You need to specify a write concern that tells the driver and the database to wait until the data is written to disk. However, this won't wait until the data has been actually written to the disk, because that would take 100ms in a bad-case scenario with the default setup.

    So, at the very best, there's a 2ms window where data can get lost. That's insufficient for a number of applications, however.

    The fsync command forces a disk flush of all data files, but that's unnecessary if you use journaling, and it's inefficient.

    Real-Life Durability

    Even if you were to journal every write, what is it good for if the datacenter administrator has a bad day and uses a chainsaw on your hardware, or the hardware simply disintegrates itself?

    Redundant storage, not on a block device level like RAID, but on a much higher level is a better option for many scenarios: Have the data in different locations or at least on different machines using a replica set and use the w:majority write concern with journaling enabled (journaling will only apply on the primary, though). Use RAID on the individual machines to increase your luck.

    This offers the best tradeoff of performance, durability and consistency. Also, it allows you to adjust the write concern for every write and has good availability. If the data is queued for the next fsync on three different machines, it might still be 30ms to the next journal commit on any of the machines (worst case), but the chance of three machines going down within the 30ms interval is probably a millionfold lower than the chainsaw-massacre-admin scenario.

    Evidence

    TL;DR: I think my answer above is correct.

    The documentation can be a little irritating, especially with regards to wtimeout, so I checked the source. I'm not an expert on the mongo source, so take this with a grain of salt:

    In write_concern.cpp, we find (edited for brevity):

    if ( cmdObj["j"].trueValue() ) {
        if( !getDur().awaitCommit() ) {
            // --journal is off
            result->append("jnote", "journaling not enabled on this server");
        } // ...
    }
    else if ( cmdObj["fsync"].trueValue() ) {
        if( !getDur().awaitCommit() ) {
            // if get here, not running with --journal
            log() << "fsync from getlasterror" << endl;
            result->append( "fsyncFiles" , MemoryMappedFile::flushAll( true ) );
        }
    

    Note the call MemoryMappedFile::flushAll( true ) if fsync is set. This call is clearly not in the first branch. Otherwise, durability is handled on a sepate thread (relevant files prefixed dur_).

    That explains what wtimeout is for: it refers to the time waiting for slaves, and has nothing to do with I/O or fsync on the server.

提交回复
热议问题