Redis AOF fsync (ALWAYS) vs. LSM tree

问题

My understanding of log structured merge trees (LSM trees) is that it takes advantage of the fact that appending to disk is very fast (since it requires no seeks) by just appending the update to a write-ahead log and returning to the client. My understanding was that this still provides immediate persistence, while still being extremely fast.

Redis, which I don't think uses LSM trees, seems to have a mode where you can AOF+fsync on every write. https://redis.io/topics/latency . The documentation says:

AOF + fsync always: this is very slow, you should use it only if you know what you are doing.

I'm confused why this would be very slow, since in principle you are still only appending to a file on every update, just like LSM-tree databases like Cassandra are doing.

回答1:

LSM is AOF that you want to actually read sometimes. You do some overhead work so you can read it faster later. Redis is designed so you never or only in a special case read it. On the other hand, Cassandra often reads it to serve requests.

And what Redis calls slow is actually very very fast for a db like Cassandra.

============================ UPDATE

It turns out, I jumped into conclusions too early. From design standpoint everything above is true, but implementation differ so much. Despite Cassandra claiming absolute durability, it does not fsync on each transaction and there is no way to force it do so (but each transaction could be fsynced). The best I could do is 'fsync in batch mode at least 1ms after previous fsync'. It means for 4 thread benchmark I was using it was doing 4 writes per fsync and threads was waiting for fsync to be done. On the other hand, Redis did fsync on every write, so 4 times more often. With addition of more threads and more partitions of the table, Cassandra could win even bigger. But note, that the use case you described is not typical. And other architectural differences (Cassandra is good at partitioning, Redis is good at counters, LUA and other) still apply.

Numbers:

Redis command: set(KEY + (tstate.i++), TEXT);

Cassandra command: execute("insert into test.test (id,data) values (?,?)", state.i++, TEXT)

Where TEXT = "Wake up, Neo. We have updated our privacy policy."

Redis fsync every sec, HDD

Benchmark              (address)   Mode  Cnt      Score      Error  Units
LettuceThreads.shared  localhost  thrpt   15  97535.900 ± 2188.862  ops/s

  97535.900 ±(99.9%) 2188.862 ops/s [Average]
  (min, avg, max) = (94460.868, 97535.900, 100983.563), stdev = 2047.463
  CI (99.9%): [95347.038, 99724.761] (assumes normal distribution)

Redis fsync every write, HDD

Benchmark              (address)   Mode  Cnt   Score   Error  Units
LettuceThreads.shared  localhost  thrpt   15  48.862 ± 2.237  ops/s

  48.862 ±(99.9%) 2.237 ops/s [Average]
  (min, avg, max) = (47.912, 48.862, 56.351), stdev = 2.092
  CI (99.9%): [46.625, 51.098] (assumes normal distribution)

Redis, fsync every write, NVMe (Samsung 960 PRO 1tb)

Benchmark              (address)   Mode  Cnt    Score   Error  Units
LettuceThreads.shared     remote  thrpt   15  449.248 ± 6.475  ops/s

  449.248 ±(99.9%) 6.475 ops/s [Average]
  (min, avg, max) = (441.206, 449.248, 462.817), stdev = 6.057
  CI (99.9%): [442.773, 455.724] (assumes normal distribution)

Cassandra, fsync every sec,HDD

Benchmark                  Mode  Cnt      Score     Error  Units
CassandraBenchMain.write  thrpt   15  12016.250 ± 601.811  ops/s

  12016.250 ±(99.9%) 601.811 ops/s [Average]
  (min, avg, max) = (10237.077, 12016.250, 12496.275), stdev = 562.935
  CI (99.9%): [11414.439, 12618.062] (assumes normal distribution)

Cassandra, fsync every batch, but wait at least 1ms, HDD

Benchmark                  Mode  Cnt    Score   Error  Units
CassandraBenchMain.write  thrpt   15  195.331 ± 3.695  ops/s

  195.331 ±(99.9%) 3.695 ops/s [Average]
  (min, avg, max) = (186.963, 195.331, 199.312), stdev = 3.456
  CI (99.9%): [191.637, 199.026] (assumes normal distribution)

回答2:

This is kinda comparing apples and oranges, they solve different problems.

Redis should fit in memory, is very fast. If configured to fsync every second AOF its still going to be very fast and with some clever client side replication would give you essentially same durability as Cassandra.

Cassandra is designed more around many tbs or petabytes across multiple nodes and data centers. You can expect to have tb's per node where you cant just fsync entire data set and fitting entire set in a commitlog will make replays too long. so with LSM trees you just sync changes and push off the cost until reads and out of read/write path compactions.

My understanding was that this still provides immediate persistence, while still being extremely fast.

The commitlog by default is periodic fsync, not per request so its just a memory append which is why its ~10 microseconds to do a write. You will need to use batch (or group for better latency/throughput) to have guaranteed durability on per replica level which brings up write time to ~10 milliseconds (very hand wavey). This is handled in practice with higher replication factor including cross DC but you still have possible data loss if an entire replica set goes down in an instant (not meteor proof). So a single host cluster with default settings is not durable. Less than three node clusters or RF<3 is very strongly recommended against as its simply not safe. I would say the high availability is Cassandras selling point, not its performance, and that availability is partially what provides some of its durability.

来源：https://stackoverflow.com/questions/50478674/redis-aof-fsync-always-vs-lsm-tree

标签

cassandra

Redis

wal