Some things that should also be mentioned about each approach:
B-trees
- The read/write operations are supposed to be logarithmic
O(logn). However, a single database write can lead to multiple writes in the storage system. For example, when a node is full, it has to be split and that means that there will be 2 writes for the 2 new nodes and 1 additional write for updating the parent node. You can see how that could increase if the parent node was also full.
- Usually, B-trees are stores in such a way that each node has the size of a page. This creates a phenomenon called write amplification, where even if a single byte needs to be updated, a whole page is written.
- Writes are usually random (not sequential), thus slower especially for magnetic disks.
SSTables
- SSTables are usually used in the following approach. There is an in-memory structure, called memtable, as you described. Every once in a while, this structure is flushed to disk to an SSTable. As a result, all the writes go to the memtable, but the reads might not be in the current memtable, in which case they are searched in the persisted SSTables.
- As a result, writes are
O(logn). However, always bear in mind that they are done in memory, so they should be orders of magnitude faster than the logarithmic operations in disk of B-trees. For the sake of completeness, we should mention that writes are also written to a write-ahead log for crash recovery. But, given that these are all sequential writes, they are expected to be much more efficient than the random writes of B-trees.
- When served from memory (from the memtable), reads are expected to be much faster as well. But, when there's need to look in the older, disk-based SSTables, reads can potentially become quite slower than B-trees. There are several optimisations around that, such as use of bloom filters, to check whether an SSTable contains a value without performing disk reads.
- As you mentioned, there's also a background process, called compaction, used to merge SSTables. This helps remove deleted values and prevent fragmentation, but it can cause significant write load, affecting the write throughput of the incoming operations.
As it becomes evident, a comparison between these 2 approaches is much more complicated. In an extremely simplified attempt to provide a concrete comparison, I think we could say that:
- SSTables provide a much better write throughput than B-trees. However, they are expected to have less stable behaviour, because of ongoing compactions. An example of this can be seen in this benchmark comparison.
- B-trees are usually preferred for use-cases, where transaction semantics are needed. This is because, each key can be found only in a single place (in contrast to the SSTable, where it could exist in multiple SSTables with obsolete values in some of them) and also because one could represent a range of values as part of the tree. This means that it's easier to perform key-level and range-level locking mechanisms.
References
[1] A Performance Comparison of LevelDB and MySQL
[2] Designing Data-intensive Applications