Log Storage and Log Structured merge trees

LSM trees are designed to achieve higher throughput and are used as the storage engine of various DB such as HBase, Cassandra, LevelDB, SQLite. As the name suggests, writes are made to log files in append-only mode. These logs files are merged and compacted in the background and indexed for efficient search.

To discuss how LSTM work's let's understand few basics:

  1. Log storage
  2. Segment
  3. Segment Merge
  4. Memtable
  5. SSTable
  6. WAL

Log storage - As the name suggests it is an append-only file where any new data is appended to the file. The file contains key-value pairs. Where key is the record key and value is the data. Since the file is append only, the log file can contain multiple records for the same key as an update to the existing the key.

Segment - Continuous appending to the log file, can make file size big and eventually running out of disk space. One good solution is to break log into segments of a certain size.

Segment Merge -  Since the files size are small these files can be merged at a regular time for compaction. The merging and compaction can be done as background thread while regular operations are performed. After the merging process is completed switch can be done to read from the new segments, replacing the old segments(which can be deleted). During the merge process multiple records with the same key are converged into a single record.

Log structure merge trees

Source : Wikipedia

SSTable - Sorted String Table is a concept borrowed from Google BigTable which stores a set of immutable row fragments in sorted order based on row keys. It is a log file but with keys sorted.

Write Ahead Log - Write ahead log is a technique where the request received for write is first written into a separate file residing on the disk. The helps from crash recovery where the data from the file can be replayed and inserted.

Write Amplification - Log structured indexes also rewrite data multiple times due to repeated compaction and merging of SSTables. This effect of one write to the database resulting in multiple writes to the disk over course of the database’s lifetime - is known as write amplification.

How does read/write works for LGSTM?

  1. When a write request comes, it is added to an in-memory balanced tree data structure (eg red-black tree). This in memory tree, is sometimes referred as memtable.
  2. Along with request added to in-memory balanced tree the request is also logged in a WAL (on disk). The log helps to recover in case system or the process crashes.
  3. When the memtable gets bigger than some threshold it is flushed to disk as a SSTable file. This is done efficiently because tree already maintains a sorted key-value pair. This becomes the most recent segment of the database. During the operation of flushing to the disk, write operation continues as normal. Even with compaction reads will still need to visit many files. Most implementations void this through the use of a Bloom filter. Bloom filters are a memory efficient way of working out whether a file contains a key.
  4. To server the read request, first key is searched in memtable then the most recent on-disk segment and so on in the next older segment.
  5. The merging and compaction of the segment happens in the background.

Advantages of LSM

  1. LSM trees are typically able to sustain higher write throughput than B-trees, partly because it has lower write amplification and partly because they sequentially write in the form of compact SST tables files rather than having to overwrite several pages in the tree.
  2. LSM can be compressed better and thus often produce smaller files on disk than B-trees.
  3. B-tree storage engines leave some disk space unused due to fragmentation: when a page is split or when a row cannot fit into an existing page. Since LSM are not page oriented and writes are sequential the storage overhead is low.

Disadvantage of LSM

  1. The compaction process of LSM can sometime interfere with the performance of ongoing reads and writes. When writing to an empty database, the full disk bandwidth can be used for initial write but bigger the database gets, the more disk bandwidth is required for compaction.
  2. An advantage of B-trees is that each key exists in exactly one place in the index whereas a log-structured storage engine may have multiple copies of the same key in different segments.