<Database Internals> Notes


Database systems take care of data integrity, consistency, and redundancy. Databases are modular systems and consist of multiple parts: a transport layer accepting requests, a query processor determining the most efficient way to run queries, an execution engine carrying out the operations, and a storage engine.

Back in 2000 you only had a few options of databases and most of them would be relational databases. Around 2010, a new class of eventually consistent databases started appearing, and terms such as NoSQL, and later, big data grew in popularity. The Dynamo paper published by the team at Amazon in 2007 had so much impact on the database community that within a short period it inspired many variants and implementations. The most prominent of them were Apache Cassandra, created at Facebook.

Terms & Concept

Horizontal scaling (scaling out): improving performance and increasing capacity by running multiple database instances acting as a single logical unit (Gamma Database Machine Project, Teradata, Greenplum, Parallel DB2)

Vertically scaling (scaling up): moving the database to a larger, more powerful machine.


1. Overview of Storage Engines

The storage engine (or database engine) is a software component of a database management system responsible for storing, retrieving, and managing data in memory and on disk, designed to capture a persistent, long-term memory of each node.

Designing a storage engine is complicated: there are many details and edge cases that are hard to get right from the start. We need to design the physical data layout and organize pointers, decide on the serialization format, understand how data is going to be garbage-collected, how the storage engine fits into the semantics of the database system as a whole, figure out how to make it work in a concurrent environment, and, finally, make sure we never lose any data, under any circumstances. And most of these decisions involve trade-offs.

Online transaction processing (OLTP) databases: handle a large number of user-facing requests and transactions. Queries are often predefined and short-lived.

Online analytical processing (OLAP) databases: handle complex aggregations. OLAP databases are often used for analytics and data warehousing, and are capable of handling complex, long-running ad hoc queries.

Hybrid transactional and analytical processing (HTAP): combine properties of both OLTP and OLAP stores.

Queries are executed by the storage engine. The storage engine has several components with dedicated responsibilities:

  • Transaction manager: schedules transactions and ensures they cannot leave the database in a logically inconsistent state.

  • Lock manager: locks on the database objects for the running transactions, ensuring that concurrent operations do not violate physical data integrity.

  • Access methods (storage structures): manage access and organizing data on disk. Access methods include heap files and storage structures such as B-Trees.

  • Buffer manager: caches data pages in memory.

  • Recovery manager: maintains the operation log and restoring the system state in case of a failure.

Together, transaction and lock managers are responsible for concurrency control.

In-memory database management systems (sometimes called main memory DBMS) store data primarily in memory and use the disk for recovery and logging. Disk-based DBMS hold most of the data on disk and use memory for caching disk contents or as a temporary storage.

Row-oriented database management systems store data in records or rows, every row has the same set of fields, useful in accessing data by row, storing entire rows together improves spatial locality.

Column-oriented database management systems partition data vertically, values for the same column are stored contiguously on disk, good fit for analytical workloads that compute aggregates.

Wide column stores, such as BigTable or HBase, where data is represented as a multidimensional map, columns are grouped into column families (usually storing data of the same type), and inside each column family, data is stored row-wise. This layout is best for storing data retrieved by a key or a sequence of keys.

Reasons to use specialized file organization:

  • Storage efficiency: files are organized in a way that minimizes storage overhead per stored data record.

  • Access efficiency: records can be located in the smallest possible number of steps.

  • Update efficiency: record updates are performed in a way that minimizes the number of changes on disk.

Data files (sometimes called primary files) can be implemented as index-organized tables (IOT), heap-organized tables (heap files), or hash-organized tables (hashed files).

This defines whether or not the storage structure chooses to collect a certain amount of data in memory before putting it on disk.

Mutability (or immutability)
This defines whether or not the storage structure reads parts of the file, updates them, and writes the updated results at the same location in the file.

Ordering This is defined as whether or not the data records are stored in the key order in the pages on disk.

2. B-Tree Basics

3. File Formats

Keys and values have a type, such as integer, date, or string, and can be represented (serialized to and deserialized from) in their raw binary forms.
All primitive numeric types have a fixed size.

Strings and other variable-size data types (such as arrays of fixed-size data) can be serialized as a number, representing the length of the array or string, followed by size bytes (the actual data).

Booleans can be represented either by using a single byte, or encoding true and false as 1 and 0 values.

Enumerated types can be represented as integers and are often used in binary formats and communication protocols.

4. Implementing B-Trees

5. Transaction Processing and Recovery

6. B-Tree Variants

7. Log-Structured Storage

8. Overview of distributed systems

The inherent difficulties and complications caused by the unreliability of the system components: links may fail to deliver messages, processes may crash, or the network may get partitioned.

9. Failure Detection

10. Leader Election

11. Replication and Consistency

Fault tolerance is a property of a system that can continue operating correctly in the presence of failures of its components.

Data replication is a way of introducing redundancy by maintaining multiple copies of data in the system. When talking about replication, we care most about three events: write, replica update, and read.

CAP conjecture discusses trade-offs between Consistency, Availability, and Partition tolerance.

Linearizability is the strongest single-object, single-operation consistency model.
Sequential consistency allows ordering operations as if they were executed in some sequential order, while requiring operations of each individual process to be executed in the same order they were performed by the process.
Under the causal consistency model, all processes have to see causally related operations in the same order.
PRAM/FIFO consistency: Operation effects become visible in the same order they were executed by individual processes. Writes from different processes can be observed in different orders.

12. Anti-Entropy and Dissemination

Three approaches to propogate messages between nodes:

  • Broadcast: notification broadcast from one process to all others
  • Anti-entropy: periodic peer-to-peer information exchange, peers connect pairwise and exchange messages
  • Gossip: message recipients become broadcasters and help to spread the information quicker and more reliably

In a distributed system, entropy represents a degree of state divergence between the nodes. Anti-entropy is used to lower the convergence time bounds in eventually consistent systems.

Read repair: detect divergence between the replicas during the read, the coordinator node makes a request to replicas, waits for their responses, and compares them, sends updates back to the replicas if inconsistency detected. Read repair can be implemented as a blocking or asynchronous operation. Read repair assumes that replicas are mostly in sync and we do not expect every request to fall back to a blocking repair.

Digest reads: the coordinator can issue only one full read request and send only digest requests to the other replicas. A digest request reads the replica-local data and, instead of returning a full snapshot of the requested data, it computes a hash of this response.

Hinted handoff: a write-side repair mechanism. If the target node fails to acknowledge the write, the write coordinator or one of the replicas stores a special record, called a hint, which is replayed to the target node as soon as it comes back up.

Finding exactly which rows have diverged between the replicas requires exchanging and comparing the data records pairwise. This is highly impractical and expensive. Merkle trees compose a compact hashed representation of the local data, building a tree of hashes. The lowest level of this hash tree is built by scanning an entire table holding data records, and computing hashes of record ranges. Higher tree levels contain hashes of the lower-level hashes, building a hierarchical representation that allows us to quickly detect inconsistencies by comparing the hashes, following the hash tree nodes recursively to narrow down inconsistent ranges. This can be done by exchanging and comparing subtrees level-wise, or by exchanging and comparing entire trees.

Bitmap version vectors: detect missing replica writes by maintaining compact records containing information about the most recent writes.

13. Distributed Transactions

A transaction is a set of operations, an atomic unit of execution. Transaction atomicity implies that all its results become visible or none of them do. To ensure atomicity, transactions should be recoverable.

Atomic commitment doesn’t allow disagreements between the participants: a transaction will not commit if even one of the participants votes against it.

In databases, distributed transactions are executed by the component commonly known as a transaction manager. Transactions commit in all partitions, and for all replicas.

Two-Phase Commit

  • Prepare
    The coordinator notifies cohorts about the new transaction by sending a Propose message. Cohorts make a decision on whether or not they can commit the part of the transaction that applies to them. If a cohort decides that it can commit, it notifies the coordinator about the positive vote. Otherwise, it responds to the coordinator, asking it to abort the transaction. All decisions taken by cohorts are persisted in the coordinator log, and each cohort keeps a copy of its decision locally.

  • Commit/abort
    Operations within a transaction can change state across different partitions (each represented by a cohort). If even one of the cohorts votes to abort the transaction, the coordinator sends the Abort message to all of them. Only if all cohorts have voted positively does the coordinator send them a final Commit message.

Failure of a single node can prevent transactions from happening.
Inability of the coordinator to proceed with a commit or abort leaves the cluster in an undecided state.

Three-Phase Commit

  • Propose
    The coordinator sends out a proposed value and collects the votes.

  • Prepare
    The coordinator notifies cohorts about the vote results. If the vote has passed and all cohorts have decided to commit, the coordinator sends a Prepare message, instructing them to prepare to commit. Otherwise, an Abort message is sent and the round completes.

  • Commit
    Cohorts are notified by the coordinator to commit the transaction.

Split brain: some nodes proceed with a commit and some abort.

Calvin establishes the global transaction execution order by reaching consensus on sequencers, and a typical Calvin implementation colocates sequencer, scheduler, worker, and storage subsystems.
Spanner uses Paxos for consistent transaction log replication, two-phase commit for cross-shard transactions, and TrueTime for deterministic transaction ordering.

Database Partitioning

The most straightforward way to partition data is by splitting it into ranges and allowing replica sets to manage only specific ranges. When executing queries, clients (or query coordinators) have to route requests based on the routing key to the correct replica set for both reads and writes. This partitioning scheme is typically called sharding: every replica set acts as a single source for a subset of data.

To effecitvely use patitions, requently accessed, read/write heavy ranges can be split into smaller partitions to spread the load between them, and split those value ranges are more dense than other ones into smaller partitions.

When nodes are added to or removed from the cluster, the database has to re-partition the data to maintain the balance.

To find a target node from the routing key, some database systems compute a hash of the key, and use some form of mapping from the hash value to the node ID.

Consistent hashing: reduce the number of relocations required for maintaining balance, a change in the ring affects only the immediate neighbors of the leaving or joining node, and not an entire cluster.

Snapshot isolation guarantees that all reads made within the transaction are consistent with a snapshot of the database. Values are consistent, as they’re read from the snapshot at a specific timestamp. Conflicting writes are aborted and retried to prevent inconsistencies.

14. Consensus

A broadcast is a communication abstraction often used in distributed systems and is often used for database replication when a single coordinator node has to distribute the data to all other participants.

For a broadcast to be reliable, it needs to guarantee that all correct processes receive the same messages, even if the sender crashes during transmission. Naively we allow every process that received the message to forward it to every other process it’s aware of, and it uses N2 messages.

Paxos algorithm can be generally split into voting (or propose phase) and replication, participants in Paxos can take one of three roles: proposers, acceptors, or learners.

Raft simplifies the terms in which consensus is described, and makes leadership a first-class citizen in the algorithm. Raft separates log replication, leader election, and safety.

Chuanrong Li

Read more posts by this author.