Paper Notes: [FTNDB'07] Architecture of a Database System

Database systems are extremely important and complex systems, but knowledge about their architecture is not as widely known as that of other important systems, such as operating systems and compilers. Traditional textbooks usually focus on database-related algorithms and theory, and rarely cover system development and architecture. The paper [1] uses popular commercial and open-source database systems as examples, focusing on the architecture of (relational) database systems. Although some details have changed over the years, the overall structure and ideas have not diverged much.

This paper covers a lot of material. For now, I will not introduce many of my own thoughts; I will simply excerpt some key points so that I do not completely forget the paper later.

Overall Structure

The overall structure of a database system is shown in Figure 1.

Figure 1. Main components of a database

Taking a single request to the database as an example:

First, a persistent connection must be established with the database. This part is handled by the top component in Figure 1, the Client Communications Manager. Databases usually need to support different protocols, such as ODBC and JDBC, TCP, and local pipes.
After the user connection is established, thread resources must be allocated to it. This part is handled by the component on the left in Figure 1. Admission Control is usually performed at this stage as well.
Next, the user’s request enters the core part of the database and is processed by the component in the middle of Figure 1, the Relational Query Processor.
1. The user’s SQL query is first parsed into an internal representation, usually an expression tree corresponding to relational algebra.
2. Next, the SQL query is optimized. Before that, there is usually a Rewrite step that preprocesses the query to simplify the optimizer’s logic.
3. The optimized SQL query may contain multiple operators, and the results of these operators must be combined and chained together. This work is performed by the Plan Executor.
Operator execution requires support from the database’s lower layers. This functionality is handled by the bottom component in Figure 1, the Transactional Storage Manager.

Process Model

A database is a multi-user service and must be able to serve multiple users at the same time. This requires some basic parallel execution components. Databases generally have their own process abstraction. This is because most databases need to support different runtime environments, and some early operating systems did not support threads well enough, so databases had to adapt on their own. The term process is used here because a database system may span multiple compute nodes.

Database protocols generally use a long-lived connection model. After a connection is established, associated session information is maintained, preserving the context for commands executed on the current connection, such as whether a transaction is being processed. A database system usually assigns a corresponding DBMS worker to manage each client connection. A DBMS worker consumes some compute resources, and its process model can take several forms:

Process per DBMS worker
Thread per DBMS worker
1. OS thread per DBMS worker
2. DBMS thread per DBMS worker
  1. DBMS threads scheduled on OS process
  2. DBMS threads scheduled on OS threads
Process/thread pool
1. DBMS workers multiplexed over a process pool
2. DBMS workers multiplexed over a thread pool

In my opinion, most of the reason so many models exist is historical baggage. If we consider only modern operating systems, especially Linux, which can support a large number of threads reasonably well, the following models are all reasonable choices:

DBMS threads scheduled on OS threads
DBMS workers multiplexed over a thread pool

Sometimes a database also uses its own thread library, which further wraps user-space threads, also called fibers. The advantage of doing so is that it can further reduce the overhead caused by thread context switches. The drawback is higher maintenance cost and fewer debugging tools and less debugging information.

Admission Control

Admission control is performed for the following two reasons:

Prevent one user from occupying too many resources and affecting other users of the system
Deny users access to content for which they do not have permission

Admission control in database systems generally happens at two points:

When a user request arrives, admission control is performed to avoid allocating resources to invalid requests
During query plan execution, because only then is it convenient to aggregate all necessary information, such as the load on the nodes that will execute the physical query plan

Parallel Processing Architecture: Coordination Between Processes and Memory

For performance reasons, database systems need to pay more attention to coordination between processes and memory. Several common models are as follows:

Shared Memory
Shared-Nothing
Shared-Disk
NUMA

Today, mainstream single-node database systems all support the Shared Memory model, which makes it easier to achieve higher performance. Distributed database systems generally use the Shared-Nothing model. This basically matches current hardware capabilities: hardware does not provide the ability to access data (memory/disk) on remote nodes as reliably as it accesses data on the local node. Some new hardware technologies may break this assumption, and this is currently a hot area for new experimentation.

There are also many distributed data systems that use the Shared-Disk model. Here, the assumption is that all processes can access the shared disk at similar cost. Some facilities can indeed do this, such as SANs (Storage Area Networks). Another approach is to treat lower-level software systems as shared disks and build a Shared-Disk system on top of them, such as BigTable[2].

Although NUMA was originally proposed as a distributed model, it has not been widely used in that field. Instead, it has played a major role in single-machine multicore architectures. In my opinion, although this architecture’s assumptions about access speed are barely valid, its assumptions about reliability do not hold, so it is difficult to apply well in distributed systems. Mainstream database systems generally consider the impact of NUMA architecture on a single machine.

Relational Query Processor

Generally speaking, relational query processing can be viewed as a single-user, single-threaded task. Concurrency control is handled at a lower layer and exposes an almost transparent interface to the upper layer. Queries are mainly divided into two categories:

DML (Data Manipulation Language), such as SELECT/INSERT/UPDATE/DELETE
DDL (Data Definition Language), such as CREATE TABLE/CREATE INDEX

Query Parsing and Authorization

For an SQL statement, the main work of parsing is to:

Check whether the query statement is correct
Obtain names and reference information. For example, in SELECT c1 FROM t1 JOIN t2 ON t1.id = t2.t1id:
1. t1 is a table name, but it needs to be normalized into a four-part name, server.database.schema.table, to locate this table precisely
2. c1 is a column name; whether this column exists in t1 or t2 also needs to be handled and normalized at this stage
Convert the query into an internal representation, usually the internal representation corresponding to relational algebra
Check whether the user has permission to execute this query

The work above generally needs to collaborate with the Catalog Manager to obtain table-related metadata. In addition, some operators also need to determine types from this metadata. For example, whether the comparison operator in (EMP.salary * 1.15) < 75000 is an integer comparison, floating-point comparison, or decimal comparison depends on the type of EMP.salary.

Some constraints can also be checked at this point. For example, if SET EMP.salary = -1 has the constraint EMP.salary > 0, it can be detected and rejected at this stage.

Query Rewrite

Although some database systems merge query rewrite into the query parsing module above or the query optimization module below, logically it is still a relatively independent function. Its main functions are:

Expanding views
Evaluating constants, such as R.x < 10 + 2 + R.y ⇒ R.x < 12 + R.y
Rewriting predicates, for example:
1. NOT Emp.salary > 1000000 ⇒ Emp.salary ⇐ 1000000 (saves one operator)
2. Emp.salary < 75000 AND Emp.salary > 1000000 ⇒ False (this situation usually appears after view expansion)
3. R.x < 10 AND R.x = S.y ⇒ R.x < 10 AND S.y < 10 AND R.x = S.y (logical transitivity; the index on S.y may be usable)
Semantic optimization, for example (common after view expansion):
```
SELECT Emp.name, Emp.salary
  FROM Dept
  INNER JOIN Emp ON Emp.deptno = Dept.dno
```
Because the Dept table is not used at all, this can be transformed into SELECT Emp.name, Emp.salary FROM Emp.
Subquery flattening and other heuristic rewrite rules
Because query optimization is a relatively complex problem (NP-hard[3]), to guarantee an upper bound on complexity, optimization generally does not cross query units. Therefore, before that stage, rewriting nested subqueries into a single query, if possible, helps the subsequent query optimization. Generally speaking, at this stage all equivalent queries are rewritten into a canonical form.
In addition, there are optimizations based on heuristic rules or cost estimation.

Query Optimizer

The main work of query optimization is to generate a query plan. The mainstream approach today continues to use the method Selinger and others used when implementing System R[4]. The generated query plan can be represented in multiple ways. Early database systems generally generated machine code directly in pursuit of performance; later, to ensure a certain degree of portability, they generally generated intermediate results and then interpreted them.

Although everyone has followed Selinger’s method, many improvements have also been made:

Plan space. For performance reasons, Selinger reduced the plan space and computation by two means: optimizing only left-deep trees and delaying Cartesian products. However, most modern database systems consider both cases: they optimize both left-deep trees and bushy trees, and consider Cartesian products earlier.
Selectivity estimation. Selinger estimated table size simply from index size. Modern systems generally use sampling to obtain histograms and other statistics for estimation.
Search Algorithms. Some commercial database systems, especially Microsoft and Tandem, do not use Selinger’s method, but instead use a top-down search strategy[5]. This search strategy can sometimes reduce the number of plans the optimizer considers[6], but at the cost of increased optimizer memory usage. Some systems fall back to heuristic search strategies when searching over a large number of tables.
Parallelism. Today’s mainstream commercial database systems all support parallel processing to some extent, and most support intra-query parallelism. The query optimizer needs to know how to schedule these parallelized operators across CPUs and even across machines. A simple idea is to use a two-level scheduling strategy: one layer considers only scheduling across machines, while the other, like a traditional optimizer, considers only scheduling across CPUs within a machine. Some commercial database systems use this strategy, while others do not use two-level scheduling and instead consider network topology and data distribution together for global scheduling.
Auto-Tuning. When data characteristics change, optimization strategies also need to change accordingly. Some companies are trying to perform automatic tuning of database systems through methods such as machine learning.

Database systems usually cache some query plans to reduce the cost of recomputation. However, it is worth noting that these caches should be invalidated at appropriate times, such as when the assumptions used to make optimization decisions no longer hold.

Query Executor

A query plan is usually represented as a dataflow in the form of a directed graph that connects tables and operators. Most modern query execution engines currently use the iterator model:

class iterator {
  iterator &inputs[];
  void init();
  tuple get_next();
  void close();
}

Each operator inherits from iterator. In this way, an operator only needs to focus on its own logic and does not need to care about its upstream or downstream operators.

One characteristic of the iterator model is the tight coupling between data flow and control flow. This model simplifies handling in many situations. Each time the get_next() method is called, when it returns, it indicates that data has arrived and that the call has ended; this is the case where data flow and control flow are tightly coupled. This allows a single thread to drive the execution of the entire query plan, and there is no need to consider performance matching between operators. Because there is no need for scheduling or blocking waits, this approach can also achieve very high system utilization relatively easily. Parallel execution and network communication can be handled by wrapping an exchange iterator[7].

There are two feasible methods for accessing data. One is when the data is in the buffer pool: the buffer page must be pinned, then a reference to the tuple is obtained and used, and after use the buffer page is unpinned. This is called BP-tuples. The other is to copy the tuple out for use, called M-tuples. Although M-tuples are easier to manage than BP-tuples, they are much less efficient and are generally used only in special cases, such as long-running queries.

In most cases, the query plans corresponding to write requests such as INSERT/DELETE/UPDATE are very simple and direct. However, some special cases require great care, especially those involving read-after-write. For example, the Halloween problem: "give every employee with a salary below $20K a 10% raise." First, candidate tuples are found through an index, and then the update is executed. But the data update causes the index to be updated, so if the tuple still satisfies the filter condition after the update statement is executed, it may be selected again for another update. One solution is to use techniques such as a temporary table to completely separate the read and write processes. Another is to use techniques such as multi-versioning to avoid reading updated results.

Access Methods

Access Methods are in the bottom module of Figure 1. Personally, I feel they are similar to the intersection between the Plan Executor and the Storage Engine, and are also the lowest-level operators for accessing data.

Several issues are discussed here. One is the need to pass scan parameters to the lower layer, for the following reasons:

Indexes need to be used
Qualifying tuples can be pinned/unpinned or copied/deleted in batches to improve performance

The paper also mentions the choice of Row ID. Although using the physical disk address can provide higher performance, things become more complicated when a B+-tree splits or a tuple needs to move. Another approach is to use the primary key as the Row ID in secondary indexes. Oracle simply allows tuples to span pages to avoid moving tuples.

Data Warehouses

Data warehouse scenarios, also called OLAP scenarios, are quite different from OLTP scenarios. The optimization process and execution engine discussed above need some extensions and modifications to achieve better performance in OLAP scenarios.

Bitmap Indexes. Some columns, such as gender, have only a fixed set of possible values. Bitmap indexes can provide better performance for them.
Fast Load. Although OLAP systems used to be able to import once per day, today people generally want them to be faster. Bulk load can skip steps such as SQL parsing and operate directly on the underlying storage engine, achieving better performance.
Materialized Views. Physical views can be created; when data is updated, both the base table and the materialized view are updated, trading space for query time.
OLAP and Ad-hoc Query Support. Some data warehouses have fixed query scenarios, so the statements that need to be executed can be predicted, such as periodically executed queries. These scenarios can be supported to some extent by data cubes, similar to precomputation. However, support for ad-hoc queries has always been a difficult problem.
Optimization of Snowflake Schema Queries. Table structures in OLAP databases are generally designed as snowflake schemas, and some optimizations can be performed for this kind of schema.

Column Storage in particular can play a huge role in OLAP scenarios.

Database Extensibility

UDT/UDF
JSON/XML
Full-Text Search

Storage Management

Spatial Control

As everyone knows, sequential access is much faster than random access. Database systems should have control over how data is laid out, and because database systems know more information than the underlying operating system, they can do a better job. Early database systems achieved this by directly operating on raw disks. However, this approach monopolizes the disk, makes data management and recovery difficult, and cannot painlessly benefit from other technologies such as SAN and RAID, so it was gradually replaced by other approaches. The current mainstream method is to allocate a large file and then operate on it using technologies such as the mmap API, Direct I/O, or Concurrent I/O, simulating a raw disk.

Temporal Control: Buffering

In addition to needing to control data placement, database systems also need to control when data should be written to persistent storage. This is due to two considerations:

Some data must be written to persistent storage immediately to guarantee correctness, such as write-ahead logging for transactions
The database performs a large amount of cache management based on application data, and it does not need the underlying operating system to do similar work and introduce additional overhead

Buffer Management

A database generally uses a region of memory, called a frame, to map one-to-one to content on disk. This mapping does not involve transformation of the data content, avoiding extra CPU overhead. This mapping relationship, metadata, and other information are managed, and one important piece of information is the dirty flag. If this frame is selected for eviction from memory, the dirty flag determines whether the frame must be written back to disk. The pin/unpin mentioned above also determines whether this frame can be evicted to disk.

Frames are loaded and evicted according to a replacement policy. Algorithms in this area were a major focus of past research, such as LRU, CLOCK, and LRU-2.

Transactions: Concurrency Control and Recovery

The truly large part of a database system, and the part that is not very cleanly decomposed, is transactional storage management. It is usually composed of the following four intertwined modules:

The Lock Manager for concurrency control
The Log Manager for failure recovery
The Buffer Pool for separating I/O
Access Methods for managing data on the underlying disks

A Note on ACID

ACID refers to transactions in database systems:

Atomicity	The changes caused by a transaction either all take effect or none take effect
Consistency	SQL-defined constraints must not be violated
Isolation	Two concurrently running transactions must not affect each other
Durability	Once a transaction succeeds, the changes it causes—unless overwritten by another process—must not be lost

A Brief Review of Serializability

There are three main categories of methods for implementing concurrency control:

Strict two-phase locking (2PL)
Multi-Version Concurrency Control (MVCC)
Optimistic Concurrency Control (OCC)

Locking and Latching

A database implements its own lock control system, called Locking. The database system maintains a Lock Table that records the relationships among Transaction, Lock, and Object, so that when a Transaction is aborted, all Locks associated with it can be released. In addition, because the order of acquiring locks is driven by the order in which users access data, deadlock detection is also essential. This kind of locking system is mainly aimed at Transactions.

Database systems also use finer-grained locks when accessing data structures to protect data and data structures. This kind of lock is called Latching. Generally speaking, Latching is infrastructure provided by the operating system or hardware instructions, so the use of Latching should avoid deadlocks.

ANSI SQL specifies several isolation levels:

READ UNCOMMITTED
READ COMMITTED
REPEATABLE READ
SERIALIZABLE

These isolation levels were specified early on for isolation control based on locks. Today, because of the use of MVCC and OCC, some database systems also provide other isolation levels:

CURSOR STABILITY
SNAPSHOT ISOLATION
READ CONSISTENCY

Log Manager

I recommend reading the ARIES paper[8], which not only describes methods that use Logging but also discusses other implementation approaches.

Database systems generally use Write-Ahead Logging (WAL) to implement transaction durability. The key points are the following three rules:

Every modification to a data page should generate a log record, and this log record must be flushed before the in-memory page is flushed
Logs must be flushed in order
A transaction can be reported as successful only after its log has been flushed

Although the basic rules are very simple, actual implementations are usually far more complex than these rules in order to achieve better performance. The key is to maintain high performance on the transaction commit fast path while also providing reasonable performance for Rollback/Abort. After optimizations for specific scenarios are considered, the logging system becomes even more complex.

Databases generally follow the DIRECT, STEAL/NOT-FORCE principles[9]:

Data objects are updated in place
Even if a data page contains data written by uncommitted transactions, an unpinned frame can still be replaced. If it contains dirty data at that point, the changes are written back to disk. This is possible because undo logs can be used to undo changes made by aborted transactions.
When a transaction commits, there is no need to flush data pages to disk, because redo logs have been written

On the other hand, reducing log size can also effectively improve the performance of the logging system. Physical operations usually contain more data related to the current structure; this data can improve performance when executing the operation, but when writing logs, only logical operations can be recorded and the structure-related data can be discarded, further reducing log size and improving log-write performance. For example, in the ARIES system, physical operations are recorded in the UNDO log, and logical operations are recorded in the REDO log.

After an abnormal system shutdown, the system state must be recovered from the logs on restart. The recovery log sequence number (recovery LSN) is used to record the order of logs; a checkpoint periodically records the recovery LSN to avoid starting recovery from too early a point in time. The simplest approach is to flush all data pages to disk and then record a checkpoint, but this performs too poorly. ARIES uses a smarter method to avoid waiting for all data pages to be flushed. (Details omitted.)

A special point to note is that transaction Rollback also needs to write WAL. If disk space is insufficient at that time, Rollback can get stuck. A common approach is to reserve some space to avoid this situation.

Locking and Logging in Indexes

Locking and Logging on index structures may use strategies different from those for Transactions to optimize performance.

Latching in B+-Trees

If strict two-phase locking is used to guarantee the SERIALIZABLE isolation level, one option is to lock the entire B+ tree by locking the root node. But in that case, two transactions with no overlap at all cannot run in parallel. There are three common strategies for solving this problem:

Conservative schemes. Concurrent access to the same data page is allowed only when it is certain that the accesses have no effect on each other. For example, if one operation is traversing this data page and another operation wants to insert into the data page, these two operations are not allowed to execute concurrently, because the insertion may cause the data page to split. Compared with the other two newer strategies, this strategy is somewhat too conservative.
Latch-coupling schemes. When traversing data, a node is latched before it is accessed. The latch on the current node is released only after the latch on the next node to be accessed has been acquired.
Right-link schemes. Nodes in the B+ tree store a right-link pointer. During traversal, the Latch-coupling strategy mentioned above is not used; the latch is released after a node is accessed. Because such a right-link pointer exists, traversal can detect whether the tree has split during this period and can correctly find the required next node.

In particular, Latch-coupling applies only to B+ trees, while the Right-link strategy is more general[10].

Logging for Physical Structures

Splitting a B+ tree does not require UNDO, so it can be marked REDO only when logging. This idea can also be used for transformations of other structures, such as file growth.

Next-Key Locking: Physical Surrogates for Logical Properties

B+ trees have a special optimization method when implementing the Serializability transaction isolation level. This optimization is aimed at solving the "Phantom" inconsistency phenomenon. The Phantom phenomenon is as follows: suppose one query has a range predicate, such as Name BETWEEN 'Bob' AND 'Bobby', and another query inserts a new record within this range, such as Name = 'Bobbie'. Then the first query may see inconsistent results during execution. This is because, when locking records, the range of the query itself is not locked, causing two query requests that actually overlap to execute in parallel and produce anomalous results, violating the Serializability isolation level.

One way to solve this problem is to use predicate locks, but doing so is expensive. On the one hand, it is difficult to determine whether any two predicates intersect; on the other hand, a hash-based Lock Table also has difficulty supporting such operations.

In B+ trees, the following method can be used for optimization. Each time a query is performed, in addition to locking the range it needs, it also locks one extra element that is just beyond the upper bound of the range. Obviously this method works, and because of the B+ tree structure, finding this next element is relatively easy.

The idea behind this method is to use a real, physical object to replace a logical property that is difficult to implement. In this example, the next element replaces the abstract concept of expanding the predicate range to determine predicate intersection. This technique should be known so it can be used when appropriate.

Interdependencies of Transactional Storage

The beginning of this chapter mentioned that Transactional Storage is a huge subsystem in database systems whose submodules are deeply intertwined. This section discusses the dependencies among the submodules.

If we consider only concurrency control and failure recovery mechanisms, we find that the common failure recovery mechanism WAL depends on concurrency control being implemented with strict two-phase locking. If non-strict two-phase locking is used, then if a lock has already been dropped, it may not be possible to acquire the lock again when performing undo during rollback.

If Access Methods are also considered, things become even more complicated. A high-performance implementation of Access Methods is already complex in itself. When concurrency control and failure recovery mechanisms are considered, they must be tightly integrated with their specific data structures. This is why mainstream database systems generally implement only B+ trees and heap files, and only PostgreSQL implements GiST. Moreover, each data structure has its own optimization techniques for concurrency control or failure recovery. Even the simplest heap file has such special techniques, and these techniques cannot be applied to other data structures.

Concurrency control implemented on Access Methods is generally done well only in lock-based implementations. Other concurrency control methods, such as MVCC and OCC, actually do not consider the characteristics of Access Methods. Therefore, it is difficult to mix multiple different concurrency control methods on the same Access Methods.

The logic for implementing failure recovery on Access Methods is highly related to the entire subsystem. Decisions such as changes to data structures and whether to use physical change logs or logical change logs cannot be separated from the details of the whole subsystem. For example, in B+ trees, recovery and concurrency logic are interwoven. On the one hand, if failure recovery is needed, the system must know what inconsistent states a B+ tree may enter before it can use logs to guarantee Atomicity. On the other hand, for example, a B+ tree node split does not actually need an UNDO log, and there is no need to merge the split nodes back during Rollback. This requires the logging system to support REDO-only capability.

Although the Buffer Manager is relatively independent and appears to be loosely connected to the other modules, this is because we implemented properties such as DIRECT, STEAL/NOT-FORCE. Supporting these properties actually depends on support from the other modules, so the Buffer Manager is also coupled with the other submodules.

Common Components

Catalog Manager

The Catalog Manager stores metadata for the entire database system, such as users, schemas, tables, columns, indexes, and so on. One important lesson from past practice is that the same method used to access ordinary data, namely SQL, should be used to access and modify this metadata. This foundational data requires special treatment, usually for performance reasons. It is usually stored directly in memory in denormalized form to serve requests. The related SQL statements are also directly precompiled and cached, and even some transaction-related operations receive special optimizations.

Memory Allocator

Traditional textbooks explain the Memory Allocator mainly in terms of memory pool management. In practice, however, database systems also apply the Memory Allocator in many other areas. For example, Selinger’s query optimization[4] uses a large amount of memory during dynamic programming, not to mention memory-hungry operators such as hash join and sort.

On the other hand, the Memory Allocator in database systems also uses a context-based technique to improve both performance and debugging capability. It provides the following basic APIs:

Create a Context with a given name or type. This Context can provide the Allocator with additional information about what the allocated memory will be used for, so the Allocator can choose a more suitable memory allocation strategy. For example, memory used by the Query Optimizer only needs to grow a little at a time, while memory used by HashJoin is allocated in large chunks.
Allocate a block of memory from a specified Context. This is similar to the malloc() function, but allocation comes from an internal memory pool.
Free a block of memory that was allocated by a specified Context. This is similar to the free() function. This usage is actually relatively uncommon; generally, the entire Context is reclaimed at once.
Reclaim all memory for a specified Context.
Reset a Context. This also reclaims all memory allocated from the Context, but it preserves the Context’s metadata, so the Context can continue to be used for memory allocation afterward.

This differs somewhat from traditional memory allocators such as jemalloc, whose goal is to transparently replace malloc calls. In some scenarios, for example, the optimizer may need to build a plan tree and therefore allocate small memory objects many times. After this phase ends, the entire context can be reclaimed at once instead of traversing the data structure again to carefully perform free operations.

This usage is somewhat similar to a Garbage Collector (GC), but it can provide more control than a GC. For example, it still retains the free operation, and it also isolates memory allocation and reclamation well through Contexts, which can provide better locality.

Because the data flow of database systems is naturally divided into multiple phases, this Memory Allocator design is especially suitable for database systems. In particular, in some scenarios, if it is known in advance that no free operation is needed and that the entire Context will be reclaimed at the end, there is no need to track the usage of every allocated block of memory, saving a lot of overhead, especially in scenarios that frequently allocate small objects.

Disk Management Subsystems

This mainly solves two kinds of problems. One is how to map database system data to multiple files distributed across different disks, where each file may also have a size limit. The other is how to handle optimization problems caused by the characteristics of different disk devices.

File size and access performance may be limited by the operating system, file system, and physical devices. Database system data and file sizes also cannot match perfectly. Some large tables may not fit into a single file on one device, such as the 2 GiB limit of early file systems, while some small tables may need to be merged into one file.

The performance and other characteristics of SCSI disks may differ greatly from those of other devices that look like disks but are not actually disks, such as RAID and SAN. Even within RAID, the characteristics of RAID-0 and RAID-5 differ greatly.

Replication Services

There are mainly three ways to perform replication, but only the last one has satisfactory performance:

Physical Replication, namely periodic full-disk copying
Trigger-Based Replication, using the database’s Trigger functionality
Log-Based Replication, writing database logs, generally Bin-Log, to another location in near real time and then recovering there. There are two approaches here: one is to reconstruct SQL from the log and replay it; the other is to replay data changes directly. The former has better generality and can replicate data across different database vendors; the latter has higher performance.

Administration, Monitoring, and Utilities

Optimizer Statistics Gathering
Physical Reorganization and Index Construction
Backup/Export (Fuzzy dump + logging)
Bulk Load (Manipulate underlying access methods)
Monitoring, Tuning, and Resource Governors

Standard Practices

This section summarizes all implementation approaches mentioned in the paper that popular database systems use for the topics discussed in the sections above.

Standard Practices for Process Models

Process per DBMS worker

IBM DB2 uses the Process per DBMS worker model by default on systems that do not support high-quality OS Threads, and uses Thread per DBMS worker by default on systems that support high-quality OS Threads.

Oracle makes the same choice as DB2, but additionally supports a Process Pool.

PostgreSQL uses the Process per DBMS worker model on all systems.

Thread per DBMS worker

There are two variants here: using OS threads or using DBMS threads.

IBM DB2 uses OS threads by default when the operating system supports high-performance Threads. MySQL also uses OS threads.

DBMS threads are a task-scheduling abstraction implemented by the database system itself in user space, sometimes also called fibers. There are two cases here: one is implemented based on OS Processes, and the other is implemented based on OS Threads. Sybase and Informix support DBMS threads based on OS Processes. Most database systems use OS Processes to implement DBMS threads, but not all systems support migrating DBMS threads between OS Processes. MS SQL Server supports DBMS Threads implemented with OS Threads, but the application scenarios are relatively limited.

Process/thread pool

Using a Process pool saves more memory than the Process per DBMS worker model and is also easier to port to operating systems with poor Thread support. Oracle also supports this model as an option and recommends using it when there are many concurrent user connections. Oracle’s default model is Process per DBMS worker, and both models can be easily extended to many kinds of operating systems.

MS SQL Server uses a Thread pool by default.

Standard Practices for Parallel Processing Architectures

Shared-Memory: All mainstream commercial database systems support this model, including IBM DB2, Oracle, and MS SQL Server.
Shared-Nothing: IBM DB2, Informix, Tandem, and NCR Teradata all support this model; Greenplum provides a customized version of PostgreSQL that supports the Shared-Nothing model.
Shared-Disk: Systems that support this model include Oracle RAC, Oracle RDB, and IBM DB2 for zSeries.

Standard Practices for Relational Query Processing

From a coarse-grained architectural perspective, the query engines of almost all relational databases are similar to the System R prototype[11]. Progress over the years has mainly focused on how to accelerate more kinds of queries and schemas within this framework. The main improvements include the following aspects:

Query optimization search strategy (top-down vs. bottom-up)
Query execution control-flow model (iterators + exchange operator vs. asynchronous producer/consumer)

At a finer granularity, different vendors have many different approaches involving integrated optimization of the optimizer, executor, and access methods to achieve better performance, especially when different workload types are involved, such as:

OLTP
decision-support for warehousing
OLAP

The specific approaches are each vendor’s "secret sauce." The only thing known is that everyone does a pretty good job.

In the open-source world, PostgreSQL uses a relatively mature, traditional cost-based optimizer, with a series of execution algorithms and many extension features not found in many commercial products. MySQL’s query engine is much simpler; it is basically nested-loop joins over indices. MySQL’s query optimization focuses on analytical queries, ensuring that the whole process is lightweight and efficient, especially for key/foreign-key joins, outer-join-to-join rewrite, and scenarios that query only the first several rows of the result.

Standard Practices for Storage Management

Today, database systems primarily use underlying storage as follows: create a large file directly on a specified disk and operate on this file directly through underlying system calls, such as mmap. Basically, the database system treats this file as a large contiguous array of database data pages.

Standard Practices for Transactions

Today, all industrial-grade database systems support ACID transactions, and they basically all use WAL to implement Durability and 2PL to implement concurrency control. PostgreSQL is an exception; it uses only MVCC to implement concurrency control. Oracle was a pioneer in providing MVCC in addition to 2PL to support other weaker consistency models. Using B+ trees for indexes is also basically standard for all database systems. Some database systems, either directly or through plugins, also provide multidimensional indexing capabilities. However, only PostgreSQL provides highly concurrent multidimensional indexes and full-text indexes through its unique GiST [10].

MySQL is distinctive in that it supports multiple implementations as its underlying storage and allows the DBA to specify different storage management implementations for different tables within the same database. MyISAM supports only table-level locks, but it performs best for read-heavy requests. InnoDB provides row-level locks and is suitable for scenarios with balanced reads and writes. However, neither storage engine implements System R’s well-known support for multilevel lock granularity[12], so both perform poorly in some scenarios, such as workloads that mix scans with high-selectivity index access.

References

[1] HELLERSTEIN J M, STONEBRAKER M, HAMILTON J. Architecture of a Database System[J]. Foundations and Trends® in Databases, 2007, 1(2): 141–259.
[2] CHANG F, DEAN J, GHEMAWAT S, et al. Bigtable: A Distributed Storage System for Structured Data[J]. 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2006: 205–218.
[3] IBARAKI T, KAMEDA T. On the Optimal Nesting Order for Computing N-relational Joins[J]. ACM Trans. Database Syst., New York, NY, USA: ACM, 1984, 9(3): 482–502.
[4] SELINGER P G, ASTRAHAN M M, CHAMBERLIN D D, et al. Access Path Selection in a Relational Database Management System[C]//Proceedings of the 1979 ACM SIGMOD International Conference on Management of Data. New York, NY, USA: ACM, 1979: 23–34.
[5] GRAEFE G. The Cascades Framework for Query Optimization[J]. IEEE Data Eng. Bull., 1995, 18(3): 19–29.
[6] SHAPIRO L D, MAIER D, BENNINGHOFF P, et al. Exploiting Upper and Lower Bounds In Top-Down Query Optimization[C]//Proceedings of the International Database Engineering &Amp; Applications Symposium. Washington, DC, USA: IEEE Computer Society, 2001: 20–33.
[7] GRAEFE G. Encapsulation of Parallelism in the Volcano Query Processing System[C]//Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data. New York, NY, USA: ACM, 1990: 102–111.
[8] MOHAN C, HADERLE D, LINDSAY B, et al. ARIES: A Transaction Recovery Method Supporting Fine-granularity Locking and Partial Rollbacks Using Write-ahead Logging[J]. ACM Transactions on Database Systems, New York, NY, USA: ACM, 1992, 17(1): 94–162.
[9] HAERDER T, REUTER A. Principles of Transaction-oriented Database Recovery[J]. ACM Comput. Surv., New York, NY, USA: ACM, 1983, 15(4): 287–317.
[10] KORNACKER M, MOHAN C, HELLERSTEIN J M. Concurrency and Recovery in Generalized Search Trees[C]//Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data. New York, NY, USA: ACM, 1997: 62–72.
[11] ASTRAHAN M M, BLASGEN M W, CHAMBERLIN D D, et al. System R: Relational Approach to Database Management[J]. ACM Trans. Database Syst., New York, NY, USA: ACM, 1976, 1(2): 97–137.
[12] GRAY J, LORIE R A, PUTZOLU G R, et al. Granularity of Locks and Degrees of Consistency in a Shared Data Base[C]//NIJSSEN G M. Proceeding of the {IFIP} Working Conference on Modelling in Data Base Management Systems. North-Holland, 1976: 365–394.