Redis is the default choice for sub-millisecond, high-throughput key-value operations. Spawning a single main execution thread to process millions of operations per second is a unique design choice.
This post analyzes the internal mechanics of Redis: the event-loop execution engine, memory layouts, progressive rehashing, persistence, and system-level syscalls.
Table of contents
Open Table of contents
- Problem Statement
- High-Level Mental Model
- Core Architecture
- Execution Flow (Step-by-Step)
- Key Internal Mechanisms
- OS & System-Level Interactions
- Failure Modes & Trade-offs
- Minimal Event Loop Pseudocode
- Cross-System Comparisons
Problem Statement
Traditional relational and document-oriented databases (like MySQL or MongoDB) are designed as disk-first storage engines. They organize data in structures optimized for block-based disk layouts, such as B-Trees or Log-Structured Merge (LSM) Trees. While memory buffers (like the InnoDB Buffer Pool) cache frequent records, these systems must ultimately guarantee durability by performing disk transactions.
Even with modern NVMe SSDs, a disk write (requiring bus transit, controller queues, and NAND flash programming) takes anywhere from tens to hundreds of microseconds, whereas a RAM access occurs in nanoseconds.
Furthermore, traditional database engines scale throughput by spawning concurrent threads. While multi-threading leverages multi-core CPUs, it introduces severe bottlenecks at scale:
- Thread Context Switching Overhead: The OS kernel must repeatedly save and load CPU registers, program counters, and stack pointers.
- Lock Contention & Race Conditions: Protecting shared database states, memory allocators, and cache structures requires locks, mutexes, and semaphores, turning concurrent workloads into serialized queues under high contention.
- Memory Footprint: Each thread requires its own stack space, bloating memory utilization.
To solve this, early developers turned to memory caches like Memcached. Memcached solved the speed problem by storing everything in RAM, but it suffered from two key limitations:
- No Native Rich Data Types: It treats all values as opaque byte arrays. Changing a field inside a serialized JSON payload requires sending the whole object back to the client, mutating it, and uploading it again.
- Volatile-Only Memory: If a machine crashes or restarts, all data disappears instantly.
Redis addresses these limitations. It combines in-memory speed with rich server-side data structures, asynchronous persistence, and a single-threaded execution core that avoids lock contention.
High-Level Mental Model
At its core, Redis is a single-threaded, event-driven, memory-first key-value engine. The entire database keyspace resides in RAM, structured as a global hash table.
Instead of spawning a thread for each client, Redis runs a single loop (known as the Event Loop). Network sockets are configured as non-blocking. Instead of waiting for data to arrive on a socket, the single main thread registers socket descriptors with the operating system’s non-blocking I/O multiplexer (e.g., epoll on Linux).
When a socket is ready (e.g., a query has arrived in the OS kernel buffer), the operating system wakes the event loop. The loop processes the incoming command, mutates the in-memory hash table, formats the reply, writes it to an outgoing socket buffer, and immediately moves to the next task.
%%{init: {
"theme": "base",
"themeVariables": {
"primaryColor": "#E3F2FD",
"primaryBorderColor": "#1E88E5",
"secondaryColor": "#FFF3E0",
"secondaryBorderColor": "#FB8C00",
"lineColor": "#546E7A",
"fontSize": "14px"
}
}}%%
graph TD
subgraph Clients ["Clients (Control Layer)"]
C1[Client 1]
C2[Client 2]
C3[Client 3]
end
subgraph OS ["Operating System Sockets (Control Layer)"]
S1[TCP Buffer 1]
S2[TCP Buffer 2]
S3[TCP Buffer 3]
Multiplexer[OS epoll / kqueue Multiplexer]
end
subgraph Engine ["Redis Core Engine (Single Main Thread)"]
EventLoop[ae.c Event Loop]
Dispatcher[Command Parser & Dispatcher]
end
subgraph Storage ["Memory Store (Data Layer)"]
Dict[Global dict.c Keyspace]
end
C1 -->|RESP Command| S1
C2 -->|RESP Command| S2
C3 -->|RESP Command| S3
S1 --> Multiplexer
S2 --> Multiplexer
S3 --> Multiplexer
Multiplexer -->|Ready Events| EventLoop
EventLoop -->|Fetch & Parse| Dispatcher
Dispatcher -->|Read/Write Operations| Dict
classDef control fill:#E3F2FD,stroke:#1E88E5,stroke-width:2px;
classDef data fill:#FFF3E0,stroke:#FB8C00,stroke-width:2px;
class C1,C2,C3,S1,S2,S3,Multiplexer,EventLoop,Dispatcher control;
class Dict data;
By scheduling all work on a single execution thread, Redis guarantees that every memory mutation is safe, predictable, and atomicity is achieved out of the box without a single mutex or transaction isolation level lock.
Core Architecture
To understand how Redis components work, we can slice the system into 5 distinct architectural layers:
- Networking Layer: Manages client TCP connections. It translates raw byte streams on sockets into packets using the REdis Serialization Protocol (RESP).
- Event Loop (
ae.c): Coordinates execution and schedules events. It abstracts the platform-specific I/O multiplexing APIs (epollon Linux,kqueueon macOS,evporton Solaris, orselectas a generic fallback). - Command Parser & Dispatcher (
server.c): Evaluates incoming buffers, matches command strings against a static routing lookup table (redisCommandTable), validates argument counts, and passes them to the correct execution functions (likesetGenericCommand). - Keyspace Dictionary (
db.c/dict.c): The core in-memory store. It maps aSimple Dynamic Stringkey to aredisObjectpointer, which wraps the actual underlying storage layouts. - Persistence Engine (
aof.c/rdb.c): Handles non-blocking background serialization to disk via binary snapshot files (RDB) and append-only transaction logging (AOF).
%%{init: {
"theme": "base",
"themeVariables": {
"primaryColor": "#E3F2FD",
"primaryBorderColor": "#1E88E5",
"secondaryColor": "#FFF3E0",
"secondaryBorderColor": "#FB8C00",
"lineColor": "#546E7A",
"fontSize": "14px"
}
}}%%
graph TD
subgraph NetworkIO ["I/O Layer"]
TCP[TCP Socket Connection]
IO_Multiplexer["OS Event Multiplexer (epoll)"]
end
subgraph LogicIO ["Logic & Routing Core"]
Event_Loop["aeMain() Event Loop"]
Parser["RESP Parser & Validator"]
Dispatcher["Command Dispatcher (redisCommandTable)"]
end
subgraph MemoryStore ["Data Storage Layer"]
GlobalDict["db.c Keyspace (dict)"]
ObjectWrapper["redisObject Metadata"]
PhysicalLayout["Physical Representation (SDS, Skiplist, Listpack)"]
end
subgraph IOStorage ["Persistence Subsystem"]
RDBEngine["RDB Snapshotting Engine"]
AOFEngine["AOF Append Log"]
end
TCP <--> IO_Multiplexer
IO_Multiplexer <--> Event_Loop
Event_Loop --> Parser
Parser --> Dispatcher
Dispatcher --> GlobalDict
GlobalDict --> ObjectWrapper
ObjectWrapper --> PhysicalLayout
Dispatcher -.->|Log Transaction| AOFEngine
GlobalDict -.->|fork Copy-on-Write Dump| RDBEngine
classDef control fill:#E3F2FD,stroke:#1E88E5,stroke-width:2px;
classDef data fill:#FFF3E0,stroke:#FB8C00,stroke-width:2px;
class TCP,IO_Multiplexer,Event_Loop,Parser,Dispatcher control;
class GlobalDict,ObjectWrapper,PhysicalLayout,RDBEngine,AOFEngine data;
Execution Flow (Step-by-Step)
Tracing a standard operation, such as a SET user:100 "Anant" command, shows how the internal layers interact.
%%{init: {
"theme": "base",
"themeVariables": {
"primaryColor": "#E3F2FD",
"primaryBorderColor": "#1E88E5",
"secondaryColor": "#FFF3E0",
"secondaryBorderColor": "#FB8C00",
"lineColor": "#546E7A",
"fontSize": "14px"
}
}}%%
sequenceDiagram
autonumber
participant Client as Client Application
participant Kernel as OS TCP / Socket Buffer
participant Multiplexer as OS epoll Instance
participant EventLoop as aeMain Event Loop (Main Thread)
participant Memory as global dict (RAM)
participant AOF as AOF Log (Disk)
Client->>Kernel: Send TCP segments (*3\r\n$3\r\nSET\r\n$8\r\nuser:100\r\n$5\r\nAnant\r\n)
Kernel->>Multiplexer: Transition state to READABLE
EventLoop->>Multiplexer: Call epoll_wait() (Blocks or ticks)
Multiplexer-->>EventLoop: Return socket file descriptor (fd) ready
EventLoop->>EventLoop: Execute readQueryFromClient()
EventLoop->>Kernel: read() bytes from socket buffer to RAM
EventLoop->>EventLoop: Parse RESP syntax & identify command: SET
EventLoop->>EventLoop: Resolve setCommand in command table
EventLoop->>Memory: Mutate keyspace: dictAddOrFind("user:100", "Anant")
EventLoop->>AOF: Append raw write string to server.aof_buf
EventLoop->>EventLoop: Queue reply "+OK\r\n" in Client Output Buffer
EventLoop->>Multiplexer: Register socket fd for WRITEABLE events
EventLoop->>Multiplexer: Woken up when socket is writeable
EventLoop->>EventLoop: Execute sendReplyToClient()
EventLoop->>Kernel: write() reply from buffer to socket
Kernel-->>Client: Receive "+OK\r\n"
1. Packet Dispatch & Socket Ingress
The client serializes the command into RESP (RESP2 or RESP3 format). The serial payload for SET user:100 "Anant" is:
*3\r\n$3\r\nSET\r\n$8\r\nuser:100\r\n$5\r\nAnant\r\n
The client transmits this payload over TCP. The OS network interface card (NIC) receives the frames, compiles them, and places the raw stream into the kernel socket’s read buffer.
2. OS Event Notification
Because the socket is configured in non-blocking mode, the OS does not wake a blocked reader thread. Instead, it transitions the socket state. The Redis main thread, constantly listening via epoll_wait(), receives an array of events detailing which file descriptors (fds) have data waiting.
3. Read & Parsing (readQueryFromClient)
Redis reads the raw bytes out of the kernel socket buffer using a standard read() syscall, placing it into a local client query buffer. The RESP parser processes the query buffer, converting it into a representation of Redis objects:
argc: Number of arguments (3)argv: Array ofrobjobjects containingSET,user:100, andAnant
4. Command Dispatching
Redis verifies the instruction. It runs a binary search on the redisCommandTable map. Once setCommand is matched, Redis runs validation routines:
- Ensuring the command matches permissions (ACL check).
- Ensuring the command doesn’t exceed database capacity or write limits (if maximum memory limit is configured).
- Validating the arguments are well-formed.
5. Memory Keyspace Mutation
The executor accesses the global keyspace hash table (dict.c). It hashes "user:100", locates the hash bucket, allocates a new redisObject containing "Anant" in memory, and maps the key to this object. If the key already existed, it frees the old memory value and updates the pointer.
6. Transaction Journaling & Output Buffering
- AOF Logging: Redis appends the mutation command to its internal memory append-buffer (
server.aof_buf) so that it can be flushed to the AOF log file on disk. - Client Output: Redis calls
addReply(). It serializes the confirmation response+OK\r\nand places it directly into the client’s output buffer in memory. It then registers a write event for this socket descriptor with the OS multiplexer (epoll).
7. Socket Egress
During the next pass of the event loop, the OS multiplexer signals that the client’s socket is ready to receive data. Redis invokes writeToClient(), using the write() syscall to flush the memory buffer to the TCP socket buffer in the kernel. The network stack sends the payload to the physical line, and the client receives the +OK reply.
Key Internal Mechanisms
To maximize memory utilization and speed, Redis bypasses standard high-level abstractions, relying on custom-written low-level data structures optimized for layout compactness and temporal locality.
1. Simple Dynamic Strings (SDS)
Standard C strings are null-terminated (\0) character arrays. This leads to severe architectural flaws:
- Calculating length
strlen()is an $O(N)$ operation because it must iterate over the entire array to locate\0. - They are not binary-safe; any embedded null character (like in images or serialized binaries) incorrectly signals the end of the string.
- Appending strings triggers frequent memory allocations (
realloc), leading to heap fragmentation.
To solve this, Redis implements Simple Dynamic Strings (SDS):
%%{init: {
"theme": "base",
"themeVariables": {
"primaryColor": "#E3F2FD",
"primaryBorderColor": "#1E88E5",
"secondaryColor": "#FFF3E0",
"secondaryBorderColor": "#FB8C00",
"lineColor": "#546E7A",
"fontSize": "14px"
}
}}%%
graph LR
subgraph SDSHeader ["SDS Header (Control / Metadata)"]
Len["len<br/>(uint8_t)"]
Alloc["alloc<br/>(uint8_t)"]
Flags["flags<br/>(uint8_t)"]
end
subgraph SDSBuffer ["SDS Data (Storage)"]
Buf["buf<br/>(char array ending with \0)"]
end
Len --- Alloc --- Flags --- Buf
classDef control fill:#E3F2FD,stroke:#1E88E5,stroke-width:2px;
classDef data fill:#FFF3E0,stroke:#FB8C00,stroke-width:2px;
class Len,Alloc,Flags control;
class Buf data;
// Equivalent memory structure represented as a Java class
public class SDSHeader8 {
public byte len; // 1 byte: Length of active data (uint8_t in C)
public byte alloc; // 1 byte: Total allocated buffer size excluding header (uint8_t in C)
public byte flags; // 1 byte: SDS type headers (unsigned char in C)
public byte[] buf; // Contiguous payload byte array (char buf[] in C)
}
- $O(1)$ Length Check: The explicit
lenfield allows instantaneous size lookups. - Binary Safety: The length dictates the boundary, allowing embedded nulls.
- Header packing (
__packed__): Eliminates compiler padding, keeping the structure tightly localized in the CPU cache. - Pre-allocation: When a string grows, Redis allocates more space than currently requested (e.g., doubling the allocation up to 1MB) to prevent repeated resizing overhead.
2. Global Dictionary & Progressive Rehashing
All keys in a Redis database are stored inside a global hash table defined in dict.c.
A hash table runs at $O(1)$ lookup speed until collisions accumulate, requiring table expansion.
In traditional systems, resizing involves pausing execution to allocate a larger table and copy all key-value entries. In Redis, doing this on a single thread with a 50-million-key database would cause the entire service to lock up for seconds.
To avoid this, Redis uses Progressive Rehashing:
- The dictionary contains two internal hash tables:
ht[0](active) andht[1](rehashing target). - When a resize is triggered, Redis allocates
ht[1]but does not copy data immediately. - During client queries (read or write operations), or during background cron ticks (
databasesCron), Redis moves a small subset of buckets (e.g., 100 buckets) fromht[0]toht[1]. - During this window, key lookups check
ht[0]first; if the key is not found, they checkht[1]. - Once all keys are migrated,
ht[1]becomesht[0], and the memory of the original table is reclaimed. This guarantees that latency remains flat even during massive database resizes.
3. Contiguous Structures: Ziplists and Listpacks
Traditional dynamic structures (like linked lists or trees) require heap allocating separate nodes and chaining them via pointers. In a 64-bit operating system, a pointer costs 8 bytes. For small items (e.g., lists of IDs like [12, 15, 22]), the pointer overhead can easily consume 4-5 times more memory than the actual payload data. Additionally, scattered pointers cause high CPU cache misses.
Redis solves this with Ziplists and Listpacks:
- A Listpack is a contiguous block of memory storing a sequence of elements without pointers.
- Elements are encoded using variable-length integer representations. An integer that fits in a single byte is stored as 1 byte, while larger numbers use more bytes.
- Each entry has an encoded header specifying the string or integer length and a trailer indicating the entry size (which allows traversing the list backward).
- This structure reduces memory overhead to near-zero and aligns elements directly in consecutive CPU cache lines.
- Trade-off: Mutating the middle of a listpack requires shifting memory via
memmove(), which becomes slow if the listpack grows too large. Therefore, Redis only uses listpacks for small collections (e.g., up to 128 elements or elements smaller than 64 bytes) before transparently converting them to standard dynamic layouts.
4. Hybrid Layouts: Quicklists and Skiplists
For large data sets, Redis switches to highly optimized dynamic hybrid layouts:
Quicklist (Used for LISTs)
A quicklist is a doubly linked list where each individual node is not a raw element, but a compressed, contiguous listpack block. This leverages the best of both worlds: fast insertion/deletion at list boundaries without causing high CPU cache misses or massive pointer overhead.
%%{init: {
"theme": "base",
"themeVariables": {
"primaryColor": "#E3F2FD",
"primaryBorderColor": "#1E88E5",
"secondaryColor": "#FFF3E0",
"secondaryBorderColor": "#FB8C00",
"lineColor": "#546E7A",
"fontSize": "14px"
}
}}%%
graph LR
subgraph Nodes ["Quicklist Nodes (Control Chaining)"]
QA[Quicklist Node A]
QB[Quicklist Node B]
QC[Quicklist Node C]
end
subgraph Lists ["Listpack Elements (Contiguous Data)"]
L1["Listpack A<br/>[1, 2, 3]"]
L2["Listpack B<br/>[4, 5, 6]"]
L3["Listpack C<br/>[7, 8, 9]"]
end
QA <--> QB <--> QC
QA -.-> L1
QB -.-> L2
QC -.-> L3
classDef control fill:#E3F2FD,stroke:#1E88E5,stroke-width:2px;
classDef data fill:#FFF3E0,stroke:#FB8C00,stroke-width:2px;
class QA,QB,QC control;
class L1,L2,L3 data;
Skiplist (Used for ZSETs)
Sorted Sets (ZSET) require fast lookup, insertion, and range scanning. Redis structures large ZSETs using a Skiplist teamed up with a Hash Table:
- The hash table maps key names to scores in $O(1)$ time.
- The skiplist tracks sorting order. A skiplist is a probabilistic, multi-level singly-linked list structure that emulates binary search trees without needing rebalancing algorithms. It achieves $O(\log N)$ search, insertion, and deletion complexity.
%%{init: {
"theme": "base",
"themeVariables": {
"primaryColor": "#E3F2FD",
"primaryBorderColor": "#1E88E5",
"secondaryColor": "#FFF3E0",
"secondaryBorderColor": "#FB8C00",
"lineColor": "#546E7A",
"fontSize": "14px"
}
}}%%
graph TD
subgraph Level3 ["Level 3 (Express Lane)"]
L3_1["Node 1 (Score: 10)"]
L3_9["Node 9 (Score: 90)"]
L3_Null["NULL"]
end
subgraph Level2 ["Level 2 (Mid Lane)"]
L2_1["Node 1 (Score: 10)"]
L2_5["Node 5 (Score: 50)"]
L2_9["Node 9 (Score: 90)"]
L2_Null["NULL"]
end
subgraph Level1 ["Level 1 (Base Lane)"]
L1_1["Node 1 (Score: 10)"]
L1_3["Node 3 (Score: 30)"]
L1_4["Node 4 (Score: 40)"]
L1_5["Node 5 (Score: 50)"]
L1_7["Node 7 (Score: 70)"]
L1_9["Node 9 (Score: 90)"]
L1_Null["NULL"]
end
L3_1 --> L3_9 --> L3_Null
L2_1 --> L2_5 --> L2_9 --> L2_Null
L1_1 --> L1_3 --> L1_4 --> L1_5 --> L1_7 --> L1_9 --> L1_Null
L3_1 -.-> L2_1
L2_1 -.-> L1_1
L2_5 -.-> L1_5
L3_9 -.-> L2_9
L2_9 -.-> L1_9
classDef control fill:#E3F2FD,stroke:#1E88E5,stroke-width:2px;
classDef data fill:#FFF3E0,stroke:#FB8C00,stroke-width:2px;
class L3_1,L3_9,L2_1,L2_5,L2_9,L1_1,L1_3,L1_4,L1_5,L1_7,L1_9 control;
class L3_Null,L2_Null,L1_Null data;
5. Multi-Threading and Concurrency
“If Redis is single-threaded, how does it scale, and why does task manager show multiple threads?”
Redis is single-threaded only for its core database execution logic (the main event loop). To prevent blocking this main thread, Redis spawns auxiliary background threads for heavy physical tasks (known as Background I/O or BIO):
BIO_CLOSE_FILE: Closes old, heavy file descriptors asynchronously.BIO_AOF_FSYNC: Periodically calls the heavy kernel disk sync syscallfsync().BIO_LAZY_FREE: When you runUNLINK key(instead ofDEL), a thread handles walking and freeing the memory pages of massive, nested structures (e.g., hashes containing 10 million elements) without blocking client operations.
Redis 6+ Threaded I/O
In modern cloud networking, parsing millions of incoming TCP commands and serializing massive responses can saturate a single CPU core’s capacity. Redis 6 introduced Threaded I/O:
- Core database engine command execution remains strictly single-threaded on the main thread.
- Auxiliary I/O threads handle reading bytes off client sockets, parsing the RESP payload, and writing formatted replies back to outgoing sockets. This balances modern network processing workloads across multiple cores while keeping the database execution core simple, lock-free, and safe.
6. Memory Allocation with jemalloc
Redis does not manage low-level page allocations directly; it delegates this to jemalloc (or libc malloc as a fallback).
Standard system allocators can suffer from severe memory fragmentation when frequent small allocations are made and destroyed, which prevents memory from being returned to the host OS.
To combat this, Redis monitors fragmentation metrics and runs an Active Defragmentation routine:
- It scans memory page indexes at regular intervals.
- When it detects highly fragmented keys, it allocates a new, contiguous memory buffer for them, copies the payload, updates the keyspace pointers, and frees the fragmented buffers. This allows the system to clean up memory without needing to restart the database process.
OS & System-Level Interactions
Redis relies heavily on direct operating system behaviors and system calls to achieve its speed and safety guarantees.
1. I/O Multiplexing with epoll
Redis configures TCP sockets with the O_NONBLOCK flag. To orchestrate connections, it calls:
epoll_create(): Instantiates an epoll descriptor to monitor socket status changes.epoll_ctl(): Registers client socket descriptors with flags (EPOLLINto check if data is readable,EPOLLOUTto check if write buffers are clear).epoll_wait(): Passes control to the OS kernel. The kernel pauses the main thread until at least one socket becomes active, instantly returning active fds.
Unlike older legacy APIs (like select or poll), which suffer from $O(N)$ scanning degradation where the CPU must iterate through every single registered socket, epoll has an $O(1)$ lookup cost because the OS kernel directly populates an active event queue.
2. Snapshots via fork() and Copy-on-Write (COW)
To execute disk snapshotting (BGSAVE) without freezing client requests, Redis uses the POSIX fork() syscall:
%%{init: {
"theme": "base",
"themeVariables": {
"primaryColor": "#E3F2FD",
"primaryBorderColor": "#1E88E5",
"secondaryColor": "#FFF3E0",
"secondaryBorderColor": "#FB8C00",
"lineColor": "#546E7A",
"fontSize": "14px"
}
}}%%
graph TD
subgraph Parent ["Parent Process (Main Event Loop)"]
MainThread["Main Thread (CPU & Keyspace Mutator)"]
VirtualPageParent["Virtual Memory Page Table (Parent)"]
end
subgraph OS ["Operating System Virtual Memory"]
COWPage1["RAM Page 1 (Read-Only Copy-on-Write)"]
COWPage2["RAM Page 2 (Read-Only Copy-on-Write)"]
ClonedPage["RAM Page 2 Clone (Written on Parent Write)"]
end
subgraph Child ["Child Process (BGSAVE / Background worker)"]
BackgroundThread["Background Thread (Disk Writer)"]
VirtualPageChild["Virtual Memory Page Table (Child)"]
end
MainThread -->|Read / Write| VirtualPageParent
BackgroundThread -->|Read Only| VirtualPageChild
VirtualPageParent --> COWPage1
VirtualPageParent --> COWPage2
VirtualPageChild --> COWPage1
VirtualPageChild --> COWPage2
MainThread -.->|Triggers fork clone| BackgroundThread
MainThread -.->|Write triggers clone| ClonedPage
classDef control fill:#E3F2FD,stroke:#1E88E5,stroke-width:2px;
classDef data fill:#FFF3E0,stroke:#FB8C00,stroke-width:2px;
class MainThread,VirtualPageParent,BackgroundThread,VirtualPageChild control;
class COWPage1,COWPage2,ClonedPage data;
- When
fork()is called, the OS spawns a lightweight child process. The child process does not duplicate the physical memory pages; instead, it duplicates only the virtual memory page tables pointing to the parent’s memory addresses. - The child process iterates through this shared memory mapping and serializes the keyspace into a binary
dump.rdbfile. - While the child process is writing to disk, the main thread continues modifying keys in memory.
- To prevent the child from writing inconsistent data, the OS marks all memory pages as Copy-on-Write (COW). If the main thread attempts to modify data on a memory page, the OS kernel intercepts the write, clones that specific memory page to a new physical address, applies the write to the cloned page, and updates the parent’s page table. The child’s page table remains pointing to the untouched original page.
- Trade-off: This approach achieves zero-lock, zero-overhead background serialization. However, if your database has a high write rate during a backup window, the OS may end up duplicating a massive number of pages. In the worst case, this can double your RAM utilization, potentially triggering the Linux Out-Of-Memory (OOM) Killer.
3. Durability with fsync()
When a program calls a standard write() syscall, the operating system does not write the data directly to disk. Instead, it places the data into a kernel buffer to optimize disk performance. If the server suddenly loses power during this window, that buffered data is lost.
To guarantee durability, Redis must force the OS to flush these buffers using the fsync() syscall. In the AOF subsystem, this behavior is controlled by the appendfsync configuration:
| Setting | Mechanism | Trade-off |
|---|---|---|
always | The main thread calls fsync() after every single command before returning a reply. | Extremely safe, but reduces throughput to disk speeds (a few thousand writes/sec). |
everysec | An auxiliary background BIO thread executes fsync() once per second. | Ideal balance. If the server crashes, you risk losing at most 1 second of write data, but execution throughput remains incredibly fast. |
no | Redis does not call fsync(). It relies on the operating system kernel to decide when to flush the buffers (typically every 30 seconds). | Maximum speed, but offers very weak durability guarantees. |
Failure Modes & Trade-offs
The single-threaded, in-memory model of Redis introduces specific trade-offs and operational failure modes:
1. Head-of-Line Blocking
Because Redis executes all database queries sequentially on a single thread, any single slow command blocks all subsequent queries.
- Running queries like
KEYS *,SMEMBERS, or executing a massiveFLUSHALLwill lock the main event loop. - While Redis is busy processing that single command, all incoming connections will back up in the OS TCP socket queues. Once the queues fill up, new client connection attempts will time out.
- Remedy: Avoid using $O(N)$ commands on large datasets in production. Instead, use non-blocking commands like
SCAN,SSCAN, and useUNLINKinstead ofDEL.
2. Copy-on-Write Memory Spikes
Under intense write workloads (e.g., bulk loading data) while a background BGSAVE or AOF rewrite is active, Copy-On-Write can trigger massive page duplications.
- If your system memory footprint exceeds 50-60% of total host RAM, the physical memory can become saturated, triggering a kernel panic or causing the Linux OOM killer to terminate the Redis process.
- Remedy: Always configure the host operating system’s virtual memory settings to allow overcommitting (
vm.overcommit_memory = 1) and keep at least 30-40% of system RAM free to handle writes during backup windows.
3. Split-Brain & Data Loss in Sentinel/Cluster
Redis replication is asynchronous:
- The master node applies a write command, returns a reply to the client, and then replicates that command to its replica nodes.
- If the master node crashes or experiences a network partition before its replicas receive the replication stream, a failover will trigger, and the replica will be promoted to the new master. Any writes that were processed by the old master but not yet replicated are permanently lost.
- In split-brain scenarios where a network partition isolates the old master, it might continue accepting writes from local clients while the rest of the cluster promotes a new master. Once the partition heals, the old master is demoted to a replica, and its local writes are overwritten, resulting in data loss.
- Remedy: Configure
min-replicas-to-writeandmin-replicas-max-lagto ensure the master rejects write commands if it loses connection to too many replica nodes.
Minimal Event Loop Pseudocode
Below is a simplified, structured Java NIO equivalent of the core Redis event-loop architecture (ae.c and server.c), illustrating how file events and time events are processed in a single thread without blocking by leveraging the JVM’s Selector API.
import java.io.IOException;
import java.nio.ByteBuffer;
import java.nio.channels.SelectionKey;
import java.nio.channels.Selector;
import java.nio.channels.SocketChannel;
import java.util.Iterator;
import java.util.Set;
/**
* A simplified Java NIO (Non-blocking I/O) equivalent of the core Redis
* event-loop architecture (ae.c and server.c).
* In Java, Selector delegates directly to OS-level epoll (Linux) or kqueue (macOS).
*/
public class RedisEventLoop {
private Selector selector;
private boolean stop = false;
private static final int MAX_EVENTS = 1024;
// The main Event Loop execution block (aeMain in C)
public void aeMain() throws IOException {
stop = false;
while (!stop) {
// Process any active File I/O events or Time-based Cron events
aeProcessEvents(AE_ALL_EVENTS);
}
}
// Process active I/O descriptor readiness and timed events (aeProcessEvents in C)
public int aeProcessEvents(int flags) throws IOException {
// Calculate timeout based on nearest scheduled time event
long timeoutMs = calculateNearestTimeEventTimeout();
// Selector.select() blocks the thread until channels are ready (delegates to epoll_wait)
int readyChannels = selector.select(timeoutMs);
if (readyChannels == 0) {
// Process Time Events if selector timed out
processTimeEvents();
return 0;
}
Set<SelectionKey> selectedKeys = selector.selectedKeys();
Iterator<SelectionKey> keyIterator = selectedKeys.iterator();
// 1. Process File Events (Socket Read / Write)
while (keyIterator.hasNext()) {
SelectionKey key = keyIterator.next();
if (key.isReadable()) {
// Read incoming RESP data from client socket
readQueryFromClient(key);
}
if (key.isWritable()) {
// Flush buffered response bytes back to client socket
writeToClient(key);
}
keyIterator.remove(); // Clear from selected set
}
// 2. Process Time Events (Cron ticks, Eviction loops, TTL checks)
if ((flags & AE_TIME_EVENTS) != 0) {
processTimeEvents();
}
return readyChannels;
}
// Handler executed when a socket is ready to be read
private void readQueryFromClient(SelectionKey key) throws IOException {
SocketChannel channel = (SocketChannel) key.channel();
ByteBuffer buffer = ByteBuffer.allocate(4096);
int bytesRead = channel.read(buffer);
if (bytesRead > 0) {
buffer.flip();
byte[] rawBytes = new byte[buffer.remaining()];
buffer.get(rawBytes);
// 1. Parse RESP bytes
RedisCommand cmd = parseCommand(rawBytes);
// 2. Execute command handler and mutate database dictionary in memory
executeCommand(cmd);
// 3. Buffer reply ("+OK\r\n") into client structure and register SelectionKey for write
queueReplyForClient(key, "+OK\r\n");
}
}
// Mock constants and helper methods representing internal system behaviors
private static final int AE_ALL_EVENTS = 1;
private static final int AE_TIME_EVENTS = 2;
private long calculateNearestTimeEventTimeout() { return 100; }
private void processTimeEvents() {}
private RedisCommand parseCommand(byte[] rawBytes) { return new RedisCommand(); }
private void executeCommand(RedisCommand cmd) {}
private void queueReplyForClient(SelectionKey key, String reply) {
key.interestOps(SelectionKey.OP_WRITE);
}
private void writeToClient(SelectionKey key) {}
private static class RedisCommand {}
}
Cross-System Comparisons
The design choices in Redis reflect fundamental patterns in systems engineering:
- Node.js & Nginx: Both share the same architectural pattern: a single-threaded event loop combined with non-blocking I/O multiplexing. This approach scales network connections without the heavy overhead of multi-threaded setups.
- Append-Only Logs (LSM-based Engines): Redis’s AOF mechanism (sequentially writing mutation operations to a file and periodically compacting it) is the core pattern behind modern transactional engines like RocksDB, Cassandra, and Kafka.
- Shared-Nothing Multi-Threading: As multi-core processors continue to scale, modern in-memory stores like KeyDB (a multi-threaded fork of Redis) and Dragonfly use a thread-per-core architecture. Each thread runs its own event loop and manages a specific partition of the keyspace, scaling across available CPU cores without requiring global mutexes.
These architectural patterns form the foundation of high-performance networking and storage systems across modern infrastructure.