Crash Recovery#
Your server currently saves data on clean shutdown but loses everything if it crashes. In this stage, you’ll add durability so data survives unexpected failures.
Write-Ahead Logging#
Implement a Write-Ahead Log (WAL) that records operations before they’re applied to memory. Each write operation must be written to the log file before updating your in-memory store.
Log Format#
Your log should record operations in append-only fashion. The format is up to you - JSONL (one JSON object per line), binary serialization, or plain text all work.
Each log entry needs enough information to replay the operation:
- Operation type (e.g., “set”, “delete”, “clear”)
- Key
- Value
- Any other metadata you need for replay
Durability#
After appending an operation to the log, ensure it’s physically written to disk before responding to the client. Use your language’s file sync mechanism (fsync, flush, etc.) to force the operating system to persist the write.
Without sync, the OS may buffer writes in memory and you’ll lose data on crash.
Syncing on every operation is slow since you’re forcing a disk write and blocking the response. Additionally, holding locks during synchronous disk I/O creates severe contention under concurrent load: multiple writers queue up waiting for the disk, serializing operations that could otherwise proceed in parallel.
This is the correct trade-off for durability in a simple implementation, but it limits both throughput and concurrency. Production databases use techniques like batching to amortize the
fsynccost across multiple operations and reduce lock hold times.
Recovery Procedure#
When your server starts:
- Load the most recent snapshot (from the persistence stage) if one exists
- Replay all operations from the WAL that occurred after the snapshot
- Resume serving requests
If no snapshot exists, replay the entire log from the beginning.
Checkpointing#
As your log grows, replaying from the beginning becomes slow. Periodically create snapshots of your in-memory state and truncate the log.
When to checkpoint is up to you - after N operations, every M seconds, when the log reaches a certain size, etc. The test doesn’t care about your checkpoint strategy, only that recovery works correctly.
After creating a snapshot:
- Write the snapshot to a new file
- Truncate or create a new WAL file
- Continue logging operations
On recovery, load the latest snapshot and replay only the operations logged after that snapshot.
Storage Layout#
You now have two types of files:
- Snapshot: Full state at a point in time (from previous stage)
- WAL: Operations logged since the last snapshot
Organize these in the working directory however makes sense - separate files, subdirectories, naming conventions, etc. The test only cares that recovery works, not how you structure the files.
Testing#
Your server will be started with the working directory:
$ ./run.sh --port 8001 --working-dir .lsfr/run-20251226-210357
Your server will be tested with unexpected crashes:
$ lsfr test crash-recovery
Testing crash-recovery: Data Survives SIGKILL
✓ Basic WAL Durability
✓ Multiple Crash Recovery Cycles
✓ Rapid Write Burst Before Crash
✓ Test Recovery When Under Concurrent Load
PASSED ✓
Run 'lsfr next' to advance to the next stage.
The tests will:
- Store data in your server
- Kill the server process (SIGKILL) without warning
- Restart your server
- Verify all data that was acknowledged before the crash is still present