Testing Distributed Systems Like FoundationDB: Deterministic Simulation That Actually Finds Bugs

Integration tests pass. Load tests pass. You deploy. Forty-eight hours later your database corrupts two records because a network partition happened while a leader election was mid-flight. You spend a week reproducing it. You never quite do.

That’s the default state of distributed systems testing, and it’s broken by design. You’re trying to test inherently non-deterministic systems — multiple processes, real clocks, real networks — with tools built for deterministic sequential code. The feedback loop is brutal: flaky CI, heisenbugs, and the ever-popular "it only happens in prod."

FoundationDB’s engineering team decided this was unacceptable in 2009 and built something that the rest of the industry is still catching up to: a simulation framework where the entire distributed system runs inside a single-threaded process, every source of non-determinism is replaced by a deterministic pseudo-random number generator, and a failing test gives you a seed you can replay bit-for-bit. Every. Single. Time.

This article breaks down exactly how that works, why it’s so effective, and how you can apply the same technique today — whether you’re writing Rust, Go, or Java.

Why Normal Approaches Fall Short

Before getting to the solution, it’s worth being precise about the problem.

Jepsen is fantastic. Kyle Kingsbury has done more for distributed systems correctness than almost anyone. But Jepsen tests a running system and injects failures from the outside. When a test finds a bug, reproducing it requires re-running the whole scenario and hoping the scheduler cooperates. It finds bugs in production-ready systems, not during development.

Property-based testing (QuickCheck, Hypothesis) generates random inputs and shrinks failing cases. It’s great for pure functions, decent for stateful machines, and almost useless for testing network-level behavior across multiple nodes without significant scaffolding.

Docker Compose test environments are just prod with fewer machines. You test one interleaving of events — the one the OS scheduler chose today.

The core issue: distributed systems bugs live in the ordering of events across nodes. Network messages arrive in different orders. Clocks drift. Disks stall for 200ms. The bug might only manifest when a specific sequence of these events occurs. That sequence has a probability of roughly 1/(n!) where n is the number of concurrent operations. In a real system with real time, you’ll never enumerate that space.

The Simulation Insight

The FoundationDB team’s key realization was this: if you control every source of non-determinism in your system, the system becomes a pure function of a random seed.

Sources of non-determinism in a distributed system:

  • Wall clock time
  • Thread scheduling (context switches happen whenever the OS wants)
  • Network: message delivery order, latency, packet loss
  • Disk I/O: latency, failure modes
  • Random number generation

If you replace all of these with a single PRNG seeded from one integer, your entire cluster simulation becomes reproducible. Seed 42 always produces the exact same sequence of network partitions, message orderings, and disk stalls. A failing test case reduces to: "run with seed 9381472."

This sounds simple. The implementation is not.

How FoundationDB Actually Did It

FoundationDB built a custom C++ framework called Flow — essentially a cooperative multitasking / actor runtime built on coroutines. Actors communicate via futures and message passing. Critically, the runtime decides when each actor runs, not the OS scheduler.

The simulation layer replaces:

  • The network with an in-process message bus that delivers messages in PRNG-controlled order, with PRNG-controlled latency
  • The disk with an in-process simulation that can inject latency, partial writes, and corruption
  • clock() with a simulated clock the framework advances deterministically
  • RNG calls everywhere with a seeded PRNG

The entire FoundationDB cluster — multiple storage servers, multiple log servers, coordinators, clients — runs in a single OS process. One thread. Zero real I/O. The simulation runs at thousands of simulated hours per minute of wall clock time.

Their test suite runs a random workload (from a PRNG seed) against this simulated cluster, injects faults (also seeded), and checks safety invariants. They’ve reportedly found bugs this way that existed in multiple simultaneous code paths requiring very specific interleavings — bugs that would be essentially impossible to find with real-world testing.

The official talk by Will Wilson from 2014, Testing Distributed Systems w/ Deterministic Simulation, is required watching. FoundationDB’s source is on GitHub at apple/foundationdb.

The Mental Model Shift

Here’s what changes when you adopt simulation testing:

Traditional testing: run the real system, observe external behavior, inject failures from outside.

Simulation testing: build a model of the system where you own the scheduler, the network, and the clock. Test the model extensively. Argue that the model is faithful enough to reality that bugs in the model correspond to bugs in the real system.

That last step — the faithfulness argument — is where people get nervous. The response to that nervousness is: you’re already making that argument. Every unit test you write is a claim that the behavior you’re testing corresponds to production behavior. Simulation just makes the scope of that claim explicit and tractable.

Building a Simulation Layer: The Mechanics

Let me walk through how you’d structure this for a new distributed system. I’ll use pseudocode close to Python to keep it readable, then show a real Rust example.

Step 1: Abstract All I/O Behind Interfaces

This is non-negotiable. If any code anywhere calls time.time() or socket.send() directly, you cannot simulate it.

# Bad — untestable
def process_request(req):
    now = time.time()  # Real clock. Can't control.
    sock.send(serialize(req))  # Real network. Can't control.

# Good — injectable
def process_request(req, clock, network):
    now = clock.now()
    network.send(req)

Every node in your system must receive its clock, network, and storage as injected dependencies. This isn’t just good for simulation — it’s just good design.

Step 2: Replace the Scheduler

Threads are the enemy of determinism. Replace thread-based concurrency with cooperative coroutines or an event loop you control.

class SimulatedNetwork:
    def __init__(self, rng):
        self.rng = rng
        self.pending = []  # (deliver_at, from_node, to_node, message)

    def send(self, from_node, to_node, message):
        # Simulate network latency with controlled randomness
        latency = self.rng.uniform(1, 50)  # ms, but simulated
        drop = self.rng.random() < 0.01    # 1% packet loss
        if not drop:
            self.pending.append((self.clock.now() + latency,
                                 from_node, to_node, message))

    def tick(self, current_time):
        ready = [(t, f, to, m) for t, f, to, m in self.pending
                 if t <= current_time]
        self.pending = [(t, f, to, m) for t, f, to, m in self.pending
                        if t > current_time]
        # Deliver in random order (also seeded)
        self.rng.shuffle(ready)
        return ready

Your main simulation loop looks like:

def run_simulation(seed, duration_ms, workload, nodes):
    rng = random.Random(seed)
    clock = SimulatedClock(start=0)
    network = SimulatedNetwork(rng, clock)

    # Wire up nodes with the simulated dependencies
    cluster = [Node(id=i, network=network, clock=clock, rng=rng)
               for i in range(len(nodes))]

    while clock.now() < duration_ms:
        # Advance clock
        clock.advance(rng.uniform(0.1, 5))  # Simulated ms

        # Deliver pending network messages
        for t, src, dst, msg in network.tick(clock.now()):
            cluster[dst].handle_message(src, msg)

        # Maybe inject a fault
        if rng.random() < 0.001:
            victim = rng.choice(cluster)
            victim.crash()

        # Run scheduled callbacks in each node
        for node in cluster:
            node.run_pending(clock.now())

    # Check invariants
    assert_linearizability(cluster)

Same seed, same rng.uniform() calls, same rng.random() calls, same crash victim chosen, same message ordering. Deterministic.

Step 3: Fault Injection as First-Class Configuration

Don’t add fault injection as an afterthought. Make it a parameter:

@dataclass
class FaultConfig:
    network_partition_probability: float = 0.001
    node_crash_probability: float = 0.0005
    disk_stall_probability: float = 0.002
    message_reorder_probability: float = 0.05
    clock_skew_max_ms: float = 500.0

# Low chaos for smoke testing, high chaos for adversarial testing
SMOKE = FaultConfig(network_partition_probability=0.0001)
ADVERSARIAL = FaultConfig(network_partition_probability=0.01,
                          node_crash_probability=0.005)

Your CI runs hundreds of seeds with ADVERSARIAL config. You’ll find bugs on seed 72934 that took your Raft implementation 47 simulated minutes to trigger.

Real-World Tool: Turmoil for Rust

If you’re writing Rust, you don’t have to build this from scratch. The Tokio team maintains Turmoil — a network simulation framework for async Rust code.

Turmoil gives you:

  • A deterministic simulated network (control latency, partitions, node crashes)
  • DNS simulation
  • Works with standard tokio async code — minimal changes to production code
use turmoil::Builder;
use std::net::SocketAddr;

#[test]
fn test_raft_leader_election_under_partition() {
    let mut sim = Builder::new()
        .simulation_duration(std::time::Duration::from_secs(30))
        .build();

    // Start three Raft nodes
    for i in 0..3 {
        sim.host(format!("raft-{i}"), move || async move {
            let addr: SocketAddr = "0.0.0.0:8080".parse().unwrap();
            run_raft_node(i, addr).await
        });
    }

    sim.client("client", async {
        // Wait for initial leader election
        tokio::time::sleep(Duration::from_secs(2)).await;

        // Partition raft-0 from the rest
        turmoil::partition("raft-0", "raft-1");
        turmoil::partition("raft-0", "raft-2");

        // Wait for re-election
        tokio::time::sleep(Duration::from_secs(5)).await;

        // Verify a new leader was elected among raft-1 and raft-2
        let leader = query_leader("raft-1:8080").await?;
        assert!(leader == "raft-1" || leader == "raft-2");

        Ok(())
    });

    sim.run().unwrap();
}

The simulation runs in a single thread, controlled time, deterministic. If the test fails, you’re running it against a controlled scenario you defined. Add randomized fault injection and you get the full FoundationDB-style approach.

TigerBeetle (a financial database built in Zig) also has an excellent simulation framework they call VOPR (Viewstamped Operation Replicator). Their blog post on simulation testing is worth reading — they test every pull request with simulated faults before it merges.

Gotchas

Gotcha #1: Abstraction leakage. One direct call to time.Now() buried in a third-party library deep in your dependency tree and the whole thing falls apart. Audit your dependencies. Some libraries make this configurable (injectable clock), many don’t. For the ones that don’t, you may need wrappers or to reconsider the dependency.

Gotcha #2: The faithfulness trap. You can build a simulation so simplified it finds no real bugs — or worse, it passes while the real system has a bug the simulation doesn’t model. The simulation must faithfully model the failure modes you care about: network partitions (not just latency), partial writes (not just full success or full failure), asymmetric connectivity (node A can reach B but not vice versa). These are the cases that destroy distributed protocols.

Gotcha #3: PRNG quality. Use a proper seeded PRNG (Mersenne Twister, PCG, xoshiro). Don’t use your language’s default rand() if it’s seeded from /dev/urandom at startup. The seed must be recorded and logged — every simulation run should print [simulation] seed=9381472 so that when CI fails, you have what you need.

Gotcha #4: Shrinking. Unlike property-based testing frameworks, simulation doesn’t automatically shrink failing cases. If seed 9381472 triggers a bug after 8000 simulated events, debugging it is still hard. The mitigation is to build scenario recording — log every event the simulator fires, and write a replay mode that replays a specific sequence of events step-by-step with your debugger attached. This is harder to build but invaluable.

Gotcha #5: The real system still needs integration tests. Simulation tests your protocol logic. They don’t test your serialization format, your actual TCP stack behavior, OS signals, or deployment configuration. Run simulation tests to find protocol bugs early. Run real integration tests (ideally with Jepsen or equivalent) before major releases. They’re complementary, not alternatives.

Gotcha #6: Clock synchronization in simulation. If different nodes in your simulation have different views of time (realistic!), make sure your simulated clock supports per-node skew. A common mistake is giving all nodes the same clock — this makes your simulation less adversarial than reality and will miss bugs in clock-dependent logic like lease expiration.

Production-Ready Approach

Here’s how to structure this at scale:

Separate your protocol layer from your I/O layer. All consensus logic, replication logic, and state machine transitions should live in pure functions or pure-ish objects that take (state, event) -> (new_state, effects). Effects are descriptions of I/O to perform — "send message X to node Y", "write these bytes to disk at offset Z". The actual I/O is done by an outer loop that can be swapped between real and simulated.

This is the functional core, imperative shell pattern, and it’s the single best architectural decision you can make for testability.

┌──────────────────────────────────────┐
│         Simulation Shell             │  ← Fake network, fake disk, fake clock
│  ┌────────────────────────────────┐  │
│  │   Protocol Core (pure-ish)     │  │  ← Raft, Paxos, your replication logic
│  │   (state, event) → effects     │  │
│  └────────────────────────────────┘  │
└──────────────────────────────────────┘

┌──────────────────────────────────────┐
│         Production Shell             │  ← Real TCP, real disk, real clock
│  ┌────────────────────────────────┐  │
│  │   Protocol Core (same code)    │  │
│  └────────────────────────────────┘  │
└──────────────────────────────────────┘

Run simulation tests in CI against every PR. Not just on main. The point is catching protocol bugs before they’re buried under five more commits. Run 100-1000 random seeds per PR, depending on your test budget. This sounds expensive — a well-optimized simulation running 1000x faster than real time makes it cheap.

Log seeds prominently. Every CI run, every test invocation, print the seed. When something fails locally or in CI, the seed is the artifact you preserve. Build a --replay-seed flag into your test harness from day one.

Vary the fault config. Smoke tests use near-zero fault probability. Adversarial tests crank it up. Occasionally run "gray failure" scenarios: nodes that are slow but not crashed, disks that acknowledge writes but don’t persist them, networks that reorder but don’t drop. These are the failure modes that fool protocols that only test crash-stop failures.

Is This Worth the Investment?

The honest answer: it depends on what you’re building.

If you’re building a distributed database, a consensus system, a replication log, a distributed transaction coordinator — yes, unambiguously. The failure modes are subtle, the bugs are catastrophic, and the investment in simulation pays back within months.

If you’re building a web API that happens to be deployed on multiple machines — probably not. Standard testing, Kubernetes health checks, and some Chaos Engineering in staging gets you most of the way there at much lower cost.

The tell is this: if your system has correctness invariants that must hold across node failures (linearizability, durability guarantees, exactly-once semantics), simulation testing is the only practical way to verify them with confidence. Everything else is hoping the scheduler doesn’t bite you.

FoundationDB’s team famously said that they have more confidence in their simulation results than in their real-world test results. That’s a remarkable claim. It’s also defensible, once you understand the technique. Bugs that take months to manifest in production take minutes to find under adversarial simulation.

The tools are available. The technique is documented. The only thing stopping most teams is the architectural discipline required to build a clean I/O boundary — which is, frankly, something you should have anyway.

Leave a comment

👁 Views: 2,289 · Unique visitors: 1,646