Two-Phase Commit vs Sagas in 2026: When to Use Each and Why It Actually Matters

Every engineer who has graduated from a monolith to microservices eventually hits the same wall: you need to atomically update data that lives in two different databases, and suddenly "BEGIN TRANSACTION" means nothing anymore.

The naive approach — call service A, then call service B, cross your fingers — works fine until service B crashes at 2 AM and you have an order that’s been charged but never fulfilled. Your on-call rotation will hate you.

Two main patterns have emerged to solve this: Two-Phase Commit (2PC) and Sagas. Both are mature, both are documented to death in papers and conference talks, and yet teams keep picking the wrong one for their situation. This article cuts through the noise and gives you a practical decision framework with real implementation examples.


The Problem Space, Precisely

Before comparing solutions, let’s be precise about what we’re solving.

In a monolith with a single database, ACID transactions give you atomicity for free. You write to orders, inventory, and payments tables in one transaction. Either all three commit or none do. Simple.

In a distributed system, those three writes go to three different services, possibly running different databases on different machines. You need distributed atomicity — or, more accurately, you need to reason carefully about whether you actually need atomicity or just eventual consistency.

That distinction is the crux of the entire 2PC vs Sagas debate.


Two-Phase Commit: The Nuclear Option

2PC has been around since the 1970s. It’s the "let’s coordinate all participants to agree before anyone commits" approach.

How It Works

There’s a coordinator (usually your application or a transaction manager) and multiple participants (your databases or services).

Phase 1 — Prepare: The coordinator sends a PREPARE message to every participant. Each participant locks the resources it needs, writes the changes to a durable log, and replies YES (ready to commit) or NO (something failed).

Phase 2 — Commit or Abort: If all participants replied YES, the coordinator sends COMMIT. If anyone said NO (or timed out), it sends ABORT. Participants then execute accordingly and release their locks.

Coordinator
    │
    ├──► Service A: "Can you commit?" ──► "YES" (locks held)
    ├──► Service B: "Can you commit?" ──► "YES" (locks held)
    ├──► Service C: "Can you commit?" ──► "YES" (locks held)
    │
    └──► All said YES → send COMMIT to all three

This gives you genuine atomicity. Either everything commits or nothing does.

Where 2PC Actually Works

2PC is a reasonable choice in a narrow set of scenarios:

  • Same database engine across services — PostgreSQL supports 2PC natively via PREPARE TRANSACTION / COMMIT PREPARED. If your services all share one Postgres cluster (different schemas or databases), this works well.
  • XA transactions — Java EE / Jakarta EE stacks with JTA, or systems using XA-compliant drivers, handle 2PC at the middleware level. It’s battle-tested there.
  • Low-latency, high-reliability internal networks — when all participants are in the same datacenter with fast, reliable connections and you can afford the extra round trips.
  • Small participant counts — 2PC with 2-3 participants is manageable. With 10+, the blast radius of failures becomes unpleasant.

Here’s a minimal PostgreSQL 2PC example with two databases:

import psycopg2

# Both databases on the same Postgres instance, different schemas
conn_orders   = psycopg2.connect("dbname=orders user=app")
conn_payments = psycopg2.connect("dbname=payments user=app")

txn_id = "order-1234-payment-5678"  # must be globally unique

try:
    cur_o = conn_orders.cursor()
    cur_p = conn_payments.cursor()

    # Phase 1: do the work, then PREPARE (don't commit yet)
    cur_o.execute("INSERT INTO orders (id, status) VALUES (1234, 'pending')")
    cur_p.execute("UPDATE wallets SET balance = balance - 50 WHERE user_id = 42")

    conn_orders.tpc_prepare(conn_orders.xid(0, txn_id, "orders"))
    conn_payments.tpc_prepare(conn_payments.xid(0, txn_id, "payments"))

    # Phase 2: commit both
    conn_orders.tpc_commit()
    conn_payments.tpc_commit()

except Exception as e:
    conn_orders.tpc_rollback()
    conn_payments.tpc_rollback()
    raise

Gotchas With 2PC

The blocking problem. If the coordinator crashes after Phase 1 but before Phase 2, participants are stuck holding locks and waiting indefinitely. This is the "in-doubt transaction" problem. Postgres will keep those rows locked until you manually resolve them with COMMIT PREPARED or ROLLBACK PREPARED. Under load, this cascades into timeouts and service unavailability fast.

It doesn’t work across heterogeneous systems. Trying to 2PC between Postgres and MongoDB, or Postgres and a Kafka topic, is a nightmare. You’ll spend weeks building a custom XA adapter that breaks in ways you won’t catch until production.

Latency multiplies. Each participant adds a network round trip in Phase 1. In a cloud environment with multiple AZs, that’s easily 2-5ms per participant just for the prepare phase. With 5 participants, you’re at 10-25ms before you even commit.

Most modern services don’t expose 2PC. Stripe doesn’t. Twilio doesn’t. Any third-party API you call during a transaction is outside 2PC’s reach — and that’s often where the real business risk lives.


Sagas: Embrace Eventual Consistency

The Saga pattern, originally described by Hector Garcia-Molina and Kenneth Salem in 1987 (yes, it’s older than most developers realize), takes a completely different philosophy: don’t try to lock everything — instead, define how to undo things when they go wrong.

A saga is a sequence of local transactions. Each step has a corresponding compensating transaction — a business-level undo. If step 3 fails, you run compensation for step 2, then step 1.

Step 1: Create Order (compensation: Cancel Order)
    ↓ success
Step 2: Reserve Inventory (compensation: Release Inventory)
    ↓ success
Step 3: Charge Payment (compensation: Refund Payment)
    ↓ FAIL
← Run compensation for Step 2: Release Inventory
← Run compensation for Step 1: Cancel Order

The data is not atomic at a point in time — other services can see the intermediate state (e.g., an order exists but payment hasn’t been confirmed yet). You handle this with appropriate state machines and UI patterns.

Two Flavors: Choreography vs Orchestration

Choreography — services react to events. Each service listens to a topic, does its work, and emits an event for the next service. No central brain.

# docker-compose.yml excerpt for a choreography-based saga setup
version: "3.9"
services:
  kafka:
    image: confluentinc/cp-kafka:7.6.0
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1

  order-service:
    build: ./order-service
    environment:
      KAFKA_BROKERS: kafka:9092
      DB_URL: postgres://orders_db/orders

  inventory-service:
    build: ./inventory-service
    environment:
      KAFKA_BROKERS: kafka:9092
      DB_URL: postgres://inventory_db/inventory

  payment-service:
    build: ./payment-service
    environment:
      KAFKA_BROKERS: kafka:9092
      DB_URL: postgres://payments_db/payments
# order-service: publishes OrderCreated, listens for PaymentFailed to compensate
from kafka import KafkaProducer, KafkaConsumer
import json

producer = KafkaProducer(bootstrap_servers='kafka:9092',
                         value_serializer=lambda v: json.dumps(v).encode())

def create_order(order_data):
    # Local transaction — this DB commit is the "lock-in" point
    order_id = db.insert_order(order_data, status="PENDING")
    # Emit event — the inventory service picks this up
    producer.send('order.created', {'order_id': order_id, **order_data})
    return order_id

# Compensation listener
consumer = KafkaConsumer('payment.failed',
                         bootstrap_servers='kafka:9092',
                         group_id='order-service-compensations')

for msg in consumer:
    event = json.loads(msg.value)
    # Compensating transaction: cancel the order
    db.update_order_status(event['order_id'], status="CANCELLED")
    producer.send('order.cancelled', {'order_id': event['order_id']})

Orchestration — a central Saga Orchestrator drives the flow explicitly. It knows the full sequence and sends commands to services.

# saga_orchestrator.py — a simple state machine approach
from enum import Enum, auto
import time

class OrderSagaState(Enum):
    STARTED          = auto()
    INVENTORY_RESERVED = auto()
    PAYMENT_CHARGED  = auto()
    COMPLETED        = auto()
    COMPENSATING     = auto()
    FAILED           = auto()

class OrderSagaOrchestrator:
    def __init__(self, order_id, db, inventory_client, payment_client):
        self.order_id = order_id
        self.db = db
        self.inventory = inventory_client
        self.payment = payment_client
        self.state = OrderSagaState.STARTED

    def execute(self):
        try:
            # Step 1
            self.inventory.reserve(self.order_id)
            self._persist_state(OrderSagaState.INVENTORY_RESERVED)

            # Step 2
            self.payment.charge(self.order_id)
            self._persist_state(OrderSagaState.PAYMENT_CHARGED)

            self._persist_state(OrderSagaState.COMPLETED)

        except InventoryError:
            # Nothing to compensate — inventory failed, order just stays PENDING
            self._persist_state(OrderSagaState.FAILED)

        except PaymentError:
            # Must compensate the inventory reservation
            self._persist_state(OrderSagaState.COMPENSATING)
            self.inventory.release(self.order_id)
            self._persist_state(OrderSagaState.FAILED)

    def _persist_state(self, new_state):
        self.state = new_state
        # Critical: persist saga state to DB before calling next service
        # This lets you resume from a crash
        self.db.upsert_saga(self.order_id, new_state.name, time.time())

Notice the _persist_state call before each external service call. That’s not optional — it’s what lets you resume a saga after a crash. If the orchestrator dies between reserving inventory and charging payment, you need to know where to pick up.

Gotchas With Sagas

Writing good compensating transactions is hard. "Refund the payment" sounds easy. It isn’t when the payment was a partial use of a coupon that’s now expired, or when the inventory you released was the last unit and someone else snapped it up. Compensations are business logic, not just database rollbacks.

Intermediate states are observable. Between step 1 and step 3, other services and users can see inconsistent data. Your UI needs to handle "PENDING" states gracefully. If you show users a confirmed order before payment clears, you’ll have angry customers and support tickets.

Idempotency is not optional. Services will receive duplicate messages. Network retries happen. You need idempotency keys on every operation. A payment service that charges twice because it got the Kafka message twice is a serious bug.

Debugging distributed sagas is genuinely painful. A saga that’s stuck mid-way through is harder to diagnose than a hung 2PC transaction. You need solid observability — structured logging with saga IDs, distributed tracing (OpenTelemetry), and a saga state dashboard if you have any volume.

Choreography produces implicit coupling. It looks decoupled — services just emit events. But in practice, the event schema becomes a hidden contract. One team renames a field and three other services break silently. Choreography works well for simple flows; orchestration is usually cleaner for anything with more than 3 steps.


The Decision Framework

Here’s the honest breakdown:

Criterion 2PC Sagas
Strong consistency ✅ Guaranteed ❌ Eventual only
Works across different DBs ❌ Painful ✅ Natural
Works with external APIs ❌ Impossible ✅ Yes, with compensation
Performance under load ❌ Lock contention ✅ No cross-service locks
Failure recovery complexity ⚠️ In-doubt txns ⚠️ Compensation logic
Debugging ✅ Clearer ❌ Distributed chaos
Works in cloud-native / k8s ❌ Awkward ✅ Natural fit

Use 2PC when:

  • All participants support XA / PREPARE TRANSACTION natively
  • You’re running a small number of participants (2-3 max)
  • Your network is fast and reliable (same datacenter, low latency)
  • You need true atomicity and eventual consistency is genuinely not acceptable
  • You’re in a Java EE / Spring ecosystem with JTA already in place

Use Sagas when:

  • Your services own separate databases (this is the default in real microservices)
  • Any step involves an external API (Stripe, SendGrid, AWS S3 — anything outside your control)
  • You’re building on top of a message broker (Kafka, RabbitMQ, AWS SQS)
  • You can tolerate eventual consistency and model your business flow as a state machine
  • You need horizontal scalability — sagas don’t hold cross-service locks

The honest default for greenfield microservices in 2026: Sagas. Cloud-native infrastructure doesn’t play nicely with 2PC. Managed databases (RDS, Cloud SQL, Cosmos DB) have varying or no XA support. Kubernetes pod restarts make coordinator crashes a regular occurrence. Sagas fit the grain of this environment.


Production-Ready Patterns Worth Knowing

Outbox pattern — the most important Saga reliability tool. Instead of writing to your DB and then publishing to Kafka (two operations that can get out of sync), write your event to an outbox table in the same local transaction. A separate relay process reads the outbox and publishes to Kafka. Your local commit becomes the source of truth.

-- In the same transaction as your business logic:
INSERT INTO orders (id, status) VALUES (1234, 'PENDING');
INSERT INTO outbox (aggregate_id, event_type, payload, created_at)
  VALUES (1234, 'OrderCreated', '{"order_id": 1234, "amount": 50}', NOW());
-- Commit. The relay handles Kafka publishing asynchronously.

Saga log / state table — always persist saga state to a durable store before making the next call. If your orchestrator crashes, you can reconstruct the saga state and resume or compensate correctly.

Dead-letter queues — when a compensation fails (yes, compensations can fail too), you need a human-readable DLQ with full context. Automated retries help, but some sagas will need manual resolution. Build that dashboard before you go to production.

Timeouts and TTLs — a saga that’s been stuck in COMPENSATING state for 48 hours is a bug. Put explicit TTLs on saga instances and alert on them. Long-running sagas (days, weeks) are a different beast entirely — look at the Process Manager pattern for those.


Where the Industry Is in 2026

A few things have matured significantly:

Temporal.io has made orchestrated sagas dramatically easier. Instead of building your own state machine, you write regular code and Temporal handles durability, retries, and timeouts. If you’re starting fresh and can tolerate the operational overhead of running Temporal, it’s worth evaluating seriously.

PostgreSQL logical replication + CDC has made the Outbox pattern easier to implement without a polling relay. Tools like Debezium capture row-level changes and publish them to Kafka with sub-second latency.

Service meshes (Istio, Linkerd) don’t solve distributed transactions, but they help significantly with the retry and timeout logic that sagas depend on. If you’re on k8s, get your mesh configured correctly before worrying about saga frameworks.

2PC in cloud databases is not getting better. Google Spanner has its own distributed transaction mechanism (which is excellent but proprietary). Most managed relational databases still treat XA as second-class. Don’t bet your architecture on 2PC unless you control your entire database stack.


The Bottom Line

2PC and Sagas aren’t competing solutions to the same problem — they encode fundamentally different consistency models. 2PC is synchronous and strong; Sagas are async and eventual.

If you’re building anything that touches external systems, runs across separate databases, or needs to scale beyond a single datacenter, Sagas aren’t just an option — they’re the architecture that matches how distributed systems actually behave under failure. Build your compensations carefully, use the Outbox pattern, and invest in observability. The complexity is real, but it’s manageable complexity that lives in your code rather than in unpredictable network timing.

If you have a very tight consistency requirement, homogeneous infrastructure, and can live with the operational burden of managing in-doubt transactions, 2PC is a legitimate choice — just go in with eyes open about its failure modes.

Choose based on your consistency requirements and your infrastructure, not based on which one sounds more "modern."

Leave a comment

👁 Views: 2,289 · Unique visitors: 1,646