Go’s Garbage Collector Internals: Tri-Color Marking, Write Barriers, and the Pacer

If your Go service occasionally spikes in latency for no obvious reason, the GC is probably the culprit — and you’re operating it as a black box. You tweak GOGC, cross your fingers, and hope the p99 drops. That’s not engineering; that’s superstition.

Understanding what’s actually happening inside the Go runtime GC gives you real leverage: you’ll know why a workload causes long pause times, which tuning knob actually helps, and what your code is doing to stress the collector in the first place.

The Go GC has three interconnected mechanisms you need to understand: tri-color marking, write barriers, and the pacer. Pull any one of them out of context and you’ll misunderstand all three.


The 30-Second Overview Before the Deep Dive

Go uses a concurrent, tri-color, mark-and-sweep garbage collector with a stop-the-world (STW) phase that’s intentionally kept tiny — typically under 1ms, often under 100µs on modern hardware.

The lifecycle of a single GC cycle looks like this:

  1. STW pause #1 — sweep termination, enable write barriers
  2. Concurrent mark — goroutines mark the heap while your application runs
  3. Mark termination (STW pause #2) — drain the work queue, disable write barriers
  4. Concurrent sweep — reclaim white (dead) objects while the next cycle’s allocation starts

The whole thing is designed so that your application goroutines ("the mutator" in GC literature) keep running during the expensive marking phase. That design choice is what creates the complexity around write barriers.


Tri-Color Marking

The algorithm comes from Dijkstra’s 1978 paper, and Go’s implementation is a faithful descendant of it.

Every object on the heap is assigned one of three logical colors at any point in a GC cycle:

  • White — not yet reached. At the end of the cycle, white objects are dead and get swept.
  • Grey — reachable, but the GC hasn’t finished scanning its outgoing pointers.
  • Black — reachable, all outgoing pointers have been scanned. A black object will not be revisited.

The algorithm:

  1. All objects start white.
  2. Root objects (globals, goroutine stacks, finalizer queue) are painted grey and pushed onto the work queue.
  3. The GC repeatedly picks a grey object, scans all pointers it contains, paints those referents grey if they’re still white, then paints the object itself black.
  4. When the work queue is empty, everything still white is unreachable — collect it.

The critical tri-color invariant is: no black object may point directly to a white object. If this invariant holds throughout the concurrent phase, the algorithm is correct. If it’s ever violated, the GC will miss live objects and you’ll get memory corruption.

In a sequential GC, the invariant holds trivially — nothing is changing while you mark. In a concurrent GC running alongside application goroutines that are mutating pointers, maintaining this invariant is the whole problem.


The Mutator Problem

Imagine the GC has just turned object A black. Object A held a pointer to B, which is grey. Your application goroutine now runs and does two things in sequence:

  1. Copies the pointer to B from A into object C (which is white).
  2. Overwrites A‘s pointer to B with something else.

Now B is reachable only through C, which is white and hasn’t been scanned. The GC won’t follow pointers from C because it’s white. B was reachable; the GC is going to collect it. That’s a use-after-free bug.

This specific failure mode — a black object acquiring a pointer to a white object, and the original grey path being severed — is the lost object problem. The write barrier exists to prevent exactly this.


Write Barriers

A write barrier in this context isn’t a memory ordering fence (like in CPU architecture). It’s a snippet of code the compiler inserts around every pointer store. Every time your code writes a pointer to the heap, the write barrier runs.

Go’s write barrier is only active during the concurrent mark phase. Outside of that window, pointer writes are free.

Dijkstra’s Insertion Barrier

The simplest approach: when you write a new pointer value into a slot, shade (grey) the new referent.

write_barrier(slot, new_ptr):
    shade(new_ptr)   // paint new_ptr grey if it's white
    *slot = new_ptr  // do the actual write

This prevents the second half of the failure scenario — a white object receiving a pointer can’t go unnoticed. The problem: it requires stack writes to be covered too, which is expensive. Object stacks are scanned at STW to compensate, but that means STW time scales with live goroutine count.

Yuasa’s Deletion Barrier

Instead of shading on insertion, shade on deletion: when you overwrite a pointer, preserve the old referent.

write_barrier(slot, new_ptr):
    shade(*slot)     // paint old value grey (snapshot-at-beginning)
    *slot = new_ptr

This maintains a "snapshot at the beginning" invariant — anything that was live at the start of the cycle stays live through the cycle. Cost: floating garbage. Objects that die during the mark phase survive until the next cycle.

Go’s Hybrid Barrier (since Go 1.14)

Go’s current implementation combines both:

write_barrier(slot, new_ptr):
    shade(*slot)         // shade the old referent (Yuasa)
    shade(new_ptr)       // shade the new referent (Dijkstra)
    *slot = new_ptr

The hybrid barrier is correct without requiring stack writes to be covered by the barrier. The goroutine stacks are treated as always-black during concurrent marking — anything on a stack is considered live. At the end of the mark phase, stacks that were modified get re-scanned during STW mark termination, but this re-scan is bounded and cheap in practice.

The implementation in the Go runtime is in runtime/mbarrier.go and the generated code uses the gcWriteBarrier assembly stub. You can see the actual generated barrier with go build -gcflags="-S" and searching for runtime.gcWriteBarrier.

Seeing the Write Barrier in Practice

package main

import (
    "fmt"
    "runtime"
)

type Node struct {
    val  int
    next *Node
}

func buildList(n int) *Node {
    var head *Node
    for i := n; i > 0; i-- {
        // Each assignment to head.next or local var triggers
        // the write barrier during GC mark phase
        head = &Node{val: i, next: head}
    }
    return head
}

func main() {
    var stats runtime.MemStats
    runtime.ReadMemStats(&stats)
    fmt.Printf("NumGC before: %d\n", stats.NumGC)

    _ = buildList(1_000_000)
    runtime.GC()

    runtime.ReadMemStats(&stats)
    fmt.Printf("NumGC after: %d\n", stats.NumGC)
    fmt.Printf("PauseNs (last): %d ns\n", stats.PauseNs[(stats.NumGC+255)%256])
}

Build with GODEBUG=gccheckmark=1 to enable a double-marking verification pass that catches invariant violations — useful when you’re writing unsafe pointer manipulation and want to verify correctness.


The Pacer

The pacer is the GC scheduler. It decides when to trigger a new GC cycle and how many CPU resources to allocate to marking work, so that the cycle finishes before the heap grows beyond its target size.

This is a control theory problem: the pacer is a proportional controller trying to keep heap size at a target, adjusting GC CPU usage as the application’s allocation rate changes.

GOGC and the Heap Target

GOGC (default: 100) controls the heap growth ratio. After a cycle, the next GC trigger is set at:

trigger = live_heap_after_last_GC * (1 + GOGC/100)

With GOGC=100, the heap is allowed to double before the next GC. With GOGC=50, it can grow by 50%. With GOGC=200, it can triple.

Lower GOGC → more frequent GC cycles → less memory pressure, more CPU overhead.
Higher GOGC → fewer cycles → more memory pressure, less CPU overhead.

Setting GOGC=off disables the GC entirely. Don’t do this in production unless you have a very specific reason (like a batch job that allocates heavily and then exits).

How the Pacer Schedules Work

The pacer estimates how much marking work needs to be done (proportional to live heap size) and how fast the application is allocating. It then starts the GC cycle early enough, and throttles the mutator goroutines enough, that marking finishes before the heap hits the trigger.

The throttling mechanism is mutator assist: goroutines that are allocating heavily are forced to do marking work proportional to the bytes they’re allocating. This is the main mechanism for keeping the GC cycle on schedule. When you see unexpected latency spikes in a high-allocation workload, mutator assist is usually involved — your goroutine is doing GC work in the middle of your business logic.

You can observe this with runtime/trace:

import "runtime/trace"

f, _ := os.Create("trace.out")
trace.Start(f)
defer trace.Stop()
// ... your workload ...

Then go tool trace trace.out and look for "mutator assist" events in the goroutine timelines.

The Pacer Redesign (Go 1.18+)

Before Go 1.18, the pacer had a well-documented bug: if the allocation rate changed significantly mid-cycle, it would either trigger STW mark termination prematurely or let the heap overshoot the target significantly. The new pacer (described in the GC pacer redesign proposal) uses a proportional-integral controller and has much better stability under variable allocation rates.

If you’re on Go 1.17 or earlier and you see erratic GC behavior under bursty load — upgrade first before trying other fixes.

GOMEMLIMIT (Go 1.19+)

GOMEMLIMIT is a soft memory limit for the entire Go runtime. When set, the pacer will run GC more aggressively to keep total memory usage under the limit, regardless of the GOGC ratio.

GOMEMLIMIT=512MiB ./myservice

This is a better primitive than fiddling with GOGC for containerized workloads where OOM kills are a real concern. With GOMEMLIMIT, you give the runtime a ceiling and it figures out the right GC frequency itself.

The combination GOGC=off GOMEMLIMIT=<something> is valid and increasingly common for latency-sensitive services — you’re disabling the ratio-based trigger and letting only the memory limit trigger GC. This reduces unnecessary GC cycles when the heap is small but gives you protection against runaway allocation.


Gotchas

Finalizers extend object lifetime by at least one GC cycle. An object with a finalizer registered via runtime.SetFinalizer cannot be collected the cycle it first becomes unreachable — it gets queued for the finalizer goroutine and collected the cycle after. If you’re registering finalizers heavily, you’re adding a GC cycle of floating garbage. Use this sparingly.

runtime.GC() is synchronous and STW. Calling runtime.GC() from your application forces a full GC cycle and blocks until mark termination completes. In a low-latency service, this is a foot-gun. The only legitimate use is in tests or benchmarks where you need predictable heap state.

Stack growth can interact with GC. When a goroutine stack grows (Go uses segmented/copying stacks), the runtime scans and copies the stack. If this happens during the mark phase, the GC has to account for pointer moves. This is handled correctly by the runtime, but it means high goroutine count with deep stacks increases STW duration in mark termination.

Large object allocations bypass mcache. Objects larger than ~32KB go directly to the heap, bypassing the per-P mcache. This means they’re immediately visible to the GC and can trigger a cycle trigger check on every such allocation. Avoid allocating large objects in tight hot paths.

unsafe.Pointer and the GC. If you’re playing with unsafe.Pointer, you’re responsible for maintaining GC visibility. The rules are in unsafe.Pointer‘s documentation. The most common mistake: storing a uintptr (not a pointer) derived from an unsafe.Pointer — the GC doesn’t see uintptr as a reference, so the underlying object can be collected while you still hold the integer. Always convert back to unsafe.Pointer immediately if you need to keep the object alive.


Production Tuning

Start with GOMEMLIMIT if you’re in a container. Set it to ~90% of your container’s memory limit. This alone often eliminates OOM kills without any other tuning.

Raise GOGC for throughput-first services. Background jobs, ETL pipelines, data transformations — these don’t care about p99 latency. GOGC=200 or GOGC=400 halves or quarters GC frequency at the cost of peak memory.

Lower GOGC or keep default for latency-sensitive services. But understand that lower GOGC means more frequent STW events, not smaller ones. If your pause times are already fine, don’t touch it.

Use escape analysis to reduce heap allocation. go build -gcflags="-m" shows what escapes to the heap. Arguments passed to interfaces, closures capturing loop variables, any pointer returned from a function — these are common escape sources. Keeping objects on the stack means the GC never sees them.

go build -gcflags="-m=2" ./... 2>&1 | grep "escapes to heap"

Profile with pprof before tuning. Specifically, the allocs profile (not heap) shows where allocations are happening. Fix the allocation hot spots first — fewer live objects means less marking work, which means faster cycles and less mutator assist.

import _ "net/http/pprof"
// GET https://cd-linux.club:6060/debug/pprof/allocs

Use sync.Pool for short-lived, frequently allocated objects. Pooled objects survive across GC cycles because the pool holds a reference. This is effective for byte buffers, request/response objects in servers, decoders/encoders. The pool is cleared on each GC cycle, so it doesn’t cause unbounded memory growth.

var bufPool = sync.Pool{
    New: func() any {
        b := make([]byte, 0, 4096)
        return &b
    },
}

func handler(w http.ResponseWriter, r *http.Request) {
    buf := bufPool.Get().(*[]byte)
    *buf = (*buf)[:0]
    defer bufPool.Put(buf)
    // use buf...
}

Read GODEBUG=gctrace=1 output. Each GC cycle logs a line with heap sizes, pause times, and CPU utilization. This is the fastest way to understand what the GC is doing in production without instrumenting the application.

gc 14 @3.741s 0%: 0.019+1.4+0.058 ms clock, 0.076+0.76/1.3/0+0.23 ms cpu, 4->4->2 MB, 5 MB goal, 0 MB stacks, 0 MB globals, 8 P

Reading this line: cycle 14, at 3.7s into the process, 0% CPU overhead, STW pause #1 was 0.019ms, concurrent mark was 1.4ms, STW pause #2 was 0.058ms. Heap went from 4MB before collection to 4MB during marking to 2MB live after. The target was 5MB.

If GOGC=100 and your live heap is 2MB, the next trigger is at 4MB. That 5MB goal means the pacer is projecting slightly ahead for safety margin.


Putting It Together

The write barrier protects the tri-color invariant during concurrent marking. The pacer keeps the concurrent marking phase on schedule so that STW pauses stay short. The tri-color algorithm lets the GC do almost all its work while your application goroutines are running.

None of this is magic — it’s a set of engineering trade-offs that the Go team has tuned over a decade. The key trade-off is: accept some CPU overhead (write barriers + mutator assist) to eliminate long STW pauses. For most workloads, this is the right call.

When it’s not the right call — when your latency requirements are extreme, or your allocation patterns are genuinely pathological — you have the tools to understand what’s happening and fix it at the source rather than blindly twisting knobs.

Measure first with gctrace and pprof. Reduce allocations in hot paths with escape analysis. Apply sync.Pool where it makes sense. Set GOMEMLIMIT in containers. Only touch GOGC after you understand what the pacing is actually doing.

The runtime source is worth reading if you want to go deeper: runtime/mgc.go (cycle orchestration), runtime/mbarrier.go (write barriers), runtime/mgcpacer.go (the pacer), and runtime/mheap.go (heap management). The comments in those files are unusually good — the Go team clearly expects people to read them.

Leave a comment

👁 Views: 2,285 · Unique visitors: 1,642