Cgo When You Must: Managing the Boundary Cost

There’s a running joke in Go communities: "The best Cgo is no Cgo." It’s funny because it’s true. The moment you reach for import "C", you’re signing a performance contract with some ugly clauses buried in the fine print. Cross-compilation breaks. The race detector goes blind. Build times balloon. And every call across the boundary carries overhead that can turn a tight hot path into a bottleneck you can’t profile away.

But sometimes you have no choice. The hardware driver is C. The crypto library you’re legally required to use is C. The audio codec, the GPU compute kernel, the legacy billing engine that nobody dares rewrite — all C. Pretending Cgo doesn’t exist isn’t engineering, it’s avoidance.

So let’s talk about what the boundary actually costs, when you genuinely have to pay it, and how to pay it as few times as possible.

Official documentation and source live at https://github.com/golang/go. The cmd/cgo package docs at pkg.go.dev/cmd/cgo are the canonical reference — read them once, even if you think you know Cgo.


What Actually Happens at the Boundary

Before optimizing anything, understand the machine. When your Go code calls a C function through Cgo, this is roughly what happens:

  1. The goroutine detects it’s about to leave the Go runtime.
  2. It gets pinned to the current OS thread — no scheduler can move it.
  3. The runtime switches from the goroutine’s small, growable stack to a separate, fixed-size C stack (typically 8MB, allocated per-thread).
  4. Arguments are copied or pointer-checked (more on this shortly).
  5. C code runs with no preemption, no GC cooperation, no scheduler visibility.
  6. On return, the goroutine is unpinned and the Go stack is restored.

This is not a function call. It’s a context switch between two runtime models. The cost is roughly 50–200 nanoseconds per crossing, depending on your hardware and what the CPU’s branch predictor has been doing. Compare that to a normal Go function call at under 1ns, or even a reflect-based indirect call at ~10ns. Cgo crossings are two orders of magnitude more expensive.

Here’s a simple benchmark to see this for yourself:

// bench_test.go
package bench

/*
#include <stdint.h>

int64_t add(int64_t a, int64_t b) { return a + b; }
*/
import "C"
import "testing"

func goAdd(a, b int64) int64 { return a + b }

func BenchmarkGoAdd(b *testing.B) {
    for i := 0; i < b.N; i++ {
        goAdd(1, 2)
    }
}

func BenchmarkCgoAdd(b *testing.B) {
    for i := 0; i < b.N; i++ {
        C.add(1, 2)
    }
}

Run it: go test -bench=. -benchtime=5s. On a typical x86-64 machine you’ll see the Cgo call is ~80–120x slower. For a trivial add — imagine what it does to real workloads.


When You Actually Have No Alternative

Stop right here and be honest. Go’s standard library covers a lot: crypto/tls, database/sql, net, os, syscall wrappers for both Linux and Windows. The ecosystem has pure-Go SQLite (modernc.org/sqlite), pure-Go image codecs, and pure-Go TLS. Before writing a single line of Cgo, grep GitHub for a pure-Go implementation.

Legitimate cases for Cgo:

Hardware and kernel interfaces. You’re talking to a custom PCIe card, an FPGA, or a proprietary SDK that only ships a .so and a header file. The vendor isn’t writing Go bindings. You are.

Legally mandated C libraries. Some regulated industries (finance, government, medical) require specific validated implementations of algorithms — FIPS 140-2 validated OpenSSL, for instance. You can’t swap in a pure-Go replacement and pass the audit.

Performance-critical C libraries with massive algorithmic investment. Things like libyuv, SIMDed codec libraries, libvpx. The performance delta isn’t the boundary crossing — it’s hand-tuned SIMD that would take months to reproduce in Go assembly.

Existing C codebases you need to integrate, not rewrite. The rewrite is a separate project. Today you need to ship.


The Pointer Rules: The Rule That Bites Everyone

The Go garbage collector moves objects. C has no idea. So the runtime enforces a hard rule:

Go code may pass a Go pointer to C only if the Go memory to which it points does not contain any Go pointers.

This is not a guideline. Violate it and you get a panic at runtime (or silent corruption if you build with GONOSAFECGO). The most common mistake is passing a Go struct containing a slice or string to C:

// This WILL panic at runtime
type Config struct {
    Name    string // contains a Go pointer (backing array)
    Timeout int32
}

cfg := Config{Name: "db", Timeout: 30}
C.init_something((*C.Config)(unsafe.Pointer(&cfg))) // panic: cgo argument has Go pointer to Go pointer

The fix: allocate on the C side, or use C.CString / C.CBytes and manage the lifetime yourself.

name := C.CString("db")
defer C.free(unsafe.Pointer(name))

C.init_something(name, 30)

C.CString allocates with malloc. You own that memory. defer C.free(...) is not optional — skip it and you’ve got a leak in a language with a garbage collector. The irony never gets old.


Gotcha: Cross-Compilation Dies Here

Pure Go cross-compiles trivially: GOOS=linux GOARCH=arm64 go build ./.... The moment you have Cgo, you need a cross-compiler toolchain — aarch64-linux-gnu-gcc, proper sysroot, matching C library headers. Most CI pipelines aren’t set up for this.

Production practice: If you need Cgo, isolate it in a separate package with a build tag, and provide a pure-Go fallback for platforms where the C library doesn’t exist. Use CGO_ENABLED=0 in your CI matrix for any architecture that doesn’t need the C integration, and test that path explicitly.

//go:build cgo

package sqlite

// cgo-backed implementation
//go:build !cgo

package sqlite

// returns ErrNotSupported, or compiles a different backend

This is extra work, but it keeps your cross-compilation story sane.


Gotcha: The Race Detector Can’t See Into C

go test -race is one of Go’s killer features. It instruments memory accesses and catches data races. But the instrumentation stops at the Cgo boundary. If your C code shares memory with Go goroutines without proper synchronization, the race detector won’t catch it. You’re back to 1990s-era concurrent C debugging.

Production practice: Treat the C side as a black box that owns its own data. Pass data in, get data out. Don’t let Go goroutines and C threads touch the same memory without explicit locking, and do that locking inside the C code itself (or via a single-goroutine serialization pattern, described below).


The Core Optimization: Batch Your Crossings

If the crossing costs 100ns and your C function does 10ns of useful work, you’re paying a 10x tax on every call. The fix isn’t to make the crossing cheaper (you can’t) — it’s to make each crossing do more work.

Bad: crossing per element

for _, item := range items {
    C.process_item(item.ptr, item.size)
}

Good: pass the whole batch

// process_batch.h
void process_batch(const Item* items, size_t count);
if len(items) == 0 {
    return
}
C.process_batch((*C.Item)(unsafe.Pointer(&items[0])), C.size_t(len(items)))

One crossing, regardless of batch size. This pattern — accumulate in Go, flush to C in bulk — is the highest-leverage optimization available.

The same logic applies to I/O buffers, message queues, and render commands. Any time you’re iterating in Go and calling C per iteration, you’re doing it wrong.


The Serialization Pattern: One Goroutine Owns C

Some C libraries aren’t thread-safe. Others use thread-local storage in ways that break if you call them from different OS threads (which Go’s scheduler happily does unless you lock). The reliable fix is to give C a single, dedicated goroutine.

type cgoRequest struct {
    fn     func()
    done   chan struct{}
}

var cgoQueue = make(chan cgoRequest, 64)

func init() {
    go func() {
        runtime.LockOSThread() // this goroutine is now pinned to one OS thread
        for req := range cgoQueue {
            req.fn()
            close(req.done)
        }
    }()
}

func callC(fn func()) {
    done := make(chan struct{})
    cgoQueue <- cgoRequest{fn: fn, done: done}
    <-done
}

Usage:

var result C.int
callC(func() {
    result = C.some_non_reentrant_thing(42)
})

runtime.LockOSThread() inside a goroutine pins it permanently to one OS thread. That thread’s identity is stable for the C library’s lifetime. Calls from other goroutines are serialized through the channel. The latency per call goes up slightly (channel overhead), but correctness is guaranteed and you avoid the nightmare of debugging thread-local state corruption.

For libraries that are thread-safe but you want to limit parallelism (e.g., a GPU that saturates at 4 concurrent operations), use a worker pool variant with a buffered channel of goroutines, all LockOSThread‘d.


Memory: C Owns C, Go Owns Go

The golden rule for sane Cgo lifetime management: don’t mix allocators.

If C allocates memory, C must free it. Don’t try to put a Go finalizer on a *C.char and hope it calls C.free. Finalizers run at GC time, which may be never for short-lived programs. Call C.free explicitly, paired with defer, at the same scope as the allocation.

If Go allocates memory and needs to hand it to C for the duration of an operation, use runtime.Pinner:

// Go 1.21+
var p runtime.Pinner
defer p.Unpin()

data := make([]byte, 1024)
p.Pin(&data[0]) // prevents GC from moving it during the C call

C.process_data((*C.uchar)(unsafe.Pointer(&data[0])), C.size_t(len(data)))

Before runtime.Pinner (Go 1.21), people used syscall.Mmap to allocate C-visible memory from Go, or just allocated on the C side entirely. runtime.Pinner is the modern, correct answer.


Callback Hell: C Calling Back Into Go

The reverse crossing — C calling a Go function via a function pointer — is where things get genuinely painful. You can export Go functions with //export, but you cannot pass a Go closure to C. C function pointers have no concept of captured state.

The standard workaround: an integer handle registry.

// handles.go
var (
    handlesMu sync.Mutex
    handles   = map[int32]SomeGoType{}
    nextID    int32
)

func registerHandle(v SomeGoType) int32 {
    handlesMu.Lock()
    defer handlesMu.Unlock()
    id := atomic.AddInt32(&nextID, 1)
    handles[id] = v
    return id
}

func releaseHandle(id int32) {
    handlesMu.Lock()
    defer handlesMu.Unlock()
    delete(handles, id)
}

//export goCallback
func goCallback(id C.int32_t, data *C.uchar, size C.size_t) {
    handlesMu.Lock()
    v, ok := handles[int32(id)]
    handlesMu.Unlock()
    if !ok {
        return
    }
    // use v to handle the callback
}

Then in C, store the integer ID in your context struct and pass it back when invoking the callback. It’s ugly but it works and it’s safe.


Gotcha: Build Cache Invalidation

Cgo makes the build cache fragile in ways pure Go doesn’t. Changing a C header file won’t always invalidate the cache correctly. Changing CGO_CFLAGS in one package can affect cached builds of another. If you’re seeing inexplicable build behavior, go clean -cache is your first diagnostic step, not your last resort.

Also: do not put Cgo files in internal/ if you expect them to be used across module boundaries. The Go module system’s visibility rules and Cgo’s build constraints interact in ways that generate confusing errors.


Production Checklist

Before you ship anything with Cgo:

  • CGO_ENABLED=0 build tested and either working or explicitly documented as unsupported
  • All C.CString / C.CBytes allocations have paired C.free in defer
  • No Go pointers containing Go pointers passed to C (run with GOEXPERIMENT=cgocheck2 in tests)
  • C library is either thread-safe, or you’re using the serialization pattern
  • Benchmark confirms the crossing cost is acceptable at your expected call rate
  • Cross-compilation targets documented with required toolchain
  • Liveness: if your C library can block indefinitely, you have a timeout or a kill mechanism — a blocked C call holds an OS thread forever

When to Quit and Wrap Instead

One underused option: move the C code out of your Go binary entirely and talk to it over a Unix socket or shared memory. Your Go binary stays pure, cross-compiles cleanly, and you can restart the C component independently. The overhead is higher than Cgo but it’s bounded and predictable, and you get process isolation as a bonus.

This is how Neovim handles many of its integrations. It’s how some databases handle extension modules. If you’re building something that needs to run on multiple architectures and the Cgo dependency is truly optional at runtime, a sidecar process is worth serious consideration.


Cgo is a sharp tool in a drawer that should mostly stay closed. When you open it, know what you’re holding: a boundary that costs real nanoseconds, breaks cross-compilation, blinds the race detector, and complicates memory ownership. None of that means don’t use it. It means use it deliberately, isolate it aggressively, and batch every crossing you can. The code that treats Cgo as an escape hatch — called freely, everywhere — is the code that shows up in profiler flamegraphs looking inexplicable. The code that treats it as a scarce resource — one package, one worker, one big batch — usually disappears from the flamegraph entirely.

Leave a comment

👁 Views: 6,806 · Unique visitors: 10,757