Go Benchmark Hygiene: Stop Lying to Yourself with -benchmem, -count, and Real Statistics

You write a benchmark. You run it once. You see 3.2 ns/op vs 2.9 ns/op and declare victory, merge the PR, and tell yourself you made things faster.

You didn’t. You measured noise.

This is the single most common mistake in Go performance work. A benchmark that runs once, ignores allocation counts, and ships without any statistical analysis isn’t a benchmark — it’s a guess with extra steps. In production systems, these guesses compound into rewrites of "slow" code that was never actually the bottleneck, or optimizations that evaporate the moment you change compiler flags.

This article covers the three pillars of honest Go benchmarking: -benchmem to surface hidden allocations, -count to collect enough samples to say something real, and benchstat to stop eyeballing numbers and start doing actual statistics. By the end you’ll have a workflow you can trust, plus a solid list of gotchas I’ve seen bite teams in the wild.


Why testing.B Alone Isn’t Enough

The standard go test -bench=. output looks like this:

BenchmarkJsonMarshal-8   1234567   987 ns/op

What it doesn’t tell you:

  • How many heap allocations happened per operation
  • How many bytes were allocated
  • Whether that 987 is stable or bounced between 750 and 1200 across runs
  • Whether your "optimization" is actually faster or just got lucky with CPU scheduling

The default output is intentionally minimal. The tooling to make it useful is already in your $PATH — you just have to use it.


-benchmem: Allocations Are the Whole Story

Most latency problems in Go are allocation problems. The GC isn’t free, and every escaped value on the heap is a tax paid in pause time, cache pressure, and GC work. -benchmem adds two columns to your output:

go test -bench=. -benchmem ./...
BenchmarkJsonMarshal-8   1234567   987 ns/op   256 B/op   3 allocs/op

Now you have real information. 256 B/op is the average heap bytes allocated per iteration. 3 allocs/op is the number of distinct heap allocations per iteration.

A zero-allocation path on the hot loop is almost always better than a 50 ns improvement that still allocs. Here’s a classic example:

// Naive: allocates on every call because fmt.Sprintf escapes to heap
func BuildKeyNaive(prefix, id string) string {
    return fmt.Sprintf("%s:%s", prefix, id)
}

// Better: uses strings.Builder, stays on stack if small enough
func BuildKeyFast(prefix, id string) string {
    var b strings.Builder
    b.Grow(len(prefix) + 1 + len(id))
    b.WriteString(prefix)
    b.WriteByte(':')
    b.WriteString(id)
    return b.String()
}

The benchmark:

func BenchmarkBuildKeyNaive(b *testing.B) {
    for b.Loop() {
        BuildKeyNaive("user", "42")
    }
}

func BenchmarkBuildKeyFast(b *testing.B) {
    for b.Loop() {
        BuildKeyFast("user", "42")
    }
}

Note on b.Loop(): Go 1.24 introduced b.Loop() as the preferred loop form. It handles warmup, timer management, and cleanup correctly. If you’re on an older version, use the classic for i := 0; i < b.N; i++ form — both work, but b.Loop() is cleaner and avoids a class of subtle bugs.

Run it:

BenchmarkBuildKeyNaive-8   7523041   159 ns/op   24 B/op   2 allocs/op
BenchmarkBuildKeyFast-8    19842310   60 ns/op    0 B/op    0 allocs/op

Zero allocations, 2.6x faster. Without -benchmem, you’d have seen the 60 ns/op vs 159 ns/op and assumed it was just CPU work. The allocation story explains why it’s faster and tells you what to watch for in future regressions.

Gotcha: allocs/op Can Lie About GC Pressure

A function that makes 1 allocation per call is not always twice as good as one that makes 2. Size matters. A single 1 MB allocation will wreck your GC far worse than a hundred 8-byte ones that stay on the stack. Use -benchmem to get the big picture, but don’t stop there — if you care about GC pause times, you need pprof heap profiles.


-count: One Run Is Not Data

CPU benchmarks are noisy. Background processes, CPU frequency scaling, scheduler jitter, memory bus contention — all of it bleeds into your numbers. A single run gives you one sample from a distribution you haven’t characterized.

-count=N tells the test binary to run each benchmark N times:

go test -bench=. -benchmem -count=10 ./...
BenchmarkBuildKeyFast-8   19842310   60.1 ns/op   0 B/op   0 allocs/op
BenchmarkBuildKeyFast-8   19911022   59.8 ns/op   0 B/op   0 allocs/op
BenchmarkBuildKeyFast-8   19678341   61.2 ns/op   0 B/op   0 allocs/op
BenchmarkBuildKeyFast-8   19823901   60.4 ns/op   0 B/op   0 allocs/op
BenchmarkBuildKeyFast-8   20001234   59.6 ns/op   0 B/op   0 allocs/op
...

Now you can see the variance. That function is stable — 59-61 ns across runs is tight. If you saw 59, 74, 61, 92, 60, that’s a different story: something external is interfering.

What count to use? For most work, -count=10 is enough to detect instability and gives benchstat something reasonable to work with. For anything where you’re trying to confirm a sub-5% improvement, push to -count=20 or higher. More samples = smaller confidence intervals.

Gotcha: Warmup Is Real

The first iteration or two of a benchmark on a cold process will be slower — caches are cold, the runtime isn’t in a steady state, the OS hasn’t paged in the working set. The Go benchmark runner does a warmup phase automatically, but with very fast functions (single-digit nanoseconds), you can still see first-run effects. -count helps average this out. If you’re being precise, throw away the first result when analysing by hand.

Gotcha: Don’t Benchmark on a Laptop Under Load

This sounds obvious but teams skip it constantly. A laptop with Slack, a browser, and a Docker daemon in the background will give you results that are 20-30% noisier than a quiet, dedicated host. For anything important — pre-merge performance gates, capacity planning — run benchmarks on a stripped-down VM or bare metal with cpupower frequency-set -g performance and no competing load. On Linux, also consider:

# Disable CPU frequency scaling for the benchmark run
sudo cpupower frequency-set -g performance

# Pin the benchmark to a specific core to reduce scheduler noise
taskset -c 2 go test -bench=. -benchmem -count=10 ./...

benchstat: Stop Eyeballing, Start Analysing

This is the tool most Go developers have heard of but never actually use. benchstat is the official Go statistical analysis tool for benchmark output. It computes mean, standard deviation, and a hypothesis test to tell you whether the difference between two benchmark results is real or just noise.

Install it:

go install golang.org/x/perf/cmd/benchstat@latest

The workflow is simple. Save your before and after results to files, then compare:

# Baseline (on the main branch)
go test -bench=. -benchmem -count=10 ./... > before.txt

# Make your change, then:
go test -bench=. -benchmem -count=10 ./... > after.txt

# Compare
benchstat before.txt after.txt

Output:

goos: linux
goarch: amd64
pkg: example.com/myapp

           │  before.txt  │             after.txt              │
           │    sec/op    │   sec/op     vs base                │
BuildKey-8   159.2n ± 2%   60.4n ± 1%  -62.06% (p=0.000 n=10)

           │  before.txt  │             after.txt              │
           │     B/op     │    B/op     vs base                 │
BuildKey-8     24.00 ± 0%   0.00 ± 0%  -100.00% (p=0.000 n=10)

           │  before.txt  │              after.txt              │
           │  allocs/op   │  allocs/op   vs base                │
BuildKey-8     2.000 ± 0%   0.000 ± 0%  -100.00% (p=0.000 n=10)

That p=0.000 means the probability this difference is due to random chance is essentially zero. That’s a real improvement.

Now compare this to a scenario where the change barely moves the needle:

           │  before.txt  │            after.txt              │
           │    sec/op    │   sec/op     vs base               │
FooBar-8     98.3n ± 8%   95.1n ± 9%   ~ (p=0.280 n=10)

The ~ means "no statistically significant difference." That p=0.280 tells you there’s a 28% chance the observed difference is just variance. Do not merge that PR claiming a performance win. Recheck your approach.

Reading the ± Column

The ± 8% next to a result is the coefficient of variation — standard deviation as a percentage of the mean. This is your noise signal:

  • < 3%: clean benchmark, stable environment, trustworthy result
  • 3-8%: acceptable, common on a dev machine
  • > 10%: noisy benchmark — something is wrong, fix before interpreting results

High variance could mean: your benchmark has non-deterministic inputs, you’re running on a loaded machine, you’re benchmarking something that depends on network/disk I/O, or your benchmark loop is too short and the timer overhead is significant.

Gotcha: benchstat Old vs New API

Before the golang.org/x/perf v0.7.0 release, benchstat took positional arguments and used a different column format. If you’re on a team and see different output formats, check versions with benchstat -version. The current format is the one shown above. The old one used two separate output sections instead of a table. Both work, the table format is just easier to read at a glance.


A Complete Benchmark Workflow

Here’s what a disciplined benchmark session looks like end-to-end:

#!/bin/bash
# bench.sh — run before/after benchmarks and compare

set -e

BENCH_PATTERN="${1:-BenchmarkFoo}"
COUNT="${2:-10}"
PKG="${3:-./...}"

# Save baseline from current git state
git stash
go test -bench="$BENCH_PATTERN" -benchmem -count="$COUNT" "$PKG" > /tmp/bench_before.txt
git stash pop

# Run after
go test -bench="$BENCH_PATTERN" -benchmem -count="$COUNT" "$PKG" > /tmp/bench_after.txt

echo "=== Before ==="
cat /tmp/bench_before.txt

echo ""
echo "=== After ==="
cat /tmp/bench_after.txt

echo ""
echo "=== Delta ==="
benchstat /tmp/bench_before.txt /tmp/bench_after.txt

Usage:

./bench.sh BenchmarkBuildKey 10 ./internal/keys/...

This script stashes your uncommitted changes, measures the baseline, pops them back, measures the new version, and gives you a clean benchstat diff. No manual file management, no forgetting which file is which.


Writing Benchmarks That Don’t Lie

Beyond the flags, the benchmark code itself has failure modes.

Always use b.ReportAllocs() if you can’t use -benchmem globally (e.g., in a CI environment where someone forgot the flag):

func BenchmarkFoo(b *testing.B) {
    b.ReportAllocs()
    for b.Loop() {
        // ...
    }
}

Prevent compiler optimizations from eliminating your work. The compiler is smart. If it can prove your computation has no side effects, it’ll delete it. Use a package-level sink:

var globalSink string

func BenchmarkBuildKeyFast(b *testing.B) {
    b.ReportAllocs()
    for b.Loop() {
        globalSink = BuildKeyFast("user", "42")
    }
}

Or use testing.B‘s built-in sink via assignment to _ (doesn’t always work for all types, the package-level var is more reliable for strings and complex types).

Reset the timer after expensive setup:

func BenchmarkProcessData(b *testing.B) {
    // This setup cost should not be counted
    data := loadLargeTestFixture()
    
    b.ResetTimer()
    b.ReportAllocs()
    
    for b.Loop() {
        ProcessData(data)
    }
}

Without b.ResetTimer(), the fixture loading time is folded into your results. You’d be measuring the wrong thing entirely.

Make inputs realistic. A benchmark that always processes the same 8-byte string will get optimized differently by the CPU branch predictor than code processing variable-length real-world data. When the variance in your input data matters, vary it:

func BenchmarkHashKey(b *testing.B) {
    // Use a fixed seed for reproducibility across runs
    keys := generateKeys(1000, 42)
    b.ResetTimer()
    b.ReportAllocs()
    
    for i := 0; b.Loop(); i++ {
        hashKey(keys[i%len(keys)])
    }
}

Gotcha: Sub-benchmarks Skew Aggregate Results

If you use b.Run() for table-driven benchmarks, each sub-benchmark resets the timer independently. That’s fine. But if you’re comparing aggregated totals, note that benchstat works at the named benchmark level — BenchmarkFoo/case1 and BenchmarkFoo/case2 are tracked separately. You can’t meaningfully aggregate them without a custom script.


CI Integration: Catching Regressions Before Merge

The real payoff is using this in your pipeline. Here’s a GitHub Actions snippet that catches regressions on PRs:

# .github/workflows/bench.yml
name: Benchmarks

on:
  pull_request:
    branches: [main]

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - uses: actions/setup-go@v5
        with:
          go-version: stable

      - name: Install benchstat
        run: go install golang.org/x/perf/cmd/benchstat@latest

      - name: Benchmark base branch
        run: |
          git checkout ${{ github.base_ref }}
          go test -bench=. -benchmem -count=10 ./... > /tmp/before.txt

      - name: Benchmark PR branch
        run: |
          git checkout ${{ github.head_ref }}
          go test -bench=. -benchmem -count=10 ./... > /tmp/after.txt

      - name: Compare results
        run: |
          benchstat /tmp/before.txt /tmp/after.txt | tee bench_delta.txt
          # Fail if any benchmark regressed more than 10%
          # (benchstat exits 0 even on regressions, so grep for the pattern)
          if grep -E '\+[0-9]{2,}\.' bench_delta.txt | grep -v '~'; then
            echo "Significant benchmark regression detected"
            exit 1
          fi

      - name: Upload benchmark results
        uses: actions/upload-artifact@v4
        with:
          name: benchmark-results
          path: |
            /tmp/before.txt
            /tmp/after.txt
            bench_delta.txt

This is rough but works. For finer-grained regression detection, look at github.com/benchmark-action/github-action-benchmark, which can track historical trends and draw graphs.


The Mental Model

Every time you run a Go benchmark, you’re asking one of two questions:

  1. Is this fast enough? — one careful run with -benchmem is fine. You’re checking absolute numbers against a threshold you already know.
  2. Is change A faster than change B? — you need -count, you need benchstat, and you need to respect the p-value. Anything with p > 0.05 is noise.

Most developers are answering question 2 but treating it like question 1. That’s how bad optimizations make it into production.

The tooling is right there. benchmem costs you nothing except slightly more verbose output. -count=10 costs you ten times the benchmark runtime, which for micro-benchmarks is still under a minute. benchstat is a single go install command. There’s no excuse for shipping performance claims that aren’t backed by real statistics.

If your benchmark result won’t survive benchstat, it doesn’t survive.

Leave a comment

👁 Views: 6,806 · Unique visitors: 10,757