You write a benchmark. You run it once. You see 3.2 ns/op vs 2.9 ns/op and declare victory, merge the PR, and tell yourself you made things faster.
You didn’t. You measured noise.
This is the single most common mistake in Go performance work. A benchmark that runs once, ignores allocation counts, and ships without any statistical analysis isn’t a benchmark — it’s a guess with extra steps. In production systems, these guesses compound into rewrites of "slow" code that was never actually the bottleneck, or optimizations that evaporate the moment you change compiler flags.
This article covers the three pillars of honest Go benchmarking: -benchmem to surface hidden allocations, -count to collect enough samples to say something real, and benchstat to stop eyeballing numbers and start doing actual statistics. By the end you’ll have a workflow you can trust, plus a solid list of gotchas I’ve seen bite teams in the wild.
Why testing.B Alone Isn’t Enough
The standard go test -bench=. output looks like this:
BenchmarkJsonMarshal-8 1234567 987 ns/op
What it doesn’t tell you:
- How many heap allocations happened per operation
- How many bytes were allocated
- Whether that
987is stable or bounced between750and1200across runs - Whether your "optimization" is actually faster or just got lucky with CPU scheduling
The default output is intentionally minimal. The tooling to make it useful is already in your $PATH — you just have to use it.
-benchmem: Allocations Are the Whole Story
Most latency problems in Go are allocation problems. The GC isn’t free, and every escaped value on the heap is a tax paid in pause time, cache pressure, and GC work. -benchmem adds two columns to your output:
go test -bench=. -benchmem ./...
BenchmarkJsonMarshal-8 1234567 987 ns/op 256 B/op 3 allocs/op
Now you have real information. 256 B/op is the average heap bytes allocated per iteration. 3 allocs/op is the number of distinct heap allocations per iteration.
A zero-allocation path on the hot loop is almost always better than a 50 ns improvement that still allocs. Here’s a classic example:
// Naive: allocates on every call because fmt.Sprintf escapes to heap
func BuildKeyNaive(prefix, id string) string {
return fmt.Sprintf("%s:%s", prefix, id)
}
// Better: uses strings.Builder, stays on stack if small enough
func BuildKeyFast(prefix, id string) string {
var b strings.Builder
b.Grow(len(prefix) + 1 + len(id))
b.WriteString(prefix)
b.WriteByte(':')
b.WriteString(id)
return b.String()
}
The benchmark:
func BenchmarkBuildKeyNaive(b *testing.B) {
for b.Loop() {
BuildKeyNaive("user", "42")
}
}
func BenchmarkBuildKeyFast(b *testing.B) {
for b.Loop() {
BuildKeyFast("user", "42")
}
}
Note on
b.Loop(): Go 1.24 introducedb.Loop()as the preferred loop form. It handles warmup, timer management, and cleanup correctly. If you’re on an older version, use the classicfor i := 0; i < b.N; i++form — both work, butb.Loop()is cleaner and avoids a class of subtle bugs.
Run it:
BenchmarkBuildKeyNaive-8 7523041 159 ns/op 24 B/op 2 allocs/op
BenchmarkBuildKeyFast-8 19842310 60 ns/op 0 B/op 0 allocs/op
Zero allocations, 2.6x faster. Without -benchmem, you’d have seen the 60 ns/op vs 159 ns/op and assumed it was just CPU work. The allocation story explains why it’s faster and tells you what to watch for in future regressions.
Gotcha: allocs/op Can Lie About GC Pressure
A function that makes 1 allocation per call is not always twice as good as one that makes 2. Size matters. A single 1 MB allocation will wreck your GC far worse than a hundred 8-byte ones that stay on the stack. Use -benchmem to get the big picture, but don’t stop there — if you care about GC pause times, you need pprof heap profiles.
-count: One Run Is Not Data
CPU benchmarks are noisy. Background processes, CPU frequency scaling, scheduler jitter, memory bus contention — all of it bleeds into your numbers. A single run gives you one sample from a distribution you haven’t characterized.
-count=N tells the test binary to run each benchmark N times:
go test -bench=. -benchmem -count=10 ./...
BenchmarkBuildKeyFast-8 19842310 60.1 ns/op 0 B/op 0 allocs/op
BenchmarkBuildKeyFast-8 19911022 59.8 ns/op 0 B/op 0 allocs/op
BenchmarkBuildKeyFast-8 19678341 61.2 ns/op 0 B/op 0 allocs/op
BenchmarkBuildKeyFast-8 19823901 60.4 ns/op 0 B/op 0 allocs/op
BenchmarkBuildKeyFast-8 20001234 59.6 ns/op 0 B/op 0 allocs/op
...
Now you can see the variance. That function is stable — 59-61 ns across runs is tight. If you saw 59, 74, 61, 92, 60, that’s a different story: something external is interfering.
What count to use? For most work, -count=10 is enough to detect instability and gives benchstat something reasonable to work with. For anything where you’re trying to confirm a sub-5% improvement, push to -count=20 or higher. More samples = smaller confidence intervals.
Gotcha: Warmup Is Real
The first iteration or two of a benchmark on a cold process will be slower — caches are cold, the runtime isn’t in a steady state, the OS hasn’t paged in the working set. The Go benchmark runner does a warmup phase automatically, but with very fast functions (single-digit nanoseconds), you can still see first-run effects. -count helps average this out. If you’re being precise, throw away the first result when analysing by hand.
Gotcha: Don’t Benchmark on a Laptop Under Load
This sounds obvious but teams skip it constantly. A laptop with Slack, a browser, and a Docker daemon in the background will give you results that are 20-30% noisier than a quiet, dedicated host. For anything important — pre-merge performance gates, capacity planning — run benchmarks on a stripped-down VM or bare metal with cpupower frequency-set -g performance and no competing load. On Linux, also consider:
# Disable CPU frequency scaling for the benchmark run
sudo cpupower frequency-set -g performance
# Pin the benchmark to a specific core to reduce scheduler noise
taskset -c 2 go test -bench=. -benchmem -count=10 ./...
benchstat: Stop Eyeballing, Start Analysing
This is the tool most Go developers have heard of but never actually use. benchstat is the official Go statistical analysis tool for benchmark output. It computes mean, standard deviation, and a hypothesis test to tell you whether the difference between two benchmark results is real or just noise.
Install it:
go install golang.org/x/perf/cmd/benchstat@latest
The workflow is simple. Save your before and after results to files, then compare:
# Baseline (on the main branch)
go test -bench=. -benchmem -count=10 ./... > before.txt
# Make your change, then:
go test -bench=. -benchmem -count=10 ./... > after.txt
# Compare
benchstat before.txt after.txt
Output:
goos: linux
goarch: amd64
pkg: example.com/myapp
│ before.txt │ after.txt │
│ sec/op │ sec/op vs base │
BuildKey-8 159.2n ± 2% 60.4n ± 1% -62.06% (p=0.000 n=10)
│ before.txt │ after.txt │
│ B/op │ B/op vs base │
BuildKey-8 24.00 ± 0% 0.00 ± 0% -100.00% (p=0.000 n=10)
│ before.txt │ after.txt │
│ allocs/op │ allocs/op vs base │
BuildKey-8 2.000 ± 0% 0.000 ± 0% -100.00% (p=0.000 n=10)
That p=0.000 means the probability this difference is due to random chance is essentially zero. That’s a real improvement.
Now compare this to a scenario where the change barely moves the needle:
│ before.txt │ after.txt │
│ sec/op │ sec/op vs base │
FooBar-8 98.3n ± 8% 95.1n ± 9% ~ (p=0.280 n=10)
The ~ means "no statistically significant difference." That p=0.280 tells you there’s a 28% chance the observed difference is just variance. Do not merge that PR claiming a performance win. Recheck your approach.
Reading the ± Column
The ± 8% next to a result is the coefficient of variation — standard deviation as a percentage of the mean. This is your noise signal:
< 3%: clean benchmark, stable environment, trustworthy result3-8%: acceptable, common on a dev machine> 10%: noisy benchmark — something is wrong, fix before interpreting results
High variance could mean: your benchmark has non-deterministic inputs, you’re running on a loaded machine, you’re benchmarking something that depends on network/disk I/O, or your benchmark loop is too short and the timer overhead is significant.
Gotcha: benchstat Old vs New API
Before the golang.org/x/perf v0.7.0 release, benchstat took positional arguments and used a different column format. If you’re on a team and see different output formats, check versions with benchstat -version. The current format is the one shown above. The old one used two separate output sections instead of a table. Both work, the table format is just easier to read at a glance.
A Complete Benchmark Workflow
Here’s what a disciplined benchmark session looks like end-to-end:
#!/bin/bash
# bench.sh — run before/after benchmarks and compare
set -e
BENCH_PATTERN="${1:-BenchmarkFoo}"
COUNT="${2:-10}"
PKG="${3:-./...}"
# Save baseline from current git state
git stash
go test -bench="$BENCH_PATTERN" -benchmem -count="$COUNT" "$PKG" > /tmp/bench_before.txt
git stash pop
# Run after
go test -bench="$BENCH_PATTERN" -benchmem -count="$COUNT" "$PKG" > /tmp/bench_after.txt
echo "=== Before ==="
cat /tmp/bench_before.txt
echo ""
echo "=== After ==="
cat /tmp/bench_after.txt
echo ""
echo "=== Delta ==="
benchstat /tmp/bench_before.txt /tmp/bench_after.txt
Usage:
./bench.sh BenchmarkBuildKey 10 ./internal/keys/...
This script stashes your uncommitted changes, measures the baseline, pops them back, measures the new version, and gives you a clean benchstat diff. No manual file management, no forgetting which file is which.
Writing Benchmarks That Don’t Lie
Beyond the flags, the benchmark code itself has failure modes.
Always use b.ReportAllocs() if you can’t use -benchmem globally (e.g., in a CI environment where someone forgot the flag):
func BenchmarkFoo(b *testing.B) {
b.ReportAllocs()
for b.Loop() {
// ...
}
}
Prevent compiler optimizations from eliminating your work. The compiler is smart. If it can prove your computation has no side effects, it’ll delete it. Use a package-level sink:
var globalSink string
func BenchmarkBuildKeyFast(b *testing.B) {
b.ReportAllocs()
for b.Loop() {
globalSink = BuildKeyFast("user", "42")
}
}
Or use testing.B‘s built-in sink via assignment to _ (doesn’t always work for all types, the package-level var is more reliable for strings and complex types).
Reset the timer after expensive setup:
func BenchmarkProcessData(b *testing.B) {
// This setup cost should not be counted
data := loadLargeTestFixture()
b.ResetTimer()
b.ReportAllocs()
for b.Loop() {
ProcessData(data)
}
}
Without b.ResetTimer(), the fixture loading time is folded into your results. You’d be measuring the wrong thing entirely.
Make inputs realistic. A benchmark that always processes the same 8-byte string will get optimized differently by the CPU branch predictor than code processing variable-length real-world data. When the variance in your input data matters, vary it:
func BenchmarkHashKey(b *testing.B) {
// Use a fixed seed for reproducibility across runs
keys := generateKeys(1000, 42)
b.ResetTimer()
b.ReportAllocs()
for i := 0; b.Loop(); i++ {
hashKey(keys[i%len(keys)])
}
}
Gotcha: Sub-benchmarks Skew Aggregate Results
If you use b.Run() for table-driven benchmarks, each sub-benchmark resets the timer independently. That’s fine. But if you’re comparing aggregated totals, note that benchstat works at the named benchmark level — BenchmarkFoo/case1 and BenchmarkFoo/case2 are tracked separately. You can’t meaningfully aggregate them without a custom script.
CI Integration: Catching Regressions Before Merge
The real payoff is using this in your pipeline. Here’s a GitHub Actions snippet that catches regressions on PRs:
# .github/workflows/bench.yml
name: Benchmarks
on:
pull_request:
branches: [main]
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- uses: actions/setup-go@v5
with:
go-version: stable
- name: Install benchstat
run: go install golang.org/x/perf/cmd/benchstat@latest
- name: Benchmark base branch
run: |
git checkout ${{ github.base_ref }}
go test -bench=. -benchmem -count=10 ./... > /tmp/before.txt
- name: Benchmark PR branch
run: |
git checkout ${{ github.head_ref }}
go test -bench=. -benchmem -count=10 ./... > /tmp/after.txt
- name: Compare results
run: |
benchstat /tmp/before.txt /tmp/after.txt | tee bench_delta.txt
# Fail if any benchmark regressed more than 10%
# (benchstat exits 0 even on regressions, so grep for the pattern)
if grep -E '\+[0-9]{2,}\.' bench_delta.txt | grep -v '~'; then
echo "Significant benchmark regression detected"
exit 1
fi
- name: Upload benchmark results
uses: actions/upload-artifact@v4
with:
name: benchmark-results
path: |
/tmp/before.txt
/tmp/after.txt
bench_delta.txt
This is rough but works. For finer-grained regression detection, look at github.com/benchmark-action/github-action-benchmark, which can track historical trends and draw graphs.
The Mental Model
Every time you run a Go benchmark, you’re asking one of two questions:
- Is this fast enough? — one careful run with
-benchmemis fine. You’re checking absolute numbers against a threshold you already know. - Is change A faster than change B? — you need
-count, you needbenchstat, and you need to respect the p-value. Anything withp > 0.05is noise.
Most developers are answering question 2 but treating it like question 1. That’s how bad optimizations make it into production.
The tooling is right there. benchmem costs you nothing except slightly more verbose output. -count=10 costs you ten times the benchmark runtime, which for micro-benchmarks is still under a minute. benchstat is a single go install command. There’s no excuse for shipping performance claims that aren’t backed by real statistics.
If your benchmark result won’t survive benchstat, it doesn’t survive.