Someone files a ticket: "The app is slow." You check dashboards — CPU is pegged at 90%. Great, something is burning cycles. But where? top shows the process. htop confirms it. And then you’re stuck staring at a process name, no closer to the root cause than you were five minutes ago.
Flame graphs fix this. One visualization that tells you exactly which functions are eating your CPU, how they got called, and how much time each one holds. The problem is most people look at a flame graph for the first time and see noise — a wall of colored rectangles with no obvious entry point.
This article is a field guide. You’ll learn the mechanics, how to generate your first flame graph without fighting your environment for an hour, and — most importantly — the exact places to look when you need to find a CPU bottleneck fast.
What a Flame Graph Actually Is
Brendan Gregg invented flame graphs at Netflix. The canonical tooling lives at https://github.com/brendangregg/FlameGraph.
The concept is simple: take thousands of stack traces captured over time, merge duplicates, and draw each unique call stack as a column of stacked rectangles. The result looks like fire — hence the name.
X axis: not time. This is the most common misconception. The X axis represents the frequency of a function appearing in samples, sorted alphabetically within each level (not by call order). A wide frame means that function showed up in a lot of stack traces. It was on-CPU a lot.
Y axis: call depth. Bottom is the root (usually main or the thread entry point), top is the leaf — the function that was actually executing when the sample was taken.
Width = time on CPU. If a rectangle takes up 30% of the width, that function (and everything it called) consumed roughly 30% of your CPU time.
Colors are mostly decorative by default. Red/orange shades are traditional but arbitrary. Some tools use color to differentiate kernel vs. userspace frames (green for kernel, orange for user) — check the legend for the specific tool you’re using.
Generating a Flame Graph
Different languages have different profilers. Here’s what actually works in production:
Linux C/C++ or any native process: perf
# Record CPU samples for PID 12345 for 30 seconds
sudo perf record -F 99 -p 12345 -g -- sleep 30
# Generate perf script output
sudo perf script > out.perf
# Clone Brendan's tools and generate the graph
git clone https://github.com/brendangregg/FlameGraph
./FlameGraph/stackcollapse-perf.pl out.perf | ./FlameGraph/flamegraph.pl > cpu.svg
Open cpu.svg in a browser. It’s interactive — click frames to zoom in.
Java: async-profiler
async-profiler is the right tool here. JVM stack sampling from the JVM Tools Interface is notoriously inaccurate (safe-point bias). async-profiler bypasses that.
# Download from https://github.com/async-profiler/async-profiler/releases
tar -xzf async-profiler-*.tar.gz
cd async-profiler-*
# Profile PID 12345 for 60 seconds, output a flame graph directly
./bin/asprof -d 60 -f /tmp/cpu.html 12345
The -f /tmp/cpu.html flag generates a self-contained interactive flame graph. No postprocessing needed.
Python: py-spy
pip install py-spy
# Record to a flamegraph SVG
sudo py-spy record -o profile.svg --pid 12345 --duration 30
py-spy samples the Python interpreter without modifying the target process. Works on running production instances.
Go: pprof
Go has first-class profiling built in. Add this to your HTTP server:
import _ "net/http/pprof"
// This registers /debug/pprof/ endpoints automatically
Then:
# Capture 30-second CPU profile
go tool pprof -http=:8080 https://cd-linux.club:6060/debug/pprof/profile?seconds=30
The built-in pprof UI renders flame graphs natively. Hit the "Flame Graph" view in the dropdown.
How to Read It: The Rules
Before you look for the bottleneck, internalize these rules. They’ll save you from chasing the wrong thing.
Rule 1: Start at the widest frames at the top of a plateau
The top of the flame is where CPU was actually spending time. Look for the widest frames near the top of any column. A wide frame at the top means that function was on-CPU without calling anything deeper — or it called things, but kept coming back to itself.
A function with a wide top and narrow children is doing its own work. A function with a wide top and equally wide children is just a pass-through — the children are the real cost.
Rule 2: Flat tops are gold
A "plateau" — a wide frame with nothing stacked on top — is the clearest signal. That function is the leaf. It’s what the CPU is actually executing. Start there.
Rule 3: Tall, narrow spikes mean deep recursion or call chains — not hotspots
A tall spike that’s also narrow means a deep call stack that doesn’t show up in many samples. It’s interesting architecturally, but it’s not your CPU bottleneck. Don’t get distracted by spikes.
Rule 4: Multiple wide towers? Pick the widest first
If you have several broad columns, rank them by width. Address the widest column first. Always. Don’t jump around — the widest column is the highest-ROI fix.
Rule 5: Recognize idle patterns
You’ll often see epoll_wait, futex, pthread_cond_wait, select, nanosleep showing up wide. These are blocking/idle system calls — the process is waiting, not burning CPU. If you’re looking at a CPU flame graph and these dominate, you have the wrong data. This is a latency or I/O problem, not a CPU problem. Re-diagnose.
Where to Look First: A Practical Checklist
When CPU is the bottleneck and you’ve got a flame graph open, run through this in order:
1. Find the widest plateau
Scan the top layer of frames. Find the widest flat top. That’s your first suspect. What is it doing? Is it parsing, serializing, copying memory, doing math, iterating a loop?
2. Trace the call path down
Click the frame (in interactive SVGs) to zoom in. Read the call chain from bottom to top: main → http_handler → process_request → json_decode → [your wide frame]. This tells you which code path is triggering the bottleneck. Sometimes the fix isn’t in the hot function — it’s in reducing how often the hot function gets called.
3. Look for recognizable anti-patterns
malloc/free/GC_collecteating a significant slice → you’re allocating too much. Look at what’s above it.memcpy/memmovewide at the top → large buffer copies in a hot path. Check for unnecessary copies.regexorPCREfunctions → someone’s compiling a regex per request instead of pre-compiling it.JSONdecode/encode in a tight loop → maybe it should be pre-parsed or cached.- Lock contention (
pthread_mutex_lock,__lll_lock_wait) → concurrency problem, not a CPU problem per se. The threads are spinning or queuing. - String operations (
strlen,sprintf,strcat) in hot loops → classic C-level waste.
4. Check the kernel/user boundary
In perf flame graphs with kernel frames included, watch the proportion of kernel vs. userspace. If 50% of your flame is kernel frames (sys_read, copy_to_user, do_page_fault), you have a syscall-heavy workload. The optimization is in reducing syscalls or batching I/O, not in the userspace code.
5. Look for duplicate subtrees
Sometimes the same expensive call subtree appears in multiple places. This means the same expensive operation is being triggered from multiple call paths. Caching or de-duplication at a higher level is usually the fix.
Gotchas
Frame merging hides inlining. Compilers inline small functions. With aggressive optimization (-O2, -O3), you’ll see frames disappear. The hot frame looks like a callee of something that logically shouldn’t be calling it. Add -fno-omit-frame-pointer to your compile flags when profiling — without it, perf can’t walk the stack correctly and you get broken or truncated traces.
For JVM: use async-profiler with -Djdk.attach.allowAttachSelf=true. With -XX:+PreserveFramePointer you get better native frame resolution.
Java safe-point bias poisons JVM stack traces. The built-in jstack and many JVM profilers only sample at safe points — when the JVM decides it’s safe to stop threads. This means certain code paths are systematically never sampled. async-profiler uses signals to interrupt threads at arbitrary points, giving you an honest picture. If you’re on JVM and not using async-profiler, your data is suspect.
Short-lived processes. If the CPU spike is in a process that lives for two seconds, your 30-second perf record catches it only during those two seconds. Low sample count = noisy graph. Either increase frequency (-F 999) or trigger the workload repeatedly to accumulate samples.
Kernel version and perf events. Some CPU events require PMU hardware support. On VMs and cloud instances, the hypervisor may not expose hardware counters. Fall back to perf record -e cycles:u (userspace cycles only) or -e software:cpu-clock if you’re getting permission errors or zero results.
Interpreting percentage in the frame label. The percentage shown in a frame is its share of total samples, not wall clock time. If you’re sampling at 99 Hz for 30 seconds, you have ~2970 samples. A 10% frame appeared in ~297 of those. Keep this in mind when comparing across different duration profiles.
Mixed-mode JVM frames. When a Java method calls native code, you get a discontinuity in the stack trace. async-profiler handles this well; other tools often show a broken chain. If your call stacks look truncated at native boundaries, async-profiler is probably your fix.
Production-Ready Practices
Profile the right workload. Capturing a flame graph while the server is idle gives you garbage data. Trigger representative load first — a load test, replayed production traffic, or a synthetic benchmark that mirrors real usage patterns. Then capture.
Don’t profile in debug mode. Debug builds skip optimizations and add overhead. The flame graph of a debug build is fiction. Always profile with the same build flags used in production.
Use a sampling frequency of 97-99 Hz, not higher. Higher sampling frequency doesn’t necessarily give more accurate data — it introduces more overhead and can disturb cache behavior. 99 Hz is the convention for a reason: it’s prime, which avoids aliasing with periodic application behaviors at common intervals like 100 Hz timers.
Keep raw data. Save your out.perf or .jfr files. Flame graphs are lossy — once you’ve collapsed and rendered, you lose the raw stacks. The raw data lets you re-analyze later, filter to specific threads, or compare two profiles programmatically.
Compare before/after, not just one snapshot. A single flame graph tells you where time is going now. It doesn’t tell you if that’s better or worse than before your change. Use tools like difffolded.pl from the FlameGraph repo to generate differential flame graphs — red frames got slower, blue frames got faster.
./FlameGraph/difffolded.pl out.before.folded out.after.folded | ./FlameGraph/flamegraph.pl --negate > diff.svg
Continuous profiling in production. Tools like Pyroscope (https://github.com/grafana/pyroscope) or Parca run as sidecars and continuously profile processes with low overhead (~1-2% CPU). You stop having to reproduce performance issues — they’re already captured the moment the incident happens. This is worth setting up if CPU profiling is a recurring task.
Know your profiling overhead. async-profiler at 100 Hz adds roughly 1-3% overhead. perf record -F 99 adds 1-5% depending on workload. py-spy adds around 1% at 100 Hz. None of these should stop you from profiling production, but be aware: in extremely latency-sensitive paths, even this overhead can change what the profiler sees.
A Real Example: Tracking Down Unexpected Serialization Cost
Here’s how this looks in practice. A Go HTTP service is pegged at high CPU under moderate load. The team assumes it’s the database query. They add connection pool sizes, add read replicas — no change.
They run go tool pprof and generate a flame graph. The widest plateau isn’t database code. It’s encoding/json.Marshal being called with a struct that contains a 400-field configuration object on every single HTTP response. The full config was being serialized into every API response as debug metadata — something someone added during development and never removed.
Two lines changed. CPU dropped 40%.
The flame graph didn’t solve the problem — but it pointed at exactly the right function in 30 seconds, in a codebase with hundreds of thousands of lines.
Quick Reference
| Situation | What to look for |
|---|---|
| CPU high, not sure where | Widest plateau frame |
| Multiple hotspots | Address widest column first |
| GC/allocator frames wide | Allocation in hot path above it |
epoll_wait / futex dominate |
Wrong problem type — check I/O/latency |
| Kernel frames > 30% | Syscall-heavy — look at I/O patterns |
| Narrow tall spikes | Deep call chains — not the CPU bottleneck |
| Same subtree in multiple places | Cache the result higher up |
Flame graphs are not magic. They’re a translation of raw stack sample data into something human eyes can process. Once you know the three axes (width = frequency, height = depth, flat top = leaf), you stop being impressed by the visualization and start using it as a debugging tool.
The first time you spot a 30%-wide sprintf inside a loop that runs 50,000 times per second, you’ll understand why engineers who profile regularly get a lot done.