Memory Profiling Rust with heaptrack and dhat: Stop Guessing, Start Measuring

Rust gives you memory safety. What it doesn’t give you for free is memory efficiency. Your program won’t use-after-free or double-free, sure — but it can absolutely allocate gigabytes of heap it doesn’t need, hold allocations alive ten times longer than necessary, or hammer the allocator with millions of tiny short-lived objects that murder your cache and latency.

The ownership model prevents a class of bugs. It doesn’t prevent you from designing a bad data pipeline that clones every string on the hot path.

This article is about finding those problems before your ops team finds them at 3 AM. We’ll cover two tools that complement each other well: heaptrack, an external Linux profiler that requires zero code changes, and dhat, a Rust-native allocator wrapper that gives you precise per-callsite attribution. Both are free, both are battle-tested, and together they cover almost every memory profiling scenario you’ll encounter.


The Rust Memory Problem Nobody Talks About

Most Rust tutorials skip memory profiling. The implicit assumption is "you’ve got the borrow checker, you’re fine." That’s half true.

Heap fragmentation, excessive cloning, unbounded caches, Arc refcount churn, over-allocation in Vec::with_capacity — none of these are safety issues, but all of them will tank your production service. I’ve debugged a Rust service that was holding 4x its expected working set in heap because a middleware layer was collect()-ing full request bodies into Vec<u8> on every request and the allocator wasn’t returning the pages fast enough under load.

Valgrind’s massif works but is painfully slow (100–200x overhead). Instruments on macOS is decent but Mac-only. perf tells you about CPU, not allocations. For Linux-first Rust services — which is most production Rust — heaptrack and dhat are the right tools.


Tool 1: heaptrack

GitHub: https://github.com/KDE/heaptrack

heaptrack intercepts every malloc, free, realloc, and friends via LD_PRELOAD. It records the full call stack at each allocation point and dumps everything to a compressed trace file. No recompilation, no code changes. You point it at any binary and it just works.

Installing heaptrack

On Debian/Ubuntu:

sudo apt install heaptrack heaptrack-gui

On Arch:

sudo pacman -S heaptrack

If your distro’s package is ancient, build from source — it’s a straightforward CMake project. The GUI (heaptrack_gui) is a Qt application and optional; heaptrack_print gives you terminal-friendly output.

Building your Rust binary with debug symbols

heaptrack reads symbol tables. A release build with stripped symbols gives you useless output like 0x7f3a...+0x12. You want symbols but still want optimization, so build with:

# Cargo.toml
[profile.release]
debug = 1          # line tables only, minimal size impact
opt-level = 3

Or use a dedicated profiling profile:

[profile.profiling]
inherits = "release"
debug = 2          # full debug info

Build with cargo build --profile profiling and point heaptrack at the result.

Running heaptrack

# Basic usage — wraps your binary, outputs a .zst trace file
heaptrack ./target/profiling/my_service --some-args

# Redirect stdin/stdout normally
heaptrack ./target/profiling/my_service < input.json > /dev/null

# Attach to a running process by PID (useful for services)
heaptrack --pid 12345

When the process exits (or you Ctrl-C), heaptrack writes something like heaptrack.my_service.12345.zst.

Analyzing the trace

# Terminal summary — heap peaks, total allocations, top callsites
heaptrack_print heaptrack.my_service.12345.zst

# GUI — flamegraphs, timeline, allocation maps
heaptrack_gui heaptrack.my_service.12345.zst

The GUI is genuinely useful here. The "Allocations" flamegraph shows which call chains allocated the most total bytes. The "Peak" view shows what was alive simultaneously at the worst moment. Those are different questions with different answers — make sure you look at both.

A typical heaptrack terminal output looks like:

total runtime: 4.23s
calls to allocation functions: 1,847,392 (436,500/s)
temporary allocations: 1,203,100 (65.1%)
peak heap memory consumption: 142.3 MiB

That "temporary allocations" number is the one that hurts performance. 65% of your allocations lived so briefly they were freed before the next snapshot. That’s allocator pressure — look at those callsites.

Gotcha: LD_PRELOAD and statically linked binaries

heaptrack works by intercepting the dynamic linker. If you’ve built a fully static binary (RUSTFLAGS="-C target-feature=+crt-static"), heaptrack cannot intercept allocations from libc. The LD_PRELOAD hook never fires. You’ll get a mostly empty trace.

The fix: profile against the dynamically linked build (the default on Linux), or use dhat (covered next) which operates at the Rust allocator level and doesn’t have this limitation.

Gotcha: jemalloc vs system allocator

If your service uses tikv-jemallocator or mimalloc, heaptrack still intercepts at the glibc level but misses allocations that the custom allocator handles internally from its own pre-allocated arenas. You get partial data. For accurate profiling with jemalloc, you need jemalloc’s own profiling facility or — again — dhat.


Tool 2: dhat

GitHub: https://github.com/nnethercote/dhat-rs
Docs: https://docs.rs/dhat

dhat is a Rust crate that wraps the global allocator and instruments every heap operation at the language level. It requires minor code changes but gives you precise attribution per-callsite, works with any underlying allocator (including jemalloc), and produces output readable by the DHAT viewer — a standalone HTML file you open in any browser, no installation needed.

The crate was written by Nicholas Nethercote, who also wrote the original Valgrind DHAT. It’s the right tool when you need to know exactly which line of Rust code is responsible for an allocation pattern.

Adding dhat to your project

# Cargo.toml
[dependencies]
dhat = { version = "0.3", optional = true }

[features]
dhat-heap = ["dhat"]

Using a feature flag is the correct pattern here. You don’t want dhat overhead in production binaries — it’s a 5–10x slowdown. The feature flag lets you compile it in only when profiling.

Wiring up the allocator

// src/main.rs

// Only compile in dhat instrumentation when the feature is enabled
#[cfg(feature = "dhat-heap")]
#[global_allocator]
static ALLOC: dhat::Alloc = dhat::Alloc;

fn main() {
    // Initialize the profiler — must happen before any heap allocations
    // dhat::Profiler is RAII: it writes the report when dropped
    #[cfg(feature = "dhat-heap")]
    let _profiler = dhat::Profiler::new_heap();

    // Your real main logic here
    run_application();
}

That’s the minimal setup. When _profiler drops at the end of main, dhat writes dhat-heap.json to the current directory.

Running with dhat enabled

cargo run --features dhat-heap -- --your-args

Or for a release-optimized profile with full debug symbols:

cargo run --profile profiling --features dhat-heap -- --your-args

Reading the output

Open the DHAT viewer. It’s a static HTML file bundled with the crate — find it at:

# Locate the dhat viewer in your cargo registry
find ~/.cargo/registry/src -name "dh_view.html" 2>/dev/null | head -1

Or just download it from https://nnethercote.github.io/dh_view/dh_view.html. Drag your dhat-heap.json onto it.

The viewer shows a tree of allocation call stacks with metrics for each node:

  • Total bytes — all bytes ever allocated at this site
  • Total blocks — number of individual allocations
  • At t-gmax — bytes alive at peak heap moment
  • At t-end — bytes still alive when the program exited (leaks)

That last column, "at t-end," is your memory leak detector. If something shows up there that shouldn’t, you’ve found your leak’s origin.

Gotcha: dhat skews allocation patterns

dhat’s allocator wrapper adds a header to every allocation to store metadata. This means your size_of calculations and allocator behavior will be slightly different than in production. The relative ranking of callsites is accurate; the absolute numbers are not. Don’t compare dhat output numbers directly against production RSS metrics — use it for ranking and attribution, not absolute measurement.

Gotcha: short-lived programs and async runtimes

Tokio and async-std spawn background threads that perform allocations outside your main() scope. If your program exits too quickly, dhat may not capture allocations from tasks that were still running. For async services, keep the profiler alive for a meaningful workload window, then initiate a graceful shutdown and let main() return normally. Sending SIGKILL will skip the Drop impl and you’ll get no output.

For async services you often want to profile under load, not during startup. Wire up a signal handler or HTTP endpoint that triggers graceful shutdown after N requests or N seconds:

#[cfg(feature = "dhat-heap")]
let _profiler = dhat::Profiler::new_heap();

// Run server for 60 seconds, then graceful shutdown
tokio::time::sleep(Duration::from_secs(60)).await;
// profiler drops here, report written

A Real Example: Finding String Clone Bloat

Here’s a minimal reproduction of a common pattern I’ve seen in production Rust services: a per-request struct that clones configuration strings it could borrow instead.

use std::collections::HashMap;

#[derive(Clone)]
struct RequestContext {
    // These come from a long-lived config — cloning is wasteful
    service_name: String,
    region: String,
    tags: Vec<String>,
}

fn process_requests(config: &AppConfig, requests: &[RawRequest]) -> Vec<Response> {
    requests
        .iter()
        .map(|req| {
            // Cloning config fields into every RequestContext
            let ctx = RequestContext {
                service_name: config.service_name.clone(),
                region: config.region.clone(),
                tags: config.tags.clone(),
            };
            handle_request(ctx, req)
        })
        .collect()
}

Run this under dhat processing 100k requests and you’ll see String::clone and Vec::clone near the top of your total-bytes callstack. The fix — using Arc<str> or &str with appropriate lifetimes — eliminates those allocations entirely. The dhat output makes the problem obvious; without it you’re reading code guessing where the pressure is.


Workflow: How to Use Both Tools Together

These tools answer different questions at different stages.

Start with heaptrack. Zero code changes, fast iteration. Run your binary under load for 30–60 seconds, look at the flamegraph. You’re looking for unexpected peaks, call chains you didn’t know were hot, and the ratio of temporary to total allocations. This gives you the 10,000-foot view.

Drill into dhat when you have a suspect. Once heaptrack points you at a module or subsystem, add dhat and run a targeted workload. The per-callsite attribution in the DHAT viewer will tell you the exact line responsible. Fix it, re-run, compare the "total bytes" numbers. Iterate.

Confirm with heaptrack again. After your fix, run heaptrack once more and compare peak heap. Did the peak drop? Did temporary allocations decrease? heaptrack’s timeline view makes before/after comparison easy — open two trace files in the GUI side-by-side.


Production-Ready Practices

Never profile debug builds. The compiler’s debug builds disable inlining and optimization, which means your allocations look completely different from production. Hot functions that get inlined in release show up as separate call frames in debug. Profile what you ship.

Profile under realistic load. A server that handles 1 request during profiling will look nothing like one handling 10,000 concurrent requests. Use a load generator (wrk, k6, hey) and let the profiler collect data for at least a full GC cycle or steady-state period. heaptrack’s timeline view will show you if you’re still in a transient startup phase.

Watch for allocator amplification. If you’re seeing high RSS but low heap bytes in profiling, your allocator is holding onto pages it’s not using. jemalloc is much better than the system allocator at returning memory to the OS. If you’re on the system allocator and seeing this, switching to tikv-jemallocator or mimalloc might solve more than your profiling will.

Tag your allocations in complex systems. dhat’s API lets you create named Profiler instances with custom output files. In a service with multiple components, create separate profiler sections per module boundary and analyze them independently. Mixing all allocation sites into one tree for a large service makes the output hard to interpret.

Automate regression detection. Instrument a key workload in CI with dhat, capture the "total bytes" for your top 5 callsites, and assert they don’t regress past a threshold. It’s a one-time setup that catches allocation regressions before code review. The dhat JSON output is machine-readable — parse it in a test with serde_json and assert!.


Gotcha: Stack Depth and Symbol Demangling

Both heaptrack and dhat rely on unwinding the call stack at allocation time. On Linux, Rust binaries use DWARF unwind info, which heaptrack reads correctly. But if your binary was compiled with panic = "abort" and stripped DWARF, or if you’re using a custom panic handler that messes with the stack, you’ll get truncated backtraces.

Make sure Cargo.toml has either debug = 1 or debug = 2 in your profiling profile and that you’re not stripping symbols before profiling. Also, Rust mangles symbol names — both tools handle this, but double-check that your heaptrack output shows std::collections::HashMap::insert and not _ZN3std11collections7hashmap6insert.... If you see mangled names, install rustfilt and pipe through it: heaptrack_print trace.zst | rustfilt.


Quick Reference

Question Tool
"What’s my peak RSS and who caused it?" heaptrack
"Which exact line allocates the most?" dhat
"Do I have memory leaks?" dhat (at-t-end column)
"Is allocator pressure hurting my throughput?" heaptrack (temporary allocations %)
"Did my fix actually reduce allocations?" heaptrack before/after comparison
"I can’t change the binary" heaptrack

Both tools are available on any modern Linux system, both are production-grade, and the information they surface is genuinely different from what perf, strace, or any CPU profiler will show you. If you’ve been writing Rust services and skipping memory profiling because "the borrow checker handles it" — it’s time to add these to your toolbox. One run of heaptrack on a service you thought was clean will change your mind.

Leave a comment

👁 Views: 2,289 · Unique visitors: 1,646