Stop Guessing Where Your CPU Goes: Continuous Profiling with Parca and eBPF

You have metrics. You have traces. You have logs. You still have no idea why your service eats 80% CPU under moderate load. That gap — between knowing something is slow and knowing where the time actually goes — is what profiling fills.

The problem is that most teams only profile when something is already on fire, using perf top or a one-shot py-spy dump that captures maybe 30 seconds of a problem that took an hour to reproduce. Continuous profiling flips this: you record stack traces all the time, at low overhead, and query them later when you notice the spike. Parca does exactly that, and with its eBPF-based agent, it works on every process on the host without touching a single line of application code.

The official repo is at github.com/parca-dev/parca. Agent lives at github.com/parca-dev/parca-agent.

Why eBPF Changes the Profiling Game

Traditional profilers fall into two buckets. The first kind requires you to instrument the application — add a library, annotate hot paths, compile with frame pointers, rebuild. The second kind attaches to a running process with ptrace or similar, which either slows it to a crawl or requires root and an awkward one-shot workflow.

eBPF is a third option. It lets you run sandboxed programs inside the Linux kernel that react to events — syscalls, perf events, kprobes — without touching userspace code at all. The Parca Agent uses a BPF program loaded into the kernel’s perf subsystem to collect CPU stack traces across every process at a configurable frequency (default: 19 Hz, chosen to avoid harmonics with common timer frequencies). It then symbolizes those traces using DWARF debug info or frame-pointer unwinding, and ships them to the Parca server.

Overhead is real but small: typically 1–3% CPU on a busy host. That’s the price of always-on visibility.

Architecture at a Glance

┌─────────────────────────────────────────────────────────┐
│  Linux Host                                             │
│                                                         │
│  ┌──────────────┐      gRPC/OTLP     ┌───────────────┐ │
│  │ Parca Agent  │ ─────────────────► │  Parca Server │ │
│  │ (eBPF)       │                    │  (storage +   │ │
│  └──────────────┘                    │   query UI)   │ │
│        ▲                             └───────────────┘ │
│        │ reads /proc, debug symbols                     │
│   Every process on host                                 │
└─────────────────────────────────────────────────────────┘

Parca Server stores profiles in its own built-in column store (based on FrostDB — think embedded Parquet for profiling data). No external database required for a single-node setup. The UI renders flame graphs directly from stored profiles, and you can query them with a Prometheus-like label selector.

Prerequisites

  • Linux kernel 5.10+ (eBPF CO-RE support; 5.15+ recommended for best symbol resolution)
  • docker and docker compose v2
  • The host must allow BPF: CONFIG_BPF=y, CONFIG_BPF_SYSCALL=y (check with zcat /proc/config.gz | grep CONFIG_BPF)
  • CAP_SYS_ADMIN or unprivileged BPF enabled — the agent container will need --privileged or a specific capability set
  • At least 2 GB RAM for the server (more if you store weeks of data)

Check your kernel quickly:

uname -r
# Should be 5.10.x or higher

zcat /proc/config.gz 2>/dev/null | grep -E "CONFIG_BPF|CONFIG_PERF_EVENTS" | head -10
# Look for CONFIG_BPF=y, CONFIG_BPF_SYSCALL=y, CONFIG_PERF_EVENTS=y

Step 1: Directory Layout

mkdir -p ~/parca/{config,data}
cd ~/parca

Everything lives here. The data/ directory is where Parca Server persists its column store. Mount it as a named volume or a bind mount — either works.

Step 2: Parca Server Configuration

Create config/parca.yaml:

# config/parca.yaml

object_storage:
  bucket:
    type: "FILESYSTEM"
    config:
      directory: "/var/lib/parca"  # maps to bind-mounted ./data inside container

# How long to retain profiling data.
# Be realistic: profiles are dense. 7 days on a busy host can hit 10+ GB.
storage_active_memory: 536870912   # 512 MiB in-memory working set (bytes)

# gRPC endpoint that the agent pushes profiles to
grpc_address: "0.0.0.0:7070"

# HTTP endpoint for the web UI and query API
http_address: "0.0.0.0:7070"

Parca uses a single port for both gRPC and HTTP (multiplexed via h2c). That’s not a typo.

Step 3: Docker Compose

# docker-compose.yml

services:

  parca:
    image: ghcr.io/parca-dev/parca:v0.21.0
    container_name: parca
    restart: unless-stopped
    ports:
      - "7070:7070"       # Web UI + gRPC ingestion
    volumes:
      - ./config/parca.yaml:/etc/parca/parca.yaml:ro
      - ./data:/var/lib/parca
    command:
      - /parca
      - --config-path=/etc/parca/parca.yaml
      - --log-level=info
      - --cors-allowed-origins=*   # Tighten this if you put Nginx in front
    healthcheck:
      test: ["CMD", "wget", "-qO-", "https://cd-linux.club:7070/healthy"]
      interval: 15s
      timeout: 5s
      retries: 5

  parca-agent:
    image: ghcr.io/parca-dev/parca-agent:v0.31.0
    container_name: parca-agent
    restart: unless-stopped
    # The agent needs to see the host's processes, kernel, and BPF subsystem.
    # privileged is the simplest path; see Gotchas for a tighter capability set.
    privileged: true
    pid: host           # Required: agent walks /proc on the real host PID namespace
    network_mode: host  # Simplest: agent can reach Parca server on localhost:7070
    volumes:
      - /sys/fs/debugfs:/sys/fs/debugfs:ro   # BPF debug filesystem
      - /sys/kernel/debug:/sys/kernel/debug:ro
      - /sys/fs/bpf:/sys/fs/bpf              # BPF map pinning (rw)
      - /proc:/proc:ro
      - /usr/lib:/usr/lib:ro                 # Host debug symbols / split DWARF
      - /usr/lib/debug:/usr/lib/debug:ro     # dbgsym packages land here
      - /lib:/lib:ro
      - /tmp:/tmp                            # Agent writes temp BPF objects here
    command:
      - /bin/parca-agent
      - --log-level=info
      - --node=my-host                       # Label added to every profile from this agent
      - --store-address=localhost:7070       # Parca server gRPC address
      - --insecure                           # Skip TLS for local-only setup
      - --remote-store-insecure-skip-verify
      # Profile everything. Narrow this with --profiling-cpu-sampling-frequency
      # or label selectors if you have hundreds of processes.
      - --profiling-cpu-sampling-frequency=19
    depends_on:
      parca:
        condition: service_healthy

Start it:

docker compose up -d
docker compose logs -f

Give it 60–90 seconds. The agent needs to load the BPF program, walk the process list, and start collecting. Then open http://your-host:7070.

Step 4: First Flame Graph

In the Parca UI, the query bar at the top accepts label matchers. Start broad:

{}

This returns all profiles across all processes. You’ll see a process selector on the left. Pick your application (it shows up by binary name, e.g., process_name="nginx" or process_name="python3").

The flame graph renders immediately. Wide bars at the bottom are the roots of call stacks (main, goroutine schedulers, etc.). Tall towers mean deep recursion. The widest bars near the top are your actual hotspots — the functions eating the most CPU samples.

To compare a time range before vs after a deploy:

  1. Set the time range picker to the 30 minutes around your deploy
  2. Use the "comparison" mode (two flame graphs side by side)
  3. Red = got slower, blue = got faster

That workflow — without touching a single line of application code, without a restart, using data that was already being collected — is the whole point.

Gotchas

No debug symbols = useless flame graphs. The most common disappointment: you run the agent, you open the UI, and you see stacks full of hex addresses (0x7f3a8c...) instead of function names. This means the binary was stripped. Solutions ranked by pain:

  1. For Go: recompile with CGO_ENABLED=0 go build (Go embeds DWARF by default unless you pass -ldflags="-w -s"). Check with objdump -h binary | grep debug.
  2. For system packages: install *-dbgsym or *-dbg packages from your distro’s debug symbol repo.
  3. For your own compiled C/C++/Rust: don’t strip in production, or use split DWARF (-gsplit-dwarf) and ship the .dwo files alongside the binary.
  4. For containers: the agent reads symbols from the host’s /proc/<pid>/root, which maps into the container filesystem. If your Docker image is scratch or distroless with stripped binaries, symbols won’t resolve without an external debuginfo server.

Kernel version matters more than you think. CO-RE (Compile Once, Run Everywhere) BPF programs work on 5.10+, but frame-pointer unwinding for interpreted runtimes (Python, JVM, Ruby) needs 5.15+ for reliable behavior. If you’re on an older LTS kernel, expect some symbolization gaps.

privileged vs. capability set. Running the agent as privileged: true is convenient but broad. For production, replace it with a tighter capability set:

cap_add:
  - SYS_ADMIN     # BPF program loading
  - SYS_PTRACE    # /proc symbol reading
  - NET_ADMIN     # Some BPF map types
  - PERFMON       # perf_event_open (kernel 5.8+)
  - BPF           # explicit BPF capability (kernel 5.8+)
security_opt:
  - apparmor:unconfined  # AppArmor blocks BPF by default on Ubuntu

Test this carefully — what capabilities you actually need can vary by kernel version.

Sampling frequency and CPU contention. At 19 Hz, the agent takes roughly 19 stack samples per second per CPU core. On a 32-core host with 200 processes, that’s a lot of BPF events. The overhead is usually fine, but if you’re running at 95% CPU already, you’ll feel it. Drop to 9 Hz with --profiling-cpu-sampling-frequency=9 on hot hosts.

Data retention. Parca’s column store is efficient but not magic. A busy 8-core host generating profiles for 50 processes for 7 days can accumulate 15–20 GB. Set retention explicitly:

# In parca.yaml — no built-in TTL yet in OSS version.
# Workaround: use a cron to rm old files or set the active_memory limit
# aggressively and let the store evict old data from its ring buffer.
storage_active_memory: 2147483648  # 2 GiB cap

The OSS server’s retention is primarily memory-bound. Polarsignals (the commercial version) adds time-based retention and object storage backends like S3. For self-hosted, plan your disk accordingly.

SELinux. On RHEL/Rocky/Fedora, SELinux will block BPF syscalls from the container. Either set the agent to --security-opt label=disable or write a proper SELinux policy. Labeling it spc_t (Super Privileged Container type) is the pragmatic solution:

security_opt:
  - label:type:spc_t

Production-Ready Additions

Reverse proxy with authentication. Parca has no built-in auth. Don’t expose port 7070 directly. Put Nginx in front:

server {
    listen 443 ssl;
    server_name parca.internal.example.com;

    ssl_certificate     /etc/ssl/certs/internal.crt;
    ssl_certificate_key /etc/ssl/private/internal.key;

    auth_basic "Profiling UI";
    auth_basic_user_file /etc/nginx/.htpasswd;

    location / {
        grpc_pass grpc://localhost:7070;  # for gRPC (agent)
    }

    location /ui {
        proxy_pass https://cd-linux.club:7070;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }
}

Or use Traefik with BasicAuth middleware if you’re already running it.

Labeling by environment. When you run the agent on multiple hosts, add a meaningful --node label and optionally inject environment labels via the agent’s --metadata-external-labels flag:

--node=$(hostname)
--metadata-external-labels=env=production,region=eu-west-1,team=backend

These labels propagate to every profile and become filterable in the UI. Without them, queries across a fleet become guesswork.

Alerting on profile regressions. Parca exposes a Prometheus-compatible /metrics endpoint on the same port. Scrape it with Prometheus and alert on parca_agent_profiling_unwind_table_build_errors_total spiking (symbolization failures) or parca_profilestore_ingested_samples_total dropping to zero (agent stopped sending).

Kubernetes deployment. If you’re on K8s, the agent ships as a DaemonSet. The Parca team maintains a Helm chart at github.com/parca-dev/helm-charts. The DaemonSet tolerates all nodes by default, so it lands on control plane nodes too — filter them out with a nodeSelector if you don’t want profiling data from etcd and the scheduler mixed into your app queries.

helm repo add parca https://parca-dev.github.io/helm-charts
helm repo update

# Install server
helm install parca parca/parca --namespace parca --create-namespace \
  --set "parca.config.storage_active_memory=2147483648"

# Install agent DaemonSet
helm install parca-agent parca/parca-agent --namespace parca \
  --set "parca-agent.config.store_address=parca.parca.svc.cluster.local:7070" \
  --set "parca-agent.config.insecure=true"

Reading the Data: What to Actually Look For

A flame graph is only useful if you know what questions to ask.

CPU regression after deploy: compare the 15 minutes before and after your rollout. Any function that gained width is consuming more CPU. If it’s in your code, it’s probably the regression. If it’s in a library, check if you bumped a dependency.

Idle CPU mystery: sometimes a process reports 40% CPU in top but your hot paths look fine in the flame graph. Look at the bottom of the graph — if you see wide bars in kernel frames (sys_futex, schedule, io_wait), the process is blocking, not computing. Your profiling data is right; your problem is I/O or lock contention, not algorithm efficiency.

Memory allocation hotspots: Parca can also collect allocation profiles (heap) if you enable the memory profiler in the agent. The workflow is the same; just switch the profile type selector in the UI from cpu to memory:alloc_space:bytes.

Finding the 1% that matters: the 80/20 rule is real in profiling. Usually 1–3 functions account for 60–80% of CPU time. Parca’s flame graph lets you click into any frame to zoom in. Use it. Don’t optimize the second-biggest bar until the first one is dealt with.

Wrapping Up

Parca with the eBPF agent is the closest thing to free profiling visibility in production. You deploy two containers, you mount a few host paths, and every process on the machine is profiled — Go services, Python workers, Nginx, Postgres, your shell scripts. No recompile, no restart, no SDK onboarding meeting.

The gaps are real: stripped binaries kill symbolization, old kernels limit unwinding quality, and the OSS server’s retention is memory-bounded rather than time-bounded. But none of these are blockers — they’re known tradeoffs with known workarounds.

The alternative is continuing to guess. Or getting paged at 3 AM and running perf record -g -p $(pgrep myservice) sleep 10 while the incident is already half over. Neither is good. Set this up on a staging host first, spend an afternoon reading the flame graphs, and you’ll never want to go back.

Leave a comment

👁 Views: 2,289 · Unique visitors: 1,646