Container Image Scanning at Scale: Harbor, Quay, and ECR Compared

Most teams discover their container security posture the hard way — either during an audit, after a breach, or when a pen-tester casually mentions that half your prod images are running a kernel userspace library with a 9.8 CVSS from 2023. Scanning images once in a pipeline and calling it done isn’t a strategy. It’s theater.

The real problem is drift: an image that was clean when it was built can accumulate critical CVEs by the time it’s actually running. New vulnerabilities get published daily. Your base images age. And unless something in your stack is continuously watching your registry — not just your pipeline — you’re flying blind.

This article is for the engineer who needs to solve scanning at scale: multiple teams, multiple registries, hundreds or thousands of image tags, and a requirement to actually act on what the scanners find. We’ll cover three serious options — self-hosted Harbor, Red Hat Quay, and AWS ECR with Inspector v2 — with real configs, honest tradeoffs, and the gotchas that’ll bite you.

The Two Modes of Scanning (and Why You Need Both)

Before touching any tool, get this mental model straight:

Shift-left scanning happens in CI/CD, before an image lands in the registry. Tools like Trivy, Grype, or Snyk run in the pipeline and can block a push. This is your first gate.

Registry-level scanning happens at rest — on push, on schedule, or both. The registry rescans images when new CVE data is published, flagging things that were clean yesterday but are critical today.

You need both. Shift-left is cheap and fast. Registry scanning is the safety net that catches everything else over time.

Every tool in this article focuses on the second mode, though some integrate with the first.

Harbor: The Self-Hosted Workhorse

GitHub: goharbor/harbor

Harbor is the go-to open source registry for teams that want full control. It’s CNCF-graduated, battle-tested, and ships with Trivy as the default scanner since v2.0. You can also plug in Anchore or Clair if you have a reason to.

Deploying Harbor with Helm

For anything beyond a laptop demo, Helm is the right approach:

helm repo add harbor https://helm.goharbor.io
helm repo update

# Pull the default values and edit before deploying
helm show values harbor/harbor > harbor-values.yaml

A production-grade harbor-values.yaml excerpt — the parts that actually matter:

expose:
  type: ingress
  tls:
    enabled: true
    certSource: secret
    secret:
      secretName: harbor-tls
  ingress:
    hosts:
      core: registry.yourdomain.com
    annotations:
      nginx.ingress.kubernetes.io/proxy-body-size: "0"  # Critical — unlimited for large image pushes
      nginx.ingress.kubernetes.io/proxy-read-timeout: "600"

externalURL: https://registry.yourdomain.com

persistence:
  enabled: true
  resourcePolicy: "keep"  # Don't delete PVCs on helm uninstall
  persistentVolumeClaim:
    registry:
      storageClass: "fast-ssd"
      size: 500Gi    # Size this generously — images pile up fast
    database:
      storageClass: "fast-ssd"
      size: 20Gi
    trivy:
      storageClass: "fast-ssd"
      size: 10Gi     # Trivy's vuln DB is ~1GB, needs room to update

database:
  type: external   # Use an external managed Postgres in production
  external:
    host: "postgres.internal"
    port: "5432"
    username: "harbor"
    password: "your-db-password"
    coreDatabase: "registry"

redis:
  type: external   # Same — external Redis for HA setups
  external:
    addr: "redis.internal:6379"

trivy:
  enabled: true
  ignoreUnfixed: false     # Don't hide unfixed vulns — you still need to know
  insecure: false
  gitHubToken: ""          # Set this — GitHub API rate limits will throttle DB updates otherwise
  skipUpdate: false

jobservice:
  jobLoggers:
    - database
  maxJobWorkers: 10   # Tune based on concurrent scan volume

notary:
  enabled: true    # Enable if you need image signing/cosign verification

Configuring Scan Policies

The UI is fine for exploration but policy automation requires the Harbor API or project-level configs. Set up automatic scanning on push per project:

# Enable auto-scan via API — do this for every project
curl -u admin:$HARBOR_PASSWORD \
  -X PUT "https://registry.yourdomain.com/api/v2.0/projects/{project_id}" \
  -H "Content-Type: application/json" \
  -d '{
    "metadata": {
      "auto_scan": "true",
      "severity": "high"
    }
  }'

Harbor also supports CVE allowlists per project — useful when a vulnerability has no fix yet and you’ve accepted the risk formally:

{
  "project_id": 1,
  "expires_at": 1767225600,
  "items": [
    {"cve_id": "CVE-2024-XXXXX"}
  ]
}

Harbor Gotchas

Trivy’s DB update rate matters more than you think. Trivy fetches its vulnerability DB from GitHub releases. On a fresh deploy with no gitHubToken set and multiple concurrent scan jobs, you’ll hit GitHub’s unauthenticated rate limit (60 req/hour) almost immediately. Set the token. If you’re air-gapped, you need a private Trivy DB mirror — Harbor’s docs cover this but it’s painful.

The ignoreUnfixed flag is a trap. Defaulting it to true makes dashboards look clean. It also hides dozens of real CVEs where the upstream vendor decided the fix isn’t worth a patch release. Leave it false and build your triage workflow around severity, not fixability.

Harbor’s replication + scanning = queue saturation. If you replicate images from another registry into Harbor, each replicated image triggers a scan job. On a bulk import, your jobservice queue will back up for hours. Pre-scale maxJobWorkers before any large migration.

Red Hat Quay: Enterprise-Grade with Clair Under the Hood

Red Hat Quay (quay.io for the hosted version, or self-hosted via the Quay Operator) uses Clair v4 as its scanner. Clair’s architecture is fundamentally different from Trivy: it’s a microservice that indexes image layers and matches them against a multi-source vulnerability DB (Red Hat, Ubuntu, Debian, Alpine, PyPI, npm, and more).

Self-Hosted Quay with the Operator

On Kubernetes, the Quay Operator is the supported path:

apiVersion: quay.redhat.com/v1
kind: QuayRegistry
metadata:
  name: central-registry
  namespace: quay
spec:
  configBundleSecret: quay-config-bundle
  components:
    - kind: clair
      managed: true       # Let the operator manage Clair
    - kind: postgres
      managed: false      # External Postgres — always do this in prod
    - kind: objectstorage
      managed: false      # External S3 or compatible — don't use the built-in minio in prod
    - kind: redis
      managed: true
    - kind: horizontalpodautoscaler
      managed: true       # Let it scale on load
    - kind: route
      managed: true
    - kind: monitoring
      managed: true       # Prometheus metrics out of the box
    - kind: tls
      managed: true

The quay-config-bundle secret holds the main config.yaml. Key scanning-related entries:

# config.yaml (inside the secret)
FEATURE_SECURITY_SCANNER: true
FEATURE_SECURITY_NOTIFICATIONS: true
SECURITY_SCANNER_V4_ENDPOINT: http://clair-v4:6060
SECURITY_SCANNER_V4_PSK: "your-pre-shared-key-here"

# Automatically scan on push
FEATURE_SECURITY_SCANNER_NOTIFY_ON_NEW_INDEX: true

# Notification config — POST to a webhook when new vulns are found
ACTION_LOG_ARCHIVE_LOCATION: default

Clair v4 Configuration

If you’re managing Clair separately (outside the operator), its config is a YAML file mounted into the container:

# clair-config.yaml
http_listen_addr: ":6060"
introspection_addr: ":8089"

log_level: info

indexer:
  connstring: "host=postgres-clair user=clair dbname=clair sslmode=disable"
  scanlock_retry: 10
  layer_scan_concurrency: 10    # How many image layers to scan in parallel
  migrations: true

matcher:
  connstring: "host=postgres-clair user=clair dbname=clair sslmode=disable"
  max_conn_pool: 100
  migrations: true
  period: "6h"    # How often to re-run matching against updated vuln data
  
  indexer_addr: "clair-indexer:6060"

updaters:
  sets:
    - "rhel"
    - "ubuntu"
    - "debian"
    - "alpine"
    - "pyupio"      # Python packages
    - "npm"         # Node packages
    - "go"          # Go modules (Clair v4.7+)
  
  config:
    rhel:
      ignore_unpatched: false

notifier:
  connstring: "host=postgres-clair user=clair dbname=clair sslmode=disable"
  migrations: true
  
  webhook:
    target: "https://your-webhook-receiver/clair-notifications"
    callback: "https://clair.internal/notifier/api/v1/notifications"
    headers:
      Authorization: "Bearer your-webhook-token"
  
  poll_interval: "5m"
  delivery_interval: "1m"

Quay Gotchas

Clair indexes layers, not full images. This is clever — if two images share a base layer, that layer is only scanned once. But it also means your Postgres DB grows proportionally to your unique layer count, not your image count. With thousands of images sharing a common base, the DB stays manageable. With teams building from scratch every time, it balloons. Watch the DB size.

The matcher re-runs on a schedule, not just on push. This is the feature you actually want — images get re-evaluated when new CVE data arrives without a new push. But it generates load spikes every 6 hours (or whatever you set period to). Size your Clair pods and Postgres accordingly, and stagger the schedule if you run multiple Quay instances.

Quay’s notification system is good but needs a receiver. Setting FEATURE_SECURITY_NOTIFICATIONS: true doesn’t do anything useful unless you have a webhook endpoint or email config to route alerts. This is where most teams drop the ball — they enable scanning and then have no automated path from "new critical CVE found" to "engineer gets paged."

AWS ECR with Amazon Inspector v2

ECR’s built-in scanning used to be a thin wrapper around an old Clair version (Basic Scanning). Amazon quietly deprecated that and replaced it with Inspector v2 — a separate AWS service that continuously scans ECR images using Snyk’s vulnerability intelligence under the hood.

Enabling Inspector v2 for ECR

# Enable Inspector v2 at the account level
aws inspector2 enable \
  --resource-types ECR \
  --region us-east-1

# Verify it's active
aws inspector2 describe-organization-configuration \
  --region us-east-1

# For multi-account orgs — delegate Inspector admin to a security account
aws inspector2 enable-delegated-admin-account \
  --delegated-admin-account-id 123456789012

Terraform for teams who version their AWS config (you should be):

resource "aws_inspector2_enabler" "ecr" {
  account_ids    = [data.aws_caller_identity.current.account_id]
  resource_types = ["ECR"]
}

# ECR repo with scan-on-push enabled
resource "aws_ecr_repository" "app" {
  name                 = "myapp"
  image_tag_mutability = "IMMUTABLE"   # Critical — never use MUTABLE in prod

  image_scanning_configuration {
    scan_on_push = true
  }

  encryption_configuration {
    encryption_type = "KMS"
    kms_key = aws_kms_key.ecr.arn
  }
}

# Lifecycle policy — control image accumulation
resource "aws_ecr_lifecycle_policy" "app" {
  repository = aws_ecr_repository.app.name

  policy = jsonencode({
    rules = [
      {
        rulePriority = 1
        description  = "Keep last 30 tagged images"
        selection = {
          tagStatus     = "tagged"
          tagPrefixList = ["v"]
          countType     = "imageCountMoreThan"
          countNumber   = 30
        }
        action = { type = "expire" }
      },
      {
        rulePriority = 2
        description  = "Expire untagged images after 7 days"
        selection = {
          tagStatus   = "untagged"
          countType   = "sinceImagePushed"
          countUnit   = "days"
          countNumber = 7
        }
        action = { type = "expire" }
      }
    ]
  })
}

Querying Findings at Scale

Inspector v2 integrates with Security Hub and EventBridge. The EventBridge route is how you build automated responses:

# EventBridge rule — fire when Inspector finds critical CVEs in ECR
resource "aws_cloudwatch_event_rule" "critical_cve" {
  name        = "ecr-critical-cve-detected"
  description = "Trigger on critical ECR vulnerability findings"

  event_pattern = jsonencode({
    source      = ["aws.inspector2"]
    detail-type = ["Inspector2 Finding"]
    detail = {
      severity = { label = ["CRITICAL", "HIGH"] }
      resources = {
        type = ["AWS::ECR::Repository"]
      }
      status = ["ACTIVE"]
    }
  })
}

resource "aws_cloudwatch_event_target" "notify_slack" {
  rule      = aws_cloudwatch_event_rule.critical_cve.name
  target_id = "SendToSNS"
  arn       = aws_sns_topic.security_alerts.arn
}

For bulk querying across a large ECR fleet:

# Get all CRITICAL findings for a specific repo
aws inspector2 list-findings \
  --filter-criteria '{
    "ecrImageRepositoryName": [{"comparison": "EQUALS", "value": "myapp"}],
    "severity": [{"comparison": "EQUALS", "value": "CRITICAL"}],
    "findingStatus": [{"comparison": "EQUALS", "value": "ACTIVE"}]
  }' \
  --query 'findings[*].{CVE:packageVulnerabilityDetails.vulnerabilityId, Package:packageVulnerabilityDetails.vulnerablePackages[0].name, Score:inspectorScore}' \
  --output table

ECR Gotchas

Inspector v2 and Basic Scanning cannot coexist. When you enable Inspector v2, Basic Scanning is disabled on those repos. The findings API changes, the console view changes, and any tooling that was querying describe-image-scan-findings will break. Audit your tooling before the migration.

Inspector v2 is regional. If you push images to ECR in eu-west-1 but only enabled Inspector in us-east-1, those images don’t get scanned. Enable it in every region where you have ECR repos. In a multi-account org setup, the delegated admin approach handles this more gracefully.

The pricing model will surprise you. ECR Basic Scanning was a flat fee. Inspector v2 charges per unique image layer per month, plus a per-container-instance charge if you also scan running ECS/EC2 workloads. With a large image fleet, the bill can be non-trivial. Pull your current ECR metrics before enabling and estimate costs first.

IMMUTABLE image tags aren’t optional. If tags are mutable, a new push overwrites the existing tag and Inspector rescans the new content — but any tooling referencing the old digest is now pointing at different, potentially vulnerable content. Always use immutable tags in prod, and always reference images by digest in Kubernetes manifests.

CI/CD Integration: The Shift-Left Layer

All three registries benefit from a pre-push scan in your pipeline. Here’s a Trivy step that works with any of them:

# .github/workflows/build-and-push.yml (excerpt)
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Build image
        run: docker build -t $IMAGE_TAG .
      
      - name: Scan with Trivy (fail on CRITICAL)
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: ${{ env.IMAGE_TAG }}
          format: "sarif"
          output: "trivy-results.sarif"
          severity: "CRITICAL,HIGH"
          exit-code: "1"        # Fail the pipeline on critical findings
          ignore-unfixed: false
          vuln-type: "os,library"
      
      - name: Upload Trivy results to GitHub Security
        uses: github/codeql-action/upload-sarif@v3
        if: always()   # Upload even on failure — you want to see what failed
        with:
          sarif_file: "trivy-results.sarif"
      
      - name: Push to registry
        if: success()   # Only push if scan passed
        run: docker push $IMAGE_TAG

For Harbor, you can enforce a registry-side policy that rejects pushes of images with unscanned or critically-vulnerable layers via Harbor’s webhook + admission controller pattern. This is more advanced but closes the gap between "pipeline bypassed" and "image lands in registry."

Choosing Between the Three

Here’s the honest breakdown:

Harbor is the right answer if you’re self-hosting, need multi-tenant project isolation, want fine-grained policy per team, and don’t want to pay per-scan fees. The operational burden is real (you’re running Postgres, Redis, object storage, and Trivy), but the control is complete. It’s the choice for platform teams managing internal developer platforms.

Quay makes sense in Red Hat/OpenShift shops or anywhere that Red Hat’s enterprise support matters. Clair v4’s architecture is more sophisticated than Trivy for large-scale layer deduplication. The Quay Operator integration with OpenShift is genuinely good. Outside of that ecosystem, Harbor is usually easier to operate.

ECR with Inspector v2 is the answer when you’re already deep in AWS, your team doesn’t want to operate registry infrastructure, and the per-image pricing is acceptable. The EventBridge integration with the rest of the AWS security stack (Security Hub, GuardDuty, AWS Config) is hard to replicate elsewhere. For pure-AWS shops it’s a no-brainer. For hybrid or multi-cloud, less so.

Production-Ready Practices Worth Keeping

Close the loop with a findings dashboard. A scanner that fires alerts nobody reads is useless. Pipe your findings into whatever your team actually uses — PagerDuty for critical CVEs, Jira/Linear for high, weekly digest email for medium and below. The specific tool matters less than the workflow being real.

Define your severity policy explicitly. "Block on CRITICAL" is the baseline. But CRITICAL CVSS scores can be misleading — a CVSS 9.8 in a library your app never calls is different from a CVSS 7.5 in your HTTP parsing stack. Document your policy, build it into your pipeline, and revisit it quarterly.

Separate base image updates from application updates. Most CVEs live in base OS packages. Automate base image rebuilds on a schedule (weekly is reasonable for most teams) using Renovate or Dependabot, rather than waiting for someone to manually bump FROM ubuntu:22.04 to FROM ubuntu:24.04.

Track your mean time to patch (MTTP). This is the metric your CISO actually cares about. Scanning creates findings. Your MTTP tells you whether your team actually acts on them. If MTTP for critical CVEs is measured in weeks, your scanning setup is working fine but your remediation workflow is broken.