AWS SQS vs EventBridge vs Kinesis: How to Pick the Right Messaging Service

Every AWS project hits this wall eventually. You need to decouple two services, and suddenly you’re staring at the console with three options that all vaguely promise "reliable messaging." SQS, EventBridge, Kinesis — pick one.

If you pick wrong, you’ll feel it months later: events disappearing, costs exploding, consumers fighting over messages, or a stream that can’t handle your data volume. This article cuts through the AWS marketing speak and gives you the actual decision criteria engineers use in production.

The Mental Model First

Before touching any config, understand what category each service falls into:

  • SQS is a queue. Something produces a message, something else consumes it. One message, one consumer. Work distribution.
  • EventBridge is a pub/sub event bus. One event can fan out to many targets. It’s a router, not a queue.
  • Kinesis is a log. Messages are ordered, retained, and can be replayed. Multiple consumers read independently from the same stream.

These are fundamentally different primitives. Choosing between them isn’t about features — it’s about which data model fits your problem.

AWS SQS: The Workhorse

SQS is the oldest of the three and remains the most commonly misused. It does one thing well: task distribution. You have work, you have workers, SQS makes sure every unit of work gets processed exactly once (with caveats — more on that below).

When SQS is the right call:

  • Offloading background jobs from a web server (image resizing, email sending, PDF generation)
  • Distributing tasks across a fleet of Lambda functions or EC2 workers
  • Buffering writes before hitting a rate-limited downstream API
  • Decoupling microservices where one service triggers a specific action in another

Standard vs FIFO — and why you should care:

Standard queues give you at-least-once delivery and best-effort ordering. FIFO gives you exactly-once processing and strict ordering within a message group — at roughly half the throughput (3,000 messages/second with batching vs. nearly unlimited for Standard).

Most teams default to Standard and then panic when they see duplicate processing. If your consumer isn’t idempotent, you will process messages more than once. Full stop. Build idempotent consumers or switch to FIFO.

Dead Letter Queues are not optional:

Set a DLQ on every SQS queue. Set maxReceiveCount to 3–5, not 1 and not 10. When a message hits the DLQ, you need an alert and a runbook. Messages that sit in a DLQ silently are bugs you don’t know about.

{
  "QueueName": "order-processing",
  "Attributes": {
    "VisibilityTimeout": "300",
    "MessageRetentionPeriod": "86400",
    "RedrivePolicy": "{\"deadLetterTargetArn\":\"arn:aws:sqs:us-east-1:123456789:order-processing-dlq\",\"maxReceiveCount\":\"5\"}"
  }
}

Gotcha — Visibility Timeout:

This one burns everyone. When a consumer picks up a message, SQS hides it from other consumers for the duration of the visibility timeout. If your Lambda times out or crashes before calling DeleteMessage, the message becomes visible again and gets re-processed.

Set visibility timeout to at least 6× your expected processing time. If your Lambda has a 30-second timeout, your queue visibility timeout should be at least 3 minutes.

Gotcha — Long polling is not default:

By default, SQS uses short polling. Your consumer hits the endpoint, gets an empty response, and you pay for that API call. Enable long polling (WaitTimeSeconds: 20) on every queue you operate. It reduces cost and reduces unnecessary API spam. There is no good reason not to do this.

Production pattern — Lambda + SQS:

# serverless.yml or SAM equivalent
OrderProcessor:
  Type: AWS::Lambda::Function
  Properties:
    FunctionName: order-processor
    Timeout: 60
    ReservedConcurrentExecutions: 50  # Throttle against downstream DB

OrderQueue:
  Type: AWS::SQS::Queue
  Properties:
    VisibilityTimeout: 360   # 6x Lambda timeout
    RedrivePolicy:
      deadLetterTargetArn: !GetAtt OrderDLQ.Arn
      maxReceiveCount: 5
    ReceiveMessageWaitTimeSeconds: 20

EventSourceMapping:
  Type: AWS::Lambda::EventSourceMapping
  Properties:
    EventSourceArn: !GetAtt OrderQueue.Arn
    FunctionName: !GetAtt OrderProcessor.Arn
    BatchSize: 10
    FunctionResponseTypes:
      - ReportBatchItemFailures  # Only delete successfully processed messages

ReportBatchItemFailures is the single most important SQS+Lambda setting most tutorials skip. Without it, if one message in a batch of 10 fails, all 10 get requeued and 9 get reprocessed unnecessarily.

AWS EventBridge: The Event Router

EventBridge is what you reach for when an event needs to go to multiple places, or when you want loose coupling between services without any consumer knowing about any producer.

It’s an event bus with a rule engine. An event lands on the bus, your rules evaluate the event’s JSON payload, and matching events get routed to targets: Lambda, SQS, another EventBridge bus, an API destination (HTTP endpoint), Step Functions, and a dozen others.

When EventBridge is the right call:

  • Cross-account or cross-region event routing
  • One domain event that multiple downstream services care about (order placed → update inventory, send email, trigger analytics)
  • Replacing point-to-point integrations between microservices
  • Reacting to AWS service events (EC2 state changes, S3 uploads, CodePipeline results) via the default event bus
  • Scheduled tasks as a cron replacement (EventBridge Scheduler or scheduled rules)

The schema registry is underrated:

EventBridge has a schema registry that can auto-discover event schemas from your bus and generate code bindings. Turn it on from day one. By month six, when you have 40 event types flying around, you’ll thank yourself.

Content-based routing is the killer feature:

Unlike SNS topic subscriptions, EventBridge rules can filter on the content of the event, not just metadata:

{
  "source": ["com.myapp.orders"],
  "detail-type": ["OrderPlaced"],
  "detail": {
    "orderValue": [{ "numeric": [">=", 1000] }],
    "customerTier": ["premium", "enterprise"]
  }
}

This routes only high-value orders from premium customers to a specific target. No application code. No additional Lambda to filter. The bus does it.

Gotcha — EventBridge is NOT a queue:

EventBridge does not store events for later consumption. Events have a 24-hour retry window (with exponential backoff), but there’s no replay of missed events unless you explicitly enable an archive. If your target Lambda is throttled and all retries are exhausted, the event is gone.

The fix: for critical events, route to SQS as the target rather than Lambda directly. SQS becomes your buffer; Lambda reads from SQS. You get EventBridge’s routing power with SQS’s durability.

EventBridge Rule → SQS Queue → Lambda

This pattern appears in probably 70% of serious EventBridge deployments. It’s not a hack, it’s the recommended approach for anything that can’t afford to drop events.

Gotcha — 5 targets per rule limit:

Each EventBridge rule supports up to 5 targets. If you need to fan out to more than 5 consumers, add SNS as one of your targets and have your consumers subscribe to the SNS topic. Or use the EventBridge → SQS → consumers pattern.

Gotcha — Custom event buses and the default bus:

The default event bus receives AWS service events. Do not route your application events through the default bus. Create a custom event bus per domain or environment. Mixing your OrderPlaced events with EC2 termination notices on the default bus is a mess waiting to happen.

Production pattern — cross-service event routing:

import boto3
import json
from datetime import datetime

client = boto3.client('events')

def publish_order_event(order_id: str, order_value: float, customer_tier: str):
    response = client.put_events(
        Entries=[
            {
                'Time': datetime.utcnow(),
                'Source': 'com.myapp.orders',
                'DetailType': 'OrderPlaced',
                'EventBusName': 'my-app-production',
                'Detail': json.dumps({
                    'orderId': order_id,
                    'orderValue': order_value,
                    'customerTier': customer_tier,
                    'version': '1.0'
                })
            }
        ]
    )
    
    failed = response.get('FailedEntryCount', 0)
    if failed > 0:
        raise RuntimeError(f"EventBridge rejected {failed} entries: {response['Entries']}")
    
    return response

Always check FailedEntryCount on the response. put_events returns HTTP 200 even when individual entries fail. This is a well-known gotcha.

AWS Kinesis Data Streams: The Log

Kinesis is different in kind, not just degree. It’s modeled after Apache Kafka: an immutable, ordered, partitioned log. Events are written to a shard, retained for 24 hours by default (up to 365 days with extended retention), and multiple consumers can read from the same stream independently without affecting each other.

When Kinesis is the right call:

  • High-throughput event ingestion: clickstream data, IoT sensor readings, application telemetry
  • Multiple independent consumers need to process the same events (analytics pipeline + real-time alerts + ML feature store, all reading the same stream)
  • Order matters and you need per-partition ordering guarantees at scale
  • You need event replay (debugging, backfilling a new consumer, disaster recovery)
  • Building a CDC (change data capture) pipeline from a database

Provisioned vs On-demand:

Provisioned capacity means you specify shard count upfront. Each shard handles 1 MB/s write, 2 MB/s read, 1,000 records/second write. On-demand scales automatically up to 200 MB/s write.

On-demand costs roughly 3–5× more at sustained load but removes capacity planning headaches. Use On-demand for variable/unpredictable workloads, provisioned when you have consistent throughput and want to control costs.

The Enhanced Fan-Out feature:

By default, all consumers on a stream share the 2 MB/s read limit per shard. With Enhanced Fan-Out, each registered consumer gets its own dedicated 2 MB/s per shard via push-based delivery (HTTP/2). This is what you need when you have 3+ consumers on a high-throughput stream.

Gotcha — Shard hot partitioning:

Kinesis distributes data across shards using a partition key you provide. If all your messages use the same partition key (e.g., a single tenant ID, or a constant string), all writes go to one shard and you immediately saturate it.

Distribute partition keys across your shard count. Use meaningful keys like userId, deviceId, or orderId. If you have no natural partition key, hash something random or use a composite key. Hot shards are a silent killer — Kinesis will throttle writes with ProvisionedThroughputExceededException and your data pipeline backs up.

Gotcha — Iterator age is your SLA:

Monitor GetRecords.IteratorAgeMilliseconds. This metric tells you how far behind the oldest record in your stream is from being processed. If this number climbs, your consumers are falling behind. Alarm on this metric, not just on producer-side errors.

import boto3

cloudwatch = boto3.client('cloudwatch')

cloudwatch.put_metric_alarm(
    AlarmName='kinesis-consumer-falling-behind',
    MetricName='GetRecords.IteratorAgeMilliseconds',
    Namespace='AWS/Kinesis',
    Dimensions=[{'Name': 'StreamName', 'Value': 'my-event-stream'}],
    Period=60,
    EvaluationPeriods=3,
    Threshold=60000,  # 1 minute behind
    ComparisonOperator='GreaterThanThreshold',
    Statistic='Maximum',
    TreatMissingData='notBreaching'
)

Gotcha — Lambda + Kinesis parallelization:

Lambda can process records from a Kinesis stream, but by default it processes one batch per shard sequentially. For a 10-shard stream, you have at most 10 concurrent Lambda invocations. If your Lambda takes 5 seconds per batch, you process at most 2 batches per shard per 10 seconds.

Enable ParallelizationFactor (1–10) to process multiple batches per shard concurrently. For high-throughput streams, this is almost always necessary.

KinesisEventSourceMapping:
  Type: AWS::Lambda::EventSourceMapping
  Properties:
    EventSourceArn: !GetAtt EventStream.Arn
    FunctionName: !GetAtt StreamProcessor.Arn
    StartingPosition: LATEST
    BatchSize: 100
    ParallelizationFactor: 5
    BisectBatchOnFunctionError: true  # Split failing batches to isolate bad records
    DestinationConfig:
      OnFailure:
        Destination: !GetAtt StreamDLQ.Arn

The Decision Matrix

Stop overthinking it. Run through these questions:

1. Do multiple independent consumers need to read the same events?

  • Yes → Kinesis or SNS+SQS fan-out
  • No → SQS or EventBridge

2. Is this high-throughput (>1,000 events/second sustained)?

  • Yes → Kinesis
  • No → SQS or EventBridge

3. Do you need event replay or time-travel reads?

  • Yes → Kinesis
  • No → SQS or EventBridge

4. Is this content-based routing between services with no ordering requirement?

  • Yes → EventBridge
  • No → SQS

5. Is this background task processing (one producer, one consumer pool)?

  • Yes → SQS
  • No → EventBridge or Kinesis

6. Are you integrating with AWS native events (CloudWatch, S3, EC2)?

  • Yes → EventBridge (default bus already has them)
  • No → your choice

Most CRUD-driven microservices land on EventBridge (domain events) + SQS (task queues). Data pipelines and analytics land on Kinesis. Simple job queues are SQS only.

What About SNS?

SNS didn’t make the headline because it’s rarely the right final answer on its own. SNS is a fan-out mechanism — it pushes to SQS queues, HTTP endpoints, Lambda, email. Use SNS when you need to push the same message to multiple SQS queues simultaneously. Use EventBridge when you need content-based filtering. For new architectures, EventBridge largely supersedes SNS for internal event routing unless you specifically need SMS/email delivery.

Cost Realities

SQS: $0.40/million requests (Standard), $0.50/million (FIFO). Extremely cheap. Background job processing for most startups costs pennies.

EventBridge: $1.00/million custom events. Custom bus events (your application events) cost money; events on the default bus (AWS service events) are free. At scale, the cost adds up — 100 million events/month is $100.

Kinesis: Charged by shard-hour ($0.015/shard/hour) plus PUT payload units. A 10-shard stream costs ~$108/month in shard hours before any data. On-demand is cheaper at low volume, more expensive at sustained high volume.

For most applications: SQS is cheapest, EventBridge is mid-tier, Kinesis provisioned has fixed costs that only make sense at data-pipeline scale.

Putting It Together: A Real Architecture

An e-commerce platform might look like this:

API Gateway
    │
    ▼
Order Service (Lambda)
    │
    ├──► SQS: payment-jobs        ──► Payment Lambda (background processing)
    │
    ├──► EventBridge: order-bus   ──► Inventory Service (update stock)
    │         │                   ──► Email Service (confirmation)
    │         │                   ──► Analytics SQS → Analytics Lambda
    │
    └──► Kinesis: order-stream    ──► Real-time dashboard consumer
                                  ──► ML feature pipeline consumer
                                  ──► Fraud detection consumer (Enhanced Fan-Out)

Payment processing goes to SQS because it’s a task queue — one payment job, one processor, with retry and DLQ semantics.

Domain events go to EventBridge because inventory, email, and analytics are independent concerns that shouldn’t know about each other.

The same order data goes to Kinesis because three different systems need to read it independently, in order, potentially replaying history.

This isn’t over-engineering. It’s using each tool where its data model matches the requirement.

The Bottom Line

SQS, EventBridge, and Kinesis aren’t competing for the same job — they model different problems. Use SQS for task distribution, EventBridge for domain event routing, and Kinesis when you need a durable log with multiple independent consumers. The biggest mistake architects make is picking one and trying to make it do everything.

Get the DLQs configured on day one. Enable long polling on SQS. Route EventBridge critical events through SQS for durability. Monitor Kinesis iterator age like it’s a health check. These aren’t advanced tips — they’re the baseline for running any of these services in production.

Leave a comment

👁 Views: 2,289 · Unique visitors: 1,646