Most teams design their on-call rotation the way they design their first Kubernetes cluster — copy something from the internet, deploy it, and spend the next six months putting out fires caused by the decisions they didn’t think through. The difference is that a bad K8s config breaks your app. A bad on-call schedule breaks your engineers.
This article is about getting the schedule right before it breaks people. We’ll go through the three dominant patterns — 12-hour, 24-hour, and follow-the-sun — with honest tradeoffs, configuration examples for real tools, and a set of hard-won opinions about what works at different team sizes and incident volumes.
No fluff. If your team is currently doing 24-hour rotations with six engineers and wondering why everyone’s miserable, you’ll know why by the end of this.
Why Rotation Design Actually Matters
On-call fatigue is one of the leading causes of senior engineer attrition. Not "burnout" in the abstract — actual, measurable churn caused by being woken up at 3am too often, carrying a pager for too many consecutive days, and never feeling truly off.
The cost of a bad rotation isn’t just morale. It’s degraded incident response. A tired engineer debugging a production outage at 4am makes worse decisions than one who slept. They miss things, take longer, and sometimes make the incident worse. Mean Time To Resolution (MTTR) is directly correlated with whether the engineer is rested.
The goal of rotation design is simple: maximum system coverage, minimum human suffering. Every decision you make should be evaluated against that axis.
The Three Patterns
12-Hour Rotations
Two shifts per day: day shift and night shift. Each engineer owns a 12-hour window, hands off to the next person, and is completely off the hook for the other 12 hours.
This is the most common pattern for teams that have enough headcount to make it work (roughly 4+ engineers per rotation), operate in a single timezone, and deal with a non-trivial incident volume.
How it looks in practice:
- Day shift: 08:00–20:00 local time
- Night shift: 20:00–08:00 local time
- Rotation: each engineer takes one or two consecutive shifts, then rotates out
The appeal is obvious. Engineers actually sleep. When your night-shift person wakes up at 6am, they have two hours before handoff — enough time to review what happened overnight, write up any lingering issues, and hand over a clean state. The day shift person comes in rested and informed.
The math on sustainability: With 8 engineers splitting 12-hour shifts, each engineer is on-call roughly once every two weeks for a 24-hour period (one day shift + one night shift back-to-back), or alternatively staggered so they never do both back-to-back. That’s manageable. With 4 engineers, that same coverage means every engineer is on a week out of every four — tolerable if incidents are rare, brutal if they’re not.
PagerDuty schedule YAML (Terraform provider pagerduty):
# 12-hour rotation schedule
resource "pagerduty_schedule" "primary_day" {
name = "Primary On-Call - Day Shift"
time_zone = "Europe/Berlin"
layer {
name = "Day Shift 08:00-20:00"
start = "2026-05-19T08:00:00+02:00"
rotation_virtual_start = "2026-05-19T08:00:00+02:00"
rotation_turn_length_seconds = 86400 # 24h per person before rotating
restriction {
type = "daily_restriction"
start_time_of_day = "08:00:00"
duration_seconds = 43200 # 12 hours
}
users = [
pagerduty_user.alice.id,
pagerduty_user.bob.id,
pagerduty_user.carol.id,
pagerduty_user.dan.id,
]
}
}
resource "pagerduty_schedule" "primary_night" {
name = "Primary On-Call - Night Shift"
time_zone = "Europe/Berlin"
layer {
name = "Night Shift 20:00-08:00"
start = "2026-05-19T20:00:00+02:00"
rotation_virtual_start = "2026-05-19T20:00:00+02:00"
rotation_turn_length_seconds = 86400
restriction {
type = "daily_restriction"
start_time_of_day = "20:00:00"
duration_seconds = 43200
}
users = [
pagerduty_user.eve.id,
pagerduty_user.frank.id,
pagerduty_user.grace.id,
pagerduty_user.hank.id,
]
}
}
Gotcha: Don’t assume day and night can be the same pool of engineers rotating through. If Alice does the 08:00–20:00 shift on Monday, and then you rotate her to 20:00–08:00 Tuesday, she’s effectively working 36 hours straight. Separate pools or enforce minimum rest gaps.
24-Hour Rotations
One engineer, one day. They’re on from midnight to midnight, responsible for everything that fires.
This is the default pattern for smaller teams (2–4 engineers), early-stage startups, and any situation where you simply don’t have enough bodies to split shifts. It’s also frequently used by teams that have low incident volume — if you’re woken up twice a month, a 24-hour window doesn’t feel that different from a 12-hour one.
The weekly variant is more extreme: one engineer owns an entire week. This is common in very small teams or when rotation complexity is too high to manage. It’s also where engineers start updating their LinkedIn profiles.
When 24-hour works:
- P0/P1 rate is genuinely low (less than 2–3 per week hitting on-call)
- Average incident resolution is under 30 minutes
- Engineers are well-compensated for on-call hours
- Clear escalation path exists so the on-call engineer isn’t actually alone
When 24-hour destroys teams:
- High-traffic services with frequent alerting
- Alert fatigue setting in — engineers start silencing pages
- No escalation, meaning one engineer absorbs everything
- Consecutive shifts (one engineer does Monday, then does Tuesday)
The sleep problem is real. A person woken at 2am, 3:30am, and 5am is not going to perform at full capacity by 9am. Multiple studies on medical residents (who pioneered modern shift rotation research) confirm that cognitive performance degrades sharply after 17–19 hours of wakefulness. Your on-call engineer debugging a cascade failure on hour 23 of their shift is operating with significantly impaired judgment.
OpsGenie rotation config (API payload):
{
"name": "Primary 24hr Rotation",
"startDate": "2026-05-19T00:00:00Z",
"rotationType": "daily",
"length": 1,
"participants": [
{ "type": "user", "username": "[email protected]" },
{ "type": "user", "username": "[email protected]" },
{ "type": "user", "username": "[email protected]" }
],
"timeRestriction": {
"type": "time-of-day",
"restrictions": []
}
}
Gotcha: A 24-hour rotation without a defined escalation policy is just one engineer holding a grenade. Configure automatic escalation if a page isn’t acknowledged within 5 minutes. The on-call person should not be the last line of defense — they should be the first.
Follow-the-Sun
This is where things get interesting — and complicated. Follow-the-sun (FTS) is a pattern for geographically distributed teams where on-call coverage literally follows daylight around the globe. The Asia-Pacific team handles incidents during APAC business hours, hands off to EMEA at their start of day, who hands off to Americas, who hands off back to APAC.
Done well, it means nobody does true night-shift work. Every region covers their own daytime hours. The handoff happens at natural shift boundaries.
The minimum viable FTS setup requires teams in at least two significantly separated timezones — ideally 6+ hours apart. Common splits:
- EMEA (UTC+1 to UTC+3) + Americas (UTC-5 to UTC-8): 6–11 hour gap
- APAC (UTC+8 to UTC+10) + EMEA + Americas: full 24hr coverage across three teams
Each team needs at least 2–3 engineers, because a "team" of one engineer per region gives you zero redundancy and vacation becomes a scheduling nightmare.
The handoff is the critical failure point. Unlike a 12-hour rotation where both people might be in the same city and speak the same timezone, FTS handoffs are asynchronous by nature. The EMEA team is going to bed while the Americas team is having their morning coffee — they’re not going to have a live call. Everything depends on written handoff quality.
Handoff runbook template (Markdown, stored in your incident management system):
## On-Call Handoff — EMEA → Americas
**Date:** 2026-05-24
**Outgoing:** @emea-oncall
**Incoming:** @amer-oncall
### Active Incidents
| ID | Service | Status | Last Action |
|----------|----------------|-------------|--------------------------|
| INC-4521 | payment-svc | Monitoring | Rollback deployed 14:32 |
### Degraded Services (not incident-level)
- `cache-cluster-eu`: elevated eviction rate (~40% above baseline), watching
### Alerts Silenced / Suppressed
- `disk-usage-warn` on `batch-worker-03`: filling during nightly job, resolves by 06:00 UTC. Silence expires automatically.
### Things That Need Attention Next 8 Hours
- Deploy of `auth-service` v2.3.1 is scheduled for 19:00 UTC. Rollback plan: [link to runbook]
- EU maintenance window ends 21:00 UTC — re-enable health checks on `lb-eu-02`
### Context / Notes
- Product team is monitoring the payment rollback closely. Ping @pm-payment if you make any changes.
PagerDuty FTS configuration (three regions):
# APAC schedule: covers 00:00–08:00 UTC
resource "pagerduty_schedule" "fts_apac" {
name = "Follow-the-Sun - APAC"
time_zone = "Asia/Singapore"
layer {
name = "APAC Coverage"
start = "2026-05-19T08:00:00+08:00"
rotation_virtual_start = "2026-05-19T08:00:00+08:00"
rotation_turn_length_seconds = 604800 # weekly rotation within region
restriction {
type = "daily_restriction"
start_time_of_day = "08:00:00"
duration_seconds = 28800 # 8 hours (08:00-16:00 SGT = 00:00-08:00 UTC)
}
users = [
pagerduty_user.apac_eng_1.id,
pagerduty_user.apac_eng_2.id,
]
}
}
# EMEA schedule: covers 08:00–16:00 UTC
resource "pagerduty_schedule" "fts_emea" {
name = "Follow-the-Sun - EMEA"
time_zone = "Europe/Berlin"
layer {
name = "EMEA Coverage"
start = "2026-05-19T10:00:00+02:00"
rotation_virtual_start = "2026-05-19T10:00:00+02:00"
rotation_turn_length_seconds = 604800
restriction {
type = "daily_restriction"
start_time_of_day = "10:00:00"
duration_seconds = 28800 # 08:00-16:00 UTC
}
users = [
pagerduty_user.emea_eng_1.id,
pagerduty_user.emea_eng_2.id,
]
}
}
# Americas schedule: covers 16:00–00:00 UTC
resource "pagerduty_schedule" "fts_americas" {
name = "Follow-the-Sun - Americas"
time_zone = "America/New_York"
layer {
name = "Americas Coverage"
start = "2026-05-19T12:00:00-04:00"
rotation_virtual_start = "2026-05-19T12:00:00-04:00"
rotation_turn_length_seconds = 604800
restriction {
type = "daily_restriction"
start_time_of_day = "12:00:00"
duration_seconds = 28800 # 16:00-00:00 UTC
}
users = [
pagerduty_user.amer_eng_1.id,
pagerduty_user.amer_eng_2.id,
]
}
}
Gotcha — the timezone math trap: When configuring FTS schedules in PagerDuty or OpsGenie, set your restrictions in UTC. Local time is seductive but will burn you during DST transitions. If EMEA handles 08:00–16:00 UTC and you configure it as 09:00–17:00 CET, you’ll get a one-hour gap (or overlap) twice a year when clocks change. Use UTC. Always.
Gotcha — the phantom handoff: FTS creates an illusion of coverage. The incoming engineer needs context they don’t have. If your handoff documentation is "no active incidents," but there’s a memory leak slowly building on a service that the outgoing engineer noticed but didn’t write down, the next team is flying blind. Make the handoff document a mandatory part of rotation exit, not optional.
Hybrid Patterns Worth Knowing
Primary + Secondary: A single on-call rotation (any pattern) with a secondary escalation layer. If the primary doesn’t ack within N minutes, it pages the secondary. Good for 12-hour and 24-hour setups where a single engineer shouldn’t be the only line of defense. Cheap to implement, high safety margin.
Tiered severity routing: Not all alerts need to wake someone up. Route P3/P4 alerts to a ticket queue with a 4-hour SLA, P2 to a Slack channel, and only P0/P1 to on-call pager. This alone can reduce nighttime pages by 60–70% for most teams. The rotation pattern matters less if you’ve already killed alert noise.
# OpsGenie escalation policy
escalationPolicies:
- name: "Severity-Based Escalation"
rules:
- condition:
field: priority
operation: equals
expectedValue: P1
notifyType: default
delay: 0
recipients:
- type: schedule
name: "Primary On-Call"
- condition:
field: priority
operation: equals
expectedValue: P1
notifyType: default
delay: 5 # minutes — if not acked
recipients:
- type: schedule
name: "Secondary On-Call"
- condition:
field: priority
operation: equals
expectedValue: P2
notifyType: default
delay: 0
recipients:
- type: team
name: "Engineering"
notificationMethod: email
How to Actually Choose
Here’s the decision tree I’d use:
Team < 4 engineers: 24-hour rotation. You don’t have the headcount for anything else. Invest in reducing alert volume and write good runbooks so the on-call engineer isn’t reinventing the wheel at 2am.
Team 4–10 engineers, single timezone: 12-hour rotation. Day/night split with defined shift boundaries and a hard rule about rest gaps between shifts. Compensate night shifts appropriately — monetary or in comp time.
Team 10+ engineers, single timezone: 12-hour rotation with a secondary layer. Rotate the secondary separately so engineers aren’t simultaneously primary and secondary.
Team distributed across 2+ timezones, 6+ hour gap: FTS is viable. Start with two regions before adding a third — each region boundary you add is another handoff to get wrong. Build the handoff template before you launch the rotation.
Team distributed globally, 24/7 coverage required: Full three-region FTS. Accept that you need at least 2 engineers per region minimum to sustain vacations and sick days. Build escalation into each region’s schedule, not globally.
Production-Ready Practices (Not Optional)
Compensation policy must exist before you go live. Engineers on-call at 3am are providing a real service. Whether it’s flat on-call pay, per-incident pay, or comp time, this needs to be defined and enforced. Teams without compensation policies have higher turnover in on-call roles. That’s not conjecture — it’s a documented pattern.
Define what "off-call" actually means. In a 12-hour rotation, the off-call engineer should have exactly zero obligation to the pager. No "can you just take a quick look" messages. This has to be cultural and enforced by management.
Incident review loop is non-negotiable. After every P0/P1, write a post-mortem. Not a blame log — a factual timeline with contributing factors and action items. The on-call rotation is part of the response infrastructure, and if the rotation design is creating problems (slow response due to handoff confusion, exhausted engineers missing things), that should surface in reviews.
Keep your rotation roster current. Stale schedules are more common than you’d think. Engineer leaves, schedule isn’t updated, PagerDuty pages a ghost. Audit rotation membership monthly. Automate it if you can — pull roster from your HRIS or directory and diff against the schedule.
Test your escalation chain quarterly. Fire a test P1 outside business hours. Does it page the right person? Does escalation kick in at the right threshold? Does your secondary actually receive the page? Find out in a drill, not during an actual outage.
The Honest Verdict
24-hour rotations work fine for small teams with low incident volume. The moment your P0/P1 rate climbs above roughly two or three per week hitting the pager at night, you need to either switch to 12-hour shifts or aggressively reduce alert noise. Probably both.
Follow-the-sun is genuinely good for distributed teams, but most implementations underinvest in handoff quality. The technical schedule configuration is the easy part. Getting engineers in different timezones and cultures to write and read thorough handoff documents consistently is the hard part.
12-hour rotations are the sweet spot for most medium-sized engineering teams — enough coverage structure to protect sleep, clear enough to manage without a scheduling PhD. The night shift is still annoying, but it’s bounded and predictable.
The worst rotation I’ve seen in practice: weekly 24-hour with three engineers, no escalation policy, no compensation, and a service with 15 P2 alerts firing weekly. Two of the three engineers quit within six months. Don’t do that.