On-Call Difficulty Ratings: A Simple Metric That Transforms Shift Management

Q: What on-call metrics should you track?

At minimum, track MTTR, MTTD, alert volume per shift, after-hours interruption count, SLO burn rate, and toil percentage. To cover engineer health, add a subjective difficulty rating at every shift sign-off and track Difficulty-Weighted Hours for fair burden distribution.

Q: How do you measure on-call quality?

On-call quality has two dimensions: system quality (MTTR, SLO adherence, incident recurrence) and human quality (difficulty ratings, sleep disruption frequency, burden distribution). Most teams only measure the first. Adding a 1-5 difficulty rating at shift end captures the second.

Q: How do difficulty ratings prevent burnout?

A rolling average of difficulty ratings over four to six weeks acts as an early warning system. If an engineer's average trends past 3.5, it indicates increasing strain before they reach the point of resignation. Alert volume alone misses this because low-volume shifts can still be highly demanding.

Q: How often should you review on-call metrics?

Review system metrics weekly. Review difficulty ratings and burden distribution monthly. Quarterly reviews are too slow — burnout builds shift by shift. Monthly reviews let you catch trends early and adjust before problems compound.

Q: Can difficulty ratings replace alert-volume tracking?

No. Difficulty ratings complement alert-volume tracking. Alert volume is an objective system metric. Difficulty is a subjective human metric. Together they reveal patterns neither captures alone.

Every SRE team tracks MTTR. Nobody tracks how the shift felt.

Your dashboards show alert volume, mean time to resolve, error budgets, and SLO burn rates. All system metrics. All important. All incomplete. They tell you everything about the infrastructure and nothing about the person who spent eight hours keeping it alive.

Missing from every on-call dashboard is a signal from the human who actually worked the shift. How hard was it? Was it manageable or overwhelming? Is this person approaching burnout — or have they already crossed that line? This article introduces a metric that fills that gap: the On-Call Difficulty Rating.

70%

of SREs say on-call stress impacts burnout and attrition

Catchpoint SRE Report, 2025

30%

increase in operational toil in 2025 despite AI adoption

RunFrame State of Incident Management, 2026

SRE dashboards that include a subjective shift-quality signal

Based on review of standard SRE tooling

The metrics everyone tracks

Before critiquing the status quo, it is worth acknowledging what teams already measure. The standard on-call metrics toolkit is well-established:

MTTR (Mean Time to Resolve) — how quickly incidents get fixed once detected.
MTTD (Mean Time to Detect) — how long before an issue is noticed, whether by monitoring or a human.
Alert volume per shift — the raw count of pages, grouped by severity.
After-hours interruption count — how many times an engineer is paged outside business hours.
SLO burn rate and error budget — how fast you are consuming your reliability budget.
Toil percentage— Google's SRE book recommends keeping toil below 50% of an engineer's time.

These are all valuable. They measure system health, operational efficiency, and service reliability. They are also, without exception, system metrics. They measure what happened to the infrastructure.

They do not measure what happened to the engineer.

System metrics tell you the database was down for 47 minutes. They do not tell you:

The engineer was dealing with three concurrent issues, not just the database.
It was 3 AM and they had already been paged twice that night.
The runbook was outdated and they had to improvise a fix from memory.
They are now running on four hours of sleep with a full day of meetings ahead.

End-of-shift report templates from tools like Shiftbase, Smartsheet, and SafetyCulture capture what happened during a shift. None of them capture how it felt. They log incidents, tasks, and handover notes. They do not ask the engineer whether the shift was manageable or whether they are running on fumes.

Google's toil measurement framework uses quarterly surveys to gauge operational burden. Quarterly is too slow. Burnout does not build quarter by quarter — it builds shift by shift. By the time a quarterly survey detects a problem, the engineer is already updating their LinkedIn profile.

Introducing the On-Call Difficulty Rating (ODR)

The On-Call Difficulty Rating is a simple 1–5 scale captured at the end of every shift. One number. Five seconds to submit. A longitudinal signal that no other metric provides.

Rating	Label	Description
1	Quiet	No incidents. Routine monitoring.
2	Light	Minor issues, handled quickly. No stress.
3	Moderate	Required focused attention. Manageable but not trivial.
4	Heavy	Multiple incidents or complex troubleshooting. Stressful.
5	Critical	P1 incident, sleep disrupted, high cognitive load.

It takes five seconds to submit. It creates a dataset that, over weeks and months, reveals patterns invisible to every other metric in your SRE toolkit.

The difficulty rating is not a performance review. It is an environmental signal. A shift rated 5/5 is not the engineer's fault — it is information about the system, the tooling, or the workload that needs attention. Treating it as a judgment of the engineer will destroy honest reporting immediately.

What the data enables

A single number per shift seems modest. But aggregated across engineers and weeks, it unlocks five capabilities that no combination of system metrics can provide.

Burnout early warning

If one engineer's average difficulty trends upward over four to six weeks, that is a leading indicator of burnout — before they hand in their notice. Alert volume alone does not catch this. An engineer might have low alert volume but consistently high difficulty because the alerts they receive are complex, the documentation is poor, or they are covering services they were not trained on.

A rolling four-week average of difficulty ratings per engineer is a better burnout predictor than any system metric. When that average crosses 3.5, it is time for a conversation — not a quarterly survey.

Fair burden distribution

Raw hours are a poor measure of on-call burden. Eight hours on a quiet Saturday is not the same as eight hours during a P1 on a Friday night. Multiply shift hours by difficulty rating and you get Difficulty-Weighted Hours (DWH) — a composite metric that captures both duration and intensity.

If one engineer's DWH is twice the team average, the schedule needs rebalancing — even if everyone worked the same number of hours. This is the foundation of fair on-call scheduling.

Staffing decisions

If average difficulty across the team trends from 2.5 to 3.5 over a quarter, that is a staffing signal — not an automation signal. It means your engineers are consistently dealing with harder shifts, and no amount of runbook automation will fix the fact that you need more people in the rotation.

Difficulty ratings give engineering managers concrete data for headcount requests. Instead of "the team feels stretched," you can say "average shift difficulty has increased 40% over the last quarter, and two engineers are consistently above 4.0."

Compensation justification

On-call compensation is one of the most contentious topics in SRE. Some companies pay flat stipends. Others pay per-page. Many pay nothing. Difficulty ratings provide a data-driven argument for on-call stipends or comp time: "Our engineers averaged difficulty 3.8 last quarter. This is not a quiet rotation — it is a demanding second job that disrupts sleep, weekends, and personal time."

That argument is far harder to dismiss when it comes with three months of shift-level data.

New hire readiness

Track a new engineer's difficulty ratings during their shadow and reverse-shadow periods. If they consistently rate shifts 4–5 when peers covering the same shift rate 2–3, that is a training signal — not a performance signal. They need more context on the systems, better runbooks, or a longer shadow period. Not more pressure.

For more on structured on-call onboarding, see our on-call onboarding guide.

How to implement difficulty ratings

Implementation is straightforward. The hard part is not the technology — it is the discipline to collect the data consistently and the culture to act on it without weaponising it.

Add a required 1–5 difficulty field to your end-of-shift sign-off. Make it mandatory — optional fields get skipped within a week.
Make it frictionless: one click, not a survey. If it takes more than ten seconds, adoption will collapse. A row of five buttons is ideal.
Store ratings alongside shift metadata — date, engineer, hours worked, incident count, and any handover notes. Context makes the data useful.
Review monthly: team average, individual trends, and high-difficulty patterns. Look for shifts, time slots, or services that consistently drive high ratings.
Act on the data: if average difficulty exceeds 3.5 for a month, something systemic needs to change — staffing, tooling, runbooks, or service ownership.
Never use individual ratings punitively. This is the single most important rule. The moment engineers believe their ratings will be used against them, they will game the system and the data becomes worthless.

The analogy — lessons from healthcare shift scoring

This concept is not new. Nurses have used shift intensity scoring for decades. Patient acuity scores — numerical ratings of how sick each patient is — drive staffing ratios in every hospital. Higher acuity means more nurses per patient. The math is simple and the logic is obvious: harder patients require more staff.

IT has no equivalent. A two-person on-call team covers the same services whether it is a quiet Tuesday or Black Friday. The staffing model does not flex with demand because there is no signal telling managers that demand has changed. Alert volume is a partial signal at best — it measures quantity, not cognitive load.

Difficulty ratings are the IT equivalent of patient acuity. They turn staffing from gut feel into data. A hospital that staffed every shift identically regardless of patient acuity would be considered negligent. Yet this is exactly how most engineering organisations run their on-call rotations.

You would not manage SLOs without SLIs. Do not manage on-call burden without a human signal.

Track the metric nobody else tracks

Shiftctl captures difficulty ratings at every shift sign-off, surfaces trends in analytics, and flags engineers approaching burnout — automatically. Free for 2 users. No credit card required.

Get started free Read the docs

Frequently asked questions

What on-call metrics should you track?

At minimum, track MTTR (Mean Time to Resolve), MTTD (Mean Time to Detect), alert volume per shift, after-hours interruption count, SLO burn rate, and toil percentage. These cover system health. To cover engineer health, add a subjective difficulty rating at every shift sign-off and track Difficulty-Weighted Hours (shift hours multiplied by difficulty rating) for fair burden distribution.

How do you measure on-call quality?

On-call quality has two dimensions: system quality (measured by MTTR, SLO adherence, and incident recurrence) and human quality (measured by difficulty ratings, sleep disruption frequency, and burden distribution across the team). Most teams only measure the first dimension. Adding a 1–5 difficulty rating at shift end captures the second.

What is an On-Call Difficulty Rating?

An On-Call Difficulty Rating (ODR) is a 1–5 scale submitted by the on-call engineer at the end of every shift. It captures a subjective assessment of how demanding the shift was — from 1 (quiet, routine monitoring) to 5 (critical incident, sleep disrupted, high cognitive load). It takes under ten seconds to submit and creates a longitudinal dataset for burnout detection, fair scheduling, and staffing decisions.

How do difficulty ratings prevent burnout?

A rolling average of an engineer's difficulty ratings over four to six weeks acts as an early warning system. If their average trends upward — especially past 3.5 — it indicates increasing strain before the engineer reaches the point of resignation. Alert volume alone misses this because low-volume shifts can still be highly demanding if the incidents are complex or the tooling is poor.

Should difficulty ratings affect compensation?

Difficulty ratings provide a data-driven foundation for on-call compensation discussions. Teams with consistently high average ratings (above 3.5) have a strong case for stipends, comp time, or reduced non-on-call workload. The data makes the argument concrete: instead of "on-call is hard," you can show that engineers averaged a specific difficulty level over a measurable period.

How often should you review on-call metrics?

Review system metrics (MTTR, alert volume) weekly. Review difficulty ratings and burden distribution monthly. Quarterly reviews are too slow — burnout builds shift by shift, not quarter by quarter. Monthly reviews let you catch trends early and adjust staffing, rotations, or tooling before problems compound.

Can difficulty ratings replace alert-volume tracking?

No. Difficulty ratings complement alert-volume tracking — they do not replace it. Alert volume is an objective system metric. Difficulty is a subjective human metric. Together they reveal patterns that neither captures alone: a shift with low alert volume but high difficulty exposes complex incidents or poor tooling. A shift with high alert volume but low difficulty exposes noisy but trivial alerts that could be suppressed.