SRE Starts With One Question, What Does Reliable Mean

Most teams cannot answer a simple question about their own service: how reliable is it supposed to be? They have a vague sense that it should "be up," and a much sharper sense of panic when it is not. Site Reliability Engineering, usually shortened to SRE, starts by replacing that vague sense with a number everyone agrees on. You do not need a dedicated team or a 24/7 rota to begin. You need one good SLO and the discipline to manage it. Here is how to get there.

SLI, SLO And Error Budget In Plain Terms

Three terms do most of the work in SRE, and they are simpler than they sound.

SLI, Service Level Indicator. A thing you measure about the service from the user's point of view. Common ones: the percentage of requests that succeed, or the percentage served faster than some threshold.
SLO, Service Level Objective. The target you hold that indicator to, over a window. For example, 99.9 percent of requests succeed over a rolling 30 days.
Error budget. The amount of failure the SLO permits. If your SLO is 99.9 percent, your error budget is 0.1 percent. Over 30 days that is about 43 minutes of allowed unreliability. Spend it how you like.

That last point is the one that changes behaviour. The error budget is not a failure to be ashamed of. It is a resource you spend deliberately, on bold deploys when the budget is healthy and on stability when it is running low.

Picking Your First SLO

Do not try to put an SLO on everything. Pick the one user-facing journey that matters most, usually your core request path, and start there. A reasonable first availability SLO looks like this:

Field	Example
Service	Checkout API
SLI	Percentage of requests with a non-5xx response
SLO	99.9 percent over a rolling 30 days
Error budget	0.1 percent, about 43 minutes per 30 days
Measured at	The load balancer, per request

Two rules keep you out of trouble. First, set the SLO at a level you can actually meet most months, not an aspirational 99.99 percent you will breach constantly, because an SLO you always miss teaches the team to ignore it. Second, measure it as close to the user as you reasonably can, so the number reflects what people actually experience.

The Error Budget Policy That Stops The Arguments

The SLO is only useful if it changes decisions. That is what an error budget policy does, in two short rules agreed in advance:

Budget healthy: ship. Push features, take reasonable risks, deploy often. The numbers say you have room.
Budget exhausted: stop feature work on that service and spend the next cycle on reliability, until the budget recovers.

Written down before the next incident, this turns the eternal "should we ship or stabilise" argument into a lookup. Nobody has to win the meeting. The budget already decided.

What To Do When The Budget Runs Out

When you blow the budget, resist the urge to just raise the SLO so the problem disappears on paper. Instead, run a proper blameless postmortem on what spent it, fix the contributing factors with tracked, owned actions, and only revisit the target if the data genuinely shows it was set wrong. The point of the budget is to force that investment at the right time.

Production-Readiness And Incidents, The Other Half

SLOs tell you when reliability slips. Two practices stop it slipping in the first place. A production-readiness review is a short checklist a service passes before launch: health checks and probes, resource limits, timeouts and failure modes, observability, a rollback path, and capacity headroom. It catches the predictable ways a launch falls over before customers do. And a basic incident process, clear severities, a named owner, and an escalation path, means that when something does break, the response is calm and fast instead of a scramble.

Getting Started Without Hiring An SRE Team

You can put the whole first iteration in place in weeks, not quarters:

Pick one critical journey and define a single, achievable SLO for it.
Instrument the SLI close to the user and put it on a dashboard.
Write the two-rule error budget policy and get product and engineering to agree to it.
Set up minimal incident severities, ownership, and escalation.
Run a real, blameless postmortem the next time something breaks.

That is a working SRE practice in miniature, and it will already change how your team makes decisions. The headcount and the on-call rota come later, when scale demands them.

If you would rather not stand this up alone, our SRE service does exactly this: defines your first SLOs, sets up the incident and postmortem process, and builds on the monitoring you already have, on a business-hours plus incident retainer. For why this is a distinct discipline and not just DevOps with a new name, see SRE vs DevOps and why the difference decides your uptime.

Talk to the engineer who will own your stack.

No account managers, no offshore handoff. Senior DevOps, direct. Tell us what you are dealing with and you get a straight answer.

View Related Service Discuss

Server & DevOps

Debugging an Airbyte 'All the Defined Primary Keys Are Null' Outage

One of our clients had an Airbyte MySQL to BigQuery pipeline that stopped syncing over a weekend, failing every run with the same error, all the defined primary keys are null. It looked like a data problem. It was not. Here is the two-bug postmortem, including the mid-incident connector upgrade we made that quietly made it worse, and how ruling out suspects one at a time found the real causes.

Server & DevOps

SRE vs DevOps and Why The Difference Decides Your Uptime

SRE and DevOps get used as if they are the same thing. They are not, and the difference is exactly what decides whether your service stays up. A plain explanation of what SRE is and when you need it.

Server & DevOps

Twenty Five Years From Compiling Apache By Hand To Prompting An AI

Twenty five years took us from compiling Apache by hand to prompting an AI, and every layer taught the same lesson. Why IT plus AI is not DevOps, why missing depth ends startups fast, and why the real risk sits in the CTO chair.

How To Start Doing SRE With SLOs And Error Budgets

SRE Starts With One Question, What Does Reliable Mean

SLI, SLO And Error Budget In Plain Terms

Picking Your First SLO

The Error Budget Policy That Stops The Arguments

What To Do When The Budget Runs Out

Production-Readiness And Incidents, The Other Half

Getting Started Without Hiring An SRE Team

Talk to the engineer who will own your stack.

Related Articles

Debugging an Airbyte 'All the Defined Primary Keys Are Null' Outage

SRE vs DevOps and Why The Difference Decides Your Uptime

Twenty Five Years From Compiling Apache By Hand To Prompting An AI