SRE Starts With One Question, What Does Reliable Mean
Most teams cannot answer a simple question about their own service: how reliable is it supposed to be? They have a vague sense that it should "be up," and a much sharper sense of panic when it is not. Site Reliability Engineering, usually shortened to SRE, starts by replacing that vague sense with a number everyone agrees on. You do not need a dedicated team or a 24/7 rota to begin. You need one good SLO and the discipline to manage it. Here is how to get there.
SLI, SLO And Error Budget In Plain Terms
Three terms do most of the work in SRE, and they are simpler than they sound.
- SLI, Service Level Indicator. A thing you measure about the service from the user's point of view. Common ones: the percentage of requests that succeed, or the percentage served faster than some threshold.
- SLO, Service Level Objective. The target you hold that indicator to, over a window. For example, 99.9 percent of requests succeed over a rolling 30 days.
- Error budget. The amount of failure the SLO permits. If your SLO is 99.9 percent, your error budget is 0.1 percent. Over 30 days that is about 43 minutes of allowed unreliability. Spend it how you like.
That last point is the one that changes behaviour. The error budget is not a failure to be ashamed of. It is a resource you spend deliberately, on bold deploys when the budget is healthy and on stability when it is running low.
Picking Your First SLO
Do not try to put an SLO on everything. Pick the one user-facing journey that matters most, usually your core request path, and start there. A reasonable first availability SLO looks like this:
| Field | Example |
|---|---|
| Service | Checkout API |
| SLI | Percentage of requests with a non-5xx response |
| SLO | 99.9 percent over a rolling 30 days |
| Error budget | 0.1 percent, about 43 minutes per 30 days |
| Measured at | The load balancer, per request |
Two rules keep you out of trouble. First, set the SLO at a level you can actually meet most months, not an aspirational 99.99 percent you will breach constantly, because an SLO you always miss teaches the team to ignore it. Second, measure it as close to the user as you reasonably can, so the number reflects what people actually experience.
The Error Budget Policy That Stops The Arguments
The SLO is only useful if it changes decisions. That is what an error budget policy does, in two short rules agreed in advance:
- Budget healthy: ship. Push features, take reasonable risks, deploy often. The numbers say you have room.
- Budget exhausted: stop feature work on that service and spend the next cycle on reliability, until the budget recovers.
Written down before the next incident, this turns the eternal "should we ship or stabilise" argument into a lookup. Nobody has to win the meeting. The budget already decided.
What To Do When The Budget Runs Out
When you blow the budget, resist the urge to just raise the SLO so the problem disappears on paper. Instead, run a proper blameless postmortem on what spent it, fix the contributing factors with tracked, owned actions, and only revisit the target if the data genuinely shows it was set wrong. The point of the budget is to force that investment at the right time.
Production-Readiness And Incidents, The Other Half
SLOs tell you when reliability slips. Two practices stop it slipping in the first place. A production-readiness review is a short checklist a service passes before launch: health checks and probes, resource limits, timeouts and failure modes, observability, a rollback path, and capacity headroom. It catches the predictable ways a launch falls over before customers do. And a basic incident process, clear severities, a named owner, and an escalation path, means that when something does break, the response is calm and fast instead of a scramble.
Getting Started Without Hiring An SRE Team
You can put the whole first iteration in place in weeks, not quarters:
- Pick one critical journey and define a single, achievable SLO for it.
- Instrument the SLI close to the user and put it on a dashboard.
- Write the two-rule error budget policy and get product and engineering to agree to it.
- Set up minimal incident severities, ownership, and escalation.
- Run a real, blameless postmortem the next time something breaks.
That is a working SRE practice in miniature, and it will already change how your team makes decisions. The headcount and the on-call rota come later, when scale demands them.
If you would rather not stand this up alone, our SRE service does exactly this: defines your first SLOs, sets up the incident and postmortem process, and builds on the monitoring you already have, on a business-hours plus incident retainer. For why this is a distinct discipline and not just DevOps with a new name, see SRE vs DevOps and why the difference decides your uptime.
Need help with this?
Our team handles this kind of work daily. Let us take care of your infrastructure.
Related Articles
SRE vs DevOps and Why The Difference Decides Your Uptime
SRE and DevOps get used as if they are the same thing. They are not, and the difference is exactly what decides whether your service stays up. A plain explanation of what SRE is and when you need it.
Server & DevOpsTwenty Five Years From Compiling Apache By Hand To Prompting An AI
Twenty five years took us from compiling Apache by hand to prompting an AI, and every layer taught the same lesson. Why IT plus AI is not DevOps, why missing depth ends startups fast, and why the real risk sits in the CTO chair.
Server & DevOpsNginx vs Apache: Performance Benchmark for 2026
A head-to-head performance benchmark comparing Nginx and Apache in 2026 across throughput, memory usage, and latency for static files, PHP, and reverse proxy workloads.