Defining Meaningful SLAs
Most SLA documents are written to protect vendors, not to protect customers. At ANSOL, we flip this: our SLAs are written around customer outcomes, not system uptime percentages.
The Four SLA Tiers We Use
| Tier | Availability | Response Time | Use Case |
|---|---|---|---|
| Gold | 99.99% | < 15 min | Critical metering infrastructure |
| Silver | 99.9% | < 2 hrs | Billing and reporting systems |
| Bronze | 99.5% | < 8 hrs | Analytics and dashboards |
| Dev | 95% | next business day | Staging environments |
Incident Response Playbook
1. Detection (< 2 min): Automated alerts via PagerDuty
2. Acknowledgement (< 5 min): On-call engineer confirms
3. Triage (< 15 min): Severity classification and customer notification
4. Mitigation (< 30 min for Gold): Rollback or hotfix deployment
5. Resolution: Root cause analysis within 24 hours
Measuring What Matters
Uptime percentages are meaningless without context. We track:
- MTTR (Mean Time to Recover): target < 25 min for Gold
- MTBF (Mean Time Between Failures): tracked per component
- Customer-impacting incidents: the only metric that truly matters
Key Takeaway
An SLA is a commitment, not a contract. Build your operations culture around the commitment, and the contract will take care of itself.