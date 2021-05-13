If you're like me, you've spent most of your career working with IT operations teams. You've watched them invest lots of hard work trying to meet the expectations of the business, but they've come away with limited success. The business continually bashes IT for providing poor service, while IT struggles to meet seemingly nebulous expectations with limited resources. The major problem here is the fundamental disconnect over how IT and the business each measure success.

IT is responsible for sharing limited resources (such as CPU, memory, and disk) between business functions, so they measure consumption. IT then uses those metrics to recognize when a resource is close to exhaustion to avoid problems and keep costs low. On the other hand, the business needs responsive and error-free services, so they measure success using speed and quality. The disconnect is two teams with drastically different definitions of success.

Practically, this means that there's lots of tension between IT and the business. Here's a real-world example: A customer of ours was continually being bashed by the business because 'the system is always slow.' Over time, they had added tools to collect thousands of consumption metrics and tried to create correlation rules that would somehow show when the system was slow. What they ended up with was a mess, a huge collection infrastructure gathering metrics at sub-second intervals, alerts that triggered 24x7, and no easy way to understand what was truly going on.

They weren't getting anywhere because they didn't measure the right things. But, again, this is because resource-oriented monitoring strategies were giving an incomplete picture.

If you want a simpler and more responsive observability practice, tighter alignment with the business, and faster paths to improvement, you should focus on service-level metrics instead.

Here I'm going to introduce you to service-level indicators (SLIs) and service-level objectives (SLOs), and then I'll show you how to set your SLOs.

The textbook definition of an SLI is: 'A carefully defined quantitative indicator of some aspect of the level of service that is provided.' In other words, an SLI is a metric measuring one thing that shows how well your IT service is performing. Extending this definition a bit, I'd say that it must be relevant to the delivered service and should be simple and easy to understand. In other words, when an SLI goes wrong, there must be some business impact, such as an outage or poor user experience.

Remember, the business expects speed and quality, so you need to choose SLIs (metrics) that measure those things, such as:

Latency/response time

Error rate/quality

Availability

Uptime

Note: Yes, there is a distinction between uptime and availability. For now, check out these Google search results.

And here are some potential SLI choices that you shouldn't use because they don't directly correlate to business impact:

CPU, disk, memory consumption

Cache hit rate

Garbage collection time

Again, the main difference between a good and bad SLI is the metric's relevance to service delivery. A high error rate or slow response time affects service delivery. High CPU utilization might impact service delivery, but the relationship between CPU and service performance is harder to establish. This is why IT teams that measure resource consumption struggle.

The key here is to pick a metric for your SLI that is clearly and unambiguously related to service delivery and is simple and easy to communicate to non-technical people. That will resolve the disconnect, making things easier for everyone involved.

An SLO is simply a goal that you set for your SLIs. First, you identify your SLIs. Then, by setting thresholds for each SLI, you create your SLOs.

SLOs should be easy for even non-technical stakeholders to understand. Stand-alone resource consumption metrics, such as CPU utilization, don't tell you if something is performing well or not-they require interpretation by an SME. Identifying business-impacting SLIs, setting SLOs, and properly presenting them means that the consumers of those SLOs don't have to ask if the number is good or bad. Interpretation is intuitive-the answer is 'good' or 'not good.' As a bonus, it's easy to use SLOs to measure improvement.