Define Service Level Objectives

YAML manifest

Service Levels Objectives in Reliably are defined in a reliably.yaml manifest file. Below is an example manifest defining two objectives, api-availability and api-latency, each with its own SLO.

apiVersion: reliably.com/v1
kind: Objective
metadata:
  labels:
    name: api-availability
    service: reliably-api
spec:
  indicatorSelector:
    category: availability
    gcp_loadbalancer_name: example-lb
    gcp_project_id: example-id
  objectivePercent: 99
  window: 1h0m0s
---
apiVersion: reliably.com/v1
kind: Objective
metadata:
  labels:
    name: api-latency
    service: reliably-api
spec:
  indicatorSelector:
    category: latency
    gcp_loadbalancer_name: example-lb
    gcp_project_id: example-id
    latency_target: 300ms
    percentile: "99"
  objectivePercent: 99
  window: 24h0m0s

Observation window

SLOs are defined by the percentage of "good" events over a time window. You can use 1 hour, 1 day, 1 week, or 1 month preset observation windows, or define a custom one.

To define a observation windows construct a string representing a duration with the format: 24h30m0.5s.

AWS resources

Resources on AWS are identified with their Amazon Resource Name. Learn more about ARNs in the AWS documentation.

You can provide this information in the following format:

spec:
  indicatorSelector:
    aws_arn: arn:partition:service:region:account-id:resource-id

Note the arn: prefix.

Google Cloud Platform resources

GCP resources are identified with a project ID, and the resource name.

spec:
  indicatorSelector:
    category: latency
    gcp_loadbalancer_name: example-lb
    gcp_project_id: example-id
note

The Reliably CLI can currently fetch Service Level Indicators for services that are attached to a Google Cloud Load Balancer. You will thus need a load balancer set up to define and report SLOs for GCP.

You can get the project ID and resource name from the gcloud CLI or the Google Cloud Console. #### gcloud CLI

The project ID can be found by running:

gcloud config get-value project

If you want to use a different project than the current one configured on your machine, list all of them with:

gcloud projects list

The resource name can be found with:

gcloud compute url-maps list

Google Cloud Console

The project ID can be found in the "Project info" card of your Google Cloud Console Dashboard.

Screenshot of Project info card in the Google Cloud Console

The resource name can be found in the Network services / Load balancing section of the Google Cloud Console, where all your services attached to a load balancer are listed.

Datadog resources

A Service Level Indicator can be computed from Datadog by providing the numerator and denominator queries.

The numerator query indicates the number of "good events", while the denominator query represents the number of "total" events.

The indicator percentage is computed as an average of "good events" / "total events" per 2-hour samples over the full time window.

You can provide this information in the following format:

spec:
  indicatorSelector:
    datadog_numerator_query: sum:gcp.loadbalancing.https.backend_request_count{response_code_class:200}.as_count()
    datadog_denominator_query: sum:gcp.loadbalancing.https.backend_request_count{}.as_count()
note

The example computes the SLO as percentage of successful requests (with 2xx status code).

important

The CLI will require some environment variables to be able to make authenticated calls to the Datadog API: export DD_SITE="datadoghq.eu" DD_API_KEY="123..." DD_APP_KEY="123..."

Prometheus resources

A Service Level Indicator can be computed from Prometheus by providing a scalar or instant vector query. It should compute the ratio of "good events" against the number of total events of a particular metric.

Here is an example os such a scalar query:

spec:
  indicatorSelector:
    prometheus_query: scalar(
        sum by (uri) (http_server_requests_count{status="200",uri="/"})
        / sum by (uri) (http_server_requests_count{uri="/"}))

The same as an instant vector query:

spec:
  indicatorSelector:
    prometheus_query: sum by (uri) (http_server_requests_count{status="200",uri="/"})
        / sum by (uri) (http_server_requests_count{uri="/"})

Make sure to also indicate the address of your Prometheus endpoint as follows:

spec:
  indicatorSelector:
    prometheus_url: http://localhost:9090

The endpoint must be reachable from the CLI's context and allow unauthenticated calls to the Prometheus API.

With the CLI

The reliably slo init command can guide you through the creation of this file.

Running reliably slo init will prompt you with questions to help you define an SLO.

Service

? What is the name of the service you want to declare SLOs for? http-service

With Reliably, SLOs are attached to a service. You will first be asked to define a name for a service before you can define its SLOs.

Service Resource

If you want to measure your SLO and generate SLO reports, you will need to add a service resource. Service resources are resources from your cloud provider which Reliably uses to get your service level data.

? On which cloud provider? [Use arrows to move, type to filter]
> Amazon Web Services
  Datadog
  Google Cloud Platform

Once you've selected a cloud provider, you will be asked to paste a resource identifier, or you can type i to enter an interactive mode which will help you identify the service you want to get data from.

important

You will need to be authentified to Google Cloud or AWS for interactive mode to work.

Here is what interactive mode looks like for AWS:

? On which cloud provider? Amazon Web Services
| Paste an AWS ARN, or type "i" for interactive mode. [? for help]
| Select an AWS partition. [Use arrows to move, type to filter]
> aws
  aws-cn
  aws-us-gov
  aws-iso
  aws-iso-b
? On which cloud provider? Amazon Web Services
| Paste an AWS ARN, or type "i" for interactive mode. [? for help]
| Select an AWS partition. aws
| Select an AWS region. eu-west-3
| Select an AWS service. API Gateway
| Select a Resource. simple-http-api (3tf7ct9s4y)

Here is what interactive mode looks like for Google Cloud:

? On which cloud provider? [Use arrows to move, type to filter]
  Amazon Web Services
> Google Cloud Platform
? On which cloud provider? Google Cloud Platform
| Select an Organization.  [Use arrows to move, type to filter]
> organization-name
? On which cloud provider? Google Cloud Platform
| Select an Organization. chaosiq.io
| Select an Project. Project One (project-one)
| What is the 'type' of the resource? Google Cloud Load Balancers
| Select a resource. staging-lb

Here is what interactive mode looks like for Datadog:

You will be asked for "good events" (numerator) and "total events" (denominator) queries.

? On which cloud provider? Datadog
| Paste your 'numerator' (good events) datadog query: sum:...
| Paste your 'denominator' (total events) datadog query: sum:...

SLO Target

? What is your target for this SLO (in %)? 99.9

You must specify a target for your SLO. This is what good looks like for your SLO and is expressed as a percentage. For example, for an Availability SLO, a target of 98% would mean 98% of the events were successfull.

SLO Type

For some providers, AWS or GCP, you'll be asked for choosing a type of SLO to measure:

? What type of SLO do you want to declare?  [Use arrows to move, type to filter]
> Availability
  Latency

You can choose either availability or latency SLOs. Availability SLOs are based on the success of a particular request. Latency SLOs are based on the completion of a request under a time threshold.

latency threshold

If you select the latency SLO type, you will also be prompted to provide a threshold in milliseconds. All reponses within this threshold contribute to your target.

? What type of SLO do you want to declare? Latency
? What is your target for this SLO (in %)? 99.9
? What is your latency threshold (in milliseconds)? 300

Observation window

? What is your observation window for this SLO?  [Use arrows to move, type to filter]
> 1 hour
  1 day
  1 week
  1 month
  custom
? What is your observation window for this SLO? custom
? Define your custom observation window [? for help]

SLO Name

Once your SLOs in defined, you will be asked to name it. The Reliably CLI will suggest you with a name based upon how you described your SLO. Press Enter to keep this name, or type one which suits you better.

? What is the name of this SLO? (99.9% of requests faster than 300ms over last 1 day)

Additional SLOs and Services

You will then be asked if you want to add more SLOs to this service, then if you want to add more services to your manifest.

important

When you're done, the CLI will confirm your manifest has been successfully created in your working directory.

 Your manifests has been saved to ./reliably.yaml

The manifest file will also uploaded to the Reliably SaaS. The local reliably.yamlfile will be used to generate your SLO report.

Measure and Report

Now that your SLOs are defined, the Reliably CLI will be able to query the resources for SLIs and generate SLO reports.

Reference

Read the Reliably CLI SLO Init command reference for a complete list of options.