Reliably's core features tutorial

Let’s get you started with Reliably. The goal of this tutorial is to familiar yourself with the concepts of objectives, verifications and score cards and begin your journey of better insights into your efforts to have a healthier team.

Outcomes

This tutorial aims at introducing to the core concepts you will manipulate with Reliably. At the end of this tutorial you will know how Reliably can bootstrap you with creating good objectives and set alerts and proactively verify them.

Prerequisites

  1. The reliably CLI installed and authenticated against Reliably
  2. One of the supported builtin providers.

Objectives

Reliable systems start with healthy teams

At the very core of Reliably are objectives. What are they? We provide a somewhat open definition but that we feel capture their nature:

An objective is an aspect of your system that surfaces the right signal to make decisions.

In other words, definining objectives is all about you and your team definining what informs about the health and sustainability of the individual, team, organisation or system. These discussions about the appropriate conditions for the team to deliver in a healthy fashion is what we consider to be reliability.

Most systems, and their dependencies, are complex. Monitoring and observing them is an essential activity of any team or organisation. But, at Reliably, we believe through objectives, and the discussions they enabled, you define the parameters of the system’s health, no matter how the system changes.

The properties of an objective are loosely as follows:

  1. A performance goal for a certain aspect you care about
  2. A capacity to measure this goal

Measuring a goal is the core of determining what decision to make on your behalf by Reliably.

Populate objectives the easy way

Creating objectives is easy but deciding which ones are the right ones might not. Reliably takes an approach of creating default objectives for you so you can get started quickly with a common baseline.

To do this, you need to run the reliably populate command, selecting one of the providers that Reliably offers. The nature of the populate command is to bootstrap you with sound objectives across a variety of targets.

For instance, let’s assume you are running a Kubernetes cluster, with Prometheus monitoring it:

$ reliably populate kube --prometheus-url=http://HOST:9090

This will ask you to select the Kubernetes resources you wish to create, for example pods, services or deployments. In this example, we need a Prometheus endpoint to read metrics about these resources that will help us compute the objective results for each one created.

On the Reliably dashboard, the objectives will now appear in the Explore tab.

objectives-populate

You can do the same with AWS:

$ reliably populate aws

Now that you have your first objectives, let’s put them into action.

Agent

Bring objectives to life

Objectives enable the discussion about “what good looks like”. Now you need to start feeding data to Reliably so they can be computed.

Reliably is not a monitoring tool so you don’t directly send raw metrics to it. Instead, we read indicators from the system used to create the objectives and we compute a variety of results and scores, that we will come back to later.

Reliably provides the agent command to start fetching data and send it back to Reliably:

$ reliably agent

This will take all your objectives, in the current organisation, and fetch the necessary metrics once to be sent back to Reliably. Then objective results will be computed.

agent-once

Note that Reliably does not expect you open any port in your infrastructure. However the communication back to Reliably is performed over HTTPS. As long as you can perform these calls back to Reliably, things will flow nicely.

You may want to run this command with a certain frequency, for instance every minute:

$ reliably agent -i 60

agent-many

Alerts

Get alerted when objectives are impacted

Putting objectives to action with alerts is how you confirm they support the team by surfacing the right signal where it needs to.

Reliably has the concept of Alert Policy which, as the name implies, describes the rules by which a team desires to be notified when an objective falls under its target (or soon when it trends negatively even if it’s still above its target).

The CLI provides a way to create alerts alongside your objectives. Replay the command as before but add the --include-alerts flag to it:

$ reliably populate kube --include-alerts \
    --prometheus-url=http://HOST:9090

This will add an alert policy matching the objectives that were initially created. It will set an alert threshold, defaulting to 95%, that will be point at which the alert policy will be triggered and alert events emitted.

alert-created

At this stage, while a policy exists, it has no notification channels to publish alert events to. You can create these on the dashboard in the Explore > Channels section.

alert-channel-create

Next, go back to the policy and attach the newly created channel to it:

alert-channel-add

alert-channel-set

When an alert policy is triggered, alert events are emitted:

alert-event

Verifications / Chaos Engineering

Cultivate a proactive mindset

Objectives communicates what is impactful to teams/system health and sustainability. In recent years, a new practice has gained traction to proactively explore how conditions can impact your objectives, therefore the team’s capability to remain functionning healtily.

Reliably calls this Verification. A verification is a chaos engineering experiment, usually directly connected to an objective.

Chaos Toolkit relationship

Reliably uses the Chaos Toolkit as the engine to propel verifications. In a nutshell, Reliably Verification are Chaos Toolkit experiments. Reliably does not change the Chaos Toolkit in anyway in its commitment to Open Source.

The Reliably objective is used as the steady-state of the experiment.

This also means that your existing Chaos Toolkit experiments are Reliably verifications in their own rights. Simply not linked to a particular objective.

Running reliably verification run verification.json is the same as running chaos run verification.json. In the latter case, you only need to make sure you also installed the chaostoolkit-reliably extension.

Start verifying a single objective

The CLI provides a way to create verifications alongside your objectives. Replay the command as before but add the --include-verifications flag to it:

$ reliably populate kube --include-alerts --include-verifications \
    --prometheus-url=http://HOST:9090

This will add a set of verifications matching the objectives that were initially created.

You can also create verification as follows:

verification-create

A verification is tied to one or more objectives. The verification’s purpose is to explore a particular condition, of your will, and understand its impact on said objective(s) proactively.

You can run and download a verification as follows:

verification-run

Chaos Toolkit requirements

To run the verification, the reliably cli requires the chaos command from Chaos Toolkit to be found in your PATH. The easiest is to download the right version from here.

Once you have run a verification, you can see its events:

verification-view

You can also see it in context on the objective’s chart directly:

verification-objective

The verification was successful. The reason is, for now, the generated verification is not actually setting a condition to be run against your system. In essence, the verification is noop. But at least, we know we are properly setup. Let’s see now what we can do to truly verify the system.

Let’s shake the system a bit

As said before, the verification is a regular experiment. It is therefore a file you can edit and change at will.

First download the verification with:

$ reliably verification download <URL>

The verification will be stored as a file named verification.json. Be mindful the file, if it already exists, will be overwritten.

Open the file in your favourite editor:

{
  "title": "Producer does not suffer from heavy traffic",
  "description": "n/a",
  "controls": [
    {
      "name": "chaosreliably",
      "provider": {
        "type": "python",
        "module": "chaosreliably.controls.experiment",
        "arguments": {
          "experiment_ref": "456ckgcdoo8lz8hw"
        }
      }
    }
  ],
  "steady-state-hypothesis": {
    "title": "We have enough pods to handle incoming traffic",
    "probes": [
      {
        "name": "Pod 'default/producer' cannot remain in state 'CrashLoopBackOff'",
        "type": "probe",
        "tolerance": true,
        "provider": {
          "type": "python",
          "module": "chaosreliably.slo.probes",
          "func": "slo_is_met",
          "arguments": {
            "labels": {
              "app": "producer",
              "app.kubernetes.io/name": "producer",
              "context": "demo",
              "name": "Pod 'default/producer' cannot remain in state 'CrashLoopBackOff'",
              "namespace": "default",
              "pod-template-hash": "6fc6df8b8",
              "provider": "kubernetes",
              "service": "producer",
              "entity-type": "objective-kubernetes-pod",
              "resource-selector": "app.kubernetes.io/name=producer,app=producer,context=demo,pod-template-hash=6fc6df8b8,service=producer"
            }
          }
        }
      }
    ]
  },
  "method": [],
  "rollbacks": []
}

As you notice, the file is populated with the objective we picked up when creating the verification. For now let’s ignore the labels, we’ll come back later to them.

The action we can is using the Chaos Toolkit Reliably extension that exposes a way to call Reliably and get its last status. If it is above its target, then we are all good. Otherwise, the verification fails.

But as noticed, the verification does not come up with any action to take against the system. This is, for now, up to you to set one up. For instance, in this example we could try to induce some fairly heavy traffic into the system and see if our objective holds.

We will use the tool called ddosify to inject traffic to our service:

Lets now change the method as follows:

{
  "method": [
    {
        "name": "induce-high-load-into-application",
        "type": "action",
        "provider": {
            "type": "process",
            "path": "ddosify",
            "arguments": "-n 10000 -d 60 -l waved -o stdout-json -t https://...."
        }
    }
  ]
}

Where the url is the endpoint of the producer service monitored by the objective. ddosify will inject 10k requests into that endpoint in 30s.

Running a verification

In a different terminal, run a command to monitor the pods. For instance, use the watch command as follows:

$ watch -c kubectl get pod -l app=producer
Every 2,0s: kubectl get po -l app=producer

NAME                       READY   STATUS    RESTARTS        AGE
producer-6fc6df8b8-pb9tl   1/1     Running   0               4d2h

Run now the verification:

$ reliably verification run verification.json
[2022-07-25 16:07:37 INFO] Validating the experiment's syntax
[2022-07-25 16:07:37 INFO] Experiment looks valid
[2022-07-25 16:07:37 INFO] Running experiment: Producer does not suffer from heavy traffic
[2022-07-25 16:07:37 INFO] Steady-state strategy: default
[2022-07-25 16:07:37 INFO] Rollbacks strategy: always
[2022-07-25 16:07:38 INFO] Steady state hypothesis: We have enough pods to handle incoming traffic
[2022-07-25 16:07:39 INFO] Probe: Pod 'default/producer' cannot remain in state 'CrashLoopBackOff'
[2022-07-25 16:07:39 INFO] Steady state hypothesis is met!
[2022-07-25 16:07:40 INFO] Playing your experiment's method now...
[2022-07-25 16:07:40 INFO] Action: induce-high-load-into-application
[2022-07-25 16:09:11 INFO] Steady state hypothesis: We have enough pods to handle incoming traffic
[2022-07-25 16:09:12 INFO] Probe: Pod 'default/producer' cannot remain in state 'CrashLoopBackOff'
[2022-07-25 16:09:12 CRITICAL] The following Objective Results were not OK:

    | Date                           |   Objective % |   Actual % |   Remaining % | Indicator Selector                                                                                                                                                                                                                                                                                                                                        |
    |--------------------------------|---------------|------------|---------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
    | 2022-07-25T14:08:55.661978398Z |            95 |          0 |           -95 | {'category': 'kubernetes', 'prometheus_query': '1 - (sum(max_over_time(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff", namespace="default", pod=~"producer-[0-9a-zA-Z]{5,10}-[0-9a-zA-Z]{5}$"}[5m]) or vector(0)) / sum(last_over_time(kube_pod_info{namespace="default", pod=~"producer-[0-9a-zA-Z]{5,10}-[0-9a-zA-Z]{5}$"}[5m])))'} |
[2022-07-25 16:09:12 CRITICAL] Steady state probe 'Pod 'default/producer' cannot remain in state 'CrashLoopBackOff'' is not in the given tolerance so failing this experiment
[2022-07-25 16:09:12 WARNING] Rollbacks were explicitly requested to be played
[2022-07-25 16:09:12 INFO] Let's rollback...
[2022-07-25 16:09:13 INFO] No declared rollbacks, let's move on.
[2022-07-25 16:09:13 INFO] Experiment ended with status: deviated
[2022-07-25 16:09:13 INFO] The steady-state has deviated, a weakness may have been discovered

Let’s look at what the pod status tells us:

$ watch -c kubectl get pod -l app=producer
Every 2,0s: kubectl get po -l app=producer

NAME                       READY   STATUS             RESTARTS        AGE
producer-6fc6df8b8-pb9tl   1/1     CrashLoopBackOff   1 (19s ago)     4d2h

At this stage, our verification has shown that with the number of replicas we run and a certain amount of traffic, our service cannot remain available.

verification-view-failed

Watch the impact of the verification on the objective

You can also view this run in-situ in the objective view:

verification-view-failed-in-objective

We can clearly see that when the objective fell under its target, a verification was currently running. Of course, while the correlation is sound, the team would have to explore a bit more to figure if nothing else was happening at the time of the event and rule out any other issues.

Remediate and verify

Let’s now scale out the number of replicas and see if this would help with our lack of availability under heavy load:

$ kubectl scale --replicas=3 deployment/producer
$ watch -c kubectl get pod -l app=producer
Every 2,0s: kubectl get po -l app=producer

NAME                       READY   STATUS    RESTARTS       AGE
producer-6fc6df8b8-p8wdb   1/1     Running   0              15s
producer-6fc6df8b8-pb9tl   1/1     Running   13 (13m ago)   4d4h
producer-6fc6df8b8-pwfqr   1/1     Running   0              15s

Let’s run the verification once more:

$  reliably verification run verification.json 
[2022-07-25 16:58:01 INFO] Validating the experiment's syntax
[2022-07-25 16:58:01 INFO] Experiment looks valid
[2022-07-25 16:58:01 INFO] Running experiment: Producer does not suffer from heavy traffic
[2022-07-25 16:58:02 INFO] Steady-state strategy: default
[2022-07-25 16:58:02 INFO] Rollbacks strategy: always
[2022-07-25 16:58:02 INFO] Steady state hypothesis: We have enough pods to handle incoming traffic
[2022-07-25 16:58:03 INFO] Probe: Pod 'default/producer' cannot remain in state 'CrashLoopBackOff'
[2022-07-25 16:58:04 INFO] Steady state hypothesis is met!
[2022-07-25 16:58:04 INFO] Playing your experiment's method now...
[2022-07-25 16:58:05 INFO] Action: induce-high-load-into-application
[2022-07-25 16:59:35 INFO] Steady state hypothesis: We have enough pods to handle incoming traffic
[2022-07-25 16:59:36 INFO] Probe: Pod 'default/producer' cannot remain in state 'CrashLoopBackOff'
[2022-07-25 16:59:37 INFO] Steady state hypothesis is met!
[2022-07-25 16:59:37 WARNING] Rollbacks were explicitly requested to be played
[2022-07-25 16:59:37 INFO] Let's rollback...
[2022-07-25 16:59:37 INFO] No declared rollbacks, let's move on.
[2022-07-25 16:59:37 INFO] Experiment ended with status: completed

Yes! With enough replicas, our objective is not impacted under the given conditions of the verification.

verification-view-remediated

verification-view-remediated-in-objective

Wrapping up

Reliably’s platform is dedicated to support teams with complexity ahead of us, whether it’s in their system and/or in their organisation. We believe teams have what it takes to deliver great product value and appreciate that they have a direct impact on people’s lives through their daily activities.

Objectives, Verification and Alerts are the core concepts in the Reliably platform. They form the elements by which teams can define and monitor their health and sustainability to navigate what is and endure what’s to come.