Engineer your Resilience

Reliably has mission: make team have an healthier relationship with their operations and reduce the anxiety related to on-calls or incidents. By being proactive, teams learn how they react and are allowed to think about how they can get better at handling future failures in the system.

We care about your teams and we believe practicing with Reliably is one of the approaches that teams can look after themselves as well.

Reliably is packed with features that support you and your teams in engineering your resilience!

Designing and building Experiments

At the core of your journey into Reliably is the concept of an experiment. The idea originates in the Chaos Engineering principles of experimenting to verify how your system copes with a sudden change.

To help you build your own experiment, you can use the Reliably builder by clicking on the Builder menu.

A screenshot of the Reliably main menu with the Builder item selected.

Reliably then shows you a list of starters to initialize your experiment with.

A screenshot of the Reliably starters list.

When you select one starter among this list, Reliably takes you into the Builder view.

A screenshot of the Reliably full builder.

The builder allows you to design your experiment by ading more activities from the starters. They can either impact the system even more or collect information state as the execution takes place. Once saved, you have now a new experiment under your belt. Congratulations!

A screenshot of the Reliably experiment.

In this particular experiment, we want to verify if our operations are correctly wired up for a potential failure in one of our services. As the team operating the service, we have put in place an alarm in AWS that, when triggered, automatically creates an incident in the AWS OpsCenter. That incident must have the "High" impact and be in an "OPEN" state. Along the way our experiment collects the logs of the pods running our application, as well as alarm states so we can review both.

Planning and running Experiments

Reliably orchestrates your experiments. You can plan and schedule their executions where and when you need.

To quickly plan the execution of an experiment, simply click on the Run experiment button.

A screenshot of the Reliably experiment buttons.

This will lead you to the plan creator that allows you to specify where and when to run the experiment you selected.

A screenshot of the Reliably new plan form.

In a nutshell, the Deployment indicates where to run the experiment from. Here we select Reliably Cloud itself so the execution is carried by Reliably itself.

The Environment provides all the needed context to run the experiment. More specifically, all environment variables and secrets that the experiment requires.

The schedule defines when to run the experiment: once or in a repeated fashion.

Finally, integrations let you send data back to your world. In our example, we have enabled sending notifications to Slack as the experiment runs. We also let the Reliably Assistant converse with OpenAI to asks questions that are relevant to the experiment (without sending any sensitive data).

Once you have planned, an experiment for execution, you will be able to navigate to its execution.

A screenshot of the Reliably plan page when it is running.

Analyzing Experiment Executions

Reliably aims to help you go through the experiment execution timeline with a fresh and intuitive user experience.

A screenshot of Reliably showing all executions.

Selecting the failed execution, we can see that the expected incident was not created on AWS.

A screenshot of Reliably showing deviation.

Let’s digg a little bit deeper. We can see a pod was indeed deleted and therefore restarted by Kubernetes.

A screenshot of the Reliably execution page showing what pod was targeted.

We can also see that traffic was impacted.

A screenshot of the Reliably execution page showing traffic loaded into application.

A screenshot of the Reliably execution page showing traffic loaded into application.

Yet, no alarm was kicked off.

A screenshot of the Reliably execution page showing no alarms were raised.

As no alarms were triggered, no incidents were opened.

At this stage, we could add more probes into the experiment to narrow down the issue.

Additionnaly, as you enabled integrations such as Slack, your team will have received notification of the execution as it took place.

A screenshot of the Slack messages of an execution.

Scoring Your Resilience Engineering Efforts

By running a variety of experiments, your teams will start bubbling the effort poured into engineering for resilience in the organization.

Reliably recommends focusing on team efforts not on particular individuals. Resilience engineering is a team and organization effort.

A screenshot of the Reliably score board.

Reliably offers two scores. One, from A to D which indicates the trend of execution states for the past ten executions. The other score is freshness of that trend, from 0 to 100. The longer you have run the experiment, the lower the freshness and therefore the less impactful is your knowledge of that experiment.

That information is aggregated on two other places of Reliably. First on the dasboard where the freshness/score is plotted for each experiment.

A screenshot of the Reliably dashboard score board.

Second, on the experiment list page.

A screenshot of the Reliably experiment list.

Reviewing Your Efforts At A Glance

Reliably brings all the data that allows you review your efforts at once on its dashboard.

A screenshot of the Reliably dashboard.

Digging Deeper With the Assistant

Reliably brings you the power of its Assistant so that you can explore new facets of your resilience.

When building a new experiment, Reliably may suggest specific additional activities to rapidly prototype the right scenario.

A screenshot of the Reliably builder assistant.

Additionally, when you enable the OPanAI integration, the Reliably Assistant will issue questions to either GPT-3.5 or GPT-4 and will integrate the answers right back into the execution for greater context.

A screenshot of the Reliably execution assistant.

You can add your own questions directly when building the experiment:

A screenshot of the Reliably builder assistant form.

No sensitive information about your execution or organization is ever sent to OpenAI.

Bring Your Own Experiments

The Reliably Builder is a powerful feature. However sometimes, you may need to add your specific experiments for your team to run. Reliably supports importing your very own Chaos Toolkit experiments.

A screenshot of the Reliably import form.

Your may also turn these experiments into templates for an easier re-use of your experiments.

A screenshot of the Reliably template form.

Templates allow you to turn this:

---
version: 1.0.0
title: Impact of the service process terminating
description: What is the impact of our service process being terminated? Do we get
  any traces anywhere?
configuration:
  target_service:
    type: env
    key: RELIABLY_PARAM_TARGET_SVC
  container_name:
    type: env
    key: RELIABLY_PARAM_CONTAINER_NAME
method:
- name: exec-in-pod
  type: action
  provider:
    type: python
    module: chaosk8s.pod.actions
    func: exec_in_pods
    arguments:
      label_selector: "${target_service}"
      container_name: "${container_name}"
      cmd: kill -TERM 1

Into this:

A screenshot of the Reliably template usage.

This is quick tour of the main features from Reliably. Enjoy making your operations less stressful and more data driven!