Engineer your Resilience
Reliably has mission: make team have an healthier relationship with their operations and reduce the anxiety related to on-calls or incidents. By being proactive, teams learn how they react and are allowed to think about how they can get better at handling future failures in the system.
We care about your teams and we believe practicing with Reliably is one of the approaches that teams can look after themselves as well.
Reliably is packed with features that support you and your teams in engineering your resilience!
Designing and building Experiments
At the core of your journey into Reliably is the concept of an experiment. The idea originates in the Chaos Engineering principles of experimenting to verify how your system copes with a sudden change.
To help you build your own experiment, you can use the Reliably builder by
clicking on the Builder
menu.
Reliably then shows you a list of starters to initialize your experiment with.
When you select one starter among this list, Reliably takes you into the Builder view.
The builder allows you to design your experiment by ading more activities from the starters. They can either impact the system even more or collect information state as the execution takes place. Once saved, you have now a new experiment under your belt. Congratulations!
In this particular experiment, we want to verify if our operations are
correctly wired up for a potential failure in one of our services. As the
team operating the service, we have put in place an alarm in AWS that, when
triggered, automatically creates an
incident in the AWS OpsCenter.
That incident must have the "High"
impact and be in an "OPEN"
state. Along
the way our experiment collects the logs of the pods running our application, as
well as alarm states so we can review both.
Planning and running Experiments
Reliably orchestrates your experiments. You can plan and schedule their executions where and when you need.
To quickly plan the execution of an experiment, simply click on the
Run experiment
button.
This will lead you to the plan creator that allows you to specify where and when to run the experiment you selected.
In a nutshell, the Deployment indicates where to run the experiment from. Here we select Reliably Cloud itself so the execution is carried by Reliably itself.
The Environment provides all the needed context to run the experiment. More specifically, all environment variables and secrets that the experiment requires.
The schedule defines when to run the experiment: once or in a repeated fashion.
Finally, integrations let you send data back to your world. In our example, we have enabled sending notifications to Slack as the experiment runs. We also let the Reliably Assistant converse with OpenAI to asks questions that are relevant to the experiment (without sending any sensitive data).
Once you have planned, an experiment for execution, you will be able to navigate to its execution.
Analyzing Experiment Executions
Reliably aims to help you go through the experiment execution timeline with a fresh and intuitive user experience.
Selecting the failed execution, we can see that the expected incident was not created on AWS.
Let’s digg a little bit deeper. We can see a pod was indeed deleted and therefore restarted by Kubernetes.
We can also see that traffic was impacted.
Yet, no alarm was kicked off.
As no alarms were triggered, no incidents were opened.
At this stage, we could add more probes into the experiment to narrow down the issue.
Additionnaly, as you enabled integrations such as Slack, your team will have received notification of the execution as it took place.
Scoring Your Resilience Engineering Efforts
By running a variety of experiments, your teams will start bubbling the effort poured into engineering for resilience in the organization.
Reliably recommends focusing on team efforts not on particular individuals. Resilience engineering is a team and organization effort.
Reliably offers two scores. One, from A
to D
which indicates the trend of
execution states for the past ten executions. The other score is freshness of
that trend, from 0
to 100
. The longer you have run the experiment, the
lower the freshness and therefore the less impactful is your knowledge of that
experiment.
That information is aggregated on two other places of Reliably. First on the dasboard where the freshness/score is plotted for each experiment.
Second, on the experiment list page.
Reviewing Your Efforts At A Glance
Reliably brings all the data that allows you review your efforts at once on its dashboard.
Digging Deeper With the Assistant
Reliably brings you the power of its Assistant so that you can explore new facets of your resilience.
When building a new experiment, Reliably may suggest specific additional activities to rapidly prototype the right scenario.
Additionally, when you enable the OPanAI integration, the Reliably Assistant will issue questions to either GPT-3.5 or GPT-4 and will integrate the answers right back into the execution for greater context.
You can add your own questions directly when building the experiment:
No sensitive information about your execution or organization is ever sent to OpenAI.
Bring Your Own Experiments
The Reliably Builder is a powerful feature. However sometimes, you may need to add your specific experiments for your team to run. Reliably supports importing your very own Chaos Toolkit experiments.
Your may also turn these experiments into templates for an easier re-use of your experiments.
Templates allow you to turn this:
---
version: 1.0.0
title: Impact of the service process terminating
description: What is the impact of our service process being terminated? Do we get
any traces anywhere?
configuration:
target_service:
type: env
key: RELIABLY_PARAM_TARGET_SVC
container_name:
type: env
key: RELIABLY_PARAM_CONTAINER_NAME
method:
- name: exec-in-pod
type: action
provider:
type: python
module: chaosk8s.pod.actions
func: exec_in_pods
arguments:
label_selector: "${target_service}"
container_name: "${container_name}"
cmd: kill -TERM 1
Into this:
This is quick tour of the main features from Reliably. Enjoy making your operations less stressful and more data driven!