Engineering Reliability for Developers

Reliability as Code

Surface Reliability issues and make your users happy with our powerful, open-core, developer-centric tools.

Get Started

reliably scan kubernetes --format table
Results:
  manifests/deployment.yaml  Kubernetes:Deployment         K8S-DPL-0007  Setting a high cpu request may render pod scheduling difficult or starve other pods 
  manifests/deployment.yaml  Kubernetes:Deployment         K8S-DPL-0009  Not setting a cpu requests means the pod will be allowed to consume the entire available CPU (unless the cluster has set a global limit)
  manifests/deployment.yaml  Kubernetes:Deployment         K8S-DPL-0013  A rollout strategy can reduce the risk of downtime 
  manifests/deployment.yaml  Kubernetes:Deployment         K8S-DPL-0014  Without the 'minReadySeconds' property set, pods are considered available from the first time the readiness probe is valid. Settings this value indicates how long it the pod should be ready for before being considered available.
  manifests/deployment.yaml  Kubernetes:Deployment         K8S-DPL-0001  You should specify a number of replicas 
  manifests/pod.yaml         Kubernetes:Pod                K8S-POD-0001  You should not use the default 'latest' image tag. It causes ambiguity and leads to the cluster not pulling the new image
  manifests/pod.yaml         Kubernetes:Pod                K8S-POD-0003  Only images from an approved registry can be run 
  manifests/deployment.yaml  Kubernetes:Deployment         K8S-DPL-0012  Image pull policy should usually not be set to 'Always' 
  test-manifest.yaml:92:1    Kubernetes:PodSecurityPolicy  K8S-PSP-0001  Enabling privileged can lead to unwanted escalation from the container's process 
  test-manifest.yaml:92:1    Kubernetes:PodSecurityPolicy  K8S-PSP-0007  To reduce risk of accessing files outside of an allowed paths, it's best to make them read only 
Summary:
  10 suggestions found
   3 info -  5 warning -  2 error
reliably slo report
                                       Current  Objective  / Time Window     Type             Trend    
  Service #1: http-api                                   
   99% availability over 1 hour      100.00%        99%  / 1h0m0s          Availability     ✓ ✓ ✓ ✓  
   99.5% availability over 1 day     100.00%      99.5%  / 1d              Availability     ✓ ✓ ✓ ✓  
   99% of requests under 300ms        73.91%        99%  / 1d              Latency          ✕ ✕ ✕ ✕  
   99.9% of requests under 1s         98.55%      99.9%  / 1d              Latency          ✕ ✕ ✕ ✕  
                                                         
  Service #2: products-api                               
   99% availability over 1 day       100.00%        99%  / 1d              Availability     ✓ ✓ ✓ ✓  
   99.5% of requests under 200ms     100.00%      99.5%  / 1d              Latency          ✓ ✓ ✓ ✓

Fast SLO monitoring

Declare your SLOs as Code and get indicators on what you care for right from your terminal. Export as JSON, YAML or Markdown to share it with other tools or start discussing next steps with your team.

Works with AWS and Google Cloud, with more to come!

reliably slo report
                                       Current  Objective  / Time Window     Type             Trend    
  Service #1: http-api                                   
   99% availability over 1 hour      100.00%        99%  / 1h0m0s          Availability     ✓ ✓ ✓ ✓   
   99% of requests under 300ms        73.91%        99%  / 1d              Latency          ✕ ✕ ✕ ✕   
                                                         
  Service #2: products-api                               
   99% availability over 1 day       100.00%        99%  / 1d              Availability     ✓ ✓ ✓ ✓
services:
- name: http-api
  service-levels:
  - name: 99% availability over 1 hour
    type: availability
    slo: 99
    sli:
    - id: project-id/google-cloud-load-balancers/resource-id
      provider: gcp
    window: PT1H
  - name: 99.5% availability over 1 day
    type: availability
    slo: 99.5
    sli:
    - id: project-id/google-cloud-load-balancers/resource-id
      provider: gcp
    window: PT24H
- name: products-api
  service-levels:
  - name: 99.9% of requests faster than 300ms
    type: latency
    criteria:
      threshold: 300ms
    slo: 99.9
    sli:
    - id: arn:partition:service:region:account-id:resource-id
      provider: aws
    window: P1D

Secure Kubernetes deployments

The Reliably CLI scans your Kubernetes manifests and clusters to surface potential issues and gives you actionable hints on how to fix them.

Out-of-the-box integrations with GitHub Actions, GitLab CI, and many more. Works wherever you want.

reliably scan kubernetes --format table
Results:
  manifests/deployment.yaml  Kubernetes:Deployment         K8S-DPL-0007  Setting a high cpu request may render pod scheduling difficult or starve other pods 
  manifests/deployment.yaml  Kubernetes:Deployment         K8S-DPL-0009  Not setting a cpu requests means the pod will be allowed to consume the entire available CPU (unless the cluster has set a global limit)
  manifests/deployment.yaml  Kubernetes:Deployment         K8S-DPL-0013  A rollout strategy can reduce the risk of downtime 
  manifests/deployment.yaml  Kubernetes:Deployment         K8S-DPL-0014  Without the 'minReadySeconds' property set, pods are considered available from the first time the readiness probe is valid. Settings this value indicates how long it the pod should be ready for before being considered available.
  manifests/deployment.yaml  Kubernetes:Deployment         K8S-DPL-0001  You should specify a number of replicas 
  manifests/pod.yaml         Kubernetes:Pod                K8S-POD-0001  You should not use the default 'latest' image tag. It causes ambiguity and leads to the cluster not pulling the new image
  manifests/pod.yaml         Kubernetes:Pod                K8S-POD-0003  Only images from an approved registry can be run 
  manifests/deployment.yaml  Kubernetes:Deployment         K8S-DPL-0012  Image pull policy should usually not be set to 'Always' 
  test-manifest.yaml:92:1    Kubernetes:PodSecurityPolicy  K8S-PSP-0001  Enabling privileged can lead to unwanted escalation from the container's process 
  test-manifest.yaml:92:1    Kubernetes:PodSecurityPolicy  K8S-PSP-0007  To reduce risk of accessing files outside of an allowed paths, it's best to make them read only 
Summary:
  10 suggestions found
   3 info -  5 warning -  2 error

Chaos Engineering Validation

Run chaos engineering experiments with the open source Chaos Toolkit to surface otherwise undiscoverable issues and validate your decisions.

The Chaos Toolkit is sponsored by Reliably, maintained with help from a great community, and used by teams at CapitalOne, Wix, Daimler, Lego, Oracle, and much more.

chaos run experiments/db-connection-loss-does-not-harm-availability/terminate-db-instance.json
[2018-02-22 15:04:20 INFO] Validating the experiment's syntax
[2018-02-22 15:04:20 INFO] Experiment looks valid
[2018-02-22 15:04:20 INFO] Running experiment: Terminate the database master should not prevent application from running
[2018-02-22 15:04:21 INFO] Steady state hypothesis: Services are all available and healthy
[2018-02-22 15:04:21 INFO] Probe: application-should-be-alive-and-healthy
[2018-02-22 15:04:21 INFO] Probe: application-must-respond
[2018-02-22 15:04:21 INFO] Steady state hypothesis is met!
[2018-02-22 15:04:21 INFO] Action: terminate-db-master
[2018-02-22 15:04:21 INFO] Pausing after activity for 2s...
[2018-02-22 15:04:21 INFO] Probe: application-must-respond
[2018-02-22 15:04:23 INFO] Pausing before next activity for 5s...
[2018-02-22 15:04:23 INFO] Probe: fetch-application-logs
[2018-02-22 15:04:28 INFO] Pausing before next activity for 20s...
[2018-02-22 15:04:28 INFO] Probe: fetch-db-logs
[2018-02-22 15:04:48 INFO] Probe: fetch-patroni-operator-logs
[2018-02-22 15:04:48 INFO] Steady state hypothesis: Services are all available and healthy
[2018-02-22 15:04:49 INFO] Probe: application-should-be-alive-and-healthy
[2018-02-22 15:04:49 INFO] Probe: application-must-respond
[2018-02-22 15:04:49 CRITICAL] Steady state probe 'application-must-respond' is not in the given tolerance so failing this experiment
[2018-02-22 15:04:49 INFO] Let's rollback...
[2018-02-22 15:04:49 INFO] No declared rollbacks, let's move on.
[2018-02-22 15:04:49 INFO] Experiment ended with status: failed

SLOs are discussion enablers

Thoughts and idea to get the discussion started on making your systems more reliable.