Cloud Reliability and Resilience


Reliability is a measure of the percentage uptime of cloud services to customers, considering the downtime due to faults. Many cloud providers are setting a reliability level of 99.95% (download uptime cheat sheet). This means that if you provision a VM it will be available 99.95% of the time, with a possible downtime of 21.6 minutes per month. Reliability is an important characteristic which enables platforms to adapt and recover under stress and remain functional from a customer perspective. You can find additional information from a Meetup meeting on Cloud Reliability and Resilience.

Every year big companies made the headlines for the wrong reason: reliability problems. In Q1 2016, Microsoft (9 days), Twitter (8h), Apple (7h), are PayPal (7h) are the “lucky” winners:

According to section of IEEE Society working on Reliability, “reliability engineering is a design engineering discipline which applies scientific knowledge to assure that a system will perform its intended function for the required duration within a given environment, including the ability to test and support the system through its total lifecycle (…) it is the probability of failure-free software operation for a specified period of time in a specified environment.”

Cloud resiliency is the ability of a cloud platform or service to recover and continue operating when failures occur. Automated means for recovery are the most desirable solutions.

Open Positions

OpenStack Cloud OS

OpenStack is a cloud operating system (Cloud OS) for building public and private clouds. It can control pools of compute, storage, and networking recourses located in large data centres. It is supported by major IT players in the world which include IBM, HP, Intel, Huawei, Red Hat, AT&T, and Ericsson. At Huawei Research we are currently developing the next generation of reliable cloud platforms for Deutsche Telekom. The Open Telekom Cloud engineered by Huawei and operated by T-Systems was launched at CeBIT 2016 and delivers flexible and convenient cloud services.

Major players are building competences in the field of cloud reliability. Microsoft Trustworthy Computing has a division dedicated to Reliability and IBM offers specialized Resiliency Services to assure continuous business operations and improve overall reliability.

Cloud reliability and resilience of OpenStack can be analyzed and improved at 3 levels:

  • Level 1. OpenStack paltform and services
  • Level 2. Hypervisor and virtual machines (VM) managed
  • Level 3. Applications running inside VMs

We concentrate our efforts on Level 1.

General Problems with Building Large-scale Distributed Systems

Reliable large-scale distributed systems are hard to build since their validation is time consuming, complex, and often non-deterministic. OpenStack is not an exception. Research from Microsoft with MODIST (Junfeng Yang, et al., MODIST: Transparent Model Checking of Unmodified Distributed Systems Proceedings of the 6th Symposium on Networked Systems Design and Implementation (NSDI ‘09), Pages 213-228) exemplifies well the problems associated with general distributed systems. Experiments found a total of 35 bugs in Berkeley DB, a Paxos implementation, and a primary-backup replication protocol implementation. Thus, validation, testing, and benchmarking frameworks are needed, specifically, when OpenStack is used to support mission critical applications.

Building large-scale distributed systems requires the consideration of several theories, technologies, and methodologies, such as:

Available Approaches (Industry solutions, Patents, Research Papers)

Fault Injection

  • Fault-injection technologies or FIT provides approaches to demonstrate that software is robustness and fault tolerance by injecting faults to damage internal components to test its fault tolerance.
  • Domenico Cotroneo and Henrique Madeira. Introduction to software fault injection. In Domenico Cotroneo, editor, Innovative Technologies for Dependable OTS-Based Critical Systems, pages 1–15. Springer Milan, 2013.
  • Haissam Ziade, Rafic A Ayoubi, Raoul Velazco, et al. A survey on fault injection techniques. Int. Arab J. Inf. Technol., 1(2):171–186, 2004.
  • (Graph-based) In Towards a Fault-Resilient Cloud Management Stack, the authors use execution graphs to monitor and observe the processing of external requests. Intrumentation is done between openStack and the hypervisor, the database, REST, HTTP, and AMQP. Server-crash faults are injected by killing relevant service processes via systemd.
  • (Graph-based) In HANSEL: Diagnosing Faults in OpenStack, the auhtors intercept AMQP and REST messages to reconstruct an execution graph. The approach requires network monitoring agents at each node in the OpenStack deployment. One of the challenges is the so-called transaction stitching to reconstruct full transactions to recreate the execution graph.
  • (String-based) In Toward achieving operational excellence in a cloud and US20150161025 A1: Injecting Faults at Select Execution Points of Distributed Applications , the authors rely on the operating system level information to build message traces by observing system events such as SEND or RECV system calls (or LIBC calls). These events are monitored per thread since with higher granularities (i.e., process-level or system-level), the job of separating events is difficult. Message sequences are converted into string of symbols and strings are comapred using an edit distance function. High distances indicate possible anomalies between executions.
  • DICE Fault Injection: A tool to generate faults within Virtual Machine. Under development.
  • Lineage-driven Fault Injection by Peter Alvaro, Joshua Rosen, Joseph M. Hellerstein UC Berkeley, Proceeding SIGMOD ‘15.
  • New Functional Testing in etcd. CoreOS uses a fault-injection framework to simulate the most common cases of failures that the system etcd may meet in real life.
  • To guarantee HA, LinkedIn simulates data center failures and measure the effects. To improve response time and lower the cost of operations, they have built the Nurse system, a workflow engine which enables to define tasks to recover automatically from failures.

  • The book Resilience and Reliability on AWS provides a motivation and a few examples (for beginners) on the importance of reliability. The author shares their experience to achieve resilience and reliability with code examples to monitor Redis or MongoDB. The use of simple techniques to solve the complex problem of reliability of clouds clearly indicates that current solutions are limited and further systmathic approaches are needed.

  • Microsoft proposed the Resilience Modeling and Analysis (RMA) methodology. It is an approach for improving resilience adapted from the industry-standard technique known as Failure Mode and Effects Analysis (FMEA).

  • Fault Injection at Cloudera uses fault-injection tools and elastic-partitioning techniques for the continuous improvement and verification of their Hadoop ecosystem (CDH) via an extensive QA process during the software-development life cycle.
  • OpenStack Reliability Testing describes an abstract methodology for OpenStack cluster high-availability testing and analysis.

Anomaly Detection


Traditional monitoring solutions for cloud platforms and applications, such as Cloudwatch from Amazon AWS, Ceilometer from Openstack, and Nagios, place emphasis on component-based monitoring. Existing solutions collect detailed information on system statistics about virtual machines, CPU, disk IO, hosts, RPC, etc.

Component-based monitoring tools provide not information into the relationship between the components of a distributed service. Since debugging a distributed system is a danting task using these tools, cross-component monitoring (tracing) solutions were explored to aliviate exisitng limitation by tracing the path of events and method calls that are generated at runtime.

The study from Sambasivan, Raja R., et al. titled So, you want to trace your distributed system? Key design insights from years of practical experience (Vol. 102. Technical Report, CMU-PDL-14, 2014) provides a very good overview of tracing systems developed up to 2014, and include the analysis of X-Trace, Magpie, Dapper, etc. The thesis from Nathan Scott of the Monash University, Australia, titled [Unifying Event Trace and Sampled Performance Data] ( also give a fairly good overview of these main tracing systems.

  • X-Trace from Berkeley outputs a set of task graphs according to Lamport’s happens before relation to trace the execution path of a distributed system. It does not rely on physical clocks and uses low level primitives for instrumentation (e.g., xtr::logEvent(string), xtr::logEvent(“end”), pushdown(), and pushnext()).
  • Magpie from Microsoft infers traces by combining event logs generated by low level black-box instrumentation. It automatically extracting individual requests from a running system and constructs a probabilistic workload model. Magpie relies on experts with deep knowledge about the system to construct a schema of how to correlate events in different components. In contrast to other approach (from Google, Twitter, and Cloudera), it infers causal relations from the events generated by the operating system and application instrumentation.
  • Dapper from Google focuses on library and middleware modifications and provides a special context to track execution across async callbacks and RPCs.
  • HTrace from Cloudera is a tracing framework intended for use with distributed systems written in java. It is similar to Dapper and performs end-to-end tracing to capture detailed paths of causally between events generated by the components which make a distributed system.
  • Zipkin from Twitter is also a distributed tracing system. It helps gather timing data needed to troubleshoot latency problems in microservice architectures. Zipkin’s design is based on the Google Dapper.
  • Distributed Tracing at Netflix with Salt Dependency and tracing from Netflix.
  • Jaeger and Tracing HTTP request latency. The approach from Uber on tracing (similar to Dapper and Zipkin).
  • Openstracing OpenTracing is an open distributed tracing standard for applications. Since distributed tracing is important for large-scale distributed systems management and complex services architectures, OpenTracing was created in 2016 to address the standardization problems from instrumentation.

Netflix uses Salp

Facebook uses mystery machine

  • Methods and Systems of Distributed Tracing and US 20140215443 A1 Application and US 9135145 B2 Grant by Rackspace Us, Inc. (Sept 2015). A system and methods are provided for distributed tracing in a distributed application by observing messages sent and received among components of the distributed application, generating a probabilistic model of a call flow, and constructing a call flow graph based on the probabilistic model for the distributed application.
  • Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox, and Eric Brewer. 2004. Path-based faliure and evolution management. In Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1 (NSDI’04), Vol. 1. USENIX Association, Berkeley, CA, USA, 23-23.


  • Eliot A Python library for Logging for complex distributed systems
  • Monitoring without Infrastructure @ Airbnb Airbnb uses logstash, statsd, NewRelic, Datadog, and their own open-sourced configuration-as-code alerting framework for monitoring.
  • RefStack. RefStack provides users in the OpenStack community with a Tempest wrapper, refstack-client, that helps to verify interoperability of their cloud with other OpenStack clouds. It does so by validating any cloud implementation against the OpenStack Tempest API tests

Other tools from the field of APM (Application Performance Management), such as NewRelic limite their span to monitor the performance of transactions across web application stacks.

Repair and Recovery

Automation is having an important role in making the management of cloud data centers automated and ‘automatic’. The principals underlying Google SRE and DevOps all call for automation. For cloud reliability and resilience, the procedures to repair the IT infrastructure need to be encoded into executable processes. Large-scale distributed monitoring (dapper), stream and events (spark and flink), and time-based databases (prometheus) provide the base to identify important event that when flagged as alarms need to trigger encoded procedures to repair faulty systems. An important question is ‘How to encode procedures to repair IT infrastructure automatically?’

Huawei’s Approach

Ensuring the reliability of large-scale, complex distributed cloud platform requires new innovative approaches. While NetFlix’s ChaosMonkey proposed a new tool (and concept) for site reliability engineers, it only enables the analysis of cloud native applications. Since at Huawei we are developing highly reliable cloud platforms (e.g., Openstack), the site reliability engineering team developed a new approach framework, called Butterfly Effect, to automatically inject faults into cloud infrastructures.

  • Efficient execution trace processing using stream processing
  • Dynamic time-based fingerprinting to detect timeouts
  • Position and negative fingerprints for automated diagnosis and localization of user commands
  • Rely as much possible on open source and Python (see Python frameworks, libraries, software and resources)

Research Groups and Initiatives