AIOps

Introduction

In planet-scale deployments, the Operation and Maintenance (O&M) of cloud platforms cannot be done any longer manually or simply with off-the-shelf solutions. It requires self-developed automated systems, ideally exploiting the use of AI to provide tools for autonomous cloud operations. We rely on deep learning, distributed traces, and time-series analysis (sequence analysis) to effectively detect and fix anomalous cloud infrastructure behaviors during operations to reduce the workload of human operators.

The iForesight system is being used to evaluate new O&M approaches. iForesight 3.0 is the result of 2 years of research with the goal to provide an intelligent new tool aimed at SRE cloud maintenance teams. It enables them to quickly detect and predict anomalies thanks to the use of artificial intelligence when cloud services are slow or unresponsive.

Problem

Existing tools for monitoring IT infrastructures, networks and applications focus on collecting logs, metrics, events, and traces from distributed systems mainly for visualization. Nonetheless, the final goal of monitoring is to reach a level of technological development where we have tools that conduct root cause analysis with a high accuracy and enable to autonomously recover systems. To achieve this goal, we still need to shift from a data collection stage to an insight- and action-driven paradigm. One promising path to monitor planet- and large-scale platforms is to rely on advanced analytics and explore techniques from statistics, time-series analysis, data mining, natural language processing, graph processing, machine learning, and deep learning to extract insights from large volumes of monitoring data to support and drive recovery actions.

Approach

The mission of the Intelligent Cloud Operations SRE team (based in Munich, Germany) is to develop new systems and tools to analyze observability data from Huawei Cloud to detect impact to customers, identify the root cause within seconds, and fix the problem using the 1/5/10 rule (detection: 1 min, RCA: 5 min, recovery: 10 min).

The following figure from Gartner provides a high level architecture of the system we are building highliting the main areas of concern:

The use of ML for production engineering can support the development of new approaches for:

  1. Monitoring and alerting
  2. Anomaly detection and Root Cause Analysis
  3. Capacity planing and prediction
  4. Canarying validation
  5. Service Scaling
  6. Operational performance

Our work focuses on points 1) and 2).

In 2017 we adopted AI in the form of Data Science and Machine Learning approaches for anomaly detection, root-cause analysis, fault prediction, and automated recovery into our suite.

These techniques, including statistical learning, time-series analysis, deep learning, big data, streaming, and data visualization, enabled us to develop new production-ready services for troubleshooting Huawei Cloud and detect issues which were previously undetectable.

Challenges

The challenges of operationalising AI are not limited to the understanding of deep learnign or machine learning algorithms. Major challenges are related with software engineering, access and processing of large amounts of distributed data, model management, updating, deleting and training models on specialized GPUs and hardware, composition of workflows for orchestrating parallel jobs, and the visual management of models, workflows, and results.

Huawei Cloud

Our cloud has planet-scale technical requirements with an microservices architecture composed of hundreds of services. They are distributed over thousands of hosts in many geographical regions and operate with an availability higher than five nines.

Our system monitors Huawei Cloud which is build on top of OpenStack, an opensource cloud operating system. OpenStack controls large pools of compute, storage, and networking resources throughout tens of datacenters. The base services are predominantly written in Python and Java running on Linux.

Huawei Cloud is one of the largest and fastest growing platforms in the world. It has a strong presence throughout the world with over 40 availability zones located across 23 geographical regions, ranging from Germany, France, South/Central America, Hong Kong and Russia to Thailand and South Africa.

Data Sources

Generally, four types of data sources are desirable to monitor and collect:

  1. Logs. Service, microservices, and applications generate logs, composed of timestamped records with a structure and free-form text, which are stored in system files.
  2. Metrics. Examples of metrics include CPU load, memory available, and the response time of a HTTP request.
  3. Traces. Traces records the workflow and tasks executed in response to an HTTP request.
  4. Events. Major milestones which occur within a data center can be exposed as events. Examples include alarms, service upgrades, and software releases.

Once the data is collected, we apply reasoning algorithms and techniques to find suspicious patterns, correlating multiple data sources to detect anomalies, conduct root-cause analysis, and remediate systems.

What to Analyse

Google SRE team proposed 4 Golden Signals which provide key insights on how a distributed system is running.

  • Latency. Time to handle a request (aka response time)
  • Traffic. How much demand is being placed on a system
  • Errors. Rate of requests that fail
  • Saturation. Constraints places on service resources

Anomaly detection

Services can be analyzed from a latency perspective to identify unreachable endpoints and permanent changes and spikes in latency.

For example, we can autonomously identify anomalous microservices’ latencies by dynamically choosing temporal features, predict memory leaks ahead of time before impacting systems, or finding rare message entries in service logs with billions records. We applies all these techniques to real-time data streams.

As another example, although distributed logging is a solved problem and many solutions already exist, what still needs to be mastered is the extraction of meaningful and actionable information from massive logs. While many argue that “the more [data] the merrier”, in reality, the more log statements you have, the less you can find due to noise and non-determinism.

With the success of anomaly detection in 2017-2018, in 2019 we are planning the next phase of our next-gen monitoring and troubleshooting suite. We will extend anomaly detection by implementing two new detector services for distributed trace and service logs. All the anomaly detectors contribute with results to a central knowledge repository of metric, trace, and log observations, and alarms and relevant external events (e.g., platform upgrades).

Root-cause analysis

A semi-supervised machine learning system will analyze the repository to automatically identify complex incidents associated with failures and explain the underlying possible root-cause to SREs and operators. This analysis will learn associations between anomalies, alerts and external events which will be formalized as rules and stored in a knowledge-based system. On top, a smart assistant will help operators in making associations and decisions on the relationship between alerts and anomalies for root-cause analysis.

Several techniques can be for root cause analysis, e.g.:

  • Physical Host Analysis: Resource saturation high CPU utilization, >90% memory utilization, high dropping of network packets, low IO utilization, memory leaks
  • Traffic Analysis: Correlation between sudden increase in requests and slashdot effect, with increase latency of requests.
  • Trace Analysis: Component or dependency failure, structural trace analysis, response time span analysis.
  • Event Analysis: Causality between upgrades, reconfigurations, and forklift replacements and failure.

Auto remediation

Once methods for anomaly detection and root cause analysis are mastered, the next step is to look into auto remediation. The first approach consists in running automated diagnostics scripts (runbooks) to troubleshoot and gain insights of the current state of components, services, or systems to, afterwards, conduct a manual remediation. As knowledge on failure modes is gained, failure patterns are identified and recovery is encoded into automated remediation scripts. Often, only simple failure cases can be handled but this constitute a very good starting point for more complex scenarios. Examples include rebooting a host, restarting a microservice or hung process, free disk space, and remove cached data. As knowledge on running systems accumulates, auto-remediation becomes pervasive to service owners which can define their own recovery actions.

The workflow

In practice, these three tasks – anomaly detection, RCA, and remediation – are linked together to provide an end-to-end solution for O&M. For example, when anomaly detection identifies an HTTP endpoint with a high latency by analysing metrics, distributed traces are immediately analysed to reveal exactly which microservice or component is causing the problem. Its logs and context metrics are accessed to quickly diagnose the issue. Afterwards, when sufficient evidence characterizing the problem is collected, a remediation actions can be executed.

Evaluation

We evaluate the techniques and algorithms we built using a 3-level approach:

  • Synthetics data. We built models simulating microservice applications which are able to generate data under very specific conditions. The scenarios simulated are usually difficult to obtain when using testbeds and production systems. The controlled data enables a fine-grained understanding of how new algorithms behave and are an effective way for improvement and redesign.
  • Testbed data. Once an algorithm passes the evaluation using synthetic data, we make a second evaluation using testbed data. We run an OpenSack cloud platform under normal utilization. Faults are injected into the platform and we expect algorithms to detect anomalies, find their root cause, predict errors, and remediate failures.
  • Production data. In the last step of the evaluation, we deploy algorithms in planet-scale production systems. This is the final evaluation in an environment with noise and which generally makes algorithms generate many false positives. Accuracy, performance and resources consumption is registered.

Many public datasets are also available to conduct comparative studies:

Exploring SRE Pain Points

After identifying a pain point, we identify the following elements to develop a solution:

  • Existing manual workflows for troubleshooting for automatization
  • Key golden metrics which can enable an effective anomaly detection
  • Data sources for root cause analysis
  • Manual recovery actions
  • Critical components which requires special monitoring infrastructure

Tech Stack

AIOps does not only requires new methods and techniques from the fields of statistics and ML, but it also needs online and offline big data infrastructure (such as Hadoop, HBase, Spark, Gobblin, Presto) to ingest and process scale monitoring data which can reach several PB/day. For example, Facebook uses Presto for interactive queries over their 300PB data stores.

iForesight is build using the following software stack and applications.

  • Frontend: Grafana, Jupyter, Node.js
  • AI: Tensorflow, Keras, PyTorch, Pandas/NumPy, Scikit-learn, Huawei Model Arts
  • Backend: Microservices, Docker, MySQL
  • Big Data: OpenTSDB, Hive, ArangoDB, HBase, Elastic Search, Spark Streaming.
  • Transport: Kafka
  • Data sources: metrics, app logs, tracing, alarms, topologies, and change events
  • Language: Python

In 2019, we will closely following the progresses make in the following 5 fields to extend our stack and suite:

Systems from Academia and Industry

Team and Culture

Several researchers have contributed to iForesight, namely, llya Shakhat, Paul Staab, Wei Guangsheng, Jinxunmi, Sasho Nedelkoski, Alexander Wieder, Yi Feng, Florian Richter, Francesco del Buono, Phani Pawan, and Ankur Bhatia, among others.

Our skill set encompasses expertise in the fields of AI/Data Science (Analytics), Software Engineering (Analysis, Design, Development, Testing), and Operation (Deployment, Infrastructure).

Our culture of innovation and R&D is based on 4 main guiding principles:

*(Time Spent) x (Intensity of Focus)