Software Systems

ML for Systems

For hyperscalers, such as Huawei Cloud, the Operation and Maintenance (O&M) of cloud infrastructures and platforms cannot be done any longer manually or simply with off-the-shelf solutions. It requires self-developed automated systems, ideally exploiting the use of AI to provide tools for autonomous cloud operations.

HC, or Huawei Cloud, has a microservices architecture composed of hundreds of services. They are distributed over thousands of hosts in many geographical regions and operate with an availability higher than five nines. Huawei Cloud is one of the largest and fastest growing platforms in the world. It has a strong presence throughout the world with over 40 availability zones located across 23 geographical regions, ranging from Germany, France, South/Central America, Hong Kong and Russia to Thailand and South Africa.

The objective of the AIOps / Reliability Team (based in Munich, Germany) was to develop new systems and tools to analyze observability data from Huawei Cloud to detect problems which impact customers, identify the root cause within seconds, and fix failures using the 1/5/10 rule (detection: 1 min, RCA: 5 min, recovery: 10 min). We generally build tools for anomaly detection, root-cause analysis, performance analysis, predictive maintenance, security operations, and operations automation for Cloud and intelligent management (1):

  • Security Operations: SecOps integrates monitoring, tools, processes, and technology to keep IT secure while reducing risk.
  • Intelligent Log Analysis: Explore the use of structured logging to facilitate the application of AI/ML methods for Root-cause analysis
  • Hypervisor Reliability: Identifying health issues of hypervisors correlated with latent failures.

From 2015 to 2024, we used AI from the fields of Data Science, Machine Learning, and Deep Learning, including statistical learning, time-series analysis, deep learning, big data, streaming, and data visualization, enabled us to develop new production-ready services for troubleshooting Huawei Cloud and detect issues which were previously undetectable.

(1) Technical University of Berlin (TUB)

AI for Operations

We developed several cutting-edge tools and solutions focused on failure prediction, failure prevention, and anomaly detection to enhance the operation of cloud infrastructures. By leveraging advanced machine learning algorithms and data analytics, we enabled HUAWEI CLOUD operators to anticipate potential issues, optimize system performance, and ensure the reliability and resilience of the cloud infrastructure.

Failure Prevention: We enhanced the global, decentralized, and scalable HUAWEI CLOUD Cloud Log Service to collect, analyze, and manage petabytes of logs and event data generated by the cloud infrastructure and on-premises systems.

Failure Prediction: We developed new systems for HUAWEI CLOUD datacenters to predict the failure of HDD, SDD, RAM, and Optical network transceivers using Machine Learning.

Anomaly detection: We build a distributed Cloud Trace Service for HUAWEI CLOUD to follow and profile the execution of public cloud services’ requests as they travel across multiple infrastructure services, components, middleware, and systems in a public and private cloud.

Observability

We enhanced HUAWEI CLOUD Cloud Monitoring Service used to monitor and manage the performance, health, and security of global cloud infrastructures using machine learning.

Cloud Reliability

From 2015 to 2020, we worked on improving the Reliability and Resilience of Huawei Cloud (HC) and Open Telekom Cloud (OTC), since in early days HC had a strong dependence on OpenStack.

We developed several new tools and systems based on:

The following presentations/lectures (1,2) provide an overview of our work on OpenStack and distributed tracing.

  • Introduction: Hyperscalers, cloud monitoring, AI and O&M, monitoring formats, ML for O&M.
  • OpenStack Cloud OS: Virtualization, public clouds, openstack system design, openstack services (IMS, compute, nova, scheduler, network, storage).
  • OpenStack Hands-on: Setup infrastructure, install Openstack, CLI, launch instances, attach volumes, create networks, distributed tracing.
  • Distributed Tracing Technologies: Workflow for VM creation, tracing concetps, tracing systems, Zipkin, Jaeger, OpenTracing, OSProfiler.
  • Distributed Trace Analysis: Monitoring data sources, troubleshooting with tracing, feature selection, trace abstraction, time series analysis, sequence analysis. LSTM.
  • Distributed Trace Analysis (Hands-on): Jupyter notebook with running code for dsitributed trace analysis for Openstack.
  • Cloud Benchmarking: Benchmarking public cloud platforms, ECS and RDS benchmarking.
  • Cloud Computing: Overview, concept, web APIs, platforms, applications, and BPM

(1) Technical University of Berlin (TUB), (2) Technical University of Munich (TUM)

Service Systems

Our contributions on service systems placed emphasis on three fields: service description languages (with the USDL family), service system modeling (with the LSS USDL language), service analytics (using process mining), and service networks (using principals from social networks).

  • Service Analytics. We analyse large logs from IT service provisioning (e.g., application logs, transactions, ITIL) to find behaviour patterns.

  • Service Descriptions. We developed the Linked USDL language (Unified Service Description Language) to describe services using computer-understandable specifications, formal ontologies (RDFS), and AI for inference.

  • Service Systems. We developed the Linked Service System model for the Unified Service Description Language (LSS-USDL) using lightweight semantic models to capture service systems.

We also explored the concept of service networks. The observation that the power of service-based economies is no longer restricted to individual organizations, but spans across networks, was the main driver for conducting service network research.

See Github LSS-USDL, Github Linked-USDL

We also explored the concept of Process Analytics. Our intentions are twofold. On the one hand, we think it is fundamental to survey findings from neighboring disciplines on how Business Process Quality Metrics can be developed. In particular, we believe that we can gather additional insights from software engineering, cognitive science, and graph theory and relate them to business process modeling. A further empirical investigation might ultimately lead to establishing a complexity theory of business process models. On the other hand, we plan to demonstrate that these metrics serves their purpose, we plan to carry out several empirical validations by means of controlled experiments.