Software Systems

ML for Systems

For hyperscalers, such as Huawei Cloud, the Operation and Maintenance (O&M) of cloud infrastructures and platforms cannot be done any longer manually or simply with off-the-shelf solutions. It requires self-developed automated systems, ideally exploiting the use of AI to provide tools for autonomous cloud operations.

HC, or Huawei Cloud, has a microservices architecture composed of hundreds of services. They are distributed over thousands of hosts in many geographical regions and operate with an availability higher than five nines. Huawei Cloud is one of the largest and fastest growing platforms in the world. It has a strong presence throughout the world with over 40 availability zones located across 23 geographical regions, ranging from Germany, France, South/Central America, Hong Kong and Russia to Thailand and South Africa.

The mission of the AIOps / Reliability Team (based in Munich, Germany) was to develop new systems and tools to analyze observability data from Huawei Cloud to detect problems which impact customers, identify the root cause within seconds, and fix failures using the 1/5/10 rule (detection: 1 min, RCA: 5 min, recovery: 10 min). We generally build tools for anomaly detection, root-cause analysis, performance analysis, predictive maintenance, security operations, and operations automation.

From 2015 to 2024, we used AI from the fields of Data Science, Machine Learning, and Deep Learning, including statistical learning, time-series analysis, deep learning, big data, streaming, and data visualization, enabled us to develop new production-ready services for troubleshooting Huawei Cloud and detect issues which were previously undetectable.

AI for Operations

We developed several cutting-edge tools and solutions focused on failure prediction, failure prevention, and anomaly detection to enhance the operation of cloud infrastructures. By leveraging advanced machine learning algorithms and data analytics, we enabled HUAWEI CLOUD operators to anticipate potential issues, optimize system performance, and ensure the reliability and resilience of the cloud infrastructure.

Failure Prevention: We enhanced the global, decentralized, and scalable HUAWEI CLOUD Cloud Log Service to collect, analyze, and manage petabytes of logs and event data generated by the cloud infrastructure and on-premises systems.

Failure Prediction: We developed new systems for HUAWEI CLOUD datacenters to predict the failure of HDD, SDD, RAM, and Optical network transceivers using Machine Learning.

Anomaly detection: We build a distributed Cloud Trace Service for HUAWEI CLOUD to follow and profile the execution of public cloud services’ requests as they travel across multiple infrastructure services, components, middleware, and systems in a public and private cloud.

Observability

We enhanced HUAWEI CLOUD Cloud Monitoring Service used to monitor and manage the performance, health, and security of global cloud infrastructures using machine learning.

Cloud Reliability

From 2015 to 2020, we worked on improving the Reliability and Resilience of Huawei Cloud (HC) and Open Telekom Cloud (OTC), since in early days HC had a strong dependence on OpenStack.

We developed several new tools and systems based on:

Service Systems

Our contributions on service systems placed emphasis on three fields: service description languages (with the USDL family), service system modeling (with the LSS USDL language), service analytics (using process mining), and service networks (using principals from social networks).

  • Service Analytics. We analyse large logs from IT service provisioning (e.g., application logs, transactions, ITIL) to find behaviour patterns.

  • Service Descriptions. We developed the Linked USDL language (Unified Service Description Language) to describe services using computer-understandable specifications, formal ontologies (RDFS), and AI for inference.

  • Service Systems. We developed the Linked Service System model for the Unified Service Description Language (LSS-USDL) using lightweight semantic models to capture service systems.

We also explored the concept of service networks. The observation that the power of service-based economies is no longer restricted to individual organizations, but spans across networks, was the main driver for conducting service network research.

See Github LSS-USDL, Github Linked-USDL

Semantic DNS

Enterprises have the need to communicate. In business to business applications, usually XML is used to automatically exchange information. But sometimes more semantics is needed. Enterprises also need to share concepts, terms, definitions and relationships (between concepts) relevant to their business activities.

Why are developing the Semantic Domain System which is a systems that follows similar concepts from the DNS. The Domain Name System or DNS, is a service where relationships between IP addresses and physical domains are stored. When you request your browser, email client, ftp client or any other application to search for a specific domain, it automatically calls the DNS Server and finds the IP for the machine that offers the required service. It enables the use of names instead of IP addresses. Each company is responsible for maintaining their own IP addresses.

In SDS, like in DNS, each company is responsible for managing their own concepts and is able to browse other companies’ concept definitions. The system will allow a clear representation of concepts and relationships between concepts.

Enterprise Integration

Semantic B2B Integration. B2B integration, also known as external IS integration and e-business integration, has promised to automate and integrate business processes and interactions between companies by considerably renovating the way business was conducted with partners, suppliers, and customers. B2B integration is fundamentally about data and information exchange among businesses and their information systems. The ability to interact and exchange information both internally and with external organizations (partners, suppliers, customers) is a fundamental issue in the enterprise sector.

One simple solution that organizations have adopted to reach a higher level of integration relied on the use of XML as the language to represent data. XML has become a de facto standard of B2B because of its simplicity, extensibility and ease of processing. Today its estimated that most organizations use XML to store and transfer data. This is the reason why we created B2BISS (Business-to-Business Integration using Syntactic-to-Semantic Mapping).

Model Transformation. Today’s enterprises face critical needs in integrating disparate information spread over several data sources inside and even outside the organization. Semantic web technologies, such as ontologies, play an important role in the semantic integration of data. The purpose of JXMLOWL is to present a framework to assist the semantic data integration process. The framework supports mappings and instances transformation from syntactic data sources in XML format to a common global model defined by an ontology using semantic web related technologies such as OWL.

Process Analytics

Our intentions are twofold. On the one hand, we think it is fundamental to survey findings from neighboring disciplines on how Business Process Quality Metrics can be developed. In particular, we believe that we can gather additional insights from software engineering, cognitive science, and graph theory and relate them to business process modeling. A further empirical investigation might ultimately lead to establishing a complexity theory of business process models. On the other hand, we plan to demonstrate that these metrics serves their purpose, we plan to carry out several empirical validations by means of controlled experiments.