I currently lead the Ultra-scale AIOps Lab. In my current role, I wear several hats (Director, Chief Architect, Principal Engineer, and Tech Lead) seamlessly balancing technical strategic vision with hands-on technical leadership. I am part of HUAWEI CLOUD, and located in Munich, Germany and Dublin, Ireland. You can find more information about our work here: AIOps for Cloud Operations (2023).
Our current work involves the development of the next generation of AI-driven IT Operations tools and platforms. We apply machine learning and deep learning techniques to various areas related to HUAWEI CLOUD such as: anomaly detection, root cause analysis, failure prediction, reliability and availability, risk estimation and security, network verification, and low-latency object tracking. Our work fits under the AI Engineering umbrella as discussed in IEEE Software, Nov.-Dec. 2022. This field is generally called AIOps (artificial intelligence for IT operations) or ML for Systems.
In planet-scale deployments, the Operation and Maintenance (O&M) of cloud platforms cannot be done any longer manually or simply with off-the-shelf solutions. It requires self-developed automated systems, ideally exploiting the use of AI to provide tools for autonomous cloud operations. Our work looks into how deep learning, machine learning, distributed traces, graph analysis, time-series analysis (sequence analysis), and log analysis can be used to effectively detect and localize anomalous cloud infrastructure behaviours during operations to reduce the workload of human operators. These techniques are typically applied to Big Data coming from microservice observability data:
- A Survey of AIOps Methods for Failure Management, ACM TIST, 2021.
We create innovative systems for:
- Service health analysis: Resource utilization (e.g., memory leaks), anomaly detection using KPI and logs
- Predictive analytics: fault prevention, SW/HW failure prediction
- Automated recovery: fault localization and recovery
- Operational risk analysis: CLI command analysis
We are currently developing the iForesight system which is being used to evaluate this new O&M approach. iForesight 7.0 is the result of 7+ years of development with the goal to provide an intelligent new tool aimed at SRE cloud maintenance teams. It enables them to quickly detect, localize and predict anomalies thanks to the use of artificial intelligence when cloud services are slow or unresponsive. Many of our innovation and system developments is done as part of the Huawei-TUB Innovation Lab for AI-driven Autonomous Operations.
Typically, the systems we develop in Munich are deployed in 93 availability zones across 33 regions in Asia Pacific, Latin America, Africa, Europe, and the Middle East. Monitoring, observability, operational risk analysis, anomaly detection, and predictive maintenance systems support over 220+ cloud services from Huawei Cloud. If you are interested, you can look at how the various hyperscale providers (e.g., AWS, Azure, Google, Huawei, Alibaba) compare with respect to the location of their data centers here.
Technologies
Over the years, I designed and implemented various types of systems, including service systems, workflow systems, and distributed systems. As my expertise in each field grew, I authored a book for each area to solidify my understanding. Currently, I am working on a book about Kubernetes Networking.
Contact
- Jorge Cardoso, PhD
- Huawei Munich Research Center, Germany
- Departamento de Engenharia Informatica, University of Coimbra, Portugal
jcardoso [*.A._.T$] dei | uc | pt