I currently lead the Ultra-scale AIOps Lab. I take a dual role of Chief Architect and Engineer Manager at HUAWEI CLOUD in Munich, Germany and Dublin, Ireland.

We apply machine learning and deep learning techniques to various areas related to HUAWEI CLOUD such as: anomaly detection, root cause analysis, failure prediction, reliability and availability, risk estimation and security, network verification, and low-latency object tracking.

Our work fits under the AI Engineering umbrella as discussed in IEEE Software, Nov.-Dec. 2022. You can find more information about our work here:

Our current work involves the development of the next generation of AI-driven IT Operations tools and platforms. This field is generally called AIOps (artificial intelligence for IT operations). In planet-scale deployments, the Operation and Maintenance (O&M) of cloud platforms cannot be done any longer manually or simply with off-the-shelf solutions.

It requires self-developed automated systems, ideally exploiting the use of AI to provide tools for autonomous cloud operations. Our work looks into how deep learning, machine learning, distributed traces, graph analysis, time-series analysis (sequence analysis), and log analysis can be used to effectively detect and localize anomalous cloud infrastructure behaviours during operations to reduce the workload of human operators. These techniques are typically applied to Big Data coming from microservice observability data:

We create innovative systems for:

  • Service health analysis: Resource utilization (e.g., memory leaks), anomaly detection using KPI and logs
  • Predictive analytics: fault prevention, SW/HW failure prediction
  • Automated recovery: fault localization and recovery
  • Operational risk analysis: CLI command analysis

We are currently developing the iForesight system which is being used to evaluate this new O&M approach. iForesight 7.0 is the result of 7+ years of R&D with the goal to provide an intelligent new tool aimed at SRE cloud maintenance teams. It enables them to quickly detect, localize and predict anomalies thanks to the use of artificial intelligence when cloud services are slow or unresponsive. Many of our innovation and system developments is done in collaboration with the Technical University of Berlin and the Huawei-TUB Innovation Lab for AI-driven Autonomous Operations.
