I currently lead the Ultra-scale AIOps Lab. In my current role, I wear several hats, such as Director, Chief Architect, Engineering Manager, and Team Lead, seamlessly balancing technical strategic vision with hands-on technical leadership. I am part of HUAWEI CLOUD, and located in Munich, Germany and Dublin, Ireland. You can find more information about our work here: AIOps for Cloud Operations (2023).

Our current work involves the development of the next generation of AI-driven IT Operations tools and platforms. We apply machine learning and deep learning techniques to various areas related to HUAWEI CLOUD such as: anomaly detection, root cause analysis, failure prediction, reliability and availability, risk estimation and security, network verification, and low-latency object tracking. Our work fits under the AI Engineering umbrella as discussed in IEEE Software, Nov.-Dec. 2022. This field is generally called AIOps (artificial intelligence for IT operations) or ML for Systems.

In planet-scale deployments, the Operation and Maintenance (O&M) of cloud platforms cannot be done any longer manually or simply with off-the-shelf solutions. It requires self-developed automated systems, ideally exploiting the use of AI to provide tools for autonomous cloud operations. Our work looks into how deep learning, machine learning, distributed traces, graph analysis, time-series analysis (sequence analysis), and log analysis can be used to effectively detect and localize anomalous cloud infrastructure behaviours during operations to reduce the workload of human operators. These techniques are typically applied to Big Data coming from microservice observability data:

We create innovative systems for:

  • Service health analysis: Resource utilization (e.g., memory leaks), anomaly detection using KPI and logs
  • Predictive analytics: fault prevention, SW/HW failure prediction
  • Automated recovery: fault localization and recovery
  • Operational risk analysis: CLI command analysis

We are currently developing the iForesight system which is being used to evaluate this new O&M approach. iForesight 7.0 is the result of 7+ years of development with the goal to provide an intelligent new tool aimed at SRE cloud maintenance teams. It enables them to quickly detect, localize and predict anomalies thanks to the use of artificial intelligence when cloud services are slow or unresponsive. Many of our innovation and system developments is done as part of the Huawei-TUB Innovation Lab for AI-driven Autonomous Operations.

Contact