Home

I am Chief Architect for Intelligent Cloud Operations (aka AIOps) at Huawei Munich Research Center in Munich, Germany and Huawei Ireland Research Center in Dublin, Ireland. I am also Associate Professor at the University of Coimbra (Portugal), and affiliated to the Information Systems Group.

My current research involves the development of the next generation of AI-driven IT Operations tools and platforms. This field in nowadays generally called AIOps (artificial intelligence for IT operations). In planet-scale deployments, the Operation and Maintenance (O&M) of cloud platforms cannot be done any longer manually or simply with off-the-shelf solutions. It requires self-developed automated systems, ideally exploiting the use of AI to provide tools for autonomous cloud operations. My research looks into how deep learning, machine learning learning, distributed traces, graph analysis, time-series analysis (sequence analysis), and log analysis can be used to effectively detect and localize anomalous cloud infrastructure behaviors during operations to reduce the workload of human operators. These techniques are typically applied to Big Data coming from microservice observability data.

My group is currently developing the iForesight system which is being used to evaluate this new O&M approach. iForesight 3.0 is the result of more than 2 years of research with the goal to provide an intelligent new tool aimed at SRE cloud maintenance teams. It enables them to quickly detect, localize and predict anomalies thanks to the use of artificial intelligence when cloud services are slow or unresponsive.

The basic research areas we touch and integrate include:

  • Cloud Computing, Cloud Operations and Cloud Monitoring
  • Machine Learning and Deep Learning.
  • Distributed Systems Reliability and Availability.
  • Anomaly Detection and Root-cause Analysis.

They are applied to create new and innovative systems for:

  • AI-driven Cloud Operations
  • Fault prevention, prediction, detection, localization, and recovery.
  • Planet-scale monitoring of distributed systems
  • Applied machine learning for predictive software maintenance
  • Natural Language Processing for systems’ behaviour analysis.

Previously I also looked into Cloud Computing, BPM, Semantic Web, Web Services, and Enterprise Systems. See Google Scholar, DBLP, and LinkedIn.

Research Topics

  1. Anomaly Detection techniques using AI/ML methods
    • Background. Traditionally, anomaly detection research targeted to identify individual point anomalies in time series. Nonetheless, for planet-scale, complex systems such as Huawei Public Cloud, where noise and entropy are a constant, detecting collections of anomalous temporal events is far more relevant.
    • Objectives and Benefits. This project seeks to use and evaluate recent neural networks developments from the field of AI and Machine Learning to detect collective, unusual, anomalous, temporal, machine generated events in Huawei Public Cloud.
  2. Intelligent predictive maintenance of Huawei Public Cloud
    • Background. Predictive maintenance attempts to anticipate failures to allow for advance scheduling of corrective activities to prevent downtime and improving service quality for the customers.
    • Objectives and Benefits. This research project seeks to develop new algorithms and approaches based on AI/ML for predicting Huawei Cloud failures by mining billion events which, while not designed for predicting failures, contain rich monitoring and operational information.
  3. Pattern Mining using a Data Science approach
    • Background. Temporal pattern mining has been used effectively for finding patterns anticipating anomalies and anomaly detection in time series. Patterns capture periodic, burst, sequential, frequent, rare, and correlated events which can be associated with known or unknown symptoms.
    • Objectives and Benefits. This research project seeks to explore new approaches for mining patterns for understanding the anomalies and critical events generated by Huawei Public Cloud, a complex, large-scale distributed system. The anticipated results should demonstrate the benefits of the approach in terms of learning accurately event models, which are a cornerstone for developing a new generation of intelligent cloud operations and maintenance systems.
  4. AIOps for the root-cause analysis of planet-scale cloud platforms
    • Background. Traditional root-cause analysis techniques are not appropriate to process planet-scale microservice applications due to their dynamicity, high noise to signal ratio, and large scale.
    • Objectives and Benefits. Use 1) advanced service management data, such as distributed traces and datacenter topology graphs; 2) reasoning constructs, such as correlation and causality, and 3) Machine Learning to identify multi-failure root causes of planet-scale cloud platforms.

Open Positions

I currently have a few open positions for PhD students, postdocs, or professionals that would like to work with us to improve and extend with new ideas our system to release iForesight 3.0.

News

About me

Jorge Cardoso his currently Chief Architect for Intelligent Cloud Operations at Huawei Munich Research Center in Munich, Germany. I am also Associate Professor at the University of Coimbra (Portugal).

Previously, he worked for several major companies such as SAP Research (Germany) on the Internet of Services, The Boeing Company in Seattle (USA) on Enterprise Application Integration and CCG/Zentrum fur Graphische Datenverarbeitung on Computer Supported Cooperative Work.

He has authored and co-authored more than 180 scientific publications and has been part of more than 120 program committees and organization bodies (journals and conferences). He his author/editor of 9 books. He holds 6 US and EU patents on process management and reliability engineering. GoogleScholar shows more than 8000 citations for his research work with an h-index of 43. His last book, titled Fundamentals of Service Systems from Springer, compiles results from the research work of his areas of interest: cloud computing, business process management, semantic Web, the Internet of Services, and service engineering.

He participated in European, German, US, and National research projects financed by the European Commission (FP7, EACEA), the German Ministry for Education and Research (BMBF), SAP Research (SAP) and Portuguese NSF (FCT). He is a founding member of the IFIP Working Group 12.7 on Social Semantics.

He created and led until 2009 the development of the W3C Unified Service Description Language (USDL).

He has a Ph.D. from the University of Georgia (US, 2002) and a MSc and BSc in Informatics Engineering University of Coimbra (1995 and 1998, Portugal).

Contact

A good researcher says, "Lets find out", others say "Nobody knows". When a good researcher makes a mistake, he says, I was wrong", others say "It wasn't my fault". A good researcher works harder than others and has more time. Others are always "too busy" to do what is necessary. [Unknown source]