I am Chief Architect for Intelligent Cloud Operations (aka AIOps) at Huawei Munich Research Center in Munich, Germany and Huawei Ireland Research Center in Dublin, Ireland. I am also Associate Professor at the University of Coimbra (Portugal), and affiliated to the Information Systems Group.
My current research involves the development of the next generation of AI-driven IT Operations tools and platforms. This field in nowadays generally called AIOps (artificial intelligence for IT operations). In planet-scale deployments, the Operation and Maintenance (O&M) of cloud platforms cannot be done any longer manually or simply with off-the-shelf solutions. It requires self-developed automated systems, ideally exploiting the use of AI to provide tools for autonomous cloud operations. My research looks into how deep learning, machine learning learning, distributed traces, graph analysis, time-series analysis (sequence analysis), and log analysis can be used to effectively detect and localize anomalous cloud infrastructure behaviors during operations to reduce the workload of human operators. These techniques are typically applied to Big Data coming from microservice observability data.
My group is currently developing the iForesight system which is being used to evaluate this new O&M approach. iForesight 3.0 is the result of more than 2 years of research with the goal to provide an intelligent new tool aimed at SRE cloud maintenance teams. It enables them to quickly detect, localize and predict anomalies thanks to the use of artificial intelligence when cloud services are slow or unresponsive.
The basic research areas we touch and integrate include:
- Cloud Computing, Cloud Operations and Cloud Monitoring
- Machine Learning and Deep Learning.
- Distributed Systems Reliability and Availability.
- Anomaly Detection and Root-cause Analysis.
They are applied to create new and innovative systems for:
- AI-driven Cloud Operations
- Fault prevention, prediction, detection, localization, and recovery.
- Planet-scale monitoring of distributed systems
- Applied machine learning for predictive software maintenance
- Natural Language Processing for systems’ behaviour analysis.
- Anomaly Detection techniques using AI/ML methods
- Background. Traditionally, anomaly detection research targeted to identify individual point anomalies in time series. Nonetheless, for planet-scale, complex systems such as Huawei Public Cloud, where noise and entropy are a constant, detecting collections of anomalous temporal events is far more relevant.
- Objectives and Benefits. This project seeks to use and evaluate recent neural networks developments from the field of AI and Machine Learning to detect collective, unusual, anomalous, temporal, machine generated events in Huawei Public Cloud.
- Intelligent predictive maintenance of Huawei Public Cloud
- Background. Predictive maintenance attempts to anticipate failures to allow for advance scheduling of corrective activities to prevent downtime and improving service quality for the customers.
- Objectives and Benefits. This research project seeks to develop new algorithms and approaches based on AI/ML for predicting Huawei Cloud failures by mining billion events which, while not designed for predicting failures, contain rich monitoring and operational information.
- Pattern Mining using a Data Science approach
- Background. Temporal pattern mining has been used effectively for finding patterns anticipating anomalies and anomaly detection in time series. Patterns capture periodic, burst, sequential, frequent, rare, and correlated events which can be associated with known or unknown symptoms.
- Objectives and Benefits. This research project seeks to explore new approaches for mining patterns for understanding the anomalies and critical events generated by Huawei Public Cloud, a complex, large-scale distributed system. The anticipated results should demonstrate the benefits of the approach in terms of learning accurately event models, which are a cornerstone for developing a new generation of intelligent cloud operations and maintenance systems.
- AIOps for the root-cause analysis of planet-scale cloud platforms
- Background. Traditional root-cause analysis techniques are not appropriate to process planet-scale microservice applications due to their dynamicity, high noise to signal ratio, and large scale.
- Objectives and Benefits. Use 1) advanced service management data, such as distributed traces and datacenter topology graphs; 2) reasoning constructs, such as correlation and causality, and 3) Machine Learning to identify multi-failure root causes of planet-scale cloud platforms.
I currently have a few open positions for PhD students, postdocs, or professionals that would like to work with us to improve and extend with new ideas our system to release iForesight 3.0.
- Permanent position (Munich or Dublin): AI / Machine Learning
- Permanent position (Munich or Dublin): SRE / AIOps Engineer – Planet-scale Clouds
- Permanent position (Munich or Dublin): Openstack SRE Engineer – Planet-scale Clouds
- PhD Position/Postdoc: AI-Driven Cloud Operations
- Permanent position: Cloud Reliability Engineer
- Permanent position: Junior/Senior Researcher Large-scale Distributed Systems
- Our work on Self-Attentive Classification-Based Anomaly Detection in Unstructured Logs was accepted at the ICDM 2020 conference (Conference Rank: A+) (thanks to Sasho Nedelkoski, Jasmin Bogatinovski, Alexander Acker, and Odej Kao).
- Our work on Self-Supervised Log Parsing was accepted at the ECML PKDD 2020 conference (Conference Rank: A) (thanks to Sasho Nedelkoski, Jasmin Bogatinovski, Alexander Acker, and Odej Kao).
- Our AIOps article titled Multi-source Distributed System Data for AI-Powered Analytics was accepted to Service-Oriented and Cloud Computing (ESOCC 2020), 28-30 September, 2020, Crete.
- My Lecture on AIOps: Anomalous Span Detection in Distributed Traces Using Deep Learning presented at Prof. Joeran Beel’s Chair (Intelligent Systems, Trinity College Dublin) on 02.10.2019 is now available.
- Our article Towards Occupation Inference in Non-instrumented Services was accepted to IEEE Network Computing and Applications. Boston, MA, USA, September 2019.
- Andre Pascoal Bento defended successfully his thesis Observing and Controlling Performance in Microservices
- Our article Anomaly Detection from System Tracing Data using Multimodal Deep Learning was accepted to IEEE Cloud 2019, July 3-8, 2019, Milan, Italy. (Acceptance Rate: 21%)
- Our article Assessing Software Development Teams Efficiency using Process Mining was accepted to International Conference on Process Mining, June 24-26, 2019, Aachen, Germany
- Our article Anomaly Detection and Classification using Distributed Tracing and Deep Learning was accepted to CCGrid 2019, 14-17.05, 2019, Cyprus. (Conference Rank: A)
- Our article On Black-Box Monitoring Techniques for Multi-Component Services was accepted to 17th IEEE International Symposium on Network Computing and Applications (NCA), 1-3.10, 2018, Cambridge, US. (Conference Rank: A)
- Our article Efficient Failure Diagnosis of OpenStack using Tempest was accepted for publication at IEEE Internet Computing (Impact Factor 2018: 1.923).
- This year we are part of the Program Committee of SREcon 2019, 2–4 October, 2019, Dublin, Ireland.
- Jorge Cardoso Mastering AIOps with Deep Learning, Presentation at SRECon18, 29–31 August 2018, Dusseldorf, Germany.
- Georgia Kapitsaki, Josef Ioannou, Jorge Cardoso, Carlos Pedrinaci, “Linked USDL Privacy: Describing Privacy Policies for Service”, was published at the IEEE Inter. Conf. on Web Services (ICWS) (Conference Rank: A), 2-7 July 2018, San Francisco, USA, 2018.
- International Industry-Academia Workshop on Cloud Reliability and Resilience, 7-8 November 2016, Berlin, Germany.
- José María García, Pablo Fernández, Carlos Pedrinaci, Manuel Resinas, Jorge Cardoso, Antonio Ruiz-Cortés, “Modeling Service Level Agreements with Linked USDL Agreement”, IEEE Transactions on Services Computing (Impact Factor 2016: 3.049), pp. 52-65, Volume: 10, Issue: 1, Jan.-Feb. 1 2017.
- José María García, Carlos Pedrinaci, Manuel Resinas, Jorge Cardoso, Pablo Fernández, Antonio Ruiz-Cortés. Linked USDL Agreement: Effectively Sharing Semantic Service Level Agreements on the Web, The IEEE International Conference on Web Services (ICWS), June 27 - July 2, 2015, New York, USA. (Acceptance Rate: 17.4%)
- Jorge Cardoso and Carlos Pedrinaci, Evolution and Overview of Linked USDL. 6th International Conference Exploring Services Science, IESS 2015, Porto, Portugal, February 4-6, 2015, LNBIP, Vol. 201, Novoa, Henriqueta, Dragoicea, Monica (Eds.), 2015.
- Cardoso, J., R Mans, PR da Cunha, W van der Aalst, H Berthold, A framework for next generation e-health systems and services Proc. Amer. Conf. Inf. Syst. (AMCIS), pp. 1-11. 2015. (Conference Rank: A)
- Pedrinaci, C.; Cardoso, J. and Leidig, T. Linked USDL: A Vocabulary for Web-scale Service Trading. In 11th Extended Semantic Web Conference (ESWC), Crete, Greece, 2014. (Acceptance Rate: 25%)
- Cardoso, J.; Binz, T.; Breitenbucher, Uwe; Kopp, O. and Leymann, F. Cloud Computing Automation: Integrating USDL and TOSCA. In 25th Conference on Advanced Information Systems Engineering (CAiSE 2013), pages 1-16, Springer, LNCS, Vol. 7908, 2013. (Conference Rank: A; Acceptance rate: 16,6%)
- Francesco Guerra (Chair) and Jorge Cardoso (Vice-Chair). COST Action IC1302: semantic KEYword-based Search on sTructured data sOurcEs, 2013-2017.
- ACM Calendar of Events
- IEEE Conference Calls for Submissions
Jorge Cardoso his currently Chief Architect for Intelligent Cloud Operations at Huawei Munich Research Center in Munich, Germany. I am also Associate Professor at the University of Coimbra (Portugal).
Previously, he worked for several major companies such as SAP Research (Germany) on the Internet of Services, The Boeing Company in Seattle (USA) on Enterprise Application Integration and CCG/Zentrum fur Graphische Datenverarbeitung on Computer Supported Cooperative Work.
He has authored and co-authored more than 180 scientific publications and has been part of more than 120 program committees and organization bodies (journals and conferences). He his author/editor of 9 books. He holds 6 US and EU patents on process management and reliability engineering. GoogleScholar shows more than 8000 citations for his research work with an h-index of 43. His last book, titled Fundamentals of Service Systems from Springer, compiles results from the research work of his areas of interest: cloud computing, business process management, semantic Web, the Internet of Services, and service engineering.
He participated in European, German, US, and National research projects financed by the European Commission (FP7, EACEA), the German Ministry for Education and Research (BMBF), SAP Research (SAP) and Portuguese NSF (FCT). He is a founding member of the IFIP Working Group 12.7 on Social Semantics.
He created and led until 2009 the development of the W3C Unified Service Description Language (USDL).
- Prof. Jorge Cardoso
- Huawei Munich Research Center, Germany
- Departamento de Engenharia Informatica, University of Coimbra, Portugal
jcardoso [*.A._.T$] dei | uc | pt
A good researcher says, "Lets find out", others say "Nobody knows". When a good researcher makes a mistake, he says, I was wrong", others say "It wasn't my fault". A good researcher works harder than others and has more time. Others are always "too busy" to do what is necessary. [Unknown source]