Predictive IT Fault Tolerance

Analyzing metric data and logs to predict upcoming infrastructure failures and migrate containers / VMs to safe resources to guarnatee 24/7 availability

Starting Point

Observability data from OpenStack monitoring and logging components incoming from a large productive cloud infrastructure

Objective

Analyze the incoming data to detect upcoming failures, determinging the routt cause, and autonomously initiating mitigation operation to keep the requiered QoS level

Added Value​

Reducing hardware and licensing cost for availability and stability in large IT infrastructures using commodity, non-dedicated components

From challenges to solutions

Changing Infrastructures

Virtualisation and containers enable data centers to use non-dedicated hardware  and achieve higher utilization and flexibilty at lower cost.

Impact on Availability

Non-dedicated hardware is much cheaper, however also more prone to errors and outages compared to specialized, dedicated IT components. Nevertheless, very high availability is not negotiable.

Approach

Errors and failures are likely to happen in infrastructures of non-dedicated hardware. Our solution is based on quick detection of upcoming failures and migration of the components to avoid impact on QoS or downtimes.

Methods

Methods based on supervised machine learning detect reliably errors from a known, given catalog and thus automatically identify the root cause as well. Unsupervised methods need more online time to learn the system, but perform better under the presence of noise.

Impact on Availability

The solutions based on supervised and unsupervised methods were deployed and tested in production data centers using only 2% of the available computing power for real-time analysis. More than 95% of the upcoming failures were successfully detected and mitigated.

Access to solution

The developed software is published as open-source (see references bellow). It is integrated with a topology detector and multiple techniques for data collection and analysis. The remaining activities include connecting the data sources and adjusting the thresholds.

Technical deep dive​

Dive deep into our work on anomaly detection using metric data

IFTM unsupervised anomaly detection for virtualized network function services

Dive deep into our work on the architecture of the overall system

A system architecture for real-time anomaly detection in large-scale nfv systems

Download and deploy the open-sourced software on GitHub

Bitflow - Tool for collecting and analyzing time series for IT availability

Interested in this topic?

Reach out to discuss the predictive IT observability and mitigation of IT errors.