Predictive IT Fault Tolerance

Analyzing metric data and logs to predict upcoming infrastructure failures and migrate containers / VMs to safe resources to guarnatee 24/7 availability

Starting Point

Observability data from OpenStack monitoring and logging components incoming from a large productive cloud infrastructure

Objective

Analyze the incoming data to detect upcoming failures, determinging the routt cause, and autonomously initiating mitigation operation to keep the requiered QoS level

Added Value

Reducing hardware and licensing cost for availability and stability in large IT infrastructures using commodity, non-dedicated components

From challenges to solutions

Changing Infrastructures

Virtualisation and containers enable data centers to use non-dedicated hardware and achieve higher utilization and flexibilty at lower cost.

Impact on Availability

Non-dedicated hardware is much cheaper, however also more prone to errors and outages compared to specialized, dedicated IT components. Nevertheless, very high availability is not negotiable.

Approach

Errors and failures are likely to happen in infrastructures of non-dedicated hardware. Our solution is based on quick detection of upcoming failures and migration of the components to avoid impact on QoS or downtimes.

Methods

Methods based on supervised machine learning detect reliably errors from a known, given catalog and thus automatically identify the root cause as well. Unsupervised methods need more online time to learn the system, but perform better under the presence of noise.

Impact on Availability

The solutions based on supervised and unsupervised methods were deployed and tested in production data centers using only 2% of the available computing power for real-time analysis. More than 95% of the upcoming failures were successfully detected and mitigated.

Access to solution

The developed software is published as open-source (see references bellow). It is integrated with a topology detector and multiple techniques for data collection and analysis. The remaining activities include connecting the data sources and adjusting the thresholds.

Technical deep dive

Dive deep into our work on anomaly detection using metric data

IFTM unsupervised anomaly detection for virtualized network function services

Link to Paper

Dive deep into our work on the architecture of the overall system

A system architecture for real-time anomaly detection in large-scale nfv systems

Link to Paper

Download and deploy the open-sourced software on GitHub

Bitflow - Tool for collecting and analyzing time series for IT availability

GitHub repo

Interested in this topic?

Reach out to discuss the predictive IT observability and mitigation of IT errors.

Predictive IT Fault Tolerance

Analyzing metric data and logs to predict upcoming infrastructure failures and migrate containers / VMs to safe resources to guarnatee 24/7 availability

Starting Point

Objective

Added Value

From challenges to solutions

Changing Infrastructures

Impact on Availability

Approach

Methods

Impact on Availability

Access to solution

Technical deep dive

Dive deep into our work on anomaly detection using metric data

Dive deep into our work on the architecture of the overall system

Download and deploy the open-sourced software on GitHub

Interested in this topic?

Professional AI solutions for efficient operation.

Contact us:

Our offices:

Our socials:

Predictive IT Fault Tolerance

Analyzing metric data and logs to predict upcoming infrastructure failures and migrate containers / VMs to safe resources to guarnatee 24/7 availability

Starting Point

Objective

Added Value​

From challenges to solutions

Changing Infrastructures

Impact on Availability

Approach

Methods

Impact on Availability

Access to solution

Technical deep dive​

Dive deep into our work on anomaly detection using metric data

Dive deep into our work on the architecture of the overall system

Download and deploy the open-sourced software on GitHub

Interested in this topic?

Professional AI solutions for efficient operation.

Contact us:

Our offices:

Our socials:

Added Value

Technical deep dive