Observability data from OpenStack monitoring and logging components incoming from a large productive cloud infrastructure
Analyze the incoming data to detect upcoming failures, determinging the routt cause, and autonomously initiating mitigation operation to keep the requiered QoS level
Reducing hardware and licensing cost for availability and stability in large IT infrastructures using commodity, non-dedicated components
Virtualisation and containers enable data centers to use non-dedicated hardware and achieve higher utilization and flexibilty at lower cost.
Non-dedicated hardware is much cheaper, however also more prone to errors and outages compared to specialized, dedicated IT components. Nevertheless, very high availability is not negotiable.
Errors and failures are likely to happen in infrastructures of non-dedicated hardware. Our solution is based on quick detection of upcoming failures and migration of the components to avoid impact on QoS or downtimes.
Methods based on supervised machine learning detect reliably errors from a known, given catalog and thus automatically identify the root cause as well. Unsupervised methods need more online time to learn the system, but perform better under the presence of noise.
The solutions based on supervised and unsupervised methods were deployed and tested in production data centers using only 2% of the available computing power for real-time analysis. More than 95% of the upcoming failures were successfully detected and mitigated.
The developed software is published as open-source (see references bellow). It is integrated with a topology detector and multiple techniques for data collection and analysis. The remaining activities include connecting the data sources and adjusting the thresholds.
Reach out to discuss the predictive IT observability and mitigation of IT errors.