Predictive Observability for DIMMs and SSDs

Delivering early warnings to system administrators to migrate data and running system components away from soon-to-fail SSDs or computing nodes.

Starting Point

Telemetry data (SMART format) from around 1500 SSD and DIMM devices running in large scale cloud data center.

Objective

Predict individual device failures ahead of time in order to improve the overall availability of the cloud services.

Added Value​

Enabled predictive maintenance ode SSD and DIMM devices by predicting failures 2 - 12 hours ahead of failure.

From challenges to solutions

Frequent outages

Modern data centers have 1000s of SSDs and computing nodes leading on average  to 3 DIMM failures and/or 1 SSD outage per day.

System Crashes

DIMM and SSD outages belong to the most frequent reasons for system crashes . Hard to handle autonomously because of lack of data and analytics methods.

Methods

Solution based on availability of a training data set from a major cloud provider and combination with machine training methods XGBoost, Random Forest, and  survival analysis methods

Approach

We developed advanced algorithms for calculating general features, features on device, bank, and bit level, 270 features in total. We aggregated the data and used smart data sampling and training model on tabular data.

Impact on SSD Availability

Classification of raw log lines as normal or abnormal with one single model detects abnormal log lines in real SSD dataset with a success rate of more that 99.5%.

Impact on DIMM Failure Prediction

The solution evaluated on production data from a cloud provider shows 10% -20% improved prediction of DIMM failure compared to currently applied solutions.

Technical deep dive​

Dive deep into our work on SSD failure prediction

LogLAB: Attention-Based Labeling of Log Data Anomalies via Weak Supervision

Dive deep into our work on DIMM failure prediction

First CE Matters: On the Importance of Long Term Properties on Memory Failure Prediction

Interested in this topic?

Reach out to discuss the predictive observability of SSD and DIMM devices with us.