Telemetry data (SMART format) from around 1500 SSD and DIMM devices running in large scale cloud data center.
Predict individual device failures ahead of time in order to improve the overall availability of the cloud services.
Enabled predictive maintenance ode SSD and DIMM devices by predicting failures 2 - 12 hours ahead of failure.
Modern data centers have 1000s of SSDs and computing nodes leading on average to 3 DIMM failures and/or 1 SSD outage per day.
DIMM and SSD outages belong to the most frequent reasons for system crashes . Hard to handle autonomously because of lack of data and analytics methods.
Solution based on availability of a training data set from a major cloud provider and combination with machine training methods XGBoost, Random Forest, and survival analysis methods
We developed advanced algorithms for calculating general features, features on device, bank, and bit level, 270 features in total. We aggregated the data and used smart data sampling and training model on tabular data.
Classification of raw log lines as normal or abnormal with one single model detects abnormal log lines in real SSD dataset with a success rate of more that 99.5%.
The solution evaluated on production data from a cloud provider shows 10% -20% improved prediction of DIMM failure compared to currently applied solutions.
Reach out to discuss the predictive observability of SSD and DIMM devices with us.