Flapping SSD Storage Server

Minor incident Region RMA (Rümlang, ZH, Switzerland) Linux Cloud Servers (RMA1)
2019-01-19 23:02 CEST · 42 minutes

Updates

Post-mortem

The flapping Ceph Mon service was caused by the failure of one of the system SSDs in a hardware RAID-1. It seems that the hardware RAID controller was not fast enough in removing the failing SSD which in turn led to instabilities of the Ceph Mon service. Now that the SSD has been removed, all services on this storage node are fully operational again and we have added the server back to the cluster.

We will replace the faulty SSD soon in order to fully restore redundancy of the system RAID.

Please accept our apologies for the inconvenience this incident may have caused.

January 20, 2019 · 00:28 CEST
Investigating

The problem seems to have been triggered by a flapping Ceph Mon service running on that storage server. We are going through the logs to find the root cause. Meanwhile, the Ceph Mon service on that server has been stopped and no negative impact is expected. We will follow up with an analysis at a later stage.

January 19, 2019 · 23:44 CEST
Monitoring

We have isolated the flapping SSD storage server and are further investigating the issue at hand. Performance should be back to normal.

January 19, 2019 · 23:23 CEST
Issue

We are currently investigating a flapping SSD storage server. Ceph recovery is in progress. Minor performance impact is to be expected. We will keep you posted.

January 19, 2019 · 23:02 CEST

← Back