Minor incident: Flapping SSD Storage Server

Post-mortem

The flapping Ceph Mon service was caused by the failure of one of the system SSDs in a hardware RAID-1. It seems that the hardware RAID controller was not fast enough in removing the failing SSD which in turn led to instabilities of the Ceph Mon service. Now that the SSD has been removed, all services on this storage node are fully operational again and we have added the server back to the cluster.

We will replace the faulty SSD soon in order to fully restore redundancy of the system RAID.

Please accept our apologies for the inconvenience this incident may have caused.

January 20, 2019 · 00:28 CET

Investigating

The problem seems to have been triggered by a flapping Ceph Mon service running on that storage server. We are going through the logs to find the root cause. Meanwhile, the Ceph Mon service on that server has been stopped and no negative impact is expected. We will follow up with an analysis at a later stage.

January 19, 2019 · 23:44 CET

Monitoring

We have isolated the flapping SSD storage server and are further investigating the issue at hand. Performance should be back to normal.

January 19, 2019 · 23:23 CET

Issue

We are currently investigating a flapping SSD storage server. Ceph recovery is in progress. Minor performance impact is to be expected. We will keep you posted.

January 19, 2019 · 23:02 CET

Flapping SSD Storage Server

Updates