Major Storage Outage in RMA1
After replacing a defective network interface card (NIC) in one of our storage servers on Thursday, 2023-11-09, a chain reaction led to an outage of the entire storage cluster in our RMA1 zone between 13:35 and 13:51 CET.
Technical Background Information
The storage clusters at cloudscale.ch are powered by Ceph. On the one hand, Ceph enables distributed storage, meaning that data is split into small chunks and distributed across multiple physical disks and storage servers, allowing for higher total data throughput. On the other hand, Ceph provides storage replication: in our case, for each chunk of data 3 copies are kept in 3 different racks, making sure that individual failing hardware components do not affect the availability of the data itself. In addition to a set of monitoring daemons, the cluster consists of several storage servers that are running a number of object storage daemons (OSDs). Each OSD is a process managing a local disk/partition and contributing its storage space to the overall cluster.
On Wednesday, 2023-11-08, one of our storage nodes in the RMA1 zone experienced hardware issues. It turned out that one of the two NICs in the server had failed and torn down the NVMe host bus adapter (HBA) in the same host along with it. While the loss of network redundancy did not have a direct impact for our customers, the disruption of the HBA led to decreased performance and, in some cases and for a very short time, hanging I/O requests when accessing our object storage. We isolated the server from the cluster and planned to go on-site the next day in order to replace the defective NIC with a spare part.
On Thursday morning, 2023-11-09, we replaced the defective NIC, which also restored proper operation of the HBA. We then added the storage server back to the cluster, so it could catch up on the delta to the current storage data. In this phase, no adverse impact was noticed, but network redundancy had yet to be restored.
In the early afternoon, 3-fold replication was fully restored and the storage cluster was no longer degraded. In order to restore network redundancy, we deployed a necessary update of the network configuration, which took into account the change of MAC addresses due to the replacement of the NIC. We then rebooted the storage server for the new network configuration to take effect.
At 13:35 CET, once the server had rebooted and applied the updated network configuration, a mismatch between the configuration on the server and the switches caused some connections to stay down, while other connections came up with an incorrect configuration, leaving the server partly unreachable.
As a subsequent fault, large amounts of unreachability notifications within the storage cluster led to a safety shutdown of actually healthy OSDs. Once the number of available OSDs for a particular chunk of data is 1, that data becomes read-only as a protective measure; with no available OSDs, the data becomes unavailable as well. As more and more OSDs within the storage cluster shut down, the outage grew to larger and larger parts of the overall storage cluster.
We quickly suspected the rebooted storage server to be causing this chain reaction. At 13:39 CET, we decided to shut down this server again, followed by a manual command to start all the OSDs on the healthy servers at 13:48 CET. By 13:51 CET, the storage cluster was fully available again.
Based on further analysis, we were able to re-establish full network connectivity on the repaired storage server, and add it back to the cluster without further impact.
We sincerely apologize for the inconvenience this outage has caused you and your customers.
While the direct cause and effect of this outage became clear very quickly, we are continuing our analysis to better understand all factors involved. We will research in detail what led to the mismatch between network configurations and also look into the Ceph mechanisms which kicked in and how they react to partly failing network connections.
Based on our findings, we will take appropriate measures, e.g. to avoid potential mistakes, detect problems earlier, or find configurations making the involved parts more resilient.
We found the root-cause and are taking steps to fully resolve the issue. The cluster has been fully operational for over an hour, and we expect it to remain so.
The storage cluster is back online, but we are still monitoring the situation and investigating the root cause. Until we have a better understanding of what happened we consider this an ongoing incident.
We are currently experiencing a major incident in RMA1 with our storage cluster.