Degraded Storage Performance

Major incident Region RMA (Rümlang, ZH, Switzerland) Linux Cloud Servers (RMA1)
2026-02-15 17:38 CET · 4 hours, 40 minutes

Updates

Post-mortem

Post Mortem
On Sunday, 2026-02-15, between 17:42 and 17:55 CET, a defective network interface card (NIC) in one of our storage hosts led to a major performance drop of I/O operations to our NVMe-only storage cluster in zone RMA1. Bulk and Object Storage in RMA1 as well as the entire LPG1 zone were not affected by this hardware issue.

Timeline
Starting 17:34 CET, first log entries from the affected storage host indicated the onset of a network issue. Individual I/O operations to virtual servers’ NVMe-only SSD volumes may have been delayed or stuck if they involved the affected storage host.

On 17:37 CET, first NVMe SSD disks in the affected storage host were detected as “unreachable” from a storage cluster perspective. By detecting unreachable disks, our Ceph storage clusters automatically ensure that (potentially failing) I/O requests are no longer sent to the affected disk, but diverted to the two remaining data replicas, and that a third data replica is rebuilt on healthy hardware shortly afterwards.

On 17:39 CET, our automated monitoring system alerted the on-call engineer.

From 17:42 CET, intermittently working/failing network connectivity of the affected storage host caused its NVMe SSD disks to be disabled and re-added to the storage cluster in turns. In consequence, a substantial amount of I/O operations kept being sent to this host but failed to complete, causing continued I/O delays or blocks for customer workloads where one of the three data replicas was located on this host. Also, creating new virtual servers in the RMA1 zone may have failed.

On 17:55 CET, after analyzing the situation, our on-call engineer manually stopped all storage cluster activity on the affected host, effectively diverting I/O operations to working hardware and preventing further delays for customer workloads.

Subsequently (until 22:34 CET), the defective NIC was disabled and replaced by our engineers, and the correctly working storage host was rejoined to the storage cluster. During this period minor performance degradation might have been noticed.

Next Steps
While our Ceph storage clusters have built-in redundancy and the ability to automatically recover from hardware failures, intermittent malfunction prevented these mechanisms from remediating the situation quickly, prolonging the impact on customer workloads. We will analyze this incident and assess possible options to further optimize our response (automated or manual) in such constellations.

We sincerely apologize for the inconvenience this incident may have caused you and your customers.

February 17, 2026 · 16:35 CET
Update

While further investigating this incident and working on the root cause analysis, we noticed that the impact for our customers was bigger than assessed initially. Therefore, we decided to classify this incident as major.

We will follow up with a detailed incident report.

February 16, 2026 · 14:48 CET
Resolved

The performance of our NVMe-only storage cluster is back to normal.

We keep watching the state and will update this incident ticket if necessary.

Please accept our apologies for the inconvenience this issue may have caused you and your customers.

February 15, 2026 · 22:12 CET
Monitoring

The faulty NIC has been replaced and the storage host is rejoining the cluster.

February 15, 2026 · 21:23 CET
Monitoring

The root cause of the degradation was a misbehaving NIC in one of our storage hosts in zone RMA1. The situation is stable again.

We keep watching the state and will update this incident ticket.

February 15, 2026 · 18:23 CET
Issue

The performance of our NVMe-only storage cluster in zone RMA1 is currently degraded.

Our engineers are investigating the issue and are working to fully restore our services. We will keep you posted.

February 15, 2026 · 17:55 CET

← Back