Bulk and Object Storage Unavailable

Minor incident Region RMA (Rümlang, ZH, Switzerland) Linux Cloud Servers (RMA1) Object Storage Service (RMA)
2019-07-01 10:15 CEST · 8 minutes

Updates

Post-mortem

Management Summary

On 2019-07-01 between 10:15 and 10:23 CEST, requests to our object storage had been answered with an HTTP error code 503. At the same time, I/O operations to bulk volumes were partially blocked as well. After 10:23 CEST, access to our bulk and object storage was fully restored.

Detailed Report

On 2019-07-01 at 09:07 CEST, we began working on the scheduled maintenance of our bulk and object storage nodes as previously announced in https://cloudscale-status.net/incident/48.

At 10:16 CEST, our monitoring system reported an issue with our object storage.

An immediate analysis showed that the Ceph cluster had blocked access to some PGs (placement groups) belonging to the bulk and object storage pools because Ceph’s hard limit of the PG per OSD (object storage device) ratio had been exceeded. This had triggered a bug as outlined in https://tracker.ceph.com/issues/23117.

At 10:23 CEST, we decided to stop all OSDs on the recently upgraded storage node, allowing Ceph to recover and permit full access again, effectively resolving this issue for our users.

After thorough analysis, we decided to increase the PG per OSD ratio and then restarted all OSDs on the upgraded node. After the increase, all OSDs were starting to backfill as expected.

We will keep this increased PG per OSD ratio for the remainder of the scheduled maintenance while upgrading the rest of the storage nodes.

Please accept our apologies for the inconvenience this service disruption may have caused you and your customers.

July 2, 2019 · 16:26 CEST
Resolved

The root cause has been identified and a workaround is in place. Recovery is in progress. We will follow up with a detailed root cause analysis.

July 1, 2019 · 11:40 CEST
Issue

The bulk and object storage are operational again, however in degraded redundancy state. We will keep investigating the root cause of this outage.

July 1, 2019 · 10:27 CEST
Issue

We were notified by our monitoring system that the bulk and object storage are currently not available. We are investigating the issue.

July 1, 2019 · 10:18 CEST

← Back