Core Network: Emergency Software Upgrade from 2026-05-11 20:00 CEST to 2026-05-12 03:00 CEST
Updates
On Friday, 2026-05-08, between 00:20 and 04:37 CEST, one of the redundant dark fiber connections between our cloud regions LPG and RMA faced an outage caused by the provider during a maintenance window which had not been announced to cloudscale. On automatic re-routing, when the connection was lost and when it came back up, the new routing table was not applied correctly. This routing issue subsequently caused failures in internet connectivity for specific source and destination pairings (especially for IPv6), while not affecting the majority of the traffic.
In emergency maintenance windows on 2026-05-08 and 2026-05-11, we tried upgrading our border routers to newer software versions as advised by the vendor. However, the connectivity issues persisted in varying degrees and patterns, which is why we reverted to the previous version and continue to debug this issue further together with the vendor.
Timeline
cloudscale is using redundant dark fibers for direct connectivity between the LPG and RMA cloud regions. For one of the dark fibers, the provider had planned a scheduled maintenance on 2026-05-08 between 00:05 and 06:00 CEST, but forgot to inform us about it.
At 2026-05-08 00:20 CEST, when the affected dark fiber lost connection (and again at 04:37 CEST when it came back up) our network infrastructure including the border routers automatically adjusted their routing tables, as is the standard reaction to changes in link availability in our network – and in the internet in general. Our internal monitoring system immediately detected the fiber outage and alerted the on-call engineer. However, apart from the dark fiber in question and some packet loss during the routing adjustment, no apparent problems were noted.
On 2026-05-08 during the day, our engineers noticed sporadic connectivity issues detected by our external monitoring systems, and a small number of customers contacted us about failing connections between specific endpoints. Further assessment showed routing issues on our border routers for a small fraction of the traffic, apparently depending on the parameters of the connection in question. We suspected the software on our border routers to be the cause, and scheduled an emergency maintenance to update to a newer minor version.
Starting 2026-05-08 at 16:30 CEST, we updated the border routers one by one, and kept a close eye on external connectivity monitoring. While short-term failures were expected during routing adjustments, it turned out that issues affecting a small fraction of IPv6 connections persisted even after the maintenance work was completed.
In a new emergency maintenance starting 2026-05-11 20:00 CEST, we updated the border routers to even newer versions in two stages as advised by the vendor, while again watching our monitoring closely. Unfortunately, we saw different patterns of connectivity issues again, which this time also impacted relevant amounts of internet traffic. We therefore decided to immediately downgrade the border routers back to the initial version, with which the least issues had been observed.
On 2026-05-12, we held a high priority escalation call with the vendor to assess the issues and discuss possible ways forward. In the process, we could assert that parts of the traffic loss which we still detect is not caused by our own equipment, but within the network of one of our IP transit providers. However, to avoid the risk of further disruptions for our customers, we decided against re-routing around this external issue. In case you experience issues with actual traffic, please contact us so we can assess possible mitigations.
Next Steps
For the time being, we try to avoid routing and version changes in order to minimize the risk of new unforeseen side effects for our customers.
Together with the vendor, we are now testing the latest stable release of the routing software with our existing configuration in an isolated, non-production environment with the intention of discovering potential issues early on. While an isolated environment is not a replacement for an internet-scale backbone, we intend to replicate and fully understand the issues we were facing in the past maintenance windows. Once we are confident enough that these issues have been addressed, we will announce another maintenance window where the vendor will be involved live in the event that the upgrade does not go as planned.
We sincerely apologize for the inconvenience this issue may have caused you and your customers.
We had to roll back the upgrade completely, since we ran into unexpected issues. The core network is operational again and we will continue the investigation in the next few days, together with the vendor.
Please accept our apologies for the inconvenience this issue may have caused you and your customers.
Unfortunately, the new major version shows similar issues with IPv6 connectivity. We have decided to extend the maintenance window to continue troubleshooting.
We will keep you posted.
Emergency Maintenance Work
Since we are still facing connectivity issues for certain IPv6 targets after the emergency maintenance last Friday, we have re-escalated the case with the vendor. After further consultation with the vendor, we have decided to upgrade the firmware of our core network to a new major version in another emergency maintenance window. During this maintenance work you may experience short periods of packet loss (up to 1-2 minutes) or higher RTTs for connections from and towards the Internet. Connections between virtual servers at cloudscale will not be affected by this maintenance work.
Date / Time
From Monday, 2026-05-11, 20:00 CEST to 23:00 CEST
Expected Impact
Short periods of packet loss (up to 1-2 minutes) or higher RTTs for Internet-facing connections. Thanks to our redundant setup we do not expect any further impact on already running virtual servers.
We apologize for any inconvenience this may cause and thank you for your understanding.
← Back
Status