Connectivity Issues for Small Subset of Virtual Servers

Updates

Context

On 2019-09-30, during a scheduled maintenance, we migrated the default gateways of your virtual servers at cloudscale.ch to new network equipment. In order to keep the downtime of the default gateways as short as possible, we tried to stop the advertisements of the old default gateways just seconds before deploying the new default gateways using one of our Ansible playbooks.

Timeline

On 2019-10-07, during another scheduled maintenance, we live-migrated a subset of the virtual servers at cloudscale.ch to compute nodes that were physically attached to new top-of-rack switches.

In the morning of 2019-10-08, the day after the migration, we received several reports from customers regarding connectivity issues. All of the affected virtual servers had a few things in common: they were running on compute nodes attached to the new switches, they had a public network interface, and they lost connectivity minutes to hours after completion of the scheduled maintenance the night before. Virtual servers that only had a private network interface did not seem to be affected.

We immediately gathered support data and escalated the case to the vendor of our new network equipment. Shortly after, we started a first debugging session with one of their escalation engineers.

At the same time we decided to live-migrate the affected servers away from the compute nodes that were attached to the new top-of-rack switches. This immediately resolved the connectivity issues for the virtual servers in question. According to our observation, connectivity remained stable even after live-migrating these servers to computes nodes attached to the new network equipment once more.

Having skimmed through the support data in the meantime, the vendor suspected hardware misprogramming to be the root cause of the connectivity issues. As we had reason to believe that the issue had been remediated by the live-migrations, we then focused on root cause analysis in collaboration with the vendor.

On 2019-10-08 at 14:05 CEST we received the first report that the connectivity issues for a virtual server recurred despite having been live-migrated back and forth already. This immediately led to the decision to completely roll back the announced migration of the night before. Minutes later, two of our engineers set off for our data center.

On 2019-10-08 at 17:20 CEST we successfully completed the rollback process: All virtual servers had been live-migrated back to compute nodes that had been re-attached to the old network equipment. Within the following hours, several customers confirmed that they had not experienced any further issues after the rollback.

Root Cause

Further investigation in collaboration with the vendor revealed that a feature called “ARP and ND suppression” in combination with stale neighbor entries for the old default gateways was the root cause of this incident. In fact, the new top-of-rack switches were still replying to ARP and ND requests using the MAC address of the old default gateways (that went out of service on 2019-09-30). This only affected servers that were physically attached to the new top-of-rack switches.

The fact that those stale entries still existed more than a week since the migration of the default gateways, was confirmed to be a bug in the firmware of our new networking equipment.

After clearing those stale entries we were able to confirm that connectivity was fully restored for all virtual servers. In agreement with the respective customers, we live-migrated several of their virtual servers back to compute nodes that were still attached to the new network equipment, along with internal (specifically monitoring) servers. We are experiencing stable connectivity without any further issues ever since.

Next Steps

As confirmed by the vendor, the newly-discovered bug advertising stale entries through the “ARP and ND suppression” feature could only have this effect as the MAC addresses of the default gateways changed in the course of the initial migration. Having completed the gateway migration on 2019-09-30 and cleared the stale entries manually in the process of researching and resolving this case, it is safe to assume that this issue cannot recur. This conclusion is also backed by the stable operation of various virtual servers as mentioned above.

After thorough analysis we therefore decided to resume the migration and again plan to live-migrate a subset of the virtual servers at cloudscale.ch to compute nodes that are attached to the new network equipment next Monday, 2019-10-21 (see upcoming maintenance announcements).

Please accept our apologies for the inconvenience this issue may have caused you and your customers.

October 15, 2019 · 17:00 CEST

Update

Migration Completed
We have successfully completed the rollback process: All virtual servers have been live-migrated back to compute nodes that have been re-attached to the old network equipment. We do not expect any further issues and will continue the investigation in collaboration with the vendor. Once we have received further information from the vendor we will publish a post-mortem.

Please accept our apologies for the inconvenience this incident may have caused.

October 8, 2019 · 17:20 CEST

Update

Reverting Migration
We have researched the current issue together with the vendor and executed their recommended steps. However, the connectivity issues of the affected servers could not be fully remediated. Therefore, we decided to roll back last night’s migration, starting immediately. This will restore the previous state which had been in place until yesterday and proven stable.

Date / Time
Tuesday, 2019-10-08, 14:50 CEST to Wednesday, 2019-10-09, 08:00 CEST

Expected Impact
Short periods of packet loss (up to 1-2 minutes) or higher RTTs for connections towards the Internet as well as between virtual servers.

We apologize for any inconvenience this may cause and thank you for your understanding.

October 8, 2019 · 14:50 CEST

Update

We have collected support data and escalated the issue to the vendor as a high priority case. We are awaiting feedback and will provide you with a root cause analysis once available. Please get in touch with our support team in case of ongoing connectivity issues of your virtual servers.

October 8, 2019 · 10:15 CEST

Update

We have live-migrated the virtual servers that have been reported to be unreachable and received confirmation that the situation is stable again. We will further investigate the root cause together with the vendor.

October 8, 2019 · 09:00 CEST

Issue

In the evening of Monday, 2019-10-07, we have live-migrated a subset of the virtual servers at cloudscale.ch to compute nodes that are attached to new network equipment. Some of these virtual servers are now facing connectivity issues. We are investigating and will keep you updated.

October 8, 2019 · 08:00 CEST

← Back