Minor incident: Connectivity Issues | cloudscale.ch AG Status

Resolved

We have received a potential workaround from the vendor that has been tested extensively in their lab and confirmed to reduce the CPU utilization on the top-of-rack switches. However, we want to verify the workaround with our own set of tests in the lab first before scheduling the rollout to production. We will provide an update and announce an emergency maintenance once all the tests have passed in our lab as well.

November 12, 2019 · 09:44 CET

Update

Since last night’s early network upgrade maintenance and subsequent upgrade of the firmware back to the version we have been using right from the start, the errors did not show up in the log files anymore. We do not believe that the root cause has been found or fixed, yet, but we are confident that the situation will remain stable for now. We will continue to find the root cause in close collaboration with the vendor and provide you with an update once further information is available.

November 9, 2019 · 21:25 CET

Update

Upgrading back to the firmware release we were using initially was successful. Together with the vendor we were able to collect performance data that will be analyzed during the next few hours. Based on that data the vendor will try to set up a reproduction in the lab. We do not expect an update before Saturday evening CET.

November 9, 2019 · 02:38 CET

Update

After a long and in-depth discussion with the vendor and because the issues still persist, we have decided to go back to the firmware release that we have been using right from the start and continue debugging this specific release. Unfortunately, the emergency upgrade will once more lead to short periods of packet loss. We are in permanent contact with the vendor and are monitoring the situation closely.

We are sorry for the inconvenience this may cause.

November 9, 2019 · 00:40 CET

Monitoring

The downgrade of the firmware on all switches in location RMA1 was successful. However, CPU utilization is still at a significantly high level. We are in touch with the engineering and development team of the vendor in order to reproduce the issue in their or our lab. We will follow up with an update once available.

November 8, 2019 · 14:46 CET

Update

The vendor suggested to downgrade the firmware on all switches in order to try and mitigate the issue while they are still working on a local repro. We will start with the downgrade now after having validated it in the lab. We will keep you posted.

November 8, 2019 · 13:43 CET

Update

We were able to establish an emergency call and are now discussing options. Unfortunately, further packet loss is expected until we have a workaround in place.

November 8, 2019 · 12:52 CET

Update

We are currently facing partial packet loss again and are trying to get the vendor on an emergency call. We will keep you posted.

November 8, 2019 · 12:45 CET

Monitoring

After performing a controlled reboot of all switches, the situation could be stabilized. We have raised a high-priority case with our vendor and it already got management awareness there.

The issue we are facing seems to be triggered by high CPU load on the switches themselves, which leads to instabilities in control plane traffic. We have relaxed several control plane timers hoping to mitigate the issue for the time being. However, the decrease in CPU utilization was only minimal.

We will continue investigating this issue together with the vendor and keep you up to date on this status page.

Please accept our sincere apologies and rest assured that we treat this case with the utmost priority.

November 7, 2019 · 16:27 CET

Issue

After collecting support data together with the vendor, we are currently facing partial packet loss between virtual servers and to/from the Internet. We are investigating and will keep you updated.

We are sorry for the inconvenience this may cause.

November 7, 2019 · 14:48 CET

Connectivity Issues

Updates