Detailed Incident Report
On 2019-01-07 at 15:35 CET, we were notified by our network monitoring system that we were facing partial packet loss to various destinations in the Internet. After short investigation we discovered that the BGP sessions with all of our upstream providers have been reset at the same time and were re-established after a few seconds.
Between 15:33 and 15:50 CET, the BGP sessions were reset and re-established repeatedly which caused further partial packet loss for Internet-facing connections. After that, the situation was stable again.
However, as a result of the numerous BGP resets in such a short period of time, our prefixes became the victims of BGP route dampening and therefore, our services may not have been available via certain Internet providers for a longer period of time.
At 16:25 CET, we received confirmation from various customers and our network monitoring system that our services were reachable from all over again.
Root Cause Analysis
Shortly after 16:00 CET, we were already in touch with the vendor and discussed potential workarounds. It quickly turned out that we were by far not the only ones affected and that these BGP session resets were caused by the DISCO experiment, which triggered a bug in our routing stack.
The bug was triggered by the use of a BGP attribute reserved for development in the virtual network control (VNC) code of our routing stack. VNC was using “255” as a development value for features that were never standardized. The intent was to disable this usage for non-development use, but this did not happen.
Since “255” was a known attribute, the software tried to parse this attribute (that was generated as part of the experiment) – and failed as it was of an unknown format. This failure in turn resulted in the common attribute parsing error behavior being triggered. RFC 4271 mandates a session reset in this case.
Between 18:00 and 19:00 CET, in an emergency maintenance window, we installed a patch containing a workaround for the VNC issue mentioned above.
The vendor is now working on a final solution to prevent this from happening again. We are considering sponsoring the implementation of RFC 7606 to contribute to BGP stability in our routing stack in the future.
Please accept our apologies for the inconvenience this incident may have caused you and your customers. We keep doing our best to prevent such situations from happening.
After successful tests in our lab, we will now roll out a patch containing a workaround for the issues seen earlier today. During this maintenance work you may experience short periods of packet loss (up to 1-2 minutes) or higher RTTs for connections from and towards the Internet. Connections between virtual servers at cloudscale.ch will not be affected by this maintenance work.
We apologize for any inconvenience this may cause and thank you for your understanding.
The situation is stable again. The incident was caused by the DISCO experiment. We will follow up with a detailed incident report later.
We are currently experiencing network issues with our upstream providers. We are investigating.