Detailed Incident Report
On 2018-04-23 we were contacted by several customers that were facing issues when trying to connect to various external APIs using SSL/TLS. At first, we suspected this to be an issue with an update of one of the SSL libraries that had been released the week before and started our investigation in that direction.
On 2018-04-24, in a joint effort with the engineers of our partner company VSHN, we discovered that the issue was caused by an MTU mismatch in the return path of one of our IP transit providers (Init7). It turned out that only connections to the US were affected and only those that exceeded 1476 Bytes.
traceroute -A -I -w1 --mtu <target> from an AWS EC2 instance, we were able to pinpoint the exact Init7 router in the US that was causing the issue and opened a support ticket at 13:15 CEST. At the same time, we were evaluating our options to implement a workaround.
At 13:21 CEST, we decided to prefer our second IP transit provider (Liberty Global) over Init7. This procedure (prepending our AS on the Init7 BGP session and increasing the localpref on the Liberty Global session) is documented in our Wiki and has been used multiple times before without causing any issues. It was therefore considered a standard routine.
At 13:23 CEST, our monitoring system reported that some external targets were no longer reachable. We then initiated a reset of the BGP session to Init7 at 13:27 CEST which immediately resolved the connectivity issues. However, most US traffic was still being routed via Init7 and thereby passing through their faulty router in the US. Prepending did not seem to have the desired effect.
At 13:34 CEST, we received an answer from Init7 that the MTU issue is well known and already being tracked on their status page. Furthermore, they suggested to set a specific BGP community (65009:2 which equals “not announce” in USA/Canada) to circumvent the router at fault as a temporary workaround.
At 13:44 CEST, we reverted our changes made at 13:21 CEST and implemented the workaround suggested by Init7. Unfortunately, the desired rerouting to Liberty Global did still not occur, hence, we informed Init7 at 13:48 CEST that their suggested workaround was without effect.
At 14:15 CEST, after evaluating our options, we decided to completely shutdown the BGP session towards Init7 to avoid further negative impact for our customers.
At 14:19 CEST, our monitoring reported that connectivity towards the Internet was flapping. We immediately started investigating and tried to mitigate the problem by clearing the BGP sessions towards our IP transit providers.
At 14:27 CEST, connectivity over IPv4 seemed stable again, but IPv6 was still flapping.
At 14:50 CEST, we decided to bring the session towards Init7 back up as they had successfully implemented a workaround in the meantime and as we still could not pinpoint the root cause of our flapping Internet connectivity.
At 14:55 CEST, once we were convinced that we hit a software issue, we decided to completely reboot the router facing Init7 and the SwissIX Internet Exchange. Flapping immediately stopped. After the reboot we were able to bring up the BGP sessions in a controlled manner as expected. The situation remained stable.
At 15:05 CEST, as a precautionary measure, we also rebooted our router facing Liberty Global.
We identified several possibilities for improvement after reviewing this incident in detail and took the following measures:
- We have extended our monitoring to get a more detailed overview of our global reachability, especially for IPv6.
- We have improved our incident assessment process to make sure a small disruption somewhere in the Internet does not lead to a bigger local impact.
- We are evaluating adding another transit provider to increase the level of redundancy, especially in case we decide to take down one of the links completely.
We are well aware that this incident may have caused you and your customers a considerable amount of trouble. Please accept our apologies and rest assured that we keep doing our best to prevent such a situation from happening in the future.
Please do not hesitate to contact us if you have any follow-up questions.
The situation is stable again. We will follow-up with an incident report.
We are currently investigating routing issues