Network Issues

Major incident Region RMA (Rümlang, ZH, Switzerland) Network Infrastructure (RMA)
2019-11-22 07:50 CEST · 10 hours

Updates

Post-mortem

Incident Report Regarding Outages on 2019-11-22 (Part 2)

Root Cause

As it turned out, the root cause was a high rate of inbound IPv6 traffic directed to many different, unused IPv6 addresses. While data traffic is usually forwarded by the ASIC directly, packets to unknown destinations in connected networks are punted to the device’s CPU for it to perform neighbor resolution (ARP/ND). In this case, the amount of IPv6 neighbor solicitation messages overwhelmed the CPU, competing against control plane traffic necessary to keep up existing BFD, BGP and LACP connections. Failure to send or process control plane traffic in time results in the flapping of the respective links, causing a period of packet loss for the connected subset of switches and servers.

While on 2019-11-22 the basic problem was the same as the weeks before, the CPU utilization peaks were slightly higher, triggering a watchdog process to kill other processes running on our network devices in multiple instances. This, in turn, led to a complete crash of the affected devices and also interfered with our problem analysis and debugging attempts.

Given the immense number of IPv6 addresses within an allocation or a subnet, scanning through an IPv6 address range can bind significant resources to process the necessary neighbor solicitations. This is a weakness inherent in the IPv6 protocol and a potential attack vector if exploited in a targeted manner. While our new network equipment rate-limits traffic that requires neighbor resolution, the respective vendor-defaults turned out not to be strict enough.

As a mid-term fix, we currently block inbound IPv6 traffic which is not directed to an existing IPv6 address using a whitelist ACL, effectively avoiding unnecessary neighbor solicitations and the thereby caused CPU utilization. Now that the flaw has been identified and mitigated, we continue working closely with the vendor to replace the mid-term fix by a more dynamic and scalable long-term solution.

While scanning address ranges can be part of a malicious attack (including DoS/DDoS), this is not always the case. Scanning address ranges has also been part of scientific projects or security research in the past and would need an evaluation on a case-by-case basis. We currently have no reason to believe that the IPv6 address scans hitting our network and causing the described issues have been performed in malicious intent.

SLA Considerations

The total duration of the outages of services in region RMA exceeds the downtime allowed for by our 99.99% availability SLA stated in our terms and conditions.

Currently, we can neither classify the address scans as a DoS attack (which would not constitute a breach of the SLA as defined) nor exclude the possibility of them being part of such an attack.

Assuming the outages were not the result of an attack, all affected customers would be eligible for “a pro-rata credit note for the duration of said failures for the services you use that have been affected”. However, as a symbolic gesture, we decided to refund the full daily cost of the services charged on 2019-11-22 to all of our customers, regardless of the SLA evaluation.

You can find the refund of your total daily charges of 2019-11-22 in a combined credit transaction in your billing overview in our cloud control panel.

Next Steps

As outlined above, we consider the current mid-term fix fully effective, protecting our infrastructure against further issues stemming from IPv6 address scans.

We are currently working with the vendor on multiple possible long-term solutions to the fundamental problem which facilitated this incident in the first place. Once available, we will test any changes or new firmware versions in our lab first and then deploy them to production during previously announced maintenance windows following our standard procedures.

While the outages also affected our cloud control panel and therefore the creation and modification of virtual servers, already running servers at our second location LPG1 were not affected by the incident on 2019-11-22. We encourage our users to look into geo-redundancy if they see fit for their individual use case. It goes without saying that we have applied the same mid-term fix to our network equipment at LPG1 as well, effectively preventing this issue from emerging in both cloud locations in the future.

November 27, 2019 · 11:43 CEST
Post-mortem

Incident Report Regarding Outages on 2019-11-22 (Part 1)

Context

Over a period of more than a year, cloudscale.ch has been evaluating and testing new network equipment. Our recently opened cloud location in Lupfig has been built on this new equipment right from the start. The existing network equipment at our location in Rümlang, on the other hand, has been replaced gradually with the new devices during several announced maintenance windows. Replacing the existing network devices allowed us to further increase the speed of key links (e.g. of our NVMe-only storage cluster) to 100 Gbps and to optimally interconnect the existing and new cloud region, Rümlang (RMA) and Lupfig (LPG). Furthermore, the new setup provides more network ports overall to accommodate our continued growth.

During the evaluation and testing period, we closely worked with the vendor’s engineering team to test and verify the new setup for our specific environment. After extensive testing of different configurations and over the course of multiple new firmware releases, we gained confidence that the new setup had reached the required maturity to take over our productive workload.

Timeline

On 2019-09-30, we took the first step by switching over the default gateways of all servers at our then only location RMA1 in a previously announced maintenance window. Between 2019-10-07 and 2019-11-09, we gradually live-migrated our customers’ virtual servers to compute nodes that were physically attached to the new network devices in further announced maintenance windows.

During productive operation, we noticed short periods of link flaps leading to partial packet loss that correlated with high CPU utilization on the new network devices. We analyzed this behavior in close collaboration with the vendor’s engineering team and applied several changes and fixes in multiple steps as proposed by the vendor, mitigating some aspects of the overall issue, but not resolving it completely.

On 2019-11-22 at 07:48 CET we received alerts from both our internal as well as external monitoring, indicating connectivity issues within our networking infrastructure at RMA1. Within minutes, one of our engineers started investigating. However, the issues grew to a complete outage at RMA1 by 08:07 CET. With the exception of creating new virtual servers, normal operation could be restored starting 08:24 CET after we had disabled BFD to take some load off the CPUs of our network devices.

Between 10:16 CET and 12:59 CET, further periods of partial or full network outages at RMA1 occurred. After most of our infrastructure was operational again by 12:20 CET, we focused on fully restoring Floating IPs and IPv6 connectivity. From 13:15 CET, all services and features at RMA1 were back to full availability.

Having found the root cause (see below) while handling these outages, implementing a blacklist ACL as a first mitigation, applied at 13:59 CET, has proved to be very effective and reduced CPU utilization on the switches by almost 50%. Right after implementing this first mitigation we added specific checks and triggers to our internal monitoring system in order to detect any arising issues triggered by the same root cause immediately.

In fact, as the traffic pattern changed over time, the mitigation had to be adapted accordingly in a timely manner to prevent the switch CPUs from spiking again. So, while this mitigation did help us get back into a stable state by massively reducing the CPU utilization on the network equipment, it did not provide us with a sustainable mid-term fix.

Both the vendor’s and our own engineering team have been working very closely together to find a mid-term fix the same day. We scheduled synchronization calls every 90 minutes and focused on various options for mid-term fixes in a joint effort. At around 22:45 CET, we started implementing the mid-term fix, based on a whitelist ACL approach, that we all believed to be the most robust while not introducing further variables. Around midnight, we deployed the fix to the production network in region RMA and confirmed its effectiveness shortly after.

The vendor will continue working on code changes that will replace the mid-term fix in the future. We will test future firmware releases in our lab and deploy them on a regular basis during previously announced maintenance windows following our standard procedures.

November 27, 2019 · 11:42 CEST
Update

Network Issues: Fix in Place

The network in region RMA has been stable for more than 24 hours now and we are convinced that it will remain that way. Here is why:

Reproduction
After collecting more data during yesterday’s outage, the vendor was able to re-create the problem in their lab. This allowed both the vendor’s and our own engineering team to work on a short-term mitigation as well as on mid-term fixes but it will also provide a way to validate long-term solutions that will be introduced in subsequent firmware releases.

Short-Term Mitigation
After identifying the potential root cause yesterday afternoon, we implemented a workaround to prevent the suspected weakness in the firmware from being triggered by unusual traffic patterns. However, as the traffic pattern changed over time, the mitigation had to be adapted accordingly in a timely manner to prevent the switch CPUs from spiking again. So, while this did help us get back into a stable state by massively reducing the CPU utilization on the network equipment, it did not provide us with a sustainable mid-term fix.

Mid-Term Fix
Both engineering teams have been working very closely together to find a mid-term fix the same day. We scheduled synchronization calls every 90 minutes and focused on various options for mid-term fixes in a joint effort. At around 22:45 CET, in a final conference call, we started implementing the mid-term fix that we all believed to be the most robust while not introducing further variables. Around midnight, we deployed the fix to the production network in region RMA and confirmed its effectiveness shortly after.

Monitoring
Right after implementing the first mitigation we added specific checks and triggers to our internal monitoring system in order to receive an alert should the situation change all of a sudden. We also added very aggressive triggers for the CPU utilization and expected to receive some “false positive” alerts during the night. However, we did not receive any alerts as the CPU utilization has been very steady for more than 24 hours now.

Long-Term Solution
The vendor will continue working on code changes that will replace the mid-term fix in the future. We will test future firmware releases in our lab and deploy them on a regular basis during previously announced maintenance windows following our standard procedures.

Incident Report
We will follow up with a detailed incident report by mid next week.

Please accept our sincere apologies for all the inconvenience this may have caused you and your customers.

November 23, 2019 · 13:55 CEST
Resolved

We have a mitigation in place that sustainably reduced CPU utilization on the switches. Ever since, the situation is stable and we are back to normal operations in region RMA (region LPG was not affected). We are still working with the vendor in order to get a final fix in place and will provide you with an update once there is new information available.

November 22, 2019 · 17:50 CEST
Monitoring

We have most probably found the root cause of the outages that we have been facing today. A first mitigation has proved to be very effective and reduced CPU utilization on the switches by almost 50%. We are working together with the vendor in order to get a workaround in place that will make sure to keep it at this level. Unfortunately, we cannot share more details right now because it remains unclear whether this is a targeted attack or not. The situation has been stable for the last two hours and we are doing everything we can to keep it that way. We will keep you posted.

November 22, 2019 · 15:00 CEST
Update

Our networking infrastructure is currently working normally. However, it is still possible for performance or availability issues to recur. We continue investigating the issue and will take further measures as needed.

November 22, 2019 · 13:55 CEST
Update

While our infrastructure is currently working, we are still investigating the issue together with the vendor. It is still possible for further outages to occur. For the time being, we can not make any statement as to how long it will take to re-establish stable operations.

November 22, 2019 · 11:53 CEST
Update

Statement from Manuel Schweizer, CEO, on 2019-11-25 at 12:35 CET

While the information published in the original statement held true at the time of writing, the situation was under control shortly after and we were able to fully restore services within the next couple of minutes. Because the total duration of the incident could not be foreseen at the time of writing, our priority was to keep the downtime for the end users of our customers as short as possible and to further prevent even greater damages. At this very moment, it was the best course of action and we were convinced that this advice was in the best interest of our customers.

Original Statement

This is probably the toughest decision I ever had to make as a CEO of this company. But I am hereby advising all of our customers to start executing their disaster recovery plans starting now. The situation is not under control. We see both default gateways crash on a regular basis due to high CPU load and the kernel watchdog then killing random (important) processes. We have escalated to the vendor but they have not been able to help so far. I know this probably does not mean a thing right now, but - coming from the bottom of my heart - I am really sorry that we put you in this situation.

November 22, 2019 · 11:46 CEST
Update

We are facing another outage.

November 22, 2019 · 10:18 CEST
Monitoring

All services are healthy again. We will keep monitoring the situation and will follow up with next steps.

November 22, 2019 · 08:44 CEST
Update

Situation is stabilizing. Several services need to be taken care of due to split brain situations. We will keep you posted.

November 22, 2019 · 08:34 CEST
Update

In order to ease the load on the CPU we have disabled a control protocol (BFD) completely hoping to stabilize the situation again. Investigating.

November 22, 2019 · 08:25 CEST
Update

We are facing a complete outage of the network in region RMA. Investigating.

November 22, 2019 · 08:07 CEST
Issue

We are currently facing partial packet loss again and are investigating the situation. We will keep you posted.

November 22, 2019 · 07:50 CEST

← Back