Tunnels experiencing unexpected performance degradation #30
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
In rare cases aggregated connections see high CPU usage and reduced throughput. This is not always due to the underlying ISP links that make up the connection itself (although link faults can still occur), but rather how the tunnels are processed and aggregated by the overall platform. Our team is investigating the underlying conditions and testing both mitigations and alternative methods to ensure stable long-term performance. We remain committed to providing high bandwidth, high performance, multi-link connectivity for internet, WAN and cloud, and this is being given the highest priority within the development team.
Technical detail
We have identified cases where OpenVPN TAP (Layer 2) tunnels are not performing as expected. Tunnels that previously performed reliably for years are now, in some rare circumstances, showing very high CPU usage and reduced bandwidth, even though the underlying links and hardware are capable of more.
Symptoms
Scope
Background
How OpenVPN TAP consumes CPU
OpenVPN TAP tunnels run entirely in userspace, which means every Ethernet frame has to be copied from the kernel into the OpenVPN process, encrypted, and then copied back again. This repeated context switching adds significant overhead, especially for high packet-per-second workloads. CPU usage scales directly with the number of packets, and throughput can be significantly impacted, even if total available bandwidth is much higher.
Under normal circumstances our networking stack prevents CPU saturation with various queuing and calibration mechanisms so that higher latency, packet loss and lower throughput are avoided.
How GRETAP consumes CPU
GRETAP tunnels operate differently: they are implemented inside the kernel networking stack, so frames are encapsulated in GRE and forwarded without userspace involvement. This avoids the copy overhead of OpenVPN, but places more emphasis on the kernel’s packet processing pipeline. Performance is therefore more sensitive to:
The result is that GRETAP may handle large packets more efficiently than OpenVPN TAP, but can still hit CPU limits with high rates of small packets or when interrupt moderation is not optimal.
In practice this means that while both TAP and GRETAP can exhibit high CPU usage, the nature of the load is different.
Why TAP/GRETAP are sensitive here
Although we are not carrying a full Ethernet broadcast domain across sites in most circumstances, TAP and GRETAP still encapsulate and process traffic as raw Ethernet frames. This means:
Possible contributing factors
While the root cause is still under investigation, factors that may play a role include:
These factors may combine to create a situation where TAP and GRETAP tunnels require significant CPU for modest throughput, even when the same designs previously performed well.
Current status
Next steps