1

Performance issues affecting multiple customers with little in common #21

Open
opened 2024-11-13 10:49:47 +00:00 by Brandon Currell · 5 comments

We've noticed a pattern of performance queries from multiple customers. In these instances, untunnelled link performance is as expected based on the circuit that has been ordered, however tunnelled link performance, and therefore overall available throughput is reduced.

At the time of opening this issue, there is little in common with the customers; they're spread across different servers, different datacentres, different circuit providers, different hardware deployments. At the moment, we're treating this as a problem with one of the following:

  • Possible issue with the testing process
  • Possible issue with the testing endpoints
  • Possible issue with the tunnel config
  • Possible issue with our network software

There is already a parent ticket on our internal ticket system, and customers that are known to have performance issues have already been bundled into this parent. Any additional performance queries, will be initially checked independently, and then will be added to this parent ticket if we think it is necessary.

We've noticed a pattern of performance queries from multiple customers. In these instances, untunnelled link performance is as expected based on the circuit that has been ordered, however tunnelled link performance, and therefore overall available throughput is reduced. At the time of opening this issue, there is little in common with the customers; they're spread across different servers, different datacentres, different circuit providers, different hardware deployments. At the moment, we're treating this as a problem with one of the following: - Possible issue with the testing process - Possible issue with the testing endpoints - Possible issue with the tunnel config - Possible issue with our network software There is already a parent ticket on our internal ticket system, and customers that are known to have performance issues have already been bundled into this parent. Any additional performance queries, will be initially checked independently, and then will be added to this parent ticket if we think it is necessary.
Brandon Currell added the
Network software
label 2024-11-13 10:49:47 +00:00
Author
Owner

We have deployed a new test head in our lab, connected to a 10Gbps leased line. We have confirmed that we can push ~7.5Gbps upload and download through this test head without any issues, which is more than enough for running down this issue.

Additional tests have been done to this endpoint from some of the sites that have open tickets. The results are mostly the same, allowing for fluctuation due to live customer traffic still passing on the connection. This rules out the testing endpoints as the point of issue.

Since we've proven that we can push 7.5Gbps with our testing methods on a raw IP connection, no tunnelling or aggregation involved, we're not currently focusing on the testing methods themselves. However, we will be trying out additional testing methods during this investigation, and will evaluate if we want to add them to our standard testing process going forward.

That leaves the tunnel config, and the network software. These are very tightly coupled, so we will be breaking it down into individual components and running them down independently of each other, and then starting building them back together to see if its a component issue, or related to some combination of them.

A further update should be expected by the end of tomorrow.

We have deployed a new test head in our lab, connected to a 10Gbps leased line. We have confirmed that we can push ~7.5Gbps upload and download through this test head without any issues, which is more than enough for running down this issue. Additional tests have been done to this endpoint from some of the sites that have open tickets. The results are mostly the same, allowing for fluctuation due to live customer traffic still passing on the connection. This rules out the testing endpoints as the point of issue. Since we've proven that we can push 7.5Gbps with our testing methods on a raw IP connection, no tunnelling or aggregation involved, we're not currently focusing on the testing methods themselves. However, we will be trying out additional testing methods during this investigation, and will evaluate if we want to add them to our standard testing process going forward. That leaves the tunnel config, and the network software. These are very tightly coupled, so we will be breaking it down into individual components and running them down independently of each other, and then starting building them back together to see if its a component issue, or related to some combination of them. A further update should be expected by the end of tomorrow.
Author
Owner

Update from the end of Friday:

We have started building out an automation environment to run multiple long-form (tedious) tests to collect as much data as possible with minor iterations to config values and system settings. Next step is to write the control scripts around this automation environment, and writing a plan of action for what we're actually going to be testing with each iteration.

A further update should be expected before the end of Wednesday.

Update from the end of Friday: We have started building out an automation environment to run multiple long-form (tedious) tests to collect as much data as possible with minor iterations to config values and system settings. Next step is to write the control scripts around this automation environment, and writing a plan of action for what we're actually going to be testing with each iteration. A further update should be expected before the end of Wednesday.
Author
Owner

Due to other issues that cropped up this week, progress has been slow, but heading in the right direction.

The automation infrastructure has proven to be stable, so we will start using it in anger shortly. Currently the development process of the automated testing tools is ongoing. Once completed, an initial set of tests will be carried out on each affected connection to test the automation process in different environments, and then the full rundown of the issue can commence.

Next update will be provided before the end of next week with initial findings and next steps.

Due to other issues that cropped up this week, progress has been slow, but heading in the right direction. The automation infrastructure has proven to be stable, so we will start using it in anger shortly. Currently the development process of the automated testing tools is ongoing. Once completed, an initial set of tests will be carried out on each affected connection to test the automation process in different environments, and then the full rundown of the issue can commence. Next update will be provided before the end of next week with initial findings and next steps.

Progress on this investigation continues to move in the right direction. While we haven’t yet reached a final fix, the preparatory work has been completed, and we continue the diagnostic process that will pinpoint the root cause.

The automation infrastructure build phase is complete, the test environment is now operational and stable, and we’ve begun testing it across various environments. These initial tests are focused on ensuring the robustness of our tools and collecting key data points. Once this stage is complete, we’ll shift to running the full suite of tests required to isolate and resolve the issue.

We appreciate your patience as we continue to work through this methodically. Another update will be provided before the end of next week with further findings and the next steps.

Progress on this investigation continues to move in the right direction. While we haven’t yet reached a final fix, the preparatory work has been completed, and we continue the diagnostic process that will pinpoint the root cause. The automation infrastructure build phase is complete, the test environment is now operational and stable, and we’ve begun testing it across various environments. These initial tests are focused on ensuring the robustness of our tools and collecting key data points. Once this stage is complete, we’ll shift to running the full suite of tests required to isolate and resolve the issue. We appreciate your patience as we continue to work through this methodically. Another update will be provided before the end of next week with further findings and the next steps.

We have a proven work around that has been implemented at a handful of sites. If you are being negatively impacted, please contact the service desk and we can arrange the implementation of the work around.

We have a proven work around that has been implemented at a handful of sites. If you are being negatively impacted, please contact the service desk and we can arrange the implementation of the work around.
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: ev/issues#21
No description provided.