If top-tier cloud providers use similar network hardware in their data centers and connect to the same transit and peering bandwidth providers, how can SoftLayer claim to provide the best network performance in the cloud computing industry?
Over the years, I've heard variations of that question asked dozens of times, and it's fairly easy to answer with impressive facts and figures. All SoftLayer data centers and network points of presence (PoPs) are connected to our unique global network backbone, which carries public, private, and management traffic to and from servers. Using our network connectivity table, some back-of-the-envelope calculations reveal that we have more than 2,500Gbps of bandwidth connectivity with some of the largest transit and peering bandwidth providers in the world (and that total doesn't even include the private peering relationships we have with other providers in various regional markets). Additionally, customers may order servers with up to 10Gbps network ports in our data centers.
For the most part, those stats explain our differentiation, but part of the bigger network performance story is still missing, and to a certain extent it has been untold—until today.
The 2,500+Gbps of bandwidth connectivity we break out in the network connectivity table only accounts for the on-ramps and off-ramps of our network. Our global network backbone is actually made up of an additional 2,600+Gbps of bandwidth connectivity ... and all of that backbone connectivity transports SoftLayer-related traffic.
This robust network architecture streamlines the access to and delivery of data on SoftLayer servers. When you access a SoftLayer server, the network is designed to bring you onto our global backbone as quickly as possible at one of our network PoPs, and when you're on our global backbone, you'll experience fewer hops (and a more direct route that we control). When one of your users requests data from your SoftLayer server, that data travels across the global backbone to the nearest network PoP, where it is handed off to another provider to carry the data the "last mile."
With this controlled environment, I decided to undertake an impromptu science experiment to demonstrate how location and physical distance affect network performance in the cloud.
Speed Testing on the SoftLayer Global Network Backbone
I work in the SoftLayer office in downtown Houston, Texas. In network-speak, this location is HOU04. You won't find that location on any data center or network tables because it's just an office, but it's connected to the same global backbone as our data centers and network points of presence. From my office, the "last mile" doesn't exist; when I access a SoftLayer server, my bits and bytes only travel across the SoftLayer network, so we're effectively cutting out a number of uncontrollable variables in the process of running network speed tests.
For better or worse, I didn't tell any network engineers that I planned to run speed tests to every available data center and share the results I found, so you're seeing exactly what I saw with no tomfoolery. I just fired up my browser, headed to our Data Centers page, and made my way down the list using the SpeedTest option for each facility. Customers often go through this process when trying to determine the latency, speeds, and network path that they can expect from servers in each data center, but if we look at the results collectively, we can learn a lot more about network performance in general.
With the results, we'll discuss how network speed tests work, what the results mean, and why some might be surprising. If you're feeling scientific and want to run the tests yourself, you're more than welcome to do so.
The Ookla SpeedTests we link to from the data centers table measured the latency (ping time), jitter (variation in latency), download speeds, and upload speeds between the user's computer and the data center's test server. To run this experiment, I connected my MacBook Pro via Ethernet to a 100Mbps wired connection. At the end of each speed test, I took a screenshot of the performance stats:
To save you the trouble of trying to read all of the stats on each data center as they cycle through that animated GIF, I also put them into a table (click the data center name to see its results screenshot in a new window):
|Data Center||Latency (ms)||Download Speed (Mbps)||Upload Speed (Mbps)||Jitter (ms)|
By performing these speed tests on the SoftLayer network, we can actually learn a lot about how speed tests work and how physical location affects network performance. But before we get into that, let's take note of a few interesting results from the table above:
- The lowest latency from my office is to the HOU02 (Houston, Texas) data center. That data center is about 14.2 miles away as the crow flies.
- The highest latency results from my office are to the SYD01 (Sydney, Australia) and SNG01 (Singapore) data centers. Those data centers are at least 8,600 and 10,000 miles away, respectively.
- The fastest download speed observed is 93.16Mbps, and that number was seen from two data centers: DAL01 and DAL05.
- The slowest download speed observed is 40.35Mbps from SNG01.
- The fastest upload speed observed is 87.43Mbps to DAL01.
- The slowest upload speed observed is 72.35Mbps to SNG01.
- The upload speeds observed are faster than the download speeds from every data center outside of North America.
Are you surprised that we didn't see any results closer to 100Mbps? Is our server in Singapore underperforming? Are servers outside of North America more selfish to receive data and stingy to give it back?
Those are great questions, and they actually jumpstart an explanation of how the network tests work and what they're telling us.
Maximum Download Speed on 100Mbps Connection
If my office is 2 milliseconds from the test server in HOU02, why is my download speed only 93.12Mbps? To answer this question, we need to understand that to perform these tests, a connection is made using Transmission Control Protocol (TCP) to move the data, and TCP does a lot of work in the background. The download is broken into a number of tiny chunks called packets and sent from the sender to the receiver. TCP wants to ensure that each packet that is sent is received, so the receiver sends an acknowledgement back to the sender to confirm that the packet arrived. If the sender is unable to verify that a given packet was successfully delivered to the receiver, the sender will resend the packet.
This system is pretty simple, but in actuality, it's very dynamic. TCP wants to be as efficient as possible ... to send the fewest number of packets to get the entire message across. To accomplish this, TCP is able to modify the size of each packet to optimize it for each communication. The receiver dictates how large the packet should be by providing a receive window to accommodate a small packet size, and it analyzes and adjusts the receive window to get the largest packets possible without becoming unstable. Some operating systems are better than others when it comes to tweaking and optimizing TCP transfer rates, but the processes TCP takes to ensure that the packets are sent and received without error takes overhead, and that overhead limits the maximum speed we can achieve.
Understanding the SNG01 Results
Why did my SNG01 speed test max out at a meager 40.35Mbps on my 100Mbps connection? Well, now that we understand how TCP is working behind the scenes, we can see why our download speeds from Singapore are lower than we'd expect. Latency between the sending and successful receipt of a packet plays into TCP’s considerations of a stable connection. Higher ping times will cause TCP to send smaller packet sizes than it would for lower ping times to ensure that no sizable packet is lost (which would have to be reproduced and resent).
With our global backbone optimizing the network path of the packets between Houston and Singapore, the more than 10,000-mile journey, the nature of TCP, and my computer's TCP receive window adjustments all factor into the download speeds recorded from SNG01. Looking at the results in the context of the distance the data has to travel, our results are actually well within the expected performance.
Because the default behavior of TCP is partially to blame for the results, we could actually tweak the test and tune our configurations to deliver faster speeds. To confirm that improvements can be made relatively easily, we can actually just look at the answer to our third question...
Upload > Download?
Why are the upload speeds faster than the download speeds after latency jumps from 50ms to 114ms? Every location in North America is within 2,000 miles of Houston, while the closest location outside of North America is about 5,000 miles away. With what we've learned about how TCP and physical distance play into download speeds, that jump in distance explains why the download speeds drop from 90.33Mbps to 77.41Mbps as soon as we cross an ocean, but how can the upload speeds to Europe (and even APAC) stay on par with their North American counterparts? The only difference between our download path and upload path is which side is sending and which side is receiving. And if the receiver determines the size of the TCP receive window, the most likely culprit in the discrepancy between download and upload speeds is TCP windowing.
A Linux server is built and optimized to be a server, whereas my MacOSX laptop has a lot of other responsibilities, so it shouldn't come as a surprise that the default TCP receive window handling is better on the server side. With changes to the way my laptop handles TCP, download speeds would likely be improved significantly. Additionally, if we wanted to push the envelope even further, we might consider using a different transfer protocol to take advantage of the consistent, controlled network environment.
The Importance of Physical Location in Cloud Computing
These real-world test results under controlled conditions demonstrate the significance of data's geographic proximity to its user on the user's perceived network performance. We know that the network latency in a 14-mile trip will be lower than the latency in a 10,000-mile trip, but we often don't think about the ripple effect latency has on other network performance indicators. And this experiment actually controls a lot of other variables that can exacerbate the performance impact of geographic distance. The tests were run on a 100Mbps connection because that's a pretty common maximum port speed, but if we ran the same tests on a GigE line, the difference would be even more dramatic. Proof: HOU02 @ 1Gbps v. SNG01 @ 1Gbps
Let's apply our experiment to a real-world example: Half of our site's user base is in Paris and the other half is in Singapore. If we chose to host our cloud infrastructure exclusively from Paris, our users would see dramatically different results. Users in Paris would have sub-10ms latency while users in Singapore have about 300ms of latency. Obviously, operating cloud servers in both markets would be the best way to ensure peak performance in both locations, but what if you can only afford to provision your cloud infrastructure in one location? Where would you choose to provision that infrastructure to provide a consistent user experience for your audience in both markets?
Given what we've learned, we should probably choose a location with roughly the same latency to both markets. We can use the SoftLayer Looking Glass to see that San Jose, California (SJC01) would be a logical midpoint ... At this second, the latency between SJC and PAR on the SoftLayer backbone is 149ms, and the latency between SJC and SNG is 162ms, so both would experience very similar performance (all else being equal). Our users in the two markets won't experience mind-blowing speeds, but neither will experience mind-numbing speeds either.
The network performance implications of physical distance apply to all cloud providers, but because of the SoftLayer global network backbone, we're able to control many of the variables that lead to higher (or inconsistent) latency to and from a given data center. The longer a single provider can route traffic, the more efficiently that traffic will move. You might see the same latency speeds to another provider's cloud infrastructure from a given location at a given time across the public Internet, but you certainly won't see the same consistency from all locations at all times. SoftLayer has spent millions of dollars to build, maintain, and grow our global network backbone to transport public and private network traffic, and as a result, we feel pretty good about claiming to provide the best network performance in cloud computing.