How TCP's congestion control saved the internet
With the annual SIGCOMM conference happening this month, it's interesting to note that congestion control still plays a significant role, even 35 years after the first paper on TCP congestion control was published. This highlights the importance of congestion management in enabling the success of the internet.
Recently, in my talk and article on the history of networking spanning 60 years, I received several comments about various competing networking technologies that emerged during the same period. These included the OSI stack, the Coloured Book protocols, and ATM (Asynchronous Transfer Mode), which was the first networking protocol I worked on extensively. At the time, many believed that ATM would become the dominant packet-switching technology worldwide.
In retrospect, it's evident that congestion control played a key role in propelling the internet from a moderate-scale network to a global one. Those in favor of ATM used to refer to Ethernet and TCP/IP as "legacy" protocols that could potentially be carried over the global ATM network once established. I vividly recall Steve Deering, a pioneer of IP networking, boldly asserting that ATM would never gain enough success to even be considered a legacy protocol.
When I recounted the history of networking earlier, I omitted these other protocols to keep things concise. My colleague Larry Peterson and I strive for brevity, particularly after receiving a one-star review on Amazon criticizing our book as a "wall of text." However, my focus was on highlighting how TCP/IP emerged as the dominant protocol suite, achieving global (or near-global) penetration.
There are various theories explaining why TCP/IP outshined its contemporaries, but they are challenging to test definitively. It's likely that multiple factors contributed to the success of the internet protocols, with congestion control ranking as a pivotal element that facilitated its transition from moderate to global scale.
Moreover, it's intriguing to study how the architectural choices made in the 1970s have proven their worth over the subsequent decades.
One of the design goals stated in David Clark's paper, "The Design Philosophy of the DARPA Internet Protocols," is the distributed management of internet resources. Jacobson and Karels' implementation of congestion control in TCP exemplifies their adherence to this principle. Furthermore, their approach aligns with another internet design goal, which is to accommodate various types of networks. Consequently, this philosophy eliminates the possibility of network-based admission control, which contrasts sharply with protocols like ATM that anticipated resource requests from end-systems before data could flow.
An essential aspect of the philosophy to accommodate diverse networks is the acknowledgment that not all networks have admission control. When combined with distributed resource management, it necessitates end-systems taking responsibility for congestion control, precisely what Jacobson and Karels achieved with their initial TCP changes.
The history of TCP congestion control is extensive enough to fill a book -- which it has. However, the work carried out at Berkeley from 1996 to 1998, particularly Jacobson's 1988 SIGCOMM paper, had a profound influence, becoming one of the most cited networking papers of all time. This paper introduced concepts like slow-start, AIMD (additive increase, multiplicative decrease), RTT estimation, and the utilization of packet loss as a congestion signal. It established the groundwork for decades of subsequent congestion control research. The enduring influence of this paper stems from its solid foundation and the ample room it allowed for future improvements, as evidenced by ongoing efforts to enhance congestion control today.
The underlying problem is inherently challenging: we strive to ensure millions of end-systems that lack direct contact work together to fairly share the bandwidth of bottleneck links. We must accomplish this using only the information derived from sending packets into the network and observing whether they reach their intended destinations.
Arguably one of the most significant advancements after 1988 was the realization by Brakmo and Peterson (yes, that guy) that congestion was not only signaled by packet loss, but also by increasing delay. This insight formed the basis of the TCP Vegas paper in 1994, which proposed using delay as an indicator of congestion, a controversial idea at the time.
Vegas sparked a new trend in congestion control research, inspiring many other efforts to consider delay as an early warning sign before loss occurs. Examples of such efforts include Data center TCP (DCTCP) and Google's BBR.
One reason I credit congestion control algorithms with the success of the internet is that in 1986, the pathway to failure was evident. Jacobson recounts early episodes of congestion collapse, in which throughput plummeted by three orders of magnitude.
Even in 1995, when I joined Cisco, we still heard tales of catastrophic congestion episodes from customers. That same year, Bob Metcalfe, the inventor of Ethernet and a recent Turing Award recipient, famously predicted the collapse of the internet due to the rapid growth in traffic driven by consumer internet access and the rise of the web. But it didn't happen.
Congestion control has continued to evolve, with protocols like QUIC offering improved mechanisms for congestion detection and allowing experimentation with multiple congestion control algorithms. Additionally, congestion control has extended to the application layer, as seen in Dynamic Adaptive Streaming over HTTP (DASH).
One interesting consequence of the congestion episodes in the 1980s and 1990s was the realization that small buffers could sometimes cause congestion collapse. A significant paper by Villamizar and Song demonstrated that TCP performance suffered when the amount of buffering was insufficient compared to the average delay multiplied by the bandwidth product of the flows.
Unfortunately, this finding only held true for very few flows, as acknowledged in the paper. Nevertheless, it was widely interpreted as an unbreakable rule that influenced router designs in the following years. Appenzeller et al.'s buffer sizing work in 2004 finally debunked this misconception. However, before that, the unfortunate phenomenon of Bufferbloat, characterized by excessively large buffer sizes leading to significant queuing delays, had already affected millions of low-end routers. It's worth checking for Bufferbloat in your home network.
While we cannot conduct controlled experiments to precisely understand how the internet succeeded while other protocol suites failed, we can observe that timely implementation of congestion control played a pivotal role in preventing potential failure. In 1986, it was relatively easy to experiment with new ideas by modifying the code in a few end systems and effectively deploying the solution across a wide range of systems. No changes were necessary within the network itself. The relatively small set of operating systems requiring modification, along with the community capable of making those changes, facilitated the widespread deployment of the initial BSD-based algorithms developed by Jacobson and Karels.
It is evident that there is no such thing as a perfect congestion control approach, which explains why we continue to see new papers on the topic even after 35 years since Jacobson's discovery. Nonetheless, the architecture of the internet has cultivated an environment where effective solutions can be tested and implemented to achieve distributed management of shared resources.
In my opinion, this is a remarkable testament to the quality of the internet's architecture.
0 Comments