5G and the End-to-End Argument
One of the most surprising things I learned while writing the 5G Book with Oguz Sunay is how the cellular network’s history, starting 40+ years ago, parallels that of the Internet. And while from an Internet-centric perspective the cellular network is just one of many possible access network technologies, the cellular network in fact shares many of the “global connectivity” design goals of the Internet. That is, the cellular network makes it possible for a cell phone user in New York to call a cell phone in Tokyo, then fly to Paris and do the same thing again. In short, the 3GPP standard federates independently operated and locally deployed Radio Access Networks (RAN) into a single logical RAN with global reach, much as the Internet federated existing packet-switched networks into a single global network.
For many years, the dominant use case for the cellular network has been access to cloud services. With 5G expected to connect everything from home appliances to industrial robots to self-driving cars, the cellular network will be less-and-less about humans making voice calls and increasingly about interconnecting swarms of autonomous devices working on behalf of those humans to the cloud. This raises the question: Are there artifacts or design decisions in the 3GPP-defined 5G architecture working at cross-purposes with the Internet architecture?
Another way to frame this question is: How might we use the end-to-end argument—which is foundational to the Internet’s architecture—to drive the evolution of the cellular network? In answering this question, two issues jump out at me, identity management and session management, both of which are related to how devices connect to (and move throughout) the RAN.
The 5G architecture leverages the fact that each device has an operator-provided SIM card, which uniquely identifies the subscriber with a 15-digit International Mobile Subscriber Identity (IMSI). The SIM card also specifies the radio parameters (e.g., frequency band) needed to communicate with that operator’s Base Stations, and includes a secret key that the device uses to authenticate itself to the network. The IMSI is a globally unique id and plays a central role in devices being mobile across the RAN, so in that sense it plays the same role as an IP address in the Internet architecture. But if you instead equate the 5G network with a layer 2 network technology, then the IMSI is effectively the device’s “ethernet address.”
Ethernet addresses are also globally unique, but the Internet architecture makes no attempt to track them with a global registry or treat them as a globally routable address. The 5G architecture, on the other hand, does, and it is a major source of complexity in the 3GPP Mobile Core. Doing so is necessary for making a voice call between two cell phones anywhere in the world, but is of limited value for cloud-connected devices deployed on a manufacturing floor, with no aspiration for global travel. Setting aside (for the moment) the question of how to also support traditional voice calls without tracking IMSI locations, the end-to-end argument suggests we leave global connectivity to IP, and not try to also provide it at the link layer.
Let’s turn from identity management to session management. Whenever a mobile device becomes active, the nearest Base Station initiates the establishment of a sequence of secure tunnels connecting the device back to the Mobile Core, which in turn bridges the RAN to the Internet. (You can find more details on this process here.) Support for mobility can then be understood as the process of re-establishing the tunnel(s) as the device moves throughout the RAN, where the Mobile Core’s user plane buffers in-flight data during the handover transition. This avoids dropped packets and subsequent end-to-end retransmissions, which may make sense for a voice call, but not necessarily for a TCP connection to a cloud service. As before, it may be time to apply the end-to-end argument to the cellular network’s architecture in light of today’s (and tomorrow’s) dominant use cases.
To complicate matters, sessions are of limited value. The 5G network maintains the session only when the same Mobile Core serves the device and only the Base Station changes. This is often the case for a device moving within some limited geographic region, but moving between regions—and hence, between Mobile Cores—is indistinguishable from power cycling the device. The device is assigned a new IP address and no attempt is made to buffer and subsequently deliver in-flight data. This is important because any time a device becomes inactive for a period of time, it also loses its session. A new session is established and a new IP address assigned when the device becomes active. Again, this makes sense for a voice call, but not necessarily for a typical broadband connection, or worse yet, for an IoT device that powers down as a normal course of events. It is also worth noting that cloud services are really good at accommodating clients who’s IP addresses change periodically (which is to say, when the relevant identity is at the application layer).
This is all to say that the cellular network’s approach, which can be traced to its roots as a connection-oriented voice network, is probably not how one would design the system today. Instead, we can use IP addresses as the globally routable identifier, lease IP addresses to often-sleeping and seldom-moving IoT devices, and depend on end-to-end protocols like TCP to retransmit packets dropped during handovers. Standardization and interoperability will still be needed to support global phone calls, but with the ability to implement voice calls entirely on top of IP, it’s not clear the Mobile Core is the right place to solve that problem. And even if it is, this could potentially be implemented as little more than legacy APIs supported for backward compatibility. In the long term, it will be interesting to see if 3GPP-defined sessions hold up well as the foundation for an architecture that fully incorporates cellular radio technology into the cloud.
We conclude by noting that while we have framed this discussion as a thought experiment, it illustrates the potential power of the software-defined architecture being embraced by 5G. With the Mobile Core in particular implemented as a set of micro-services, an incremental evolution that addresses the issues outlined here is not only feasible, but actually quite likely. This is because history teaches us that once a system is open and programmable, the dominant use cases will eventually correct for redundant mechanisms and sub-optimal design decisions.
Why 5G Matters
Over the last month I undertook a detailed review of a new book in the Systems Approach series, 5G Mobile Networks: A Systems Approach by Larry Peterson and Oguz Sunay. Talking to people outside the technology world about my work, I soon found myself trying to explain "why does 5G matter" to all sorts of folks without a technical background. At this point in 2020, we can generally assume people know two things about 5G: the Telcos are marketing it as the greatest innovation ever (here's a sample); and conspiracy theorists are having a field day telling us all the things that 5G causing or covering up (which has in turn led to more telco ads like this one). By the end of reviewing the new book from Larry and Oguz, I felt I had finally grasped why 5G matters. Spoiler alert: I'm not going to bother debunking conspiracy theories, but I do think there is something quite important going on with 5G. And frankly, there is plenty of hype around 5G, but behind that hype are some significant innovations.
What is clear about 5G, technically, is that there will be a whole lot of new radio technologies and new spectrum allocation, which will enable yet another upgrade in speeds and feeds. If you are a radio person that's quite interesting–there is plenty of innovation in squeezing more bandwidth out of wireless channels. It's a bit harder to explain why more bandwidth will make a big difference to users, simply because 4G generally works pretty well. Once you can stream video at decent resolution to your phone or tablet, it's a bit hard to make a case for the value of more bandwidth alone. A more subtle issue is bandwidth density–the aggregate bandwidth that can be delivered to many devices in a certain area. Think of a sporting event as a good example (leaving aside the question of whether people need to watch videos on their phones at sporting events).
Lowering the latency of communication starts to make the discussion more interesting–although not so much to human users, but as an enabler of machine-to-machine or Internet-of-things applications. If we imagine a world where cars might communicate with each other, for example, to better manage road congestion, you can see a need for very low latency coupled with very high reliability–which is another dimension that 5G aims to address. And once we start to get to these scenarios, we begin to see why 5G isn't just about new radio technology, but actually entails a whole new mobile network architecture. Lowering latency and improving availability aren't just radio issues, they are system architecture issues. For example, low latency requires that a certain set of functions move closer to the edge–an approach sometimes called edge computing or edge clouds.
The Importance of Architecture
The high points of the new cellular architecture for 5G are all about leveraging trends from the broader networking and computing ecosystems. Three trends stand out in particular:
If you want to know more about the architecture of 5G, the application requirements that are driving it, and how it will enable innovation, you should go read the book as I did!
Photo by Sander Weeteling on Unsplash
The transition to 5G is happening, and unless you’ve been actively trying to ignore it, you’ve undoubtedly heard the hype. But if you are like 99% of the CS-trained, systems-oriented, cloud-savvy people in the world, the cellular network is largely a mystery. You know it’s an important technology used in the last mile to connect people to the Internet, but you’ve otherwise abstracted it out of your scope-of-concerns.
The important thing to understand about 5G is that it implies much more than a generational upgrade in bandwidth. It involves transformative changes that blur the line between the access network and the cloud. And it will encompass enough value that it has the potential to turn the “Access-as-frontent-to-Internet” perspective on its head. We will just as likely be talking about “Internet-as-backend-to-Access” ten years from now. (Remember, you read it here first.)
The challenge for someone that understands the Internet is penetrating the myriad of acronyms that dominate cellular networking. In fairness, the Internet has its share acronyms, but it also comes with a sufficient set of abstractions to help manage the complexity. It’s hard to say the same for the cellular network, where pulling on one thread seemingly unravels the entire space. It has also been the case that the cellular network had been largely hidden inside proprietary devices, which has made it impossible to figure it out for yourself.
In retrospect, it's strange that we find ourselves in this situation, considering that mobile networks have a 40-year history that parallels the Internet’s. But unlike the Internet, which has evolved around some relatively stable "fixed points," the cellular network has reinvented itself multiple times over, transitioning from from voice-only to data-centric, and from circuit-oriented to IP-based. 5G brings another such transformation, this time heavily influenced the cloud. In the same way 3G defined the transition from voice to broadband, 5G’s promise is mostly about the transition from a single access service (broadband connectivity) to a richer collection of edge services and devices, including support for immersive user interfaces (e.g., AR/VR), mission-critical applications (e.g., public safety, autonomous vehicles), and the Internet-of-Things (IoT). Because these use cases will include everything from home appliances to industrial robots to self-driving cars, 5G won’t just support humans accessing the Internet from their smartphones, but also swarms of autonomous devices working together on their behalf. All of this requires a fundamentally different architecture that will both borrow from and impact the Internet and Cloud.
We have attempted to document this emerging architecture in a book that is accessible to people with a general understanding of the Internet and Cloud. The book (5G Mobile Networks: A Systems Approach) is the result of a mobile networking expert teaching a systems person about 5G as we’ve collaborated on an open source 5G implementation. The material has been used to train other software developers, and we are hopeful it will be useful to anyone that wants a deeper understanding of 5G and the opportunity for innovation it provides. Readers that want hands-on experience can also access the open source software introduced in the book.
Democratizing the Network Edge
Two industry trends with significant momentum are on a collision course. One is the cloud, which in pursuit of low-latency/high-bandwidth applications is moving out of the datacenter and towards the edge. The promise and potential of applications ranging from Internet-of-Things (IoT) to Immersive UIs, Public Safety, Autonomous Vehicles, and Automated Factories, has triggered a gold rush to build edge platforms and services. The other is the access network that connects homes, businesses, and mobile devices to the Internet. Network operators (Telcos and CableCos) are transitioning from a reliance on closed and proprietary hardware to open architectures leveraging disaggregated and virtualized software running on white-box servers, switches, and access devices.
The confluence of cloud and access technologies raises the possibility of convergence. For the cloud, access networks provide low-latency connectivity to end users and their devices, with 5G in particular providing native support for the mobility of those devices. For the access network, cloud technology enables network operators to enjoy the CAPEX & OPEX savings that come from replacing purpose-built appliances with commodity hardware, as well as accelerating the pace of innovation through the softwartization of the access network.
It is clear that the confluence of cloud and access technologies at the access-edge is rich with opportunities to innovate, and this is what motivates the CORD-related platforms we are building at ONF. But it is impossible to say how this will all play out over time, with different perspectives on whether the edge is on-premise, on-vehicle, in the cell tower, in the Central Office, distributed across a metro area, or all of the above. With multiple incumbent players—e.g., network operators, cloud providers, cell tower providers—and countless startups jockeying for position, it’s impossible to predict how the dust will settle.
On the one hand, cloud providers believe that by saturating metro areas with edge clusters and abstracting away the access network, they can build an edge presence with low enough latency and high enough bandwidth to serve the next generation of edge applications. In this scenario, the access network remains a dumb bit-pipe, allowing cloud providers to excel at what they do best: run scalable cloud services on commodity hardware. On the other hand, network operators believe that by building the next generation access network using cloud technology, they will be able to co-locate edge applications in the access network. This scenario comes with built-in advantages: an existing and widely distributed physical footprint, existing operational support, and native support for both mobility and guaranteed service.
While acknowledging both of these possibilities, there is a third outcome that not only merits consideration, but is also worth actively working towards: the democratization of the network edge. The idea is to make the access-edge accessible to anyone, and not strictly the domain of incumbent cloud providers or network operators. There are three reasons to be optimistic about this possibility:
For almost as long as there have been packet-switched networks, there have been ideas about how to virtualize them. For example, there were early debates in the networking community about the merits of "virtual circuits" versus connectionless networks. But the concept of network virtualization has become more widespread in recent years, helped along by the rise of SDN as an enabling technology.
Virtualization has a robust history in computer science, but there remains some confusion about precisely what the term means. Arguably this is due in part to confusion caused by colloquial usage of "virtual" as a synonym for "almost", among many other uses.
Virtual memory provides an easy example to help understand what virtualization means in computing. Virtual memory creates an abstraction of a large and private pool of memory resources, even though the underlying physical memory may be shared by many applications and users and considerably smaller than the apparent pool of virtual memory. This abstraction enables programmers to operate under the illusion that there is plenty of memory and that no-one else is using it, while under the covers the memory management system takes care of things like mapping the virtual memory to physical resources and avoiding conflict between users.
Similarly, server virtualization presents the abstraction of a virtual machine (VM), which has all the features of a physical machine. Again, there may be many VMs supported on a single physical server, and the operating system and users on the virtual machine are happily unaware that the VM is being mapped onto physical resources.
A key point here is that virtualization of computing resources preserves the abstractions that existed before they were virtualized. This is important because it means that users of those abstractions don't need to change - they see a faithful reproduction of the thing being virtualized.
So what happens when we try to virtualize networks? We are able to present familiar abstractions to users of the virtual network, while mapping those abstractions onto the physical network in a way that insulates the user from the complexity of this mapping.
An early success for virtual networking came with the introduction of virtual private networks (VPNs), which allowed carriers to present corporate customers with the illusion that they had their own private network, even though in reality they were sharing underlying resources with many other users. One instance of this was the flavor of VPN known as MPLS VPNs, which gave each customer their own private address space and routing tables, along with control over the topology of their network, all implemented on top of a single IP network.
VPNs, however, only virtualize a few resources, notably addressing and routing tables. Network virtualization as commonly understood today goes further, virtualizing every aspect of networking. That means that a virtual network today supports all the basic abstractions of a physical network - switching, routing, firewalling, load balancing - virtualizing the entire network stack from layers two through seven. In this sense, they are analogous to the virtual machine, with its support of all the abstractions of a server: CPU, storage, I/O, etc.
Like virtual machines, virtual networks are also allowing a whole set of operational advances. They can be created rapidly under programmatic control; snapshots can be taken; networks can be cloned and migrated to entirely new locations, e.g., for disaster recovery.
There's still lots of room for growth in the virtual networking space. Modern cloud operators increasingly depend on virtual networks to automate their provisioning of services. Operators of emerging 5G networks are looking at options for virtualizing their networks.
For a more in depth discussion of this topic, we refer you to this blog post, co-authored with Martin Casado, one of the pioneers of both SDN and network virtualization.
SDN and Disaggregation
Earlier posts talked about the softwarization of the network in fairly general terms, but the idea got rolling ten years ago with the introduction of Software Defined Networks (SDN).
The fundamental idea of SDN is to decouple the network control plane (i.e., where routing algorithms like RIP, OSPF, and BGP run) from the network data plane (i.e., where packet forwarding decisions get made), with the former moved into software running on commodity servers, and the latter implemented by white-box switches like the ones described in Section 3.4 of the book. The original enabling idea of SDN was to define a standard interface between the control plane and the data plane so that any implementation of the control plane could talk to any implementation of the data plane; this breaks the dependency on any one vendor’s bundled solution. The original interface is called OpenFlow, and this idea of decoupling the control and data planes came to be known as disaggregation.
OpenFlow was a great first step, but a decade of experience has revealed that it is not sufficient as the interface for controlling the data plane. This is for the same reason any API layered on top of hardware falls short: it does not expose the full range of features that switch vendors put into their hardware. To address this shortcoming, the SDN community is now working on a language-based approach to specifying how the control and data planes interact. The language is called P4, and it provides a richer model of the switch's packet forwarding pipeline.
Another important aspect of disaggregation is that a logically centralized control plane can be used to control a distributed network data plane. We say logically centralized because while the state collected by the control plane is maintained in a global data structure (e.g., a Network Map), the implementation of this data structure could still be distributed over multiple servers (i.e., it could run in a cloud). This is important for both scalability and availability, where the two planes are configured and scaled independent of each other. This idea took off quickly in the cloud, with today’s cloud providers running SDN-based solutions both within their datacenters and across the backbone networks that interconnect their datacenters.
A consequence of this design that isn’t immediately obvious is that a logically centralized control plane doesn’t just manage a network of physical (hardware) switches that interconnects physical servers, but it also manages a network of virtual (software) switches that interconnect virtual servers (e.g., Virtual Machines and containers). If you’re counting “switch ports” (a good measure of all the devices connected to your network) then the number of virtual ports in the Internet shot past the number of physical ports in 2012.
One of other key enablers for SDN’s success, as depicted in the Figure, is the Network Operating System (NOS). Like a server operating system (e.g., Linux, IOS, Android, Windows) that provides a set of high-level abstractions that make it easier to implement applications (e.g., you can read and write files instead of directly accessing disk drives), a NOS makes it easier to implement network control functionality, otherwise known as Control Apps. A good NOS abstracts the details of the network switches and provides a “network map” abstraction to the application developer. The NOS detects changes in the underlying network (e.g., switches, ports, and links going up-and-down) and the control application simply implements the behavior it wants on this abstract graph. What that means is that the NOS takes on the burden of collecting network state (the hard part of distributed algorithms like Link-State and Distance-Vector algorithms) and the control app is free to simply implement the shortest path algorithm and load the computed forwarding rules into the underlying switches. By centralizing this logic, SDN is able to produce a globally optimized solution. The published evidence confirms this advantage (e.g., Google's private wide-area network B4).
As much of an advantage as the cloud providers have been able to get out of SDN, its adoption in enterprises and Telcos has much much slower. This is partly about the ability of different markets to manage their networks. The Googles, Microsofts, and Amazons of the world have the engineers and DevOps skills needed to take advantage of this technology, whereas others still prefer pre-packaged and integrated solutions that support the management and command line interfaces they are familiar with. As is often the case, business culture changes more slowly than technology.
Feature Velocity vs Stability
It is important to recognize the various perspectives on computer networks (e.g., that of network architects, application developers, end users, and network operators) to understand the technical requirements that shape how networks are designed and built. But this presumes all design decisions are purely technical, which is certainly not the case. Many other factors, from economic forces, to government policy, to societal influences, to ethical considerations, influence how networks are designed and built.
Of these, the marketplace is often the most influential, and corresponds to the interplay between network operators that sell access and connectivity (e.g., AT&T, Comcast, Verizon, DT, NTT, China Mobile), network equipment venders that sell hardware to network operators (e.g., Cisco, Juniper, Ericsson, Nokia, Huawei, NEC), cloud providers that host content and scalable applications in their datacenters (e.g., Google, Amazon, Microsoft), service providers that deliver content and cloud apps to end-users (e.g., Facebook, Apple, Netflix, Spotify), and of course, subscribers and customers that download content and run cloud applications (i.e., individuals, but also enterprises and businesses). Not surprisingly, the lines between all these players are not crisp, with many companies playing multiple roles. For example, service providers like Facebook run their own clouds and network operators like Comcast and AT&T own their own content.
The most notable example of this cross-over are the large cloud providers, who (a) build their own networking equipment, (b) deploy and operate their own networks, and (c) provide end-user services and applications on top of their networks. It's notable because it challenges the implicit assumptions of the simple "textbook" version of the technical design process. One such assumption is that designing a network is a one-time activity. Build it once and use it forever (modulo hardware upgrades so users can enjoy the benefits of the latest performance improvements). A second is that the job of designing and implementing the network is completely divorce from the job of operating the network. Neither of these assumptions is quite right.
On the first point, the network’s design is clearly evolving. The only question is how fast. Historically, the feature upgrade cycle involved an interaction between network operators and their vender partners (often collaborating through the standardization process), with timelines measured in years. But anyone that has downloaded and used the latest cloud app knows how glacially slow anything measured in years is by today's standards.
On the second point, the companies that build networks are almost always the same ones that operate them. The only question is whether they develop their own features or outsource that process to their venders. If we once again look to the cloud for inspiration, we see that develop-and-operate isn’t just true at the corporate level, but it is also how the fastest moving cloud companies organize their engineering teams: around the DevOps model. (If you are unfamiliar with DevOps, we recommend you read "Site Reliability Engineering: How Google Runs Production Systems" to see how Google practices it.)
What this all means is that computer networks are now in the midst of a major transformation, due largely to market pressure being applied by agile cloud providers. Network operators are trying to simultaneously accelerate the pace of innovation (sometimes known as feature velocity) and yet continue to offer a reliable service (preserve stability). And they are increasingly doing this by adopting the best practices of cloud providers, which can be summarized as having two major themes: (1) take advantage of commodity hardware and move all intelligence into software, and (2) adopt agile engineering processes that break down barriers between development and operations.
This transformation is sometimes called the “cloudification” or “softwarization” of the network, but by another name, it’s known as Software Defined Networks (SDN). Whatever you call it, this new perspective will (eventually) be a game changer, not so much in terms of how we address the fundamental technical challenges of framing, routing, fragmentation/reassembly, packet scheduling, congestion control, security, and so on, but in terms of how rapidly the network evolves to support new features and to accommodate the latest advances in technology.
This general theme is important and we plan to return to it in future posts. Understanding networks is partly about understanding the technical underpinnings, but also partly about how market forces (and other factors) drive change. That you are able to make informed design decisions about technical approach A versus technical approach B is a necessary first step, but that you are able to deploy that solution and bring it to market more rapidly and for less cost than the competition is just as important, if not more so.
From Research to Practice
Having not cracked open Computer Networks: A Systems Approach for several years, the thing that most struck me as I started to update the material is how much of the Internet has its origins in the research community. Everyone knows that the ARPANET and later TCP/IP came out of DARPA-funded university research, but even as the Web burst onto the scene in the 1990s, it was still the research community that that led the way in the Internet's coming-of-age. There's a direct line connecting papers published on congestion control, quality-of-service, multicast, real-time multimedia, security protocols, overlay networks, content distribution, and network telemetry to today's practice. And in many cases, the technology has become so routine (think Skype, Netflix, Spotify), that it's easy to forget the history of how we got to where we are today. This makes updating the textbook feel strangely like writing an historical record.
From the perspective of writing a relevant textbook (or just making sense of the Internet), certainly it's important to understand the historical context. It is even more important to appreciate the thought process of designing systems and solving problems, for which the Internet is clearly the best use case to study. But there are some interesting challenges in providing perspective on the Internet to a generation that has never known a world without the Internet.
One is how to factor commercial reality into the discussion. Take video conferencing as an example. Once there was a single experimental prototype (vic/vat) used to gain experience and drive progress. Today there is Skype, GoToMeeting, WebEx, Google Hangouts, Zoom, UberConference, and many other commercial services. It's important to connect-the-dots between these familiar services and the underlying network capabilities and design principles. For example, while today's video conferencing services leverage the foundational work on both multicast and real-time protocols, they are closed-source systems implemented on top of the network, at the application level. They are able to do this by taking advantage of widely distributed points-of-presence made possible by the cloud. Teasing apart the roles of cloud providers, cloud services, and network operators is key to understanding how and where innovation happens today.
A second is to identify open platforms and specifications that serve as good exemplars for the core ideas. Open source has become an important part of today's Internet ecosystem, surpassing the role of the IETF and other standards bodies. In the video conferencing realm, for example, projects like Jitsi, WebRTC, and Opus are important examples of the state-of-the-art. But one look at the projects list on the Apache Foundation or Linux Foundation web sites makes it clear that separating the signal from the noise is no trivial matter. Knowing how to navigate this unbelievably rich ecosystem is the new challenge.
A third is to anticipate what cutting edge activity happening today is going to be routine tomorrow. On this point, the answer seems obvious. It will be how network providers improve feature velocity through the softwarization and virtualization of the network. By another name, this is Software Defined Networking (SDN), but more broadly, this represents a shift from building the network using closed/proprietary appliances to using open software platforms running on commodity hardware. This shift is both pervasive and transformative. It impacts everything from high-performance switch design, to architecting access networks (5G, Fiber-to-the-Home), to how network operators deal with lifecycle management, to the blurring of the line between the Internet and the Cloud. Recognizing that this transformation is underway is essential to understanding where the Internet is headed next.