We recently sat down for a conversation with Martin Casado of Andreesen Horowitz. The transcript below has been lightly edited for clarity.
Bruce Davie (BD) - Okay, so I'm here today with Martin Casado. We're doing another episode
of Coffee with Bruce. Martin, welcome.
Martin Casado (MC) - Happy to be here with
my chrysanthemum tea.
BD - Alright, yeah. Dealing with time zones and all, it's kind of late for coffee there, I guess. So, Martin Casado really needs very little introduction but in case you've been living under a rock, Martin was one of the founders of Nicira, was the person who successfully recruited me to go and work at Nicira, starting off an eight year journey into network virtualisation for me. And these days he's a general partner at Andreessen Horowitz. And so Martin, I've got a few things I wanted to talk about today. First of all, I want to talk a little bit about some of the things we experienced at Nicira in the early days of network virtualisation.
I think when I joined, you guys had already established that you had this plan to disrupt networking and change it in a pretty significant way. So I guess first thing I wanted to say looking back now, I think it was what 2007 that you founded Nicira. It's a lot of years to look back on now. Are you happy with where things ended up?
Martin Casado (MC) - So yes. Yes, I'm quite happy.
It's interesting though, when we started the company, if you would have looked at what our aspirations were, it was "start with the data center and then change all of networking". And every new bit of networking is another feature on the same technical platform.
And it turns out, every new bit of networking is actually a separate industry, right?
So I think originally we had pretty broad ambitions, which it's fun to see the industry play out totally independent of what we're doing. I mean, that's been fantastic.
But if you look at, I was actually looking back. If you look at our Series A deck, when we're like, okay, we're going to start, and we're going to focus on the data center and network virtualisation. If you look at the timelines that we played out and what actually happened, it was actually pretty much in line.
And so I think in the area that we decided to focus on, you know, it's been great to see the realisation.
BD - Yeah, and honestly, SD-WAN for me is just the continuation of the Nicira vision. It's just that it got done by other companies. And the first time somebody explained SD-WAN to me, it was like, oh, you're doing what Nicira did, but you're gonna do it for the WAN.
And so to that extent, the vision has actually played out quite, quite broadly. It's just, as you say it, wasn't on a single technical platform.
MC - I know that's exactly, yeah. I think that's exactly right. The lower you go in the stack, the more general the technology, isn't it? So at some level a compiler company is basically every software company on the planet, right? Because you can build anything with a compiler.
So I think that the industry - and we were a very important part of the discourse - the industry at the time was kind of grappling with what does software-defined networking and network virtualisation mean? I think that what we distilled in that discourse was the right way to build networks, all aspects of networks.
But when it comes to the brass tacks of building a company, who you sell to dictates the kind of company you are, and the specific problems you're tackling dictate the type of engineering organisation that you build.
And what I didn't appreciate that I appreciate now is, absolutely SD-WAN is very much part of the original vision that we had, but how much effort it takes to execute in these separate verticals.
Which is why it's actually so exciting to see companies pull this off. That have both data center and WAN solutions, because they actually have so much potential synergy, but for a startup that's a hard thing to execute on.
BD - Yeah, also you sort of getting onto another thing I want to talk about here, which is actually what kind of company you wanted to build, I thought the culture at Nicira was pretty amazing. My biggest regret is I didn't join earlier just because it was so much fun.
But what was in your mind about how to build the culture? What did you do to go and create the culture at Nicira? And I mean, what are your thoughts on that more broadly.
MC - In some ways, one of our greatest challenges was also our greatest gift. And that was, we started the company in 2007 and then the world ended about a year later, right? If you'll remember with the great recession. And it was like the nuclear winter had set in.
And you probably had two options, which is you band together or you die. And I think that in many ways our culture was galvanised in this tough time. And what it forced in particular is for the company to have a very clear vision that you can march towards. And if the vision wasn't clear, people wouldn't have done it. I mean, they just wouldn't have, just because it was such a scary time.
And so I think 2008, 2009, we just really, really sharpened a vision that everybody believed in. And if someone didn't believe in it they'd speak up because they didn't have anything to lose.
And then we had iterated by the time things were starting to improve, that vision was very clear. And then I think what happened is once you have a vision that you believe in, the work is really realising that vision and an implementation of building product, which it's something that we've just been desperate to do anyways. And so I do think a lot of this was galvanised by, having a tough time.
And I think that being mission focused is something people talk about a lot and it's like, these are the three things on my badge or whatever. And I think that kind of misses the point, which is if you don't have a very high definition kind of like the mountain top that you're tackling, I think people start just meandering in the woods. And I think that was very, very core to the cultural success of the company.
BD - Yeah and to be honest, I've talked about that in terms of how you led the NSBU when we were at VMware together was, I think every staff meeting you would tell us, you know, these are the three things that we're shooting for this year. And you would repeat it every week so that nobody could have any doubt what direction the organisation was moving.
MC - It's always been amazing to me how easy is to lose focus of even very simple things. We're a room of very professional adults at that peak of our career. And we're talking about very simple things like KPIs.
But I do think in many ways being an operator is managing your own psychology and the hardest part of managing your own psychology is just staying focused on the things that are really important and taking a long view.
And so one thing that I loved as a group that we did is, everybody had the same expectation that we're going to make sure that we're staying on track for a broader goal, and everything is subservient to that.
And I mean, listen I sit on 17 boards now, and I can't tell you how rare that actually is, that people will actually distill all of the trouble and all of the pain, and all of the excitement until you get fairly simple to articulate goals.
BD - What about now that we're moving to this world of much more distributed teams, given the COVID impact on remote working, does that affect the way you think about building a team culture in a startup?
MC - This is one of the big questions that we're trying to answer right now. I think maintaining culture and building culture are actually two different things. In my experience, maintaining culture is something that you can kind of doing the same motions but you may have to do more of them and you may have to adapt them online, but it's not that interesting of a conversation. Does that make sense?
Which is like, yes, you should do it online. Yes, you should have more touch points. Yes, should you be sensitive to people in their personal issues. But these are fairly straightforward conversations to have and you can maintain a strong culture. And quite frankly, we've seen cultures getting even stronger because there's a notion of solidarity, through that.
What I don't know the answer to and I wish I did have an answer because I think everybody's trying to figure it out is, can you create a new culture, right? Can you change a culture?
And I think about - a lot of times - Andreessen Horowitz is a venture firm for example. I mean, it's pretty remarkable to create a tier one venture firm in 10 years, right? It just doesn't happen very often.
And then I ask the question could you do that in the time of COVID, right? Like for sure you can stay being a firm. And so I think it's a great question. I wish I had a better answer, but I would encourage anybody thinking to this to really decouple the two issues of maintaining culture and creating culture.
BD - Yeah, it's going to be fascinating to see how this plays out in the next few years.
So, one of the areas that I've been really interested in for the last few years, and I know you've written about it quite a bit is, is AI and machine learning. And also, I love the fact that when you talk about it, you often talk about it in terms of economics. Because I go all the way back to listening to David Clark at MIT talk about the idea that we should all be economists because you can't really understand networking, unless you think about the economic impact of your design decisions.
What's your thinking about the economics of AI businesses?
MC - Let me actually start with the economics of cloud businesses because it's very interesting. And then we'll back into the economics of AI.
So let's talk about the life cycle of a company that's built in the cloud. So, you know, Bruce Davie creates Bruce Davie inc, and comes to Martin Casado and says, "Hey, Martin, you know, give me $2 million to do my company."
BD - The scenario could totally happen by the way.
MC - So, we're like here, Bruce here's 2 million bucks and you're off looking for product market fit. Now you come back and like, there's some interest here, give me $10 million. So I'll give you $10 million and I join your board.
And so now what happens for every board meeting, the questions I ask are: what's growth look like? What features are we shipping?
And so you're telling your R&D organisation, we need to grow, we need new features. And that's basically your entire focus. Never once do we talk about things like gross margin or COGS right? And COGS are cost of goods sold, like how much it costs to run this stuff. So you're writing horribly suboptimal code on AWS or wherever.
So, you're doing this and then you've got a real business to say you're 50 million in ARR. You're feeling good and now you've got a real financial investor. Like I'm an investor that invests in ideas and people. And like, right, Bruce is the smartest architect I've ever worked with. Like, this is a great space. Like whatever, I'll invest just in that.
But financial investors actually care about financial metrics, like unit economics, and things like margins matter because they directly impact profitability. So then they look at your company and they're like, hey, listen, this is all great. You've got great growth but basically for every dollar you're making 50 cents goes to Amazon.
You got this huge issue now, which is you spent four years building an R&D team and a business practice around growth because that's what all of the economic incentives including the board have told you to do. And you've got a company now that has low multiples because you've got low margin, right? And then you've got this paradox, what do you do?
So more and more, what we're seeing as soon as companies slow down, they do, what's called repatriation. They're like, listen, like you can't fix that much code. I mean, it took Dropbox years to do it. So we're going to have to find some way to move it onto our own infrastructure or something else to improve those margins.
And I would say there's probably, in SaaS companies as they slowed down, billions of dollars trapped in this margin issue for these companies that are going to the cloud.
Okay, so this is the life cycle you're focusing on growth. You write poor code, you have these kind of relatively fat margins, and now you have to do something about it because it's impacting your valuation. Instead of being valued at $10 billion or $20 billion, you're valued at $10 billion. We're talking about a lot of money that's being impacted.
Okay, so the question with AI/ML to bring it back to that, is it's not clear in the case of the cloud, you could be like, well, I'm paying somebody else, if I do it myself, I can do it cheaper. And maybe if I re-architect it, I do it cheaper.
There's a very open question for AI/ML, which is, is there a way to actually build the company where you, no matter what can have good margins or software level margins.
So I feel like we're going to this escalation of margin crisis. You had traditional software, like in the Microsoft era, say where you ship software to 80% margins because you write it and you shipped it. And copying bits is basically free. You have 80% margins. Now we've have the cloud era, which I just walked through where you have a margin crisis where it is possible to run it on low margins.
But it's something that you have to tackle later on, which is why people think about re-architecting and repatriation.
Now we've got the AI/ML one, which is there's a fundamental question. And it's interesting to talk about is, is it even possible, no matter what you do, to do this on good margins? And the reason that question comes up is if you look at what it takes to create a model, and I'm going to shut up in a second, but if you look at what it takes to create one of these data companies, you have to create a model per customer.
BD - Yeah.
MC - And that's incredibly compute intensive to do it. And then to improve the accuracy of any single model, for example, you require just a ton more data. So you basically have a dis-economy of scale, which is in order to stay ahead, you've got to do more and it costs a lot more.
And so I feel like we're having a margin crisis on top of a margin crisis, like all the cloud margin costs and now the actual structural algorithm cost per AI/ML.
BD - So Martin there's one thing I wanted to drill into. There was this thing about repatriation, because I think that's a bit of a controversial topic. The idea that somehow you can run your private infrastructure more cost-effectively than public-cloud infrastructure.
So maybe you could just give us a little bit of insight on why that's true or back up that assertion.
MC - Yeah, so we've now seen a number of cases I think in the industry to really understand what's going on here. And it's an economic argument. It's strictly an economic argument and I kind of penciled it out, but I want to be a little bit clearer which is, if you look at dollars out the door on any given month, it never makes sense to build a data center, right? Because it takes a lot of dollars out the door in order to do that. However, if you look at multiples of revenue to the valuation of a company, all of a sudden it makes sense, right?
So if you've got 80% gross margins and you're growing at 2X, it may be 10X revenue, for the value of the company. And if your COGS, if your margins are say 40%, it could be 5X. Now that could be the difference between a $10 and a $20 billion company. Now, can you build a data center for $10 billion? The answer is yes, you can buy a whole bunch of them, right?
And then if you purpose built a data center for your app, can you make it more efficient than the cloud? Absolutely, because this is standard. The cloud is built for the long tail of small apps because that's what it caters to. This is nobody's fault, here, it's just the system.
So if you have a sufficiently large workload that you've been focusing on growth where the company slows down, let's take any online SaaS service and that's impacting your margins and therefore it's impacting the valuation of the company. It's actually quite easy to make the argument that, from a shareholder standpoint, it's far, far cheaper to build a data center than not.
BD - Wow, that's brilliant analysis.
We will give people a chance to go and find other ways to dig deeper into some of these topics, but a super interesting discussion Martin.
And I want to say thanks for your time.
MC - Thank you so much.
BD - Alright, cheers. And we'll see you soon.
Earlier this month, Bruce and Larry sat down over Zoom to discuss their shared history researching and developing networking, how "Computer Networks: A Systems Approach" came to be written, and their thoughts on the future of Networking. The interview is on YouTube and the transcript follows.
Bruce Davie: so I'm here today with Larry Peterson. I'm drinking coffee as is my normal habit. I'm guessing it's bit late in the day for you for coffee.
Larry Peterson: Well it isn't too late for a brew of some kind.
Bruce Davie: All right, well, if you want to go and pop up and get a beer, feel free. So, for those of you who don't know, this is Larry Peterson, my co-author on "Computer Networks: A Systems Approach" and Larry and I have known each other since I think early 1990s, when we collaborated on a networking project.
I guess before I go into giving more background about you. I want to know that your cat since the cat theme been showing up a lot
Larry Peterson: He's going to be interrupting here at some point. This is Toby. If you can see him. Yeah, so we live in the desert, which is full of coyotes. Toby's used up a couple of his lives right outside in our backyard. There's a wall on the other side of that is open desert. And when he was a few months old a pack of coyotes came in the yard in and got him and took him over the wall. My wife went out and started yelling at the coyotes. We think he has some coyote DNA now.
Bruce Davie: So I have a lot of literally warm memories of visiting you in Arizona. Also remember working with you on a joint research paper sometime in the early 90s and I was locked in my house during an ice storm in New Jersey, and it was literally unsafe to go outside, because there was so much ice on the roads. You were probably soaking up the sun in Arizona. But it was one of those early experiences, similar to what we're going through now where we actually could leverage networking to collaborate from a pretty remote distance.
So I think you've actually been in networking longer than I have. And your ageing very well, by the way. Can you maybe tell me a little bit about your early experiences with networking?
Larry Peterson: Yeah, let's see. Well, I was a grad student at Purdue. I remember the transition from the ARPANET to the Internet and my advisor actually handed me a nine track track tape for our VAX. That had this thing called TCP/IP on it so, yeah, that was my first exposure to it.
We are officially shutting down PlanetLab at the end of May, with our last major user community (MeasurementLab) having now migrated to new infrastructure. It was 18 years ago this month (March 2002) that 30 systems researchers got together at the Intel Lab in Berkeley to talk about how we could cooperate to build out the distributed testbed to support our research. There were no funding agencies in the room, no study group, and no platinum sponsors. Just a group of systems people that wanted to get their research done. We left the meeting with an offer from David Tennenhouse, then Director of Research at Intel, to buy 100 servers to bootstrap the effort. In August, the day before SIGCOMM, a second underground meeting happened in Pittsburgh, this time drawing 80 people. The first machines came online at Princeton and Berkeley in July, and by October, we had the 100 seed machine up and running at 42 sites. The rest, as they say, is history.
In retrospect, it was a unique moment in time. The distributed systems community, having spent the previous 15 years focused on the LAN, was moving on to wide-area networking challenges. The networking community, having architected the Internet, was ruminating about how it had become ossified. Both lacked a realistic platform to work on. My own epiphany came during an Internet End-to-End Research Group meeting in 2001, when I found myself in a room full of the Internet’s best-and-brightest, trying to figure out how we could possibly convince Cisco to interpret one bit in the IP header differently. I realized we needed to try a different approach.
PlanetLab enabled a lot of good research, much of which has been documented in the website’s bibliography. Those research results are certainly important, but from my point of view, PlanetLab has had impact in other, more lasting ways. One was a model for how computer scientists can share research infrastructure. Many of the early difficulties we faced deploying PlanetLab had to do with convincing University CIOs that hosting PlanetLab servers had an acceptable risk/reward tradeoff. A happy mistake we made early on was asking the VP for Research (not the University CIO) for permission to install servers on their campus. By the time the security-minded folks figured out what was going on, it was too late. They had no choice but to invent Network DMZs as a workaround.
A second was to expose computer scientists to real-world operational issues that are inevitable when you’re running Internet services. Researchers that had been safely working in their labs were suddenly exposed to all sorts of unexpected user behavior, both benign and malicious, not to mention the challenges of keeping a service running under varied network conditions. There were a lot of lessons learned under fire, with unexpected traffic bursts (immediately followed by email from upset University system admins), a common right-of-passage for both grad students and their advisors. I’m not surprised when I visit Google and catch up with former faculty colleagues to hear that they now spend all their time worrying about operational challenges. Suddenly, network management is cool.
Then there were the non-technical, policy-related issues, forcing us to deal with everything from DMCA take-down notices to FBI subpoenas to irate web-surfers threatening to call the local Sheriff on us. These and similar episodes were among the most eye-opening aspects of the entire experience. They were certainly the best source of war stories, and an opportunity to get to know Princeton’s General Counsel quite well. Setting policy and making judgements about content is really hard… who knew.
Last, but certainly not least, is the people. In addition to the fantastic and dedicated group of people that helped build and operate PlanetLab, the most gratifying thing that happens to me (even still today) is running into people--usually working for an Internet company of one sort or another--who tell me that PlanetLab was an important part of their graduate student experience. If you are one of those people and I haven’t run into you recently (or even if I have) please leave a comment and let me know what you’re up to. It will be good to hear from you.
The transition to 5G is happening, and unless you’ve been actively trying to ignore it, you’ve undoubtedly heard the hype. But if you are like 99% of the CS-trained, systems-oriented, cloud-savvy people in the world, the cellular network is largely a mystery. You know it’s an important technology used in the last mile to connect people to the Internet, but you’ve otherwise abstracted it out of your scope-of-concerns.
The important thing to understand about 5G is that it implies much more than a generational upgrade in bandwidth. It involves transformative changes that blur the line between the access network and the cloud. And it will encompass enough value that it has the potential to turn the “Access-as-frontent-to-Internet” perspective on its head. We will just as likely be talking about “Internet-as-backend-to-Access” ten years from now. (Remember, you read it here first.)
The challenge for someone that understands the Internet is penetrating the myriad of acronyms that dominate cellular networking. In fairness, the Internet has its share acronyms, but it also comes with a sufficient set of abstractions to help manage the complexity. It’s hard to say the same for the cellular network, where pulling on one thread seemingly unravels the entire space. It has also been the case that the cellular network had been largely hidden inside proprietary devices, which has made it impossible to figure it out for yourself.
In retrospect, it's strange that we find ourselves in this situation, considering that mobile networks have a 40-year history that parallels the Internet’s. But unlike the Internet, which has evolved around some relatively stable "fixed points," the cellular network has reinvented itself multiple times over, transitioning from from voice-only to data-centric, and from circuit-oriented to IP-based. 5G brings another such transformation, this time heavily influenced the cloud. In the same way 3G defined the transition from voice to broadband, 5G’s promise is mostly about the transition from a single access service (broadband connectivity) to a richer collection of edge services and devices, including support for immersive user interfaces (e.g., AR/VR), mission-critical applications (e.g., public safety, autonomous vehicles), and the Internet-of-Things (IoT). Because these use cases will include everything from home appliances to industrial robots to self-driving cars, 5G won’t just support humans accessing the Internet from their smartphones, but also swarms of autonomous devices working together on their behalf. All of this requires a fundamentally different architecture that will both borrow from and impact the Internet and Cloud.
We have attempted to document this emerging architecture in a book that is accessible to people with a general understanding of the Internet and Cloud. The book (5G Mobile Networks: A Systems Approach) is the result of a mobile networking expert teaching a systems person about 5G as we’ve collaborated on an open source 5G implementation. The material has been used to train other software developers, and we are hopeful it will be useful to anyone that wants a deeper understanding of 5G and the opportunity for innovation it provides. Readers that want hands-on experience can also access the open source software introduced in the book.
Two industry trends with significant momentum are on a collision course. One is the cloud, which in pursuit of low-latency/high-bandwidth applications is moving out of the datacenter and towards the edge. The promise and potential of applications ranging from Internet-of-Things (IoT) to Immersive UIs, Public Safety, Autonomous Vehicles, and Automated Factories, has triggered a gold rush to build edge platforms and services. The other is the access network that connects homes, businesses, and mobile devices to the Internet. Network operators (Telcos and CableCos) are transitioning from a reliance on closed and proprietary hardware to open architectures leveraging disaggregated and virtualized software running on white-box servers, switches, and access devices.
The confluence of cloud and access technologies raises the possibility of convergence. For the cloud, access networks provide low-latency connectivity to end users and their devices, with 5G in particular providing native support for the mobility of those devices. For the access network, cloud technology enables network operators to enjoy the CAPEX & OPEX savings that come from replacing purpose-built appliances with commodity hardware, as well as accelerating the pace of innovation through the softwartization of the access network.
It is clear that the confluence of cloud and access technologies at the access-edge is rich with opportunities to innovate, and this is what motivates the CORD-related platforms we are building at ONF. But it is impossible to say how this will all play out over time, with different perspectives on whether the edge is on-premise, on-vehicle, in the cell tower, in the Central Office, distributed across a metro area, or all of the above. With multiple incumbent players—e.g., network operators, cloud providers, cell tower providers—and countless startups jockeying for position, it’s impossible to predict how the dust will settle.
On the one hand, cloud providers believe that by saturating metro areas with edge clusters and abstracting away the access network, they can build an edge presence with low enough latency and high enough bandwidth to serve the next generation of edge applications. In this scenario, the access network remains a dumb bit-pipe, allowing cloud providers to excel at what they do best: run scalable cloud services on commodity hardware. On the other hand, network operators believe that by building the next generation access network using cloud technology, they will be able to co-locate edge applications in the access network. This scenario comes with built-in advantages: an existing and widely distributed physical footprint, existing operational support, and native support for both mobility and guaranteed service.
While acknowledging both of these possibilities, there is a third outcome that not only merits consideration, but is also worth actively working towards: the democratization of the network edge. The idea is to make the access-edge accessible to anyone, and not strictly the domain of incumbent cloud providers or network operators. There are three reasons to be optimistic about this possibility:
The Internet has been described as having a narrow waist architecture, with one universal protocol in the middle (IP), widening to support many transport and application protocols above it (e.g., TCP, UDP, RTP, SunRPC, DCE-RPC, gRPC, SMTP, HTTP, SNMP) and able to run on top of many network technologies below (e.g., Ethernet, PPP, WiFi, SONET, ATM). This general structure has been a key to the Internet becoming ubiquitous: by keeping the IP layer that everyone has to agree to minimal, a thousand flowers were allowed to bloom both above and below. This is now a widely understood strategy for any platform trying to achieve universal adoption.
But something else has happened over the last 30 years. By not addressing all the issues the Internet would eventually face as it grew (e.g., security, congestion, mobility, real-time responsiveness, and so on) it became necessary to introduce a series of additional features into the Internet architecture. Having IP’s universal addresses and best-effort service model was a necessary condition for adoption, but not a sufficient foundation for all the applications people wanted to build.
It is informative to reconcile the value of a universal narrow waist with the evolution that inevitably happens in any long-lived system: the “fixed point” around which the rest of the architecture evolves has moved to a new spot in the software stack. In short, HTTP has become the new narrow waist; the one shared/assumed piece of the global infrastructure that makes everything else possible. This didn’t happen overnight or by proclamation, although some did anticipate it would happen. The narrow waist drifted slowly up the protocol stack as a consequence of a evolution (to mix geoscience and biological metaphors).
Putting the narrow waist label purely on HTTP is an over simplification. It’s actually a team effort, with the HTTP/TLS/TCP/IP combination now serving as the Internet’s common platform.
Somewhat less obviously, HTTP also provides a good foundation for dealing with mobility. If the resource you want to access has moved, you can have HTTP return a redirect response that points the client to a new location. Similarly, HTTP enables injecting caching proxies between the client and server, making it possible to replicate popular content in multiple locations and save clients the delay of going all the way across the Internet to retrieve some piece of information. (See how in Section 9.4.) Finally, HTTP has been used to deliver real-time multi-media, in an approach known as adaptive streaming. (See how in Section 7.2.)
For almost as long as there have been packet-switched networks, there have been ideas about how to virtualize them. For example, there were early debates in the networking community about the merits of "virtual circuits" versus connectionless networks. But the concept of network virtualization has become more widespread in recent years, helped along by the rise of SDN as an enabling technology.
Virtualization has a robust history in computer science, but there remains some confusion about precisely what the term means. Arguably this is due in part to confusion caused by colloquial usage of "virtual" as a synonym for "almost", among many other uses.
Virtual memory provides an easy example to help understand what virtualization means in computing. Virtual memory creates an abstraction of a large and private pool of memory resources, even though the underlying physical memory may be shared by many applications and users and considerably smaller than the apparent pool of virtual memory. This abstraction enables programmers to operate under the illusion that there is plenty of memory and that no-one else is using it, while under the covers the memory management system takes care of things like mapping the virtual memory to physical resources and avoiding conflict between users.
Similarly, server virtualization presents the abstraction of a virtual machine (VM), which has all the features of a physical machine. Again, there may be many VMs supported on a single physical server, and the operating system and users on the virtual machine are happily unaware that the VM is being mapped onto physical resources.
A key point here is that virtualization of computing resources preserves the abstractions that existed before they were virtualized. This is important because it means that users of those abstractions don't need to change - they see a faithful reproduction of the thing being virtualized.
So what happens when we try to virtualize networks? We are able to present familiar abstractions to users of the virtual network, while mapping those abstractions onto the physical network in a way that insulates the user from the complexity of this mapping.
An early success for virtual networking came with the introduction of virtual private networks (VPNs), which allowed carriers to present corporate customers with the illusion that they had their own private network, even though in reality they were sharing underlying resources with many other users. One instance of this was the flavor of VPN known as MPLS VPNs, which gave each customer their own private address space and routing tables, along with control over the topology of their network, all implemented on top of a single IP network.
VPNs, however, only virtualize a few resources, notably addressing and routing tables. Network virtualization as commonly understood today goes further, virtualizing every aspect of networking. That means that a virtual network today supports all the basic abstractions of a physical network - switching, routing, firewalling, load balancing - virtualizing the entire network stack from layers two through seven. In this sense, they are analogous to the virtual machine, with its support of all the abstractions of a server: CPU, storage, I/O, etc.
Like virtual machines, virtual networks are also allowing a whole set of operational advances. They can be created rapidly under programmatic control; snapshots can be taken; networks can be cloned and migrated to entirely new locations, e.g., for disaster recovery.
There's still lots of room for growth in the virtual networking space. Modern cloud operators increasingly depend on virtual networks to automate their provisioning of services. Operators of emerging 5G networks are looking at options for virtualizing their networks.
For a more in depth discussion of this topic, we refer you to this blog post, co-authored with Martin Casado, one of the pioneers of both SDN and network virtualization.
Earlier posts talked about the softwarization of the network in fairly general terms, but the idea got rolling ten years ago with the introduction of Software Defined Networks (SDN).
The fundamental idea of SDN is to decouple the network control plane (i.e., where routing algorithms like RIP, OSPF, and BGP run) from the network data plane (i.e., where packet forwarding decisions get made), with the former moved into software running on commodity servers, and the latter implemented by white-box switches like the ones described in Section 3.4 of the book. The original enabling idea of SDN was to define a standard interface between the control plane and the data plane so that any implementation of the control plane could talk to any implementation of the data plane; this breaks the dependency on any one vendor’s bundled solution. The original interface is called OpenFlow, and this idea of decoupling the control and data planes came to be known as disaggregation.
OpenFlow was a great first step, but a decade of experience has revealed that it is not sufficient as the interface for controlling the data plane. This is for the same reason any API layered on top of hardware falls short: it does not expose the full range of features that switch vendors put into their hardware. To address this shortcoming, the SDN community is now working on a language-based approach to specifying how the control and data planes interact. The language is called P4, and it provides a richer model of the switch's packet forwarding pipeline.
Another important aspect of disaggregation is that a logically centralized control plane can be used to control a distributed network data plane. We say logically centralized because while the state collected by the control plane is maintained in a global data structure (e.g., a Network Map), the implementation of this data structure could still be distributed over multiple servers (i.e., it could run in a cloud). This is important for both scalability and availability, where the two planes are configured and scaled independent of each other. This idea took off quickly in the cloud, with today’s cloud providers running SDN-based solutions both within their datacenters and across the backbone networks that interconnect their datacenters.
A consequence of this design that isn’t immediately obvious is that a logically centralized control plane doesn’t just manage a network of physical (hardware) switches that interconnects physical servers, but it also manages a network of virtual (software) switches that interconnect virtual servers (e.g., Virtual Machines and containers). If you’re counting “switch ports” (a good measure of all the devices connected to your network) then the number of virtual ports in the Internet shot past the number of physical ports in 2012.
One of other key enablers for SDN’s success, as depicted in the Figure, is the Network Operating System (NOS). Like a server operating system (e.g., Linux, IOS, Android, Windows) that provides a set of high-level abstractions that make it easier to implement applications (e.g., you can read and write files instead of directly accessing disk drives), a NOS makes it easier to implement network control functionality, otherwise known as Control Apps. A good NOS abstracts the details of the network switches and provides a “network map” abstraction to the application developer. The NOS detects changes in the underlying network (e.g., switches, ports, and links going up-and-down) and the control application simply implements the behavior it wants on this abstract graph. What that means is that the NOS takes on the burden of collecting network state (the hard part of distributed algorithms like Link-State and Distance-Vector algorithms) and the control app is free to simply implement the shortest path algorithm and load the computed forwarding rules into the underlying switches. By centralizing this logic, SDN is able to produce a globally optimized solution. The published evidence confirms this advantage (e.g., Google's private wide-area network B4).
As much of an advantage as the cloud providers have been able to get out of SDN, its adoption in enterprises and Telcos has much much slower. This is partly about the ability of different markets to manage their networks. The Googles, Microsofts, and Amazons of the world have the engineers and DevOps skills needed to take advantage of this technology, whereas others still prefer pre-packaged and integrated solutions that support the management and command line interfaces they are familiar with. As is often the case, business culture changes more slowly than technology.
It is important to recognize the various perspectives on computer networks (e.g., that of network architects, application developers, end users, and network operators) to understand the technical requirements that shape how networks are designed and built. But this presumes all design decisions are purely technical, which is certainly not the case. Many other factors, from economic forces, to government policy, to societal influences, to ethical considerations, influence how networks are designed and built.
Of these, the marketplace is often the most influential, and corresponds to the interplay between network operators that sell access and connectivity (e.g., AT&T, Comcast, Verizon, DT, NTT, China Mobile), network equipment venders that sell hardware to network operators (e.g., Cisco, Juniper, Ericsson, Nokia, Huawei, NEC), cloud providers that host content and scalable applications in their datacenters (e.g., Google, Amazon, Microsoft), service providers that deliver content and cloud apps to end-users (e.g., Facebook, Apple, Netflix, Spotify), and of course, subscribers and customers that download content and run cloud applications (i.e., individuals, but also enterprises and businesses). Not surprisingly, the lines between all these players are not crisp, with many companies playing multiple roles. For example, service providers like Facebook run their own clouds and network operators like Comcast and AT&T own their own content.
The most notable example of this cross-over are the large cloud providers, who (a) build their own networking equipment, (b) deploy and operate their own networks, and (c) provide end-user services and applications on top of their networks. It's notable because it challenges the implicit assumptions of the simple "textbook" version of the technical design process. One such assumption is that designing a network is a one-time activity. Build it once and use it forever (modulo hardware upgrades so users can enjoy the benefits of the latest performance improvements). A second is that the job of designing and implementing the network is completely divorce from the job of operating the network. Neither of these assumptions is quite right.
On the first point, the network’s design is clearly evolving. The only question is how fast. Historically, the feature upgrade cycle involved an interaction between network operators and their vender partners (often collaborating through the standardization process), with timelines measured in years. But anyone that has downloaded and used the latest cloud app knows how glacially slow anything measured in years is by today's standards.
On the second point, the companies that build networks are almost always the same ones that operate them. The only question is whether they develop their own features or outsource that process to their venders. If we once again look to the cloud for inspiration, we see that develop-and-operate isn’t just true at the corporate level, but it is also how the fastest moving cloud companies organize their engineering teams: around the DevOps model. (If you are unfamiliar with DevOps, we recommend you read "Site Reliability Engineering: How Google Runs Production Systems" to see how Google practices it.)
What this all means is that computer networks are now in the midst of a major transformation, due largely to market pressure being applied by agile cloud providers. Network operators are trying to simultaneously accelerate the pace of innovation (sometimes known as feature velocity) and yet continue to offer a reliable service (preserve stability). And they are increasingly doing this by adopting the best practices of cloud providers, which can be summarized as having two major themes: (1) take advantage of commodity hardware and move all intelligence into software, and (2) adopt agile engineering processes that break down barriers between development and operations.
This transformation is sometimes called the “cloudification” or “softwarization” of the network, but by another name, it’s known as Software Defined Networks (SDN). Whatever you call it, this new perspective will (eventually) be a game changer, not so much in terms of how we address the fundamental technical challenges of framing, routing, fragmentation/reassembly, packet scheduling, congestion control, security, and so on, but in terms of how rapidly the network evolves to support new features and to accommodate the latest advances in technology.
This general theme is important and we plan to return to it in future posts. Understanding networks is partly about understanding the technical underpinnings, but also partly about how market forces (and other factors) drive change. That you are able to make informed design decisions about technical approach A versus technical approach B is a necessary first step, but that you are able to deploy that solution and bring it to market more rapidly and for less cost than the competition is just as important, if not more so.
Having not cracked open Computer Networks: A Systems Approach for several years, the thing that most struck me as I started to update the material is how much of the Internet has its origins in the research community. Everyone knows that the ARPANET and later TCP/IP came out of DARPA-funded university research, but even as the Web burst onto the scene in the 1990s, it was still the research community that that led the way in the Internet's coming-of-age. There's a direct line connecting papers published on congestion control, quality-of-service, multicast, real-time multimedia, security protocols, overlay networks, content distribution, and network telemetry to today's practice. And in many cases, the technology has become so routine (think Skype, Netflix, Spotify), that it's easy to forget the history of how we got to where we are today. This makes updating the textbook feel strangely like writing an historical record.
From the perspective of writing a relevant textbook (or just making sense of the Internet), certainly it's important to understand the historical context. It is even more important to appreciate the thought process of designing systems and solving problems, for which the Internet is clearly the best use case to study. But there are some interesting challenges in providing perspective on the Internet to a generation that has never known a world without the Internet.
One is how to factor commercial reality into the discussion. Take video conferencing as an example. Once there was a single experimental prototype (vic/vat) used to gain experience and drive progress. Today there is Skype, GoToMeeting, WebEx, Google Hangouts, Zoom, UberConference, and many other commercial services. It's important to connect-the-dots between these familiar services and the underlying network capabilities and design principles. For example, while today's video conferencing services leverage the foundational work on both multicast and real-time protocols, they are closed-source systems implemented on top of the network, at the application level. They are able to do this by taking advantage of widely distributed points-of-presence made possible by the cloud. Teasing apart the roles of cloud providers, cloud services, and network operators is key to understanding how and where innovation happens today.
A second is to identify open platforms and specifications that serve as good exemplars for the core ideas. Open source has become an important part of today's Internet ecosystem, surpassing the role of the IETF and other standards bodies. In the video conferencing realm, for example, projects like Jitsi, WebRTC, and Opus are important examples of the state-of-the-art. But one look at the projects list on the Apache Foundation or Linux Foundation web sites makes it clear that separating the signal from the noise is no trivial matter. Knowing how to navigate this unbelievably rich ecosystem is the new challenge.
A third is to anticipate what cutting edge activity happening today is going to be routine tomorrow. On this point, the answer seems obvious. It will be how network providers improve feature velocity through the softwarization and virtualization of the network. By another name, this is Software Defined Networking (SDN), but more broadly, this represents a shift from building the network using closed/proprietary appliances to using open software platforms running on commodity hardware. This shift is both pervasive and transformative. It impacts everything from high-performance switch design, to architecting access networks (5G, Fiber-to-the-Home), to how network operators deal with lifecycle management, to the blurring of the line between the Internet and the Cloud. Recognizing that this transformation is underway is essential to understanding where the Internet is headed next.