Building a Scalable OpenFlow Network with MAC Based Routing

Fabrics have been the talk of the industry for the last few years but they are getting overshadowed by the buzz around software defined networking (SDN) and OpenFlow.  It is easy to see why based on what I saw at last weeks Open Network Summit (ONS).  There were dozens of vendors showcasing OpenFlow based solutions including six (Arista, Dell, Extreme, HP, IBM and vArmour) that were showing solutions based on the controller technology from Big Switch Networks.

One of the highlights of the show was the talk by Urs Hölzle from Google where he announced they are 100% OpenFlow across their inter-datacenter network.  This is in stark contrast to the detractors that claim there are no OpenFlow networks in production today.  Working on OpenFlow and SDN I certainly have information to the contrary but it’s difficult to announce it publicly when you work at a startup, especially when deployments are under NDA.

Now that there is a large scale publicly known OpenFlow network, I think its time to start putting this technology in the context of new and interesting architectures.  You may take what I say with a grain of salt, or downright dismiss it as network heresy, and that’s ok.  What I propose goes against everything we think we know about networks, which is exactly why I find it so compelling.

Building a (bigger) Better Bridge

Long ago we settled a debate in the network world that Layer 3 networks were more stable, easier to manage, and infinitely less complex to troubleshoot than Layer 2.  Layer 3 was being pushed all the way down to the access layer, and in some cases right into the host.  Then virtualization came along and Layer 2 became (begrudgingly) required again.  The industry responded with the concept of a fabric which is essentially a scalable Layer 2 architecture.  Unfortunately the ideal Layer 2 architecture from a server administrator’s point of view is one that spans the entire datacenter or even between data centers.  This is well beyond the implementations of the most popular fabric offerings such as Cisco’s Fabric Path or Junipers QFabric, especially for web scale data centers.  With the advent of OpenFlow the concept of building a massive bridged network, without the limitations of todays Layer 2 designs, is not only possible but achievable with the technology on the table today

The inner workings of OpenFlow are fairly simple.  OpenFlow centers around the ability to program a flow table in a switch by matching patterns in a packets and performing actions on that packet.  The most basic example would be for any packet with a source MAC of aa:aa:aa:aa:aa:aa and a destination MAC of bb:bb:bb:bb:bb:bb forward it to port 20 on the switch.

As a point of clarity I have simplified the Flow Tables in all of the diagrams by referring to only two octects of each MAC address.

A loop free logical topology in a physically looped network.

The single switch example explains the basic concept but interesting possibilities open up when you start building a network topology.  A SDN controller (a control plane decoupled from the switches) can take a network wide view and program all of the switches along the path required to get packets from Host-A to Host-B.

Think about that for a few minutes because it has some serious implications.  If a SDN controller can program a path from end to end, there is no more need for explicitly managing the physical topology.  All links in the network become potential bandwidth and everything  from STP and Trill to EIGRP and OSPF can be eliminated by an intelligent SDN controller.  Even with a physically looped topology like the example below, source-destination paths are loop free, rendering physical loops irrelevant.  Traffic (unicast, multicast and broadcast) can always take the path desired, wether that is the shortest path or a traffic engineered path becomes a questions of network policy.

A loop free logical topology in a physically looped network.

Your network begins to look like one big bridged network without the traditional limitations of large scale Layer 2.  Crazy I know, but it is very possible and networks across the globe are already in production using it.

Anyone with the slightest insight into switch internals will immediately say this is not scalable because you will overrun the MAC address tables in even the most robust switches very quickly.  Considering a rack of virtualized servers can contain thousands or even tens of thousands of MACs the concern appears to be well founded on the surface. In reality two features of the OpenFlow architecture limit concern fairly quickly, at least at the access layer of your network.

First, flows expire in a relatively short amount of time (usually 5-30 seconds) if they are not in use.  Meaning that if Host-A moves or doesn’t need to communicate with Host-B anymore the flows will be expire from the switches automatically. The net result is that the current number of flows in a switch represent an almost realtime picture of what’s connected to the switch.  The second benefit is that the concept of wildcard flows can be employed for the purposes of matching.  For example if multiple hosts on Switch-1 wanted to talk to Host-B on Switch-3 you may summarize them in a single flow such as if src.MAC=* and  dst.MAC=bb:bb:bb:bb:bb:bb then forward to port 15.

Access layer (or virtual access layer) switches can handle the number of flows needed at any given moment, but clearly aggregating those up to a distribution layer creates a challenge; one that is actually simple to overcome.

MAC Based Routing

One of the key requirements of any fabric solution is that you present the source and destination with accurate L2 and L3 information about each other.  The easiest way to do this today is to carry the L2/L3 headers all the way across the network using some form of tunneling.  A number of solutions are using this method, including LISP, VxLAN, and NVGRE to architect and ‘overlay network’ on top of your physical network. With OpenFlow L2 information could be ‘recreated’ at the penultimate destination, thereby sparing the rest of the network from learning every MAC address, and eliminating the need for tunneling within a data center.

In the example below each switch is assigned a MAC address by the SDN controller.  Because the controller knows the location of the source, destination and every switch in between simple rules can be programmed to ensure the packet gets from Host-A to Host-B, without the aggregation switch (Switch-2 in this example) learning any host MAC addresses using MAC rewrite (which can be done in hardware at line rate).

MAC based routing can scale to tens of thousands of switches in a data center.

Larger networks with tens of thousands of possible switch destinations still risk overrunning the aggregation or core switch MAC address tables. Building a multi-tier rewrite approach can solve the problem with ease, requiring little but a more intelligent SDN controller.  From here it doesn’t take too much imagination to express a possible switch destination with multiple MAC addresses that represent different QoS requirements, or any other network policy.

Taking this concept to a logical conclusion we can use both the source and destination MAC fields as one big playground for policy and routing decisions. Claiming some of the bit space for a network ID (or tenant ID) would be a simple way to eliminate VLANs, QinQ, VxLAN, VRFs and MPLS.  Both MAC fields can easily be rewritten at the edge and devices in the middle of the network can use a combination of IP address and the network ID portion of the MAC addresses to provide isolation, route traffic, and apply other policy.

Networks have refused to evolve at the same pace as server virtualization technologies.  SDN and OpenFlow have the potential to revolutionize networking and enable it to meet the challenges of todays business requirements.  I can’t wait to see what the rest of the industry cooks up over the next few years.

Categories: Data Center, Networking, SDN & OpenFlow, Virtual Networking

Tags: ,

14 replies

  1. A couple of questions:

    1. How does HostA get the MAC addr of HostB? An ARP is going to broadcast through the loop. Can the SDN offer a proxy-arp at the first hop and then when it sees a packet come in for his MAC, rewrite the MAC on the packet to that of HostB?

    Also, how effectively can other broadcasts be controlled (like if some moron insists on a Windows server plugged into the network)?

    2. How well can this scale? to 65k or 128k end hosts?

    3. When a flow times out and then that conversation starts up again, what is the delay for the switch to contact the controller to get new flow info?

    • All great questions.

      1) If you are trying to build ‘virtual networks’ then more than likely the SDN controller will need to build broadcast trees per network. There are a couple ways this could be enforced. The controller can insert flows for the broadcast tree or the controller could handle broadcast directly.

      2) Scale will really depend on the controller but 128k hosts seems very reasonable to me.

      3) This is very dependent on the controller switch hardware, and network topology, but 5-50ms would be a good estimate.

      • Thanks for the quick response.

        1) I’m thinking what if someone really configured a /8 network (ok, so that would be 16M, not 128k which assumes a /15)? I’m assuming max flexibility and anything could be anywhere (that’s why you go single subnet in the first place, right?) so I’m guessing no virtual networks.

        Or are you saying that we would still need a loop-free tree to handle broadcasts and multicasts, just not necc specifically STP.

        3) Might that 5 – 50ms penalty be imposed at each physical hop? Have you had any issues from application admins who are used to network latency closer to the 10us mark?

      • Actually with this scheme any device may be anywhere even without a /8. I think /8 network designs solve many problems but you still want to provide some level of broadcast isolation. This is one of the beauties of OpenFlow, we can now do things that just aren’t possible (or really hard to do) with existing network architecture.

        As for flow setup latency the controller may do this in parallel so 5-50ms end to end is very reasonable. In fact, with this scheme there would be minimal flow setup in the middle of the infrastructure and you may only need a flow at the access layer. All of the switch to switch flows with MAC rewrite could be proactively inserted when a switch comes online.

        If an App really can’t take a 50ms hit on the first packet of a flow you could always make the access layer flows proactive as well.

  2. Hi Dan,

    Thanks for the wonderful article! I’ve had one question for you. Are you aware of any open source implementation of MAC based routing in SDN?

  3. Hi Dan
    I am interested to work on “Mac Based Routing”, I have go through your all labs except this, if you have any source of “Step by Step lab” for “MAC Based Routing” than please forward. Since I am new user for all this so need with basic level to advance.

    • I don’t have a step by step guide but if I find some spare time I’ll see what I can pull together. Realistically this requires an intelligent application but I suspect this could be pulled of in a hackish way with static flows.

  4. Thank you for the great article!
    I have quick questions:
    1) Is the MAC routing you described a general (common) one in the industry?
    2) If the controller assigns the destination switch’s MAC address, we don’t need ARP in this structure?

    • I would not call MAC routing common but there are implementation of it in the market. It is an interesting approach to solving a problem, that is easier to implement because of the advent of OpenFlow. The reality is we focus far to much of how instead of the end result.

      Relative to ARP the important thing is not change the implementation of ARP host side, which this example does not. It will certainly require support from an OpenFlow controller/app.

      To be clear there are other considerations to take into account. This is simply a framework to start from.

  5. Good evening, you mentioned that there are two ways to enforced ARP broadcast but how to generate ARP reply?
    1. The controller can insert flows for the broadcast tree.
    2. The controller could handle broadcast directly.


  1. SDN? Whozawhutsit? « Chicken Scratch, Hit or Miss
  2. Clouds, Service Providers, Converged and Commodity Infrastructure | Cloud Information Management

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: