Unification Part 2: FCoE Demystified

As promised here is the 2nd part of the Unified Fabric post, where we get under the covers with FCoE.

The first and most important thing to clarify is as its name suggests Fibre Channel over Ethernet (FCoE) still uses the Fibre Channel protocol, and as such all the higher level processes that needed to happen in a Native Fibre Channel environment FLOGI/PLOGI etc., still need to happen in an FCoE environment.

So having a good understanding of Native Fibre Channel operations is key. So let’s start with a quick Native Fibre Channel recap:

For the IP Networker I have put some parentheses () and corresponding IP services that can be very loosely mapped to the Fibre Channel process to aid understanding.

Native Fibre Channel

Initiators/Targets contain Host Bus Adapters (HBA’s) which in Fibre Channel terms are referred to as Node ports (N ports).

These N Ports are connected to Fabric Ports (F ports) on the Fibre Channel switches.

Fibre Channel switches are then in turn connected together via Expansion (E) Ports, or if both Switches are Cisco you have the option of also Trunking multiple Virtual SANs (VSANs) over the E ports in which case they become Trunking Expansion Ports (TE Ports).

First the initiator (server) sends out a Fabric Login (FLOGI) to the well-known address FFFFFE, this FLOGI registers the unique 64bit World Wide Port Name (WWPN) of the HBA (Think MAC Address) with the Fibre Channel Name Server (FCNS).

The FCNS is a service that automatically runs on an elected “Principal switch” within the Fabric. By default the switch with the lowest Domain ID in the Fabric is elected the Principal Switch.
The Principal Switch is also in charge of issuing the Domain ID’s to all the other switches in the Fabric.

The FCNS then sends the initiator back a unique 24bit routable Fibre Channel Identifier (FC_ID) also referred to as an N_Port_ID (Think IP Address) the 24bit FC_ID is expressed as 6 Hexadecimal digits.

So the basic FLOGI conversation goes something like “Here’s my unique burned in address” send me my routable address (think DHCP)

The 24bit FC_ID is made up of 3 parts:

• The Domain ID, which is assigned by the Principal switch to the Fibre Channel switch, to which the host connects.
• The Area ID, which actually is the port number of the switch the HBA is connected to.
• The Port ID which refers to the single port address on the actual host HBA.

The above format ensures FC_ID uniqueness within the fabric.

FCID
Figure 1 Fibre Channel Identifier

Once the initiator receives its FC_ID, it then sends a Port Login (PLOGI) to well-known address FFFFFC which registers its WWPN and assigned FC_ID with the Fibre Channel Name Server (FCNS). (Think of the FCNS Server like DNS). The FCNS then returns all the FCID’s of the Targets the initiator has been allowed to access via the Zoning policy.

Once the PLOGI is completed, the initiator starts a discovery process, to find the Logical Unit Numbers (LUNs) it has access to.

The FLOGI database is locally significant to the switch and only shows WWPN’s and FC_ID’s of directly attached Initiators/Targets, the FCNS database on the other hand is distributed across all switches in the fabric, and shows all reachable WWPN’s and FC_ID’s within the Fabric.

Native Fibre Channel Topology
Figure 2 Native Fibre Channel Topology.

OK History lesson over.

The Fibre Channel protocol has long proven to be the best choice for block based storage (storage that appears as locally connected), and FCoE simply takes all that tried and tested Fibre Channel performance and stability, and offers an alternative layer one physical transport in this case Ethernet.

But replacing the Fibre Channel transport, did come with its challenges, The Fibre Channel physical layer creates a “lossless” medium by using buffer credits; think of a line of people passing boxes down the line, and if the next person does not have empty hands (available buffer), they cannot receive the next box, so the flow is “paused” until the box can again be passed.

Ethernet on the other hand expects drops and uses windowing by upper layer protocols in order to not over whelm the receiver, instead of a line of people passing a box from hand to hand, think of a conveyor belt with someone just loading boxes on it, at an ever increasing speed, until they hear shouts from the other end that boxes are falling off, at which point they slow their loading rate and gradually speed up again.

So the million dollar question was how to send a “lossless” payload over a “lossy” transport.

The answer to which, was several enhancements to the Ethernet Standard generally and collectively referred to as Data Centre Bridging (DCB)

Fibre Channel over Ethernet

OK so now we have had a quick refresher on Native Fibre Channel, let’s walk through the same process, in the converged world.

First of all let’s get some terminology out of the way,

End Node (E-Node) the End host in an FCoE network, containing the Converged Network Adapter (CNA) this could be a Server or FCoE attached Storage Array.

Fibre Channel Forwarder (FCF) Switches that understand both Ethernet and Fibre Channel protocol stacks.

NB) An FCF is required whenever FC encapsulation/de-encapsulation is required. But as an FCoE frame is a legal tagged Ethernet frame it could be transparently forwarded over standard Ethernet switches.

The next thing to keep in mind is that Fibre Channel and Ethernet work very differently, Ethernet is an open mulit-access medium, meaning that multiple devices can exist on the same segment and can all talk to each other without any additional configuration.

Fibre Channel on the other hand is a closed point to point medium , meaning that there should only ever be point to point links, and hosts by default cannot communicate with each other, without additional configuration called Zoning (Think Access Control List).

So if you keep in mind that in an FCoE environment we are creating 2 separate logical point to point Fibre Channel Fabrics (A&B) just like you have in a native Fibre Channel environment, you should be in pretty good shape to understand what config is required.

So as explained in the Native Fibre Channel refresher above, an N Port in a Host, connects to an F port in a switch and then that switch connects to another Switch via an E port, similarly in the FCoE world we have a Virtual N Port (VN_Port) which connects to a Vitrual F Port (VF_Port) in the FCF and then if two FCF’s need to be connected together this is done with Virtual E (VE_Ports).

As can also be seen in the below figure, as the FCF is fully conversant in both Ethernet and Fibre Channel as long as they have native FC ports configured they can quite happily have native FC initiators and Targets connected to them.

FCoE
Figure 3: Multi-Hop Fibre Channel over Ethernet Topology

So as can be seen above an FCoE Network is a collection of virtual Fibre Channel links, carried over and mapped onto an Ethernet Transport, but what makes the logical links between the VN_Ports, VF_Ports and VE_Ports? Well a few control protocols are required, collectively known as FCoE Initialisation Protocol (FIP) and it is FIP which enables the discovery and correct formation of these virtual FC links.

Under each physical FCoE Ethernet port of the FCF a virtual Fibre Channel Port (vfc) is created, and it is the responsibility of FIP to identify and create the virtual FC link.

Each virtual FC link is identified by 3 values the MAC addresses at either end of the virtual circuit and the FCoE VLAN ID which carries the encapsulated traffic.

Every FC encapsulated packet must use a VLAN ID dedicated and mapped to that particular VSAN. No IP data traffic can co-exist on a VLAN designated on the Nexus switch as an FCoE VLAN. If multiple VSANs are in use, a separate FCoE VLAN is required for each VSAN.

As we know Ethernet has no inherent loss prevention mechanisms, therefore an additional protocol was required in order to prevent any loss of Fibre Channel packets traversing the Ethernet Links in the event of congestion. A sub protocol of the Data Centre Bridging standard called Priority Flow Control (PFC) IEEE 802.1Qbb ensures zero packet loss by providing a link level flow control mechanism that can be controlled independently for each frame priority. Along with Enhanced Transmission Selection (ETS) IEEE 802.1Qaz which enables the consistent management of QoS at the network level by providing consistent scheduling.

Fibre Channel encapsulated frames are marked with an Ethertype of 0x8906 by the CNA and thus can be correctly identified, queued and prioritised by PFC which places them in a prioritised queue with a Class of Service (CoS) value of 3 which is the default for encapsulated FC packets. FIP is identified by the Ethertype of 0x8914.

Before the FIP negotiation can start, the physical link needs to come up, this is a job for the Data Centre Bridging capabilities eXchange (DCBX) protocol, which makes use of the Link Layer Discovery Protocol (LLDP) in order to configure the CNA with the settings (PFC & ETS) as specified on the switch to which the CNA is connected.

FIP can then establish the virtual FC links between VN_Ports and VF_Ports (ENode to FCF), as well as between pairs of VE_Ports (FCF to FCF), since these are the only legal combinations supported by native Fibre Channel fabrics.

Once FIP has established the virtual FC circuit, it identifies the FCoE VLAN in use by the FCF then prompts the initialisation of FLOGI and Fabric Discovery.

The diagram below shows the FIP initialisation process, the green section is FIP which will identified with the Ethertype 0x8914 and the yellow section is FCoE identified with the Ethertype of 0x8906.

FIP

It is also worth noting that the E-Node uses different source MAC addresses for FIP and FCoE traffic, FIP traffic is sourced using the burned in address (BIA) of the CNA whereas the FCoE traffic is sourced using a Fabric Provided MAC Address (FPMA).

FPMAs are made up from the 24 bit Fibre Channel ID (FC_ID) assigned to the CNA during the FIP FLOGI process, this 24 bit value is appended to another 24 bit value called the FCoE MAC address prefix (FC-MAP) of which there are 256 predefined values, but as the FC_ID is unique within the fabric itself, Cisco apply a default FC-MAP of 0E-FC-00.

FPMA
Figure 4 Make-up of the Fabric Provided MAC Address (FPMA)

The fact that FIP and FCoE make use of a tagged FCoE VLAN requires that each Ethernet port configured on the FCF is configured as a Trunk port, carrying the FCoE VLAN. Along with any required Ethernet VLANs. If the Server only requires a single VLAN, then this VLAN should be configured as the Native VLAN on the physical Ethernet port to which the ENode connects.

Ok, it would only be right for me to include a bit on how Cisco UCS fits in to all this.

Well as we know the Cisco UCS Fabric Interconnect by default is in End Host Mode for the Ethernet side of things and in N-Port Virtualisation (NPV) mode for the storage side of things.

This basically means the Fabric Interconnect appears to the servers as a LAN and SAN switch, but appears to the upstream LAN and SAN switches as just a big server with lots of HBA’s and NICs inside.

There are many reasons why these are the default values, but the main reasons are around scale, simplicity and safety. On the LAN side having the FI in EHM prevents the possibility of bridging loops forming between the FI and upstream LAN switch, And in the case of the SAN, as each FI is pretending to be a Host, the FI does not need a Fibre Channel Domain ID, neither does it need to participate in all the Fibre Channel Domain Services.

As can be seen from the below Figure in the default NPV mode the Cisco UCS Fabric Interconnect is basically just a proxy. Its server facing ports are Proxy F ports and its Fabric facing (uplink) ports are Proxy N ports.

Again note no FC Domain ID is required on the Fabric Interconnects.

Also that as we are using Unified Uplinks from the FI to the Nexus (FCF), we cannot use Virtual Port-Channels to carry the FCoE VLAN, as the FCoE VLAN and corresponding VSAN should only exist on a single Fabric. We could of course create an Ethernet Only vPC and then have a separate Unified Uplink carrying the FCoE VLAN to the local upstream Nexus, but if you’re going to do that, you may as well just have stuck with a vPC and Native Fibre Channel combo.

As would be the case with any multi-VSAN host, the Cisco Nexus ports which are connected to the UCS FI are configured as Trunking F (TF) ports.

FCoE NPV
Figure 5 FCoE with Cisco UCS Unified Uplinks.

Well hope you found this post useful, I’ll certainly be back referencing it myself during the Storage elements of my CCIE Data Center studies, as it is certainly useful having all elements of a multi-hop FCoE environment, along with the Native Fibre Channel processes all in a single post.

Until next time.

Colin

Posted in CCIE DC | Tagged , , , , , , , , , , | 5 Comments

Unification Part 1: The Rise of the Data Centre Admin.

This is the first of a 2-Part Post: Part one is a non-technical primer. Then in part two we have some fun sorting out your LLDP from your DCBX with sprinkles of ETS, covered in a PFC sauce topped off with a nice FIP cherry.

Part-1

In this new world of convergence and unification, I seem to spend a lot of my time either teaching “Traditional Networkers” SAN principals and configuration, or on the other side of the coin teaching “Traditional Storage” people Networking principals and configuration.

These historically siloed teams are increasingly having to work together in order to create a holistic unified/converged network.

It is still quite common for me to get requests from clients to create separate “SAN Admin” and “LAN Admin” accounts on the same Cisco Nexus switch and enforce the privileges of each account via Role Based Access Control (RBAC), and there is by the way, absolutely, nothing wrong with that, especially if both the LAN and SAN are complex environments.

However there is an ever increasing overlap and grey area between the roles of the LAN and SAN administrator, and in a world which is ever focusing on increasing efficiency, simplicity and reduction in support costs, the Role of “Data Centre Administrator” is on the rise.

I’m glad to say that I very rarely ever get dragged into debates about the validity of FCoE these days, as it has undoubtedly proven to be a “no brainer” at the edge of the network, with the significant efficiencies in reduced costs, HBA’s, switch port counts, and all the associated power and cooling reductions that go along with it.

And once the transition to FCoE on the edge is complete, you have to really ask yourself is there any real benefit maintaining native FC links within the network core, or would it be simpler to just bring everything under the Ethernet umbrella.

While the efficiencies and savings of a multi-hop FCoE network are not as much of a “no brainer” as it is at the edge, in my book there’s a lot to be said for just having the same flavour SFP’s throughout the entire network, along with no need to allocate native FC ports in your Nexus switches or Cisco UCS Fabric Interconnects, (unless you have FC only Hosts/Arrays somewhere in the network)

In all my years in IT, this topic may well be the one which contains the most abbreviations, DCB, DCBX, LLDP, PFC, ETS, FIP to name just a few, which I think has led to the perception of complexity, however while there is certainly a lot of clever tech going on “under the hood” the actual configuration and business as usual tasks are actually quite simple.

So with all of the above in mind, Part 2 of this post will cover much of the information you need to know as the “Data Centre Admin” in order to survive in a unified Cisco Nexus Environment.

Posted in General | Tagged , , , , , , , , | Leave a comment

OTV doesn’t kill people, people kill people.

I was designing a Datacentre migration for one of our clients, they have two DC’s 10km apart connected with some dark fibre.

Both DC’s were in the south of England but the client needed to vacate the current buildings and move both Datacentres up north (Circa 300 miles / 500km away) as ever this migration had to be done with minimal disruption, and at no point should the client be without DR. Meaning we couldn’t simply turn 1 off, load it in a truck and drive it up north, then repeat for the other one.

The client also had the requirement to maintain 75% service in the event of a single DC going offline. Their current DC’s were active/active but could support 50% of the load of the other DC if required, meeting this 75% service availability SLA.

Anyway cutting a long story short I proposed that we located a pair of Cisco ASR 1000’s in one of the southern DC’s and a pair in each of the northern DC’s and use Cisco’s Overlay Transport Virtualisation (OTV) to extend the necessary VLANs between all 4 locations for the period of the migration.

OTV

As would be expected at this distance, the latency across the MPLS cloud connecting the Southern and Northern data centres (circa 20ms) was too great to vMotion the workloads, but the VMs could be powered off, cold migrated and powered back up again in the north. And doing this intelligently DR could be maintained.

The major challenge was that there were dozens of applications and services within these DC’s some of which were latently sensitive financial applications, along with all the internal fire walling and load balancing that comes along with them.

The client being still pretty much being a Cisco Catalyst house were unfamiliar with newer concepts like Nexus and OTV but immediately saw the benefit to this approach, as it allowed a staged migration and great flexibility, while protecting them from a lot of the issues they were historically vulnerable to, as they had traditionally extended layer 2 natively across the dark fibre between their two southern Data Centres.

Being a new technology to them, the client understandably had concerns about OTV, in particular around the potential for suboptimal traffic flows, which could cause their latency sensitive traffic going on unnecessary “field trips” up and down the country, during the migrationary time period that the North and South DC’s were connected.

I was repeatedly asked to re-assure the client about the maturity of OTV and lost count on how many times I had to whiteboard out the intricacies around how it works, and topics like First Hop Redundancy Protocol isolation and how broadcast and multi-Cast works over OTV.

My main message though being, “forget about OTV, it’s a means to an end. It’s does what it does, and it does it very effectively, however it does not replace your brain, there are lots of other considerations to take into account, all your concerns would be just as valid, if not more so, if I just ran a 500km length of fibre up the country and extended L2 natively, as the client was already doing, already comfortable with, and had accepted the risks associated with doing so.

This concept got the client thinking along the right lines, that while OTV certainly facilitated the migration approach, careful consideration as to what, when, how and the order in which workloads and services were migrated, would be the crucial factor, which actually had nothing to do with OTV at all.

The point being that an intelligent and responsible use of the technology was the critical factor, and not the technology itself.

So just remember OTV doesn’t kill people, people kill people.

Stay safe out there.
Colin

Posted in CCIE DC | Tagged , , , , , , , | Leave a comment

Cisco UCS has had a baby (Mother and Daughterboard doing well)

As many of you know I am now in full CCIE Datacenter study mode, and as such, I never seem to have as much time to blog and answer posted questions as I would like. However I felt compelled to take a break from my studies to write a post on the new Cisco UCS generation 3 Fabric Interconnect.

I noticed the other day that Cisco have released the data sheet on the latest member of the Cisco UCS family, the Cisco 6324 Fabric Interconnect, which is great because I can now finally blog about it.

http://www.cisco.com/c/en/us/products/collateral/servers-unified-computing/ucs-6300-series-fabric-interconnects/datasheet-c78-732207.html

Having been waiting for this new FI for a long time, I immediately contacted our purchasing team to get a quote, with the view to getting one in for our Lab so I can have a good play with it, and I was again pleased to see that the 6324 is listed on Cisco Commerce Workspace (CCW) all-be-it still on New Product Hold.

The main reason I have been waiting for this product is that it meets a few use cases which historically UCS never really addressed to the level I wanted with a “full-fat” B-Series deployment, but my customers needed. These use cases were generally smaller requirements like DMZ or Remote/Branch offices.

Sure, I could use some stand-a-lone C-Series rack mounts, but I really want the power of UCS Manager and to consolidate all these UCS Domains under UCS Central and integrate them with UCS Director.

And that is where the new Cisco 6324 Fabric Interconnect IO Module comes in, it brings all the power and features of a full scale UCS Solution, but at the scale and price point that meets these smaller use cases. The best of both worlds if you like.

So what does this new solution look like?

Well as can be seen from the above data sheet and the below figure, the Fabric Interconnects occupy the IO Module slots in the Chassis.

5108 v2 Chassis with 6324 FI IOM

5108 v2 Chassis with 6324 FI IOM

If we look at the new Fabric Interconnect a little closer we see there are 4 x 10G Unified ports and 1 x 40G QSFP+ Port, and as can be seen from the below image there are a number of connectivity options available including direct attached storage and up to 7 directly attached C Series Rack mount servers, allowing a total of 15 Severs within the system.

6324 Fabric Interconnect

Internally the 6324 Fabric Interconnect provides 2 x 10Gb Traces (KR Ports) to each half width blade slot (think 2204XP)

But I’m sure you are wondering what happened to the L1 and L2 cluster ports, which would allow two Fabric Interconnects to cluster and form an HA pair.

Well that explains why there is also a new Chassis being released. This updated 5108 Chassis is fully backwards compatible, and has hardware support for all past, present and foreseen Fabric Interconnects, IO Modules, Power Supplies and Servers. Although remember it is actually the version of UCS Manager which determines supported hardware.

This new chassis not only supports a new Dual Voltage power supply but also comes with a new backplane, and part of that new back plane, yes you guessed it, are the required traces to support the 1Gbit cluster interconnect and primary heartbeat between the 6324 Fabric Interconnects. (2104/2204/2208 if used are unaffected).

The secondary heartbeat still runs over the Chassis SEEPROM as per the traditional UCS method (See my previous post on Cisco UCS HA)

So a new 6324 based solution could look like the following, which I’m sure you’ll agree is more than suitable for all the use cases I mentioned above.

Fully Deployed 6324 FI IOM

At First Customer Ship (FCS) the servers supported for use with the 6324 FI are the B200M3, C220M3 and C240M3.

Anyway I for one can’t wait to get my hands on this for a good play, and am really excited about all the possibilities for future updates that this platform allows.

Watch this space carefully, I feel Cisco have some big plans for this new arrival.

Regards
Colin

Posted in Product Updates | Tagged , , , , , , , | 6 Comments

The King is Dead, Long live the King!

Huge congratulations to Cisco for achieving number 1 in the x86 blade server market in only 5 Years since launch.

Cisco No.1

Cisco No.1

According to the latest IDC worldwide quarterly Server Tracker (2014 Q1) Cisco UCS which turned 5 years old this year has hit the number one spot for x86 Server market share in Americas and No.2 worldwide.

To go from zero to No.1 in only 5 years from a standing start is an awesome achievement, and a real credit to all those involved.

In the 5 years that I have been SME for Cisco UCS, I have seen this traction first hand and still get a great buzz from seeing the lights switch on when people “get it”

This latest news only gets me more excited about Cisco Application Centric Infrastructure (ACI) as many of the same great minds that bought us Cisco UCS developed Cisco ACI.

Congrats!

Regards
Colin

Posted in General | Tagged , , , , , , | 1 Comment

#EngineersUnplugged ACI Edition with Colin Lynch and Hal Rottenburg

Posted in SDN | Tagged , , , , | 2 Comments

Colin Lynch and Joe Onisick Talk Cisco ACI

Listen to Cisco Champion radio with Joe Onisick @jonisick and Colin Lynch @UCSguru on Cisco ACI and Nexus 9000 hosted by Amy Lewis @CommsNinja

Posted in SDN | Tagged , , , , , , , , | Leave a comment

Behind The Cisco Service Request

Ever wondered who’s on the other end of your Cisco Service Request? well wonder no more as I put on my CiscoChampion hat and play journalist for the week at Cisco Live Milan

Posted in Cisco Champion | Tagged , , , , , | Leave a comment

The SDN Meteor is coming

When you next look up at the night sky, you may see a bright spec in the distance, and that bright spec is set to get a lot brighter.

The spec of which I speak is Software Defined Networking (SDN) and is set to change the network as we know it forever and perhaps a lot sooner than first thought.

With the “commoditisation” of Pure SDN solutions and hybrid SDN solutions which also harness custom ASICS, things will change! Maybe not today, not tomorrow but they will change.

We have plenty of warning about this meteor strike, not to try and divert it, as impact is inevitable, but we have fair warning to prepare for it and to evolve our traditional networking skill set in time.

I do not see the results of this strike, being an immediate extinction level event for traditional networkers but more like a huge lake gradually drying up.

At the moment the lake is huge and teaming with life but gradually as businesses move towards SDN solutions, the traditional networking lake will slowly start to dry up until a few who are unwilling to adapt are flapping in a pool of mud awaiting their imminent fate.

This is not by any means meant to be a doom and gloom “End of the traditional networking world is nigh” type post, but a positive post that the networking world is about to get real interesting and bought kicking and screaming into the modern world of flexibility, agility and fast provisioning. And I for one am not close enough to retirement age to ignore it, and am actually quite looking forward to the new Challenge.

Having attended Cisco Live Europe and VMware PEX this month, I’ve spoken at length to the relevant business units, and I am very much encouraged by the commitments and training road maps being put in place to bring us “Traditional Networkers” on this new and exciting journey ahead.

Colin

Posted in SDN | Tagged , , , , | 6 Comments

Cols Guide to… VXLAN

Your indispensible guides to making your IT life simpler.

So what is VXLAN and why do we need it?

Well put simply it’s VLAN with an X in the middle🙂 the X standing for eXtensible. VXLAN was a joint project between Cisco, VMware, Redhat and Citrix which is why it has been so widely adopted, and underpins the majority of SDN offerings.
And as to why we need it, well that’s mainly to address two limitations of using regular VLANs. Scale and Flexibility.

Scale:
As we all know standard 802.1Q VLANs scale to just over 4000 VLAN Ids, and while that number sounds a lot and is fine in most cases, large Service Providers, Enterprises and Multi Tenant environments ,would certainly need more.

VXLAN encapsulates the standard Ethernet frame and adds a header to it including a 24bit VXLAN ID field which increases the number of VLANs from 4096 to 16million logical segments, while only adding approx 50 Bytes of overhead to the frame (udp header)

Flexibility:
In this world of ever increasing workload flexibility and agility we need a way of quickly and safely providing connectivity between Virtual Machines anywhere in the network where we have capacity.
Historically this was done by extending VLANs everywhere that a Virtual Machine may be required. This as we all know comes with a raft of potential issues around Scale, Complexity and resiliency
As the Layer 2 Frame is encapsulated into an IP Packet it can now cross Layer 3 boundaries! This opens up a whole raft of use cases.

These use cases include, but are certainly not limited to:
• Running layer 3 all the way to the edge of your network then mapping your VXLANs over the top (overlay) getting the best of both worlds of a L3 transport but Layer 2 adjacency / reach ability wherever you need it.
• Extend your Layer 2 into any Public/Hosted Cloud allowing you to move VMs in and out of a hosted service as and when you need to. (Cloud Burst)
• Extending a VLAN over a Layer 3 Data Centre Interconnect (DCI) for Disaster Recovery (DR) to allow VM mobility between Data Centres.

Also IP packets make much better use of Port-Channelled links unlike other encapsulation technologies like MAC in MAC.

So how does VXLAN work?

The VXLAN enabled switch (The Nexus 1000v VEM in my example below) learns the VM’s MAC Address, and the assigned VXLAN ID; it then encapsulates the frame according to the port profile the VM is assigned to.
When the VM first comes online the VEM assigns it to a defined Multicast Group, which carries all, Broadcast, Unknown Unicast and multicast traffic (B/U/M). Known Unicasts are sent directly to the correct destination VEM/port.
Although all VMs/Tenants are assigned to the same Multicast group the VXLAN segment IDs are used to only deliver traffic to the same VXLAN thus maintaining and ensuring tenant separation.
The resulting VXLAN “tunnels” terminate at either end on the VXLAN enabled Switches the VM’s/Servers are connected to. These Switches are referred to as Virtual Tunnel End Points (VTEPs)

Figure 1 below shows the VXLAN encapsulation (Wrapper) put around the original Ethernet frame.

Figure 1 VXLAN Encapsulation

VXLAN Packet

The Outer IP’s added by the VEM are for the VTEPs, VTEPs can be a virtual switch residing in a hypervisor like the Nexus 1000v or a logical switch residing in a physical switch.
If you want to “break out” of the VXLAN and have your VM talk to a Bare Metal device or a gateway for routing then a VTEP Gateway is required. This VXLAN gateway has an interface in the VXLAN and an interface in the classical Ethernet VLAN then bridges between the two.
Examples of VXLAN gateways are the Cisco ASR1000v/CSR1000v or the VXLAN Gateway Services Module for the Nexus 1110/1010 Virtual Services Appliance. Some VXLAN enabled physical switches are also capable of providing VXLAN gateway functionality.
As mentioned above VXLAN relies on having an IP Multicast Enabled network between VTEPs.
There are 2 Cisco (non IETF) enhancements which negate the need for an IP Multicast enabled network.
1) Head-end software replication.
The VTEP (Nexus 1000v in my example) sends a copy of the B/U/M Traffic via unicast to all possible VTEPs on which the destination MAC could be located. (works well for smaller deployments)

2) The second solution relies on the control plane of the Nexus 1000V virtual switch, the Virtual Supervisor Module (VSM), to distribute the MAC locations of the VMs to the Nexus 1000V Virtual Ethernet Module (VEM, or the data plane), so that all packets can be sent in unicast mode. While this solution seemingly conflicts with the VXLAN design objective of not relying on a control plane, it provides an optimal solution within Nexus 1000V-based virtual network environments. Compatibility with other VXLAN implementations is maintained through IP Multicast, where required.

VXLAN Configuration example:

Physical Topology

Physical Topology

Logical Topology

VXLAN Logical Topology

First Ensure IP multicast is enabled on the switch and SVI interfaces.

Ip pim sparse-dense-mode (on the L3 interfaces)
Ip pim birdir-enable (recommended as any endpoint could be a sender or receiver)
Ip send-rp-announce Loopback0 scope 16 birdir (sets switch up as an RP)
Ip pim send-rp-discovery Loopback0 scope 16

Verify with “sh ip pim interface” and “sh ip pim rp map

On Cisco Nexus 1000v VSM

Feature Segmentation (enable VXLAN Feature, requires advance license)

Bridge-domain VXLAN5000_TENANT1
Group 239.1.2.3
Segment id 5000

Create the Layer 3 control interface uplink port-profiles for the VEMs

Port-Profile type vethernet Control_Uplink_1001
capability l3control
capability vxlan
vmware port-group
switchport mode access
switchport access vlan 1001
no shutdown
system vlan 1001
state enabled

Port-Profile type vethernet Control_Uplink_1002
capability l3control
capability vxlan
vmware port-group
switchport mode access
switchport access vlan 1002
no shutdown
system vlan 1002
state enabled

Create the Port-Profile the VMs will connect to:

Port-Profile type vethernet VXLAN_5000_Tenant1
switchport mode access
switchport access bridge-domain 5000
vmware port-group
no shut
state enable

Verify on VSM with
Show bridge domain

Verify on Switch with
Sh ip mroute 239.1.2.3

First test with both VM’s on the same host/port-group then vMotion VM2 to ESX2

VXLAN Packet Walk

Let’s take the above example and do a PING from VM1 (MAC1) on ESX01 to VM2 (MAC2) on ESX02

1. Virtual machine VM1 on ESX01 sends an ARP packet with Destination MAC as “FFFFFFFFFFF”

2. VTEP (VEM) on ESX01 encapsulates the Ethernet broadcast packet into a UDP header with Multicast address “239.1.2.3” as the destination IP address and VTEP address “10.200.1.150” as the Source IP address.

3. The physical network delivers the multicast packet to the hosts that joined the multicast group address “239.1.2.3”.

4. The VTEP on ESX02 receives the encapsulated packet. Based on the outer and inner header, it makes an entry in the forwarding table that shows the mapping of the virtual machine MAC address and the VTEP. In this example, the virtual machine MAC1 running on ESX01 is associated with VTEP IP “10.200.1.50”.

5. The VTEP also checks the segment ID or VXLAN logical network ID (5000) in the external header to decide if the packet has to be delivered on the host or not.

6. The packet is de-encapsulated and delivered to the virtual machines connected on that logical network VXLAN 5000.

7. Virtual Machine MAC2 on ESX02 responds to the ARP request by sending a unicast packet with Destination Ethernet MAC address as MAC1.

8. After receiving the unicast packet, the VTEP on Host 2 performs a lookup in the forwarding table and gets a match for the destination MAC address “MAC1”.

9. The VTEP now knows that to deliver the packet to virtual machine MAC1 it has to send it to VTEP with IP address “10.200.1.50”.

10. The VTEP creates unicast packet with destination IP address as “10.200.1.50” and sends it out.

11. The packet is delivered to ESX01

12. The VTEP on Host 1 receives the encapsulated packet. Based on the outer and inner header, it makes an entry in the forwarding table that shows the mapping of the virtual machine MAC address and the VTEP. In this example, the virtual machine MAC2 running on ESX02 is associated with VTEP IP “10.200.2.50”.

13. The VTEP also checks segment ID or VXLAN logical network ID (5000) in the external header to decide if the packet has to be delivered on the host or not.

14. The packet is de-encapsulated and delivered to the virtual machine connected on that logical network VXLAN 5000.

I will do a Video walkthrough on how to set VXLAN up using my Cisco UCS and Nexus 1000v and Nexus 5000 Lab and post here when done.

Thanks for stopping by and look after that Datacenter of yours🙂

Posted in SDN | Tagged , , , , , , | 3 Comments