I was recently asked to investigate an issue which on the face of it sounded very odd.
We had installed a fairly large FlexPod environment running VMware NSX across a couple of datacentres
During pre-handover testing the environment suffered a complete loss of service to the vSphere and NSX management cluster.
When I asked if anything had been done immediately prior to the outage, the only thing they could think off was that a UCS admin (to protect his identity let’s call him “Donald”) had renamed an unused VLAN, which had no VM’s in it, so was almost certainly not the cause and just a coincidence.
Hmmmm I’ve never really been one to believe in coincidences, and armed with this information, I had a pretty good hunch where to start looking.
As I suspected both production vNICs (eth0 & eth1) of all 3 hosts in the management cluster were now showing as down/unpinned in UCS manager.
This was obviously why the complete vSphere production and VMware NSX management environment were unreachable, as all the Management VM’s including vCenter, NSX manager along with an NSX Edge Services Gateway (ESG) that protected the management VLAN resided on these hosts, all of which had effectively just had their networking cables pulled out.
So what had happened?
As you may know you cannot rename a VLAN in Cisco UCS so Donald had deleted VLAN 126 and recreated it with the same VLAN ID but a different name (“spare” in this case). This wasn’t perceived as anything important as there were not yet any VM’s in the port-group for VLAN 126.
Donald then went into the updating vNIC template to which the 3 vSphere management hosts were bound and added in the recreated VLAN 126.
And that is when all management connectivity was lost.
The issue was, as per best practice when using vPC’s on Cisco UCS with NSX, there were two port-channels northbound from each Fabric Interconnect one for all the production VLANs connected to a pair of Nexus switches running virtual port-channels vPC and the other a point to point port-channel to carry the VLAN used for the layer 3 OSPF adjacency between the Nexus switches and the virtual NSX Edge Services Gateways (ESG’s) as it is not supported to run a dynamic routing adjacency over a vPC.
So obviously VLAN groups have to be used to tell UCS which uplinks carry which VLANs (just like a disjointed L2 setup)
Cisco UCS then compares the VLANs on each vNIC to those tagged on the uplinks and thus knows which uplink to pin the vNIC to.
Unfortunately as Donald found out this is an “All or nothing” deal, unless ALL of the VLANs on a vNIC exist on a single uplink that entire vNIC and ALL its associated VLANs will not come up. Or as in this case will just shut down.
So when VLAN 126 was deleted and recreated with a new name, this new VLAN did not exist on the main vPC UCS uplinks (105 & 106) hence all hosts bound to that updating vNIC template immediately shut down all their production vNICs (eth0 & eth1) as there was no longer an uplink carrying ALL their required VLANs to which to Pin to. (Cisco UCS 101 really)
As soon as I added the recreated VLAN to the vPC uplink VLAN group, all vNICs re pinned, came up and connectivity was restored. (I could have also just removed this new VLAN from the vNIC template) either way the “All or nothing” rule was now happy.
As per best practice all the clients user workloads and supporting vSphere and NSX infrastructure were located on different vSphere clusters and thus were unaffected by this outage.
There are numerous ways to avoid the above issue, for example you could take out the vPC element and just have a singular homed port-channel carrying all VLANs from FI A to Nexus A and the same from FI B to Nexus B.
Or as was done in this case, the run book was updated and everyone informed that in this environment VLAN groups are in use, thus ensure that a newly created VLAN is added to the relevant VLAN group, before it is added to the vNIC template.
I would like to see a feature added to UCS that changes this behaviour to perhaps only isolate the individual VLAN(s) rather than the whole vNIC, but I can think of a few technical reasons as to why it likely is how it is. Or at least a warning added, if the action will result in a vNIC unpinning.
In a previous post UCS Fear the Power? I quoted Spider-man that “with great power comes great responsibility”
This was certainly true in this case, that seemingly minor changes can have major effects if the full ramifications of those changes are not completely understood.
And before, anyone comments. No I am not Donald :-), but I have done this myself in the past so knew of this potential “Gotcha”
But if this post saves just one UCS Admin from a RGE (Résumé Generating Event) then it was worthwhile!
Don’t be a Donald! and look after that datacentre of yours!
Click on the Images to enlarge.