This latest post in my “UCS for….” Series attempts to put across the Cisco UCS key concept of the Blade and its role in the UCS system for people already familiar with Storage Concepts.
The role of the Blade has certainly changed in a Cisco UCS environment, no longer is it “The Server”, it is now just the physical memory, CPU and I/O that the server makes use of. “The Server” in the case of UCS now being the Service Profile, Basically an XML file with all of that server’s identity, addresses, BIOS settings and firmware defined.
Abstracting the logical server from the physical tin opens up a huge raft of efficiencies and dramatically increases flexibility. As I’m sure all Hypervisor admins fully appreciate.
So for the purposes of this post think of the blade as a disk in an array.
Now, in a disk array do you generally care which physical disk your data is currently on?
Generally the answer is No,
In the same way you don’t necessarily need to care which bit of tin your Service Profile is currently making use of.
Pause…….. for that key concept to sink in.
OK, I’m not going to say it is wrong for customers to want to be able to say “That blade x is server y” and put host name stickers on them etc.. etc.. That’s fine and many customers want just that. There is a certain amount of comfort in knowing and controlling exactly which blades are associated to which service profiles. Its just a human thing and a concept which is deeply ingrained in most server admins.
However in the era of the cloud and increased adoption of automation / orchestration tools this “Legacy” thought process is gradually softening.
I must admit I get a great feeling when customers fully embrace the statelessness of UCS and allow it to “Stretch its legs” and make full use of server pools and qualifications. When you associate a Service Profile to a server pool the system just picks a blade out of the specified pool and away it goes, if that blade ever fails and there is a spare blade in the pool the UCS will just dynamically grab that spare blade regardless of which chassis that spare blade may be in, and that server is back up in a few minutes.
Now when I said “do you care about which disk your data sits on” you may have well said, “no, but I do kind of care what TYPE of disk my data sits on”, i.e. whether your data is on larger but relatively slow SATA or NL-SAS drives or on super fast Enterprise Flash Drives (EFD)
Enter Server pool qualifications; you can setup server pools based on most physical attributes of a blade. I.e. if a blade has 40 Cores dynamically put it in my pool called “High Performance” if it has 512GB of RAM dynamically put it in my pool called “ESXi Servers” as a couple of examples.
This separation of the Service Profile from the physical blade gives the UCS admin the flexibility to move service profiles between different spec blades as the need arises. For example if there is a greater demand on the payroll system at month end they can associate that service profile to a “High Performance” blade for the duration of the peak demand and then associate back to an “Efficient performance” blade for when that peak demand reduces. This moving of service profiles is disruptive however, as server needs to be shutdown first. But hey still awesome to be able to do and a huge advancement from where the compute industry was.
In any case a lot of customers have several hosts in an ESXi cluster, and cluster bare metal servers so critical workloads are protected from single blade failures. So done with a bit of planning you could move Service Profiles between different blades without impacting a clustered application.
Similarly upgrades are now just a case of upgrading or buying a spare blade soak testing it for as long as you need (I use a Soak Test Service Profile with several diag utils on) then in your outage window and with a couple of clicks of the mouse, move your service profile to the upgraded blade, with all your addresses, BIOS settings and Firmware revisions maintained.
I must stress that I have not seen any of the below on any roadmaps it’s just where my thinking goes.
So what could be in the future if we carry on this thought process and the analogy of UCS as the “Compute Array”, it would seem logical to me that the next stage of evolution would be non-disruptive service profile moves (akin to a bare metal vMotion) which would then open up the possibility of moving service profiles dynamically and seamlessly between blades of differing performance as demands on that workload increase or decrease a cross between VMware’s Dynamic Resource Sheduling (DRS) and EMC’s Fully Automated Storage Tiering (FAST) but for compute. Wow, what a place the world will be then!
So hope this post helps all you Storage Bods develope a better under standing of Cisco UCS, as ever please feel free to comment on this post. I enjoy getting feedback and answering your Cisco UCS questions.
I think the idea of dynamic service profile migration is unlikely. Any dynamic server move depends on moment by moment memory coherence, which is hard to do: Sun/Oracle took years to make it work with LDOMs (Oracle VM for SPARC). UCSM, to my understanding, communicates with blades as a command and control agent, not as a “hypervisor layer”, so to dynamically transfer service profiles between blades would need another software layer. This can’t run on the blade, as that’s being managed for the incoming service profile, and it would keep the FIs rather busy.
I’ve not seen any roadmap either, so could be completely wrong.
Thanks Dunstan, can always rely on you to curb my enthusiasm 🙂
With Egenera’s PAN Manager if a blade fails we move the server profile over to another blade automatically with no manual intervention no matter what OS you are running. I know that is not dynamic, but it provides the next best thing, which is automation. If you have mission critical apps you surely don’t want to take the time to manually do something when it can be automated. We even self heal VMWare clusters using the same methodology. If you have a 3 node VMWare cluster and one of the nodes fail, VMWare of course will re-distribute the remaining VM’s across the other 2 nodes based on your settings, but you are down one node and your performance is degraded until you physically replace the blade. PAN Manager detects the blade failure and moves the profile over to a designated failover blade or we have the ability to designate a lower priority server (development or test blade) to be used. After the server is moved VMWare will then see that the cluster is 3 nodes and move the appropriate VM’s back in place. All of that process is automatic with no user intervention. To further enhance your VM environment we provide the ability to expand and retract your VM environment based on usage. If your VM environment is running and suddenly has an extended peak time that would require more resources you can configure within PAN Manager to add another node to the cluster to handle the demand based on a pre-determined threshold. The same can be done the other way if you have a slow period where you don’t need as many nodes in your cluster. Since Egenera can manage multiple vendors blade chassis’ (HP, IBM, Dell, NEC Fujitsu) we just announced the ability to failover profiles between different hardware vendors, meaning we can failover a server running on a HP blade over to a blade in an IBM chassis. If you want to learn more check out our website at http://www.egenera.com or you can contact me via email at email@example.com
Great post and thanks a lot. I’m deploying UCS pretty soon.
I do not quite get the point of running HA in VMWare and running HA in UCS at the same time. Seems twice the work for the same goal unless you’re !really! strapped for cash in which case you probably will not have UCS anyway.
Your right there is usually little point enableing VMware adapter HA and UCS Hardware fabric failover as this can sometimes hide issues, i.e. if one of my uplinks to my vSwitch has lost connectivity I kinda want to know. (The UCS Admin will get notification that a failover has occured, but the VMware admin will be none the wiser)
My Advise do not use Fabric failover in circumstances where HA is already catered for e.g. VMware and Nexus 1000v.
My caveats to the above is if you want to influence traffic direction to ensure that the traffic is locally switched as Layer 2 within the Fabric Interconnect (vMotion for example) in such cases I just just a seperate vSwitch with a single uplink with Fabric Failover enabled.
Operating systems that do not support HA for uplinks (Historically Hyper-V, although I think it may do now) or Baremetal blades that do not require to load blance across both fabrics, then single vNIC with fabric failover enabled is a no brainer.
Am sure you’ll have a great experience deploying UCS, have a good read of all the Q&A’s on this site and older posts, so your fully up to speed by the time your kit arrives.
I have a query regarding the UCS FI, the current in production setup for UCS FI and UCS Chassis all are with IOS 1.3.
One of the Cisco Fabric Interconnect is failed, out of order, we received replacement and the New FI is with IOS 2.0
What is the best practive to install the new FI with latest IOS to be part of the current system.
Is there any possibolity we can down grade the new FI’s IOS or Firmware version to current production one i.e. downgrade with Version 1.3.
any other solution to replace th faulty one FI with new one, right now there is no secondary FI in setup as one already failed.
Thanks & Regards,
You Have 2 options upgrade your remaining FI to ver 2.0 then add the new FI to the cluster, this option will disrupt your traffic flows as your single FI will need to reboot.
Or as you say downgrade the new FI to 1.3 then join it to the cluster, (This obviously does not cause you any downtime)
1.3 is pretty old now so once you get your cluster backup probably time to think about an upgrade anyway.