In a lot of the workshops I host or skills transfer sessions I conduct with clients, The response is generally wow this stuff is incredibly powerful, being able to update 160 servers firmware simultaneously in a couple of mouse clicks, or add a network card to every ESXi or Bare metal host in the UCS Infrastructure simultaneously using updating templates. (Both of which actions obviously require the blades to be rebooted.)
This inevitably leads to opinions like “User errors can now potentially take out my whole enterprise”, to which the answer must be, well yes they could. And it is not up for debate that user errors do account for the majority of unplanned outages.
I am not of the opinion however that server operators have not regularly caused massive global server outages in the past merely because it wasn’t particularly easy for them to do so. Which is sort of what the above opinion implies.
I’m sure there were those when passenger planes began crossing the Atlantic, who said it was crazy, as if the plane crashed it would likley kill hundreds, but far better to put 8 people in a row boat and row them across, regardless of the multitude of in-efficiencies in this theory. Ok perhaps pushing the analogy a bit far, but you get the point. Fact is this power now exists, and the Genie is well and truly out of the bottle.
VMware administrators have had global power over the entire DC for several years, do our platform Admins deserve any less? In my experience it is common for the VMware Admins to also manage the UCS environment in any case.
Utilizing role based access control (RBAC) or automation utilities like EMC IONIX / Unified Infrastructure Manager (UIM), Blade Logic or Cisco’s Intelligent Automation for Cloud (CIAC) to name but a few, can further greatly reduce margins for human error. But many can be eliminated by standard best practices. I.e. for day to day monitoring just login with read only privileges, and then only login with an escalated privileged account when conducting agreed changes.
In my experience customers who have experienced “unexpected” server reboots would have a) expected them or b) not experienced them at all if they had done either of 2 things. 1) Actually read the big dialogue boxes that pop up explaining that this action will reboot servers x, y & z, or 2 ) Had a properly configured maintenance policy in place. The default Cisco UCS maintenance policy is to reboot the blades immediately (if a reboot is required). The system does of course advise the Admin that the task will reboot the servers and requires the Admin to acknowledge this by clicking OK.
I would recommended changing the default maintenance policy to “User-Ack” thus even when the system tells you it will reboot the servers and the Admin clicks OK, the servers still will not reboot. The Admin will get a flashing icon, saying user action required. The admin then has to go through and click a radio button next to each blade that has been flagged for reboot. A belt and braces approach if you will.
But again to state the obvious, the role of Admin or server operator should be given to someone trained in the use of that role.
I spend a lot of time with clients assisting them with how they go from managing a silo’d environment to a unified one, and how this affects their organization, proceedures and change systems etc.. and once they understand the new mindset, they certainly see the benefits.
As with anything, if proper safeguards and protections are in place this power can be harnessed to awesome effect.
Allowing the UCS admin to manipulate the environment as easily and as skillfully as an artist manipulates his brush. It’s not quite bare metal vMotion but not far off!
Spider-Man once said with great power comes great responsibility; This is UCS’s gift, its curse. Who am I? I’m UCSguru!