UCS for HP People

Following on from the popular “UCS for my Wife” I have been inspired to write a quick Architecture comparison for people familiar with the HP C7000 Chassis who want a quick UCS comparison.

I’m not going into the “which is better” debate in this post, but simply the architectural differences.

This post has come about after I received a tweet from an HP literate engineer asking whether I would recommend striping ESXi hosts across several UCS chassis to reduce the impact to an ESXi Cluster in event of a Chassis failure. As this was his best practice in his HP C7K environment.

So the short answer to the question is yes, but probably not for the reasons you may think.

First off you need to embrace the concept that UCS is not a Chassis Centric architecture, “How can you say that Colin, Cisco UCS has Chassis everywhere”, I hear you cry.
Well I’ll tell you. You need to think of the entire Cisco UCS Infrastructure as the “Virtual Chassis” the fact that Cisco provide nice convenient bits of tin that house 8 blades is merely to provide power, cooling and nice modular building blocks for expansion.

There is no intelligence or hardware switching that goes on inside a UCS Chassis.

For conceptual purposes I have drawn out the “Cisco UCS Virtual Chassis” and have listed where the HP C7000 “equivalents” are located to aid in getting the concept across.

Now the Cisco and HP technologies are not like for like but for the purposes of the “virtual chassis concept” the comparison of where each element is placed from a functionality view is totally valid.

UCS for HP People

So first off, as you can see the HP modules with which I’m sure you are very familiar have all been taken out of the Chassis.

The Onboard Administrator modules (OA) have there equivalents within the 2 Fabric Interconnects (FI’s), (the things that look like switches generally in the tops of the UCS racks.) There are 2 FI’s which house an Active/Standby software management module (UCS Manager) These two FI’s are clustered together so you only ever need to reference the single cluster address, regardless of the number of chassis in your UCS domain.

The Virtual Connect, Flex 10, Fibre Channel and FlexFabric “equivalents” again are within the Fabric Interconnects, but unlike the Active/Standby relationship of the management element, the Data elements run Active/Active i.e. both Fabric Interconnects forward traffic providing load balancing as well as fault tolerance.

So as you can see, imagine dissecting your C7000 chassis, consolidating and putting all your OA’s, Virtual connect modules, FC Modules and switches in the top of the rack and expanding your 16 slot chassis to a 160 slot chassis and your pretty much there.

So going back to the original question as to whether I would recommend splitting clusters across Chassis.

Well as hopefully you know now, there is no single point of failure within a Cisco UCS chassis, so unless you insert your blades with a hydraulic ram and manage to crack the mid-plane you should be ok.

Firmware infrastructure updates can be done without disruption to the hosts so no issues there, Although I would always recommend these be done in a potential outage window, because as we all know sh*t does happen sometimes. Blade / Adapter firmware upgrades do require the blades to be rebooted, but this can easily be planned and managed allowing you to vacate the VM’s off a blade before you apply the host firmware policy and reboot it.

Bandwidth should not be a consideration as each chassis can have up to 160Gbs of bandwidth (80Gbs per fabric Active/Active)

So the reason I would recommend splitting ESXi clusters across chassis is really to minimize human errors causing major disruption to a cluster.

Imagine you have 8 hosts in a cluster and all 8 hosts are in the same chassis. An engineer gets told to go and turn off Chassis 1, he gets in the DC and does not notice the blue flashing locator light on Chassis 1, neither does he notice the Label Chassis 1 on the front, and rather than count from the bottom he counts from the top and pulls out all of the grid redundant power supplies and brings down your cluster. This would never happen though right?

Being fair I have never seen the above happen either, but what I have seen happen is a UCS admin right click and re-acknowledge a chassis which caused 30 seconds of outage to ALL blades in that chassis. (You re-ack a chassis if you ever change the number of chassis to FI cables) obviously you should never re-ack a chassis in a period when 30secs of disruption to all blades in it will cause you issues.

So while from the UCS’s point of view it may not care whether all host are in the same chassis or distributed across chassis, best practice based on my experience is definitely to distribute the Cluster hosts across chassis.

Hope this clarifies things.

About ucsguru

Technical Architect and Data Center Subject Matter Expert. I do not work or speak for Cisco or any other vendor.
This entry was posted in UCS for .... and tagged , , , , , . Bookmark the permalink.

8 Responses to UCS for HP People

  1. Alex Haddock (HP UK) says:

    Hi Colin, worth pointing out that in Virtual Connect land there are two methods of creating cross chassis mobility. The first is chassis stacking, creating a single Virtual Connect Domain of up to 4 chassis (this is equivalent to 8 x UCS chassis in capacity). The second is Virtual Connect Enterprise Manager which allows for up to 250 (500 UCS chassis) Virtual Connect Domains to be managed. Combining the two would allow for 1000 HP chassis though to be fair I’m sure both we or Cisco would be happy to have a customer with that many :-).

    Whilst we have out of band control points in the chassis (OA) I tend to see them as granular additions (alongside iLO management Engine) in a larger deployment to an Insight/VCEM install where all control (and a heck of a lot more) is centralised.

    A c7000 has no SPOF but again to mitigate for human error ( as you state) or the potential of a bug being introduced you do tend to find larger deployments split across chassis as per human nature.

    In kind return for your article and in the spirit it was written in (no bun-throwing) I’d point to two useful Virtual Connect Documents for the Cisco reader:

    Virtual Connect For Dummies (expect an update very soon…)
    http://h18000.www1.hp.com/products/blades/virtualconnect/connectfordummies/regForm.html

    Virtual Connect for the Cisco Administrator

    Click to access c01386629.pdf

    Cheers, Alex

    • ucsguru says:

      Thanks for providing the HP view Alex and the open disclosure, one of the great aspects of being an independant is we can take in all the Vendors information, bat it around internally amoungst our own SME’s and then provide the best solution for the client.
      Appreciate the info.
      Regards
      Colin

    • Full Disclosure: I work for Cisco on the UCS Product
      While the two points above are true (VCEM managing thousands of servers and stacking up to 4 chassis together), there are some caveats that need highlighting around those solutions.
      For multi-enclosure (ME) stacking to work with the C7000, you must be using Virtual Connect and if any one of the 4 enclosures have VC-FC (or Flex Fabric) for fiber channel storage, then all 4 chassis in the stack also must have VC-FC installed (in the same bays in each enclosure). This is because only the ethernet modules are stacked – not the FC. In addition, ME stacking designates the first chassis deployed to be the “master” and it handles all of the management functions of the entire stack. If that chassis goes offline, you cannot manage the stack. You should keep backups of your local VC domain config.
      With UCS, FC is a core service and available everywhere. The user can add FC to any blade in any chassis without adding additional hardware to each blade and/or chassis.
      As far as VCEM managing thousands of servers, this product uses the concept of “domain groups” where enclosures with matching hardware a grouped together for easier management. In essence, if you want to manage 20 HP enclosures in your datacenter, all of the chassis have to be configured identically. If one chassis has two ethernet uplink ports, ALL 20 enclosures must have the same 2 uplinks (in the same port numbers). If one enclosures has two VC ethernet modules in Bays 5/6 – ALL enclosures must have the same. Want FC in just one or two chassis? Not gonna happen. One particularly painful one is that all VC modules must be the identical model. If you have any mix of VC 1/10, 1/10-F, Flex-10, Flex10D, or Flex Fabric, you cannot put any of these in the same domain group (because they are not identical – seeing the pattern?). In addition, if you enable certain features in VC (like expanded VLAN capacity which is just one of many examples), you must enable it on ALL the enclosures. This means you have to upgrade the firmware of all VC modules in all the enclosures to support this feature (VCEM does not manage firmware upgrades) and you must enable it on all the local VC domains.
      Assuming you are able to get 250 identical enclosures all in a VC domain group and you can now manage thousands of servers with it, did I mention it’s not redundant? There is no cluster or other cold/hot standby backup for VCEM (outside of rolling some non-hp solution yourself).

      • Lionel Jullien (HP) says:

        [LJ] Let me clarify a few points
        ——————————————————————————————————-
        Full Disclosure: I work for Cisco on the UCS Product While the two points above are true (VCEM managing thousands of servers and stacking up to 4 chassis together), there are some caveats that need highlighting around those solutions. For multi-enclosure (ME) stacking to work with the C7000, you must be using Virtual Connect and if any one of the 4 enclosures have VC-FC (or Flex Fabric) for fiber channel storage, then all 4 chassis in the stack also must have VC-FC installed (in the same bays in each enclosure). This is because only the ethernet modules are stacked – not the FC.
        ——————————————————————————————————-
        [LJ] This is correct and this prerequisite will vanish when VC will support DCB traffic on uplinks.
        ——————————————————————————————————-
        In addition, ME stacking designates the first chassis deployed to be the “master” and it handles all of the management functions of the entire stack. If that chassis goes offline, you cannot manage the stack. You should keep backups of your local VC domain config.
        ——————————————————————————————————-
        [LJ] This is true but :
        1/ a c-class enclosure and VC modules are fully redundant.
        2/ if that chassis goes offline (you must be very unlucky or very heavy handed) only the VC manager is lost (i.e. you cannot create a new profile, edit a vNet, etc.) but the server traffic in the other chassis is running just fine.
        3/ VCEM can be used to recover the offline chassis by migrating all server profiles to spare servers available somewhere in the datacenter (requires VCEM 7.1).
        ——————————————————————————————————-
        With UCS, FC is a core service and available everywhere. The user can add FC to any blade in any chassis without adding additional hardware to each blade and/or chassis.
        As far as VCEM managing thousands of servers, this product uses the concept of “domain groups” where enclosures with matching hardware a grouped together for easier management. In essence, if you want to manage 20 HP enclosures in your datacenter, all of the chassis have to be configured identically. If one chassis has two ethernet uplink ports, ALL 20 enclosures must have the same 2 uplinks (in the same port numbers). If one enclosures has two VC ethernet modules in Bays 5/6 – ALL enclosures must have the same.
        ——————————————————————————————————-
        [LJ] This is true that enclosures must be the same but you forget to mention that this condition only exists if the chassis are part of the same VCEM Domain group. Nothing forces you to put all enclosures in a same Domain group. You can create as much Domain groups as you like with a different configuration and you can even migrate VC profiles between groups since VCEM 7.0.
        ——————————————————————————————————-
        Want FC in just one or two chassis? Not gonna happen.
        ——————————————————————————————————-
        [LJ] The is incorrect, you can create a new group for these 2 enclosures. The whole purpose of Domain Groups is to simplify management of like resources so naturally this would necessitate common configurations within those groups.
        ——————————————————————————————————-
        One particularly painful one is that all VC modules must be the identical model. If you have any mix of VC 1/10, 1/10-F, Flex-10, Flex10D, or Flex Fabric, you cannot put any of these in the same domain group (because they are not identical – seeing the pattern?).
        ——————————————————————————————————-
        [LJ] This is incorrect, for non-identical models, you create different domain groups because the configuration is not the same, it’s essential to separate 1Gb to 10Gb modules, the bandwidth gap is huge and requires a different configuration more appropriate to 1G technology (i.e. much more uplinks will be used in a configuration with 1G modules vs. 10Gb modules).
        ——————————————————————————————————-
        In addition, if you enable certain features in VC (like expanded VLAN capacity which is just one of many examples), you must enable it on ALL the enclosures.
        ——————————————————————————————————-
        [LJ] This is incorrect. One of the main feature of VCEM is the automatic replication. If you enable the expanded VLAN capacity in a Domain group of 100 enclosures, the configuration is replicated automatically to all enclosures. This avoids profile sprawl and operational micromanagement overheads.
        ——————————————————————————————————-
        This means you have to upgrade the firmware of all VC modules in all the enclosures to support this feature
        ——————————————————————————————————-
        [LJ] For all vendors, new features are always available through new firmware.
        ——————————————————————————————————-
        (VCEM does not manage firmware upgrades) and you must enable it on all the local VC domains.
        ——————————————————————————————————-
        [LJ] This is true, VCEM does not manage firmware upgrades because HP’s strategy in term of firmware deployment is to use HP Smart Update Manager (HP SUM). HP SUM can provide deployment of firmware for single or one-to-many targets such as VC, iLO, servers, OA.
        Only HP SUM can make sure that all firmware and drivers are installed in the correct order and ensures that all dependencies are met before deploying an update.
        ——————————————————————————————————-
        Assuming you are able to get 250 identical enclosures all in a VC domain group and you can now manage thousands of servers with it,
        ——————————————————————————————————-
        [LJ] Again, 250 is the maximum number of Virtual Connect domain that VCEM can manage today (16000 servers max) but nobody forces you to put all enclosures in one group. The majority of customer scenarios have multiple groups to reflect the needs of the business (test/dev, finance, web etc). The key thing with VCEM is that these groups can be created, expanded and re-defined in a granular 16 server unit.
        ——————————————————————————————————-
        did I mention it’s not redundant? There is no cluster or other cold/hot standby backup for VCEM (outside of rolling some non-hp solution yourself).
        ——————————————————————————————————-
        [LJ] This is wrong, VCEM like many other Windows applications, fully supports Microsoft cluster, see http://bizsupport1.austin.hp.com/bc/docs/support/SupportManual/c03345475/c03345475.pdf
        It is also important to note that loss of VCEM, even catastrophic, does not result in loss of services across any of the blade chassis. VCEM can be restored or the VCEM lock be removed if local operations are required. Rolling out firmware by chassis also ensures that any error can be localized to a 16 server unit. This contrasts with other vendor scenarios where potentially hundred of servers could be affected by an injected firmware error.
        ——————————————————————————————————-

      • ucsguru says:

        Hi Lionel
        Thanks for the disclosure and detailed retort.
        Always glad to have constructive “right of reply”
        Regards
        Colin

  2. Lee Morris says:

    Hi Col,

    great article, can you clarify something for me please?

    “So as you can see, imagine dissecting your C7000 chassis, consolidating and putting all your OA’s, Virtual connect modules, FC Modules and switches in the top of the rack and expanding your 16 slot chassis to a 160 slot chassis and your pretty much there.”

    “Bandwidth should not be a consideration as each chassis can have up to 160Gbs of bandwidth (80Gbs per fabric Active/Active)”

    Surely there is a trade-off between the 160Gb/s bandwidth per chassis and the 20 chassis (160 slot) option – I can’t imagine the FI’s can give that much bandwidth across 20 chassis.

    Cheers

    Lee

    • ucsguru says:

      Hi Lee
      Great comment and you are absolutley right there is a trade off, of performance Vs Scale.
      Currently the largest Fabric Interconnect has 96 ports, so if you wanted 20 Chassis each with the maximum of 80Gbs per fabric would require at least 160 Ports which obviously you can’t have.

      However in my experience deployments of 20 Chassis per domain are very rare, and chassis that actually require 160Gbs of I/O rarer still.
      In the real word most full chassis are quite happy with 2 x 10Gbs per fabric (40Gbs total) but what I tend to do in most deployments is provide 4 x 10Gbs links per fabric, which provides 80Gbs of I/O per chassis, Whilst certainly overkill for most environments this does provide what I call the “Wire Once” approach which is you just wire 4 cables to each fabric then forget it, however that chassis expands or retracts the wireing does not need to change. and with 4 links per fabric even if you did have the Maximum number of chassis (20 currently supported) this equates to 80 ports so with a 96 port FI you still have 16 ports for LAN and SAN connectivity.

      But as mentioned in the real world 20 Chassis single pods are rare, so you may be best suited to a mixed bandwidth deployment, with 2 or 3 types of chassis, a default chassis of 2 ports per fabric, a High I/O chassis of 4 ports per fabric and if you really want to go to town a Ultra high I/O chassis with 8 x 10Gbs ports per fabric. Bearing in mind if you want the benefits of the 8 port FEX (2208XP) you will also need the VIC 1280 Mez cards.

      Thanks again for the comment
      Regards
      Colin

    • My co-worker Sean McGee published a blog article that covers this whole performance vs scale argument. You can find it here: http://www.mseanmcgee.com/2012/05/introduction-to-the-new-cisco-ucs-6296up-fabric-interconnect-and-2204xp-io-module/

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.