Where’s the real cost consideration in your AI back‑end Network?

AI Thought of the Day

Let’s talk about a myth I still hear far too often:

“The network cost is generally about 10–15% of an AI build, and the difference in cost between an Infiniband and Ethernet backend fabric isn’t compelling enough for me to ‘risk’ going Ethernet.

Here’s the reality, and let’s keep the maths simple.
If you’re buying a £100M AI platform, the network might be ‘only’ be £10M of that. And the cost delta between an InfiniBand back‑end and an Ethernet back‑end might only be a few percent of that £10M.

So will a serious AI customer really ‘risk’ the performance of their entire platform just to save a few quid on the scale-out / backend fabric?

I’ve never heard a large-scale AI buyer say:

“Let’s risk GPU idle time on a £500M supercomputer to save £5M on the network.”

Because the real financial equation isn’t about the network CAPEX at all.

The real cost driver is this:

Even low single-digit drops in GPU utilisation costs more (often far more) than the entire network savings.

If your GPUs are waiting around because your network is congested, mis‑tuned, or simply not engineered for scale‑out AI behaviour, that’s not a technical issue.
That’s a material financial loss, examples being:

Lost revenue (GPUaaS)
Delayed training results (slow product cycles)
Energy + cooling (real OPEX burn)
Total cost per token
Stranded capital lowering ROI

All of these dwarf any saving made by choosing a ‘cheaper network’.

It doesn’t matter how many of the world’s fastest GPUs you buy,
if the network can’t get data to them fast enough, you may as well switch them off.

Don’t get me wrong, I’m a techy and I like nothing more than debating the intricacies of PFC and SHARP, ECMP vs Packet Spraying etc..

But the fact is, at board level the network selection has quietly shifted from a “technical architecture choice” to a core business and AI ROI decision.

And as AI environments grow into the tens of thousands of GPUs, the stakes only get bigger.

For now at least InfiniBand still generally leads to the lowest GPU idle time, although Ethernet with RoCEv2 can get close and NVIDIA’s Spectrum-X even closer, but as my old dad used to say, ‘close only counts, in horseshoes and hand grenades’ so if your main consideration is GPU idle-time InfiniBand is still generally the best choice, plus in my experience depending on vendor RoCEv2 can add configuration complexity and can lead to performance issues if not optimally configured thus increasing operational risk.

And on top of the above with supply and demand being as it is you now, you increasingly also need to take into account lead times and device availability, i.e. do you want Ethernet TODAY, or can you wait 6 months for an InfiniBand backend? Or vice / versa.

And in my experience depending on the type of client, the best choice for each of them may be different anyway. i.e. a Formula One team (think neoscaler / neocloud) want that perfectly tuned and optimised race car with zero compromises and can afford and can justify having a team of technical specialists for every component of the car..

But as you drop down into the lower leagues Formula 2, 3, 4 and so on (think Enterprise), this no longer holds true, suddenly other factors come into play like, operational simplicity, cost vs return, familiarity of architecture, abundance of skills in the market, ease of integrating into existing operating models etc.. and in that case ‘GOOD maybe good enough’. The term that has often led to Ethernet displacing other more proprietary technologies.

The reality of course is, this is rarely a ‘cut and dry’ situation but more a sliding scale, and what is best for you should be given careful consideration

In a future post, I’ll break down the technical and commercial trade-offs between InfiniBand vs Ethernet on the AI back-end, and how Ultra Ethernet could reshape the landscape entirely by turning that old, reliable and trusty ‘Ethernet Honda Civic’ into something more akin to that Formula One car!

My own perspective and experience — AI simply performed the CRC check.

Where’s the real cost consideration in your AI back‑end Network?

About Colin Lynch

Leave a comment Cancel reply

Search UCSguru.com

Recent Posts

Categories

Helpful UCS Links

Blogroll

Tweets

Top Search TAGs

Archives

Follow Blog via Email

Meta

Where’s the real cost consideration in your AI back‑end Network?

Share this:

Related

About Colin Lynch

Leave a comment Cancel reply

Search UCSguru.com

Recent Posts

Categories

Helpful UCS Links

Blogroll

Tweets

Top Search TAGs

Archives

Follow Blog via Email