How to build a best server in “100 easy steps”: The growing pains of modern data centers

Configuring GPU data centers is demonstrating to be a headache

The big picture: “If you fully uproot the way databases have been built for the past 10 years, you will inevitably face some growing pains.”. While headlines are all around the rise of AI, the reality on the ground involves plenty of headaches.

When speaking to methods integrators and others climbing up large computing systems, we hear a constant stream of complaints about the hardships of getting large GPU sets operational.

The main issue is fluid cooling. GPU systems run hot, with racks destroying tens of thousands of watts of power. Standard air cooling is insufficient, which has led to the general adoption of liquid cooling systems. This growth has driven up the stock prices of businesses like Vertiv, which deploy these systems.

Editor’s Note:
Guest author Jonathan Goldberg is the creator of D2D Advisory, a multi-functional consulting company. Jonathan has set growth plans and alliances for companies in the mobile, networking, gaming, and software industries.

However, liquid cooling is still relatively new for data centers, and there aren’t enough people familiar with installing them. As a result, liquid cooling has evolved as the leading cause of failures in databases. There are all kinds of explanations for this, but they all boil down to the fact that water and electronics don’t mix well. The industry will sort this out eventually, but it’s a prime sample of the growing pains data centers are participating.

=> Related Stories

There are also many challenges in configuring GPUs. This isn’t unexpected– most data center professionals have a wealth of knowledge configuring CPUs, but for many of them, GPUs are unknown territory.

On top of that, Nvidia cares to sell complete designs, which presents a whole new set of difficulties. For example, Nvidia’s firmware and BIOS systems aren’t entirely new, but they are just other and underdeveloped sufficiently to cause delays and an unusually high digit of bugs. Add Nvidia’s networking layer into the mix, and it’s easy to see how frustrating the process has evolved. There’s simply a lot of new technology for experts to master in a very short timeframe.

How to build a best server in "100 easy steps": The growing pains of modern data centers

In the great scheme of things, these are just speed bubbles. None of these issues are severeenough to halt AI action, but in the near term, they will likely become more prominent and more high-profile. We hope hyperscalers to delay or slow down their GPU rollouts to address these challenges. To be more precise, we’re probably to hear more about these delays because they’ve already begun.

AMD’s recent $5 billion bet on the data center:

“Recently, individuals have asked us about the logic behind AMD’s acquisition of ZT Systems. Since this and the growing sophistication of installing AI clusters are closely related, we can use ZT as a lens to comprehend broader issues in the industry.”

Let’s say Acme Semiconductor wants to enter the database market. They spend a few hundred million dollars to develop a processor. Then they try to sell it to their hyperscaler client, but the hyperscaler accomplishes want just a chip – they want a working plan to test their software.

So, Acme goes to an ODM (Original Design Manufacturer) and expends a few hundred thousand dollars to design a functional server, full of storage, energy, cooling, networking, and everything else. Acme builds a few dozen of these waitpeople and hands them out to their top sales options. At this point, Acme is out almost $1 million, and they notice that their chip budgets for only 20% of the system’s cost.

How to build a best server in "100 easy steps": The growing pains of modern data centers

The hyperscalers then spend a few months trying the system. One of them likes Acme’s performance enough to put it through a more stringent test, but they don’t want a standard server; they want one designed particularly for their data center procedures. This means a new server design with a fully different design of storage, networking, cooling, and more. The hyperscaler also wants Acme to build these test techniques with their selected ODM.

Eager to close the deal, Acme foots the bill for this new plan, though at least the jittery scaler pays for the test techniques – Acme finally has some revenue, maybe $100,000. While the first hyperscaler is running its multi-month evaluation, a second customer expresses interest. Of course, they want their server designed with their own preferred ODM. Acme, needing the industry, covers the cost of this design as well.

Acme comes to all the OEMs to see if any will design a catalog system to streamline the process. The OEMs are all very friendly and inquisitive about what Acme is doing. Great job guys, but they’ll only commit to creating once Acme secures more business.

Finally, a customer wants to buy in volume – a big win for Acme. This time, because there’s real importance involved, the ODM decides to do the design. However, the hyperscaler will use its internally designed networking and security chips in the new server, saving them secret. Acme has never seen these chips and knows little about the new server, which the client and the ODM designed directly. The ODM builds a bunch of servers, then wires them up inside the hyperscaler’s data center, flips the power switch on, and things instantly start to break.

This is wished; bugs are everywhere. But quickly, everyone started accusing Acme of the problems, ignoring the fact that Acme was mostly excluded from the design process. Their chip is the least familiar element to the ODM and the customer. Acme performed with the customer to iron out bugs during the evaluation cycle, but this is another.

How to build a best server in "100 easy steps": The growing pains of modern data centers

Much of the design is new, and the stakes are much higher, so everyone is under stress. Acme sends its field makers to the super-remote data center to get hands-on with the system. The three teams work through the bugs, seeing more along the way. Acme’s processor enters an obscure error mode when interacting with the hyperscaler’s security chip. The networking features are fragile and perform well below specifications. Every chip is running a different firmware that is inconsistent with the others.

To top it off, liquid cooling – something no one on the debugging team has operated with before – probably causes 50% of the problems. The deployment drags on as the teams work through the issues. Something important needs to be entirely replaced at some point, adding more delays and costs. But after months of work, the system finally enters presentation. After months of work, Acme’s system finally enters production. But then Acme’s second customer decides they want to do a deeper evaluation. And so the whole process starts again.

And if all of that doesn’t sound painful enough, just wait until the lawyers get involved. Trust me, it’s not pretty.

Acme had to spend nine months negotiating strenuous terms with the hyperscaler from a very weak position. This was just to start the project.

When it came to designing the custom server, the three companies (Acme, the ODM, and the customer) spent six weeks negotiating the non-disclosure agreement (NDA). This NDA was necessary to protect the intellectual property of all three companies.

This is how waitpeople have been built for years. Then Nvidia joined the market, bringing their server designs. Not only that, but they brought designs for entire racks. Nvidia has been designing strategies for 25 years, dating back to their work on graphics cards. Their team also builds their own data centers, so they have an in-house team that participates in handling all of these issues.

To contest with Nvidia, AMD can either spend five years reproducing Nvidia’s team or buy ZT. In theory, ZT can help AMD destroy almost all of the friction outlined above. It’s too soon to tell how well this will work in practice, but AMD has gained pretty good at merger integration. Frankly, we would gladly pay $5 billion to sidestep bargaining a three-way NDA and Master Service Agreement ever again.

Leave a Comment