Challenges for datacenters in the face of advancing AI

Sheila Zabeu -

September 20, 2023

The extraordinary growth in the use of Artificial Intelligence (AI) in various sectors of activity is posing challenges and requiring changes in the design and operation of datacenters so that they can meet ever-increasing demand. An estimate by Schneider Electric, a company that operates in the field of energy systems management and automation, points out that AI currently represents 4.3 GW of energy demand, a figure that is expected to grow at a compound annual rate of 26% to 36%, resulting in a total of between 13.5 GW and 20 GW by 2028.

Against this backdrop, Schneider Electric has prepared a document in which it explains how AI attributes and trends can create challenges for every element of datacenter physical infrastructure, including power systems, cooling, racks and management software. It also offers guidance on how to tackle these challenges and presents a vision of what lies ahead in terms of datacenter design.

Overview of AI workloads in data centers.

“With the advance of AI, specific requirements are emerging in datacenter design and management. To meet the challenges, it is important to consider various attributes and trends in AI workloads that impact both new and existing datacenters,” says Pankaj Sharma, executive vice president of Schneider Electric’s secure energy and datacenter business. “AI applications, especially training clusters, are highly demanding in terms of the processing power provided by GPUs or specialized AI accelerators. This puts significant pressure on the power and cooling infrastructure of datacenters. And with rising energy costs and environmental concerns, datacenters need to adopt energy-efficient hardware, such as high-efficiency power and cooling systems, and renewable energy sources to help reduce operating expenses and their carbon footprint,” he adds.

According to the guide, there are four AI attributes and trends that underlie datacenter physical infrastructure challenges: AI workloads (training and inference), GPU thermal design power (TDP), network latency and the size of AI clusters.

The challenges posed by AI workloads are: (1) 120/208 V distribution that is impractical to deploy; (2) small power distribution blocks that waste space; (3) standard 60/63 A rack PDUs (power distribution units) that are impractical to deploy; (4) increased risks generated by arcing that complicate working practices; (5) lack of load diversity that increases the risk of upstream circuit breakers tripping; and (6) high rack temperatures that increase the risk of faults and other hazards.

In terms of cooling, the densification of server clusters for AI training is forcing the migration from air cooling to liquid cooling. Furthermore, although less dense clusters and inference servers still use more conventional datacenter cooling methods, the study lists the main cooling challenges to be addressed: (1) inadequate air cooling for AI clusters above 20 kW/rack; (2) lack of standardised designs and site constraints that complicate the adoption of liquid cooling; (3) unknown future TDPs that increase the risk of cooling design obsolescence; (4) inexperience that complicates installation, operation, and maintenance processes; (5) liquid cooling that increases the risk of rack leaks; and (6) limited fluid options for use in liquid cooling in a sustainable way.

In the case of racks, there are four main challenges generated by AI workloads: (1) standard width racks that don’t have room for power and cooling equipment; (2) standard depth racks that don’t have room for deep AI servers and cabling; (3) standard height racks that don’t have room for the required number of servers; and (4) standard racks that can’t support the weight of AI equipment.

Maintaining high-density liquid-cooled clusters alongside traditional air-cooled IT systems makes certain management software tools more critical. In addition, even though some AI training workloads may not require high availability, inadequate design and monitoring can cause downtime of adjacent business-critical racks. The two main challenges related to management software in the context of high-density AI training workloads are: (1) high power densities and the demand for AI clusters that generate design uncertainties; (2) smaller error margins that increase operational risks in a dynamic environment.

In the guide, guidelines are presented to help address each of these challenges. In addition, some future technologies and design approaches should also help in this endeavour: (1) rPDUs (rack PDU) optimized for AI; (2) average voltage for 415/240 V transformers; (3) solid-state transformers; (4) solid-state circuit breakers; (5) sustainable dielectric fluids; (6) ultra-deep IT racks; and (7) greater interaction/optimization with grids.