Meta evaluates new design for datacenters

With an eye on the new workloads that are coming around in the field of Artificial Intelligence, Meta may be evaluating a redesigned design for a new generation of data centres. Mark Zuckerberg’s company has already halted construction of several new facilities, perhaps, to study what the future datacenters should look like, capable of handling higher levels of Artificial Intelligence (AI) processing and experiments in the metaverse.

The first signs came in mid-December, with the news that Meta would be interrupting or cancelling the construction of two new buildings on its campus in Odense, Denmark, according to the Data Center Dynamics (DCD) website. Subsequently, another building affected was in the North American municipality of Temple, Texas.

At the time, Peter Münster, Meta’s communications manager for the Nordic region, said that “supporting AI workloads at scale requires a different type of data centre to those built to operate regular online services for the company. Therefore, they were focusing on building a new generation of datacenters.

According to DCD sources, some of Meta’s 11 projects under development are under re-examination. Odense is the only place with no plans for new facilities, but it is believed that the datacenters in the final stages of construction with the old design will be completed as is. The remaining projects are being “redeveloped”, which will likely affect timelines and require new contracts. Meta said these new facilities will be liquid-cooled at the Open Compute Summit last October.

It is unclear how these recent changes will impact Meta’s investment plans. The company’s chief strategy officer, Dave Wehner, said in October that new datacenters equipped explicitly with hardware for next-generation AI are being built and that there is some increase in capital invested because of the change in infrastructure.

The question that remains

It is true that curiosity kills, the saying goes, but in the business world, being curious is a best practice. In this specific case, we could ask what exactly Meta is considering when considering changing the datacenter design.

Recently, during the Open Compute Project (OCP) 2022 conference, Meta said it was working on innovations to help overcome obstacles and propel AI into the future. This includes everything from new platforms for training and running AI models to rack and power supply solutions called Open Rack v3 (ORV3).

The ORV3 system is designed to accommodate several different forms of liquid cooling, including one model using air-assisted liquid cooling (AALC) and another using plant water. It also includes an optional design with drip-free connections between IT equipment and the liquid manifold, making maintenance and installation tasks easier.

So when asked why so much effort is focused on these areas, Meta replies: the trend of increasing power consumption and the demand for advances in the liquid cooling sector are forcing the company to think differently about all elements of the platform, from the rack and power systems to the design of the data centres themselves.

And where will this growing need come from, in Meta’s view? It answers in its blog: “as we move towards the next computing platform, the metaverse, the need for new open innovations to power AI becomes even clearer.”

The chart below provided by Meta shows projections of the growth in power consumption per wideband memory (HBM) and training module, how these trends will require different cooling technologies over the years, and the associated limits.

Projections of energy consumption growth
Source: Meta

AI and metaverse in Meta’s sights

In early 2022, Meta unveiled iResearch SuperCluster (RSC), placing it among the fastest supercomputers in the world. The machine was already being used to train natural language processing (NLP) and computer vision models applied to research with the goal of one day running them with trillions of parameters. Ultimately, the work done by the RSC will pave the way for the development of Meta’s future big platform, the metaverse, in which AI applications will play an important role.

In this scenario, high-performance supercomputers are critical for training complex models. The first generation of this type of infrastructure, designed in 2017, used 22,000 NVIDIA V100 Tensor Core GPUs in a single cluster to run 35,000 training tasks daily. This was Meta’s research standard in terms of performance, reliability, and throughput.

Today, the RSC brings together 760 NVIDIA DGX A100 systems as compute nodes across a total of 6,080 GPUs – each A100 GPU being far more potent than the V100 used previously. Each DGX communicates via a Clos NVIDIA Quantum 1600 Gb/s InfiniBand two-tier Clos NVIDIA Quantum framework with no oversubscription. The storage layer has 175 petabytes of Pure Storage FlashArray, 46 petabytes of cache on Penguin Computing Altus systems and 10 petabytes of Pure Storage FlashBlade.

Plans for RSC are to increase the number of GPUs from 6,080 to 16,000 to boost AI training performance by more than 2.5 times. The InfiniBand framework is expected to support up to 16,000 ports in a two-tier topology without oversubscription. And the storage system will have a bandwidth of 16 TB/s and capacity in the exabytes to meet the growing demand.