Meta introduces new data centre design for AI

Sheila Zabeu -

May 29, 2023

Part of a plan to build a new generation of infrastructure for Artificial Intelligence (AI), Meta recently revealed some details about advances in this area, including a new optimised data centre design, as well as the company’s first chip specifically for running AI models and the second phase of a supercomputer with 16,000 GPUs for AI research. According to Meta, these efforts will enable it to develop larger and more sophisticated AI models and then efficiently deploy them at scale.

Since opening its first data centre in 2010, Meta has been building a global infrastructure for its family of applications. According to the company, AI has been an important part of these systems for many years, including elements such as Big Sur hardware, the development of the PyTorch Machine Learning framework and the supercomputer for AI research.

Now, the new data centre design will be able to work with future generations of hardware focused on AI training and inference. It will have an optimised design capable of supporting liquid-cooled hardware and a high-performance AI network connecting thousands of chips for AI training clusters. It will also be faster and more cost-effective to build and will complement other new pieces of hardware, such as the Meta Scalable Video Processor (MSVP ), Meta’s first internally developed ASIC solution for powering video workloads, an area that is constantly growing at the company.

The next generation Meta Training and Inference Accelerator (MTIA), meanwhile, is made up of the first accelerator chips targeted at AI inference workloads. It offers greater computing power and efficiency than CPUs and is customised for Meta’s internal workloads. By adopting MTIA chips and GPUs, Meta can deliver better performance, lower latency and more efficiency.

The first generation of MTIA was introduced in 2020. Now this inference accelerator is part of a full stack solution that includes chip, PyTorch and recommendation templates. It is manufactured using TSMC’s 7 nm process and operates at 800 MHz, delivering 102.4 TOPS (teraoperations per second) with INT8 precision and 51.2 TFLOPS (floating point teraoperations per second) with FP16 precision. It has a thermal design power (TDP) of 25 W.

In the supercomputer field, the news unveiled by Meta has to do with the second phase of the RSC, which the company believes is one of the fastest AI-focused models in the world. It is built to train the next generation of large AI models and work with new augmented reality tools, content understanding systems, translation technology and more. It has 16,000 GPUs, all accessible by the Clos network mesh that has the bandwidth to service each of the 2,000 training systems.

The RSC can achieve almost 5 exaflops of computing power, meaning it can perform one quintillion or one billion billion calculations per second. This level of performance can be achieved using 2,000 NVIDIA DGX A100 systems as the RSC’s compute nodes, a total of 16,000 NVIDIA A100 Tensor Core GPUs, connected through a mesh of 16 Tb/s NVIDIA Quantum InfiniBand networks.

According to Meta, some projects using CSR are already enabling it to accelerate research in areas such as LLM (Large Language Model), universal a href=”https://ai.facebook.com/blog/supercomputer-meta-research-supercluster-2023/#universal”>speech translation and theorem proving. Meta is observing the performance of the first projects to understand how to better manage the allocation of GPUs and extract other lessons that help in the future development of the supercomputer. It has already learned, for example, that capacity allocation can adopt a dynamic QoS model to reduce resource contention for the 16,000 GPUs. Working in partnership with Penguin Computing, it has also improved the overall management of the cluster and managed to keep availability above 95% consistently.

Meanwhile…

During the International Supercomputing Conference (ISC) held in Germany, Intel gave more details about an AI chip planned for release in 2025. According to Jeff McVeigh, vice president of Intel’s supercomputing group, the Falcon Shores platform will no longer bring CPU and GPU together in an XPU, arguing that the market had changed so much that it no longer made sense to pursue integration. Now Falcon Shores will just be a GPU with 288 gigabytes of memory and 8-bit floating point processing.

“When workloads are fixed, when there’s a lot of clarity that they’re not going to change dramatically, integration is great,” McVeigh explains, stating that currently AI and HPC workloads are too dynamic for integration to make sense.

The decision not to pursue the combined CPU-GPU architecture has to do with Intel’s changing strategy to address Nvidia’s lead in the AI chip market, and also AMD’s upcoming MI300 chip.

Data centre and AI market

A direct consequence of the growth of the Generative AI market, which is expected to reach $126.5 billion by 2031 at a compound annual rate of 32%, is the increased demand for data centre resources. Because of this, the need for higher power densities of the IT systems that support AI applications is also growing. This brings challenges for existing data centres, especially older facilities.

“This situation makes the move to cloud services imperative for many organisations, although they must also decide how to manage their current infrastructure and facilities,” said Chris Street, director of data centres at JLL in an interview with Tech Wire Asia.

Another concern beyond the possible exclusion of companies with little ability to invest in AI workloads has to do with sustainability. According to Street, there needs to be collaboration between the data centre industry, other technology companies, government agencies, regulators and communities to drive sustainability efforts. “These efforts start with assessing how data centre strategies are aligned with corporate goals and objectives as well as operational strategies and audits of third-party service providers,” Street explains.