A communication standard for AI systems in datacenters is born

Google Data Center
Sheila Zabeu -

June 14, 2024

In response to NVidia’s dominance, a group of major technology companies has formed to develop a new high-speed, low-latency communication standard for linked Artificial Intelligence (AI) systems in datacenters. The consortium will create the Ultra Accelerator Link (UALink) standard and will include AMD, Broadcom, Cisco, Google, Hewlett Packard Enterprise (HPE), Intel, Meta and Microsoft.

The aim of the UALink Consortium is to define and establish an open standard for AI accelerators to communicate more effectively. UALink will enable OEMs, IT professionals and system integrators to have an easier path to integration, more flexibility and scalability in datacenters operating AI solutions.

More specifically, the group will develop a specification to define high-speed, low-latency interconnections for scalable communication between accelerators and switches in AI computing clusters.

According to Tom’s Hardware, UALink’s goal in its 1.0 specification, which should be available in the third quarter of 2024, is to interconnect up to 1,024 accelerators in an AI computing pod. It will have to compete with NVidia’s NVLink which, most likely for this reason, is not part of the new association.

The first major difference between UALink and NVlink is the open standard approach of the former, which contrasts with the proprietary nature of the latter. The purpose of UALink, being open, is to stimulate collaboration and joint development in the industry, accelerating advances in multi-vendor AI hardware.

NVidia versus the others

The publication HPCwire, which has been covering the high-performance computer sector since 1987, explains that there are three ways to connect GPUs:

1. Per PCI bus, in servers that can generally support 4 to 8 GPUs on the same bus. This number can grow to 32 using technologies such as the GigaIO FabreX memory fabric.

2. By interconnecting servers with GPUs, using Ethernet or InfiniBand networks. Ethernet has long been the standard of choice for computer networks, but it has recently been given a performance upgrade with the creation of the Ultra Ethernet Consortium, of which NVidia basically has sole ownership of the InfiniBand market. According to HPCwire, the Ultra Ethernet Consortium was created to be the “InfiniBand” of all the others.

3. Interconnection between GPUs: Recognising the need for faster and more scalable connections for GPUs, NVidia has developed NVLink, capable of transferring data at speeds of 1.8 terabytes per second between GPUs. NVLink switches at rack level can support up to 576 fully connected GPUs in a computing fabric. GPUs connected via NVLink are called pods. This is where the new UALink will now work.

“In a very short period of time, the technology industry has embraced the challenges that AI and HPC (high performance computing) have revealed. Interconnecting accelerators such as GPUs requires a holistic perspective to improve efficiency and performance. At the Ultra Ethernet Consortium, we believe that UALink’s approach to solving pod cluster problems complements our own expansion protocol, and we look forward to collaborating on creating an open, ecosystem-friendly, industry-wide solution that addresses both types of needs in the future,” says J. Metz, chairman of the Ultra Ethernet Consortium.

You could say that in terms of scalability and performance, UALink and NVLink are on a par. UALink is being developed to connect up to 1,024 accelerators within an AI computing pod.

Source: UALink Promoter Group

The fifth generation of NVLink significantly improves scalability for larger multi-GPU systems, according to NVidia. A single NVIDIA Blackwell Tensor Core GPU can work with up to 18 NVLink connections at 100 gigabytes per second (GB/s) for a total bandwidth of 1.8 terabytes per second (TB/s), twice the bandwidth of the previous generation and more than 14 times the bandwidth of PCIe Gen5.

Given that NVidia’s NVLink is currently well established in the AI systems interconnection segment and that the first UALink products are expected to hit the market only at the beginning of 2025, the expectation is that UALink will have a relevant volume of implementations only in 2026, speculates the STH website.