Home Networking service InfiniBand innovation goes beyond bandwidth and latency

InfiniBand innovation goes beyond bandwidth and latency


Sponsored Moving more bits over copper wire or optical cable at a lower cost per bit moved has been the dominant driver of data center networking since the development of distributed systems over three decades ago.

For most of this time, InfiniBand networking was also concerned with reducing latency far below what Ethernet or other protocols could provide – through the use of Remote Direct Memory Access technology, or RDMA, as well as to provide additional capabilities such as full transport offloading, adaptive routing, congestion control, and QoS for workloads running on a network of compute engines.

This relentless pursuit of higher bandwidth and lower latency has served InfiniBand well and must continue into the future, but its continued evolution depends on many other technologies working together to deliver more scale, latency. lowest possible application and more flexible topologies. This contrasts with the first 15 years or so of InfiniBand technology, in which enough innovation was achieved simply by reducing port-to-port hop latency in the switch or the latency on a network interface card that links a server. to the network.

As bandwidth rates increase, forward error correction must compensate for higher bit error rates during transmission, meaning that the latency on the switch will remain flat – at best – and will likely increase with each step. generation of technology; this is true for any InfiniBand variant as well as for Ethernet and even any proprietary interconnect. So latency improvements should be found elsewhere in the stack.

Additionally, the work that has traditionally been done on host servers in a distributed computing system has to be moved from the very expensive general-purpose processor cores where application code runs (or handle the code offloaded to accelerators. GPU) and on network devices. The devices can be the ASIC switch itself, the ASIC network interface, or a complete data processing unit. This DPU is a new part of the InfiniBand stack, and it is important as it can virtualize networking and storage for the host as well as running security software without putting this load on the processors. hosts.

The combination of all of these technologies and techniques will keep InfiniBand at the forefront of interconnects for HPC, AI, and other clustered systems.

“If you look at InfiniBand, it gives us two types of services,” said Gilad Shainer, senior vice president of networks for NVIDIA’s Networking division, during an InfiniBand Birds of Feather session at the recent international conference on supercomputers. “This gives us the highest throughput networking services on the market, operating at 200 Gb / sec for over two years now, and scaling to 400 Gb / sec later this year. In addition, it provides IT services through preconfigured and programmable networked IT engines.

Mellanox, which was acquired by NVIDIA in April 2020, was the first major commercial player to introduce InfiniBand to the high performance computing industry. The company’s InfiniBand roadmap dates back to the early days of 2001, starting with Mellanox supplying silicon and 10Gb / s InfiniBand SDR cards for switches and network interface cards.

This was followed by DDR InfiniBand, running at 20 Gb / sec, in 2004, also marking the first time that Mellanox sold systems as well as silicon and cards. In 2008, speeds doubled again to 40 Gb / s with QDR InfiniBand, which was also the time when Mellanox entered the cable business with its LinkX line.

In 2011, speeds were increased slightly with 56 Gb / s FDR InfiniBand, and Mellanox expanded to Fabric software. In 2015, 100 Gb / sec EDR InfiniBand debuted, followed by 200 Gb / sec HDR InfiniBand in 2018, which for the first time included integrated HPC acceleration technologies. (The ConnectX family of adapters had offloaded the network from host servers for a long time.)

Looking ahead, the slowdowns for the InfiniBand specification, and therefore for InfiniBand vendors like NVIDIA implementing the specification after its release, appear to be more regular than they were in the mid-2000s.

With InfiniBand HDR, customers can either increase the bandwidth on the ports to 200 Gb / sec by combining four lanes with an effective speed of 50 Gb / sec, or double the base of the switch ASICs and run the ports with only two lanes and each port remains at the same 100 Gbps speed as InfiniBand EDR. (This is implemented by NVIDIA in the Quantum InfiniBand ASICs, which were unveiled at the end of 2016 and which started to market about a year later.) By doing this basic expansion, customers who do not need higher bandwidth can flatten their networks, eliminating hops in their topologies, while still eliminating certain switchings. (And it’s interesting to think of how a 400 Gb / sec ASIC could again double its base to create an even higher base switch as well as ports operating at speeds of 200 Gb / sec and 400 Gb / s. dry.)

The next stop on the InfiniBand train is NDR, with 400Gbps ports using four lanes – and what NVIDIA implements with its Quantum-2 ASICs. These Quantum-2 ASICs have 256 SerDes operating at 51.6 GHz with PAM-4 encoding and have an overall bandwidth of 25.6 Tb / sec unidirectional or 51.2 Tb / sec bidirectional, capable of handling 66.5 billion packets per second and provide 64 ports at 400 Gb / sec.

Following this on the InfiniBand roadmap is XDR speed, delivering 800 Gb / sec per port, and the projected end stop – there will undoubtedly be more – is 1.6 Tb / sec using four lanes. by port. Latency will likely increase a bit with every jump on the InfiniBand roadmap due to forward error correction, but NVIDIA is adding other technologies that mitigate this increasing latency. Equally important, Ethernet will be subject to the same forward error correction and will see port latencies increase as well, so that the latency gaps remain more or less the same between InfiniBand and Ethernet.

As far as Shainer is concerned, the right side of the block diagram above – the side that deals with network services – is more important than the left side which deals with the growing capacities and capabilities of the raw InfiniBand transport and protocol.

“The most interesting thing is to put compute engines in the silicon of the InfiniBand network, either in the adapter or in the switch, to allow the application to run as the data flows. move in the network, ”he says. “There are engines preconfigured to do something very specific, like data reductions, that would normally be done on host processors. Data reductions are going to take longer and longer as you add more and more nodes – more data, more data in motion, and a lot of overhead. We can migrate all data reduction operations into the network, enabling flat latency in a single-digit microsecond regardless of system size, reducing data movement, reducing overhead, and delivering much better performance. We also have wire-speed MPI tag matching engines and all-in-all engines for small messages at speeds of 200 Gb / sec and 400 Gb / sec.

These network accelerations started in the ConnectX adapter cards, speeding up parts of the MPI stack, and with the Switch-IB 2 ASICs for EDR 100 Gb / sec InfiniBand, a second generation chip operating at this speed announced in November 2015, Mellanox added support for what it calls Scalable Hierarchical Aggregation Protocol, or SHARP for short, to perform data reduction within the network.

This capability was specially developed for the Oak Ridge National Laboratory’s “Summit” supercomputer and the Lawrence Livermore National Laboratory’s “Sierra” supercomputer, which began entering the field at the end of 2017 and were fully operational in late 2017. 2018. Since then, Mellanox (and now NVIDIA) have extended networking functions by adding all-to-all and MPI mapping to data reductions. Here’s how MPI tag matching works:

More and more functions are being accelerated in InfiniBand switches, and with the Quantum 2 NDR 400 Gb / s switches, ConnectX-7 adapters and the NVIDIA SHARP 3.0 stack, even more network processing is done. And with the advent of the NVIDIA BlueField 3 DPU, which itself will have five times more computing capacity than the BlueField 2 DPU it replaces, NVIDIA will be able to speed up collective operations, and offload the active processing of messages, the progress Intelligent MPI, data compression, and user-defined algorithms to Arm cores on the BlueField 3 DPU, further relieving host systems.

What matters is how it all comes together to improve overall application performance, and Ohio State University’s DK Panda, as usual, is pushing the performance limits with its hybrid MPI and PGAS MVAPICH2 libraries. and performs performance tests on the whole “Frontera”. CPU system at the Texas Advanced Computing Center at the University of Texas. For those who do not know, here is the MVAPICH2 stack:

The performance increases with running NVIDIA SHARP in conjunction with the MVAPICH2 stack are significant. With SHARP running in conjunction with MVAPICH2 with the MPI_Barrier workload on the full Frontera system with one process per node (1ppn in the graphic below), the scaling is almost flat (which means the latency n (not increasing as the number of nodes increases) until node 4096 the barrier is crossed, and even at the full system configuration of 7861 nodes, the latency is still a factor of 9 times lower. If you turn off SHARP on the same machine, the latency increases exponentially as the number of nodes increases. With 16 processes per node (16ppn), latencies are lower and the difference between enabling and disabling SHARP is only a factor of 3.5X on 1,024 nodes. Looked:

On MPI_Reduce operations, the performance benefits of SHARP range from a minimum of 2.2X to a maximum of 6.4X, depending on the number of nodes and processes per node. MPI_Allreduce operations range from 2.5X to 7.1X, again depending on the number of nodes and processes per node.

On a separate set of benchmarks, adding a BlueField-2 DPU to the system nodes and helping it offload MPI to a 32-node system with 16 or 32 processes per node, the performance benefit ranged from 17% to 23% over a range of message sizes from 64K to 512K, like this:

These performance gains – in adapters and switches and in the DPUs where they are present – are cumulative and are expected to significantly improve the performance of HPC and even AI applications that depend on MPI operations.

And that’s why, when you add all that up, it’s safe to assume that InfiniBand will have a long and successful life in HPC and AI systems.

This article is sponsored by NVIDIA.

Source link


Please enter your comment!
Please enter your name here