The evolution of AI system performance is increasingly defined by networking technologies rather than just raw chip power. MLCommons’ latest MLPerf Training benchmark (round 5.0) reveals how connectivity between chips has become a critical factor as AI systems scale to unprecedented sizes. This shift highlights a growing competitive landscape where network configuration and communication algorithms play an increasingly decisive role in AI training speed and efficiency.
The big picture: As AI systems scale to thousands of interconnected chips, network configuration has become just as crucial as the chips themselves for achieving peak performance.
- The latest MLPerf Training benchmark saw systems with up to 8,192 GPU chips, dramatically up from just 32 chips in the first test six years ago.
- This scaling trend has transformed AI computers into massive distributed systems where inter-chip communication significantly impacts overall training speed.
By the numbers: The benchmark drew record participation and showcased systems of unprecedented scale.
- A total of 201 performance submissions came from 20 different organizations in this round.
- Nvidia submitted the largest system, featuring 8,192 GPU chips working in concert.
- The fastest system completed training Meta’s Llama 3.1 405B model in just under 21 minutes.
Why this matters: The benchmark results demonstrate that AI training has evolved beyond a competition of chip manufacturers to include networking technologies and topologies.
- As systems grow larger, the algorithms that manage communication between chips become increasingly influential in determining overall performance.
- Different network topologies require specialized communication algorithms, creating a complex technical ecosystem beyond just raw GPU power.
What’s new: This round introduced a significant addition to the benchmark suite with Meta’s massive language model.
- For the first time, MLPerf included a test measuring training speed for Meta’s Llama 3.1 405B large language model.
- This addition reflects the industry’s focus on training ever-larger language models efficiently.
Making AI work is increasingly a matter of the network, latest benchmark test shows