Nvidia, Intel claim new LLM training speed records in new MLPerf 3.1 benchmark

VentureBeat presents: AI Unleashed – An exclusive executive event for enterprise data leaders. Hear from top industry leaders on Nov 15. Reserve your free pass

Training AI models is a whole lot faster in 2023, according to the results from the MLPerf Training 3.1 benchmark released today.

The pace of innovation in the generative AI space is breathtaking to say the least. A key part of the speed of innovation is the ability to rapidly train models, which is something that the MLCommons MLPerf training benchmark tracks and measures. MLCommons is an open engineering consortium focused on ML benchmarks, datasets and best practices to accelerate the development of AI.

The MLPerf Training 3.1 benchmark, included submissions from 19 vendors that generated over 200 performance results. Among the tests were benchmarks for large language model (LLM) training with GPT-3 and a new benchmark for training the open source Stable Diffusion text to image generation model.

“We’ve got over 200 performance results and the improvements in performance are fairly substantial, somewhere between 50% to almost up to 3x better,” MLCommons executive director David Kanter said during a press briefing.

VB Event

AI Unleashed

Don’t miss out on AI Unleashed on November 15! This virtual event will showcase exclusive insights and best practices from data leaders including Albertsons, Intuit, and more.

LLM training gets an oversized boost that is beating Moore’s Law

Of particular note among all the results in the MLPerf Training 3.1 benchmark are the numbers on large language model (LLM) training. It was only in June that MLcommons included data on LLM training for the first time. Now just a few months later the MLPerf 3.1 training benchmarks show a nearly 3x gain in the performance of LLM training.

“It’s about 2.8x faster comparing the fastest LLM training benchmark in the first round [in June], to the fastest in this round,” Kanter said. “I don’t know if that’s going to keep up in the next round and the round after that, but that’s a pretty impressive improvement in performance and represents tremendous capabilities.”

In Kanter’s view, the performance gains over the last five months for AI training are outpacing what Moore’s Law would predict. Moore’s Law forecasts a doubling of compute performance every couple of years. Kanter said that the AI industry is scaling hardware architecture and software faster than Moore’s Law would predict.

“MLPerf is to some extent a barometer on progress for the whole industry,” Kanter said.

Nvidia, Intel and Google boast big AI training gains

Intel, Nvidia and Google have made significant strides in recent months that enable faster LLM training results in the MLPerf Training 3.1 benchmarks.

Intel claims that its Habana Gaudi 2 accelerator was able to generate a 103% training speed performance boost, over the June MLPerf training results using a combination of techniques including 8-bit floating point (FP8) data types.

“We enabled FP8 using the same software stack and we managed to improve our results on the same hardware,” Itay Hubara, senior researcher at Intel commented during the MLCommons briefing. “We promised to do that in the last submission and we delivered.”

Google is also claiming training gains, with its Cloud TPU v5e which only became generally available on Aug. 29. Much like Intel, Google is using FP8 to get the best possible training performance. Vaibhav Singh, product manager for cloud accelerators at Google also highlighted the scaling capabilities that Google has developed which included the Cloud TPU multislice technology.

“What Cloud TPU multislice does is it has the ability to scale over the data center network,” Singh explained during the MLCommons briefing.

“With the multislice scaling technology, we were able to get a really good scaling performance up to 1,024 nodes using 4,096 TPU v5e chips,” Singh said.

Nvidia used its EOS supercomputer to supercharge training

Not to be outdone on scale, Nvidia has its own supercomputer known as EOS, which it used to conduct its MLPerf Training 3.1 benchmarks. Nvidia first spoke about its initial plans to build EOS back in 2022.

Nvidia reported that its LLM training results for MLPerf was 2.8 times faster than it was in June for training a model based on GPT-3. In an Nvidia briefing on the MLcommons results, Dave Salvator, director of accelerated computing products at Nvidia said that EOS has 10,752 GPUs connected via Nvidia Quantum-2 InfiniBand running at 400 gigabits per second. The system has 860 terabytes of HBM3 memory. Savator noted that Nvidia has also worked on improving software to get the best possible outcome for training.

“Some of the speeds and feed numbers here are kind of mind blowing,” Salvator said. “In terms of AI compute, it’s over 40 exaflops of AI compute, right, which is just extraordinary.”

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.

source