Azure sets a scale record in large language model training | Microsoft Azure Blog

ByContributor November 8, 2023

Azure empowers intelligent services like Microsoft Copilot, Bing, and Azure OpenAI Service that have captured our imagination in recent days. These services, facilitating various applications like Microsoft Office 365, chatbots, and search engines with generative AI, owe their magic to large language models (LLMs). While the latest LLMs are transcendental, bringing a generational change in how we apply artificial intelligence in our daily lives and reason about its evolution, we have merely scratched the surface. Creating more capable, fair, foundational LLMs that consume and present information more accurately is necessary.

How Microsoft maximizes the power of LLMs

However, creating new LLMs or improving the accuracy of existing ones is no easy feat. To create and train improved versions of LLMs, supercomputers with massive computational capabilities are required. It is paramount that both the hardware and software in these supercomputers are utilized efficiently at scale, not leaving performance on the table. This is where the sheer scale of the supercomputing infrastructure in Azure cloud shines and setting a new scale record in LLM training matters.

*Figure 1: Scale records on the model GPT-3 (175 billion parameters) from MLPerf Training v3.0 in June 2023 (3.0-2003) and Azure on MLPerf Training v3.1 in November 2023 (3.1-2002).*

Customers need reliable and performant infrastructure to bring the most sophisticated AI use cases to market in record time. Our objective is to build state-of-the-art infrastructure and meet these demands. The latest MLPerf™ 3.1 Training results¹ are a testament to our unwavering commitment to building high-quality and high-performance systems in the cloud to achieve unparalleled efficiency in training LLMs at scale. The idea here is to use massive workloads to stress every component of the system and accelerate our build process to achieve high quality.

The GPT-3 LLM model and its 175 billion parameters were trained to completion in four minutes on 1,344 ND H100 v5 virtual machines (VMs), which represent 10,752 NVIDIA H100 Tensor Core GPUs, connected by the NVIDIA Quantum-2 InfiniBand networking platform (as shown in Figure 1). This training workload uses close to real-world datasets and restarts from 2.4 terabytes of checkpoints acting closely a production LLM training scenario. The workload stresses the H100 GPUs Tensor Cores, direct-attached Non-Volatile Memory Express disks, and the NVLink interconnect that provides fast communication to the high-bandwidth memory in the GPUs and cross-node 400Gb/s InfiniBand fabric.

“Azure’s submission, the largest in the history of MLPerf Training, demonstrates the extraordinary progress we have made in optimizing the scale of training. MLCommons’ benchmarks showcase the prowess of modern AI infrastructure and software, underlining the continuous advancements that have been achieved, ultimately propelling us toward even more powerful and efficient AI systems.”—David Kanter, Executive Director of MLCommons

Microsoft’s commitment to performance

In March 2023, Microsoft introduced the ND H100 v5-series which completed training a 350 million parameter Bidirectional Encoder Representations from Transformers (BERT) language model in 5.4 minutes, beating our existing record. This resulted in a four times improvement in time to train BERT within just 18 months, highlighting our continuous endeavor to bring the best performance to our users.

*Figure 2: Relative size of the models BERT (350 million parameters) and GPT-3 (175 billion parameters) from MLPerf Training v3.1.*

Today’s results are with GPT-3, a large language model in the MLPerf Training benchmarking suite, featuring 175 billion parameters, a remarkable 500 times larger than the previously benchmarked BERT model (figure 2). The latest training time from Azure reached a 2.7x improvement compared to the previous record from MLPerf Training v3.0. The v3.1 submission underscores the ability to decrease training time and cost by optimizing a model that accurately represents current AI workloads.

The power of virtualization

NVIDIA’s submission to the MLPerf Training v3.1 LLM benchmark on 10,752 NVIDIA H100 Tensor Core GPUs achieved a training time of 3.92 minutes. This amounts to just a 2 percent increase in the training time in Azure VMs compared to the NVIDIA bare-metal submission, which has the best-in-class performance of virtual machines across all offerings of HPC instances in the cloud (figure 3).

Figure 3: Relative training times on the model GPT-3 (175 billion parameters) from MLPerf Training v3.1 between the NVIDIA submission on the bare-metal platform (3.1-2007) and Azure on virtual machines (3.1-2002).

The latest results in AI Inferencing on Azure ND H100 v5 VMs show leadership results as well, as shown in MLPerf Inference v3.1. The ND H100 v5-series delivered 0.99x-1.05x relative performance compared to the bare-metal submissions on the same NVIDIA H100 Tensor Core GPUs (figure 4), echoing the efficiency of virtual machines.

Figure 4: Performance of the ND H100 v5-series (3.1-0003) compared to on-premises and bare metal offerings of the same NVIDIA H100 Tensor Core GPUs (3.1-0107 and 3.1-0121). All the results were obtained with the GPT-J benchmark from MLPerf Inference v3.1, scenarios: Offline and Server, accuracy: 99 percent.

In conclusion, created for performance, scalability, and adaptability, the Azure ND H100 v5-series offers exceptional throughput and minimal latency for both training and inferencing tasks in the cloud and offers the highest quality infrastructure for AI.

Learn more about Azure AI Infrastructure

References

MLCommons® is an open engineering consortium of AI leaders from academia, research labs, and industry. They build fair and useful benchmarks that provide unbiased evaluations of training and inference performance for hardware, software, and services—all conducted under prescribed conditions. MLPerf™ Training benchmarks consist of real-world compute-intensive AI workloads to best simulate customer’s needs. Tests are transparent and objective, so technology decision-makers can rely on the results to make informed buying decisions.

Maximizing Your Forms with RSForm! Pro – The Ultimate Guide

Joomla plugins are vital tools that enhance the functionality of a website, with RSForm! Pro distinguished as a robust form-building solution. This overview aims to outline the key features, benefits, and straightforward installation process of RSForm! Pro. It will explore various customization options that enable users to tailor forms to their specific requirements, review common… […]

Boost Your Website Speed with Our Performance Optimization Plugin

In the rapidly evolving digital landscape, the performance of a Joomla website significantly influences user engagement. Performance optimization plugins serve as essential tools aimed at enhancing the speed and efficiency of a website by combining, minifying, and compressing assets such as CSS, JavaScript, and images. This article examines the advantages these plugins provide, ranging from… […]

How to Upgrade from Ubuntu 22.04 LTS to Ubuntu 24.04 LTS

The stable version of Ubuntu 24.04 LTS (code-named Noble Numbat) is released on April 25th 2024, if you are curious to know what is in it, you can now upgrade to the version of it… The post How to Upgrade from Ubuntu 22.04 LTS to Ubuntu 24.04 LTS appeared first on FAST DOMAINS.

16 Best Linux Distributions for Older Computers

Do you have an old laptop that has gathered layers of dust over time and you don’t exactly what to do with it? A good place to start would be to install a Linux distribution… The post 16 Best Linux Distributions for Older Computers appeared first on FAST DOMAINS.

Azure sets a scale record in large language model training | Microsoft Azure Blog

How Microsoft maximizes the power of LLMs

Microsoft’s commitment to performance

The power of virtualization

Learn more about Azure AI Infrastructure

Why the open source driver release from NVIDIA is so important for Linux?

Linux System Monitoring with Prometheus, Grafana, and collectd

Reserve quantum computers, get guidance and cutting-edge capabilities with Amazon Braket Direct | Amazon Web Services

Add storage to your Fedora system with LVM

Elevate Azure expertise with new AI and optimization video episodes | Microsoft Azure Blog

A WebKit Update for Ubuntu

© Copyright

VMware ESXi Power Optimization Overview

WiredGorilla

How Microsoft maximizes the power of LLMs

Microsoft’s commitment to performance

The power of virtualization

Learn more about Azure AI Infrastructure

Similar Posts

© Copyright