NVIDIA Triton Inference Server Powers Microsoft Bing’s Breakthrough

In the world of personalized advertising, Microsoft Bing has taken a giant leap forward, thanks to the exceptional efforts of the principal software engineering manager. With their innovative approach and the assistance of NVIDIA Triton Inference Server running on powerful NVIDIA A100 Tensor Core GPUs, they have achieved an impressive milestone, delivering personalized ads to Bing users with 7x throughput at a reduced cost.

Bang and EL-Attention Innovations

Bang and EL-Attention were instrumental in accelerating the performance of AI models within Bing’s ad service. Bang is a novel optimization technique that significantly reduces the inference time of models. By employing advanced quantization algorithms and network design changes, Bang ensures faster and more efficient execution of AI models.

On the other hand, EL-Attention is an attention mechanism that improves the accuracy and speed of natural language processing tasks by selectively attending to relevant information. Together, these innovations revolutionized the performance of Bing’s ad service.

Tuning a Complex System

Tuning Bing’s ad service presented a myriad of challenges. With hundreds of constantly evolving models, each needing to respond to user requests within a mere 10 milliseconds, optimization was essential. To overcome this hurdle, the team introduced two groundbreaking innovations.

The combination of Bang and EL-Attention has allowed Chen’s team to significantly improve the performance of Bing’s ad service. The models can now respond to user requests within 10 milliseconds, which is a 7x improvement over the previous performance. This improvement has allowed Bing to deliver personalized ads to users more quickly and accurately, which has resulted in a better user experience.

Flying With NVIDIA A100 MIG

The NVIDIA A100 GPU, featuring the Multi-Instance GPU (MIG) feature, played a crucial role in amplifying the throughput of Bing’s ad service. With MIG, Chen’s team was able to divide a single GPU into multiple instances, effectively parallelizing their workload.

This enabled them to process multiple ad requests simultaneously, significantly boosting the overall throughput of the system. The power and flexibility of the A100 GPU with MIG propelled Bing’s ad service to new heights.

Flexible, Easy, Open Software

NVIDIA Triton Inference Server emerged as a game-changer in Chen’s team’s pursuit of optimizing Bing’s ad service. Triton’s ability to simultaneously run different runtime software, frameworks, and AI models on isolated instances of a single GPU brought unparalleled flexibility. This facilitated efficient resource allocation and allowed the team to achieve optimal performance for their diverse set of models. Additionally, the use of a software container to deploy Triton streamlined the deployment process, making it easier to manage and scale the system.

One of the significant advantages of Triton is its open-source nature, backed by a thriving community. This vibrant community continually contributes to improving the software, ensuring its evolution over time. The collective expertise and collaboration fostered by the community enhance Triton’s reliability, scalability, and adaptability, making it an invaluable asset for Bing’s ad service.

Conclusion

Under the leadership of Chen, the principal software engineering manager, Bing’s ad service has achieved remarkable milestones by leveraging the capabilities of NVIDIA Triton Inference Server running on NVIDIA A100 Tensor Core GPUs. The innovative optimizations of Bang and EL-Attention, coupled with the power of the A100 GPU’s MIG feature, have catapulted the system’s throughput to new heights.

With the flexibility and ease offered by Triton, Bing’s ad service can run a diverse array of models efficiently. The open-source nature of Triton, bolstered by an enthusiastic community, ensures continuous improvement and a bright future for the software. Microsoft Bing’s achievement in delivering personalized ads at an unprecedented scale showcases the incredible potential when cutting-edge technology meets dedicated engineering expertise.