NVIDIA Launches Llama 4 Models

Nvidia

The latest Llama 4 AI models, Llama 4 Scout and Llama 4 Maverick, have been released, powered by NVIDIA's open-source software. These models offer high-speed performance, achieving over 40,000 tokens per second on NVIDIA Blackwell B200 GPUs, and are available as NVIDIA NIM microservices for easy access and deployment.

Capabilites

Multimodal and Multilingual Capabilities

Llama 4 models are now natively multimodal and multilingual, thanks to their mixture-of-experts (MoE) architecture. This allows them to handle both text and images across different languages, improving their versatility and ability to support a wide range of applications.

Performance and Technical Details

  • Llama 4 Scout: A 109 billion parameter model optimized for use with NVIDIA H100 GPUs, capable of tasks like multi-document summarization and personalized task analysis.
  • Llama 4 Maverick: A larger 400 billion parameter model designed for high-performance image and text understanding, optimized for TensorRT-LLM to speed up processing.

TensorRT-LLM Optimization for Faster Performance

Both Llama 4 models are optimized with NVIDIA’s TensorRT-LLM, a library that accelerates AI model performance on NVIDIA GPUs. This optimization allows for more than 40,000 tokens per second on Llama 4 Scout and over 30,000 tokens per second on Llama 4 Maverick, providing faster and more efficient AI processing.

Customizing and Deploying Llama Models for Enterprises

NVIDIA’s NeMo framework allows businesses to fine-tune the Llama models with their own data for higher accuracy. Additionally, the Llama 4 models are packaged as NVIDIA NIM microservices, simplifying deployment across different infrastructure, ensuring strong security, and enabling seamless scaling in various environments.

Up Next