![Nvidia]()
The latest Llama 4 AI models, Llama 4 Scout and Llama 4 Maverick, have been released, powered by NVIDIA's open-source software. These models offer high-speed performance, achieving over 40,000 tokens per second on NVIDIA Blackwell B200 GPUs, and are available as NVIDIA NIM microservices for easy access and deployment.
![Capabilites]()
Multimodal and Multilingual Capabilities
Llama 4 models are now natively multimodal and multilingual, thanks to their mixture-of-experts (MoE) architecture. This allows them to handle both text and images across different languages, improving their versatility and ability to support a wide range of applications.
Performance and Technical Details
- Llama 4 Scout: A 109 billion parameter model optimized for use with NVIDIA H100 GPUs, capable of tasks like multi-document summarization and personalized task analysis.
- Llama 4 Maverick: A larger 400 billion parameter model designed for high-performance image and text understanding, optimized for TensorRT-LLM to speed up processing.
TensorRT-LLM Optimization for Faster Performance
Both Llama 4 models are optimized with NVIDIA’s TensorRT-LLM, a library that accelerates AI model performance on NVIDIA GPUs. This optimization allows for more than 40,000 tokens per second on Llama 4 Scout and over 30,000 tokens per second on Llama 4 Maverick, providing faster and more efficient AI processing.
Customizing and Deploying Llama Models for Enterprises
NVIDIA’s NeMo framework allows businesses to fine-tune the Llama models with their own data for higher accuracy. Additionally, the Llama 4 models are packaged as NVIDIA NIM microservices, simplifying deployment across different infrastructure, ensuring strong security, and enabling seamless scaling in various environments.