Key Features

What you can do

🔲

Model Parallelism

Splits large transformer models across multiple GPUs to enable training of models with billions of parameters without memory bottlenecks.

🔄

Pipeline Parallelism

Divides the model into stages that run concurrently on different GPUs, improving training efficiency and throughput.

Mixed Precision Training

Uses FP16 precision to reduce memory usage and speed up training while maintaining model accuracy.

☁️

Highly Scalable

Designed to scale from a few GPUs to thousands, supporting multi-node distributed training seamlessly.

💻

Customizable Model Architectures

Supports various transformer architectures like GPT, BERT, and their variants, allowing flexible experimentation.

🚀

Integration with NVIDIA Hardware

Optimized for NVIDIA GPUs and software stacks such as CUDA and NCCL for maximum performance.