Overview
YOLO12 introduces an attention-centric architecture that departs from the traditional CNN-based approaches used in previous YOLO models, yet retains the real-time inference speed essential for many applications. This model achieves state-of-the-art object detection accuracy through novel methodological innovations in attention mechanisms and overall network architecture, while maintaining real-time performance.

Supported Tasks and Modes
YOLO12 supports a variety of computer vision tasks. The table below shows task support and the operational modes (Inference, Validation, Training, and Export) enabled for each:
| Model Type | Task | Inference | Validation | Training | Export |
|---|---|---|---|---|---|
| YOLO12 | Detection | ✅ | ✅ | ✅ | ✅ |
| YOLO12-seg | Segmentation | ✅ | ✅ | ✅ | ✅ |
| YOLO12-pose | Pose | ✅ | ✅ | ✅ | ✅ |
| YOLO12-cls | Classification | ✅ | ✅ | ✅ | ✅ |
| YOLO12-obb | OBB | ✅ | ✅ | ✅ | ✅ |
Performance Metrics
YOLO12 demonstrates significant accuracy improvements across all model scales, with some trade-offs in speed compared to the fastest prior YOLO models. Below are quantitative results for object detection on the COCO validation dataset:
Detection Performance (COCO val2017)
Performance
| Model | size (pixels) | mAPval 50-95 | Speed CPU ONNX (ms) | Speed T4 TensorRT (ms) | params (M) | FLOPs (B) | Comparison (mAP/Speed) |
|---|---|---|---|---|---|---|---|
| YOLO12n | 640 | 40.6 | - | 1.64 | 2.6 | 6.5 | +2.1%/-9% (vs. YOLOv10n) |
| YOLO12s | 640 | 48.0 | - | 2.61 | 9.3 | 21.4 | +0.1%/+42% (vs. RT-DETRv2) |
| YOLO12m | 640 | 52.5 | - | 4.86 | 20.2 | 67.5 | +1.0%/-3% (vs. YOLO11m) |
| YOLO12l | 640 | 53.7 | - | 6.77 | 26.4 | 88.9 | +0.4%/-8% (vs. YOLO11l) |
| YOLO12x | 640 | 55.2 | - | 11.79 | 59.1 | 199.0 | +0.6%/-4% (vs. YOLO11x) |
Inference speed measured on an NVIDIA T4 GPU with TensorRT FP16 precision.
Comparisons show the relative improvement in mAP and the percentage change in speed (positive indicates faster; negative indicates slower). Comparisons are made against published results for YOLOv10, YOLO11, and RT-DETR where available.
Usage Examples
This section provides examples for training and inference with YOLO12. For more comprehensive documentation on these and other modes (including Validation and Export), consult the dedicated Predict and Train pages.
The examples below focus on YOLO12 Detect models (for object detection). For other supported tasks (segmentation, classification, oriented object detection, and pose estimation), refer to the respective task-specific documentation: Segment, Classify, OBB, and Pose.
Example
Pretrained *.pt models (using PyTorch) and configuration *.yaml files can be passed to the YOLO() class to create a model instance in Python:
from ultralytics import YOLO
# Load a COCO-pretrained YOLO12n model
model = YOLO("yolo12n.pt")
# Train the model on the COCO8 example dataset for 100 epochs
results = model.train(data="coco8.yaml", epochs=100, imgsz=640)
# Run inference with the YOLO12n model on the 'bus.jpg' image
results = model("path/to/bus.jpg")
Key Improvements
Enhanced Feature Extraction:
Area Attention: Efficiently handles large receptive fields, reducing computational cost.
Optimized Balance: Improved balance between attention and feed-forward network computations.
R-ELAN: Enhances feature aggregation using the R-ELAN architecture.
Optimization Innovations:
Residual Connections: Introduces residual connections with scaling to stabilize training, especially in larger models.
Refined Feature Integration: Implements an improved method for feature integration within R-ELAN.
FlashAttention: Incorporates FlashAttention to reduce memory access overhead.
Architectural Efficiency:
Reduced Parameters: Achieves a lower parameter count while maintaining or improving accuracy compared to many previous models.
Streamlined Attention: Uses a simplified attention implementation, avoiding positional encoding.
Optimized MLP Ratios: Adjusts MLP ratios to more effectively allocate computational resources.
Requirements
The Ultralytics YOLO12 implementation, by default, does not require FlashAttention. However, FlashAttention can be optionally compiled and used with YOLO12. To compile FlashAttention, one of the following NVIDIA GPUs is needed:
Turing GPUs (e.g., T4, Quadro RTX series)
Ampere GPUs (e.g., RTX30 series, A30/40/100)
Ada Lovelace GPUs (e.g., RTX40 series)
Hopper GPUs (e.g., H100/H200)
