Yolo

Overview

YOLO12 introduces an attention-centric architecture that departs from the traditional CNN-based approaches used in previous YOLO models, yet retains the real-time inference speed essential for many applications. This model achieves state-of-the-art object detection accuracy through novel methodological innovations in attention mechanisms and overall network architecture, while maintaining real-time performance.

Supported Tasks and Modes

YOLO12 supports a variety of computer vision tasks. The table below shows task support and the operational modes (Inference, Validation, Training, and Export) enabled for each:

Model Type	Task	Inference	Validation	Training	Export
YOLO12	Detection	✅	✅	✅	✅
YOLO12-seg	Segmentation	✅	✅	✅	✅
YOLO12-pose	Pose	✅	✅	✅	✅
YOLO12-cls	Classification	✅	✅	✅	✅
YOLO12-obb	OBB	✅	✅	✅	✅

Performance Metrics

YOLO12 demonstrates significant accuracy improvements across all model scales, with some trade-offs in speed compared to the fastest prior YOLO models. Below are quantitative results for object detection on the COCO validation dataset:

Detection Performance (COCO val2017)

Performance

Detection (COCO)

Model	size ^(pixels)	mAP^val ^50-95	Speed ^{CPU ONNX} ^(ms)	Speed ^{T4 TensorRT} ^(ms)	params ^(M)	FLOPs ^(B)	Comparison ^(mAP/Speed)
YOLO12n	640	40.6	-	1.64	2.6	6.5	+2.1%/-9% (vs. YOLOv10n)
YOLO12s	640	48.0	-	2.61	9.3	21.4	+0.1%/+42% (vs. RT-DETRv2)
YOLO12m	640	52.5	-	4.86	20.2	67.5	+1.0%/-3% (vs. YOLO11m)
YOLO12l	640	53.7	-	6.77	26.4	88.9	+0.4%/-8% (vs. YOLO11l)
YOLO12x	640	55.2	-	11.79	59.1	199.0	+0.6%/-4% (vs. YOLO11x)

Inference speed measured on an NVIDIA T4 GPU with TensorRT FP16 precision.
Comparisons show the relative improvement in mAP and the percentage change in speed (positive indicates faster; negative indicates slower). Comparisons are made against published results for YOLOv10, YOLO11, and RT-DETR where available.

Usage Examples

This section provides examples for training and inference with YOLO12. For more comprehensive documentation on these and other modes (including Validation and Export), consult the dedicated Predict and Train pages.

The examples below focus on YOLO12 Detect models (for object detection). For other supported tasks (segmentation, classification, oriented object detection, and pose estimation), refer to the respective task-specific documentation: Segment, Classify, OBB, and Pose.

Example

PythonCLI

Pretrained *.pt models (using PyTorch) and configuration *.yaml files can be passed to the YOLO() class to create a model instance in Python:

from ultralytics import YOLO

# Load a COCO-pretrained YOLO12n model
model = YOLO("yolo12n.pt")

# Train the model on the COCO8 example dataset for 100 epochs
results = model.train(data="coco8.yaml", epochs=100, imgsz=640)

# Run inference with the YOLO12n model on the 'bus.jpg' image
results = model("path/to/bus.jpg")

Key Improvements

Enhanced Feature Extraction:
- Area Attention: Efficiently handles large receptive fields, reducing computational cost.
- Optimized Balance: Improved balance between attention and feed-forward network computations.
- R-ELAN: Enhances feature aggregation using the R-ELAN architecture.
Optimization Innovations:
- Residual Connections: Introduces residual connections with scaling to stabilize training, especially in larger models.
- Refined Feature Integration: Implements an improved method for feature integration within R-ELAN.
- FlashAttention: Incorporates FlashAttention to reduce memory access overhead.
Architectural Efficiency:
- Reduced Parameters: Achieves a lower parameter count while maintaining or improving accuracy compared to many previous models.
- Streamlined Attention: Uses a simplified attention implementation, avoiding positional encoding.
- Optimized MLP Ratios: Adjusts MLP ratios to more effectively allocate computational resources.

Requirements

The Ultralytics YOLO12 implementation, by default, does not require FlashAttention. However, FlashAttention can be optionally compiled and used with YOLO12. To compile FlashAttention, one of the following NVIDIA GPUs is needed:

Turing GPUs (e.g., T4, Quadro RTX series)
Ampere GPUs (e.g., RTX30 series, A30/40/100)
Ada Lovelace GPUs (e.g., RTX40 series)
Hopper GPUs (e.g., H100/H200)

Discover our blogs

YOLO12: Attention-Centric Object Detection