Discover our blogs

YOLO12: Attention-Centric Object Detection

By Prabakaran | August 22, 2025

Category: Object Detection

Overview

YOLO12 introduces an attention-centric architecture that departs from the traditional CNN-based approaches used in previous YOLO models, yet retains the real-time inference speed essential for many applications. This model achieves state-of-the-art object detection accuracy through novel methodological innovations in attention mechanisms and overall network architecture, while maintaining real-time performance.

 

Supported Tasks and Modes

YOLO12 supports a variety of computer vision tasks. The table below shows task support and the operational modes (Inference, Validation, Training, and Export) enabled for each:

Model TypeTaskInferenceValidationTrainingExport
YOLO12Detection
YOLO12-segSegmentation
YOLO12-posePose
YOLO12-clsClassification
YOLO12-obbOBB

Performance Metrics

YOLO12 demonstrates significant accuracy improvements across all model scales, with some trade-offs in speed compared to the fastest prior YOLO models. Below are quantitative results for object detection on the COCO validation dataset:

Detection Performance (COCO val2017)

  Performance

Detection (COCO)

Modelsize
(pixels)
mAPval
50-95
Speed
CPU ONNX
(ms)
Speed
T4 TensorRT
(ms)
params
(M)
FLOPs
(B)
Comparison
(mAP/Speed)
YOLO12n64040.6-1.642.66.5+2.1%/-9% (vs. YOLOv10n)
YOLO12s64048.0-2.619.321.4+0.1%/+42% (vs. RT-DETRv2)
YOLO12m64052.5-4.8620.267.5+1.0%/-3% (vs. YOLO11m)
YOLO12l64053.7-6.7726.488.9+0.4%/-8% (vs. YOLO11l)
YOLO12x64055.2-11.7959.1199.0+0.6%/-4% (vs. YOLO11x)
  • Inference speed measured on an NVIDIA T4 GPU with TensorRT FP16 precision.

  • Comparisons show the relative improvement in mAP and the percentage change in speed (positive indicates faster; negative indicates slower). Comparisons are made against published results for YOLOv10, YOLO11, and RT-DETR where available.

Usage Examples

This section provides examples for training and inference with YOLO12. For more comprehensive documentation on these and other modes (including Validation and Export), consult the dedicated Predict and Train pages.

The examples below focus on YOLO12 Detect models (for object detection). For other supported tasks (segmentation, classification, oriented object detection, and pose estimation), refer to the respective task-specific documentation: Segment, Classify, OBB, and Pose.

  Example

PythonCLI

Pretrained *.pt models (using PyTorch) and configuration *.yaml files can be passed to the YOLO() class to create a model instance in Python:

from ultralytics import YOLO

# Load a COCO-pretrained YOLO12n model
model = YOLO("yolo12n.pt")

# Train the model on the COCO8 example dataset for 100 epochs
results = model.train(data="coco8.yaml", epochs=100, imgsz=640)

# Run inference with the YOLO12n model on the 'bus.jpg' image
results = model("path/to/bus.jpg")

Key Improvements

  1. Enhanced Feature Extraction:

    • Area Attention: Efficiently handles large receptive fields, reducing computational cost.

    • Optimized Balance: Improved balance between attention and feed-forward network computations.

    • R-ELAN: Enhances feature aggregation using the R-ELAN architecture.

  2. Optimization Innovations:

    • Residual Connections: Introduces residual connections with scaling to stabilize training, especially in larger models.

    • Refined Feature Integration: Implements an improved method for feature integration within R-ELAN.

    • FlashAttention: Incorporates FlashAttention to reduce memory access overhead.

  3. Architectural Efficiency:

    • Reduced Parameters: Achieves a lower parameter count while maintaining or improving accuracy compared to many previous models.

    • Streamlined Attention: Uses a simplified attention implementation, avoiding positional encoding.

    • Optimized MLP Ratios: Adjusts MLP ratios to more effectively allocate computational resources.

Requirements

The Ultralytics YOLO12 implementation, by default, does not require FlashAttention. However, FlashAttention can be optionally compiled and used with YOLO12. To compile FlashAttention, one of the following NVIDIA GPUs is needed:

Login to Comment

You might also like…

Explore fresh insights, tips, and stories from our latest blog posts.

YOLO12: Attention-Centric Object Detection
YOLO12: Attention-Centric Object Detection

OverviewYOLO12 introduces an attention-centric architecture that departs from the traditional CNN-based approaches used in previous YOLO models, yet retains the real-time inference speed essential for …