[ICLR 2025] YOLO-RD: Introducing Relevant and Compact Explicit Knowledge to YOLO by Retriever-Dictionary

1Institute of Information Science, Academia Sinica, Taiwan
Interpolate start reference image.

YOLO-RD, a module provides visual model 'dataset information' through Retriever Dictionary.

🧾 Abstract

Identifying and localizing objects within images is a fundamental challenge, and numerous efforts have been made to enhance model accuracy by experimenting with diverse architectures and refining training strategies. Nevertheless, a prevalent limitation in existing models is overemphasizing the current input while ignoring the information from the entire dataset.

We introduce an innovative Retriever-Dictionary (RD) module to address this issue. This architecture enables YOLO-based models to efficiently retrieve features from a Dictionary that contains the insight of the dataset, which is built by the knowledge from Visual Models (VM), Large Language Models (LLM), or Visual Language Models (VLM). The flexible RD enables the model to incorporate such explicit knowledge that enhances the ability to benefit multiple tasks, specifically, segmentation, detection, and classification, from pixel to image level.

The experiments show that using the RD significantly improves model performance, achieving more than a 3% increase in mean Average Precision for object detection with less than a 1% increase in model parameters. Beyond 1-stage object detection models, the RD module improves the effectiveness of 2-stage models and DETR-based architectures, such as Faster R-CNN and DETR.

🔬 Method

Initialize

We initialize the dictionary by mapping the dataset's images or captions into a high-dimensional space using YOLO's backbone (Vision Model), CLIP's vision encoder (Vision Language Model), or GPT (Large Language Model). Representative vectors—called atoms—are then selected via K-means or convex hull algorithms.

Interpolate start reference image.

Training

Our module consists of a Retriever and a Dictionary. The Dictionary stores atoms, while the Retriever computes coefficients to combine them based on the input features. To reduce computation and memory costs, especially when the number of atoms is large, we adopt a two-step strategy inspired by grouped convolutions: a Coefficient Generator produces initial coefficients, and a Global Information Exchanger refines them by aggregating neighboring pixel information.

Interpolate start reference image.

Compress

After training, we further compress the dictionary. We sample a smaller Retriever-Dictionary and align its output with the original one via contrastive learning in a teacher-student setup. This enables efficient inference with minimal performance loss.

Interpolate start reference image.

📺 Demo

Real-Time Demo

🧪 Experiments

Comparing the performance and the improvements across different modalities of RD-module and structures on the MSCOCO.
Initializer BackBone Params Latency mAPval.5:.95 (%) mAPval.5 (%)
Baseline
YOLOv737.2M3.5950.0468.65
YOLOv925.3M4.0052.6469.56
Faster-RCNN43.1M41.0038.4059.00
Deformable DETR40.1M41.1043.8062.60
With VM
YOLOv737.4M3.7051.37 (↥ 2.66%)69.42 (↥ 1.13%)
YOLOv925.5M4.1653.41 (↥ 1.46%)70.57 (↥ 1.46%)
Faster-RCNN44.1M41.0040.50 (↥ 5.47%)60.30 (↥ 2.20%)
Deformable DETR41.2M41.1044.10 (↥ 0.68%)63.30 (↥ 1.12%)
With VLM
YOLOv737.4M3.7051.75 (↥ 3.42%)70.12 (↥ 2.15%)
YOLOv925.5M4.1653.36 (↥ 1.37%)70.55 (↥ 1.43%)
Faster-RCNN44.1M41.0240.50 (↥ 5.47%)60.40 (↥ 2.37%)
Deformable DETR41.2M41.2844.40 (↥ 1.37%)63.30 (↥ 1.12%)
With LLM
YOLOv738.2M3.7951.36 (↥ 2.64%)69.40 (↥ 1.17%)
YOLOv925.8M4.2053.28 (↥ 1.22%)70.48 (↥ 1.33%)
Faster-RCNN44.6M41.0340.70 (↥ 5.99%)60.80 (↥ 3.05%)
Deformable DETR41.7M41.3544.16 (↥ 0.91%)63.10 (↥ 1.12%)

Different knowledge-based methods

Overall, RD not only offers the lightest solution, with only 0.2 additional parameters but also delivers the best performance, making it a highly efficient and effective approach.

Method mAP mAP.5 +Params
baseline 52.64 69.56 -
KD 52.52 69.14 57.3M
YOLO-World 51.00 67.70 66.1M
RALF 51.40 68.07 37.8M
RD (ours) 53.36 70.55 0.2M

Different tasks

The Retriever Dictionary (RD) module enhances pixel-level features, and its potential benefits extend beyond detection tasks to include other vision tasks, such as segmentation and classification.

Task Metrics w/ RD Improve
ClassificationTop-174.8675.70↑ 1.12%
Top-593.7294.28↑ 0.60%
DetectionmAPBox50.0451.75↑ 3.42%
mAPBox.568.6569.51↑ 2.15%
SegmentationmAPSeg40.5341.56↑ 2.54%
mAPSeg.564.0064.64↑ 1.00%

🎨 Visualization

Interpolate start reference image.
Interpolate start reference image.
1

(c) Positive Sample

2

(d) Positive Sample

3

(e) Negative Sample

4

(f) Negative Sample

BibTeX

@inproceedings{tsui2024yolord,
    author={Tsui, Hao-Tang and Wang, Chien-Yao and Liao, Hong-Yuan Mark},
    title={YOLO-RD: Introducing Relevant and Compact Explicit Knowledge to YOLO by Retriever-Dictionary},
    booktitle={Proceedings of the International Conference on Learning Representations (ICLR)},
    year={2025},
}