YOLO-RD: Introducing Relevant and Compact Explicit Knowledge to YOLO by Retriever-Dictionary

We initialize the dictionary by mapping the dataset's images or captions into a high-dimensional space using YOLO's backbone (Vision Model), CLIP's vision encoder (Vision Language Model), or GPT (Large Language Model). Representative vectors—called atoms—are then selected via K-means or convex hull algorithms.

Our module consists of a Retriever and a Dictionary. The Dictionary stores atoms, while the Retriever computes coefficients to combine them based on the input features. To reduce computation and memory costs, especially when the number of atoms is large, we adopt a two-step strategy inspired by grouped convolutions: a Coefficient Generator produces initial coefficients, and a Global Information Exchanger refines them by aggregating neighboring pixel information.

After training, we further compress the dictionary. We sample a smaller Retriever-Dictionary and align its output with the original one via contrastive learning in a teacher-student setup. This enables efficient inference with minimal performance loss.

📺 Demo

Real-Time Demo

🧪 Experiments

Comparing the performance and the improvements across different modalities of RD-module and structures on the MSCOCO.
Initializer	BackBone	Params	Latency	mAP^val_.5:.95 (%)	mAP^val_.5 (%)
Baseline
	YOLOv7	37.2M	3.59	50.04	68.65
	YOLOv9	25.3M	4.00	52.64	69.56
	Faster-RCNN	43.1M	41.00	38.40	59.00
	Deformable DETR	40.1M	41.10	43.80	62.60
With VM
	YOLOv7	37.4M	3.70	51.37 (↥ 2.66%)	69.42 (↥ 1.13%)
	YOLOv9	25.5M	4.16	53.41 (↥ 1.46%)	70.57 (↥ 1.46%)
	Faster-RCNN	44.1M	41.00	40.50 (↥ 5.47%)	60.30 (↥ 2.20%)
	Deformable DETR	41.2M	41.10	44.10 (↥ 0.68%)	63.30 (↥ 1.12%)
With VLM
	YOLOv7	37.4M	3.70	51.75 (↥ 3.42%)	70.12 (↥ 2.15%)
	YOLOv9	25.5M	4.16	53.36 (↥ 1.37%)	70.55 (↥ 1.43%)
	Faster-RCNN	44.1M	41.02	40.50 (↥ 5.47%)	60.40 (↥ 2.37%)
	Deformable DETR	41.2M	41.28	44.40 (↥ 1.37%)	63.30 (↥ 1.12%)
With LLM
	YOLOv7	38.2M	3.79	51.36 (↥ 2.64%)	69.40 (↥ 1.17%)
	YOLOv9	25.8M	4.20	53.28 (↥ 1.22%)	70.48 (↥ 1.33%)
	Faster-RCNN	44.6M	41.03	40.70 (↥ 5.99%)	60.80 (↥ 3.05%)
	Deformable DETR	41.7M	41.35	44.16 (↥ 0.91%)	63.10 (↥ 1.12%)

Different knowledge-based methods

Overall, RD not only offers the lightest solution, with only 0.2 additional parameters but also delivers the best performance, making it a highly efficient and effective approach.

Method	mAP	mAP_.5	+Params
baseline	52.64	69.56	-
KD	52.52	69.14	57.3M
YOLO-World	51.00	67.70	66.1M
RALF	51.40	68.07	37.8M
RD (ours)	53.36	70.55	0.2M

Different tasks

The Retriever Dictionary (RD) module enhances pixel-level features, and its potential benefits extend beyond detection tasks to include other vision tasks, such as segmentation and classification.

Task	Metrics		w/ RD	Improve
Classification	Top-1	74.86	75.70	↑ 1.12%
Classification	Top-5	93.72	94.28	↑ 0.60%
Detection	mAP^Box	50.04	51.75	↑ 3.42%
Detection	mAP^Box_.5	68.65	69.51	↑ 2.15%
Segmentation	mAP^Seg	40.53	41.56	↑ 2.54%
Segmentation	mAP^Seg_.5	64.00	64.64	↑ 1.00%