[ECCV 2024] YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information

1Institute of Information Science, Academia Sinica, Taiwan
Interpolate start reference image.

YOLOv9, the current State-Of-The-Art real-time object detection model.

🧾 Abstract

Today’s deep learning methods focus on how to design the objective functions to make the prediction as close as possible to the target. Meanwhile, an appropriate neural network architecture has to be designed. Existing methods ignore a fact that when input data undergoes layer-by-layer feature transformation, large amount of information will be lost.

This paper delve into the important issues of information bottle- neck and reversible functions. We proposed the concept of programmable gradient information (PGI) to cope with the various changes required by deep networks to achieve multiple objectives. PGI can provide complete input information for the target task to calculate objective function, so that reliable gradient information can be obtained to update network parameters. In addition, a lightweight network architecture – General- ized Efficient Layer Aggregation Network (GELAN) is designed. GELAN confirms that PGI has gained superior results on lightweight models. We verified the proposed GELAN and PGI on MS COCO object detection dataset.

The results show that GELAN only uses conventional convolution operators to achieve better parameter utilization than the state-of-the-art methods developed based on depth-wise convolution. PGI can be used for variety of models from lightweight to large. It can be used to obtain complete information, so that train-from-scratch models can achieve better results than state-of-the-art models pre-trained using large datasets, the comparison results are shown in Figure above.

🎥 Video

🔬 Method

PGI

PGI combines Reversible Branches and Deep Supervision to provide accurate gradients. It introduces auxiliary branches that shorten paths or reuse inputs, improving information retention and gradient precision. Unlike traditional Deep Supervision, PGI works with small models and allows removable branches during inference, enhancing speed without sacrificing performance.

Interpolate start reference image.

GELAN

GELAN, the Generalized Efficient Layer Aggregation Network, reduces data loss by combining CSPNet and ELAN. CSPNet splits and merges features, while ELAN aggregates information across layers. We generalize ELAN to use any network block, and here we use CSPNet. Unlike ResNet, GELAN maintains data quality even beyond 100 layers, easing the model's learning effort.

Interpolate start reference image.

📺 Demo

Real-Time Demo

Multi-Tasking

Beyond object detection, the backbone built with GELAN and PGI provides overall improvements across real-time multitasks, including object detection, instance segmentation, and panoptic segmentation.

Interpolate start reference image.

🧪 Expriments

Comparison of SOTA Detectors

MODEL Param. (M) FLOPs (G) APval (%) AP50val (%) AP75val (%) APSval (%) APMval (%) APLval (%)
YOLOv7-N AF 3.1 8.7 37.6 53.3 40.6 18.7 41.7 52.8
YOLOv7-S AF 11 28.1 45.1 61.8 48.9 25.7 50.2 61.2
YOLOv7 36.9 104.7 51.2 69.7 55.9 31.8 55.5 65.0
YOLOv7 AF 43.6 130.5 53.0 70.2 57.5 35.8 58.7 68.9
YOLOv7-X 71.3 189.9 52.9 71.1 51.4 36.9 57.7 68.6
YOLOv8-N 3.2 8.7 37.3 52.6 - - - -
YOLOv8-S 11.2 28.6 44.9 61.8 - - - -
YOLOv8-M 25.9 78.9 50.2 67.2 - - - -
YOLOv8-L 43.7 165.2 52.9 69.8 57.5 35.3 58.3 69.8
YOLOv8-X 68.2 257.8 53.9 71.0 58.7 35.7 59.3 70.7
YOLO MS-N 4.5 17.4 43.4 60.4 47.6 23.7 48.3 60.3
YOLO MS-S 8.1 31.2 46.2 63.7 50.5 26.9 50.5 63.0
YOLO MS 22.2 80.2 51.0 68.6 55.7 33.1 56.1 66.5
RT-DETR R50 42 136 53.1 71.3 57.7 34.8 58.0 70.0
RT-DETR R101 76 259 54.3 72.7 58.6 36.0 58.8 72.1
YOLOv9-S (Ours) 7.2 26.7 46.8 63.4 50.7 26.6 56.0 64.5
YOLOv9-M (Ours) 20.1 76.8 51.4 68.1 56.1 33.6 57.0 68.0
YOLOv9-TR (Ours) 14.1 67.5 53.1 - - - - -
YOLOv9-C (Ours) 25.5 102.8 53.0 70.2 57.8 36.2 58.5 69.3
YOLOv9-E (Ours) 58.1 192.5 55.6 72.8 60.6 40.2 61.0 71.4

PGI in various tasks

We extend PGI in different tasks, including instance segmentation, panoptic segmentation, and image captioning. In Table, the results prove that PGI can indeed be used in various tasks.

Detect Segment Panoptic Caption
metric APbox APmask PQpan BLEU4cap
no PGI 52.5 42.4 39.4 38.8
PGI 53.0 43.2 40.5 39.1

PGI on small datasets (VOC)

PGI effectively captures data-target relations, enhancing learning from smaller datasets and improving YOLOv9's transfer learning. Table highlights PGI's strong performance in transfer learning on small datasets.

GELAN-S YOLOv9-S YOLOv5-S YOLOv8-S
pretrain COCO COCO COCO COCO
APbox 73.5% 74.4% 62.4% 67.1%
APbox50 89.8% 90.4% 86.6% 85.8%

BibTeX

@article{wang2024yolov9,
  author    = {Chien-Yao Wang and I-Hau Yeh and Hong-Yuan Mark Liao},
  title     = {YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information},
  journal   = {ECCV},
  year      = {2024},
}