YOLOv9 promises to be the new state-of-the-art real-time object detector
YOLOv9 is a real-time object detector with performance competitive enough to become the newest state-of-the-art method. YOLOv9 features two improvements: Programmable Gradient Information (PGI) and Generalized Efficient Layer Aggregation Network (GELAN)
YOLOv9 was developed to address a phenomenon known as information bottleneck, which causes input data to undergo non-negligible information loss during the feedforward process. Currently, three strategies are usually applied to mitigate the losses caused by information bottlenecks: reversible architectures, masked modeling, and deep supervision.
Since reversible architectures rely on repeated input data to maintain the information explicitly, they require extra layers to process the repeated inputs, which increases the inference cost. Masked modeling implicitly maximizes the features extracted from the data to perform reconstruction loss, but the latter can conflict with target loss. Deep supervision mechanisms can cause error accumulation, which means that if information is lost at a shallow level, retrieval will not be possible in the subsequent layers. Moreover, deep supervision is only suitable for less complicated tasks and larger models since error accumulation will be more noticeable in complex functions and small-sized models.
Thus, YOLOv9 integrates Programmable Gradient Information (PGI) and Generalized Efficient Layer Aggregation Network (GELAN) to address the issues related to information bottlenecks and mitigation mechanisms. The Programmable Gradient Information (PGI) is an auxiliary supervision mechanism including three main components: a main branch, an auxiliary reversible branch, and multi-level auxiliary information. Since the inference process is restricted to the main branch, PGI does not increase the inference cost. Moreover, the auxiliary reversible branch will mitigate deep feature, bottleneck-caused information loss by feeding the main branch with reliable gradient information. The auxiliary reversible branch is also designed in a way that is suitable for shallower networks.
The multi-level auxiliary information component consists of an integration network that combines gradients from the pyramid's prediction head to aggregate gradient information containing all the target objects. The component feeds this information to the main branch and then updates the parameters. This keeps the main branch feature pyramid from being dominated by information about a specific target object. As a result, PGI can mitigate some of the issues caused by traditional deep supervision and can guide different-sized architectures' learning. Finally, GELAN combines two neural network architectures, CSPNet and ELAN. Unlike ELAN, the resulting architecture is not limited to using convolutional layers stacking, as it can use any computational blocks.
As a result of these two improvements, YOLOv9 performs significantly better than most of its predecessors and methods, including RT DETR, RTMDet, and PP-YOLOE. Overall, YOLOv9 displays a competitive performance across experiments. More details on YOLOv9's features, performance, implementation, and testing can be found in the preprint and the project's GitHub.
YOLOv9 reflects what seems to be one of the most prevalent dynamics in AI research: the rush to attain and preserve state-of-the-art status. YOLOv9 joins OpenAI's recent announcement of Sora, the company's new competitor for the best video generation model, and Google's announcement of Gemini 1.5, the next generation of capable and efficient natively multimodal models.