Le-DETR: Revisiting Real-Time Detection Transformer with Efficient Encoder Design
Abstract
Real-time object detection is crucial for real-world applications as it requires high accuracy with low latency. While Detection Transformers (DETR) have demonstrated significant performance improvements, current real-time DETR models are challenging to reproduce from scratch due to excessive pre-training overheads on the backbone, constraining research advancements by hindering the exploration of novel backbone architectures. In this paper, we want to show that by using general good design, it is possible to have high performance with low pre-training cost. After a thorough study of the backbone architecture, we propose EfficientNAT at various scales, which incorporates modern efficient convolution and local attention mechanisms. Moreover, we re-design the hybrid encoder with local attention, significantly enhancing both performance and inference speed. Based on these advancements, we present Le-DETR (Low-cost and Efficient DEtection TRansformer), which achieves a new SOTA in real-time detection using only ImageNet1K and COCO2017 training datasets, saving about 80\% images in pre-training stage compared with previous methods. We demonstrate that with well-designed, real-time DETR models can achieve strong performance without the need for complex and computationally expensive pretraining. Extensive experiments show that Le-DETR-M/L/X achieves 52.9/54.3/55.1 mAP on COCO Val2017 with 4.45/5.01/6.68 ms on an RTX4090. It surpasses YOLOv12-L/X by +0.6/-0.1 mAP while achieving similar speed and +20\% speedup. Compared with DEIM-D-FINE, Le-DETR-M achieves +0.2 mAP with slightly faster inference, and surpasses DEIM-D-FINE-L by +0.4 mAP with only 0.4 ms additional latency. Code and weights will be open-sourced.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.