P2HCT: Plug-and-Play Hierarchical C2F Transformer for Multi-Scale Feature Fusion

Abstract

Feature fusion plays a pivotal role in achieving high performance in vision models, yet existing attention-based fusion techniques often suffer from substantial computational overhead and implementation complexity, particularly in resource-constrained settings. To address these limitations, we introduce the Plug-and-Play Hierarchical C2F Transformer (P2HCT), a lightweight module that combines coarse-to-fine token selection with shared attention parameters to preserve spatial details while reducing inference cost. P2HCT is trainable using coarse attention alone and can be seamlessly activated at inference to enhance accuracy without retraining. Integrated into real-time detectors such as YOLOv11-N/S/M, P2HCT achieves mAP gains of 0.9\%, 0.5\%, and 0.4\% on MS COCO with minimal latency increase. Similarly, embedding P2HCT into ResNet-18/50/101 backbones improves ImageNet top-1 accuracy by 6.5\%, 1.7\%, and 1.0\%, respectively. These results underscore P2HCT's effectiveness as a hardware-friendly and general-purpose enhancement for both detection and classification tasks.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…