论文Nature ML· 06-12

用于中耳炎检测的 4DO-DETR

4DO-DETR for otitis media detection

Introduction

Otitis media (OM) is a common ear disease worldwide that can lead to hearing loss, ear pain, and, in severe cases, serious complications such as facial paralysis and intracranial infection[1](https://www.nature.com/articles/s41598-026-44468-7#ref-CR1 "Monasta, L. et al. Burden of disease caused by otitis media: systematic review and global estimates[J]. PloS one. 7 (4), e36226 (2012)."). Accurate and timely diagnosis is therefore crucial for effective treatment and prevention of complications. Computed tomography (CT) imaging plays an important role in OM detection, providing high-resolution three-dimensional views of the ear structures[2](https://www.nature.com/articles/s41598-026-44468-7#ref-CR2 "Wang, Y. M. et al. Deep learning in automated region proposal and diagnosis of chronic otitis media based on computed tomography. Ear Hear. 41 (3), 669–677 (2020).").

At present, CT interpretation relies primarily on manual assessment by specialists[3](https://www.nature.com/articles/s41598-026-44468-7#ref-CR3 "Duan, B., Guo, Z., Pan, L., Xu, Z. & Chen, W. Temporal bone CT-based deep learning models for differential diagnosis of primary ciliary dyskinesia related otitis media and simple otitis media with effusion. Am. J. Otolaryngol. 43 (6), 153–162 (2022)."). However, the complexity of anatomical structures, variability of lesions, and increasing clinical workload make manual evaluation both time-consuming and error-prone. Automated object detection offers a promising means to enhance diagnostic efficiency and reliability[4](https://www.nature.com/articles/s41598-026-44468-7#ref-CR4 "Pham, V. T., Tran, T. T., Wang, P. C. & Chen, P. Y. EAR-UNet: A deep learning-based approach for segmentation of tympanic membranes from otoscopic images. Artif. Intell. Med., 112, (2021). Article 102015.").

Classical object detection methods such as Histogram of Oriented Gradients (HOG)[5](https://www.nature.com/articles/s41598-026-44468-7#ref-CR5 "Dalal, N. & Triggs, B. Histograms of oriented gradients for human detection, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 2005, 886–893, 1, San Diego, CA, USA, 2005, 886–893, 1, (2005). https://doi.org/10.1109/CVPR.2005.177

") and Deformable Part Models (DPM)[6](https://www.nature.com/articles/s41598-026-44468-7#ref-CR6 "Felzenszwalb PF, et al. Object detection with discriminatively trained part-based models. IEEE Trans Pattern Anal Mach Intell. 2010 Sep;32(9):1627-45.") relied on handcrafted features and sliding-window search. Although effective in certain scenarios, their generalization was limited by dataset-specific feature design and the heterogeneity of medical images. The introduction of deep learning–based methods brought a major shift: region-based CNN models (R-CNN and Faster R-CNN)[7](https://www.nature.com/articles/s41598-026-44468-7#ref-CR7 "Girshick, R. et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. : 580–587. (2014)."),[8](https://www.nature.com/articles/s41598-026-44468-7#ref-CR8 "Ren S, et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans Pattern Anal Mach Intell. 2017 Jun;39(6):1137-1149. ") leveraged convolutional feature extractors and region proposal networks to achieve state-of-the-art performance. In parallel, one-stage approaches such as YOLO[9](https://www.nature.com/articles/s41598-026-44468-7#ref-CR9 "Redmon, J. et al. You only look once: Unified, real-time object detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 779–788. (2016).") framed detection as direct regression, achieving higher speed but often sacrificing accuracy. While these CNN-based frameworks greatly advanced general-purpose detection, their reliance on anchors, post-processing, and dataset-specific tuning limited their robustness and clinical applicability in medical imaging tasks.

More recently, transformer-based architectures have redefined object detection. DETR[10](https://www.nature.com/articles/s41598-026-44468-7#ref-CR10 "Carion, N. et al. End-to-End Object Detection with Transformers. In European Conference on Computer Vision (ECCV). (2020).") formulates detection as a set prediction problem, eliminating handcrafted anchors and non-maximum suppression (NMS) by directly learning object-query correspondences. Despite its elegance, DETR suffers from slow convergence, high sensitivity to hyperparameters such as decoder depth, and unstable training, particularly problematic for medical images where robustness and reliability are paramount. Extensions such as DN-DETR[11](https://www.nature.com/articles/s41598-026-44468-7#ref-CR11 "Li, F. et al. DN-DETR: Accelerate DETR Training by Introducing Query DeNoising. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). (2022)."), DINO[12](https://www.nature.com/articles/s41598-026-44468-7#ref-CR12 "[1] Zhang H , Li F , Liu S ,et al.DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection[J].arXiv e-prints, 2022.DOI:10.48550/arXiv.2203.03605."), and Co-DETR[13](https://www.nature.com/articles/s41598-026-44468-7#ref-CR13 "Zong, Z., Song, G. & Liu, Y. Detrs with collaborative hybrid assignments training[C]//Proceedings of the IEEE/CVF international conference on computer vision. 6748–6758. (2023).") address aspects of convergence and query optimization, yet they leave unresolved challenges in feature propagation stability and over-decoding across layers. To this end, based on DN-DETR, this study proposed a novel method for object detection in CT images of OM——4DO-DETR. Specifically, by incorporating Deformable attention into DN-DETR (with the DAB module, making DN-DETR equivalent to the subsequent DN-DAB-DETR)[11](https://www.nature.com/articles/s41598-026-44468-7#ref-CR11 "Li, F. et al. DN-DETR: Accelerate DETR Training by Introducing Query DeNoising. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). (2022)."),[14](https://www.nature.com/articles/s41598-026-44468-7#ref-CR14 "Zhu X , Su W , Lu L, et al. Deformable detr: Deformable transformers for end-to-end object detection[J]. arXiv preprint arXiv:2010.04159, 2020. DOI:10.48550/arXiv.2010.04159."),[15](https://www.nature.com/articles/s41598-026-44468-7#ref-CR15 "Liu S , Li F , Zhang H ,et al. DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR[J]. arXiv preprint arXiv:2201.12329, 2022. DOI:10.48550/arXiv.2201.12329."), we designed denser connections without increasing the spatiotemporal complexity[14](https://www.nature.com/articles/s41598-026-44468-7#ref-CR14 "Zhu X , Su W , Lu L, et al. Deformable detr: Deformable transformers for end-to-end object detection[J]. arXiv preprint arXiv:2010.04159, 2020. DOI:10.48550/arXiv.2010.04159."),[16](https://www.nature.com/articles/s41598-026-44468-7#ref-CR16 "Lin, T. Y. et al. Feature Pyramid Networks for Object Detection. Proceedings of the IEEE Conference on Computer Vision and (CVPR), 2117–2125. (2017)."). These denser connections enhance the modeling of information interaction between adjacent encoder and decoder layers.

To prevent excessive entropy increase in the model, we employ a novel high-stability loss function incorporating a 0.05-weighted entropy balance to mitigate internal entropy. Thus, a new method for object detection in OM CT imaging is proposed, aiming to effectively address the problems and challenges mentioned above. The contributions of this study are as follows:

A method showcasing denser residuals without increasing memory consumption is proposed. Specifically, the original skip connections within decoder layers are retained, and each skip connection now extends across one additional decoder layer, effectively avoiding excessive decoding.

Entropy balancing is introduced, and a novel loss function that maintains the focal loss’s capability to handle class imbalance issues while enhancing its stability is designed. This significantly improves the model’s performance in detecting OM lesions.

For validation, we constructed and annotated a dataset of CT images for OM, providing a solid foundation for training and evaluating the model. With only 41.412 M parameters and 62.953 GFLOPS, 4DO-DETR achieves an mAP of 0.568, mAP50 of 0.975, and mAP75 of 0.600 on the Otitis1415 dataset. In addition, on the brain-tumor dataset, 4DO-DETR achieves a remarkable 0.314 mAP and 0.468 mAP50. The model also demonstrates strong robustness: due to the denser skip connections, the model is explicitly capable of automatically selecting the appropriate number of encoder and decoder layers, preventing over-encoding and over-decoding issues when processing grayscale images.

Related work

In object detection, classical research methods can be largely classified into three categories based on their detection processes and structural features: methods based on traditional manual features, one-stage detectors, and two-stage detectors. As an end-to-end detection model, DETR is based on the Transformer architecture and has recently demonstrated unique advantages in object detection.

Before the prevalence of deep learning, object detection mainly relied on manual feature-based methods. Among these, DPM and HOG are the most well-known[5](https://www.nature.com/articles/s41598-026-44468-7#ref-CR5 "Dalal, N. & Triggs, B. Histograms of oriented gradients for human detection, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 2005, 886–893, 1, San Diego, CA, USA, 2005, 886–893, 1, (2005). https://doi.org/10.1109/CVPR.2005.177

"),[6](https://www.nature.com/articles/s41598-026-44468-7#ref-CR6 "Felzenszwalb PF, et al. Object detection with discriminatively trained part-based models. IEEE Trans Pattern Anal Mach Intell. 2010 Sep;32(9):1627-45.").

Methods based on traditional handcrafted features

Proposed by Felzenszwalb et al., DPM is an object detection method that conceptualizes a complete object as consisting of multiple components with relatively stable spatial layouts[6](https://www.nature.com/articles/s41598-026-44468-7#ref-CR6 "Felzenszwalb PF, et al. Object detection with discriminatively trained part-based models. IEEE Trans Pattern Anal Mach Intell. 2010 Sep;32(9):1627-45."). Detection of the entire object is achieved by detecting these individual components. During the training phase, DPM learns an explicit model that includes multiple components, representing the relative spatial layout and appearance of each component. During the detection phase, DPM searches for regions in the test image that resemble the model, with each component corresponding to a specific search area. When the sum of scores of all components exceeds a certain threshold, the presence of an object is confirmed. Due to its flexible modeling of components, DPM exhibits high robustness and can detect objects in various poses and viewpoints.

HOG characterizes objects by statistically analyzing the local gradient or edge orientation histogram of an image using mathematical methods[5](https://www.nature.com/articles/s41598-026-44468-7#ref-CR5 "Dalal, N. & Triggs, B. Histograms of oriented gradients for human detection, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 2005, 886–893, 1, San Diego, CA, USA, 2005, 886–893, 1, (2005). https://doi.org/10.1109/CVPR.2005.177

"). This technique captures the local shape of an image and demonstrates robustness to changes in illumination. The main concept involves calculating the pixel gradients within a sliding window across the image and then accumulating these gradients into a set number of orientations to create a gradient orientation histogram. This histogram records the distribution and orientation of pixels in each window, thereby forming a feature descriptor. HOG features are typically combined with SVM classifiers to select the bounding boxes that best represent the content and shape information of the original image.

Single-stage detectors

As the application of deep learning technology has progressed, one-stage detectors have emerged, where typical examples include YOLO and SSD[9](https://www.nature.com/articles/s41598-026-44468-7#ref-CR9 "Redmon, J. et al. You only look once: Unified, real-time object detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 779–788. (2016)."),[17](https://www.nature.com/articles/s41598-026-44468-7#ref-CR17 "Liu, W. et al. SSD: Single shot multibox detector[C]//Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer International Publishing, 21–37. (2016).").

YOLO transforms object detection into a regression problem and handles the entire process in a single network. The model takes an image as input and divides it into a fixed grid. Each grid cell generates a set number of bounding boxes, each defined by its position, size, and confidence score[9](https://www.nature.com/articles/s41598-026-44468-7#ref-CR9 "Redmon, J. et al. You only look once: Unified, real-time object detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 779–788. (2016)."). This confidence score reflects whether the bounding box contains an object and the accuracy of the detection. Additionally, each grid cell predicts the categories of potential objects it might contain.

SSD (Single Shot MultiBox Detector) builds on the strengths of YOLO, enabling the detection of multiple objects in images by simultaneously predicting multiple bounding boxes of different positions and scales, along with their corresponding category probabilities, in a single forward propagation. It uses a series of feature maps of different sizes for prediction, allowing the algorithm to detect objects at different scales. This capability ensures the effective detection of all target objects within an image[17](https://www.nature.com/articles/s41598-026-44468-7#ref-CR17 "Liu, W. et al. SSD: Single shot multibox detector[C]//Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer International Publishing, 21–37. (2016).").

Two-stage detectors

Compared to the relatively superficial detection of one-stage detectors, two-stage detectors offer a more refined approach and yield better performance. Faster R-CNN and Sparse R-CNN are representative examples of this approach.

Faster R-CNN is a popular object detection network that begins by utilizing CNN to extract features from the entire image[8](https://www.nature.com/articles/s41598-026-44468-7#ref-CR8 "Ren S, et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans Pattern Anal Mach Intell. 2017 Jun;39(6):1137-1149. "). It then employs a Region Proposal Network (RPN) to scan these shared feature maps. The RPN slides windows across the feature map and uses an anchor point mechanism to predict the location and probability of potential targets. The candidate regions are then standardized to uniform sizes through a Region of Interest Pooling layer (RoI Pooling) for further processing. Finally, these fixed-size features are sent through a series of fully connected layers for classification and bounding-box regression, resulting in the specific categories and localization of each candidate region. This well-defined, step-by-step process gives two-stage detectors a significant advantage in accuracy, making them suitable for applications that require high precision.

Sparse R-CNN addresses the challenge posed by sparse targets and the model’s dense predicted boxes through an efficient strategy: it starts with a sparse set of predefined proposals and optimizes these proposals directly to predict the final detection results. This method significantly cuts down on unnecessary computations while improving the efficiency and accuracy of detection[18](https://www.nature.com/articles/s41598-026-44468-7#ref-CR18 "Sun, P. et al. Sparse r-cnn: End-to-end object detection with learnable proposals[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 14454–14463. (2021).").

End-to-end detection model — DETR

DETR (Detection Transformer) introduces a novel object detection framework by integrating the Transformer model into the field. It departs from traditional methods by eliminating anchor boxes and Non-Maximum Suppression (NMS), opting instead for an end-to-end approach[10](https://www.nature.com/articles/s41598-026-44468-7#ref-CR10 "Carion, N. et al. End-to-End Object Detection with Transformers. In European Conference on Computer Vision (ECCV). (2020)."). Initially, image features are extracted using CNN and combined with a fixed number of learned positional encodings before being fed into the Transformer. The Transformer’s encoder learns global dependencies among features using a self-attention mechanism, while the decoder predicts categories and bounding boxes for each target. Each decoder output corresponds directly to a predicted target, optimized through the Hungarian algorithm and bidirectional matching loss in the output layer. This direct approach enables precise identification and localization of targets in images without relying on complex post-processing steps.

DN-DETR addresses class imbalance by introducing focal loss and improves the prediction of output box positions by learning relative offsets through denoising. This approach replaces Hungarian matching, reducing computational granularity and accelerating model convergence[11](https://www.nature.com/articles/s41598-026-44468-7#ref-CR11 "Li, F. et al. DN-DETR: Accelerate DETR Training by Introducing Query DeNoising. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). (2022)."). Deformable-DETR enhances detection precision across different target sizes with a multi-scale deformable attention mechanism[14](https://www.nature.com/articles/s41598-026-44468-7#ref-CR14 "Zhu X , Su W , Lu L, et al. Deformable detr: Deformable transformers for end-to-end object detection[J]. arXiv preprint arXiv:2010.04159, 2020. DOI:10.48550/arXiv.2010.04159."). DAB-DETR improves the query strategy with dynamic anchor boxes that adapt to specific input images, creating a more flexible and effective framework to guide the training and inference of the DETR model. This enhancement boosts convergence speed and detection performance[15](https://www.nature.com/articles/s41598-026-44468-7#ref-CR15 "Liu S , Li F , Zhang H ,et al. DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR[J]. arXiv preprint arXiv:2201.12329, 2022. DOI:10.48550/arXiv.2201.12329."). Subsequent advancements such as DINO[12](https://www.nature.com/articles/s41598-026-44468-7#ref-CR12 "[1] Zhang H , Li F , Liu S ,et al.DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection[J].arXiv e-prints, 2022.DOI:10.48550/arXiv.2203.03605."), Co-DETR[13](https://www.nature.com/articles/s41598-026-44468-7#ref-CR13 "Zong, Z., Song, G. & Liu, Y. Detrs with collaborative hybrid assignments training[C]//Proceedings of the IEEE/CVF international conference on computer vision. 6748–6758. (2023)."), and RT-DETR[19](https://www.nature.com/articles/s41598-026-44468-7#ref-CR19 "Zhao, Y., Lv, W., Xu, S., Wei, J., Wang, G., Dang, Q., Chen, J. (2024). Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16965–16974).") have propelled the accuracy of the DETR series to unprecedented heights.

While these methods have made significant progress in object detection, they encounter several common challenges when applied to the clinical detection of OM.

Firstly, clinical settings demand high model accuracy for precise identification of affected areas and their specific pathological conditions[20](https://www.nature.com/articles/s41598-026-44468-7#ref-CR20 "Cai, Y. et al. Investigating the use of a two-stage attention-aware convolutional neural network for the automated diagnosis of otitis media from tympanic membrane images: a prediction model development and validation study[J]. BMJ open. 11 (1), e041139 (2021).").

Secondly, despite the strong end-to-end capabilities of DINO and Co-DETR, their slow convergence, lengthy training times, and sensitivity to hyperparameters hinder their practical use in clinical contexts.

OM lesions often appear as small and complex formations in images, such as inflammation, exudation, or tympanic membrane redness, thus making them difficult to identify[21](https://www.nature.com/articles/s41598-026-44468-7#ref-CR21 "Schilder, A. G. M. et al. Otitis media[J]. Nat. reviews Disease primers. 2 (1), 1–18 (2016)."). These manifestations are typically subtle and varied, unlike the clear and singular objects in standard object detection. Therefore, more advanced image processing techniques and deep learning models are required to identify these pathological features[20](https://www.nature.com/articles/s41598-026-44468-7#ref-CR20 "Cai, Y. et al. Investigating the use of a two-stage attention-aware convolutional neural network for the automated diagnosis of otitis media from tympanic membrane images: a prediction model development and validation study[J]. BMJ open. 11 (1), e041139 (2021).").

Method

This study introduces a novel half-dense-transformer architecture and a new loss function based on the advantages of DAB-DETR and Deformable-DETR. Section 3.1 outlines the model’s framework, while Sect. 3.2 discusses its balanced performance and spatiotemporal cost, which includes concepts such as large residual connections and the half-dense trans-former architecture. Section 3.3 details the design of the entropy-balanced loss function.

Our model

Figure 1 shows the overall architecture of our model——DETR with Denoising-task, Deformable-attention, Dynamic-anchor-boxes and Denser-connection. Belonging to the DETR series, our model comprises a backbone network, transformer, and prediction head groups[10](https://www.nature.com/articles/s41598-026-44468-7#ref-CR10 "Carion, N. et al. End-to-End Object Detection with Transformers. In European Conference on Computer Vision (ECCV). (2020)."),[11](https://www.nature.com/articles/s41598-026-44468-7#ref-CR11 "Li, F. et al. DN-DETR: Accelerate DETR Training by Introducing Query DeNoising. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). (2022)."),[14](https://www.nature.com/articles/s41598-026-44468-7#ref-CR14 "Zhu X , Su W , Lu L, et al. Deformable detr: Deformable transformers for end-to-end object detection[J]. arXiv preprint arXiv:2010.04159, 2020. DOI:10.48550/arXiv.2010.04159."),[15](https://www.nature.com/articles/s41598-026-44468-7#ref-CR15 "Liu S , Li F , Zhang H ,et al. DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR[J]. arXiv preprint arXiv:2201.12329, 2022. DOI:10.48550/arXiv.2201.12329."),[22](https://www.nature.com/articles/s41598-026-44468-7#ref-CR22 "Huang, G. et al. Densely connected convolutional networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. : 4700–4708. (2017)."). The model consists of three main components: a backbone network, a transformer module, and a set of prediction heads. The backbone employs a convolutional neural network (CNN) to extract multi-scale visual features, which are augmented with positional encodings and fed into the transformer. The transformer comprises deformable encoder and decoder layers, while the prediction head group consists of multiple feed-forward networks (FFNs), each corresponding to an output head[23](https://www.nature.com/articles/s41598-026-44468-7#ref-CR23 "Ngombu, S. et al. Advances in artificial intelligence to diagnose otitis media: State of the art review[J]. Otolaryngology–Head Neck Surg. 168 (4), 635–642 (2023)."),[24](https://www.nature.com/articles/s41598-026-44468-7#ref-CR24 "Rosenblatt, F. The perceptron: a probabilistic model for information storage and organization in the brain[J]. Psychol. Rev. 65 (6), 386 (1958).").

Beyond depicting the architectural components, Fig.1 is designed to provide an intuitive visualization of model behavior during training. Specifically, the figure highlights the prediction refinement process of the deformable transformer decoder. Before training, decoder queries generate a large number of coarse and spatially scattered candidate bounding boxes with low confidence, reflecting the model’s initial uncertainty and lack of semantic alignment. After training, through successive decoder layers, these queries are progressively refined and converge to a small number of high-confidence predictions that align well with the ground-truth lesion regions, resulting in accurate localization.

To further improve interpretability, the functional roles of different modules are explicitly annotated in Fig.1: yellow blocks denote deformable transformer encoder layers, peach-colored blocks represent deformable transformer decoder layers, green blocks indicate normalization layers, and blue blocks correspond to feed-forward networks (FFNs). These annotations help clarify how information flows through the network and how iterative decoder refinement contributes to precise lesion detection.

Fig. 1

The alternative text for this image may have been generated using AI.

Full size image

The proposed overall framework for the object detection of otitis media.

Backbone network

The officially pretrained ResNet50 is used as the feature extractor to obtain the feature maps of the target images. Following the fine-tuning strategy of DN-DETR, our feature extractor is fine-tuned during training with a learning rate of 1 × 10⁻⁵, allowing it to better adapt to our dataset and enhance model compatibility. For positional encoding, a fixed-length encoding is employed to represent spatial positions, with each position corresponding to an encoding result. Specifically, for a given position _pos_ and dimension _i_, the positional encoding is defined as follows:

$$PE\left(pos,2i\right)=sin\left(pos/{10000}^{2i/{d}_{model}}\right)$$

(1)

$$PE\left(pos,2i+1\right)=cos\left(pos/{10000}^{2i/{d}_{model}}\right)$$

(2)

The transformer architecture in our model

The original transformer is composed of encoders and decoders, each comprising six layers. Each encoder layer includes a multi-head self-attention layer, an FFN, and two regularization layers. Data are first passed through the multi-head self-attention layer after encoding, followed by summation with the initial data and regularization layer 1. The resulting output then passes through the FFN layer and is summed with the output from the regularization layer, which has not passed through the FFN layer before undergoing normalization. That is to say, each encoder layer has four sub-layers but only two skip connections, making it impossible to skip over two regularization layers. However, Excessive regularization can weaken the model’s learning ability rather than enhance it. Therefore, an excessive number of encoders can deteriorate performance. Similarly, a mismatch between the number of skip connections and the number of sub-layers in the decoder layer can also result in a decrease in model performance if there are too many decoder layers.

DAB-DETR and Deformable-DETR emphasize using anchor boxes as queries and applying deformable attention to help DETR detect small objects. Based on this, the study tackles the issue of performance degradation due to over-regularization with excessive decoder layers by introducing large residual connections to skip across the entire decoder and encoder layers. Specifically, residual connections between adjacent encoder and decoder layers are added, thereby avoiding issues related to excessive encoding and decoding.

Output layer

Each decoder output (head) corresponds to an FFN as the prediction head, which is divided into two components[23](https://www.nature.com/articles/s41598-026-44468-7#ref-CR23 "Ngombu, S. et al. Advances in artificial intelligence to diagnose otitis media: State of the art review[J]. Otolaryngology–Head Neck Surg. 168 (4), 635–642 (2023)."),[24](https://www.nature.com/articles/s41598-026-44468-7#ref-CR24 "Rosenblatt, F. The perceptron: a probabilistic model for information storage and organization in the brain[J]. Psychol. Rev. 65 (6), 386 (1958)."):

Classification Head: This component predicts the category of the target. For each decoder output, the classification head produces a probability distribution over all categories in the dataset. This is typically achieved through a fully connected (linear) layer, which maps the decoder’s output features to the number of categories, followed by a softmax layer to generate a normalized probability distribution.

Bounding Box Regression Head: This component predicts the bounding box of the target. For each decoder output, the bounding box regression head generates four values, typically representing the center coordinates, width, and height of the target’s bounding box.

Large residual transformer architecture

The core idea of the large residual transformer architecture, which is similar to ResNet, is the implementation of “shortcut connections,” also known as skip connections[25](https://www.nature.com/articles/s41598-026-44468-7#ref-CR25 "He, K., Zhang, X., Ren, S., Sun, J. & Recognition, P. Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer Vision and (CVPR), 770–778. (2016)."). These connections establish direct links between multiple network layers, allowing inputs to bypass certain layers (typically non-linear ones) and pass directly to subsequent layers. This structure mitigates the issues of gradient vanishing and accuracy degradation that often arise when training deep networks, thereby enabling the network to support deeper layers.

The Densely Connected Convolutional Networks (DenseNet) fundamentally connects the output of each layer to the input of every subsequent layer[22](https://www.nature.com/articles/s41598-026-44468-7#ref-CR22 "Huang, G. et al. Densely connected convolutional networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. : 4700–4708. (2017)."). This means that each layer receives feature maps from all preceding layers as input. This architectural approach offers several benefits, including enhanced feature preservation and propagation, mitigation of the vanishing gradient, and improved efficiency in parameter utilization.

While the dense connections used in DenseNet can improve model performance to a certain extent, they suffer from poor stability and high computational cost. Sabina Umirzakova proposed a combined approach, the DRFDCAN, in which an entire set of modules with residual connections is further connected from the first to the last by the largest possible skip connection, similar to the longest skip connection in DenseNet[26](https://www.nature.com/articles/s41598-026-44468-7#ref-CR26 "Umirzakova, S., Mardieva, S., Muksimova, S., Ahmad, S. & Whangbo, T. Enhancing the Super-Resolution of Medical Images: Introducing the Deep Residual Feature Distillation Channel Attention Network for Optimized Performance and Efficiency. Bioengineering 10 (11), 1332. https://doi.org/10.3390/bioengineering10111332

(2023)."). However, DenseNet generally incurs higher spatial costs compared to ResNet, and in some cases, its performance may even be inferior. Motivated by these observations, we adopt a decoder design that preserves the efficiency of residual connections while enhancing information flow across decoder stages. Specifically, skip connections are maintained between adjacent decoder layers, and additional cross-decoder residual connections are introduced to facilitate stable feature propagation and iterative refinement. This design maintains the same spatial resolution and introduces minimal computational overhead, while improving both optimization stability and localization performance.

Importantly, Fig.2 goes beyond illustrating this architectural modification and provides a mechanism-level interpretation of the proposed decoder. The figure visualizes how information is progressively refined from input perception to final prediction. Starting from the input CT image, which does not explicitly highlight lesion regions, deformable cross-attention gradually shifts the model’s spatial focus toward otitis media areas. This attention-guided feature aggregation directly supports the subsequent anchor refinement process, leading to increasingly accurate localization results. By explicitly connecting decoder structure, attention behavior, and prediction refinement, Fig.2 helps explain not only how the decoder is organized, but also why the proposed design leads to more stable and accurate detection.

Fig. 2

The alternative text for this image may have been generated using AI.

Full size image

The deformable decoder layer structure. Rounded rectangles represent modules, while rectangles represent data or parameters.

Loss function

Our loss function consists of two components—reconstruction loss and the traditional Hungarian loss. Specifically, the reconstruction loss includes entropy-balanced focal loss, L1 loss, and GIOU loss.

Entropy-balanced focal loss

For a given pair of matches, we design an entropy-balanced focal loss to compute the loss for target categories. The focal loss calculates the difference between the predicted category probabilities from the model and the actual categories[27](https://www.nature.com/articles/s41598-026-44468-7#ref-CR27 "Leng Z , Tan M , Liu C ,et al.PolyLoss: A Polynomial Expansion Perspective of Classification Loss Functions[J]. arXiv preprint arXiv:2204.12511, 2022. https://doi.org/10.48550/arXiv.2204.12511

."), offering greater robustness compared to cross-entropy loss. It enhances the model’s ability to handle class imbalances. Additionally, the entropy balance further enhances the stability of the loss function, avoiding adverse effects on model stability introduced by large residual connections. In the loss function, the sigma function denotes the sigmoid function, Eq. (7) represents the formula for cross-entropy computation, Eq. (11) indicates entropy balance, and (1-${w}_{c}$) is the weight of entropy balance. This loss function is utilized to compute the model’s classification loss, encouraging the model to correctly classify targets while imposing higher penalties on less frequent classes. This approach effectively addresses the issue of models struggling to learn targets of categories with fewer instances due to class imbalance. The relevant formula is as follows, where α = 0.25, γ = 2, ${w}_{c}$= 0.95:

$$p=\sigma\left(inputs\right)$$

(3)

$$CE\left(p,{y}_{t}\right)=-\left({y}_{t}\text{*}log\left(p\right)+\left(1-{y}_{t}\right)\text{*}log\left(1-p\right)\right)$$

(4)

$${p}_{t}=p\text{*}{y}_{t}+\left(1-p\right)\text{*}\left(1-{y}_{t}\right)$$

(5)

$${L}_{mod}=CE\left(p,{y}_{t}\right)\text{*}\left({\left(1-{p}_{t}\right)}^{\gamma}\right)$$

(6)

$${\alpha}_{t}=\alpha\text{*}{y}_{t}+\left(1-\alpha\right)\text{*}\left(1-{y}_{t}\right)$$

(7)

$${L}_{\alpha}={\alpha}_{t}\text{*}{L}_{mod}$$

(8)

$$L={w}_{c}\text{*}{L}_{\alpha}+\left(1-{w}_{c}\right)\text{*}{\left(1-{p}_{t}\right)}^{\left(\gamma+1\right)}$$

(9)

$${L}_{final}=\frac{1}{num\_boxes}{\sum}_{i}{L}_{i}$$

(10)

Bounding box regression loss

For each target bounding box in the matching pairs, the model calculates a regression loss, which includes a smooth L1 loss to measure the difference between the predicted bounding boxes and the ground-truth bounding boxes and a scale-invariant IoU loss, such as the Generalized Intersection over Union (GIoU) loss, to improve the accuracy of bounding box regression[28](https://www.nature.com/articles/s41598-026-44468-7#ref-CR28 "Gevorgyan, Z. SIoU loss: More powerful learning for bounding box regression[J]. arXiv preprint arXiv:2205.12740, (2022)."),[29](https://www.nature.com/articles/s41598-026-44468-7#ref-CR29 "Zheng, Z. et al. Distance-IoU loss: Faster and better learning for bounding box regression[C]//Proceedings of the AAAI conference on artificial intelligence. 34(07): 12993–13000. (2020)."),[30](https://www.nature.com/articles/s41598-026-44468-7#ref-CR30 "Lin, T. Y. et al. Focal loss for dense object detection[C]//Proceedings of the IEEE international conference on computer vision. : 2980–2988. (2017)."). The GIoU loss can be understood as the sum of the Intersection over Union (IoU) and the complement of the union over the outer bounding box minus 1, or IoU minus the complement of the large union. The " complement” refers to the circumscribed square of the figure formed by the union of these two rectangular boxes, and the “large complement” refers to the complement of the union of the two rectangular boxes when the circumscribed square is considered the universal set. The L1 loss is the Manhattan distance, which can be interpreted as calculating the difference between each predicted value and its true value. To sum up, the GIoU is formulated as follows:

$${L}_{box}\left({b}_{i},{\widehat{b}}_{\sigma\left(i\right)}\right)={\lambda}_{iou}{L}_{iou}\left({b}_{i},{\widehat{b}}_{\sigma\left(i\right)}\right)+{\lambda}_{L1}{\left|\left|{b}_{i}-{\widehat{b}}_{\sigma\left(i\right)}\right|\right|}_{1}$$

(11)

$${L}_{GIoU}\left(p,g\right)=IoU\left(p,g\right)+\frac{p\cup g}{\left[\left(max\left({p}_{x2},{g}_{x2}\right)-min\left({p}_{x1},{g}_{x1}\right)\right)\text{*}\left(max\left({p}_{y2},{g}_{y2}\right)-min\left({p}_{y1},{g}_{y1}\right)\right)\right]-1}$$

(12)

Hungarian loss

The algorithm starts by marking zero elements within the cost matrix and searching for an augmenting path for each uncovered zero element. This path is a sequence alternating between unmatched and matched edges, with both the starting and ending points being unmatched elements. Once an augmenting path is found, the algorithm gradually constructs the optimal matching by updating the cost along the path. In our model, the Hungarian algorithm determines the optimal match between predicted objects and ground-truth objects by constructing the cost matrix and finding the matching scheme that minimizes the total matching cost. This process guides the loss calculation during the model training. The formula for its computation is as follows:

$${L}_{box}\left({b}_{i},{\widehat{b}}_{\sigma\left(i\right)}\right)={\lambda}_{giou}{L}_{giou}\left({b}_{i},{\widehat{b}}_{\sigma\left(i\right)}\right)+{\lambda}_{L1}{\left|\left|{b}_{i}-{\widehat{b}}_{\sigma\left(i\right)}\right|\right|}_{1}$$

(13)

$${L}_{match}\left({y}_{i},{\widehat{y}}_{\widehat{\sigma}\left(i\right)}\right)=-log{p}_{\widehat{\sigma}\left(i\right)}\left({c}_{i}\right)+{L}_{box}\left({b}_{i},{\widehat{b}}_{\widehat{\sigma}\left(i\right)}\right)$$

(14)

$${L}_{Hungarian}\left(y,\widehat{y}\right)={\sum}_{i=1}^{N}{L}_{match}\left({y}_{i},{\widehat{y}}_{\widehat{\sigma}\left(i\right)}\right)$$

(15)

In other words, the Hungarian loss function considers the loss of both category matching and bounding box annotation to comprehensively assess the loss of optimal matching.

Experiments and analysis

Experimental dataset and pre-processing

In this study, we annotated and released a new dataset, namely the Otitis1415 datasetFootnote 1, which consists of CT images from 2014 to 2022 at the Xiangnan University Affiliated Hospital. The dataset includes 4,216 images of OM cases. It is particularly necessary to state that these research activities, including the data collection, annotation, and use, have been reviewed by the ethics committee of the Xiangnan University Affiliated Hospital with the Ethics Approval Number K2022-015-01. and all data have been de-identified. Moreover, all the images in the Otitis1415 dataset have undergone privacy removal operations, without revealing any privacy or causing any harm to anyone.

We randomly divided the dataset into two parts: 3,280 images for the training set, used to train the model, and the remaining 936 images for the validation set, used to evaluate the model’s performance. Each image in the dataset is annotated with a ground truth box indicating the position of the middle ear. The labels indicate whether the middle ear is diseased (label 1) or not diseased (label 0). In the training set, approximately 66.7% of the targets are labeled as 1, while in the validation set, about 65.8% are labeled as 1. No modifications were made to the dataset during the experiment.

Experimental setup and evaluation

For the experimental environment, a platform equipped with a Tesla V100 is used. The software environment included PyTorch framework (version 2.1.1), CUDA 12.1, Python 3.9.18, torch 2.1.0 + cu121, torchvision 0.16.0 + cu121, and OpenCV 4.8.1.78. For comparative experiments, this study configured the experimental environment according to the requirements specified in the open-source code for each experiment. During the training process, traditional random data augmentation techniques are adopted, such as scaling, flipping, and translation, to enhance the training data. ResNet50 backbone network and DN-DAB-DETR are then chosen as the baseline models, with the ResNet50 backbone network being pre-trained by the official PyTorch. Single-scale training and testing were conducted at a scale of (3,800,800). The set of experiment parameters included a batch size of 2 and 40 training cycles.

During the evaluation phase, mean average precision (mAP) is used as the evaluation metric. Specifically, the intersection over the union (IoU) represents the ratio of the area of intersection to the area of the union of two rectangular frames. mAP50 refers to the measurement of mAP when the IoU threshold is 50%. Similarly, mAP75 refers to the measurement of mAP when the IoU threshold is 75%. In addition, mAP(medium) refers to the measurement of mAP when considering only targets with a pixel area between 32 × 32 and 96 × 96, and mAP(large) refers to the measurement of mAP when considering only targets with a pixel area greater than 96 × 96.

Comparison with the latest methods

To comprehensively evaluate the effectiveness of the proposed method, we compare it with representative detectors covering major architectural paradigms in modern object detection. Specifically, the selected baselines include the CNN-based two-stage detector Faster R-CNN; DETR-family transformer detectors such as DN-DETR, DINO, and Co-DETR; the sparse proposal–based two-stage query refinement transformer Sparse R-CNN; and efficiency-oriented detectors including YOLOv8 and RT-DETR.

For fair comparison, all baseline detectors are implemented strictly following the official configurations reported in their respective original papers. No additional hyperparameter re-tuning is performed on our medical dataset to avoid per-method optimization bias. The backbone architecture, input resolution, data augmentation strategy, optimizer, learning rate schedule, and pretraining settings are kept consistent with the original implementations.All models are initialized with the official pretrained backbone weights provided in their respective releases. Transformer-based detectors (DN-DETR, DINO, Co-DETR, Sparse R-CNN, and RT-DETR) follow their official optimizer settings and learning rate schedules. CNN-based detectors (Faster R-CNN and YOLOv8) are implemented according to their standard configurations.To ensure comparable training duration and convergence conditions, all models are trained for the same number of epochs under identical hardware environments[8](https://www.nature.com/articles/s41598-026-44468-7#ref-CR8 "Ren S, et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans Pattern Anal Mach Intell. 2017 Jun;39(6):1137-1149. "),[10](https://www.nature.com/articles/s41598-026-44468-7#ref-CR10 "Carion, N. et al. End-to-End Object Detection with Transformers. In European Conference on Computer Vision (ECCV). (2020)."),[12](https://www.nature.com/articles/s41598-026-44468-7#ref-CR12 "[1] Zhang H , Li F , Liu S ,et al.DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection[J].arXiv e-prints, 2022.DOI:10.48550/arXiv.2203.03605."),[13](https://www.nature.com/articles/s41598-026-44468-7#ref-CR13 "Zong, Z., Song, G. & Liu, Y. Detrs with collaborative hybrid assignments training[C]//Proceedings of the IEEE/CVF international conference on computer vision. 6748–6758. (2023)."),[14](https://www.nature.com/articles/s41598-026-44468-7#ref-CR14 "Zhu X , Su W , Lu L, et al. Deformable detr: Deformable transformers for end-to-end object detection[J]. arXiv preprint arXiv:2010.04159, 2020. DOI:10.48550/arXiv.2010.04159."),[15](https://www.nature.com/articles/s41598-026-44468-7#ref-CR15 "Liu S , Li F , Zhang H ,et al. DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR[J]. arXiv preprint arXiv:2201.12329, 2022. DOI:10.48550/arXiv.2201.12329."),[18](https://www.nature.com/articles/s41598-026-44468-7#ref-CR18 "Sun, P. et al. Sparse r-cnn: End-to-end object detection with learnable proposals[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 14454–14463. (2021)."),[19](https://www.nature.com/articles/s41598-026-44468-7#ref-CR19 "Zhao, Y., Lv, W., Xu, S., Wei, J., Wang, G., Dang, Q., Chen, J. (2024). Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16965–16974).").

Fig. 3

The alternative text for this image may have been generated using AI.

Full size image

Detection examples of the compared methods. Each row corresponds to one object detection algorithm: (a) ground truth, (b) baseline, (c) Deformable DETR, (d) Faster R-CNN, (e) DINO, (f) Co-DETR, (g) YOLOv8, (h) RT-DETR, (i) Ours.

Table 1 Experimental results for the compared state-of-the-art methods on the Otitis1415 dataset. The optimal results are marked in bold.

Full size table

Figure 3 presents representative anomaly detection examples in which the proposed model consistently outperforms competing methods. As can be clearly observed from the figure, our model achieves superior performance in both prediction category accuracy and lesion localization accuracy. Table1 further summarizes the quantitative advantages of our model over existing approaches. This superior performance lies in the utilization of denser residual connections, which prevent over-decoding and allow for predicting the positions of detection boxes more accurately. Additionally, the entropy-balanced loss function effectively balances the model’s entropy, mitigating the issue of entropy increase with denser connections.

Computational complexity analysis

Table2 shows that our method achieves moderate and satisfactory computational complexity in terms of GFLOPs and parameter memory. As can be seen from the illustration, our GFLOPs are equivalent to the baseline and lower than those of other models. The number of parameters in our model is the same as the baseline, slightly higher than that of Faster R-CNN, Deformable-DETR and RT-DETR, and lower than the other models. This indicates that our model requires less computational power and has fewer parameters, thus exhibiting lower space-time costs.

Table 2 Computational complexity comparison with state-of-the-art methods on the Otitis1415 dataset. The optimal results are marked in bold.

Full size table

Fig. 4

The alternative text for this image may have been generated using AI.

Full size image

Detection capability vs. computational complexity requirements of our method compared with state-of-the-art algorithms. The four subfigures represent: (a) mAP50 and mAP75 vs. GFLOPs; (b) mAP50 and mAP75 vs. parameter memory (M).

Figure 4 illustrates a comparison of our mAP indicator against the number of parameters and the computation load. Based on previous analysis, it’s evident that our model significantly improves performance without increasing the number of parameters or floating-point operations compared to the baseline. Additionally, when compared with models like DINO and Co-DETR, our model has fewer parameters, requires fewer floating-point operations, and delivers better performance. This indicates that we have achieved a balance and breakthrough between performance and computational load.

Ablation study

To validate our improved method, we conducted ablation experiments. In these experiments, “A” refers to adding denser residual connections between adjacent encoder layers and between adjacent decoder layers, while “B” involves incorporating a new loss function with entropy balancing by adding a weight of _w_ _c_ _=0.05_ to the focal loss, replacing the old loss during training. The results show that adding denser residual connections significantly improves the model’s mAP, mAP75, mAP(large), and mAP(medium) compared to the baseline. Furthermore, modifying the loss function led to a substantial increase in mAP50 compared to the intermediate model that only had denser residual connections added. The combination of “A” and “B” enhancements resulted in improvements across all performance metrics. The results are shown in Table3.

Table 3 Ablation experiment with the proposed method. A: denser residual connection B: entropy-balanced loss.

Full size table

Specifically, the baseline model has an mAP of 0.457, which serves as a starting point for comparison. Introducing denser residual connections (A) significantly improves the mAP to 0.502, indicating that the optimization of residual connections has a positive impact on model performance. Furthermore, the mAP50, which measures average precision at an IoU threshold of 50%, increases from the baseline’s 0.950 to 0.960, showing that the optimized model is more accurate in recognizing objects at lower IoU thresholds.

Moreover, when both denser residual connections (A) and the entropy-balanced loss function (B) are introduced together, the model’s mAP further increases to 0.568. This improvement suggests that these two optimization strategies complement each other and jointly improve model performance. Meanwhile, mAP50 remains at 0.975, indicating that the model’s strong recognition capability at higher IoU thresholds.

The optimized model also exhibits improvements in recognizing medium-sized (mAP(medium)) and large-sized (mAP(large)) objects. The baseline model achieves an mAP of 0.367 for medium-sized objects and 0.455 for large-sized objects. After introducing denser residual connections (A), mAP for medium-sized objects increases to 0.410, and for large-sized objects increases to 0.505. When further combining the loss function with entropy balancing (B), mAP(medium) remains at 0.497, while mAP(large) increases to 0.571, indicating the effectiveness of combining these strategies, particularly in handling large-sized objects.

In summary, the introduction of denser residual connections and the entropy-balanced loss function leads to performance improvements across multiple evaluation metrics, demonstrating that these optimization strategies effectively enhance the model’s recognition capabilities.

Fig. 5

The alternative text for this image may have been generated using AI.

Full size image

Comparison of experimental results and heat maps of our model with Baseline. GT bounding boxes are overlaid on the heatmaps for visual comparison of spatial focus and localization behavior. (a)ground truth, (b) Predicted Box of Baseline, (c)Heat map of Baseline, (d) Predicted Box of Baseline + A, (e) Heat map of Baseline + A, (f) Predicted Box of Ours, (g) Heat map of Ours.

As shown in Fig.5, introducing denser residual connections mitigates the risk of over-decoding, with the decoder playing a role in refining bounding boxes. Consequently, these denser residual connections facilitate more accurate positioning of predicted boxes, and the entropy-balanced loss function reduces the model’s entropy, ensuring the stability of predicted box positions and minimizing significant deviations.

To provide more intuitive evidence of this behavior, ground-truth (GT) bounding boxes are overlaid on the corresponding attention heatmaps in Fig.5. From these visualizations, clear differences can be observed among the compared models. For the baseline model, the heatmap responses are spatially diffuse and scattered, indicating unstable and unfocused attention that frequently drifts away from the lesion region. After introducing denser residual connections (baseline + A), the heatmap responses become noticeably more concentrated and are largely centered around the GT bounding boxes, suggesting improved spatial focus. In contrast, the proposed model produces highly concentrated and well-aligned heatmaps that closely match the GT lesion regions, exhibiting minimal dispersion.

These qualitative results demonstrate that the proposed design enables more stable and lesion-aware feature aggregation, which directly contributes to improved localization accuracy. Combined with the entropy-balanced loss function, the model achieves both refined attention behavior and robust bounding box prediction, providing strong evidence of its effectiveness.

Table 4 Ablation experiment with the proposed method. A: denser residual connection B: entropy-balanced loss.

Full size table

To further analyze our model’s performance, the general toolbox TIDE for the error assessment of object detection was used[31](https://www.nature.com/articles/s41598-026-44468-7#ref-CR31 "Bolya, D. et al. Tide: A general toolbox for identifying object detection errors[C]//Computer Vision–ECCV. : 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16. Springer International Publishing, 2020: 558–573. (2020)."). Table 4 presents results comparing classification errors (_Cls_), localization errors (_Loc_), combined errors (_Both_), duplicate detection errors (_Dupe_), background errors (_Bkg_), missed ground truths (_Miss_), and specific errors for the baseline and our method.

Our model exhibits a slightly higher _Cls_ of 2.28 compared to the baseline model’s 2.20, indicating that the baseline model performs marginally better in correctly identifying object categories. However, our model shows a lower _Loc_ of 2.28 compared to the baseline model’s 3.62, demonstrating improved accuracy in determining object positions. In terms of _Both_, our model’s error value is 0.02, lower than the baseline model’s 0.04, indicating that our model performs better than the baseline model when considering both classification and localization errors simultaneously. Neither model demonstrates _Dupe_ or _Bkg_, indicating that both models performed well in avoiding duplicate detections and background errors. Notably, our model achieves a _Miss_ value of 0.00, indicating it did not miss detecting any true objects, whereas the baseline model exhibited an error value of 0.48, indicating some shortcomings in this aspect. In terms of specific errors, our model records an error value of 1.04 for false positives (_FalsePos_), lower than the baseline model’s 1.42, indicating that our model performs better in reducing false-positive detections. Similarly, our model has an error value of 3.41 for false negatives (FalseNeg), lower than the baseline model’s 5.36, indicating an improvement in reducing false negatives as well.

Significance test

To further validate the robustness of the proposed 4DO-DETR and to ensure that the observed performance improvements are not due to random fluctuations, we performed additional experiments focusing on statistical consistency and hyperparameter sensitivity. Specifically, we repeated the experiments multiple times under the same settings to verify the stability of the results. Furthermore, since batch size is one of the most influential hyperparameters in object detection training, we varied the batch size while keeping all other hyperparameters fixed. The experimental results are summarized in Table5.

Table 5 Results of repeated and different batchsize experiments for significance validation of 4DO-DETR.

Full size table

As can be seen, 4DO-DETR consistently maintains high accuracy across different batch size configurations, outperforming existing state-of-the-art methods. These results demonstrate that the performance gain is not an artifact of hyperparameter tuning, but rather a robust and meaningful improvement.

Effect of entropy balance on training stability

Training stability is a critical factor for reliable optimization and reproducible performance, especially in medical image analysis where noisy labels and class imbalance are common. To investigate the role of the proposed entropy-balance mechanism in stabilizing the training process, we conduct a comparative analysis against standard Cross Entropy loss and Focal Loss, both of which are widely used in detection and classification tasks.

Unlike label smoothing or other label-level regularization techniques, the entropy-balance mechanism operates directly at the loss and gradient level. Label smoothing relaxes hard targets into soft labels, thereby reducing model overconfidence and improving generalization. However, it does not explicitly address instability during optimization and may weaken the supervision signal. In contrast, entropy balance does not alter the target distribution. Instead, it constrains excessive entropy induced by higher-order unstable terms in the loss landscape, resulting in smoother gradient updates and more stable training dynamics.

Quantitative results are summarized in Table6. As shown, the proposed method achieves the lowest standard deviation of training loss (0.7488), which is significantly lower than that of Focal Loss (1.2347) and Cross Entropy (7.6188). This indicates that entropy balance effectively suppresses large loss fluctuations and leads to a more stable convergence trajectory throughout training. Such stability is particularly important for reproducibility and robustness in practical deployment scenarios.In addition to stability, the proposed method also yields a lower mean loss value during the convergence stage. Specifically, the average loss of our method is 4.1327, outperforming Focal Loss (4.5716) and Cross Entropy (8.7296). This demonstrates that entropy balance not only stabilizes training but also maintains strong optimization performance in the later stages of learning.

Figure 6 further illustrates the evolution of training loss curves under different loss functions. Cross Entropy exhibits sharp spikes and large oscillations, reflecting unstable gradient behavior during optimization. Focal Loss partially mitigates this issue but still shows noticeable fluctuations. In contrast, the proposed entropy-balance mechanism produces a smoother and more consistent loss curve, confirming its effectiveness in regulating training dynamics.

Overall, these results demonstrate that entropy balance improves training stability through direct regulation of loss entropy and gradient behavior, rather than by weakening label supervision. This distinction makes it fundamentally different from label smoothing and particularly suitable for tasks requiring robust and stable optimization, such as medical image detection.

Table 6 Comparison of training stability under different loss functions. The table reports the final loss, mean loss during the convergence stage, and standard deviation of the training loss for Cross Entropy, Focal Loss, and the proposed entropy-balance method. Lower standard deviation indicates more stable training dynamics.

Full size table

Fig. 6

The alternative text for this image may have been generated using AI.

Full size image

Comparison of training loss curves with different loss functions.

The proposed entropy-balance method produces a smoother and more stable loss trajectory compared with Cross Entropy and Focal Loss. Cross Entropy exhibits large fluctuations and sharp spikes during training, while Focal Loss partially alleviates this issue. The reduced oscillation of our method indicates improved stability of the optimization process.

Experiments on the brain tumor dataset

To further evaluate the generalization ability of the model, we conducted experiments on the brain-tumor dataset. Co-DETR and Deformable DETR are less effective in the same environment, and none of the mAP exceeds 0.1; thus, they were excluded from the comparison. The performance comparison results for Faster R-CNN, DINO, Baseline model, and Ours are as follows:

Table 7 Experimental results for the compared state-of-the-art methods on the brain-tumor dataset. The optimal results are marked in bold.

Full size table

Fig. 7

The alternative text for this image may have been generated using AI.

Full size image

Detection examples of the compared methods. Each row represents one algorithm: (a) baseline, (b) Faster R-CNN, (c) DINO, (d) Ours.

Our model outperforms others in anomaly detection, as demonstrated in Fig.7, which shows examples of superior detection type and location. Table7 demonstrates stronger generalization ability of our model, particularly in the mAP(large) metric, where it significantly outperforms other models. For small-sized objects, our model’s performance is slightly below that of DINO. However, on the Otitis1415 dataset, our model surpasses DINO, suggesting a minor limitation in our model’s generalization capability for medium-sized objects.

Limitation analysis

Although 4DO-DETR achieves notable improvements in lightweight design and data efficiency, and demonstrates competitive performance under limited training data conditions, several limitations remain and warrant further investigation.

First, small-lesion detection remains a challenging scenario for the proposed model. One potential limitation lies in its sensitivity to scale imbalance in the training data. In the otitis media dataset, lesions of different sizes are unevenly distributed, with medium and large lesions constituting the majority of annotated samples, while small lesions are relatively underrepresented. Under such imbalanced conditions, the regression patterns learned by the model may become biased toward dominant object scales. Consequently, localization performance on small targets may be suboptimal, and predicted bounding boxes for small lesions may tend to be larger than the corresponding ground-truth annotations. This phenomenon is more consistent with a distribution-induced scale bias than with a structural deficiency of the detection framework itself.

In addition, based on qualitative assessments conducted with four clinical experts from the Affiliated Hospital of Xiangnan University, the proposed model shows insufficient performance in certain clinically challenging scenarios, particularly in cases involving blurred texture boundaries or ambiguous lesion morphology. Such cases are also frequently encountered in real-world clinical practice and often require additional contextual cues or multi-slice confirmation for reliable diagnosis.

Second, as a Transformer-based architecture, the proposed model still suffers from limited transparency in its decision-making process. Although attention-based mechanisms provide a degree of interpretability, it remains difficult to precisely trace or explain specific detection errors at the instance level. This limitation may affect clinical trust and highlights the need for more fine-grained interpretability mechanisms tailored to medical imaging tasks.

Finally, the current implementation relies on CUDA-based acceleration and therefore cannot be directly deployed on CPU-only edge devices. This constraint limits its applicability in resource-constrained clinical environments and underscores the necessity of further optimization toward more flexible and lightweight deployment strategies.

Addressing these limitations, particularly through deeper collaboration with clinical experts, improved interpretability mechanisms, and more efficient deployment strategies, will be important directions for future work.

Conclusions

In this paper, we primarily investigated the issue of otitis media detection in CT images and proposed a method called 4DO-DETR. This method builds on the DN-DAB-DETR framework by incorporating Deformable attention modules, creating a denser residual connection structure, and designing a loss function with enhanced stability. The denser residual connection structure prevents the excessive decoding issues caused by too many decoders, thus reducing the model’s sensitivity to the number of decoders and enhancing both the performance and stability of the model. Finally, by replacing the focal loss function with an entropy-balanced focal loss, we effectively avoided the adverse effects that the introduction of denser residual connections might have on model training, thereby improving the model’s performance during detection. Experiments on the Otitis1415 dataset demonstrate that our proposed model achieved superior performance across multiple evaluation metrics in comparison with DINO, RT-DETR, YOLOv8 and baseline models. Ablation studies showed that each component in our model is interlinked and synergistically contributes to its performance.

Data availability

The datasets used and/or analysed during the current study available at: https://github.com/promisedong/Four-DO-DETR . Specially, the other dataset, namely the brain-tumor dataset, is a public dataset that was released on https://github.com/ultralytics/assets/releases/download/v0.0.0/brain-tumor.zip by Ultralytics. The Code that support the findings of this study are available from the corresponding author, upon reasonable request. For any additional information needed to analyze the data presented in this paper, please contact the corresponding author.

References

Monasta, L. et al. Burden of disease caused by otitis media: systematic review and global estimates[J]. _PloS one_. 7 (4), e36226 (2012).

Article ADS CAS PubMed PubMed Central Google Scholar

Wang, Y. M. et al. Deep learning in automated region proposal and diagnosis of chronic otitis media based on computed tomography. _Ear Hear._41 (3), 669–677 (2020).

Article PubMed Google Scholar

Duan, B., Guo, Z., Pan, L., Xu, Z. & Chen, W. Temporal bone CT-based deep learning models for differential diagnosis of primary ciliary dyskinesia related otitis media and simple otitis media with effusion. _Am. J. Otolaryngol._43 (6), 153–162 (2022).

CAS Google Scholar

Pham, V. T., Tran, T. T., Wang, P. C. & Chen, P. Y. EAR-UNet: A deep learning-based approach for segmentation of tympanic membranes from otoscopic images. _Artif. Intell. Med._, 112, (2021). Article 102015.

Dalal, N. & Triggs, B. Histograms of oriented gradients for human detection, _IEEE Computer Society Conference on Computer Vision and Pattern_ Recognition _(CVPR’05)_, San Diego, CA, USA, 2005, 886–893, 1, San Diego, CA, USA, 2005, 886–893, 1, (2005). https://doi.org/10.1109/CVPR.2005.177

Felzenszwalb PF, et al. Object detection with discriminatively trained part-based models. IEEE Trans Pattern Anal Mach Intell. 2010 Sep;32(9):1627-45.

Girshick, R. et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. : 580–587. (2014).

Ren S, et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans Pattern Anal Mach Intell. 2017 Jun;39(6):1137-1149.

Redmon, J. et al. You only look once: Unified, real-time object detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 779–788. (2016).

Carion, N. et al. End-to-End Object Detection with Transformers. In European Conference on Computer Vision (ECCV). (2020).

Li, F. et al. DN-DETR: Accelerate DETR Training by Introducing Query DeNoising. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). (2022).

[1] Zhang H , Li F , Liu S ,et al.DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection[J].arXiv e-prints, 2022.DOI:10.48550/arXiv.2203.03605.

Zong, Z., Song, G. & Liu, Y. Detrs with collaborative hybrid assignments training[C]//Proceedings of the IEEE/CVF international conference on computer vision. 6748–6758. (2023).

Zhu X , Su W , Lu L, et al. Deformable detr: Deformable transformers for end-to-end object detection[J]. arXiv preprint arXiv:2010.04159, 2020. DOI:10.48550/arXiv.2010.04159.

Liu S , Li F , Zhang H ,et al. DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR[J]. arXiv preprint arXiv:2201.12329, 2022. DOI:10.48550/arXiv.2201.12329.

Lin, T. Y. et al. Feature Pyramid Networks for Object Detection. Proceedings of the IEEE Conference on Computer Vision and (CVPR), 2117–2125. (2017).

Liu, W. et al. SSD: Single shot multibox detector[C]//Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer International Publishing, 21–37. (2016).

Sun, P. et al. Sparse r-cnn: End-to-end object detection with learnable proposals[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 14454–14463. (2021).

Zhao, Y., Lv, W., Xu, S., Wei, J., Wang, G., Dang, Q., Chen, J. (2024). Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16965–16974).

Cai, Y. et al. Investigating the use of a two-stage attention-aware convolutional neural network for the automated diagnosis of otitis media from tympanic membrane images: a prediction model development and validation study[J]. _BMJ open._11 (1), e041139 (2021).

Article PubMed PubMed Central Google Scholar

Schilder, A. G. M. et al. Otitis media[J]. _Nat. reviews Disease primers_. 2 (1), 1–18 (2016).

Google Scholar

Huang, G. et al. Densely connected convolutional networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. : 4700–4708. (2017).

Ngombu, S. et al. Advances in artificial intelligence to diagnose otitis media: State of the art review[J]. _Otolaryngology–Head Neck Surg._168 (4), 635–642 (2023).

Article Google Scholar

Rosenblatt, F. The perceptron: a probabilistic model for information storage and organization in the brain[J]. _Psychol. Rev._65 (6), 386 (1958).

Article CAS PubMed Google Scholar

He, K., Zhang, X., Ren, S., Sun, J. & Recognition, P. Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer Vision and (CVPR), 770–778. (2016).

Umirzakova, S., Mardieva, S., Muksimova, S., Ahmad, S. & Whangbo, T. Enhancing the Super-Resolution of Medical Images: Introducing the Deep Residual Feature Distillation Channel Attention Network for Optimized Performance and Efficiency. _Bioengineering_10 (11), 1332. https://doi.org/10.3390/bioengineering10111332 (2023).

Article PubMed PubMed Central Google Scholar

Leng Z , Tan M , Liu C ,et al.PolyLoss: A Polynomial Expansion Perspective of Classification Loss Functions[J]. arXiv preprint arXiv:2204.12511, 2022.https://doi.org/10.48550/arXiv.2204.12511.

Gevorgyan, Z. SIoU loss: More powerful learning for bounding box regression[J]. arXiv preprint arXiv:2205.12740, (2022).

Zheng, Z. et al. Distance-IoU loss: Faster and better learning for bounding box regression[C]//Proceedings of the AAAI conference on artificial intelligence. 34(07): 12993–13000. (2020).

Lin, T. Y. et al. Focal loss for dense object detection[C]//Proceedings of the IEEE international conference on computer vision. : 2980–2988. (2017).

Bolya, D. et al. Tide: A general toolbox for identifying object detection errors[C]//Computer Vision–ECCV. : 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16. Springer International Publishing, 2020: 558–573. (2020).

Download references

Acknowledgements

This research was funded by Natural Science Foundation of Hunan Province (No. 2023JJ50392), Scientific Research Fund of Hunan Provincial Education Department (No. 23A0588, 22A0587), Scientific Research Project of Hunan Provincial Health Commission (202207012466), Guidance Fund of Hunan Province in Clinical Medical Technology Innovation (No.2021SK52202) and Program for Science and Technology Innovative Research Team in Higher Educational Institutions of Hunan Province.

Funding

Author information

Authors and Affiliations

Hunan Engineering Research Center of Advanced Embedded Computing and Intelligent Medical Systems, Xiangnan University, Chenzhou, 423300, China

Xinyue Zhao,Haowen Zhang,Dong Liu&Guanxiong Lei

School of Artificial Intelligence and Data Science, University of Science and Technology of China, Hefei, 230026, China

Xinyue Zhao

The University of HONG KONG, HONG KONG, China

Haowen Zhang

School of Computer and Artificial Intelligence, Xiangnan University, Chenzhou, 423300, China

Dong Liu

Key Laboratory of Medical Imaging and Artificial Intelligence of Hunan Province, Xiangnan University, Chenzhou, 423300, China

Dong Liu&Guanxiong Lei

Clinical College, Xiangnan University, Chenzhou, 423300, China

Guanxiong Lei

Authors

Xinyue Zhao
Haowen Zhang
Dong Liu
Guanxiong Lei

Contributions

X.Z. and H.Z. wrote the main manuscript text. D.L. revised the manuscript. X.Z. and H.Z. performed the experiments. X.Z., D.L. and G.L. analyzed the experimental results. G.L. provided the experimental data. D.L. and G.L. provided suggestions on the experimental setup as well as revised the manuscript critically. All authors reviewed the manuscript.

Corresponding author

Correspondence to Guanxiong Lei.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhao, X., Zhang, H., Liu, D. _et al._ 4DO-DETR for otitis media detection. _Sci Rep_16, 18264 (2026). https://doi.org/10.1038/s41598-026-44468-7

Download citation

Received: 24 September 2024

Accepted: 11 March 2026

Published: 12 June 2026

Version of record: 12 June 2026

DOI: https://doi.org/10.1038/s41598-026-44468-7

Keywords

这篇还没有中文全文

该条目暂未提供中文翻译。标题/摘要已自动中译;本系统只对人工挑选的内容生成全文翻译。

挑中后 → markitdown 取正文 → 精翻 → 此处切换为译文