Zhou Yuan 2024a - Revision history

Rimni: /* 2.2. Attention mechanisms */

2024-06-10T11:35:07Z

‎2.2. Attention mechanisms

Rimni: /* 4.3. Comparison with mainstream models */

2024-06-10T11:30:12Z

‎4.3. Comparison with mainstream models

Rimni at 11:29, 10 June 2024

2024-06-10T11:29:05Z

Rimni at 11:19, 10 June 2024

2024-06-10T11:19:34Z

Rimni at 11:18, 10 June 2024

2024-06-10T11:18:46Z

Rimni at 11:18, 10 June 2024

2024-06-10T11:18:13Z

Rimni: /* 1. Introduction */

2024-06-10T11:17:12Z

‎1. Introduction

Rimni at 12:59, 7 June 2024

2024-06-07T12:59:17Z

Rimni at 11:58, 7 June 2024

2024-06-07T11:58:46Z

Rimni: /* 4.4. Ablation experiments */

2024-06-07T11:57:34Z

‎4.4. Ablation experiments

← Older revision		Revision as of 11:35, 10 June 2024
Line 101:		Line 101:


−	where C, W and H represent the number, width and height of channels in the feature map, respectively. Here, AvgPool and StdPool represent the global average pooling and the global standardized difference pooling, respectively. ⊙ stands for broadcast element-wise multiplication, and ⊕ stands for broadcast element-wise summation. ~~three~~ branches are used to capture the interactions between different dimensions and channels. A substitution operation is used in the first two branches to capture the remote dependence between the channel dimension and any of the spatial dimensions. The final branch aggregates the outputs of all three branches in the integration phase.	+	where C, W and H represent the number, width and height of channels in the feature map, respectively. Here, AvgPool and StdPool represent the global average pooling and the global standardized difference pooling, respectively. ⊙ stands for broadcast element-wise multiplication, and ⊕ stands for broadcast element-wise summation. Three branches are used to capture the interactions between different dimensions and channels. A substitution operation is used in the first two branches to capture the remote dependence between the channel dimension and any of the spatial dimensions. The final branch aggregates the outputs of all three branches in the integration phase.

	We made a series of improvements to YOLOv5s: the original C3 convolutional module was replaced with the improved C2f in Backbone, and the Biformer attention mechanism was introduced; a module with both inverse convolution and bilinear interpolation upsampling was designed in Neck, and the original C3 module was replaced with an MSDA module; a convolutional module was added in Head before the Simam attention mechanism was added to improve the detection accuracy and efficiency; the loss function was replaced with EIOU to obtain more accurate target localization.		We made a series of improvements to YOLOv5s: the original C3 convolutional module was replaced with the improved C2f in Backbone, and the Biformer attention mechanism was introduced; a module with both inverse convolution and bilinear interpolation upsampling was designed in Neck, and the original C3 module was replaced with an MSDA module; a convolutional module was added in Head before the Simam attention mechanism was added to improve the detection accuracy and efficiency; the loss function was replaced with EIOU to obtain more accurate target localization.

@@ Line 163: / Line 163: @@
 ===4.3. Comparison with mainstream models===
-In order to verify the superiority of our model, we compare it with current mainstream models in both one-stage and two-stage categories, which include ATSS, CASCADE-RCNN , FASTER-RCNN , SSD300, Retinanet, and YOLOVX. [[#tab-1|Table 1]] lists the comparison of our proposed SteelGuard-yolo with these mainstream algorithms. It can be seen that our algorithm outperforms most of the algorithms for comparison, with Map50 reaching 0.690, which is an improvement of 0.072 over the initial YOLOv5s. We made predictions using a one- and two-stage typical model for target detection and our model, respectively, and [[#img-5|Figure 5]] shows the visualization results, where (a) a one-stage typical model of YolovX is used, (b) a two-stage typical model of Faster-Rcnn, (c) is ssdlite, (d) is atss, and (e) is our proposed SteelGuard-yolo.
+In order to verify the superiority of our model, we compare it with current mainstream models in both one-stage and two-stage categories, which include ATSS, CASCADE-RCNN, FASTER-RCNN, SSD300, Retinanet, and YOLOVX. [[#tab-1|Table 1]] lists the comparison of our proposed SteelGuard-yolo with these mainstream algorithms. It can be seen that our algorithm outperforms most of the algorithms for comparison, with Map50 reaching 0.690, which is an improvement of 0.072 over the initial YOLOv5s. We made predictions using a one- and two-stage typical model for target detection and our model, respectively, and [[#img-5|Figure 5]] shows the visualization results, where (a) a one-stage typical model of YolovX is used, (b) a two-stage typical model of Faster-Rcnn, (c) is ssdlite, (d) is atss, and (e) is our proposed SteelGuard-yolo.
 <div class="center" style="font-size: 85%;">'''Table 1'''. Comparison of SteelGuard-yolo's detection effect with mainstream models on NEU-DET dataset</div>

@@ Line 375: / Line 375: @@
 [[#cite-19|[19]]] Jiao J., Tang Y.M., Lin K.Y., et al. Dilateformer: multi-scale dilated transformer for visual recognition. IEEE Transactions on Multimedia, 25:8906-8919, 2023.
-[20] Bao Y., Song K., Liu J., Wang Y., Yan Y., Yu H., Li X. Triplet-Graph Reasoning Network for Few-shot Metal Generic Surface Defect Segmentation. IEEE Transactions on Instrumentation and Measurement 70:1-11, 3083561, 2021.
+[20] Bao Y., Song K., Liu J., Wang Y., Yan Y., Yu H., Li X. Triplet-Graph Reasoning Network for Few-shot Metal Generic Surface Defect Segmentation. IEEE Transactions on Instrumentation and Measurement, 70:1-11, 3083561, 2021.
 [21] Song K.,  Yan Y. A noise robust method based on completed local binary patterns for hot-rolled steel strip surface defects. Applied Surface Science, 285:858-864, 2013.
 [22] He Y.,  Song K.,  Meng Q.,  Yan Y. An end-to-end steel surface defect detection approach via fusing multiple hierarchical features. IEEE Transactions on Instrumentation and Measurement, 69(4):1493-1504, 2020.

@@ Line 128: / Line 128: @@
 ===3.2. Neck optimization===
-In Neck, we design a module with both inverse convolution and bilinear interpolation up-sampling, and we replace the original C3 module in Neck with MSDA module to enhance the feature aggregation capability of the model. Feature up-sampling is a very important part of deep learning and neural networks. Feature maps of different resolutions are matched based on high-resolution supervision. up-sampling in YOLOv5s uses the nearest neighbor interpolation algorithm by default, which fills new pixel positions by copying the nearest pixel values. This means that there are significant discontinuities in the gray values in the sampled image, resulting in a large loss of image quality. This can manifest itself in the form of noticeable mosaic and jaggedness. This method is overly concerned with the speed of the operation and ignores the accuracy and effectiveness of the up-sampling result. Our design with both inverse convolution and bilinear interpolation up-sampling module effectively remedies this shortcoming. Bilinear interpolation takes into account the weights of the four pixels around a pixel point, which can effectively compensate for mosaic and jaggedness. Inverse convolution enables the network to automatically learn the appropriate upsampling weights for a particular task. The combination of the two effectively improves the smoothness and accuracy of the upsampling results. The original YOLOv5s is not as good as the original YOLOv5s in detecting small targets, so we replace the C3 module in Neck with the MSDA module to enhance the feature aggregation ability of the model.The MSDA module improves the detection accuracy of small targets by generating larger scale feature maps to differentiate the fine features of small targets. Meanwhile, the MSDA module adopts sliding window feature extraction, which effectively reduces the computational requirements and the number of parameters. Finally, the MSDA module introduces a global attention mechanism that combines channel information with global information to create a weighted feature map. This helps to highlight the attributes of the object of interest while effectively ignoring irrelevant details.
+In Neck, we design a module with both inverse convolution and bilinear interpolation up-sampling, and we replace the original C3 module in Neck with MSDA module to enhance the feature aggregation capability of the model. Feature up-sampling is a very important part of deep learning and neural networks. Feature maps of different resolutions are matched based on high-resolution supervision. up-sampling in YOLOv5s uses the nearest neighbor interpolation algorithm by default, which fills new pixel positions by copying the nearest pixel values. This means that there are significant discontinuities in the gray values in the sampled image, resulting in a large loss of image quality. This can manifest itself in the form of noticeable mosaic and jaggedness. This method is overly concerned with the speed of the operation and ignores the accuracy and effectiveness of the up-sampling result. Our design with both inverse convolution and bilinear interpolation up-sampling module effectively remedies this shortcoming. Bilinear interpolation takes into account the weights of the four pixels around a pixel point, which can effectively compensate for mosaic and jaggedness. Inverse convolution enables the network to automatically learn the appropriate upsampling weights for a particular task. The combination of the two effectively improves the smoothness and accuracy of the upsampling results. The original YOLOv5s is not as good as the original YOLOv5s in detecting small targets, so we replace the C3 module in Neck with the MSDA module to enhance the feature aggregation ability of the model. The MSDA module improves the detection accuracy of small targets by generating larger scale feature maps to differentiate the fine features of small targets. Meanwhile, the MSDA module adopts sliding window feature extraction, which effectively reduces the computational requirements and the number of parameters. Finally, the MSDA module introduces a global attention mechanism that combines channel information with global information to create a weighted feature map. This helps to highlight the attributes of the object of interest while effectively ignoring irrelevant details.
 ===3.3. Detection head optimization===

@@ Line 128: / Line 128: @@
 ===3.2. Neck optimization===
-In Neck, we design a module with both inverse convolution and bilinear interpolation up-sampling, and we replace the original C3 module in Neck with MSDA module to enhance the feature aggregation capability of the model. Feature up-sampling is a very important part of deep learning and neural networks. Feature maps of different resolutions are matched based on high-resolution supervision. up-sampling in YOLOv5s uses the nearest neighbor interpolation algorithm by default, which fills new pixel positions by copying the nearest pixel values.[https://www.bilibili.com/read/cv17309395/ This means that there are significant discontinuities in the gray values in the sampled image, resulting in a large loss of image quality. This can manifest itself in the form of noticeable mosaic and jaggedness] . This method is overly concerned with the speed of the operation and ignores the accuracy and effectiveness of the up-sampling result. Our design with both inverse convolution and bilinear interpolation up-sampling module effectively remedies this shortcoming. Bilinear interpolation takes into account the weights of the four pixels around a pixel point, which can effectively compensate for mosaic and jaggedness. Inverse convolution enables the network to automatically learn the appropriate upsampling weights for a particular task. The combination of the two effectively improves the smoothness and accuracy of the upsampling results. The original YOLOv5s is not as good as the original YOLOv5s in detecting small targets, so we replace the C3 module in Neck with the MSDA module to enhance the feature aggregation ability of the model.The MSDA module improves the detection accuracy of small targets by generating larger scale feature maps to differentiate the fine features of small targets. Meanwhile, the MSDA module adopts sliding window feature extraction, which effectively reduces the computational requirements and the number of parameters. Finally, the MSDA module introduces a global attention mechanism that combines channel information with global information to create a weighted feature map. This helps to highlight the attributes of the object of interest while effectively ignoring irrelevant details.
+In Neck, we design a module with both inverse convolution and bilinear interpolation up-sampling, and we replace the original C3 module in Neck with MSDA module to enhance the feature aggregation capability of the model. Feature up-sampling is a very important part of deep learning and neural networks. Feature maps of different resolutions are matched based on high-resolution supervision. up-sampling in YOLOv5s uses the nearest neighbor interpolation algorithm by default, which fills new pixel positions by copying the nearest pixel values. This means that there are significant discontinuities in the gray values in the sampled image, resulting in a large loss of image quality. This can manifest itself in the form of noticeable mosaic and jaggedness. This method is overly concerned with the speed of the operation and ignores the accuracy and effectiveness of the up-sampling result. Our design with both inverse convolution and bilinear interpolation up-sampling module effectively remedies this shortcoming. Bilinear interpolation takes into account the weights of the four pixels around a pixel point, which can effectively compensate for mosaic and jaggedness. Inverse convolution enables the network to automatically learn the appropriate upsampling weights for a particular task. The combination of the two effectively improves the smoothness and accuracy of the upsampling results. The original YOLOv5s is not as good as the original YOLOv5s in detecting small targets, so we replace the C3 module in Neck with the MSDA module to enhance the feature aggregation ability of the model.The MSDA module improves the detection accuracy of small targets by generating larger scale feature maps to differentiate the fine features of small targets. Meanwhile, the MSDA module adopts sliding window feature extraction, which effectively reduces the computational requirements and the number of parameters. Finally, the MSDA module introduces a global attention mechanism that combines channel information with global information to create a weighted feature map. This helps to highlight the attributes of the object of interest while effectively ignoring irrelevant details.
 ===3.3. Detection head optimization===

@@ Line 112: / Line 112: @@
 {| class="wikitable" style="margin: 0em auto 0.1em auto;border-collapse: collapse;width:auto;"
 |-style="background:white;"
-|style="text-align: center;padding:10px;"| [[File:1-33.png|600x600px]]
+|style="text-align: center;padding:10px;"| [[File:1-33.png|750px]]
 |-
 | style="background:#efefef;text-align:left;padding:10px;font-size: 85%;"| '''Figure 3'''. SteelGuard-yolo network structure

@@ Line 24: / Line 24: @@
 Two-stage detection methods were the first to appear, and the main element is to generate anchor frames on the input image, followed by detecting the contents of the anchor frames and finally classifying them. Typical algorithms are R-CNN [<span id='cite-2'></span>[[#2|2]]], Fast-RCNN [<span id='cite-3'></span>[[#3|3]]], Faster-RCNN [<span id='cite-4'></span>[[#4|4]]], and Grid-RCNN [<span id='cite-5'></span>[[#5|5]]], etc. Girshick et al. proposed R-CNN, which sets a set of anchors at each pixel, and RPN classifies and regresses all of these anchors, and then picks the proposals based on the classification confidence of the proposed top K proposals. Models such as Fast-RCNN and Faster-RCNN, which are improved on its basis, are considered to be the classical models in the second stage. In addition, for target recognition under different viewing angles, lighting and occlusion conditions, Anton Osokin et al. proposed Context-aware CNNs [<span id='cite-6'></span>[[#6|6]]], which realized target recognition under different conditions. These models have good detection performance and all of them have achieved excellent results, but their anchor frame localization modules are very similar and there is still room for improvement. To address this problem, lu et al. proposed Grid-RCNN, which effectively utilizes explicit spatial representations to achieve high-quality localization. The two-stage model has higher accuracy but longer detection time, although the EfficientDet series model optimizes the detection time, but it also consumes more computational resources.
-One-stage methods no longer need to generate anchor frames, but directly predict the whole image, obtaining an improvement in detection speed. The most typical one-stage target detection algorithms include: YOLO [<span id='cite-7'></span>[[#7|7]]], SSD [<span id='cite-8'></span>[[#8|8]]], SqueezeDet [<span id='cite-9'></span>[[#9|9]]] and  DetectNet [<span id='cite-10'></span>[[#10|10]]], etc. The OverFeat algorithm proposed by  Sermanet et al. [<span id='cite-11'></span>[[#11|11]]] is the basis of the first stage of target detection, which classifies images at different locations in a multi-scale region of the image in the form of a sliding window, as well as trains a regressor on the same convolutional layer to predict the location of the bounding box. On top of this the yolo series of algorithms are also the classic algorithms of the first stage.REDMON J proposed the YOLO (You Only Look Once) model, which has a strong generalization ability as well as adaptability. With the proposal of YOLO, various applications have begun to utilize YOLO for target detection and recognition in various contexts. Aiming at the deficiencies of YOLO family of models in network fusion, Wang et al. proposed gold-yolo [<span id='cite-12'></span>[[#12|12]]],which improves on convolution and self-attention mechanisms, and employs Mae-style pre-training to allow the model to gain under unsupervised training. Aiming at the problems of poor performance of yolov2 backbone and underutilization of multi-scale regional features, Huang et al. proposed a DC-SPP-YOLO [<span id='cite-13'></span>[[#13|13]]] based on dense connectivity (DC) and spatial pyramid pooling (SPP), which improves the target detection accuracy of YOLOv2 [<span id='cite-14'></span>[[#14|14]]]. Aiming at the industrial scenarios where image background interference is large, defect categories are easily confused, defect scales vary greatly, and small defects are poorly detected, Guo et al. proposed MSFT-YOLO [<span id='cite-15'></span>[[#15|15]]], which realizes the fusion of features at different scales and enhances the dynamic adjustment of the model to targets at different scales.The first-stage models, such as YOLO, SSD, and RetinaNet [<span id='cite-16'></span>[[#16|16]]], are excellent in terms of speed and real-time performance but there is the problem of localization accuracy and relatively low detection accuracy for small targets. One-stage models such as YolovX have much lower accuracy than two-stage models, even though their detection speed is faster than that of two-stage models. Two-stage models such as Faster-Rcnn have higher accuracy than the one-stage detection model, but their computation amount and time are much higher than the one-stage model.
+One-stage methods no longer need to generate anchor frames, but directly predict the whole image, obtaining an improvement in detection speed. The most typical one-stage target detection algorithms include: YOLO [<span id='cite-7'></span>[[#7|7]]], SSD [<span id='cite-8'></span>[[#8|8]]], SqueezeDet [<span id='cite-9'></span>[[#9|9]]] and  DetectNet [<span id='cite-10'></span>[[#10|10]]], etc. The OverFeat algorithm proposed by  Sermanet et al. [<span id='cite-11'></span>[[#11|11]]] is the basis of the first stage of target detection, which classifies images at different locations in a multi-scale region of the image in the form of a sliding window, as well as trains a regressor on the same convolutional layer to predict the location of the bounding box. On top of this the yolo series of algorithms are also the classic algorithms of the first stage.REDMON J proposed the YOLO (You Only Look Once) model, which has a strong generalization ability as well as adaptability. With the proposal of YOLO, various applications have begun to utilize YOLO for target detection and recognition in various contexts. Aiming at the deficiencies of YOLO family of models in network fusion, Wang et al. proposed gold-yolo [<span id='cite-12'></span>[[#12|12]]], which improves on convolution and self-attention mechanisms, and employs Mae-style pre-training to allow the model to gain under unsupervised training. Aiming at the problems of poor performance of yolov2 backbone and underutilization of multi-scale regional features, Huang et al. proposed a DC-SPP-YOLO [<span id='cite-13'></span>[[#13|13]]] based on dense connectivity (DC) and spatial pyramid pooling (SPP), which improves the target detection accuracy of YOLOv2 [<span id='cite-14'></span>[[#14|14]]]. Aiming at the industrial scenarios where image background interference is large, defect categories are easily confused, defect scales vary greatly, and small defects are poorly detected, Guo et al. proposed MSFT-YOLO [<span id='cite-15'></span>[[#15|15]]], which realizes the fusion of features at different scales and enhances the dynamic adjustment of the model to targets at different scales.The first-stage models, such as YOLO, SSD, and RetinaNet [<span id='cite-16'></span>[[#16|16]]], are excellent in terms of speed and real-time performance but there is the problem of localization accuracy and relatively low detection accuracy for small targets. One-stage models such as YolovX have much lower accuracy than two-stage models, even though their detection speed is faster than that of two-stage models. Two-stage models such as Faster-Rcnn have higher accuracy than the one-stage detection model, but their computation amount and time are much higher than the one-stage model.
 To address the above problems, this paper proposes a new defect detection algorithm SteelGuard-yolo,to realize the improvement of the detection accuracy of the one-stage model with reduced computation, the main contributions are:

@@ Line 329: / Line 329: @@
 |}
-=5. Conclusion=
+==5. Conclusion==
 In this paper, an algorithm for detecting defects on steel surfaces is proposed. By replacing the original C3 module in Backbone with an improved C2f module with weight aggregation, and introducing the BiFormer attention mechanism in front of the SPFF module to enhance the perceptual ability of the model, the accuracy and robustness of target detection are improved. In Neck, the up-sampling method is designed to combine the parallelism of bilinear interpolation and inverse convolution, which realizes the more accurate mutual fusion of different scale features. The C3 module of Neck is replaced by the MSDA module, which suppresses the background information and emphasizes the perceptual region. The Simam attention mechanism is introduced in front of the Conv module of the detection head to better extract the key target information without introducing too many parameters. Finally, the original loss function is replaced with EIOU to improve the localization accuracy. In our future work, we will further develop the lightweight backbone feature extraction network and new feature fusion methods to simplify the network structure architecture and achieve an effective balance between high speed and high accuracy.
-==REFERENCES==
+==References==
 <div id="1"></div>
-[[#cite-1|[1]]] Tang B, Chen L, Sun W, et al. Review of surface defect detection of steel products based on machine vision. IET Image Processing, 2023, 17(2): 303-322.
+[[#cite-1|[1]]] Tang B, Chen L., Sun W., et al. Review of surface defect detection of steel products based on machine vision. IET Image Processing, 17(2):303-322, 2023.
-[[#cite-2|[2]]] Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE conference on computer vision and pattern recognition. 2014: 580-587.
+[[#cite-2|[2]]] Girshick R., Donahue J., Darrell T., et al. Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA,  580-587, 2014.
-[[#cite-3|[3]]] Girshick R. Fast r-cnn. Proceedings of the IEEE international conference on computer vision. 2015: 1440-1448.
+[[#cite-3|[3]]] Girshick R. Fast R-CNN. Proceedings of the IEEE International Conference on Computer Vision,  Santiago, Chile, 1440-1448, 2015.
-[[#cite-4|[4]]] Ren S, He K, Girshick R, et al. Faster R-CNNn: Towards real-time object detection with region proposal networks. Ieee Transactions on Pattern Analysis and Machine Intelligence, 2017, 39, 1137-1149.
+[[#cite-4|[4]]] Ren S., He K., Girshick R., et al. Faster R-CNNn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39:1137-1149, 2017.
-[[#cite-5|[5]]] Lu X, Li B, Yue Y, et al. Grid r-cnn. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019: 7363-7372.
+[[#cite-5|[5]]] Lu X., Li B., Yue Y., et al. Grid R-CNN. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 7355-7364, 2019.
-[[#cite-6|[6]]] Shaban M, Awan R, Fraz M M, et al. Context-aware convolutional neural network for grading of colorectal cancer histology images. IEEE transactions on medical imaging, 2020, 39(7): 2395-2405.
+[[#cite-6|[6]]] Shaban M., Awan R., Fraz M.M., et al. Context-aware convolutional neural network for grading of colorectal cancer histology images. IEEE Transactions on Medical Imaging, 39(7):2395-2405, 2020.
-[[#cite-7|[7]]] Redmon J, Divvala S, Girshick R, et al. You only look once: Unified, real-time object detection. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 779-788.
+[[#cite-7|[7]]] Redmon J., Divvala S., Girshick R., et al. You only look once: Unified, real-time object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 779-788, 2016.
-[[#cite-8|[8]]] Liu W, Anguelov D, Erhan D, et al. Ssd: Single shot multibox detector. Computer Vision-ECCV 2016: 14th European Conference,. Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I 14. Springer International Publishing, 2016: 21-37.
+[[#cite-8|[8]]] Liu W., Anguelov D., Erhan D., et al. Ssd: Single shot multibox detector. Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016. Proceedings, Part I 14, Springer International Publishing, 21-37, 2016.
-[[#cite-9|[9]]] Iandola F N, Han S, Moskewicz M W, et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 MB model size. arXiv:1602.07360, 2016.
+[[#cite-9|[9]]] Iandola F.N., Han S., Moskewicz M.W., et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 MB model size. arXiv:1602.07360, 2016.
-[[#cite-10|[10]]] Li Z, Peng C, Yu G, et al. Detnet: a backbone network for object detection. arXiv:1804.06215, 2018.
+[[#cite-10|[10]]] Li Z., Peng C., Yu G., et al. Detnet: a backbone network for object detection. arXiv:1804.06215, 2018.
-[[#cite-11|[11]]] Sermanet P, Eigen D, Zhang X, et al. Overfeat: integrated recognition, localization and detection using convolutional networks. arXiv:1312.6229, 2013.
+[[#cite-11|[11]]] Sermanet P., Eigen D., Zhang X., et al. Overfeat: integrated recognition, localization and detection using convolutional networks. arXiv:1312.6229, 2013.
-[[#cite-12|[12]]] Wang C, He W, Nie Y, et al. Gold-YOLO: Efficient object detector via gather-and-distribute mechanism. Advances in Neural Information Processing Systems, 2024, 36, 1-10.
+[[#cite-12|[12]]] Wang C., He W., Nie Y., et al. Gold-YOLO: Efficient object detector via gather-and-distribute mechanism. Advances in Neural Information Processing Systems, 36:1-10, 2024.
-[[#cite-13|[13]]] Huang Z, Wang J, Fu X, et al. DC-SPP-YOLO: Dense connection and spatial pyramid pooling based YOLO for object detection. Information Sciences, 2020, 522: 241-258.
+[[#cite-13|[13]]] Huang Z., Wang J., Fu X., et al. DC-SPP-YOLO: Dense connection and spatial pyramid pooling based YOLO for object detection. Information Sciences, 522:241-258, 2020.
-[[#cite-14|[14]]] Redmon J, Farhadi A. YOLO9000: better, faster, stronger. Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. 7263-7271.
+[[#cite-14|[14]]] Redmon J., Farhadi A. YOLO9000: Better, faster, stronger. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 6517-6525, 2017.
-[[#cite-15|[15]]] Guo Z, Wang C, Yang G, et al. Msft-yolo: Improved yolov5 based on transformer for detecting defects of steel surface. Sensors, 2022, 22(9): 3467.
+[[#cite-15|[15]]] Guo Z., Wang C., Yang G., et al. Msft-yolo: Improved yolov5 based on transformer for detecting defects of steel surface. Sensors, 22(9), 3467, 2022.
-[[#cite-16|[16]]] Lin T Y, Goyal P, Girshick R, et al. Focal loss for dense object detection. Proceedings of the IEEE international conference on computer vision. 2017: 2980-2988.
+[[#cite-16|[16]]] Lin T.Y., Goyal P., Girshick R., et al. Focal loss for dense object detection. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 2999-3007, 2017.
-[[#cite-17|[17]]] Yu Y, Zhang Y, Cheng Z, et al. MCA: Multidimensional collaborative attention in deep convolutional neural networks for image recognition. Engineering Applications of Artificial Intelligence, 2023, 126: 107079.
+[[#cite-17|[17]]] Yu Y., Zhang Y., Cheng Z., et al. MCA: Multidimensional collaborative attention in deep convolutional neural networks for image recognition. Engineering Applications of Artificial Intelligence, 126, 107079, 2023.
-[[#cite-18|[18]]] Zhu L, Wang X, Ke Z, et al. Biformer: Vision transformer with bi-level routing attention. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023: 10323-10333.
+[[#cite-18|[18]]] Zhu L., Wang X., Ke Z., et al. Biformer: Vision transformer with bi-level routing attention. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 10323-10333, 2023.
-[[#cite-19|[19]]] Jiao J, Tang Y M, Lin K Y, et al. Dilateformer: multi-scale dilated transformer for visual recognition. IEEE Transactions on Multimedia, 2023.
+[[#cite-19|[19]]] Jiao J., Tang Y.M., Lin K.Y., et al. Dilateformer: multi-scale dilated transformer for visual recognition. IEEE Transactions on Multimedia, 25:8906-8919, 2023.
-[20] Yanqi Bao, Kechen Song, Jie Liu, Yanyan Wang, Yunhui Yan, Han Yu, Xingjie Li, "Triplet-Graph Reasoning Network for Few-shot Metal Generic Surface Defect Segmentation," IEEE Transactions on Instrumentation and Measurement, 2021, 70, 3083561.
+[20] Bao Y., Song K., Liu J., Wang Y., Yan Y., Yu H., Li X. Triplet-Graph Reasoning Network for Few-shot Metal Generic Surface Defect Segmentation. IEEE Transactions on Instrumentation and Measurement 70:1-11, 3083561, 2021.
-[21] K. Song and Y. Yan, "A Noise Robust Method Based on Completed Local Binary Patterns for Hot-Rolled Steel Strip Surface Defects, " Applied Surface Science, 2013, 285, 858-864.
+[21] Song K.,  Yan Y. A noise robust method based on completed local binary patterns for hot-rolled steel strip surface defects. Applied Surface Science, 285:858-864, 2013.
-[22] Yu He, Kechen Song, Qinggang Meng, Yunhui Yan, "An End-to-end Steel Surface Defect Detection Approach via Fusing Multiple Hierarchical Features," IEEE Transactions on Instrumentation and Measurement, 2020, 69(4), 1493-1504.
+[22] He Y.,  Song K.,  Meng Q.,  Yan Y. An end-to-end steel surface defect detection approach via fusing multiple hierarchical features. IEEE Transactions on Instrumentation and Measurement, 69(4):1493-1504, 2020.

@@ Line 265: / Line 265: @@
 | <span style="text-align: center; ">0.704 </span>
 |-style="text-align:center"
-| <span style="text-align: left; ">YOLOV5s-Backbone-EIOU</span>
+| style="text-align:left" |YOLOV5s-Backbone-EIOU
 | <span style="text-align: center; ">0.669 </span>
 | <span style="text-align: center; ">0.330 </span>
@@ Line 274: / Line 274: @@
 | <span style="text-align: center; ">0.784 </span>
 |-style="text-align:center"
-| <span style="text-align: left; ">YOLOV5s-Head-EIOU</span>
+| style="text-align:left" |YOLOV5s-Head-EIOU
 | <span style="text-align: center; ">0.679 </span>
 | <span style="text-align: center; ">0.335 </span>
@@ Line 283: / Line 283: @@
 | <span style="text-align: center; ">0.799 </span>
 |-style="text-align:center"
-| <span style="text-align: left; ">YOLOV5s-Backbone-Neck-Head</span>
+| style="text-align:left" |YOLOV5s-Backbone-Neck-Head
 | <span style="text-align: center; ">0.676 </span>
 | <span style="text-align: center; ">0.322 </span>
@@ Line 292: / Line 292: @@
 | <span style="text-align: center; ">0.809 </span>
 |-style="text-align:center"
-| <span style="text-align: left; ">YOLOV5s-Backbone-Neck-EIOU</span>
+| style="text-align:left" |YOLOV5s-Backbone-Neck-EIOU
 | <span style="text-align: center; ">0.667 </span>
 | <span style="text-align: center; ">0.330 </span>
@@ Line 301: / Line 301: @@
 | <span style="text-align: center; ">0.814 </span>
 |-style="text-align:center"
-| <span style="text-align: left; ">YOLOV5s-Backbone-Head-EIOU</span>
+| style="text-align:left" |YOLOV5s-Backbone-Head-EIOU
 | <span style="text-align: center; ">0.687 </span>
 | <span style="text-align: center; ">0.358 </span>
@@ Line 310: / Line 310: @@
 | <span style="text-align: center; ">0.793 </span>
 |-style="text-align:center"
-| <span style="text-align: left; ">YOLOV5s-Neck-Head-EIOU</span>
+| style="text-align:left" |YOLOV5s-Neck-Head-EIOU
 | <span style="text-align: center; ">0.686 </span>
 | <span style="text-align: center; ">0.361 </span>
@@ Line 319: / Line 319: @@
 | <span style="text-align: center; ">0.785 </span>
 |-style="text-align:center"
-|  <span style="text-align: left; ">YOLOV5s-Backbone-Neck-Head-EIOU</span>
+|  style="text-align:left" |YOLOV5s-Backbone-Neck-Head-EIOU
 |  <span style="text-align: center; ">0.690 </span>
 |  <span style="text-align: center; ">0.408 </span>