One of the most basic and difficult areas of computer vision and image understanding applications is still object detection. Deep neural network models and enhanced object representation have led to significant progress in object detection. This research investigates in greater detail how object detection has changed in the recent years in the deep learning age. We provide an overview of the literature on a range of cutting-edge object identification algorithms and the theoretical underpinnings of these techniques. Deep learning technologies are contributing to substantial innovations in the field of object detection. While Convolutional Neural Networks (CNN) have laid a solid foundation, new models such as You Only Look Once (YOLO) and Vision Transformers (ViTs) have expanded the possibilities even further by providing high accuracy and fast detection in a variety of settings. Even with these developments, integrating CNN, YOLO and ViTs, into a coherent framework still poses challenges with juggling computing demand, speed, and accuracy especially in dynamic contexts. Real-time processing in applications like surveillance and autonomous driving necessitates improvements that take use of each model type’s advantages. The goal of this work is to provide an object detection system that maximizes detection speed and accuracy while decreasing processing requirements by integrating YOLO, CNN, and ViTs. Improving real-time detection performance in changing weather and light exposure circumstances, as well as detecting small or partially obscured objects in crowded cities, are among the goals. We provide a hybrid architecture which leverages CNN for robust feature extraction, YOLO for rapid detection, and ViTs for remarkable global context capture via self-attention techniques. Using an innovative training regimen that prioritizes flexible learning rates and data augmentation procedures, the model is trained on an extensive dataset of urban settings. Compared to solo YOLO, CNN, or ViTs models, the suggested model exhibits an increase in detection accuracy. This improvement is especially noticeable in difficult situations such settings with high occlusion and low light. In addition, it attains a decrease in inference time in comparison to baseline models, allowing real-time object detection without performance loss. This work introduces a novel method of object identification that integrates CNN, YOLO and ViTs, in a synergistic way. The resultant framework extends the use of integrated deep learning models in practical applications while also setting a
References
[1]
Krizhevsky, A., Sutskever, I. and Hinton, G.E. (2012) ImageNet Classification with Deep Convolutional Neural Networks. In: Pereira, F., Burges, C.J., Bottou, L. and Weinberger, K.Q., Eds., Advances in Neural Information Processing Systems (NIPS), Curran Associates, Inc., 1097-1105. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
[2]
Wang, J., Yu, K., Dong, C., Loy, C.C. and Qiao, Y. (2020) Vision Transformers. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 13-19 June 2020, 1571-1580.
[3]
Bochkovskiy, A., Wang, C.Y. and Liao, H.Y.M. (2020) YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv: 2004.10934.
[4]
Redmon, J., Divvala, S., Girshick, R. and Farhadi, A. (2016) You Only Look Once: Unified, Real-Time Object Detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 779-788. https://doi.org/10.1109/cvpr.2016.91
[5]
Hu, Y., Lin, Y. and Yang, H. (2024) CLDE-Net: Crowd Localization and Density Estimation Based on CNN and Transformer Network. Multimedia Systems, 30, Article No. 120. https://doi.org/10.1007/s00530-024-01318-8
[6]
Viola, P. and Jones, M. (2001) Rapid Object Detection Using a Boosted Cascade of Simple Features. Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, Kauai, 8-14 December 2001, 1.
[7]
Dalal, N. and Triggs, B. (20005) Histograms of Oriented Gradients for Human Detection. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, 20-25 June 2005, 886-893.
[8]
Felzenszwalb, P., McAllester, D. and Ramanan, D. (2008) A Discriminatively Trained, Multiscale, Deformable Part Model. 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, 3-28 June 2008, 1-8. https://doi.org/10.1109/cvpr.2008.4587597
[9]
Ullman, S., Vidal-Naquet, M. and Sali, E. (2002) Visual Features of Intermediate Complexity and Their Use in Classification. Nature Neuroscience, 5, 682-687. https://doi.org/10.1038/nn870
[10]
Sivic, J., Russell, B.C., Efros, A.A., Zisserman, A. and Freeman, W.T. (2005) Discovering Objects and Their Location in Images. Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, Beijing, 17-21 October 2005, 370-377. https://doi.org/10.1109/iccv.2005.77
[11]
Agarwal, S. and Roth, D. (2002) Learning a Sparse Representation for Object Detection. In: Heyden, A., Sparr, G., Nielsen, M. and Johansen, P., Eds., ComputerVision—ECCV 2002, Springer Berlin Heidelberg, 113-127. https://doi.org/10.1007/3-540-47979-1_8
[12]
Schneiderman, H. and Kanade, T. (2004) Object Detection Using the Statistics of Parts. International Journal of Computer Vision, 56, 151-177. https://doi.org/10.1023/b:visi.0000011202.85607.00
[13]
van de Sande, K.E.A., Uijlings, J.R.R., Gevers, T. and Smeulders, A.W.M. (2011) Segmentation as Selective Search for Object Recognition. 2011 International Conference on Computer Vision, Barcelona, 6-13 November 2011, 1879-1886. https://doi.org/10.1109/iccv.2011.6126456
[14]
Lowe, D.G. (2004) Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision, 60, 91-110. https://doi.org/10.1023/b:visi.0000029664.99615.94
[15]
Lienhart, R. and Maydt, J. (2002) An Extended Set of HAAR-Like Features for Rapid Object Detection. Proceedings International Conference on Image Processing, Rochester, 22-25 September 2002, 1.
[16]
Bay, H., Tuytelaars, T. and Van Gool, L. (2006) SURF: Speeded up Robust Features. In: Leonardis, A., Bischof, H., Pinz, A., Eds., ComputerVision—ECCV 2006, Springer, 404-417. https://doi.org/10.1007/11744023_32
[17]
Hutchison, D., Kanade, T., Kittler, J., Kleinberg, J.M., Mattern, F., Mitchell, J.C., Naor, M., Nierstrasz, O., Pandu Rangan, C., Steffen, B., Sudan, M., Terzopoulos, D., Tygar, D., Vardi, M.Y., Weikum, G., Calonder, M., Lepetit, V., Strecha, C. and Fua, P. (2010) BRIEF: Binary Robust Independent Elementary Features. In: Daniilidis, K., Maragos, P. and Paragios, N., (Eds.), Computer Vision—ECCV 2010, Springer, 778-792.
[18]
Cortes, C. and Vapnik, V. (1995) Support-Vector Networks. Machine Learning, 20, 273-297. https://doi.org/10.1007/bf00994018
[19]
Freund, Y. and Schapire, R.E. (1997) A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences, 55, 119-139. https://doi.org/10.1006/jcss.1997.1504
[20]
Cover, T. and Hart, P. (1967) Nearest Neighbor Pattern Classification. IEEE Transactions on Information Theory, 13, 21-27. https://doi.org/10.1109/tit.1967.1053964
[21]
Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J. and Zisserman, A. (2009) The Pascal Visual Object Classes (VOC) Challenge. International Journal of Computer Vision, 88, 303-338. https://doi.org/10.1007/s11263-009-0275-4
[22]
Amjoud, A.B. and Amrouch, M. (2023) Object Detection Using Deep Learning, CNNs and Vision Transformers: A Review. IEEE Access, 11, 35479-35516. https://doi.org/10.1109/ACCESS.2023.3266093
[23]
(2010) ImageNet Large Scale Visual Recognition Competition. http://www.image-net.org/challenges/LSVRC/2010/
Zhou, X., Wang, D. and Krähenbühl, P. (2019) Objects as Points. ar-Xiv: 1904.07850.
[26]
Redmon, J. and Farhadi, A. (2018) YOLOv3: An Incremental Improvement. arXiv: 1804.02767.
[27]
Lin, T., Dollar, P., Girshick, R., He, K., Hariharan, B. and Belongie, S. (2017) Feature Pyramid Networks for Object Detection. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 936-944. https://doi.org/10.1109/cvpr.2017.106
[28]
Ren, S., He, K., Girshick, R. and Sun, J. (2015) Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv: 1506.01497.
[29]
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A. and Zagoruyko, S. (2020) End-to-End Object Detection with Transformers. In: Vedaldi, A., Bischof, H., Brox, T. and Frahm, J.M., Eds., ComputerVision—ECCV 2020, Springer, 213-229. https://doi.org/10.1007/978-3-030-58452-8_13
[30]
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J. and Houlsby, N. (2020) An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv: 2010.11929.
[31]
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C., et al. (2016) SSD: Single Shot Multibox Detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M., Eds., Computer Vision—ECCV 2016, Springer, 21-37. https://doi.org/10.1007/978-3-319-46448-0_2
[32]
Borji, A., Cheng, M., Hou, Q., Jiang, H. and Li, J. (2019) Salient Object Detection: A Survey. Computational Visual Media, 5, 117-150. https://doi.org/10.1007/s41095-019-0149-9
[33]
Girshick, R., Donahue, J., Darrell, T. and Malik, J. (2014) Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, 23-28 June 2014, 580-587. https://doi.org/10.1109/cvpr.2014.81
[34]
Simonyan, K. and Zisserman, A. (2014) Very Deep Convolutional Networks for Large-Scale Image Recognition. http://arxiv.org/abs/1409.1556
[35]
Szegedy, C., Ioffe, S., Vanhoucke, V. and Alemi, A. (2017) Inception-v4, Inception-Resnet and the Impact of Residual Connections on Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 31, 4278-4284. https://doi.org/10.1609/aaai.v31i1.11231
[36]
Szegedy, C., Liu, W., Jia, Y.Q., Sermanet, P., Reed, S., Anguelov, D., et al. (2015) Going Deeper with Convolutions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, 7-12 June 2015, 1-9. https://doi.org/10.1109/cvpr.2015.7298594
[37]
Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T. and Xie, S. (2022) A ConvNet for the 2020s. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 11966-119760. https://doi.org/10.1109/cvpr52688.2022.01167
[38]
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. and Wojna, Z. (2016) Rethinking the Inception Architecture for Computer Vision. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 2818-2826. https://doi.org/10.1109/cvpr.2016.308
[39]
Girshick, R., Donahue, J., Darrell, T. and Malik, J. (2013) Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. arXiv: 1311.2524. http://arxiv.org/abs/1311.2524
[40]
Girshick, R. (2015) Fast R-CNN. 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, 7-13 December 2015, 1440-1448. https://doi.org/10.1109/iccv.2015.169
[41]
Ren, S., He, K., Girshick, R. and Sun, J. (2017) Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 1137-1149. https://doi.org/10.1109/tpami.2016.2577031
[42]
Shao, S., et al. (2018) CrowdHuman: A Benchmark for Detecting Human in a Crowd. arXiv: 1805.00123.
[43]
Zhang, C., Li, H., Wang, X. and Yang, X. (2015) Cross-Scene Crowd Counting via Deep Convolutional Neural Networks. 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, 7-12 June 2015, 833-841.
[44]
Sam, S., Peri, S. and Subrahmanian, V.S. (2020) Density-Aware Object Detection in Aerial Imagery. IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops 2020, Seattle, 14-19 June 2020.
[45]
Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. and Dollar, P. (2015) Microsoft COCO: Common Objects in Context. arXiv: 1405.0312. http://arxiv.org/abs/1405.0312
[46]
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115, 211-252. https://doi.org/10.1007/s11263-015-0816-y
[47]
Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., et al. (2020) The Open Images Dataset V4. International Journal of Computer Vision, 128, 1956-1981. https://doi.org/10.1007/s11263-020-01316-z
[48]
Lin, T.Y., Goyal, P., Girshick, R., He, K. and Dollar, P. (2017) Focal Loss for Dense Object Detection. arXiv: 1708.02002.
[49]
Redmon, J. and Farhadi, A. (2017) YOLO9000: Better, Faster, Stronger. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 6517-6525. https://doi.org/10.1109/cvpr.2017.690
[50]
Yang, T., Zhang, X., Li, Z., Zhang, W. and Sun, J. (2018) MetaAnchor: Learning to Detect Objects with Customized Anchors. Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montréal, 3-8 December 2018.
[51]
Wang, J., Chen, K., Yang, S., Loy, C.C. and Lin, D. (2019) Region Proposal by Guided Anchoring. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 2960-2969. https://doi.org/10.1109/cvpr.2019.00308
[52]
Uijlings, J.R.R., van de Sande, K.E.A., Gevers, T. and Smeulders, A.W.M. (2013) Selective Search for Object Recognition. International Journal of Computer Vision, 104, 154-171. https://doi.org/10.1007/s11263-013-0620-5
[53]
He, K., Gkioxari, G., Dollár, P. and Girshick, R. (2019) Mask R-CNN. arXiv: 1703.06870. http://arxiv.org/abs/1703.06870
[54]
Cai, Z., Fan, Q., Feris, R.S. and Vasconcelos, N. (2016) A Unified Multi-Scale Deep Convolutional Neural Network for Fast Object Detection. In: Leibe, B., Matas, J., Sebe, N. and Welling, M., Eds., ComputerVision—ECCV 2016, Springer, 354-370. https://doi.org/10.1007/978-3-319-46493-0_22
[55]
He, Y., Zhu, C., Wang, J., Savvides, M. and Zhang, X. (2019) Bounding Box Regression with Uncertainty for Accurate Object Detection. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 2883-2892. https://doi.org/10.1109/cvpr.2019.00300
[56]
Pang, J., Chen, K., Shi, J., Feng, H., Ouyang, W. and Lin, D. (2019) Libra R-CNN: Towards Balanced Learning for Object Detection. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 821-830. https://doi.org/10.1109/cvpr.2019.00091
[57]
Zhou, P., Ni, B., Geng, C., Hu, J. and Xu, Y. (2018) Scale-transferrable Object Detection. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 528-537. https://doi.org/10.1109/cvpr.2018.00062
[58]
Shen, Z., Liu, Z., Li, J., Jiang, Y., Chen, Y. and Xue, X. (2017) DSOD: Learning Deeply Supervised Object Detectors from Scratch. 2017 IEEE International Conference on Computer Vision (ICCV), Venice, 22-29 October 2017, 1937-1945. https://doi.org/10.1109/iccv.2017.212
[59]
Zhang, S., Wen, L., Bian, X., Lei, Z. and Li, S.Z. (2018) Single-shot Refinement Neural Network for Object Detection. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 4203-4212. https://doi.org/10.1109/cvpr.2018.00442
[60]
Liu, S., Huang, D. and Wang, Y. (2018) Receptive Field Block Net for Accurate and Fast Object Detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds., ComputerVision—ECCV 2018, Springer, 404-419. https://doi.org/10.1007/978-3-030-01252-6_24
[61]
Kim, S., Kook, H., Sun, J., Kang, M. and Ko, S. (2018) Parallel Feature Pyramid Network for Object Detection. In: Ferrari, V., Hebert, M., Sminchisescu, C. and Weiss, Y., Eds., ComputerVision—ECCV 2018, Springer, 239-256. https://doi.org/10.1007/978-3-030-01228-1_15
[62]
Chen, K., Li, J., Lin, W., See, J., Wang, J., Duan, L., et al. (2019) Towards Accurate One-Stage Object Detection with AP-Loss. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 5114-5122. https://doi.org/10.1109/cvpr.2019.00526
[63]
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K. and Li, F.F. (2009) ImageNet: A Large-Scale Hierarchical Image Database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, 20-25 June 2009, 248-255.
[64]
Wang, C., Bochkovskiy, A. and Liao, H.M. (2023) Yolov7: Trainable Bag-of-Freebies Sets New State-Of-The-Art for Real-Time Object Detectors. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 7464-7475. https://doi.org/10.1109/cvpr52729.2023.00721
[65]
Wang, C., Bochkovskiy, A. and Liao, H.M. (2021) Scaled-YOLOv4: Scaling Cross Stage Partial Network. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 20-25 June 2021, 13024-13033. https://doi.org/10.1109/cvpr46437.2021.01283
[66]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, L. and Polosukhin, I. (2021) Attention Is All You Need. arXiv: 1706.03762. http://arxiv.org/abs/1706.03762
[67]
Rao, S., Li, Y., Ramakrishnan, R., Hassaine, A., Canoy, D., Cleland, J., et al. (2022) An Explainable Transformer-Based Deep Learning Model for the Prediction of Incident Heart Failure. IEEE Journal of Biomedical and Health Informatics, 26, 3362-3372. https://doi.org/10.1109/jbhi.2022.3148820
[68]
Gao, P., Zheng, M., Wang, X., Dai, J. and Li, H. (2021) Fast Convergence of DETR with Spatially Modulated Co-Attention. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 3601-3610. https://doi.org/10.1109/iccv48922.2021.00360
[69]
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., et al. (2021) Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 9992-10002. https://doi.org/10.1109/iccv48922.2021.00986
[70]
Wang, Y., Zhang, X., Yang, T. and Sun, J. (2021) Anchor DETR: Query Design for Trans-Former-Based Object Detection. arXiv: 2109.07107. https://doi.org/10.48550/ARXIV.2109.07107
[71]
He, L. and Todorovic, S. (2022) DESTR: Object Detection with Split Transformer. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 9367-9376. https://doi.org/10.1109/cvpr52688.2022.00916
[72]
Zhiqiang, W. and Jun, L. (2017) A Review of Object Detection Based on Convolutional Neural Network. 2017 36th Chinese Control Conference (CCC), Dalian, 26-28 July 2017, 11104-11109. https://doi.org/10.23919/chicc.2017.8029130
[73]
Jiao, L., Zhang, F., Liu, F., Yang, S., Li, L., Feng, Z., et al. (2019) A Survey of Deep Learning-Based Object Detection. IEEE Access, 7, 128837-128868. https://doi.org/10.1109/access.2019.2939201
[74]
Wang, W., Lai, Q., Fu, H., Shen, J., Ling, H. and Yang, R. (2022) Salient Object Detection in the Deep Learning Era: An In-Depth Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44, 3239-3259. https://doi.org/10.1109/tpami.2021.3051099
[75]
Tong, K. and Wu, Y. (2022) Deep Learning-Based Detection from the Perspective of Small or Tiny Objects: A Survey. Image and Vision Computing, 123, Article ID: 104471. https://doi.org/10.1016/j.imavis.2022.104471
[76]
Nguyen, N., Do, T., Ngo, T.D. and Le, D. (2020) An Evaluation of Deep Learning Methods for Small Object Detection. Journal of Electrical and Computer Engineering, 2020, Article ID: 3189691. https://doi.org/10.1155/2020/3189691
[77]
Liu, Y., Sun, P., Wergeles, N. and Shang, Y. (2021) A Survey and Performance Evaluation of Deep Learning Methods for Small Object Detection. Expert Systems with Applications, 172, Article ID: 114602. https://doi.org/10.1016/j.eswa.2021.114602
[78]
Wu, X., Sahoo, D. and Hoi, S.C.H. (2020) Recent Advances in Deep Learning for Object Detection. Neurocomputing, 396, 39-64. https://doi.org/10.1016/j.neucom.2020.01.085