Skip to content
2000
Volume 18, Issue 2
  • ISSN: 2352-0965
  • E-ISSN: 2352-0973

Abstract

Background

An image captioning system is a crucial component in the domains of computer vision and natural language processing. Deep neural networks have been an increasingly popular tool for the generation of descriptive captions for photos in recent years.

Objective

On the other hand, these models frequently have the issue of providing captions that are unoriginal and repetitious. Beam search is a well-known search technique that is utilized for the purpose of producing descriptions for images in an effective and productive manner.

Methods

The algorithm keeps track of a set of partial captions and expands them iteratively by choosing the probable next word throughout each step until a complete caption is generated. The set of partial captions, also known as the beam, is updated at each step based on the predicted probabilities of the next words. This research paper presents an image caption generation system based on beam search. In order to encode the image data and generate captions, the system is trained on a deep neural network architecture.

Results

This architecture brings together the benefits of CNN with RNN. After that, the beam search method is executed in order to provide the completed captions, resulting in a more diverse and descriptive set of captions compared to traditional greedy decoding approaches. The experimental outcomes indicate that the suggested system beats existing image caption generation techniques in terms of the precision and variety of the generated captions.

Conclusion

This demonstrates the effectiveness of beam search in enhancing the efficiency of image caption generation systems.

Loading

Article metrics loading...

/content/journals/raeeng/10.2174/0123520965254606231009091711
2023-10-18
2025-06-23
Loading full text...

Full text loading...

References

  1. KarpathyA. Fei-FeiL. Deep visual-semantic alignments for generating image descriptionsIEEE Transac. Pattern Analy. Mach. Intell.20153943128313710.1109/CVPR.2015.7298932
    [Google Scholar]
  2. XuK. Show, attend and tell: Neural image caption generation with visual attentionInt. Conf. Mach. Learn., ICML3204820572015
    [Google Scholar]
  3. MaoJ. XuW. YangY. WangJ. HuangZ. YuilleA. Deep captioning with multimodal recurrent neural networks (m-RNN).3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings2015.
    [Google Scholar]
  4. SelivanovA. RogovO.Y. ChesakovD. ShelmanovA. FedulovaI. DylovD.V. Medical image captioning via generative pretrained transformers.Sci. Rep.2022131471
    [Google Scholar]
  5. DonahueJ. Long-term recurrent convolutional networks for visual recognition and descriptionIEEE Transac. Pattern Analy. Mach. Intell.39426252634201410.21236/ADA623249
    [Google Scholar]
  6. FangH. From captions to visual concepts and back.2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)Boston, MA, USA, 07-12 June, pp. 1473-1482, 2015.
    [Google Scholar]
  7. JiaX. GavvesE. FernandoB. TuytelaarsT. Guiding the long-short term memory model for image caption generation.IEEE International Conference on Computer Vision (ICCV) Santiago, Chile, 07-13 Dec, pp. 2407-2415, 2015.10.1109/ICCV.2015.277
    [Google Scholar]
  8. VinyalsO. ToshevA. BengioS. ErhanD. Show and tell: A neural image caption generator.2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)Boston, MA, USA, 07-12 June, pp. 3156-3164, 2015.
    [Google Scholar]
  9. YaoT. PanY. LiY. MeiT. Exploring visual relationship for image captioning.Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics).ChamSpringer20181121871172710.1007/978‑3‑030‑01264‑9_42
    [Google Scholar]
  10. LuJ. XiongC. ParikhD. SocherR. Knowing when to look: Adaptive attention via a visual sentinel for image captioning.2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)Honolulu, HI, USA, 21-26 July, pp. 3242-3250, 2017.
    [Google Scholar]
  11. YouQ. JinH. WangZ. FangC. LuoJ. Image captioning with semantic attention.IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Las Vegas, NV, USA, 27-30 June, 2016, pp. 4651-4659.10.1109/CVPR.2016.503
    [Google Scholar]
  12. PuY. Variational autoencoder for deep learning of images, labels and captions.Advances in Neural Information Processing Systems.Barcelona, Spain, pp. 2360-2368, 2016.
    [Google Scholar]
  13. WuQ. ShenC. LiuL. DickA.R. van den HengelA. What value do explicit high level concepts have in vision to language problems?2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)Las Vegas, NV, USA, 27-30 June, 2016, pp. 203-212.
    [Google Scholar]
  14. ShuY. ZhangL. LiZ. TangJ. Bidirectional multimodal recurrent neural networks with refined visual features for image captioning.Commun. Comput. Inf. Sci.2018819758410.1007/978‑981‑10‑8530‑7_8
    [Google Scholar]
  15. YangZ. YuanY. WuY. SalakhutdinovR. CohenW.W. Review networks for caption generation.Advances in Neural Information Processing Systems.Barcelona, Spain201623692377
    [Google Scholar]
  16. ChenL. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning.2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)Honolulu, HI, USA, 21-26 July, pp.6298-6306, 2017.
    [Google Scholar]
  17. GuJ. WangG. CaiJ. ChenT. An empirical study of language CNN for image captioning.2017 IEEE International Conference on Computer Vision (ICCV)Venice, Italy, 22-29 Oct, pp. 1231-1240, 2017.
    [Google Scholar]
  18. ZhongY. RegionCLIP: Region-based language-image pretraining.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)New Orleans, LA, USA, 18-24 June, 2022, pp. 16772-16782.
    [Google Scholar]
  19. YaoT. PanY. LiY. QiuZ. MeiT. Boosting image captioning with attributes.2017 IEEE International Conference on Computer Vision (ICCV)2016, pp. 4904-4912.
    [Google Scholar]
  20. LiuS. ZhuZ. YeN. GuadarramaS. MurphyK.P. Improved image captioning via policy gradient optimization of SPIDEr.2017 IEEE International Conference on Computer Vision (ICCV)Venice, Italy, 22-29 Oct, 2017, pp. 873-881.
    [Google Scholar]
  21. RennieS.J. MarcheretE. MrouehY. RossJ. GoelV. Self-critical sequence training for image captioning.2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)Honolulu, HI, USA, Honolulu, HI, USA, 2017, pp. 1179-1195.
    [Google Scholar]
  22. FuK. JinJ. CuiR. ShaF. ZhangC. Aligning where to see and what to tell: Image captioning with region-based attention and scene-specific contexts.IEEE Trans. Pattern Anal. Mach. Intell.201739122321233410.1109/TPAMI.2016.264295328026750
    [Google Scholar]
  23. BujimallaS. SubedarM. TickooO. B-SCST: Bayesian self-critical sequence training for image captioning.ArXiv202020202004.0
    [Google Scholar]
  24. AndersonP. Bottom-up and top-down attention for image captioning and VQA.ArXiv201720171707.0
    [Google Scholar]
  25. JiangW. MaL. JiangY-G. LiuW. ZhangT. Recurrent fusion network for image captionin.ArXiv201820181807.010.1007/978‑3‑030‑01216‑8_31
    [Google Scholar]
  26. GuJ. CaiJ. WangG. ChenT. Stack-captioning: Coarse-to-fine learning for image captioning.AAAI Conf. Artificial Intell.32168376844201810.1609/aaai.v32i1.12266
    [Google Scholar]
  27. MunJ. ChoM. HanB. Text-guided attention model for image captioning.AAAI Conf. Artificial Intell.20173114233423910.1609/aaai.v31i1.11237
    [Google Scholar]
  28. LiuS. ZhuZ. YeN. GuadarramaS. MurphyK.P. Optimization of image description metrics using policy gradient methods.ArXiv201620161612.0
    [Google Scholar]
  29. AnejaJ. DeshpandeA. SchwingA.G. Convolutional image captioning.Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Salt Lake City, UT, USA, 18-23 June, 2018, pp. 5561-5570.10.1109/CVPR.2018.00583
    [Google Scholar]
  30. ChenF. JiR. SunX. WuY. SuJ. Group Cap: Group-based image captioning with structured relevance and diversity constraints.2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Salt Lake City, UT, USA, 18-23 June, pp. 1345-1353, 2018.10.1109/CVPR.2018.00146
    [Google Scholar]
  31. LuJ. YangJ. BatraD. ParikhD. Neural baby talk.2018 IEEE/CVF Conference on Computer Vision and Pattern RecognitionSalt Lake City, UT, USA, 18-23 June, pp. 7219-7228, 2018.10.1109/CVPR.2018.00754
    [Google Scholar]
  32. YaoT. PanY. LiY. MeiT. Hierarchy parsing for image captioning.2019 IEEE/CVF International Conference on Computer Vision (ICCV) Seoul, Korea (South), 27 Oct - 02 Nov, pp. 2621-2629, 2019.10.1109/ICCV.2019.00271
    [Google Scholar]
  33. GaoJ. WangS. WangS. MaS. GaoW. Self-critical N-step training for image captioning.2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Long Beach, CA, USA, 15-20 June, pp. 6293-6301, 2019.10.1109/CVPR.2019.00646
    [Google Scholar]
  34. HerdadeS. KappelerA. BoakyeK. SoaresJ. Image captioning: Transforming objects into words.ArXiv20191906.02019
    [Google Scholar]
  35. QinY. DuJ. ZhangY. LuH. Look back and predict forward in image captioning.2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Long Beach, CA, USA, 15-20 June, pp. 8359-8367, 2019.10.1109/CVPR.2019.00856
    [Google Scholar]
  36. LiuJ. Interactive dual generative adversarial networks for image captioning.Proc. AAAI Conf. Artificial Intell.34071158811595202010.1609/aaai.v34i07.6826
    [Google Scholar]
  37. KamangarZ.U. ShaikhG.M. HassanS. MughalN. KamangarU.A. Image caption generation related to object detection and colour recognition using transformer-decoder.4th International Conference on Computing, Mathematics and Engineering Technologies (iCoMET) Sukkur, Pakistan, 17-18 Mar, pp. 1-5, 2023.10.1109/iCoMET57998.2023.10099161
    [Google Scholar]
  38. SerraF.D. DeligianniF. DaltonJ. O’NeilA.Q. CMRE-UoG team at imageclefmedical caption 2022: Concept detection and image captioning.Conference and Labs of the Evaluation ForumBologna, Italy, 5–8 Sep, pp. 1381-1390, 2022.
    [Google Scholar]
  39. YadavV. RahulM. ShuklaR. A new improved approach for feature generation and selection in multi-relational statistical modelling using ML.J. Sci. Indus. Res.109511002020
    [Google Scholar]
  40. SinghS. SinghD. YadavV. Face recognition using HOG feature extraction and SVM classifier.Int. J. Emerg. Trends Eng. Res.20208914
    [Google Scholar]
  41. YadavV. KaushikV.D. Detection of melanoma skin disease by extracting high level features for skin lesions.Int. J. Adv. Intell.e Paradigms2018113/439740810.1504/IJAIP.2018.095493
    [Google Scholar]
  42. BakshiG. YadavV. An optimized approach for feature extraction in multi-relational statistical.J. Sci. Indus. Res.8065375422021
    [Google Scholar]
  43. LinC.Y. Rouge: A package for automatic evaluation of summaries.Barcelona, SpainAssociation for Computational Linguisticspp. 74-81, 2004
    [Google Scholar]
  44. DenkowskiM. LavieA. Meteor universal: Language specific translation evaluation for any target language.Proceedings of the Ninth Workshop on Statistical Machine Translation Maryland, USA, pp. 376-380, 2014.10.3115/v1/W14‑3348
    [Google Scholar]
  45. VedantamR. ZitnickC.L. ParikhD. CIDEr: Consensus-based image description evaluation.2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)Boston, MA, USA, 07-12 June, pp. 4566-4575, 2015.
    [Google Scholar]
  46. AndersonP. FernandoB. JohnsonM. GouldS. SPICE: Semantic propositional image caption evaluation.Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics).ChamSpringer9909pp. 382-398, 201610.1007/978‑3‑319‑46454‑1_24
    [Google Scholar]
  47. SzegedyC. IoffeS. VanhouckeV. AlemiA.A. Inception-v4, inception-resNet and the impact of residual connections on learning.ArXiv20161602.02016
    [Google Scholar]
  48. JaiswalT. PandeyM. TripathiP. Image captioning through cognitive IOT and machine-learning approaches.Turk. J. Comput. Math. Edu.2021333351
    [Google Scholar]
/content/journals/raeeng/10.2174/0123520965254606231009091711
Loading
/content/journals/raeeng/10.2174/0123520965254606231009091711
Loading

Data & Media loading...


  • Article Type:
    Research Article
Keyword(s): beam search; Captioning; CNN; InceptionResNet_v2; NLP; oracle-based training
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error
Please enter a valid_number test