Skip to content
2000
Volume 17, Issue 4
  • ISSN: 2666-2558
  • E-ISSN: 2666-2566

Abstract

Background: Visual question answering (VQA) is a challenging task that requires multimodal reasoning and knowledge. The objective of VQA is to answer natural language questions based on corresponding present information in a given image. The challenge of VQA is to extract visual and textual features and pass them into a common space. However, the method faces the challenge of object detection being present in an image and finding the relationship between objects. Methods: In this study, we explored different methods of feature fusion for VQA, using pretrained models to encode the text and image features and then applying different attention mechanisms to fuse them. We evaluated our methods on the DAQUAR dataset. Results: We used three metrics to measure the performance of our methods: WUPS, Acc, and F1. We found that concatenating raw text and image features performs slightly better than selfattention for VQA. We also found that using text as query and image as key and value performs worse than other methods of cross-attention or self-attention for VQA because it might not capture the bidirectional interactions between the text and image modalities. Conclusion: In this paper, we presented a comparative study of different feature fusion methods for VQA, using pre-trained models to encode the text and image features and then applying different attention mechanisms to fuse them. We showed that concatenating raw text and image features is a simple but effective method for VQA while using text as query and image as key and value is a suboptimal method for VQA. We also discussed the limitations and future directions of our work.

Loading

Article metrics loading...

/content/journals/rascs/10.2174/0126662558291150240102111855
2024-06-01
2025-01-06
Loading full text...

Full text loading...

/content/journals/rascs/10.2174/0126662558291150240102111855
Loading
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error
Please enter a valid_number test