Cross-attention Based Text-image Transformer for Visual Question Answering

Mahdi Rezapour

doi:10.2174/0126662558291150240102111855

ISSN: 2666-2558
E-ISSN: 2666-2566

Cross-attention Based Text-image Transformer for Visual Question Answering
By Mahdi Rezapour
Source: Recent Advances in Computer Science and Communications, Volume 17, Issue 4, Jun 2024, p. 72 - 78
DOI: https://doi.org/10.2174/0126662558291150240102111855
- Available online: 01 Jun 2024

Abstract

Background: Visual question answering (VQA) is a challenging task that requires multimodal reasoning and knowledge. The objective of VQA is to answer natural language questions based on corresponding present information in a given image. The challenge of VQA is to extract visual and textual features and pass them into a common space. However, the method faces the challenge of object detection being present in an image and finding the relationship between objects. Methods: In this study, we explored different methods of feature fusion for VQA, using pretrained models to encode the text and image features and then applying different attention mechanisms to fuse them. We evaluated our methods on the DAQUAR dataset. Results: We used three metrics to measure the performance of our methods: WUPS, Acc, and F1. We found that concatenating raw text and image features performs slightly better than selfattention for VQA. We also found that using text as query and image as key and value performs worse than other methods of cross-attention or self-attention for VQA because it might not capture the bidirectional interactions between the text and image modalities. Conclusion: In this paper, we presented a comparative study of different feature fusion methods for VQA, using pre-trained models to encode the text and image features and then applying different attention mechanisms to fuse them. We showed that concatenating raw text and image features is a simple but effective method for VQA while using text as query and image as key and value is a suboptimal method for VQA. We also discussed the limitations and future directions of our work.

Article metrics loading...

/content/journals/rascs/10.2174/0126662558291150240102111855

2024-06-01

2025-01-06

From This Site

/content/journals/rascs/10.2174/0126662558291150240102111855

dcterms_title,dcterms_subject,pub_keyword

-contentType:Contributor -contentType:Concept -contentType:Institution

10

5

Full text loading...

/content/journals/rascs/10.2174/0126662558291150240102111855

Article Type: Research Article

Keyword(s): attention mechanism; convolutional neural network; feature fusion; multimodal reasoning; text-image; Visual question answering

Cross-attention Based Text-image Transformer for Visual Question Answering

Abstract

From This Site

Most Read This Month

Most Cited Most Cited RSS feed

Key Issues in Software Reliability Growth Models

Remaining Useful Life Prediction of Lithium-ion Batteries Using Multiple Kernel Extreme Learning Machine

An Ensemble of Bacterial Foraging, Genetic, Ant Colony and Particle Swarm Approach EB-GAP: A Load Balancing Approach in Cloud Computing

Container Elasticity: Based on Response Time using Docker

Research on Monitoring System of Daily Statistical Indexes Through Big Data

An Analog Circuit Fault Diagnosis Approach Based on Wavelet-based Fractal Analysis and Multiple Kernel SVM

A Rapid Transition from Subversion to Git: Time, Space, Branching, Merging, Offline Commits & Offline builds and Repository Aspects

Cooperative Spectrum Sensing in Cognitive Radio Networks: A Systematic Review

A CMT Device for Electrical Energy Meter Detection

A Study on E-Learning and Recommendation System