Deep-View X-Modalities Visio-Linguistics (DV-XML) Features Engineering Image Retrieval Framework

Authors

  • Adel Alkhalil Department of Software Engineering, College of Computer Science and Engineering, University of Hail, Hail, Saudi Arabia
Volume: 15 | Issue: 2 | Pages: 21951-21962 | April 2025 | https://doi.org/10.48084/etasr.10175

Abstract

This research proposes an advanced framework for efficient image retrieval by integrating visual and linguistic modalities into a unified system. The Deep-View X-Modalities Visio-Linguistics (DV-XML) framework is designed to handle user queries that include both text and image inputs while allowing modifications to align with user preferences. By employing a multimodal Content-Based Image Retrieval (CBIR) system, the framework combines features extracted by a ResNet-50 model for images and a Bidirectional Encoder Representations from Transformers (BERT) model for textual data. These features are harmonized using an inductive learning-based fusion methodology within Multi-Layer Perceptrons (MLPs). A novel Reverse Re-ranking (RR) algorithm enhances retrieval accuracy by optimally aligning the combined representations with the target images during inference. Extensive evaluations on the Fashion-200K and MIT-States datasets demonstrate the model's superior performance compared to baseline CBIR methods. This work advances the field by efficiently merging dual modalities and streamlining the retrieval process with innovative RR strategies, setting a benchmark for future research in multimodal image retrieval systems.

Keywords:

image retrieval framework, image and textual embedding extraction, ResNet, BERT-base/large, multimodal embedding fusion, reverse re-ranking

Downloads

Download data is not yet available.

References

Z. Khan, B. Latif, J. Kim, H. K. Kim, and M. Jeon, "DenseBert4Ret: Deep bi-modal for image retrieval," Information Sciences, vol. 612, pp. 1171–1186, Oct. 2022.

I. Ahmed, N. Iltaf, Z. Khan, and U. Zia, "Deep-view linguistic and inductive learning (DvLIL) based framework for Image Retrieval," Information Sciences, vol. 649, Nov. 2023, Art. no. 119641.

I. Ahmed, N. Iltaf, R. Latif, N. S. M. Jamail, and Z. Khan, "Dual Modality Reverse Reranking (DM-RR) Based Image Retrieval Framework," IEEE Open Journal of the Industrial Electronics Society, vol. 5, pp. 886–897, 2024.

I. Ahmed, Z. Khan, Z. Khan, N. Iltaf, and M. Jeon, "Advanced Multi-Model Deep Learning Approach for Content-Based Image Retrieval," in Proceedings of the 2023 Korean Software Conference of the Korean Society of Information Scientists and Engineers, Busan, South Korea, 2023, pp. 736–738.

Z. Yuan et al., "Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and Local Information," IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–16, 2022.

M. M. Monowar, M. A. Hamid, A. Q. Ohi, M. O. Alassafi, and M. F. Mridha, "AutoRet: A Self-Supervised Spatial Recurrent Network for Content-Based Image Retrieval," Sensors, vol. 22, no. 6, Mar. 2022, Art. no. 2188.

N. Vo et al., "Composing Text and Image for Image Retrieval - an Empirical Odyssey," in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 6432–6441.

Y. Chen, S. Gong, and L. Bazzani, "Image Search With Text Feedback by Visiolinguistic Attention Learning," in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 2998–3008.

T. Nagarajan and K. Grauman, "Attributes as Operators: Factorizing Unseen Attribute-Object Compositions," in 15th European Conference on Computer Vision, ECCV 2018, Munich, Germany, 2018, pp. 172–190.

M. Aboali, I. Elmaddah, and H. E.-D. Hassan, "Augmented TIRG for CBIR Using Combined Text and Image Features," in 2021 International Conference on Electrical, Computer and Energy Technologies, Cape Town, South Africa, 2021, pp. 1–6.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, "ImageNet classification with deep convolutional neural networks, " Communications of the ACM, vol. 60, no. 6, pp. 84–90, May 2017.

M. D. Zeiler and R. Fergus, "Visualizing and Understanding Convolutional Networks," in 13th European Conference on Computer Vision, ECCV 2014, Zurich, Switzerland, 2014, pp. 818–833.

K. Simonyan and A. Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition." arXiv, Apr. 10, 2015.

S. Merugu, R. Yadav, V. Pathi, and H. R. Perianayagam, "Identification and Improvement of Image Similarity using Autoencoder," Engineering, Technology & Applied Science Research, vol. 14, no. 4, pp. 15541–15546, Aug. 2024.

H. Wen, X. Song, X. Chen, Y. Wei, L. Nie, and T.-S. Chua, "Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image Retrieval," in Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 2024, pp. 229–239.

H. Wen, X. Zhang, X. Song, Y. Wei, and L. Nie, "Target-Guided Composed Image Retrieval," in Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, Canada, 2023, pp. 915–923.

A. Baldrati, M. Bertini, T. Uricchio, and A. del Bimbo, "Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features." arXiv, Aug. 22, 2023.

H. Wu et al., "Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback," in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 11302–11312.

X. Han et al., "Automatic Spatially-Aware Fashion Concept Discovery," in 2017 IEEE International Conference on Computer Vision, Venice, Italy, 2017, pp. 1472–1480.

Z. Liu, C. Rodriguez-Opazo, D. Teney, and S. Gould, "Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models," in 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021, pp. 2105–2114.

M. Haq and M. A. R. Khan, "DNNBoT: Deep Neural Network-Based Botnet Detection and Classification," Computers, Materials & Continua, vol. 71, no. 1, pp. 1729–1750, 2021.

M. Suresh, A. S. Shaik, B. Premalatha, V. A. Narayana, and G. Ghinea, "Intelligent & Smart Navigation System for Visually Impaired Friends," in 12th International Advanced Computing Conference, IACC 2022, Part I, Hyderabad, India, 2022, pp. 374–383.

S. Merugu, A. Tiwari, and S. K. Sharma, "Spatial–Spectral Image Classification with Edge Preserving Method," Journal of the Indian Society of Remote Sensing, vol. 49, no. 3, pp. 703–711, Mar. 2021.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." arXiv, May 24, 2019.

I. Annamoradnejad and G. Zoghi, "ColBERT: Using BERT sentence embedding in parallel neural networks for computational humor," Expert Systems with Applications, vol. 249, no. B, Sep. 2024, Art. no. 123685.

Y. Zhu et al., "Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books," in 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 2015, pp. 19–27.

T. Wang, X. Xu, Y. Yang, A. Hanjalic, H. T. Shen, and J. Song, "Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking." arXiv, Jul. 29, 2020.

F. Schroff, D. Kalenichenko, and J. Philbin, "FaceNet: A unified embedding for face recognition and clustering," in 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 815–823.

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, "Show and tell: A neural image caption generator," in 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 3156–3164.

H. Noh, P. H. Seo, and B. Han, "Image Question Answering Using Convolutional Neural Network with Dynamic Parameter Prediction," in 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 30–38.

A. Santoro et al., "A simple neural network module for relational reasoning," in Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 4974–4983.

E. Perez, F. Strub, H. de Vries, V. Dumoulin, and A. Courville, "FiLM: Visual Reasoning with a General Conditioning Layer," Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, pp. 3942–3951, Apr. 2018.

M. U. Anwaar, E. Labintcev, and M. Kleinsteuber, "Compositional Learning of Image-Text Query for Image Retrieval," in 2021 IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2021, pp. 1139–1148.

Y. Tian, S. Newsam, and K. Boakye, "Fashion Image Retrieval with Text Feedback by Additive Attention Compositional Learning," in 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2023, pp. 1011–1021.

Downloads

How to Cite

[1]
Alkhalil, A. 2025. Deep-View X-Modalities Visio-Linguistics (DV-XML) Features Engineering Image Retrieval Framework. Engineering, Technology & Applied Science Research. 15, 2 (Apr. 2025), 21951–21962. DOI:https://doi.org/10.48084/etasr.10175.

Metrics

Abstract Views: 3
PDF Downloads: 1

Metrics Information