Attention-based Binarized Visual Encoder for LLM-driven Visual Question Answering

Artificial intelligence & Data intelligence Computer science and software Engineering sciences Technological challenges 

Abstract

In the context of smart image sensors, there is an increasing demand to go beyond simple inferences such as classification or object detection, to add more complex applications enabling a semantic understanding of the scene. Among these applications, Visual Question Answering (VQA) enables AI systems to answer questions by analyzing images. This project aims to develop an efficient VQA system combining a visual encoder based on Binary Neural Networks (BNN) with a compact language model (tiny LLM). Although LLMs are still far from a complete hardware implementation, this project represents a significant step in this direction by using a BNN to analyze the context and relationship between objects of the scene. This encoder processes images with low resource consumption, allowing real-time deployment on edge devices. Attention mechanisms can be taken into consideration to extract the semantic information necessary for scene understanding. The language model used can be stored locally and adjusted jointly with the BNN to generate precise and contextually relevant answers.
This project offers an opportunity for candidates interested in Tiny Deep Learning and LLMs. It proposes a broad field of research for significant contributions and interesting results for concrete applications. The work will consist of developing a robust BNN topology for semantic scene analysis under certain hardware constraints (memory and computation) and integrating and jointly optimizing the BNN encoder with the LLM, while ensuring a coherent and performant VQA system across different types of inquiries.

Laboratory

Département d’Optronique (LETI)

Service d’Innovation et Systèmes Photoniques

Laboratoire conception de Circuits Intégrés Intelligents pour l’image

Université Grenoble Alpes

Back

Share this thesis topic

Practicle information

Pre-requisite:

Intelligence Artificielle, Traitement d'Image

University - graduate school:

Electronique, Electrotechnique, Automatique, Traitement du Signal (EEATS)

Université Grenoble Alpes

Starting date:

01-10-2025

Place:

Grenoble

Contact Person

Thien

NGUYEN

CEA

DRT/DOPT//L3I

Tel : 0438780980

Email : vanthien.nguyen@cea.fr

Thesis supervisor

William

GUICQUERO

CEA

DRT/DOPT//L3I