In the context of smart image sensors, there is an increasing demand to go beyond simple inferences such as classification or object detection, to add more complex applications enabling a semantic understanding of the scene. Among these applications, Visual Question Answering (VQA) enables AI systems to answer questions by analyzing images. This project aims to develop an efficient VQA system combining a visual encoder based on Binary Neural Networks (BNN) with a compact language model (tiny LLM). Although LLMs are still far from a complete hardware implementation, this project represents a significant step in this direction by using a BNN to analyze the context and relationship between objects of the scene. This encoder processes images with low resource consumption, allowing real-time deployment on edge devices. Attention mechanisms can be taken into consideration to extract the semantic information necessary for scene understanding. The language model used can be stored locally and adjusted jointly with the BNN to generate precise and contextually relevant answers.
This project offers an opportunity for candidates interested in Tiny Deep Learning and LLMs. It proposes a broad field of research for significant contributions and interesting results for concrete applications. The work will consist of developing a robust BNN topology for semantic scene analysis under certain hardware constraints (memory and computation) and integrating and jointly optimizing the BNN encoder with the LLM, while ensuring a coherent and performant VQA system across different types of inquiries.