Grounding and reasoning over space and time in Vision-Language Models (VLM)

Artificial intelligence & Data intelligence Computer science and software Engineering sciences Technological challenges 

Abstract

Recent Vision-Language Models (VLMs) like BLIP, LLaVA, and Qwen-VL have achieved impressive results in multimodal tasks but still face limitations in true spatial and temporal reasoning. Many current benchmarks conflate visual reasoning with general knowledge and involve shallow reasoning tasks. Furthermore, these models often struggle with understanding complex spatial relations and dynamic scenes due to suboptimal visual feature usage. To address this, recent approaches such as SpatialRGPT, SpaceVLLM, VPD, and ST-VLM have introduced techniques like 3D scene graph integration, spatio-temporal queries, and kinematic instruction tuning to improve reasoning over space and time. This thesis proposes to build on these advances by developing new instruction-tuned models with improved data representation and architectural innovations. The goal is to enable robust spatio-temporal reasoning for applications in robotics, video analysis, and dynamic environment understanding.

Laboratory

Département Intelligence Ambiante et Systèmes Interactifs (LIST)

Service Intelligence Artificielle pour le Langage et la Vision

Laboratoire Vision et Apprentissage pour l’analyse de scènes

Paris-Saclay

Back

Share this thesis topic

Practicle information

Pre-requisite:

Ingénieur ou Master en Informatique, Science des données ou IA

University - graduate school:

Paris-Saclay

Starting date:

01-10-2025

Place:

Saclay

Contact Person

Aboubacar

TUO

CEA

DRT/DIASI//LVA

Tel : 0656802188

Email : aboubacar.tuo@cea.fr

Thesis supervisor

Angélique

LOESCH

CEA

DRT/DIASI//LVA