Grounding and reasoning over space and time in Vision-Language Models (VLM)

Recent Vision-Language Models (VLMs) like BLIP, LLaVA, and Qwen-VL have achieved impressive results in multimodal tasks but still face limitations in true spatial and temporal reasoning. Many current benchmarks conflate visual reasoning with general knowledge and involve shallow reasoning tasks. Furthermore, these models often struggle with understanding complex spatial relations and dynamic scenes due to suboptimal visual feature usage. To address this, recent approaches such as SpatialRGPT, SpaceVLLM, VPD, and ST-VLM have introduced techniques like 3D scene graph integration, spatio-temporal queries, and kinematic instruction tuning to improve reasoning over space and time. This thesis proposes to build on these advances by developing new instruction-tuned models with improved data representation and architectural innovations. The goal is to enable robust spatio-temporal reasoning for applications in robotics, video analysis, and dynamic environment understanding.

Adaptive and explainable Video Anomaly Detection

Video Anomaly Detection (VAD) aims to automatically identify unusual events in video that deviate from normal patterns. Existing methods often rely on One-Class or Weakly Supervised learning: the former uses only normal data for training, while the latter leverages video-level labels. Recent advances in Vision-Language Models (VLMs) and Large Language Models (LLMs) have improved both the performance and explainability of VAD systems. Despite progress on public benchmarks, challenges remain. Most methods are limited to a single domain, leading to performance drops when applied to new datasets with different anomaly definitions. Additionally, they assume all training data is available upfront, which is unrealistic for real-world deployment where models must adapt to new data over time. Few approaches explore multimodal adaptation using natural language rules to define normal and abnormal events, offering a more intuitive and flexible way to update VAD systems without needing new video samples.

This PhD research aims to develop adaptable Video Anomaly Detection methods capable of handling new domains or anomaly types using few video examples and/or textual rules.

The main lines of research will be the following:
• Cross-Domain Adaptation in VAD: improving robustness against domain gaps through Few-Shot adaptation;
• Continual Learning in VAD: continually enriching the model to deal with new types of anomalies;
• Multimodal Few-Shot Learning: facilitating the model adaptation process through rules in natural language.

A theoretical framework for the task-based optimal design of Modular and Reconfigurable Serial Robots for rapid deployment

The innovations that gave rise to industrial robots date back to the sixties and seventies. They have enabled a massive deployment of industrial robots that transformed factory floors, at least in industrial sectors such as car manufacturing and other mass production lines.

However, such robots do not fit the requirements of other interesting applications that appeared and developed in fields such as in laboratory research, space robotics, medical robotics, automation in inspection and maintenance, agricultural robotics, service robotics and, of course, humanoids. A small number of these sectors have seen large-scale deployment and commercialization of robotic systems, with most others advancing slowly and incrementally to that goal.

This begs the following question: is it due to unsuitable hardware (insufficient physical capabilities to generate the required motions and forces); software capabilities (control systems, perception, decision support, learning, etc.); or a lack of new design paradigms capable to meet the needs of these applications (agile and scalable custom-design approaches)?

The unprecedented explosion of data science, machine learning and AI in all areas of science, technology and society may be seen as a compelling solution, and a radical transformation is taking shape (or is anticipated), with the promise of empowering the next generations of robots with AI (both predictive and generative). Therefore, research can tend to pay increasing attention to the software aspects (learning, decision support, coding etc.); perhaps to the detriment of more advanced physical capabilities (hardware) and new concepts (design paradigms). It is however clear that the cognitive aspects of robotics, including learning, control and decision support, are useful if and only if suitable physical embodiments are available to meet the needs of the various tasks that can be robotized, hence requiring adapted design methodologies and hardware.

The aim of this thesis is thus to focus on design paradigms and hardware, and in particular on the optimal design of rapidly-produced serial robots based on given families of standardized « modules » whose layout will be optimized according to the requirements of the tasks that cannot be performed by the industrial robots available on the market. The ambition is to answer the question of whether and how a paradigm shift may be possible for the design of robots, from being fixed-catalogue to rapidly available bespoke type.

The successful candidate will enrol at the « Ecole Doctorale Mathématiques, STIC » of Nantes Université (ED-MASTIC), and he or she will be hosted for three years in the CEA-LIST Interactive Robotics Unit under supervision of Dr Farzam Ranjbaran. Professors Yannick Aoustin (Nantes) and Clément Gosselin (Laval) will provide academic guidance and joint supervision for a successful completion of the thesis.

A follow-up to this thesis is strongly considered in the form of a one-year Post-Doctoral fellowship to which the candidate will be able to apply, upon successful completion of all the requirements of the PhD Degree. This Post-Doctoral fellowship will be hosted at the « Centre de recherche en robotique, vision et intelligence machine (CeRVIM) », Université Laval, Québec, Canada.

Internalisation of external knowledge by foundation models

To perform an unknown task, a subject (human or robot) has to consult external information, which involves a cognitive cost. After several similar experiments, it masters the situation and can act automatically. The 1980s and 1990s saw explorations in AI using conceptual graphs and schemas, but their large-scale implementation was limited by the technology available at the time.

Today's neural models, including transformers and LLM/VLMs, learn universal representations through pre-training on huge amounts of data. They can be used with prompts to provide local context. Fine-tuning allows these models to be specialised for specific tasks.

RAG and GraphRAG methods can be used to exploit external knowledge, but their use for inference is resource-intensive. This thesis proposes a cognitivist approach in which the system undergoes continuous learning. It consults external sources during inference and uses this information to refine itself regularly, as it does during sleep. This method aims to improve performance and reduce resource consumption.

In humans, these processes are linked to the spatial organisation of the brain. The thesis will also study network architectures inspired by this organisation, with dedicated but interconnected “zones”, such as the vision-language and language models.

These concepts can be applied to the Astir and Ridder projects, which aim to exploit foundation models for software engineering in robotics and the development of generative AI methods for the safe control of robots.

Fine-grained and spatio-temporally grounded large multimodal models

This PhD project focuses on enhancing Large Multimodal Models (LMMs) through the integration of fine-grained and spatio-temporal information into training datasets. While current LMMs such as CLIP and Flamingo show strong performance, they rely on noisy and coarse-grained image-text pairs and often lack spatial or temporal grounding. The thesis aims to develop automatic pipelines to enrich image datasets with geographic and temporal metadata, refine captions using fine-grained semantic descriptors, and balance dataset diversity and compactness by controlling class-wise sample sizes.

Training strategies will incorporate hierarchical class structures and adapt protocols to improve alignment between caption elements and image regions. The work will also explore joint training regimes that integrate fine-grained, spatial, and temporal dimensions, and propose set-based inference to improve the diversity of generated outputs. The enriched datasets and models will be evaluated using existing or newly developed benchmarks targeting contextual relevance and output diversity. The project also addresses challenges in metadata accuracy, efficient model adaptation, and benchmarking methodologies for multi-dimensional model evaluation.

Applications include improved synthetic data generation for autonomous driving, enhanced annotation of media archives through contextual captioning, and better visual reasoning in industrial simulation scenarios.

Automatic modelling language variations for socially responsive chatbots

Conversational agents are increasingly present in our daily lives thanks to advances in natural language processing and artificial intelligence and are attracting growing interest. However, their ability to understand human communication in all its complexity remains a major challenge. This PhD project aims to model linguistic variation to develop agents capable of socially adaptive interactions, taking into account the socio-demographic profile and emotional state of their interlocutors. It also focuses on evaluating linguistic cues at different levels, leveraging both spoken and written language varieties, and assessing the generalization capacity of models trained on multilingual and multi-situational data, with the goal of improving interaction modeling with conversational agents.

Compositional Generalization of Visual Language Models

The advent of the foundation models led to increase the state-of-the art performance on a large number of tasks in several fields of AI, in particular computer vision and natural language processing. However, despite the huge amount of data used to train them, these models are still limited in their ability to generalize, in particular for a use case of interest that is in a specific domain, not well represented on the Web. A way to formalize this issue is compositional generalization, i.e. generalising to a new, unseen concept from concepts learned during training. This "generalization" is the ability to learn disentangle concepts and to be able to recombine
them into unseen composition when the model is in production. The proposed thesis will address this issue, aiming at proposing visual representations that enable generic visual language models to generalize compositionally within specific domains. It will investigate strategies to reduce shortcut learning, promoting deeper understanding of compositional structures in multimodal data. It will also address the problem of compositional generalization beyond simple attribute–object pairs, capturing more subtle and complex semantics. The proposed thesis aims at proposing preogress at a quite theoretical level but has many potential practical interest, in the fields of health, administration and services sectors, security and defense, manufacturing and agriculture.

High mobility mobile manipulator control in a dynamic context

The development of mobile manipulators capable of adapting to new conditions is a major step forward in the development of new means of production, whether for industrial or agricultural applications. Such technologies enable repetitive tasks to be carried out with precision and without the constraints of limited workspace. Nevertheless, the efficiency of such robots depends on their adaptation to the variability of the evolutionary context and the task to be performed. This thesis therefore proposes to design mechanisms for adapting the sensory-motor behaviors of this type of robot, in order to ensure that their actions are appropriate to the situation. It envisages extending the reconfiguration capabilities of perception and control approaches through the contribution of Artificial Intelligence, here understood in the sense of deep learning. The aim is to develop new decision-making architectures capable of optimizing robotic behaviors for mobile handling in changing contexts (notably indoor-outdoor), and for carrying out a range of precision tasks.

Scalability of the Network Digital Twin in Complex Communication Networks

Communication networks are experiencing an exponential growth both in terms of deployment of network infrastructures (particularly observed in the gradual and sustained evolution towards 6G networks), but also in terms of machines, covering a wide range of devices ranging from Cloud servers to lightweight embedded IoT components (e.g. System on Chip: SoC), and including mobile terminals such as smartphones.

This ecosystem also encompasses a variety of software components ranging from applications (e.g. A/V streaming) to the protocols from different communication network layers. Furthermore, such an ecosystem is intrinsically dynamic because of the following features:
- Change in network topology: due, for example, to hardware/software failures, user mobility, operator network resource management policies, etc.
- Change in the usage/consumption ratio of network resources (bandwidth, memory, CPU, battery, etc.). This is due to user needs and operator network resource management policies, etc.

To ensure effective supervision or management, whether fine-grained or with an abstract view, of communication networks, various network management services/platforms, such as SNMP, CMIP, LWM2M, CoMI, SDN, have been proposed and documented in the networking literature and standard bodies. Furthermore, the adoption of such management platforms has seen broad acceptance and utilization within the network operators, service providers, and the industry, where the said management platforms often incorporate advanced features, including automated control loops (e.g. rule-based, expert-system-based, ML-based), further enhancing their capability to optimize the performance of the network management operations.

Despite the extensive exploration and exploitation of these network management platforms, they do not guarantee an effective (re)configuration without intrinsic risks/errors, which can cause serious outage to network applications and services. This is particularly true when the objective of the network (re)configuration is to ensure real-time optimization of the network, analysis/ tests in operational mode (what- if analysis), planning updates/modernizations/extensions of the communication network, etc. For such (re)configuration objectives, a new network management paradigm has to be designed.

In the recent years, the communication network research community started exploring the adoption of the digital twin concept for the networking context (Network Digital Twin: NDT). The objective behind this adoption is to help for the management of the communication network for various purposes, including those mentioned in the previous paragraph.

The NDT is a digital twin of the real/physical communication network (Physical Twin Network: PTN), making it possible to manipulate a digital copy of the real communication network, without risk. This allow in particular for visualizing/predicting the evolution (or the behavior, the state) of the real network, if this or that network configuration is to be applied. Beyond this aspect, the NDT and the PTN network exchange information via one or more communication interfaces with the aim of maintaining synchronized states between the NDT and the PTN.

Nonetheless, setting up a network digital twin (NDT) is not a simple task. Indeed, frequent and real-time PTN-NDT synchronization poses a scalability problem when dealing with complex networks, where each network information is likely to be reported at the NDT level (e.g. a very large number of network entities, very dynamic topologies, large volume of information per node/per network link).

Various scientific contributions have attempted to address the question of the network digital twin (NDT). The state-of-the-art contributions focus on establishing scenarios, requirements, and architecture for the NDT. Nevertheless, the literature does not tackle the scalability problem of the NDT.

The objective of this PhD thesis is to address the scalability problem of network digital twins by exploring new machine learning models for network information selection and prediction.

Defense of scene analysis models against adversarial attacks

In many applications, scene analysis modules such as object detection and recognition, or pose recognition, are required. Deep neural networks are nowadays among the most efficient models to perform a large number of vision tasks, sometimes simultaneously in case of multitask learning. However, it has been shown that they are vulnerable to adversarial attacks: Indeed, it is possible to add to the input data some perturbations imperceptible by the human eye which undermine the results during the inference made by the neural network. However, a guarantee of reliable results is essential for applications such as autonomous vehicles or person search for video surveillance, where security is critical. Different types of adversarial attacks and defenses have been proposed, most often for the classification problem (of images, in particular). Some works have addressed the attack of embedding optimized by metric learning, especially used for open-set tasks such as object re-identification, facial recognition or image retrieval by content. The types of attacks have multiplied: some universal, other optimized on a particular instance. The proposed defenses must deal with new threats without sacrificing too much of the initial performance of the model. Protecting input data from adversarial attacks is essential for decision systems where security vulnerabilities are critical. One way to protect this data is to develop defenses against these attacks. Therefore, the objective will be to study and propose different attacks and defenses applicable to scene analysis modules, especially those for object detection and object instance search in images.