GenPhi : 3D Generative AI conditioned by geometry, structure and physics

The aim of this thesis is to design new 3D model generators based on Generative Artificial Intelligence (GenAI), capable of producing faithful, coherent and physically viable shapes. While 3D generation has become essential in many fields, current automatic generation approaches suffer from limitations in terms of respecting geometric, structural and physical constraints. The goal is to develop methods for integrating constraints related to geometry, topology, internal structure and physical laws, both stationary (equilibrium, statics) and dynamic (kinematics, deformation), right from the generation stage. The study will combine geometric perception, semantic enrichment and physical simulation approaches to produce robust, realistic 3D models that can be directly exploited without human intervention.

Robust and Secure Federated Learning

Federated Learning (FL) allows multiple clients to collaboratively train a global model without sharing their raw data. While this decentralized setup is appealing for privacy-sensitive domains like healthcare and finance, it is not inherently secure: model updates can leak private information, and malicious clients can corrupt training.

To tackle these challenges, two main strategies are used: Secure Aggregation, which protects privacy by hiding individual updates, and Robust Aggregation, which filters out malicious updates. However, these goals can conflict—privacy mechanisms may obscure signs of malicious behavior, and robustness methods may violate privacy.

Moreover, most research focuses on model-level attacks, neglecting protocol-level threats like message delays or dropped updates, which are common in real-world, asynchronous networks.

This thesis aims to explore the privacy–robustness trade-off in FL, identify feasible security models, and design practical, secure, and robust protocols. Both theoretical analysis and prototype implementation will be conducted, leveraging tools like Secure Multi-Party Computation, cryptographic techniques, and differential privacy.

AI Enhanced MBSE framework for joint safety and security analysis of critical systems

Critical systems must simultaneously meet the requirements of both Safety (preventing unintentional failures that could lead to damage) and Security (protecting against malicious attacks). Traditionally, these two areas are treated separately, whereas they are interdependent: An attack (Security) can trigger a failure (Safety), and a functional flaw can be exploited as an attack vector.
MBSE approaches enable rigorous system modeling, but they don't always capture the explicit links between Safety [1] and Security [2]; risk analyses are manual, time-consuming and error-prone. The complexity of modern systems makes it necessary to automate the evaluation of Safety-Security trade-offs.
Joint safety/security MBSE modeling has been widely addressed in several research works such as [3], [4] and [5]. The scientific challenge of this thesis is to use AI to automate and improve the quality of analyses. What type of AI should we use for each analysis step? How can we detect conflicts between safety and security requirements? What are the criteria for assessing the contribution of AI to joint safety/security analysis?

Grounding and reasoning over space and time in Vision-Language Models (VLM)

Recent Vision-Language Models (VLMs) like BLIP, LLaVA, and Qwen-VL have achieved impressive results in multimodal tasks but still face limitations in true spatial and temporal reasoning. Many current benchmarks conflate visual reasoning with general knowledge and involve shallow reasoning tasks. Furthermore, these models often struggle with understanding complex spatial relations and dynamic scenes due to suboptimal visual feature usage. To address this, recent approaches such as SpatialRGPT, SpaceVLLM, VPD, and ST-VLM have introduced techniques like 3D scene graph integration, spatio-temporal queries, and kinematic instruction tuning to improve reasoning over space and time. This thesis proposes to build on these advances by developing new instruction-tuned models with improved data representation and architectural innovations. The goal is to enable robust spatio-temporal reasoning for applications in robotics, video analysis, and dynamic environment understanding.

Adaptive and explainable Video Anomaly Detection

Video Anomaly Detection (VAD) aims to automatically identify unusual events in video that deviate from normal patterns. Existing methods often rely on One-Class or Weakly Supervised learning: the former uses only normal data for training, while the latter leverages video-level labels. Recent advances in Vision-Language Models (VLMs) and Large Language Models (LLMs) have improved both the performance and explainability of VAD systems. Despite progress on public benchmarks, challenges remain. Most methods are limited to a single domain, leading to performance drops when applied to new datasets with different anomaly definitions. Additionally, they assume all training data is available upfront, which is unrealistic for real-world deployment where models must adapt to new data over time. Few approaches explore multimodal adaptation using natural language rules to define normal and abnormal events, offering a more intuitive and flexible way to update VAD systems without needing new video samples.

This PhD research aims to develop adaptable Video Anomaly Detection methods capable of handling new domains or anomaly types using few video examples and/or textual rules.

The main lines of research will be the following:
• Cross-Domain Adaptation in VAD: improving robustness against domain gaps through Few-Shot adaptation;
• Continual Learning in VAD: continually enriching the model to deal with new types of anomalies;
• Multimodal Few-Shot Learning: facilitating the model adaptation process through rules in natural language.

Internalisation of external knowledge by foundation models

To perform an unknown task, a subject (human or robot) has to consult external information, which involves a cognitive cost. After several similar experiments, it masters the situation and can act automatically. The 1980s and 1990s saw explorations in AI using conceptual graphs and schemas, but their large-scale implementation was limited by the technology available at the time.

Today's neural models, including transformers and LLM/VLMs, learn universal representations through pre-training on huge amounts of data. They can be used with prompts to provide local context. Fine-tuning allows these models to be specialised for specific tasks.

RAG and GraphRAG methods can be used to exploit external knowledge, but their use for inference is resource-intensive. This thesis proposes a cognitivist approach in which the system undergoes continuous learning. It consults external sources during inference and uses this information to refine itself regularly, as it does during sleep. This method aims to improve performance and reduce resource consumption.

In humans, these processes are linked to the spatial organisation of the brain. The thesis will also study network architectures inspired by this organisation, with dedicated but interconnected “zones”, such as the vision-language and language models.

These concepts can be applied to the Astir and Ridder projects, which aim to exploit foundation models for software engineering in robotics and the development of generative AI methods for the safe control of robots.

Fine-grained and spatio-temporally grounded large multimodal models

This PhD project focuses on enhancing Large Multimodal Models (LMMs) through the integration of fine-grained and spatio-temporal information into training datasets. While current LMMs such as CLIP and Flamingo show strong performance, they rely on noisy and coarse-grained image-text pairs and often lack spatial or temporal grounding. The thesis aims to develop automatic pipelines to enrich image datasets with geographic and temporal metadata, refine captions using fine-grained semantic descriptors, and balance dataset diversity and compactness by controlling class-wise sample sizes.

Training strategies will incorporate hierarchical class structures and adapt protocols to improve alignment between caption elements and image regions. The work will also explore joint training regimes that integrate fine-grained, spatial, and temporal dimensions, and propose set-based inference to improve the diversity of generated outputs. The enriched datasets and models will be evaluated using existing or newly developed benchmarks targeting contextual relevance and output diversity. The project also addresses challenges in metadata accuracy, efficient model adaptation, and benchmarking methodologies for multi-dimensional model evaluation.

Applications include improved synthetic data generation for autonomous driving, enhanced annotation of media archives through contextual captioning, and better visual reasoning in industrial simulation scenarios.

Machine Learning-Accelerated Electron Density Calculations

Density Functional Theory (DFT) in the Kohn-Sham formalism is one of the most widespread methods for simulating microscopic properties in solid-state physics and chemistry. Its main advantage lies in its ability to strike a favorable balance between accuracy and computational cost. The continuous evolution of increasingly efficient numerical techniques has constantly broadened the scope of its applicability.
Among these techniques that can be associated with DFT, machine learning is being used more and more. Today, a very common application consists in producing potentials capable of predicting interactions between atoms using supervised learning models, relying on properties computed by DFT.
The objective of the project proposed as part of this thesis is to use machine learning techniques at a deeper level, notably to predict the electronic density in crystals or molecules. Compared to predicting properties such as forces between atoms, calculating the electronic density presents certain challenges: the electronic density is high-dimensional since it must be calculated throughout all space; its characteristics vary strongly from one material to another (metals, insulators, charge transfer, etc.). Ultimately, this can represent a significant computational cost. There are several options to reduce the dimensionality of the electronic density, such as computing projections or using localization functions.
The final goal of this project is to be able to predict, with the highest possible accuracy, the electronic density, in order to use it as a prediction or as a starting point for calculations of electron-specific properties (magnetism, band structure, for example).
In a first stage, the candidate will be able to implement methods recently proposed in the literature; in a second part of the thesis, it will then be necessary to propose new ideas. Finally, the implemented method will be used to accelerate the prediction of properties of large systems involving charge transfers, such as defect migration in crystals.

Automatic modelling language variations for socially responsive chatbots

Conversational agents are increasingly present in our daily lives thanks to advances in natural language processing and artificial intelligence and are attracting growing interest. However, their ability to understand human communication in all its complexity remains a major challenge. This PhD project aims to model linguistic variation to develop agents capable of socially adaptive interactions, taking into account the socio-demographic profile and emotional state of their interlocutors. It also focuses on evaluating linguistic cues at different levels, leveraging both spoken and written language varieties, and assessing the generalization capacity of models trained on multilingual and multi-situational data, with the goal of improving interaction modeling with conversational agents.

Compositional Generalization of Visual Language Models

The advent of the foundation models led to increase the state-of-the art performance on a large number of tasks in several fields of AI, in particular computer vision and natural language processing. However, despite the huge amount of data used to train them, these models are still limited in their ability to generalize, in particular for a use case of interest that is in a specific domain, not well represented on the Web. A way to formalize this issue is compositional generalization, i.e. generalising to a new, unseen concept from concepts learned during training. This "generalization" is the ability to learn disentangle concepts and to be able to recombine
them into unseen composition when the model is in production. The proposed thesis will address this issue, aiming at proposing visual representations that enable generic visual language models to generalize compositionally within specific domains. It will investigate strategies to reduce shortcut learning, promoting deeper understanding of compositional structures in multimodal data. It will also address the problem of compositional generalization beyond simple attribute–object pairs, capturing more subtle and complex semantics. The proposed thesis aims at proposing preogress at a quite theoretical level but has many potential practical interest, in the fields of health, administration and services sectors, security and defense, manufacturing and agriculture.