GenPhi : 3D Generative AI conditioned by geometry, structure and physics

The aim of this thesis is to design new 3D model generators based on Generative Artificial Intelligence (GenAI), capable of producing faithful, coherent and physically viable shapes. While 3D generation has become essential in many fields, current automatic generation approaches suffer from limitations in terms of respecting geometric, structural and physical constraints. The goal is to develop methods for integrating constraints related to geometry, topology, internal structure and physical laws, both stationary (equilibrium, statics) and dynamic (kinematics, deformation), right from the generation stage. The study will combine geometric perception, semantic enrichment and physical simulation approaches to produce robust, realistic 3D models that can be directly exploited without human intervention.

Towards Reliable and Autonomous Workflow Coordination in Agentic AI-Based Systems

The rise of Large Language Models (LLMs) and agentic AI systems is transforming how complex workflows are designed and managed. Unlike traditional centralized orchestration, modern workflows must support distributed, autonomous agents operating across cloud, edge, and on-premise environments. These agents collaborate with humans and other systems, adapt to evolving goals, and cross organizational and trust boundaries. This paradigm shift is especially relevant in domains like cybersecurity and healthcare emergency response, where workflows must be dynamically constructed and executed under uncertainty. In such settings, rigid automation falls short—agentic workflows require decentralized, secure, and auditable orchestration.
This thesis explores how to enable such systems, asking: How can we achieve secure, distributed orchestration in environments where agentic AI operates autonomously? It will propose a formal modeling framework for distributed agentic workflows, protocols for auditable, privacy-preserving coordination, and a reference architecture with real-world proofs of concept in cybersecurity and healthcare.

AI Enhanced MBSE framework for joint safety and security analysis of critical systems

Critical systems must simultaneously meet the requirements of both Safety (preventing unintentional failures that could lead to damage) and Security (protecting against malicious attacks). Traditionally, these two areas are treated separately, whereas they are interdependent: An attack (Security) can trigger a failure (Safety), and a functional flaw can be exploited as an attack vector.
MBSE approaches enable rigorous system modeling, but they don't always capture the explicit links between Safety [1] and Security [2]; risk analyses are manual, time-consuming and error-prone. The complexity of modern systems makes it necessary to automate the evaluation of Safety-Security trade-offs.
Joint safety/security MBSE modeling has been widely addressed in several research works such as [3], [4] and [5]. The scientific challenge of this thesis is to use AI to automate and improve the quality of analyses. What type of AI should we use for each analysis step? How can we detect conflicts between safety and security requirements? What are the criteria for assessing the contribution of AI to joint safety/security analysis?

Grounding and reasoning over space and time in Vision-Language Models (VLM)

Recent Vision-Language Models (VLMs) like BLIP, LLaVA, and Qwen-VL have achieved impressive results in multimodal tasks but still face limitations in true spatial and temporal reasoning. Many current benchmarks conflate visual reasoning with general knowledge and involve shallow reasoning tasks. Furthermore, these models often struggle with understanding complex spatial relations and dynamic scenes due to suboptimal visual feature usage. To address this, recent approaches such as SpatialRGPT, SpaceVLLM, VPD, and ST-VLM have introduced techniques like 3D scene graph integration, spatio-temporal queries, and kinematic instruction tuning to improve reasoning over space and time. This thesis proposes to build on these advances by developing new instruction-tuned models with improved data representation and architectural innovations. The goal is to enable robust spatio-temporal reasoning for applications in robotics, video analysis, and dynamic environment understanding.

Adaptive and explainable Video Anomaly Detection

Video Anomaly Detection (VAD) aims to automatically identify unusual events in video that deviate from normal patterns. Existing methods often rely on One-Class or Weakly Supervised learning: the former uses only normal data for training, while the latter leverages video-level labels. Recent advances in Vision-Language Models (VLMs) and Large Language Models (LLMs) have improved both the performance and explainability of VAD systems. Despite progress on public benchmarks, challenges remain. Most methods are limited to a single domain, leading to performance drops when applied to new datasets with different anomaly definitions. Additionally, they assume all training data is available upfront, which is unrealistic for real-world deployment where models must adapt to new data over time. Few approaches explore multimodal adaptation using natural language rules to define normal and abnormal events, offering a more intuitive and flexible way to update VAD systems without needing new video samples.

This PhD research aims to develop adaptable Video Anomaly Detection methods capable of handling new domains or anomaly types using few video examples and/or textual rules.

The main lines of research will be the following:
• Cross-Domain Adaptation in VAD: improving robustness against domain gaps through Few-Shot adaptation;
• Continual Learning in VAD: continually enriching the model to deal with new types of anomalies;
• Multimodal Few-Shot Learning: facilitating the model adaptation process through rules in natural language.

Enabling efficient federated learning and fine-tuning for heterogeneous and resource-constrained devices

The goal of this PhD thesis is to develop methods that enhance resource efficiency in federated learning (FL), with particular attention to the constraints and heterogeneity of client resources. The work will first focus on the classical client-server FL architecture, before extending the investigation to decentralised FL settings. The proposed methods will be studied in the context of both federated model training and distributed fine-tuning of large models, such as large language models (LLMs).

Internalisation of external knowledge by foundation models

To perform an unknown task, a subject (human or robot) has to consult external information, which involves a cognitive cost. After several similar experiments, it masters the situation and can act automatically. The 1980s and 1990s saw explorations in AI using conceptual graphs and schemas, but their large-scale implementation was limited by the technology available at the time.

Today's neural models, including transformers and LLM/VLMs, learn universal representations through pre-training on huge amounts of data. They can be used with prompts to provide local context. Fine-tuning allows these models to be specialised for specific tasks.

RAG and GraphRAG methods can be used to exploit external knowledge, but their use for inference is resource-intensive. This thesis proposes a cognitivist approach in which the system undergoes continuous learning. It consults external sources during inference and uses this information to refine itself regularly, as it does during sleep. This method aims to improve performance and reduce resource consumption.

In humans, these processes are linked to the spatial organisation of the brain. The thesis will also study network architectures inspired by this organisation, with dedicated but interconnected “zones”, such as the vision-language and language models.

These concepts can be applied to the Astir and Ridder projects, which aim to exploit foundation models for software engineering in robotics and the development of generative AI methods for the safe control of robots.

Fine-grained and spatio-temporally grounded large multimodal models

This PhD project focuses on enhancing Large Multimodal Models (LMMs) through the integration of fine-grained and spatio-temporal information into training datasets. While current LMMs such as CLIP and Flamingo show strong performance, they rely on noisy and coarse-grained image-text pairs and often lack spatial or temporal grounding. The thesis aims to develop automatic pipelines to enrich image datasets with geographic and temporal metadata, refine captions using fine-grained semantic descriptors, and balance dataset diversity and compactness by controlling class-wise sample sizes.

Training strategies will incorporate hierarchical class structures and adapt protocols to improve alignment between caption elements and image regions. The work will also explore joint training regimes that integrate fine-grained, spatial, and temporal dimensions, and propose set-based inference to improve the diversity of generated outputs. The enriched datasets and models will be evaluated using existing or newly developed benchmarks targeting contextual relevance and output diversity. The project also addresses challenges in metadata accuracy, efficient model adaptation, and benchmarking methodologies for multi-dimensional model evaluation.

Applications include improved synthetic data generation for autonomous driving, enhanced annotation of media archives through contextual captioning, and better visual reasoning in industrial simulation scenarios.

Automatic modelling language variations for socially responsive chatbots

Conversational agents are increasingly present in our daily lives thanks to advances in natural language processing and artificial intelligence and are attracting growing interest. However, their ability to understand human communication in all its complexity remains a major challenge. This PhD project aims to model linguistic variation to develop agents capable of socially adaptive interactions, taking into account the socio-demographic profile and emotional state of their interlocutors. It also focuses on evaluating linguistic cues at different levels, leveraging both spoken and written language varieties, and assessing the generalization capacity of models trained on multilingual and multi-situational data, with the goal of improving interaction modeling with conversational agents.

Compositional Generalization of Visual Language Models

The advent of the foundation models led to increase the state-of-the art performance on a large number of tasks in several fields of AI, in particular computer vision and natural language processing. However, despite the huge amount of data used to train them, these models are still limited in their ability to generalize, in particular for a use case of interest that is in a specific domain, not well represented on the Web. A way to formalize this issue is compositional generalization, i.e. generalising to a new, unseen concept from concepts learned during training. This "generalization" is the ability to learn disentangle concepts and to be able to recombine
them into unseen composition when the model is in production. The proposed thesis will address this issue, aiming at proposing visual representations that enable generic visual language models to generalize compositionally within specific domains. It will investigate strategies to reduce shortcut learning, promoting deeper understanding of compositional structures in multimodal data. It will also address the problem of compositional generalization beyond simple attribute–object pairs, capturing more subtle and complex semantics. The proposed thesis aims at proposing preogress at a quite theoretical level but has many potential practical interest, in the fields of health, administration and services sectors, security and defense, manufacturing and agriculture.