Enabling efficient federated learning and fine-tuning for heterogeneous and resource-constrained devices
The goal of this PhD thesis is to develop methods that enhance resource efficiency in federated learning (FL), with particular attention to the constraints and heterogeneity of client resources. The work will first focus on the classical client-server FL architecture, before extending the investigation to decentralised FL settings. The proposed methods will be studied in the context of both federated model training and distributed fine-tuning of large models, such as large language models (LLMs).
Internalisation of external knowledge by foundation models
To perform an unknown task, a subject (human or robot) has to consult external information, which involves a cognitive cost. After several similar experiments, it masters the situation and can act automatically. The 1980s and 1990s saw explorations in AI using conceptual graphs and schemas, but their large-scale implementation was limited by the technology available at the time.
Today's neural models, including transformers and LLM/VLMs, learn universal representations through pre-training on huge amounts of data. They can be used with prompts to provide local context. Fine-tuning allows these models to be specialised for specific tasks.
RAG and GraphRAG methods can be used to exploit external knowledge, but their use for inference is resource-intensive. This thesis proposes a cognitivist approach in which the system undergoes continuous learning. It consults external sources during inference and uses this information to refine itself regularly, as it does during sleep. This method aims to improve performance and reduce resource consumption.
In humans, these processes are linked to the spatial organisation of the brain. The thesis will also study network architectures inspired by this organisation, with dedicated but interconnected “zones”, such as the vision-language and language models.
These concepts can be applied to the Astir and Ridder projects, which aim to exploit foundation models for software engineering in robotics and the development of generative AI methods for the safe control of robots.
Fine-grained and spatio-temporally grounded large multimodal models
This PhD project focuses on enhancing Large Multimodal Models (LMMs) through the integration of fine-grained and spatio-temporal information into training datasets. While current LMMs such as CLIP and Flamingo show strong performance, they rely on noisy and coarse-grained image-text pairs and often lack spatial or temporal grounding. The thesis aims to develop automatic pipelines to enrich image datasets with geographic and temporal metadata, refine captions using fine-grained semantic descriptors, and balance dataset diversity and compactness by controlling class-wise sample sizes.
Training strategies will incorporate hierarchical class structures and adapt protocols to improve alignment between caption elements and image regions. The work will also explore joint training regimes that integrate fine-grained, spatial, and temporal dimensions, and propose set-based inference to improve the diversity of generated outputs. The enriched datasets and models will be evaluated using existing or newly developed benchmarks targeting contextual relevance and output diversity. The project also addresses challenges in metadata accuracy, efficient model adaptation, and benchmarking methodologies for multi-dimensional model evaluation.
Applications include improved synthetic data generation for autonomous driving, enhanced annotation of media archives through contextual captioning, and better visual reasoning in industrial simulation scenarios.
Automatic modelling language variations for socially responsive chatbots
Conversational agents are increasingly present in our daily lives thanks to advances in natural language processing and artificial intelligence and are attracting growing interest. However, their ability to understand human communication in all its complexity remains a major challenge. This PhD project aims to model linguistic variation to develop agents capable of socially adaptive interactions, taking into account the socio-demographic profile and emotional state of their interlocutors. It also focuses on evaluating linguistic cues at different levels, leveraging both spoken and written language varieties, and assessing the generalization capacity of models trained on multilingual and multi-situational data, with the goal of improving interaction modeling with conversational agents.
Compositional Generalization of Visual Language Models
The advent of the foundation models led to increase the state-of-the art performance on a large number of tasks in several fields of AI, in particular computer vision and natural language processing. However, despite the huge amount of data used to train them, these models are still limited in their ability to generalize, in particular for a use case of interest that is in a specific domain, not well represented on the Web. A way to formalize this issue is compositional generalization, i.e. generalising to a new, unseen concept from concepts learned during training. This "generalization" is the ability to learn disentangle concepts and to be able to recombine
them into unseen composition when the model is in production. The proposed thesis will address this issue, aiming at proposing visual representations that enable generic visual language models to generalize compositionally within specific domains. It will investigate strategies to reduce shortcut learning, promoting deeper understanding of compositional structures in multimodal data. It will also address the problem of compositional generalization beyond simple attribute–object pairs, capturing more subtle and complex semantics. The proposed thesis aims at proposing preogress at a quite theoretical level but has many potential practical interest, in the fields of health, administration and services sectors, security and defense, manufacturing and agriculture.
Out-of-Distribution Detection with Vision Foundation Models and Post-hoc Methods
The thesis focuses on improving the reliability of deep learning models, particularly in detecting out-of-distribution (OoD) samples, which are data points that differ from the training data and can lead to incorrect predictions. This is especially important in critical fields like healthcare and autonomous vehicles, where errors can have serious consequences. The research leverages vision foundation models (VFMs) like CLIP and DINO, which have revolutionized computer vision by enabling learning from limited data. The proposed work aims to develop methods that maintain the robustness of these models during fine-tuning, ensuring they can still effectively detect OoD samples. Additionally, the thesis will explore solutions for handling changing data distributions over time, a common challenge in real-world applications. The expected results include new techniques for OoD detection and adaptive methods for dynamic environments, ultimately enhancing the safety and reliability of AI systems in practical scenarios.
Point Spread Function Modelling for Space Telescopes with a Differentiable Optical Model
Context
Weak gravitational lensing [1] is a powerful probe of the Large Scale Structure of our Universe. Cosmologists use weak lensing to study the nature of dark matter and its spatial distribution. Weak lensing missions require highly accurate shape measurements of galaxy images. The instrumental response of the telescope, called the point spread function (PSF), produces a deformation of the observed images. This deformation can be mistaken for the effects of weak lensing in the galaxy images, thus being one of the primary sources of systematic error when doing weak lensing science. Therefore, estimating a reliable and accurate PSF model is crucial for the success of any weak lensing mission [2]. The PSF field can be interpreted as a convolutional kernel that affects each of our observations of interest, which varies spatially, spectrally, and temporally. The PSF model needs to be able to cope with each of these variations. We use specific stars considered point sources in the field of view to constrain our PSF model. These stars, which are unresolved objects, provide us with degraded samples of the PSF field. The observations go through different degradations depending on the properties of the telescope. These degradations include undersampling, integration over the instrument passband, and additive noise. We finally build the PSF model using these degraded observations and then use the model to infer the PSF at the position of galaxies. This procedure constitutes the ill-posed inverse problem of PSF modelling. See [3] for a recent review on PSF modelling.
The recently launched Euclid survey represents one of the most complex challenges for PSF modelling. Because of the very broad passband of Euclid’s visible imager (VIS) ranging from 550nm to 900nm, PSF models need to capture not only the PSF field spatial variations but also its chromatic variations. Each star observation is integrated with the object’s spectral energy distribution (SED) over the whole VIS passband. As the observations are undersampled, a super-resolution step is also required. A recent model coined WaveDiff [4] was proposed to tackle the PSF modelling problem for Euclid and is based on a differentiable optical model. WaveDiff achieved state-of-the-art performance and is currently being tested with recent observations from the Euclid survey.
The James Webb Space Telescope (JWST) was recently launched and is producing outstanding observations. The COSMOS-Web collaboration [5] is a wide-field JWST treasury program that maps a contiguous 0.6 deg2 field. The COSMOS-Web observations are available and provide a unique opportunity to test and develop a precise PSF model for JWST. In this context, several science cases, on top of weak gravitational lensing studies, can vastly profit from a precise PSF model. For example, strong gravitational lensing [6], where the PSF plays a crucial role in reconstruction, and exoplanet imaging [7], where the PSF speckles can mimic the appearance of exoplanets, therefore subtracting an accurate and precise PSF model is essential to improve the imaging and detection of exoplanets.
PhD project
The candidate will aim to develop more accurate and performant PSF models for space-based telescopes exploiting a differentiable optical framework and focus the effort on Euclid and JWST.
The WaveDiff model is based on the wavefront space and does not consider pixel-based or detector-level effects. These pixel errors cannot be modelled accurately in the wavefront as they naturally arise directly on the detectors and are unrelated to the telescope’s optic aberrations. Therefore, as a first direction, we will extend the PSF modelling approach, considering the detector-level effect by combining a parametric and data-driven (learned) approach. We will exploit the automatic differentiation capabilities of machine learning frameworks (e.g. TensorFlow, Pytorch, JAX) of the WaveDiff PSF model to accomplish the objective.
As a second direction, we will consider the joint estimation of the PSF field and the stellar Spectral Energy Densities (SEDs) by exploiting repeated exposures or dithers. The goal is to improve and calibrate the original SED estimation by exploiting the PSF modelling information. We will rely on our PSF model, and repeated observations of the same object will change the star image (as it is imaged on different focal plane positions) but will share the same SEDs.
Another direction will be to extend WaveDiff for more general astronomical observatories like JWST with smaller fields of view. We will need to constrain the PSF model with observations from several bands to build a unique PSF model constrained by more information. The objective is to develop the next PSF model for JWST that is available for widespread use, which we will validate with the available real data from the COSMOS-Web JWST program.
The following direction will be to extend the performance of WaveDiff by including a continuous field in the form of an implicit neural representations [8], or neural fields (NeRF) [9], to address the spatial variations of the PSF in the wavefront space with a more powerful and flexible model.
Finally, throughout the PhD, the candidate will collaborate on Euclid’s data-driven PSF modelling effort, which consists of applying WaveDiff to real Euclid data, and the COSMOS-Web collaboration to exploit JWST observations.
References
[1] R. Mandelbaum. “Weak Lensing for Precision Cosmology”. In: Annual Review of Astronomy and Astro- physics 56 (2018), pp. 393–433. doi: 10.1146/annurev-astro-081817-051928. arXiv: 1710.03235.
[2] T. I. Liaudat et al. “Multi-CCD modelling of the point spread function”. In: A&A 646 (2021), A27. doi:10.1051/0004-6361/202039584.
[3] T. I. Liaudat, J.-L. Starck, and M. Kilbinger. “Point spread function modelling for astronomical telescopes: a review focused on weak gravitational lensing studies”. In: Frontiers in Astronomy and Space Sciences 10 (2023). doi: 10.3389/fspas.2023.1158213.
[4] T. I. Liaudat, J.-L. Starck, M. Kilbinger, and P.-A. Frugier. “Rethinking data-driven point spread function modeling with a differentiable optical model”. In: Inverse Problems 39.3 (Feb. 2023), p. 035008. doi:10.1088/1361-6420/acb664.
[5] C. M. Casey et al. “COSMOS-Web: An Overview of the JWST Cosmic Origins Survey”. In: The Astrophysical Journal 954.1 (Aug. 2023), p. 31. doi: 10.3847/1538-4357/acc2bc.
[6] A. Acebron et al. “The Next Step in Galaxy Cluster Strong Lensing: Modeling the Surface Brightness of Multiply Imaged Sources”. In: ApJ 976.1, 110 (Nov. 2024), p. 110. doi: 10.3847/1538-4357/ad8343. arXiv: 2410.01883 [astro-ph.GA].
[7] B. Y. Feng et al. “Exoplanet Imaging via Differentiable Rendering”. In: IEEE Transactions on Computational Imaging 11 (2025), pp. 36–51. doi: 10.1109/TCI.2025.3525971.
[8] Y. Xie et al. “Neural Fields in Visual Computing and Beyond”. In: arXiv e-prints, arXiv:2111.11426 (Nov.2021), arXiv:2111.11426. doi: 10.48550/arXiv.2111.11426. arXiv: 2111.11426 [cs.CV].
[9] B. Mildenhall et al. “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis”. In: arXiv e-prints, arXiv:2003.08934 (Mar. 2020), arXiv:2003.08934. doi: 10.48550/arXiv.2003.08934. arXiv:2003.08934 [cs.CV].
Attention-based Binarized Visual Encoder for LLM-driven Visual Question Answering
In the context of smart image sensors, there is an increasing demand to go beyond simple inferences such as classification or object detection, to add more complex applications enabling a semantic understanding of the scene. Among these applications, Visual Question Answering (VQA) enables AI systems to answer questions by analyzing images. This project aims to develop an efficient VQA system combining a visual encoder based on Binary Neural Networks (BNN) with a compact language model (tiny LLM). Although LLMs are still far from a complete hardware implementation, this project represents a significant step in this direction by using a BNN to analyze the context and relationship between objects of the scene. This encoder processes images with low resource consumption, allowing real-time deployment on edge devices. Attention mechanisms can be taken into consideration to extract the semantic information necessary for scene understanding. The language model used can be stored locally and adjusted jointly with the BNN to generate precise and contextually relevant answers.
This project offers an opportunity for candidates interested in Tiny Deep Learning and LLMs. It proposes a broad field of research for significant contributions and interesting results for concrete applications. The work will consist of developing a robust BNN topology for semantic scene analysis under certain hardware constraints (memory and computation) and integrating and jointly optimizing the BNN encoder with the LLM, while ensuring a coherent and performant VQA system across different types of inquiries.
Predictive Diagnosis and Ageing Trajectory Estimation of New Generation Batteries through Multi-modalities Fusion and Physics-Informed Machine Learning
Context:
Lithium-ion and emerging Sodium-ion batteries are crucial for energy transition and transportation electrification. Ensuring battery longevity, performance, and safety requires understanding degradation mechanisms at multiple scales.
Research Objective:
Develop innovative battery diagnostic and prognostic methodologies by leveraging multi-sensor data fusion (acoustic sensors, strain gauge sensors, thermal sensors, electrical sensors, optical sensors) and Physics-Informed Machine Learning (PIML) approaches, combining physical battery models with deep learning algorithms.
Scientific Approach:
Establish correlations between multi-physical measurements and battery degradation mechanisms
Explore hybrid PIML approaches for multi-physical data fusion
Develop learning architectures integrating physical constraints while processing heterogeneous data
Extend methodologies to emerging Na-Ion battery technologies
Methodology:
The research will utilize an extensive multi-instrumented cell database, analyzing measurement signatures and developing innovative PIML algorithms that optimize multi-sensor data fusion and validate performance using real-world data.
Expected Outcomes:
The thesis aims to provide valuable recommendations for battery system instrumentation, develop advanced diagnostic algorithms, and contribute significantly to improving the reliability and sustainability of electrochemical storage systems, with potential academic and industrial impacts.
Defense of scene analysis models against adversarial attacks
In many applications, scene analysis modules such as object detection and recognition, or pose recognition, are required. Deep neural networks are nowadays among the most efficient models to perform a large number of vision tasks, sometimes simultaneously in case of multitask learning. However, it has been shown that they are vulnerable to adversarial attacks: Indeed, it is possible to add to the input data some perturbations imperceptible by the human eye which undermine the results during the inference made by the neural network. However, a guarantee of reliable results is essential for applications such as autonomous vehicles or person search for video surveillance, where security is critical. Different types of adversarial attacks and defenses have been proposed, most often for the classification problem (of images, in particular). Some works have addressed the attack of embedding optimized by metric learning, especially used for open-set tasks such as object re-identification, facial recognition or image retrieval by content. The types of attacks have multiplied: some universal, other optimized on a particular instance. The proposed defenses must deal with new threats without sacrificing too much of the initial performance of the model. Protecting input data from adversarial attacks is essential for decision systems where security vulnerabilities are critical. One way to protect this data is to develop defenses against these attacks. Therefore, the objective will be to study and propose different attacks and defenses applicable to scene analysis modules, especially those for object detection and object instance search in images.