



Information extraction from text, which falls under the broader field of Natural Language Processing, has been the subject of research for many years. These efforts have primarily focused on Named Entity Recognition, relation extraction between entities, and, in its most complex form, event extraction, a task typically formulated as filling predefined templates from unstructured text. Within this framework, the objective of this thesis is to design, develop, and evaluate event extraction models operating on scientific articles. In this context, an "event" may correspond to a set of entities and relations characterizing, for instance, a chemical reaction or an experiment. Furthermore, these models must be capable of being defined from a highly restricted set of annotated data to allow for rapid adaptation to new scientific domains.
From a methodological standpoint, the proposed thesis seeks to move beyond the current, almost reflexive tendency to rely exclusively on Large Language Models (LLMs). Instead, it advocates for a potential synergy between LLMs and smaller encoder-based models within a few-shot context. In this synergy, the former are leveraged, through the generation of synthetic data and annotations, to build the resources necessary to implement the latter via pre-training mechanisms. This thesis will be conducted within the framework of the AIKO project of the Digital Programs Agency, which focuses on knowledge extraction from scientific publications.

