Illustration IA en science du vivant 2
Replay séance 4 : Maxime Delmas

Replay séance 4 (Maxime Delmas) : Modèles de Langage pour l'Extraction d'Informations

Lors de cette 4ème séance du 5 mars, Maxime Delmas (en post doc à l'IDIAP) nous a présenté les Modèles de Langage pour l'Extraction d'Informations, avec une comparaison entre extraction et génération de données synthétiques.

Many biomedical databases can now rely on trained Extraction Models to automatically extract and populate information from scientific articles. Unfortunately, this advantage is typically restricted to popular domains, such as gene-disease relationships, primarily due to the scarcity of the raw material: the training data. Indeed, annotating large datasets with domain experts is a time-consuming and costly process, leaving less-explored domains with a dearth of training data. While being of high interest in drug-discovery, the natural-products literature, reporting the identification of potential bioactive compounds from organisms, is a concrete example of such an overlooked topic. The recent emergence of Large Language Models (LLMs) holds a great potential in addressing this limitation. Firstly, these models exhibit a remarkable ability to learn new tasks with only a handful of examples, known as few-shot learning. Beyond using them to directly perform the task, we also explored their potential to alleviate the real issue, namely the lack of training data. In a second phase, we used synthetic training data, in the form of automatically generated “fake” abstracts of scientific articles, to train smaller extraction models and compare the performance against the original LLMs. In summary, our work explores the versatility of LLMs in extracting information from biomedical text, with a specific focus on the natural-product literature. We assess their performance both as extractors, and as generators of synthetic data.

Télécharger la présentation de Maxime Delmas

Voir la vidéo  :