Tutorial for WP10 text classifiction

This template is a fast starting pack for AIML4OS projects that use R. For Python projects, see this Github repository.

Authors

Published

August 18, 2025

What is this tutorial about ?

Link to the introductory presentation.

Official statistics typically use standardized classifications like NACE, ISCO, or COICOP to organize data. Assigning codes to text descriptions is known as the text-to-code problem. For example, we map the answer “I work as a software developer” provided by a respondent to the ISCO category 21 Science and Engineering Professionals.

Manual coding is slow and labor-intensive, but transformer models can help to make this process more efficient. This tutorials shows how these models can automate to some extent this classification task.

The tutorial consists of three parts. Each part is a self-contained notebook. The notebooks can be executed for example on the SSP cloud datalab - a platform built on Onyxia, an open-source web application developed by INSEE. This environment provides data scientists with a modern, cloud-based workspace. We provide links to launch the notebooks using a pre-configured python environment and with the nrequired data. An account on SSP Cloud is needed.

Part 1: Synthetic data generation to improve the training data set

In practice, training data sets that include statistical classification codes are often imbalanced, with certain categories significantly underrepresented. Therefore, some categories do not have enough data for the model to properly learn. One approach for mitigating this issue is data augmentation. In this context, we explore how LLMs together with the official explanatory notes of the classification can be leveraged to generate synthetic data points.

Part 2: Fine-tuning a transformer pipeline for text classification

In this part, we will learn how to fine-tune transformer models from Hugging Face for text classification tasks. Using pre-labeled datasets, we will demonstrate how to fine-tune a pretrained model and evaluate its performance on a test set. We illustrate the pipeline on an open data set that links company descriptions with corresponding NACE codes.

Part 3: Setting up a simple RAG pipeline for automatic text classification

We will explore an alternative approach to text classification using a simple Retrieval-Augmented Generation (RAG) pipeline. We will construct a knowledge base using official NACE category descriptions. Applying semantic search, we then select the k-closest classes for a given user query. Finally we will use an LLM to generate classification predictions.

What is this template?

This template is a Quarto Website customized for the AIML4OS project, designed to be used with .

How to use it?

See this tutorial.

Reuse

CC0 1.0 Universal