Project 2: Automatic coding for NACE classification

Authors

Affiliation

Théo FERRY

Julien PRAMIL

Meilame TAYEBJEE

Technical level	Tasks
Beginner	Follow the tutorial and understand provided answers.
Intermediate	Try to complete the exercises without checking the answers.
Expert	Enhance the pipeline and improve results: better models, richer embeddings, prompt tuning, etc.

Introduction

This Funathon project (a non-competitive hackathon) is about building automatic coding tools that map free-text activity labels (such as “Installation and repair of residential air conditioning systems”) to the corresponding code in the NACE 2.1 nomenclature, the European statistical classification of economic activities.

We explore two approaches to the same task. Both take a textual activity label as input and produce a NACE code, but they rely on very different paradigms, with different strengths and weaknesses.

1 Two independent parts

Important

The two parts below are independent. You can tackle either one, or both, in any order, depending on your time and interests. Each part is self-contained and starts from the same initialization (see below).

1.1 Part 1: Supervised learning (Supervised learning)

Train a deep learning text classifier on a labelled dataset of (activity label, NACE code) pairs, using Insee’s torchTextClassifiers package. You will learn how to:

Load and preprocess a labelled dataset
Train a deep learning model on the multi-class classification task
Evaluate the resulting classifier
Log experiments and model artifacts with MLflow for reproducibility

1.2 Part 2: RAG pipeline (3 chapters)

Build a retrieval-augmented generation (RAG) system that combines a vector database of official NACE definitions with a Large Language Model:

RAG - Introduction: concepts, components, and overview of the pipeline
RAG - Create the vector DB: embed NACE definitions and store them in Qdrant
RAG - Generate predictions: query the LLM with retrieved context to classify new activities, then evaluate the pipeline

2 Initialization of the project

Before diving into either part, follow these steps to set up your environment. The same setup applies to both parts.

If you need more details for setups, do not hesitate to visite project 1 introduction page and/or watch the 2026’s Funathon Onboarding replay.

2.1 Fork this repository

On the project’s GitHub page, click Fork (top-right corner) to create a copy of the repository under your own GitHub account. Working on a fork lets you commit your experiments and your progress without affecting the original repository.

2.2 Open a VS Code service on SSPCloud

Go to the SSPCloud catalogue and launch a VS Code Python service. This gives you a browser-based development environment with direct access to the platform’s resources (S3 storage, Qdrant vector database, LLM Lab, MLflow). No local installation required.

2.3 Clone your fork inside the service

Open a terminal in the VS Code service and clone your fork (replace <YOUR_GITHUB_NAME> with your GitHub username, <YOUR_GITHUB_TOKEN> with your token, and so on):

git clone https://<YOUR_GITHUB_TOKEN>@github.com/<YOUR_GITHUB_NAME>/funathon-project2.git
git config --global user.email "<YOUR_EMAIL>"
git config --global user.name "<YOUR_GITHUB_NAME>"
cd funathon-project2

2.4 Install dependencies with `uv sync`

The project uses uv for dependency management. From the repository root, run:

uv sync

This will:

create a virtual environment in .venv/ at the root of the project,
install every package pinned in uv.lock.

Re-run uv sync any time pyproject.toml or uv.lock change (for instance, after pulling updates from upstream).

2.5 Select the Python interpreter (VS Code)

If you want to run code interactively in VS Code, instead of going through quarto preview, point VS Code to the virtual environment you just created:

Open the Command Palette (Ctrl+Shift+P or Cmd+Shift+P).
Run >Python: Select Interpreter.

Choose Enter interpreter path and provide:

/home/onyxia/work/funathon-project2/.venv/bin/python

2.6 Run code interactively (optional but recommended)

For exploration, debugging, or quick experiments, the easiest workflow is to create your own scripts at the root of the repository. A useful pattern in VS Code is to use cell markers (# %%) inside a regular .py file: each # %% block becomes individually runnable, with outputs and plots rendered inline in the editor. It works like a Jupyter notebook, but with the simplicity of a plain Python file.

Example (scratch.py at the root of the repo):

# %%
# If you need to change working directory (default is your interactive .py file location)
# import os
# os.chdir("<NEW_RELATIVE_LOCATION>")

import pandas as pd

df = pd.read_parquet(
    "https://minio.lab.sspcloud.fr/projet-formation/diffusion/funathon/2026/project2/generation_None_temp08.parquet"
)
df.head()

# %%
df["code"].value_counts().head(10).plot(kind="bar")

A small “Run Cell” lens appears above each # %% marker. Click it to execute that block. You can also use the shortcut Shift + Enter. This is convenient for iterating on small pieces of code or testing specific functions from the project.

You’re now ready to start. Head to Part 1 or Part 2.

Introduction

1 Two independent parts

1.1 Part 1: Supervised learning (Supervised learning)

1.2 Part 2: RAG pipeline (3 chapters)

2 Initialization of the project

2.1 Fork this repository

2.2 Open a VS Code service on SSPCloud

2.3 Clone your fork inside the service

2.4 Install dependencies with uv sync

2.5 Select the Python interpreter (VS Code)

2.6 Run code interactively (optional but recommended)

2.4 Install dependencies with `uv sync`