Project 2: Automatic coding for NACE classification
| Technical level | Tasks |
|---|---|
| Follow the tutorial and understand provided answers. | |
| Try to complete the exercises without checking the answers. | |
| Enhance the pipeline and improve results: better models, richer embeddings, prompt tuning, etc. |
Introduction
This Funathon project (a non-competitive hackathon) is about building automatic coding tools that map free-text activity labels (such as “Installation and repair of residential air conditioning systems”) to the corresponding code in the NACE 2.1 nomenclature, the European statistical classification of economic activities.
We explore two approaches to the same task. Both take a textual activity label as input and produce a NACE code, but they rely on very different paradigms, with different strengths and weaknesses.
1 Two independent parts
The two parts below are independent. You can tackle either one, or both, in any order, depending on your time and interests. Each part is self-contained and starts from the same initialization (see below).
1.1 Part 1: Supervised learning (Supervised learning)
Train a deep learning text classifier on a labelled dataset of (activity label, NACE code) pairs, using Insee’s torchTextClassifiers package. You will learn how to:
- Load and preprocess a labelled dataset
- Train a deep learning model on the multi-class classification task
- Evaluate the resulting classifier
- Log experiments and model artifacts with MLflow for reproducibility
1.2 Part 2: RAG pipeline (3 chapters)
Build a retrieval-augmented generation (RAG) system that combines a vector database of official NACE definitions with a Large Language Model:
- RAG - Introduction: concepts, components, and overview of the pipeline
- RAG - Create the vector DB: embed NACE definitions and store them in Qdrant
- RAG - Generate predictions: query the LLM with retrieved context to classify new activities, then evaluate the pipeline
2 Initialization of the project
Before diving into either part, follow these steps to set up your environment. The same setup applies to both parts.
If you need more details for setups, do not hesitate to visite project 1 introduction page and/or watch the 2026’s Funathon Onboarding replay.
2.1 Fork this repository
On the project’s GitHub page, click Fork (top-right corner) to create a copy of the repository under your own GitHub account. Working on a fork lets you commit your experiments and your progress without affecting the original repository.
2.2 Open a VS Code service on SSPCloud
Go to the SSPCloud catalogue and launch a VS Code Python service. This gives you a browser-based development environment with direct access to the platform’s resources (S3 storage, Qdrant vector database, LLM Lab, MLflow). No local installation required.
2.3 Clone your fork inside the service
Open a terminal in the VS Code service and clone your fork (replace <YOUR_GITHUB_NAME> with your GitHub username, <YOUR_GITHUB_TOKEN> with your token, and so on):
git clone https://<YOUR_GITHUB_TOKEN>@github.com/<YOUR_GITHUB_NAME>/funathon-project2.git
git config --global user.email "<YOUR_EMAIL>"
git config --global user.name "<YOUR_GITHUB_NAME>"
cd funathon-project22.4 Install dependencies with uv sync
The project uses uv for dependency management. From the repository root, run:
uv syncThis will:
- create a virtual environment in
.venv/at the root of the project, - install every package pinned in
uv.lock.
Re-run uv sync any time pyproject.toml or uv.lock change (for instance, after pulling updates from upstream).
2.5 Select the Python interpreter (VS Code)
If you want to run code interactively in VS Code, instead of going through quarto preview, point VS Code to the virtual environment you just created:
Open the Command Palette (
Ctrl+Shift+PorCmd+Shift+P).Run
>Python: Select Interpreter.Choose Enter interpreter path and provide:
/home/onyxia/work/funathon-project2/.venv/bin/python
2.6 Run code interactively (optional but recommended)
For exploration, debugging, or quick experiments, the easiest workflow is to create your own scripts at the root of the repository. A useful pattern in VS Code is to use cell markers (# %%) inside a regular .py file: each # %% block becomes individually runnable, with outputs and plots rendered inline in the editor. It works like a Jupyter notebook, but with the simplicity of a plain Python file.
Example (scratch.py at the root of the repo):
# %%
# If you need to change working directory (default is your interactive .py file location)
# import os
# os.chdir("<NEW_RELATIVE_LOCATION>")
import pandas as pd
df = pd.read_parquet(
"https://minio.lab.sspcloud.fr/projet-formation/diffusion/funathon/2026/project2/generation_None_temp08.parquet"
)
df.head()
# %%
df["code"].value_counts().head(10).plot(kind="bar")A small “Run Cell” lens appears above each # %% marker. Click it to execute that block. You can also use the shortcut Shift + Enter. This is convenient for iterating on small pieces of code or testing specific functions from the project.