Project 2: Automatic coding for NACE classification

Authors
Affiliation

Théo FERRY

Julien PRAMIL

Meilame TAYEBJEE

Technical level Tasks
Beginner Follow the tutorial and understand provided answers.
Intermediate Try to complete the exercises without checking the answers.
Expert Enhance the pipeline and improve results: better models, richer embeddings, prompt tuning, etc.

Introduction

This Funathon project (a non-competitive hackathon) is about building automatic coding tools that map free-text activity labels (such as “Installation and repair of residential air conditioning systems”) to the corresponding code in the NACE 2.1 nomenclature, the European statistical classification of economic activities.

We explore two approaches to the same task. Both take a textual activity label as input and produce a NACE code, but they rely on very different paradigms, with different strengths and weaknesses.

1 Two independent parts

Important

The two parts below are independent. You can tackle either one, or both, in any order, depending on your time and interests. Each part is self-contained and starts from the same initialization (see below).

1.1 Part 1: Supervised learning (Supervised learning)

Train a deep learning text classifier on a labelled dataset of (activity label, NACE code) pairs, using Insee’s torchTextClassifiers package. You will learn how to:

  • Load and preprocess a labelled dataset
  • Train a deep learning model on the multi-class classification task
  • Evaluate the resulting classifier
  • Log experiments and model artifacts with MLflow for reproducibility

1.2 Part 2: RAG pipeline (3 chapters)

Build a retrieval-augmented generation (RAG) system that combines a vector database of official NACE definitions with a Large Language Model:

2 Initialization of the project

Before diving into either part, follow these steps to set up your environment. The same setup applies to both parts.

If you need more details for setups, do not hesitate to visite project 1 introduction page and/or watch the 2026’s Funathon Onboarding replay.

2.1 Fork this repository

On the project’s GitHub page, click Fork (top-right corner) to create a copy of the repository under your own GitHub account. Working on a fork lets you commit your experiments and your progress without affecting the original repository.

2.2 Open a VS Code service on SSPCloud

Go to the SSPCloud catalogue and launch a VS Code Python service. This gives you a browser-based development environment with direct access to the platform’s resources (S3 storage, Qdrant vector database, LLM Lab, MLflow). No local installation required.

2.3 Clone your fork inside the service

Open a terminal in the VS Code service and clone your fork (replace <YOUR_GITHUB_NAME> with your GitHub username, <YOUR_GITHUB_TOKEN> with your token, and so on):

git clone https://<YOUR_GITHUB_TOKEN>@github.com/<YOUR_GITHUB_NAME>/funathon-project2.git
git config --global user.email "<YOUR_EMAIL>"
git config --global user.name "<YOUR_GITHUB_NAME>"
cd funathon-project2

2.4 Install dependencies with uv sync

The project uses uv for dependency management. From the repository root, run:

uv sync

This will:

  • create a virtual environment in .venv/ at the root of the project,
  • install every package pinned in uv.lock.

Re-run uv sync any time pyproject.toml or uv.lock change (for instance, after pulling updates from upstream).

2.5 Select the Python interpreter (VS Code)

If you want to run code interactively in VS Code, instead of going through quarto preview, point VS Code to the virtual environment you just created:

  1. Open the Command Palette (Ctrl+Shift+P or Cmd+Shift+P).

  2. Run >Python: Select Interpreter.

  3. Choose Enter interpreter path and provide:

    /home/onyxia/work/funathon-project2/.venv/bin/python