Project 2: Automatic coding for NACE classification

Authors

Affiliation

Théo FERRY

Julien PRAMIL

Meilame TAYEBJEE

Technical level	Tasks
Beginner	Run the notebook start-to-finish, load the dataset, and connect to Qdrant.
Intermediate	Modify the data query, train a simple classifier, and log results with MLflow.
Expert	Extend the pipeline with a new embedding/model, deploy a serving endpoint, or integrate additional data sources.

What you will learn

By following this tutorial, you will learn how to:

Set up a Python environment and install dependencies for the project.
Download and inspect a labeled dataset for NACE classification.
Connect to a Qdrant vector database from Python.
Explore the solution scripts in the solutions/ folder.
Understand how model training and logging are organized for reproducible experiments.

Introduction

This project demonstrates an end-to-end pipeline for automatically classifying text into NACE codes. It covers:

generating and loading labeled data,
preprocessing and model training,
logging experiments with MLflow,
and preparing deployment-ready artifacts.

1 Structure of the project

This project has five main steps (listed in the banner at the top of the page):

data generation;
data preprocessing;
model fitting and evaluation;
model logging with MLflow;
deployment.

2 Initialization of the project

2.1 Fork this repository in your own GitHub

On the GitHub page of the project here, click Fork to create a copy in your own GitHub account.

Then clone your fork locally using the command below (replace <YOUR_GITHUB_NAME> with your username):

git clone https://github.com/<YOUR_GITHUB_NAME>/funathon-project2.git
cd funathon-project2

2.2 Setup the environment (quick start)

Run the following commands in a terminal inside the cloned repo:

# install/update dependencies
uv sync

# activate the project virtual environment
source .venv/bin/activate

✅ If you prefer to keep your shell clean, you can also prefix commands with uv run (e.g. uv run python script.py).

2.2.1 Select the Python interpreter (VS Code)

If you are using VS Code, make sure the workspace uses the virtual environment you just created:

Open the Command Palette (Ctrl+Shift+P or Cmd+Shift+P).
Run >Python: Select Interpreter.
Choose Enter interpreter path and point to:

/home/onyxia/work/funathon-project2/.venv/bin/python3.31

2.3 Quick start: run this notebook

Open index.qmd in VS Code or Quarto.
Run the first code cell (the data download cell) and wait for it to finish.
Confirm you see a small table output (from annotations.head()).

2.4 Troubleshooting (common issues)

If uv sync fails, re-run it and check the error message; a missing system package or network issue is most common.
If the Qdrant connection fails, ensure your .env file contains the correct values for QDRANT_URL, QDRANT_API_KEY, and QDRANT_API_PORT.
If a terminal command fails, verify you are in the repository root (pwd should end with funathon-project2).

3 TODO:

In the following cell we retrieve a small labeled dataset (text + NACE labels) from a remote parquet file using DuckDB.

con.execute(“INSTALL httpfs;”) con.execute(“LOAD httpfs;”)

4 You can use the internal S3 endpoint if you have access (uncomment the line below):

5 query_definition = f”SELECT * FROM read_parquet(‘s3://projet-formation/diffusion/funathon/2026/project2/generation_None_temp08.parquet’)”

6 Using the public HTTPS endpoint is more likely to work in typical environments:

query_definition = f”SELECT * FROM read_parquet(‘https://minio.lab.sspcloud.fr/projet-formation/diffusion/funathon/2026/project2/generation_None_temp08.parquet’)” annotations = con.sql(query_definition).to_df() annotations.head()


It is also useful to verify the Qdrant connection before running vector indexing or retrieval steps.

::: {#qdrant-client .cell execution_count=1}
``` {.python .cell-code}
import os
from dotenv import load_dotenv
from qdrant_client import QdrantClient

load_dotenv()

client_qdrant = QdrantClient(
    url=os.environ["QDRANT_URL"],
    api_key=os.environ["QDRANT_API_KEY"],
    port=os.environ["QDRANT_API_PORT"]
)

collections = client_qdrant.get_collections()

:::