Project 2: Automatic coding for NACE classification
| Technical level | Tasks |
|---|---|
| Run the notebook start-to-finish, load the dataset, and connect to Qdrant. | |
| Modify the data query, train a simple classifier, and log results with MLflow. | |
| Extend the pipeline with a new embedding/model, deploy a serving endpoint, or integrate additional data sources. |
What you will learn
By following this tutorial, you will learn how to:
- Set up a Python environment and install dependencies for the project.
- Download and inspect a labeled dataset for NACE classification.
- Connect to a Qdrant vector database from Python.
- Explore the solution scripts in the
solutions/folder. - Understand how model training and logging are organized for reproducible experiments.
Introduction
This project demonstrates an end-to-end pipeline for automatically classifying text into NACE codes. It covers:
- generating and loading labeled data,
- preprocessing and model training,
- logging experiments with MLflow,
- and preparing deployment-ready artifacts.
1 Structure of the project
This project has five main steps (listed in the banner at the top of the page):
- data generation;
- data preprocessing;
- model fitting and evaluation;
- model logging with MLflow;
- deployment.
2 Initialization of the project
2.1 Fork this repository in your own GitHub
On the GitHub page of the project here, click Fork to create a copy in your own GitHub account.
Then clone your fork locally using the command below (replace <YOUR_GITHUB_NAME> with your username):
git clone https://github.com/<YOUR_GITHUB_NAME>/funathon-project2.git
cd funathon-project22.2 Setup the environment (quick start)
Run the following commands in a terminal inside the cloned repo:
# install/update dependencies
uv sync
# activate the project virtual environment
source .venv/bin/activate✅ If you prefer to keep your shell clean, you can also prefix commands with
uv run(e.g.uv run python script.py).
2.2.1 Select the Python interpreter (VS Code)
If you are using VS Code, make sure the workspace uses the virtual environment you just created:
- Open the Command Palette (
Ctrl+Shift+PorCmd+Shift+P). - Run
>Python: Select Interpreter. - Choose Enter interpreter path and point to:
/home/onyxia/work/funathon-project2/.venv/bin/python3.31
2.3 Quick start: run this notebook
- Open
index.qmdin VS Code or Quarto. - Run the first code cell (the data download cell) and wait for it to finish.
- Confirm you see a small table output (from
annotations.head()).
2.4 Troubleshooting (common issues)
- If
uv syncfails, re-run it and check the error message; a missing system package or network issue is most common. - If the Qdrant connection fails, ensure your
.envfile contains the correct values forQDRANT_URL,QDRANT_API_KEY, andQDRANT_API_PORT. - If a terminal command fails, verify you are in the repository root (
pwdshould end withfunathon-project2).
3 TODO:
In the following cell we retrieve a small labeled dataset (text + NACE labels) from a remote parquet file using DuckDB.
con.execute(“INSTALL httpfs;”) con.execute(“LOAD httpfs;”)
4 You can use the internal S3 endpoint if you have access (uncomment the line below):
5 query_definition = f”SELECT * FROM read_parquet(‘s3://projet-formation/diffusion/funathon/2026/project2/generation_None_temp08.parquet’)”
6 Using the public HTTPS endpoint is more likely to work in typical environments:
query_definition = f”SELECT * FROM read_parquet(‘https://minio.lab.sspcloud.fr/projet-formation/diffusion/funathon/2026/project2/generation_None_temp08.parquet’)” annotations = con.sql(query_definition).to_df() annotations.head()
It is also useful to verify the Qdrant connection before running vector indexing or retrieval steps.
::: {#qdrant-client .cell execution_count=1}
``` {.python .cell-code}
import os
from dotenv import load_dotenv
from qdrant_client import QdrantClient
load_dotenv()
client_qdrant = QdrantClient(
url=os.environ["QDRANT_URL"],
api_key=os.environ["QDRANT_API_KEY"],
port=os.environ["QDRANT_API_PORT"]
)
collections = client_qdrant.get_collections()
:::