Supervised approach using TorchTextClassifiers

1 Forewords

1.1 The problem: classifying free-text activity descriptions

Given a short text description of a business activity — such as “I sell croissants and pains au chocolat” — the objective is to find automatically the correct NACE 2.1 activity code from hundreds of possible codes. This is a supervised multiclass classification problem.

1.2 Supervised learning or RAG?

Two main approaches exist for this task:

Approach How it works Best when
Supervised (this notebook) Train a neural network on labelled examples Large labelled dataset available, low-latency inference needed
RAG (see notebook 2) Retrieve similar examples, ask a LLM to decide Few labelled examples

In this notebook, we use the supervised approach: the model learns directly from thousands of labelled examples and can predict at inference time without calling an external API.

1.3 Pipeline overview

flowchart LR
    A["Labelled data<br/>text + NACE code"] --> B["Split<br/>train / val / test"]
    B --> C["WordPiece Tokenizer<br/>text → token IDs"]
    C --> D["torchTextClassifiers<br/>token IDs → NACE code"]
    D --> E["Evaluation<br/>& explainability"]

1.4 What you will learn

  • How to prepare and split a labelled text dataset
  • How to train a subword tokenizer (WordPiece) on domain-specific text (tokenizers are introduced in Section 4)
  • How to configure and train a neural text classifier end-to-end
  • How to load a pretrained model from MLflow and run inference
  • How to interpret predictions using the Captum package

1.5 Before you start

Three things need to be in place before running any cell.

1. Install the dependencies

All required packages are listed in pyproject.toml. Run the following command once at the root of the project to install them into a virtual environment:

uv sync

This installs, among others, torchTextClassifiers — the INSEE library that provides the tokenizer, model, and training loop used throughout this notebook. Its full API documentation is available here.

2. Understand what MLflow is and why we need it

In this notebook, we use MLflow in two ways:

  • During training (Exercise 4), metrics are logged to a remote MLflow server so you can monitor progress.
  • After training (Exercise 4, Question 3), we download a pretrained model that was already trained on the full dataset and stored in MLflow, so you do not have to wait hours for training to complete.

3. Start an MLflow service

Launch an MLflow service from the SSP Cloud service catalogue. To do so, go on SSPCloud, click on My Services, then on New Service, then on Automation, then on MLflow, then on Launch. Once the service starts, its public URL appears in the service details. It typically looks like https://user-<username>-mlflow.user.lab.sspcloud.fr.

4. Set up your MLflow credentials

MLflow is an open-source platform for tracking machine learning experiments. Every time a model is trained, MLflow records the hyperparameters used, the metrics obtained (loss, accuracy…), and the model artefacts (weights, tokenizer, configuration). This makes it easy to compare runs, reproduce results, and share trained models with others. This funathon project uses MLflow in multiple ways.

Because the MLflow server is a remote, access-controlled service, you need to provide three pieces of information:

  • MLFLOW_TRACKING_URI: the URL of the MLflow server (ex: https://user-<username>-mlflow.user.lab.sspcloud.fr).
  • MLFLOW_TRACKING_USERNAME;
  • MLFLOW_TRACKING_PASSWORD.

They are given on the information panel that opens when you launch your MLflow service.

These values must not be hardcoded in the notebook — storing credentials in source code is a security risk (anyone with these credentials could access your MLflow). Instead, create a file named .env at the root of the project and fill it in as follows:

MLFLOW_TRACKING_URI=<add the MLflow server URL>
MLFLOW_TRACKING_USERNAME=<add your username>
MLFLOW_TRACKING_PASSWORD=<add your password>

The notebook loads this file automatically at startup with load_dotenv(), making the values available as environment variables without ever appearing in the code. Be careful: do not add spaces around the = sign, and do not put single quotes or double quotes. Example: MLFLOW_TRACKING_USERNAME=user-donaldduck.


2 Load the data

The dataset consists of synthetic labelled examples generated for the NACE rev 2.1 nomenclature. Each row contains a short text description (label) and the corresponding NACE code (code).

Tip Exercice 1: Load the labelled dataset

2.1 Question 1 — Import libraries and load environment variables

Import the package mlflow and the load_dotenv from the dotenv package. Then execute load_dotenv(override=True) to load your .env file.

Click to see the answer
import mlflow
from dotenv import load_dotenv

load_dotenv(override=True)

2.2 Question 2 — Load the dataset from s3

Use Polars to load the parquet file directly from this public URL:

https://minio.lab.sspcloud.fr/projet-formation/diffusion/funathon/2026/project2/generation_None_temp08.parquet

Print the first rows and the total number of observations. Do you understand all columns?

Clue here You may use the functions pl.read_parquet() and len(), and the method .head().
Click to see the answer
import polars as pl

df = pl.read_parquet("https://minio.lab.sspcloud.fr/projet-formation/diffusion/funathon/2026/project2/generation_None_temp08.parquet")

print(df.head())
print(f"Total rows: {len(df)}")

2.3 Question 3 — Count unique NACE codes

How many unique NACE codes are present in the dataset? Store this number in a variable called n_classes. This number will define the number of output classes the model must predict.

Clue here The .n_unique() method in Polars may help you computing the number of classes.
Click to see the answer
n_classes = df['code'].n_unique()
print(f"Number of unique NACE codes: {n_classes}")

3 Split the data

Before training a model, we need to split the dataset into three subsets:

  • Train: the examples the model learns from.
  • Validation: used during training to monitor performance and trigger early stopping (i.e. stop training before overfitting).
  • Test: held out until the very end to give an unbiased estimate of the model’s performance on unseen data.

Using the validation set to tune hyperparameters or stop training means it indirectly influences the model. The test set must therefore remain completely untouched during training so that the final evaluation is truly unbiased.

Tip Exercice 2: Prepare the data

3.1 Question 1 — Split the dataset into train / validation / test sets

Use train_test_split from sklearn.model_selection to split the dataset into train, validation, and test sets (70% / 15% / 15%). Do not forget to choose a random_state. Separate the target y from the features X, and convert them to numpy arrays. You should obtain objects X_train, y_train, and so on.

Clue here
  • train_test_split splits a dataset in two — call it twice to get three subsets.
  • You may use the to_numpy method to convert to numpy arrays.
Click to see the answer
from sklearn.model_selection import train_test_split

train_df, tmp_df = train_test_split(df, test_size=0.30, random_state=42)
val_df, test_df  = train_test_split(tmp_df, test_size=0.50, random_state=42)

X_train, y_train = train_df["label"].to_numpy(), train_df["code"].to_numpy()
X_val, y_val = val_df["label"].to_numpy(), val_df["code"].to_numpy()
X_test, y_test = test_df["label"].to_numpy(), test_df["code"].to_numpy()

print(f"Train: {len(train_df)} | Val: {len(val_df)} | Test: {len(test_df)}")

3.2 Question 2 — Encode the labels

The code column contains strings like "47.11A" (codes from the NACE classification). Neural networks do not understand strings and require integer targets (1, 2, 3…), so we need to map each unique NACE code to an integer index. To do that, you will use an encoder.

Use LabelEncoder from sklearn.preprocessing (see here for the documentation). Call .fit() on the code column of your training dataframe (after having converted it to a numpy array with .to_numpy()).

We use LabelEncoder to map the categorical target y to integers 0, 1, …, n_classes - 1, the format expected by scikit-learn classifiers.

WarningMake sure all labels appear in the training set!

Fit the LabelEncoder on the training set only — fitting on validation or test data would be data leakage. However, this means any NACE code absent from the training set will cause an error at inference time. Make sure every code appears at least once in training before splitting.

Click to see the answer
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
encoder.fit(train_df['code'].to_numpy())
Click to see the answer
all_codes  = set(df['code'])
train_codes = set(train_df['code'])
missing = all_codes - train_codes

if missing:
    print(f"WARNING: {len(missing)} code(s) missing from training set: {missing}")
else:
    print(f"OK — all {len(all_codes)} codes appear in the training set.")

3.3 Question 3 — Prepare the labels to use them with ttc

torchTextClassifiers requires labels to be wrapped in a ValueEncoder object. This is straightforward: you just pass your already-fitted LabelEncoder to the ValueEncoder constructor.

Import the ValueEncoder class from the torchTextClassifiers.value_encoder subpackage, then create a value_encoder object by passing your fitted label_encoder as an argument.

While LabelEncoder plays the role of the “converter” from string labels to integers, working with raw integers at prediction time can be cumbersome — you have to manually map predictions back to their original label names. The ValueEncoder solves this by integrating tightly with the torchTextClassifiers ecosystem: when you pass it to a model, predictions are automatically returned as human-readable labels rather than integers. As a bonus, ValueEncoder also supports additional categorical input features — advanced readers can find examples here.

Click to see the answer
from torchTextClassifiers.value_encoder import ValueEncoder

value_encoder = ValueEncoder(label_encoder=encoder)

4 Applying a tokenizer

4.1 Why tokenization?

Neural networks can only work with numbers because they have no notion of letters or words. As a a consequence, models cannot use directly the string “I sell croissants and pains au chocolat” to predict a NACE code, it must be transformed into something the model can use. A tokenizer is the component that bridges this gap: it converts raw text into a sequence of integers that the model can process.

The process has three steps:

  1. Split the text into small units called tokens: these can be whole words, subword fragments, or individual characters.
  2. Map each token to a unique integer using a fixed vocabulary (a lookup table built during training).
  3. Pad or truncate the resulting sequence to a fixed length so that all inputs have the same shape, which is required for batched training.

The diagram below illustrates this pipeline on a short example:

flowchart LR
    A["#quot;I sell croissants#quot;"] -->|"split into tokens"| B["[#quot;i#quot;, #quot;sell#quot;, #quot;croi#quot;, #quot;##ssants#quot;]"]
    B -->|"look up in vocabulary"| C["[45, 112, 87, 203]"]
    C -->|"pad to length 6"| D["[45, 112, 87, 203, 0, 0]"]

4.2 Why subword tokenization?

There are several ways to split text into tokens: by character, by full word, or by subword fragment. Each has trade-offs:

Strategy Example for “croissants” Problem
Character-level ["c","r","o","i","s","s","a","n","t","s"] Sequences become very long; meaning is hard to capture
Word-level ["croissants"] Rare or misspelled words get mapped to [UNKNOWN]
Subword (WordPiece) ["croi", "##ssants"] Handles rare words without losing common subparts

WordPiece, the algorithm used here, builds a vocabulary of frequent subword units from the training corpus. Words it has never seen can still be represented by combining known fragments. The ## prefix marks a fragment that continues the previous token (i.e. it is not the start of a new word).

Because the vocabulary is learned from your data, it naturally adapts to domain-specific vocabulary; in our case, French business descriptions full of technical terms.

Tip Exercice 3: Train a WordPiece tokenizer

4.3 Question 1 — Train the tokenizer and inspect a sample

You are going to use a WordPieceTokenizer from the torchTextClassifiers package (see the documentation). Train a WordPieceTokenizer from scratch on the training labels with vocab_size=5000 and output_dim=10. Use the fitted tokenizer to tokenize one observations (with the tokenize method), and inspect the result to understand how the text is split.

Clue here Instantiate WordPieceTokenizer(vocab_size=5000, output_dim=10) and call .train(X_train). Use .tokenize() and .convert_ids_to_tokens() to inspect a sample.

Always train the tokenizer on X_train only — never on the validation or test sets. Using the full dataset would mean the tokenizer has seen vocabulary from examples the model is supposed to predict.

Click to see the answer
from torchTextClassifiers.tokenizers import WordPieceTokenizer

tokenizer = WordPieceTokenizer(vocab_size=5000, output_dim=10)
tokenizer.train(X_train)

print("Output tensor size:", tokenizer.tokenize(X_train[0]).input_ids.shape)
print("Vocabulary size:", tokenizer.vocab_size)

# Look at an example of tokenization
print("Raw text", X_train[0])
print(
    "Tokens id:",
    tokenizer.tokenize(X_train[0]).input_ids.squeeze(0)
)
print(
    "Tokens:",
    tokenizer.tokenizer.convert_ids_to_tokens(
        tokenizer.tokenize(X_train[0]).input_ids.squeeze(0)
    )
)

5 Model architecture

A text classification model in torchTextClassifiers is built from three components (on top of the ValueEncoder and the tokenizer seen above):

  1. TextEmbedder: converts token IDs into dense vectors (embeddings) of size embedding_dim.

The problem: computers can’t work with words directly — they need numbers. A naive solution is to assign each word a unique integer (its token ID). But this is useless for learning: the number 2345 tells the model nothing about what “cat” means, and gives no hint that “cat” and “kitten” should be treated similarly.

The idea: instead of a single integer, represent each token as a dense vector — a list of, say, 768 floating-point numbers. This vector is the word’s embedding: its coordinates in a high-dimensional “semantic space”.

From raw text to a dense vector

From raw text to a dense vector

Where do these numbers come from? The embedding vectors are learned parameters of the model, initialised randomly and updated during training just like any other weight. The training signal pushes the model to give similar vectors to words that behave similarly in context. After training, words with related meanings end up close together in embedding space.

Words projected into a 2D embedding space — similar concepts cluster together

Words projected into a 2D embedding space — similar concepts cluster together

A memorable analogy: think of each word as a point on a map. Cities in the same country end up near each other; animals cluster together; food words form their own neighbourhood. The “distance” between two points reflects how semantically related the words are. This is why you can do things like:

\[\text{vec}(\textit{king}) - \text{vec}(\textit{man}) + \text{vec}(\textit{woman}) \approx \text{vec}(\textit{queen})\]

The geometry of the space encodes meaning.

In torchTextClassifiers: the TextEmbedder component takes a sequence of token IDs and looks up each one in a learnable embedding matrix (of shape vocab_size × embedding_dim), producing a sequence of dense vectors that the rest of the network can reason about.

  1. CategoricalVariableNet (optional): encodes additional categorical features and merges them with the text representation. Not used here.

  2. ClassificationHead: projects the final representation onto num_classes dimensions. The highest value determines the predicted class.

To learn more about the building blocks of the torchTextClassifiers package, please visit the documentation.

These are configured through ModelConfig and assembled automatically by torchTextClassifiers. You can see some ModelConfig examples here and for torchTextClassifiers here.

6 Training

Training a neural network means finding the weights that minimise a loss function — a measure of how wrong the model’s predictions are. At each step, the model makes predictions on a mini-batch of examples, computes the loss, and updates the weights in the direction that reduces it (via an optimization algorithm called the stochastic gradient descent).

The training loop

The training loop

Key training hyperparameters:

  • num_epochs: how many times the model sees the full training set.
  • batch_size: how many examples are processed at once before a weight update.
  • lr (learning rate): how large each weight update step is.
  • patience_early_stopping: stop training if the validation loss has not improved for this many epochs, to avoid overfitting.
Tip Exercice 4: Train the model

torchTextClassifiers provides a high-level wrapper that combines the value encoder, tokenizer, model components, and training loop. You will do two things: create a model, and then train the model. It is important to distinguish these two steps clearly. To do so, torchTextClassifiers offers two tools: ModelConfig defines the model and TrainingConfig defines the training process.

6.1 Question 1 — Create the classifier

Using the torchTextClassifiers package, create a ModelConfig. Then create a torchTextClassifiers classifier that uses this ModelConfig and the trained tokenizer. Set the dimension of embeddings to 96. You will need to read carefully the torchTextClassifiers documentation.

Clue here Build a ModelConfig with embedding_dim and num_classes. Pass both to torchTextClassifiers(tokenizer=..., model_config=..., value_encoder=...).

The hyperparameter embedding_dim controls the size of the dense vector representing each token. Larger values capture more information but are slower to train.

Click to see the answer
from torchTextClassifiers import ModelConfig, TrainingConfig, torchTextClassifiers

embedding_dim = 96

model_config = ModelConfig(
    embedding_dim=embedding_dim,
    num_classes=n_classes,)

ttc = torchTextClassifiers(
    tokenizer=tokenizer,
    model_config=model_config,
    value_encoder=value_encoder,
)

6.2 Question 2 — Prepare training

Create a TrainingConfig(lr=..., batch_size=..., num_epochs=...). Look at the documentation. Do you understand what each argument does? Use only 1 epoch for fast training.

Of course, you can play with hyperparameters if you want: add more epochs, change the learning rate and the batch size… But remember that you are training the model using CPUs, so avoid big models that will take a lot of time and resources to train.

Key hyperparameters used here:

  • lr=5e-4: a standard starting learning rate for Adam-based optimizers.
  • batch_size=128: how many examples are processed before each weight update — larger batches are faster but require more memory.
  • num_epochs=1: one pass over the data, kept short for this demo.
Click to see the answer
training_config = TrainingConfig(
    num_epochs=1,
    batch_size=128,
    lr=5 * 1e-4,
    patience_early_stopping=5,
)

6.3 Question 3 — Train on a small subsample

Train the model using the TrainingConfig defined before. Look carefully at the arguments of the train method. You can try to use MLflow or automatic model logging, but this is optional (see below).

Clue here Call ttc.train(X_train=..., y_train=..., X_val=..., y_val=..., training_config=..., verbose=True).

Use mlflow.pytorch.autolog() and wrap your training call in mlflow.start_run() to automatically log metrics (loss, accuracy) to an MLflow experiment:

mlflow.set_experiment("my-experiment")
mlflow.pytorch.autolog()

with mlflow.start_run():
    ttc.train(...)
  • mlflow.set_experiment(...): sets the active experiment (created automatically if it does not exist).
  • Each call inside mlflow.start_run() creates a new run inside that experiment, so successive training attempts are kept separate and comparable.
  • If MLFLOW_TRACKING_URI is set in your environment (e.g. via .env), metrics are forwarded to that remote server automatically.
Click to see the answer
mlflow.set_experiment("funathon-2026-project2")
mlflow.pytorch.autolog()

with mlflow.start_run() as run:
    # This should take approximately 1-2mn
    ttc.train(
        X_train,
        y_train,
        training_config=training_config,
        X_val=X_val,
        y_val=y_val,
        verbose=True,
    )

    mlflow.log_artifacts(
        training_config.save_path,   # local folder produced by ttc.train()
        artifact_path="model_artifacts",
    )

torchTextClassifiers.load() reads model_checkpoint.ckpt, tokenizer.pkl, value_encoder.pkl, and the metadata file that were saved by ttc.save() — the same files that ttc.train() produced.

#| label: load-from-run
#| code-overflow: scroll
#| output: true
local_dir = mlflow.artifacts.download_artifacts(
    f"runs:/{run.info.run_id}/model_artifacts"
)

# Rebuild the torchTextClassifiers object from the downloaded files
ttc_loaded = torchTextClassifiers.load(local_dir)

print(ttc_loaded)

7 Prediction and explainability

Once the model is trained, we can use it to predict NACE codes for new text descriptions. Beyond a simple prediction, we also want to understand why a given NACE code was predicted — specifically, which words in the input text contributed the most to the decision. To understand which words influenced a prediction, we use a technique called integrated gradients (via Captum’s LayerIntegratedGradients). It assigns a score to each word in the input: a high score means that word pushed the model toward the predicted class. You can refer to this tutorial on explainability for more details.

ImportantOn the interpretation of integrated gradients

Scores based on integrated gradients are a useful guide, not a definitive explanation. Treat them as a way to explore the model’s behavior, not as proof of causality.

Tip Exercice 5: Generate predictions and inspect attributions

7.1 Question 0 — Load the pretrained model from MLflow

We provide a pretrained model trained on the full dataset with many epochs. Load it using this code:

import s3fs

fs = s3fs.S3FileSystem(
    anon=True,  # public bucket
    endpoint_url="https://minio.lab.sspcloud.fr",
)

local_dir = "./mlflow-artifacts/"
fs.get(
    "projet-funathon/diffusion/mlflow-artifacts/",
    local_dir,
    recursive=True,
)
# Rebuild the torchTextClassifiers object from the downloaded files
ttc = torchTextClassifiers.load(local_dir)

ttc.pytorch_model.eval()

7.2 Question 1 — Generate top-5 predictions with confidence scores

Sample a few texts from the X_test array into a variable called example_texts. Then call ttc.predict(np.array(example_texts), top_k=5, explain_with_captum=True) and print the main fields of the resulting dictionary. The result is a dictionary with keys prediction, confidence and captum_attributions among others.

Clue here To sample random texts, you can use for instance random.sample(range(len(X_test)), 3) to pick 3 random indices, then use them to index into X_test. Do not forget to import the random package.
Click to see the answer
import random

random_indices = random.sample(range(len(X_test)), 3)
example_texts = X_test[random_indices]
example_true_codes = y_test[random_indices]
print(example_texts)
top_k = 5
results = ttc.predict(example_texts, top_k=top_k, explain_with_captum=True)
for i, text in enumerate(example_texts):
    predicted_codes = [results["prediction"][i][k] for k in range(top_k)]
    confidence = [results["confidence"][i][k].item() for k in range(top_k)]
    print(f"\nText: {text}")
    print(f"  True code: {example_true_codes[i]}")
    for code, conf in zip(predicted_codes, confidence):
        print(f"  {code}  (confidence: {conf:.3f})")

7.3 Question 2 — Visualise word attributions for the top prediction

Now, we want to retrieve the most important words driving the NACE prediction. During inference, the model computes an attribution score for each token — stored in the captum_attributions field of the results — indicating how much each token contributed to the final decision. To do so:

  • extract captum_attributions from results;
  • use map_attributions_to_word() and map_attributions_to_char() to aggregate token-level attribution scores up to the word and character levels;
  • visualize the resulting attributions with plot_attributions_at_word() and plot_attributions_at_char() from torchTextClassifiers.utilities.plot_explainability.
  • captum_attributions: tensor of shape (n_text, top_k, seq_len) — attribution scores per token (how much a token has driven a given prediction).
  • offset_mapping: character-level start/end positions of each token in the original string, used to map token-level scores back to characters.
  • word_ids: maps each token to its parent word index, used to aggregate token scores at the word level.
Click to see the answer
from torchTextClassifiers.utilities.plot_explainability import (
    map_attributions_to_char, map_attributions_to_word,
    plot_attributions_at_char, plot_attributions_at_word, figshow,
)

text_idx = 0
top_k_idx = 0
text_sample         = example_texts[text_idx]
offsets             = results["offset_mapping"][text_idx]
word_ids            = results["word_ids"][text_idx]
predicted_code = results["prediction"][text_idx][top_k_idx]

attributions  = results["captum_attributions"][text_idx][top_k_idx] # (seq_len,)

words, word_attributions = map_attributions_to_word(
    attributions.unsqueeze(0), text_sample, word_ids, offsets
)
char_attributions = map_attributions_to_char(attributions.unsqueeze(0), offsets, text_sample)

titles = [f"Attributions for NACE code {predicted_code}"]

figshow(plot_attributions_at_char(
    text=text_sample, attributions_per_char=char_attributions, titles=titles,
)[0])

figshow(plot_attributions_at_word(
    text=text_sample, words=words.values(), attributions_per_word=word_attributions, titles=titles,
)[0])

7.4 Question 3 — Evaluate accuracy on the test set

Run ttc(X_test, top_k=1) and compare the predictions against y_test to compute accuracy.

Click to see the answer
results_test = ttc.predict(X_test, top_k=1)
preds    = results_test["prediction"].squeeze(1)
accuracy = (preds == y_test).mean()
print(f"Test accuracy: {accuracy:.4f} ({int(accuracy * len(y_test))}/{len(y_test)} correct)")

7.5 Conclusion

Not bad at all! The model reaches strong accuracy on the held-out test set — a genuinely production-ready1 classifier for NACE 2.1 codes.

1 Really? For production purposes, accuracy is not enough: calibration and robustness metrics matter too. Check out this presentation to read more about the metrics to be checked before trusting a model for production.

In this notebook we learnt how to:

  • Load and explore a labelled dataset of business activity descriptions mapped to NACE codes.
  • Preprocess the data: split into train, validation, and test sets, and encode the string labels to integers with LabelEncoder.
  • Use the torchTextClassifiers package to handle the full workflow: encoding textual labels, training a tokenizer, building a classifier, and explaining its predictions.
  • Train a WordPiece tokenizer from scratch on the training corpus, adapting the subword vocabulary to French domain-specific text.
  • Configure and train a neural text classifier end-to-end using torchTextClassifiers, with MLflow tracking metrics in real time.
  • Load a pretrained model from MLflow artefact storage, so the full-dataset model is available without waiting hours for training.
  • Interpret predictions using Captum’s Integrated Gradients to visualise which words drove each classification decision.
  • Evaluate accuracy on the unseen test set.

Ready to go further? Head over to the next notebook — Introduction to RAG — where we tackle the same classification problem from a completely different angle: retrieval-augmented generation.