Monitoring

Shipping a model is not the end of the MLOps lifecycle — it is the point where a new class of risks appears. In production, a model is exposed to inputs it never saw during training, to a real world that keeps drifting away from the training distribution, and, for classification tasks, to nomenclatures whose definitions can themselves change over time. None of these failures crash the API: predictions keep flowing, they just slowly get worse, unless something is actively watching.

Two practices close this gap:

Observability: continuously assess whether current predictions can still be trusted. This combines label-free signals (e.g. comparing successive model versions, tracking known hard cases) with label-based ones (accuracy, calibration) whenever ground truth is available, complemented by human review where automation cannot cover everything.
Retraining: turn newly collected and freshly labelled data into a reproducible pipeline that regularly produces a new, validated model version — closing the loop between what monitoring detects and what gets fixed.

Tip

Observability and retraining are two sides of the same loop: what monitoring flags becomes the training signal for the next model version.

Observability

Insee: GRAAL — an agentic approach to monitoring and relabelling

Beyond the automated reports described below, the production NACE coding pipeline (see the Data, Model and Serving chapters) is also watched continuously by human annotators reviewing a sample of live predictions — a signal that complements automated metrics and feeds back into retraining.

That level of continuous annotation isn’t sustainable for every pipeline, though, and nomenclatures are periodically revised, forcing costly relabelling of historical data. GRAAL (Graph-based Research with Agents for Automatic Labelling) is an Insee framework exploring an LLM-based answer to these gaps, for classifiers running fully offline on CPU whose reference knowledge is a set of official notices. An initial zero-shot RAG attempt — embedding notices and company declarations to retrieve the closest match — struggled because short, informal declarations and long, formal notices don’t share a comparable embedding space. GRAAL has since moved to a graph-exploring multi-agent system instead: nomenclature notices are modelled as a tree in a Neo4j database, and agents navigate it with dedicated tools (fetch a code’s definition, list children or siblings…) rather than similarity search, producing an explicit justification for every decision instead of an opaque distance score. Today GRAAL is used for monitoring existing API predictions, batch (re)coding on nomenclature changes, and generating synthetic training data — see the full presentation here.

Statistics Austria: Monitoring Reports

After each retraining phase, two quality reports are generated to monitor model performance and ensure consistency over time.

The first report focuses on comparing two model versions, typically the most recent model and its predecessor. Since true labels are not required for this comparison, the analysis is based on the models’ predictions. We compute and visualize the overall agreement rate between the models, as well as the distribution of predicted 1-digit codes. In addition, we evaluate a selected set of input texts that are known to be challenging in manual coding. For these cases, we analyze how both model versions classify them and compare the predictions to human annotations. This provides a qualitative assessment of behavioral changes between model versions.

The second report is based on the labeled validation data and therefore allows for a more formal performance evaluation. Standard metrics such as accuracy, top-k accuracy, and F1 score are calculated and visualized. This enables a quantitative assessment of model performance and facilitates comparisons across retraining cycles.

Retraining

Insee: Retraining stays classic

Retraining itself follows the same standard, reproducible pipeline described in chapter 2: fixed random seeds, versioned train/test/validation splits, and a tracked execution environment. What GRAAL changes is only what happens upstream of it — its batch recoding and synthetic-data generation are used to refresh the labelled dataset when a nomenclature changes version or annotation coverage is too thin, before handing off to that same classic retraining pipeline.

Statistics Austria: Reproducible retraining pipelines

The models are retrained on a quarterly basis as new training data becomes available. During each retraining cycle, the newly collected data is first fetched and combined with the existing historical training dataset to ensure that the model benefits from both recent and previously observed examples.

Based on the updated dataset, new training, test, and validation splits are created using predefined random seeds to ensure reproducibility. The observation IDs belonging to each split are stored so that the exact data partition used for a given model version can always be reconstructed and traced.

The retraining itself follows a standardized and reproducible pipeline, including the reuse of the established hyperparameter configuration and preprocessing steps.

After retraining, the two quality reports, described in section Observability, are generated. Together, they provide both label-free and label-based evaluations, ensuring that changes in model behavior are detected even when ground-truth data is limited or delayed.