Model
Model training
Choosing the model, designing the right deep learning architecture or best Machine Learning practices (preprocessing, cross-validation, dealing with class imbalance, parameter tuning…) is out of the MLOps scope per se, so we will not dive into that here and refer to further references for details. From a MLOps perspective, what actually matters is to be able to replicate the training in a fully reproducible way, at any time, on a any machine that has sufficient computational power (e.g. GPU).
Once again, the idea is that if you lose everything - machine, storage…, you should be able to re-achieve your results without any additionnal effort.
This is achievable through different means and softwares, yet, minimally:
- Fixed Randomness: random seed should be fixed for any process that implies randomness, including train/val/test splitting, model training…
- Modular Code: the training pipeline should be fully reproducible, as abstract as possible, with minimally hard-coded values and runnable with all sets of hyperparameters (and ideally for different models). Especially, avoid at all costs notebooks,
- Experiment Tracking: there should be a logging tool (Tensorboard, MLFlow, Weights & Biases…) for you to track experiments, metrics and hyperparameters (see below) including training code and data version.
- Environment Definition: Define the exact execution environment (e.g.,
conda.yaml,requirements.txt, or a Docker image) to prevent “it works on my machine” failures.
Insee: Hydra, uv, torchTextClassifiers, Argo Workflows & MLFlow
- Insee uses the library Hydra for seamless configuration management, enabling to have a very abstract and unified
train.pyscript for every model architecture and hyperparameters, - uv for dependency management and envionment definition,
- torchTextClassifiers, an internally developed library that serves as a toolkit for text classification using deep learning and distributes a unified source of truth for the model design across training, inference… It is highly predicated on PyTorch and Lightning, with minimum other dependencies,
- Argo Workflows is a Kubernetes-native software used to launch several parallel trainings in a fully reproducible way (on our Onyxia cluster),
- and finallyMLFlow is used as a logger and to monitor training.
Training pipeline is available here.
Austria:
Due to the small volume of training data, parallelized training is not required. Instead, we use Keras and TensorFlow in R via reticulate to train our models. Training logs are stored for subsequent analysis.
Hyperparameter tuning is not performed every time the model is retrained; instead, we reuse the hyperparameter configuration that was identified as optimal during the initial model training.
Practical experience: Spark-based text classification at Destatis
Model training at scale with Spark: general lessons learned
When text classification datasets grow beyond the limits of single-machine parallelization in Python or R, distributed frameworks such as Apache Spark become a natural consideration. Spark enables horizontal scaling across clusters and is particularly well suited for large-scale data preparation, feature extraction, and the training of classical machine-learning models on datasets that no longer fit into memory. In an MLOps context, Spark integrates well with cloud storage, columnar formats such as Parquet, and fault-tolerant execution, making it attractive for industrial and institutional environments.
However, experience shows that Spark should be used selectively for model training. While it excels at scaling data processing and simple models, its native machine-learning library (MLlib) offers a limited algorithm portfolio and constrained options for hyperparameter tuning and model optimization. As a result, Spark-based ML pipelines are most effective when applied to well-understood, classical models and when training speed, robustness, and operational scalability are prioritized over modeling flexibility. For modern NLP approaches—particularly those based on transformer architectures—Spark is typically better positioned as a data engineering backbone rather than as the primary training framework.
These general observations are confirmed by practical experience at Destatis, where Spark was evaluated for large-scale text-to-code classification in the context of COICOP expenditure coding. Several experiments were conducted on both a traditional R-based infrastructure and a Cloudera Spark environment, focusing on performance, scalability, and operational feasibility.
The results show that classical models can achieve solid predictive performance across platforms. A Random Forest model trained on an R server delivered strong results (Macro F1 ≈ 0.80, Accuracy ≈ 0.90) but required very high memory consumption (~200 GB) and its hyperparameters could only be fine-tuned on data subsamples, limiting its suitability for regular re-trainings. Logistic regression experiments on the R server using scikit-learn failed to converge or required runtimes that were considered too risky for production use.
In contrast, logistic regression implemented with Spark MLlib achieved competitive performance (Macro F1 ≈ 0.78, Accuracy ≈ 0.88) while dramatically reducing training time (3–5 minutes) and keeping memory usage at a manageable level (~40 GB). Even with grid search over multiple parameter combinations, total runtimes remained operationally acceptable. More complex models, such as Random Forests in MLlib, proved unsuitable for this high-cardinality multi-class problem, leading to memory errors and limited model quality.
Overall, this experience illustrates a key MLOps takeaway: Spark can be highly effective for scalable, robust training of classical text classification models, especially when retraining frequency, runtime stability, and infrastructure constraints matter. At the same time, its limitations in model diversity and optimization flexibility mean that Spark-based ML should be carefully scoped, especially in text classification context.
Model validation
MLOps induces a notion of operations, meaning that your model will be served and deployed for a specific - potentially critical - use-case, and used by production teams.
Therefore, you need to ensure your model is trustworthy and will not break the production pipeline via a meaningful set of metrics on top of mere accuracy. The choice of these metrics is heavily dependent on the specific use-case and the design of the production pipeline.
For instance, automatic coding use cases often imply a fallback to human annotation in case of low-confidence from the model; then, tconfidence scores must represent real probabilities (e.g., a score of 0.9 should mean 90% accuracy): this is called calibration (see REF for further reading). Hardware-specific metrics like latency (how fast is the inference?) and memory footprint are also common.
The Gold Standard in MLOps: An automated validation script that runs immediately after training, compares results against a baseline (the current “Champion” model), and logs the report alongside the model artifact.
Insee: Calibration is key
For the NACE coding, if the model confidence is below a threshold, the text is manually annotated (human-in-the-loop). To be developed
Austria:
Before the model is deployed, several validation steps are performed. These include the calculation of evaluation metrics and a comparison with the previous model version. Evaluation metrics are automatically generated each time the model is retrained. Metrics such as accuracy, F1 score, and top-k accuracy are then compared against those of earlier model versions.
Practical experience: evaluation strategy at Destatis
Model evaluation strategy in our specific Text-To-Code context
For countries applying machine learning to large-scale text classification, especially in text-to-code scenarios, model evaluation must be adapted to the specific characteristics of statistical classification systems: many classes, strong class imbalance, and high semantic proximity between categories. In such contexts, overall accuracy alone is rarely sufficient and may even be misleading. Instead, evaluation strategies should emphasize precision, recall, and F1-score, which provide a more meaningful view of model behavior under imbalance.
For multi-class problems, the aggregation strategy of metrics is a critical design choice. Macro, micro, and weighted variants of F1-score answer different questions and should be selected based on how classification errors are valued. If errors on small or rare classes are substantively important, unweighted macro metrics are preferable, as they prevent dominant classes from masking poor performance elsewhere. Conversely, weighted metrics may be appropriate when the operational focus is primarily on high-frequency categories. In all cases, relying solely on aggregated metrics is insufficient; class-level evaluation is essential to identify systematic biases and uneven model performance.
dataset stratification is an important design choice rather than a universal requirement. Stratified train–test splits by class are particularly valuable when poor model performance on rare categories is a critical concern, as they ensure that small but substantively relevant classes are adequately represented during evaluation. However, if the primary objective is to optimize performance on high-frequency classes and errors on rare classes are considered less critical, strict stratification may be less essential. The choice should therefore be guided by how classification errors are weighted and interpreted in the specific application context.
Finally, evaluation datasets must be representative of future data, not just historical samples. Regular evaluation on newly collected data, combined with expert review, is necessary to detect model drift and to ensure sustainable model quality over time.
These principles were applied in practice at Destatis in the context of large-scale COICOP text-to-code classification. Given the highly imbalanced class distribution and the presence of several hundred target classes, the evaluation strategy deliberately moved beyond accuracy as a primary metric. While accuracy was still reported, the main focus was placed on precision, recall, and F1-score, which better capture performance differences across classes of varying frequency.
In particular, unweighted macro F1-score was chosen as a key benchmark metric. This choice reflects the statistical objective of treating all COICOP categories as equally important, regardless of their frequency in the data, and of avoiding systematic neglect of rare but substantively relevant classes. Weighted metrics were considered but deemed less suitable, as they tend to understate errors on small classes and can mask biased prediction patterns—for example, consistently favoring dominant categories over closely related but less frequent ones.
To support robust evaluation, all train–test splits were stratified by COICOP class, ensuring that both frequent and rare categories were adequately represented in the test data. In addition to aggregated metrics, class-level performance indicators were systematically analyzed to identify weak classes and guide targeted improvements, such as enriching training data for problematic categories.
Finally, evaluation at Destatis extended beyond static test sets. To measure the robustness of the different models and to ensure relevance for future production use, model outputs were compared with classifications produced by domain experts. This concerns records where the model scores used in production were below a fixed thresholds. This continuous expert-based validation targeting difficult records ensures that the model remains aligned with evolving data and classification practices and naturally forms the basis for a human-in-the-loop framework, which is discussed in the following section.
Model wrapping
Any machine learning model would expect a nicely preprocessed (or tokenized) tensors as input, and outputs logits (or confidence scores).
Yet in production, we have to take a step back: the user directly inputs raw text into the pipeline, and expects a label back.
The wrapper is the object that bridges the gap between machine learning and ops, and is the object that will be actually deployed. It encapsulates the trained model and handles:
- Preprocessing: All cleaning, normalization, and tokenization steps.
- Inference: Passing the data through the model.
- Post-processing: Converting raw logits into human-readable predictions or nomenclature codes.
Standardizing this into a single .predict() method allows the production team to swap models without changing a single line of their application code.
Many libraries, for instance MLFlow or Lightning, provide native support for these wrapper objects.
Insee: MLFlow with PyTorch flavor
The library torchTextClassifiers offers already a Lightning wrapper, that we convert into a MLFLow pyfunc wrapper for easy service. Include code.
Austria:
We implemented a single function that handles all pre- and post-processing steps required for the model. This function takes raw input data and transforms it into a format suitable for model inference. During pre-processing, text inputs are tokenized, and categorical variables—mostly in textual form—are translated into the integer values used internally by the model.
The model outputs a vector of probabilities, with one probability for each possible code per input. During post-processing, these probabilities are sorted to generate the top-k most likely codes.
Practical experience: model wrapping at Destatis
- model functionality of CML is used to deploy a model
- You can deploy automated pipelines of analytics workloads in different programming languages (R, Python, etc.)
- In addition you can train, evaluate models and deploy models as REST APIs to serve predictions
- easy to use through a user defined function (e.g. predict()) which generates the predictions in json-Format
- EXAMPLE CODE VANILLA FUNCTION
- simplified deployment limits flexibility e.g.
- Custom deployment scripts
- Advanced model diagnostics and performance tracking options (like in mlflow)
Model storage & versioning
Exactly as for data, and for the same reasons, models should be stored in a stable storage, once again preferably cloud-based, called the Model Registry. The logger you (should) have used for training most of the time also handles the storage of them.
A standard practice is also to keep track of which model has been deployed over the time (versioning) via a promotion system: the registry acts as a “source of truth” for which model is currently in production.
The deployed model is treated exactly as a software application: it is versioned, tagged, updated/deprecated over time.
The rationale is that you should be able to load any experiment that you have tried, at any time. This system ensures that the production pipeline is model-agnostic; the application simply fetches the model tagged as Production, regardless of whether it is an SVM or a Transformer.
Insee: MLFlow
Extensive use of MLFlow for versioning. Include screenshots…
Austria:
Model versions are stored locally. Once deployed to the rsconnect server, we can switch between model versions, enabling the reproduction of past experiments.
Practical experience: model storage & versioning at Destatis
- CML offers model registry so models can be tracked and a roll back to previous model versions is possible
- like model wrapping, the model storage is simpliefied