Model evaluation strategy in our specific Text-To-Code context
For countries applying machine learning to large-scale text classification, especially in text-to-code scenarios, model evaluation must be adapted to the specific characteristics of statistical classification systems: many classes, strong class imbalance, and high semantic proximity between categories. In such contexts, overall accuracy alone is rarely sufficient and may even be misleading. Instead, evaluation strategies should emphasize precision, recall, and F1-score, which provide a more meaningful view of model behavior under imbalance.
For multi-class problems, the aggregation strategy of metrics is a critical design choice. Macro, micro, and weighted variants of F1-score answer different questions and should be selected based on how classification errors are valued. If errors on small or rare classes are substantively important, unweighted macro metrics are preferable, as they prevent dominant classes from masking poor performance elsewhere. Conversely, weighted metrics may be appropriate when the operational focus is primarily on high-frequency categories. In all cases, relying solely on aggregated metrics is insufficient; class-level evaluation is essential to identify systematic biases and uneven model performance.
The way dataset stratification is performed to properly evaluate the model depends on the specific use case. Stratified train–test splits by class are particularly valuable when poor model performance on rare categories is a critical concern, as they ensure that small but substantively relevant classes are adequately represented during evaluation. However, if the primary objective is to optimize performance on high-frequency classes and errors on rare classes are considered less critical, strict stratification may be less essential. The choice should therefore be guided by how classification errors are weighted and interpreted in the specific application context.
Finally, evaluation datasets must be representative of future data, not just historical samples. Regular evaluation on newly collected data, combined with expert review, is necessary to detect model drift and to ensure sustainable model quality over time.
These principles were applied in practice at Destatis in the context of large-scale COICOP text-to-code classification. Given the highly imbalanced class distribution and the presence of several hundred target classes, the evaluation strategy deliberately moved beyond accuracy as a primary metric. While accuracy was still reported, the main focus was placed on precision, recall, and F1-score, which better capture performance differences across classes of varying frequency.
In particular, unweighted macro F1-score was chosen as a key benchmark metric. This choice reflects the statistical objective of treating all COICOP categories as equally important, regardless of their frequency in the data, and of avoiding systematic neglect of rare but substantively relevant classes. Weighted metrics were considered but deemed less suitable, as they tend to understate errors on small classes and can mask biased prediction patterns—for example, consistently favoring dominant categories over closely related but less frequent ones.
To support robust evaluation, all train–test splits were stratified by COICOP class, ensuring that both frequent and rare categories were adequately represented in the test data. In addition to aggregated metrics, class-level performance indicators were systematically analyzed to identify weak classes and guide targeted improvements, such as enriching training data for problematic categories.
Finally, evaluation at Destatis extended beyond static test sets. To measure the robustness of the different models and to ensure relevance for future production use, model outputs were compared with classifications produced by domain experts. This concerns records where the model scores used in production were below a fixed thresholds. This continuous expert-based validation targeting difficult records ensures that the model remains aligned with evolving data and classification practices and naturally forms the basis for a human-in-the-loop framework, which is discussed in the following section.