From text to code (WP10.4) - MLOps tutorial

This tutorial is aimed at providing a step-by-step guide for MLOps pipeline building. It will sift through the main building blocks of an ideal MLOps pipeline, while providing feedbacks and example implementations from three National Statistical Institutes (Insee, German Federal Statistical Office, Statistics Austria).

Use-Case Descriptions:

German Federal Statistical Office: automatic coding for the Household Budget Survey

Household Budget Surveys (HBS), such as the Income and Expenditure Survey (EVS), provide important data on the income and expenditure of the total population and its various groups. In these surveys, participants record their daily expenditures over a period of up to three months, either using a digital app or a paper diary. All recorded expenditures must be categorized according to the COICOP classification. Data entered via the app are already classified by the participants using a search algorithm, whereas entries from paper diaries must be classified retrospectively, which (until now) has been done manually.For the processing of EVS 2023 data, approximately 5 million individual expense records were expected, which would have required manual classification. To handle this volume, a machine learning approach was developed and implemented and has been successfully deployed in production.

Statistics Austria: Generic models for multiple coding systems

For several surveys, participant enter their answers to questions in free textual form. Each of these texts must be assigned a standardized code. While filling out the survey, respondents are given suggestions. If respondents do not select one of the suggested responses and instead provide a new free-text entry, this entry must be manually classified. For the Labour Force Survey (LFS), Job Vacancy Survey (JVS) and Structural Business Statistics Survey (SBS), participants are required to fill out their job title (ISCO classification). Additionally, for the LFS, information on the economic sector of participants’ place of employment (NACE), and their education (ISCED) are required. Daily expenditures recorded during the Household Budget Survey (HBS) are classified according to the ÖKLAP - Austria’s customized version of the COICOP code. At Statistics Austria, we currently have a respective model for each of these four classification systems (ISCO, ISCED, NACE, and ÖKLAP), with 420, 100, 701, and over 500 classes, respectively. In our classification approach, we restrict the set of classes for each code to those observed in the training data rather than utilizing the entire set of available classes. As a result, the number of classes we classify may be smaller than the total number of classes in the full code structure. The reason for this is that certain codes represent classes that are either highly unlikely or irrelevant for Austria. For instance, ‘A 03.11-0’, the NACE code for marine fishing, describes activities that are not relevant within Austria’s business context due to the country’s lack of access to the sea. By focusing only on classes observed in the training data, we can improve classification relevance and efficiency, tailoring the model to predict only those categories with practical applicability for the Austrian setting. Not only the number of classes, but also the amount of training data available varies significantly across codes: the training dataset for NACE codes is the smallest, comprising about 13,000 instances, whereas the ISCO dataset includes approximately 400,000 instances.

Insee: high-throughput NACE automatic coding

Insee had adopted an MLOPs approach for one main use case: automatic NACE coding for the national business registry. The production pipeline classifies +10,000 self-declared activity descriptions into 745 codes, including a human-in-the-loop fallback in case of low confidence.

Other production-oriented automatic coding pipelines include COICOP (household budget survey) and SPC (census).

Authors

Meilame Tayebjee

Public Statistician 2

Published

October 27, 2025