
From text to code (WP10.4) - MLOps tutorial
This tutorial is aimed at providing a step-by-step guide for MLOps pipeline building. It will sift through the main building blocks of an ideal MLOps pipeline, while providing feedbacks and example implementations from three NSIs (France, Germany, Austria) and other partners.
Use-Case Descriptions:Destatis:
Household Budget Surveys (HBS), such as the Income and Expenditure Survey (EVS), provide important data on the income and expenditure of the total population and its various groups. In these surveys, participants record their daily expenditures over a period of up to three months, either using a digital app or a paper diary. All recorded expenditures must be categorized according to the COICOP classification. Data entered via the app are already classified by the participants using a search algorithm, whereas entries from paper diaries must be classified retrospectively, which (until now) has been done manually.For the processing of EVS 2023 data, approximately 5 million individual expense records were expected, which would have required manual classification. To handle this volume, a machine learning approach was developed and implemented and has been successfully deployed in production.
Austria:
For several surveys, participant enter their answers to questions in free textual form. Each of these texts must be assigned a standardized code. While filling out the survey, respondents are given suggestions. If respondents do not select one of the suggested responses and instead provide a new free-text entry, this entry must be manually classified. For the Labour Force Survey (LFS), Job Vacancy Survey (JVS) and Structural Business Statistics Survey (SBS), participants are required to fill out their job title (ISCO classification). Additionally, for the LFS, information on the economic sector of participants’ place of employment (NACE), and their education (ISCED) are required. Daily expenditures recorded during the Household Budget Survey (HBS) are classified according to the ÖKLAP - Austria’s customized version of the COICOP code. At Statistics Austria, we currently have a respective model for each of these four classification systems (ISCO, ISCED, NACE, and ÖKLAP), with 420, 100, 701, and over 500 classes, respectively. In our classification approach, we restrict the set of classes for each code to those observed in the training data rather than utilizing the entire set of available classes. As a result, the number of classes we classify may be smaller than the total number of classes in the full code structure. The reason for this is that certain codes represent classes that are either highly unlikely or irrelevant for Austria. For instance, ‘A 03.11-0’, the NACE code for marine fishing, describes activities that are not relevant within Austria’s business context due to the country’s lack of access to the sea. By focusing only on classes observed in the training data, we can improve classification relevance and efficiency, tailoring the model to predict only those categories with practical applicability for the Austrian setting. Not only the number of classes, but also the amount of training data available varies significantly across codes: the training dataset for NACE codes is the smallest, comprising about 13,000 instances, whereas the ISCO dataset includes approximately 400,000 instances.