AIML4OS WP10 From text to code

This page introduces the experiences and potential of using Artificial Intelligence (AI) and Machine Learning (ML) for classifying and coding in official statistics, specifically within the context of Work Package 10 “From Text to Code” (WP10) of the AIML4OS project. This WP aims to explore AI/ML methodologies for enhancing the accuracy and efficiency of classification and coding processes used National Statistical Institutes (NSI).

WP10 Goals

WP10 addresses one of the most persistent challenges in official statistics: the classification of free-text data — such as job titles or household expenditures — into standardized statistical codes like ISCO, COICOP, and NACE. Artificial Intelligence and Machine Learning (AI/ML) methods offer significant potential to enhance statistical classification and coding, particularly in contexts where large datasets and complex categorical structures make manual coding inefficient.

However, the diversity of data sources — varying in language, structure, and national adaptations — makes developing a single universal solution nearly impossible.

To tackle this, Work Package 10 (WP10) brings together nine countries and four observers in a collaborative effort to overcome shared obstacles, including:

  • Limited training data,
  • Linguistic complexity and the diversity of statistical nomenclatures,
  • MLOps deployment challenges, and
  • Frequent revisions of classification systems.

WP10 Outputs

WP10 aims to deliver a series of tangible outputs across two main phases.

Phase 1: Understanding the Landscape — Projects Overview and Literature Reviews

Projects Overview:
During the first year, each participating NSI presented its ongoing work on text classification, enabling WP10 to gain a comprehensive overview of how this topic is currently being approached across institutions, and what challenges are commonly encountered.
A synthesis of these presentations — identifying shared priorities, differences, and common obstacles — has been compiled in the Landscape Document (link).

Literature Reviews:
Building on this initial mapping, WP10 identified five key challenges and conducted five corresponding literature reviews. These reviews serve as the foundational groundwork before moving into the experimental phase.
All reviews are accessible here: Literature Reviews (link).


Phase 2: Exploring Solutions — Experiments and Pipeline Implementations

Based on insights from Phase 1, WP10 identified five thematic clusters to structure the experimental work.
Each cluster focuses on one of the main challenges NSIs face, exploring and implementing possible solutions collaboratively.

Rather than enforcing a single unified pipeline, the project aims to document and share a variety of approaches, enabling each NSI to tailor solutions to its specific national and operational context.

The clusters’ reports will be made available here: clusters reports (link). All experimental code and resources are available through GitHub repositories linked in the cluster sections below.


Clusters Structure

WP10 follows a multi-pronged, collaborative structure organized into five clusters — each targeting a specific statistical and technical challenge.

Cluster Participants Problem Solution
Cluster 1 National Institute of Statistics (Spain), Destatis (Germany) Insufficient training data Generate synthetic training material using LLMs
Cluster 2 Statistics Norway, Statistics Poland, Statistics Austria Addressing the complexity and nuances of natural language Use LLMs and RAGs for text classification
Cluster 3 Statistics Austria, Statistics Denmark Traditional models neglect hierarchical structures Apply hierarchical methods for text classification
Cluster 4 Insee (France), Statistics Austria, Destatis (Germany) Challenges in deploying and maintaining classification models in production environments Provide guidelines for model deployment and monitoring
Cluster 5 Insee (France), Statistics Norway, Statistics Netherlands Adapting to classification system revisions Use LLMs to handle the next NACE revision

Reuse

CC-BY 4.0