Introduction
Official statistical nomenclatures, or standardized code systems such as NACE, ISCO, and COICOP, are often designed as hierarchical taxonomies. They can be represented as tree-like structures of parent and child nodes with the root at the very top and leafs at the bottom, where the leafs are the complete class label strings whereas branches of nodes without leafs are substrings or incomplete classes.
Such hierarchies allow for analysis at different levels of detail that can be leveraged for classification of text into such code systems.
Some attempts in statistical modelling have been made to do so. Although a fair amount of research has already been conducted, the resulting hierarchical models do not seem to increase performances substantially.
As part of AIML4OS, WP 10 “Text classification” Cluster 3, we reviewed existing literature1 and developed our own hierarchical text classification models. Here, we report our research starting with a case study clearly comparing a selection of methods using a dataset based on the Austrian version of NACE. The subsequent section compiles various results: (a) similarly clear comparisons outside officials statistics and (b) additional efforts at National Statistics Institutes (NSIs) based on their versions of the classification systems. This enables us to represent conversations about the study of hierarchical classification for official statistics at Statistics Austria, Statistics Denmark, DeStatis (Germany), Statistics Norway, Spanish Statistical Office, and Insee (France). In conclusion, we offer perspectives of interest to those pondering similar research.
Footnotes
https://github.com/AIML4OS/WP10/blob/main/LiteratureReviews/literature_review_hierarchical_models.pdf↩︎