Introduction

At Statistics Austria we have developed text models to classify text from survey responses into standardized codes. We currently have four models for the following codes in production: ISCO, ISCED, NACE, and ÖKLAP (Austria’s customized version of the COICOP code), with 420, 100, 701, and over 500 classes, respectively. In our classification approach, we restrict the set of classes for each code to those observed in the training data rather than utilizing the entire set of available classes. As a result, the number of classes we classify may be smaller than the total number of classes in the full code structure. The reason for this is that certain codes represent classes that are either highly unlikely or irrelevant for Austria. For instance, ‘A 03.11-0’, the NACE code for marine fishing, describes activities that are not relevant within Austria’s business context due to the country’s lack of access to the sea. By focusing only on classes observed in the training data, we can improve classification relevance and efficiency, tailoring the model to predict only those categories with practical applicability for the Austrian setting. Not only the number of classes, but also the amount of training data available varies significantly across codes: the training data set for NACE codes is the smallest, comprising about 13,000 instances, whereas the ISCO data set includes approximately 400,000 instances.