Web data

The main source of data consists of company descriptions collected from their websites, particularly from subpages that outline their business profile, such as “about us” or “company information”. Texts from these pages typically contain detailed information about business activities, services, or products.

Synthetic data

Due to the insufficient amount of available data and the imbalance within the dataset (for example, one class contained around 800 observations, while another had only 1 or none), we decided to generate synthetic data.

To generate synthetic data, we used an LLM (Groq‑API), which was applied to create new textual examples. The generated data followed a standardised structure consisting of:

  • Description – a short description of the enterprise
  • PKD – the corresponding PKD code

At the same time, to verify the style of the generated descriptions, we tested LM Arena, analysing the consistency and overall quality of the produced texts. The outcome of this work is a synthetic dataset that was incorporated into the subsequent stages of the project. This made it possible to partially mitigate the issue of imbalanced classes and improve the models’ ability to generalise.