Data

Data storage

Germany

In order to ensure that the data is stored and used efficiently we make use of the Hadoop Distributed File System (HDFS) and parquet for data partitioning. HDFS is especially made for handling a large amount of data. For programming and data processing, we use Cloudera Machine Learning (CML) with PySpark, which allows us to efficiently work on the data. We store our data in the Parquet format, which is ideal for big data and in addition, to make it easier for users to handle and cross-check the data, we use Hue (Hadoop User Experience), an open-source SQL-based cloud editor. For rights management, we use Ranger, which provides a big variety of access control to ensure data security.

Austria

Training data is stored as csv-files. New files are added quarterly by the subject matter experts (between 300-500 data entries), which are then used as to retrain the model.

Data cleaning

Germany

The data cleaning in our project is quite straightforward, since the text entries contain short texts (mostly keywords) instead of long ones. First, data augmentation is performed by adding new text entries (e.g. text like “groceries” or “beverages”) to the dataset, adding multiple newly generated text values to each household to enrich the data. Adding a variety of new textual entries helps the model to generelize better. Secondly, we clean the data by removing punctation and handling missing values.

Austria

Duplicated entries are removed from the data. Text inputs are transformed into all lower-case letters. Further, we remove stop words, umlaut-charaters (ä,ö,ü), special characters (e.g. -,+,#,), gender-specific words endings (e.g. “-in”, “:innen”), and numbers. Each categorical variable has a predefined set of valid input classes, since the model can only handle known classes. All known inputs are translated into this set of classes. Unknown inputs are set to their respective “unknown” category.

Data versioning

Germany

None

Austria

None