Useful Links

The following links lead to external training materials, tutorials, projects, documents, and papers related to topics covered by Working Packages of AIML4OS project.

Earth Observation

[1] Introduction to Machine Learning for Earth Observation. Educational platform offering courses that combine machine learning methods with Earth Observation data. Focuses on practical applications of AI techniques in environmental and geospatial analysis.

[2] ESA’s Newcomers Earth Observation Guide | Eurostat CROS. Introductory guide explaining the basics of Earth Observation for new users. It presents key concepts, data types, and common applications in a structured and accessible way.

[3] Copernicus Data Space Ecosystem | Europe’s eyes on Earth. Online platform providing access to Copernicus Earth Observation datasets and related tools. It enables users to discover, download, and work with satellite data from the European Earth observation programme.

[4] EUSPA - European Union Agency for the Space Programme (requires login). Portal related to the European Union Space Programme offering resources and information for stakeholders. It supports access to space-related services, data, and applications within the EU ecosystem.

[5] Copernicus MOOC – Learn to harness the power of space data. Free online course introducing the use of Copernicus satellite data for environmental and societal applications. It is designed to help learners understand and apply Earth Observation in practice.

[6] Basics of Optical Remote Sensing – EO4GEO. Educational resource covering the fundamentals of optical remote sensing. It explains principles of data acquisition, sensor types, and interpretation of optical imagery.

[7] Image processing and analysis – EO4GEO. Learning material focused on techniques for processing and analysing Earth Observation imagery. It covers workflows used to transform raw satellite data into usable information.

[8] ESA - EO science for society. Training and educational hub providing resources on Earth Observation science and applications. It supports capacity building for researchers, professionals, and students working with satellite data.

[9] 12th ESA Training Course on Earth Observation 2022. Collection of materials from an ESA training course on Earth Observation. It includes presentations and resources used to support learning in satellite data analysis.

[10] Space4Climate (UK Agency). Platform supporting the use of space-based data for climate monitoring and climate services. It focuses on connecting Earth Observation with climate science and policy applications.

[11] Gateway to NASA Earth Observation Data. Central NASA portal providing access to Earth Observation datasets, tools, and services. It supports exploration and analysis of satellite data for research and applications.

[12] Earth Observation satellite’s data portals provided by ESA, NASA, CSA, JAXA, ISRO. Overview platform linking to major Earth Observation data portals from international space agencies. It serves as a gateway for discovering satellite data sources worldwide.

Programming

[13] Insee - Best programming practices with Git and R. Practical guide covering coding standards, reproducibility, and version control workflows using R and Git in collaborative data science projects.

[14] Lino Galiana from Insee - Data science with Python. Structured introduction to data science in Python, covering core libraries, data manipulation, and foundational analytical techniques.

[15] Insee - Introduction to MLOps with MLflow (slides). Training material introducing MLOps concepts with a focus on MLflow for experiment tracking, model management, and deployment workflows.

[16] Insee - Putting data science projects into production. Resource focused on reproducibility, deployment strategies, and operational aspects of maintaining data science systems in production environments.

[17] Insee - Introduction to ensemble algorithms. Educational content explaining ensemble learning techniques such as bagging and boosting, along with their applications in predictive modeling.

[18] Python Data Science Handbook. Comprehensive reference covering Python tools for data analysis, including NumPy, pandas, Matplotlib, and machine learning workflows.

[19] R for Data Science. Widely used textbook introducing data science workflows in R, including data wrangling, visualization, and modeling using tidyverse principles.

[20] Standford Machine Learning Notes. Lecture notes providing a foundational overview of machine learning algorithms, optimization, and practical implementation concepts.

[21] An Introduction to Statistical Learning with Applications in R. Introductory textbook on statistical learning methods, offering both theoretical background and applied examples in R.

[22] Author(s): Hastie, T.; Tibshirani, R.; Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer. [online] Available at: link (accessed: 2026-04-16). This foundational textbook provides a comprehensive introduction to statistical learning theory and machine learning methods. It covers regression, classification, model selection, and regularization techniques with both theoretical and applied perspectives. The book is widely regarded as a core reference in statistical and machine learning education.

[23] Author(s): Zhang, C.; Ma, Y. (eds.). (2012). Machine Learning Refined: Foundations, Algorithms, and Applications. Cambridge University Press. [online] Available at: link (accessed: 2026-04-16). This book presents modern foundations of machine learning, including core algorithms and their practical applications. It bridges theoretical concepts with implementation-oriented perspectives, making it suitable for both researchers and practitioners. The publication emphasizes algorithmic understanding and real-world use cases.

A selection of resources from the United States Geological Survey about use of R.

Beyond Basic R

[24] The case for reproducibility. Discussion of the importance of reproducible research practices in data science, emphasizing transparency, reliability, and replicability of analytical workflows.

[25] Beyond Basic R - Introduction and Best Practices. Overview of good programming practices in R, including workflow organization, code structure, and principles of reproducible analysis.

[26] Beyond Basic R - Version Control with Git. Practical guide to using Git for version control in collaborative data science projects and reproducible research workflows.

[27] Beyond Basic R - Mapping. Introduction to spatial data visualization techniques in R, including basic mapping workflows and geospatial representation.

[28] Beyond Basic R - Plotting with ggplot2 and Multiple Plots in One Figure. Explanation of advanced data visualization techniques in R using ggplot2, including multi-panel plot design.

[29] Beyond Basic R - Data Munging. Overview of data cleaning, transformation, and preparation techniques used in R-based analytical workflows.

[30] Beyond Basic R - Data Munging. Resource describing data wrangling and transformation techniques in R, focusing on preparing datasets for analysis.

Reproducible Data Science in R

[31] Reproducible Data Science in R: Say the quiet part out loud with assertion tests. Introduction to using assertion tests in R to validate data and ensure correctness and reproducibility of analytical workflows.

[32] Reproducible Data Science in R: Flexible functions using tidy evaluation. Explanation of tidy evaluation principles in R and their use in building flexible and programmatic functions.

[33] Reproducible Data Science in R: Writing better functions. Guidance on designing clean, reusable, and robust functions in R with a focus on maintainability and reliability.

[34] Reproducible Data Science in R: Writing functions that work for you. Practical introduction to functional programming approaches that support more efficient and scalable data analysis workflows.

[35] Reproducible Data Science in R: Iterate, don’t duplicate. Discussion of iterative programming strategies in R, emphasizing code reuse and reducing duplication in analytical workflows.

Quality

[38] Van Delden, A., J. Burger, and M. Puts. 2023. “Ten Propositions on Machine Learning in Official Statistics.”. Discussion of key principles and recommendations for the application of machine learning in official statistics.

[39] Author(s): Kowarik, A., et al. (2020). Deliverable K3: Revised Version of the Quality Guidelines for the Acquisition and Usage of Big Data. [online] Available at: link (accessed: 2026-04-16). This report presents revised quality guidelines for the acquisition, processing, and use of big data in official statistical systems. It focuses on ensuring methodological consistency and data quality in non-traditional data sources. The document provides practical recommendations for statistical offices.

[40] Author(s): Reinert, R., et al. (2016). Work Package 1: Checklist for Evaluating the Quality of Input Data. [online] Available at: link (accessed: 2026-04-16). This document provides a structured checklist for evaluating the quality of input data used in statistical production processes. It focuses on identifying and mitigating data quality issues early in the statistical workflow. The framework supports consistent assessment of data sources in official statistics.

[41] De Waal, T., et al (2019). “Quality measures for multisource statistics.”. Methodological paper proposing approaches for evaluating data quality in multisource statistical frameworks.

[42] Kowarik, A., et al (2025). “Quality Guidelines for acquiring and using web scraped data”. Guidelines addressing quality assessment and methodological challenges in the use of web-scraped data for official statistics.

[43] AI Act: High Level Summary. Summary of the European Union Artificial Intelligence Act outlining its scope, structure, and regulatory objectives.

[44] Piela R. (2024). “Incorporating AI into Statistical Standards: Enhancing GSBPM with AI”. Presentation on integrating artificial intelligence into statistical production frameworks, including enhancements to GSBPM.

[45] GPAI models guidelines. Guidelines describing regulatory obligations for providers of general-purpose AI models under the EU AI Act.

[46] Saidani, Y., et al (2023). “Quality dimensions of machine learning in official statistics.”. Study defining key quality dimensions for machine learning applications in official statistics.

[47] Saidani, Y., Dumpert, F. (2025). “Quality dimensions and guidelines for machine learning in official statistics.”. Book chapter presenting frameworks and guidelines for assessing machine learning quality in official statistics.

[48] UNECE (2021) Machine Learning for Official Statistics. Comprehensive report on the use of machine learning methods in official statistical production.

[49] Yung, W., et al (2018). The use of machine learning in official statistics. Overview of early applications and methodological considerations of machine learning in official statistics.

[50] Yung, W., et al (2022). “A quality framework for statistical algorithms.”. Framework proposing quality assessment criteria for statistical algorithms used in official statistics.

[51] UNECE - Machine Learning for Official Statistics. Report discussing opportunities and challenges of applying machine learning methods in official statistics production.

[52] Puts, Daas - Machine Learning from the Perspective of Official Statistic. Academic discussion on the role of machine learning in official statistics and its methodological implications.

[53] UNECE - A quality framework for statistical algorithms. Framework defining quality dimensions and evaluation criteria for statistical algorithms used in official statistics.

[54] A review of ensemble learning and data augmentation models for class imbalanced problems: Combination, implementation and evaluation. Review of methods addressing class imbalance using ensemble learning and data augmentation techniques, including evaluation strategies.

[55] UNECE - Organisational aspects of implementing ML based data editing in statistical production. Report on organizational and methodological aspects of implementing machine learning for automated data editing in statistical systems.

[56] Text classification for ECOICOP classification. [online] Available at: link (accessed: 2026-04-16). This resource focuses on the application of text classification methods for assigning statistical consumption data to ECOICOP categories. It explores how machine learning can support automated classification of expenditure descriptions into standardized statistical groupings. The aim is to improve consistency and efficiency in official statistics classification workflows.

[57] Machine learning introductory lecture – Statistics Norway. This introductory lecture presents fundamental concepts of machine learning in the context of official statistics. It explains basic algorithms, workflows, and typical applications used by statistical offices. The material is designed for beginners with a focus on practical understanding of ML techniques.

[58] Handbook of Statistical Data Editing and Imputation. This handbook provides a comprehensive overview of methods for data editing and imputation in statistical production. It covers both traditional and modern approaches used to handle missing data and correct errors in datasets. The publication serves as a reference guide for official statistics practitioners.

[59] The OECD Laboratory for Geospatial Analysis. Initiative by the OECD dedicated to developing and applying geospatial analysis for policy-making and statistical innovation. It highlights methods for integrating spatial data into public sector decision processes.

[60] FAO Webinar Series: Earth observation data for agricultural statistics. Webinar series presenting applications of Earth Observation in agricultural statistics. It demonstrates how satellite data can support monitoring and reporting in the agricultural sector.

Papers and publications

[61] Author(s): Dumpert, F. (2025). Foundations and Advances of Machine Learning in Official Statistics. Springer. [online] Available at: link (accessed: 2026-04-16). This book presents recent developments in the application of machine learning methods in official statistics. It covers methodological foundations as well as practical implementations in statistical production systems. The focus is on integrating modern ML techniques into official statistical workflows.

[62] Author(s): Lee, D.; Zhang, L.C.; Chen, S. (2022). Robust quasi-randomization-based estimation with ensemble learning for missing data. Wiley Online Library. [online] Available at: link (accessed: 2026-04-16). The article proposes an approach for handling missing data using ensemble learning combined with quasi-randomization-based estimation. It focuses on improving robustness of statistical estimation under incomplete data conditions. The study evaluates performance improvements compared to traditional methods.

[63] Author(s): Khan, A.A.; Chaudhari, O.; Chandra, R. (2023). A review of ensemble learning and data augmentation models for class imbalanced problems. ScienceDirect. [online] Available at: link (accessed: 2026-04-16). This paper provides a systematic review of ensemble learning and data augmentation techniques for addressing class imbalance problems. It compares different methodological approaches and their effectiveness across applications. The study also discusses evaluation strategies for imbalanced datasets.

[64] Author(s): Malley, J.D.; Kruppa, J.; Dasgupta, A.; Malley, K.G.; Ziegler, A. (2012). Probability machines: consistent probability estimation using nonparametric learning machines. PLOS/PMC. [online] Available at: link (accessed: 2026-04-16). The paper introduces “probability machines” for consistent probability estimation using nonparametric machine learning methods. It focuses on improving probabilistic prediction rather than only classification accuracy. The approach is evaluated theoretically and empirically for consistency and reliability.

[65] Author(s): Wright, M.N.; König, I.R. (2019). Splitting on categorical predictors in random forests. PLOS/PMC. [online] Available at: link (accessed: 2026-04-16). This study investigates how categorical variables can be effectively handled in random forest models. It proposes improved splitting strategies to better capture information from categorical predictors. The results show improved predictive performance in relevant settings.

[66] Author(s): Forteza, N.; García-Uribe, S. (2024). A Score Function to Prioritize Editing in Household Survey Data: A Machine Learning Approach. SAGE Journals. [online] Available at: link (accessed: 2026-04-16). The article presents a machine learning-based score function designed to prioritize editing tasks in household survey data. It aims to improve efficiency in data cleaning processes by identifying high-impact records. The approach supports more targeted and resource-efficient data validation workflows.

[67] Author(s): Zabala, F. (2015). Let the data speak: Machine learning methods for data editing and imputation. UNECE. [online] Available at: link (accessed: 2026-04-16). This report explores early applications of machine learning methods for data editing and imputation in official statistics. It emphasizes the potential of data-driven approaches in improving statistical data quality. The document provides practical case studies and methodological insights.

[68] Author(s): Rocci, V.; Varriale, R. (2021). Machine Learning tool for editing in the Italian Register of the Public Administration. UNECE StatsWiki. [online] Available at: link (accessed: 2026-04-16). This case study presents a machine learning tool used for data editing in the Italian Public Administration register. It demonstrates how ML techniques can support automated error detection and correction. The work highlights practical implementation in administrative data systems.

[69] Author(s): Côté, P.-O.; Nikanjam, A.; Ahmed, N.; Humeniuk, D.; Khomh, F. (2023). Data cleaning and machine learning: a systematic literature review. [online] Available at: link (accessed: 2026-04-16). This systematic review examines the relationship between data cleaning practices and machine learning performance. It summarizes how data preprocessing impacts model accuracy and robustness. The study highlights common techniques and research gaps in the field.

[70] Author(s): Burakauskaitė, I. (2024). Moving towards the standardized process of automatic statistical data editing using machine learning techniques. UNECE. [online] Available at: link (accessed: 2026-04-16). This presentation discusses efforts to standardize automated statistical data editing using machine learning methods. It focuses on integrating ML into formal statistical production pipelines. The work highlights challenges and benefits of automation in data quality processes.

[71] Author(s): del Monaco, D. (2022). Stacking machine-learning models for anomaly detection: comparing AnaCredit to other banking datasets. UNECE. [online] Available at: link (accessed: 2026-04-16). The study explores stacking ensemble models for anomaly detection in financial datasets. It compares performance across different banking data sources, including AnaCredit. The results show how ensemble strategies improve detection accuracy.

[72] Author(s): Barragán, S.; Salgado, D. (2022). Improving statistical data editing with Machine Learning: first use cases in Statistics Spain (INE). UNECE. [online] Available at: link (accessed: 2026-04-16). This work presents early applications of machine learning for improving statistical data editing at Statistics Spain (INE). It describes practical use cases and implementation experiences. The study highlights improvements in efficiency and data quality.

[73] Author(s): Vásquez, C. (2022). Automatic selective editing approach using machine learning: an application to VAT data. UNECE. [online] Available at: link (accessed: 2026-04-16). The article describes an automatic selective editing approach applied to VAT data using machine learning techniques. It focuses on identifying influential records requiring manual review. The method improves efficiency in statistical data validation processes.

[74] UNECE – HLG-MOS Machine Learning Project Edit and Imputation Theme Report. Documentation of machine learning approaches for data editing and imputation in official statistical production pipelines.