Serving

Model serving

Once a model is trained, validated, and packaged, it must be made accessible to the applications and teams that will use it in production. The standard approach is to expose it as a REST API: the model runs as a persistent service, and any client — regardless of language or platform — can obtain predictions by sending an HTTP request to a URL. This decoupling is what allows data science teams to iterate on the model independently from the engineering teams consuming it.

Two serving patterns are common in practice:

Online (synchronous) inference: the client sends a single request and waits for the response. This is the right pattern for interactive applications where latency matters.
Batch (asynchronous) inference: a script processes a large dataset in bulk, typically on a schedule, in a containerised and reproducible environment (e.g. Argo Workflows). This is preferred when throughput matters more than latency and results do not need to be immediate.

Tip

A well-designed API accepts raw input (text, structured fields) and returns predictions with confidence scores — the model wrapper from the previous chapter is what makes this possible without exposing any ML internals to the caller.

Insee: FastAPI

At Insee, the NACE classification model is served via a Python REST API built with FastAPI. FastAPI was chosen for its performance, its native support for type-validated request/response schemas via Pydantic, and its automatic generation of interactive API documentation (Swagger UI) at /docs with zero configuration.

Loading the model. At startup, the production model is pulled directly from the MLflow Model Registry (fetching the artifact tagged Production) and stored in the application state. This means the model is loaded once and reused across all requests, keeping latency low and the API stateless with respect to any individual request.

The predict endpoint. A single POST /predict endpoint accepts a JSON payload containing the raw text to classify (and optionally categorical features) and returns the predicted NACE code together with its confidence score. The endpoint delegates entirely to the TTC wrapper’s .predict() method, keeping the API layer thin.

from fastapi import FastAPI
from contextlib import asynccontextmanager
import mlflow.pyfunc

@asynccontextmanager
async def lifespan(app: FastAPI):
    app.state.model = mlflow.pyfunc.load_model("models:/nace-classifier/Production")
    yield

app = FastAPI(lifespan=lifespan)

@app.post("/predict")
def predict(payload: dict):
    import pandas as pd
    df = pd.DataFrame([payload])
    return app.state.model.predict(df).to_dict(orient="records")[0]

The API documentation is available here and the source code here.

Statistics Austria: Plumber APIs on RSconnect

Since all models are implemented in R, we use the plumber package to expose them as RESTful APIs. Each trained model is wrapped in an API endpoint that handles incoming requests, calls the standardized .predict() function, and returns the results in a structured JSON format.

These APIs are deployed on RStudio Connect (rsconnect), which provides a managed environment for hosting, scaling, and securing the services. This setup allows other applications to interact with the models in a language-agnostic way, independent of the underlying R implementation.

Deployment

Developing the API locally is only the first step. Deploying it means making it reliably available to production consumers, often in environments that differ significantly from the development machine. Containerisation — packaging the application together with all its dependencies into a portable, self-contained image — is the standard answer to this challenge. A containerised API behaves identically regardless of where it runs: a laptop, a cloud cluster, or an air-gapped production server.

The general deployment pipeline follows three steps:

Build — a Dockerfile defines the environment and entry point; docker build produces an immutable image.
Distribute — the image is pushed to a container registry (Docker Hub, a private registry, etc.), making it addressable by any deployment target.
Deploy — the image is pulled and run on the target infrastructure, typically a Kubernetes cluster, with the serving configuration (replicas, resources, routing) managed separately from the application code.

Steps 1 and 2 should never be manual. A CI pipeline (e.g. GitHub Actions, GitLab CI) automates them: every merge to the main branch of the API repository triggers a workflow that builds the image and pushes it to the registry, tagged with the commit SHA or a semantic version. This guarantees that the registry always reflects the current state of the code, with no human intervention required.

Insee: Containerisation with Docker, GitOps with ArgoCD

Containerisation. The API is packaged as a Docker image defined by a Dockerfile that pins the Python runtime, installs dependencies, and sets the FastAPI entry point. The resulting image is pushed to Docker Hub, where it can be pulled by any deployment target. This makes the image the single distributable artefact: the same image is deployed to the SSPCloud for integration testing and handed to production teams for deployment on their secured, offline infrastructure.

Continuous Integration. The build-and-push steps are fully automated via a GitHub Actions workflow defined in the API repository. Every merge to the main branch triggers the workflow: it builds the Docker image and pushes it to Docker Hub tagged with the commit SHA. This means that the registry is always in sync with the codebase — no developer needs to remember to rebuild or push manually, and every image on the registry is traceable back to an exact commit.

GitOps with ArgoCD. Deploying to the SSPCloud Kubernetes cluster is managed with a GitOps approach using ArgoCD. In a GitOps workflow, a dedicated Git repository (the CD repository, available here) is the single source of truth for the desired state of the deployment: image tag, resource limits, replica count, and routing rules are all declared as versioned manifests. ArgoCD watches this repository and automatically reconciles the cluster state with what is declared there. Updating the deployed version is then as simple as opening a pull request to change the image tag — the rollout, and any rollback, goes through Git.

This separation between the application repository (the API code) and the CD repository (the deployment configuration) is a key MLOps practice: it keeps infrastructure changes auditable, reviewable, and reversible independently of code changes.

German Federal Statistical Office: automatic deployement through Cloudera AI Model Component

Using separate environments for development and production, that are build using the same set-up for an easy deployment using CI/CD workflows (for instance with Gitlab).
Using Cloudera AI Model component to standardize and simplify the transition from experimentation to production. The model component package a trained model (from experiments or registry), define how it should be served, deploy it as a live endpoint, monitor/version it and roll back if needed. Cloudera builds a Docker container image automatically, including a base runtime (Python, R, etc.), dependencies needed, model and serving code (wrapper), that will be deployed on a Kubernetes Cluster (managed by Cloudera, but possibly hosted on your private servers). At the end of the process, the model is exposed as a REST API endpoint, using the wrapper function as explained in chapter 2.

Statistics Austria: Standardized Model Query Interfaces

To facilitate interaction with the deployed APIs, we developed an R package that standardizes the process of querying the models. This package acts as a client interface between downstream applications and the deployed endpoints.

Input data, along with optional parameters (e.g. model configuration or code selections), are serialized into JSON format and sent to the API using the httr package. The API response is then parsed and transformed into a structured output table, which can be further processed or stored locally in various formats (e.g. CSV, database tables). This approach ensures a clear separation between model serving and model consumption.

From development to production: managing environment gaps

Development and production environments are rarely identical — and this mismatch is one of the most common sources of deployment failures. Development environments are typically powerful (GPUs, large RAM), open (internet access, permissive package installation), and instrumented for experimentation. Production environments are often the opposite: constrained hardware, air-gapped networks, locked-down security policies, and strict constraints on what software can run.

Containerisation partially solves the technical side of this problem: by packaging the application with all its dependencies into a Docker image, environment parity is enforced at the infrastructure level. However, containerisation does not eliminate the organisational dimension.

The organisational challenge is at least as important as the technical one. Two questions must have clear answers before any model reaches production:

Who is responsible for validating that the model behaves correctly under production constraints? This is typically not the same person who designs or trains the model.
Is there a staging environment that faithfully mirrors production — same hardware constraints, same network isolation, same security rules — where the model can be tested end-to-end before going live?

Without answers to both, the gap between a working experiment and a reliable production service is bridged by luck rather than process.

Insee: Isoprod qualification platform and dedicated MLOps role

At Insee, the dev-to-prod gap has been addressed through two complementary investments — one technical, one organisational.

Model qualification platform. A dedicated isoprod platform replicates the exact conditions of the production environment: same network isolation, same hardware constraints, same security policies. Before any model can be promoted to production, it must pass a qualification phase on this platform. This includes validating inference latency under realistic load (load testing) and confirming that the containerised model behaves identically to what was observed during development. This step is what gives production teams the confidence to accept a new model version.

Dedicated MLOps data scientist. A data scientist position has been created within the business registry team with an explicit mandate to bridge the innovation and production sides. This person is responsible for regularly retraining the model, running the qualification process, and owning the model in production. The innovation team, by contrast, focuses exclusively on research and model improvements. This clear separation of concerns avoids the common anti-pattern where a research team hands off a model to a production team with no shared ownership — and no one takes responsibility when something breaks.

The lesson is broadly applicable: closing the dev-to-prod gap requires both a place to test under production conditions and a person accountable for the transition.

Statistics Austria: Synchronized R Environments Across Deployment

To ensure consistency between development and production, package versions are explicitly defined and synchronized across environments using a requirements.txt file specifying the dependencies plus their respective version. This guarantees that the same dependencies used during model development are also available on the rsconnect instance.

By aligning environments in this way, we reduce the risk of incompatibilities and ensure that models behave identically when deployed. Combined with versioned code and model artifacts, this setup supports a smooth transition from development to production and contributes to the overall reproducibility and stability of the system.