RAG pipeline approach - Query a LLM for automatic coding

Forewords

This second tutorial builds on the vector database created in part 1. It completes the RAG pipeline by loading labelled data, retrieving context from Qdrant, and generating final answers from a query. You will learn how to automatically classify free-text activity labels into the NACE 2.1 nomenclature, and then evaluate the quality of the pipeline at each stage.

RAG pipeline overview

The automated coding workflow is broken down into five steps:

Raw text label
      │
      ▼
 [1] Embedding       ← transform the label into a dense vector
      │
      ▼
 [2] Retrieval       ← find the k nearest neighbours in Qdrant
      │
      ▼
 [3] Augmentation    ← inject retrieved candidates into a structured prompt
      │
      ▼
 [4] Generation      ← LLM inference with a JSON-constrained output
      │
      ▼
 [5] Evaluation      ← compare predictions against reference codes

The RAG acronym specifically refers to the combination of steps 2 (Retrieval) and 4 (Generation): instead of asking the LLM to code from memory — which would be unreliable for fine-grained nomenclatures — we supply it with the most semantically relevant candidates from the vector store.

RAG workflow: Part 2 runs the inference pipeline on new activity labels.

RAG workflow: Part 2 runs the inference pipeline on new activity labels.

Why do we need the generation step?

Retrieval alone is not enough. Here’s why we need generation:

  • Raw retrieval returns documents, not answers: Qdrant gives us relevant chunks, but a user wants a clear, coherent response tailored to their question.
  • Synthesis across multiple documents: The LLM can read 5+ retrieved chunks and synthesize a unified answer, not just return the closest match.
  • Natural language output: Generation models (GPT-like) produce human-readable text that directly addresses the user’s intent.
  • Reasoning and inference: The LLM can combine retrieved facts with its own reasoning to make predictions or classifications.
  • Evaluation: In our case, we compare the LLM’s predicted labels against ground truth to measure RAG quality.

In this notebook, you will:

  • Load and inspect annotated examples
  • Connect to Qdrant and verify collections are available
  • Build a retrieval + generation function that queries Qdrant and sends retrieved context to an LLM
  • Compare predicted labels with ground truth for evaluation
  • Understand how retrieval quality affects final generation quality

Prerequisites

  • Completion of tutorial 1 (creation of the nace-collection Qdrant collection)
  • Access to the following services: LLM Lab, Qdrant
  • Python libraries: openai, qdrant-client, duckdb, pandas, tqdm, matplotlib

Connections and global parameters

# Models
EMB_MODEL_NAME = "qwen3-embedding:8b"   # Embedding model
GEN_MODEL_NAME = "gpt-oss:20b"          # Generative model

# Qdrant
COLLECTION_NAME = "nace-collection"
RETRIEVER_LIMIT = 5    # Number of candidates returned by the vector search

# Generation
TEMPERATURE = 0.1      # Low temperature → more deterministic, reproducible outputs

# Evaluation
SAMPLE_SIZE = 50       # Number of activities to evaluate (increase for more robust results)
Note

Why a low temperature? Temperature controls the degree of randomness in generation. For a closed-list classification task, we want stable and reproducible answers: a value between 0.0 and 0.2 is recommended. Higher values are better suited to creative tasks.


First attempt on a single activity label

Before running the pipeline on the full dataset, test each component step by step.

Tip Exercise 1: Test the pipeline on a single example

Question 1 — Embed an activity label

Embedding is the process of transforming a text into a high-dimensional numerical vector that captures its semantics. Two semantically similar texts will have vectors that are close to each other in the vector space. This property is what enables nearest-neighbour search.

Use client_llmlab.embeddings.create to embed the activity label below.

activity = "Installation, maintenance and repair of residential air conditioning systems for private customers"
print(f"Activity to code: {activity}")
Activity to code: Installation, maintenance and repair of residential air conditioning systems for private customers
Click to see the answer
response = client_llmlab.embeddings.create(
    model=EMB_MODEL_NAME,
    input=activity
)

search_embedding = response.data[0].embedding

print(f"Vector created of length: {len(search_embedding)}")
Note

Vector dimension: the vector size depends on the embedding model and is fixed at collection creation time (tutorial 1). You must use the same embedding model at every stage of the pipeline.


Question 2 — Retrieval: search for NACE candidates

Retrieval consists of querying the Qdrant vector store to find the k NACE codes whose descriptions are semantically closest to the label being coded.

Use client_qdrant.query_points with limit=RETRIEVER_LIMIT and inspect the results.

Click to see the answer
points = client_qdrant.query_points(
    collection_name=COLLECTION_NAME,
    query=search_embedding,
    limit=RETRIEVER_LIMIT,
)

descriptions_retrieved = []
codes_retrieved = []

for point in points.model_dump()["points"]:
    descriptions_retrieved.append(point["payload"]["text"])
    codes_retrieved.append(point["payload"]["code"])

print(
    f"✓ Vector search completed: {len(descriptions_retrieved)} codes and descriptions retrieved\n"
)
print("Check the first code retrieved ==============\n")
print(descriptions_retrieved[0])
Note

Retriever accuracy: the retrieval step is critical. If the correct code is not among the k returned candidates, the LLM cannot select it. You will compute this metric in Part 3.


Question 3 — Prompt construction

The quality of the prompt directly affects the LLM’s output. A RAG prompt typically has two parts:

  • The system prompt: defines the model’s general role and behaviour.
  • The user prompt: contains request-specific information (label to code, candidates, output format instructions).

Constraining the output to JSON format is a best practice: it simplifies automatic parsing and reduces format hallucinations.

SYSTEM_PROMPT = """
You are an expert in the Statistical Classification of Economic Activities in the European Community (NACE).

Your task is to classify the main economic activity of a company into the NACE 2.1 classification system, based strictly on:
- the textual description of the company's activity.
- a restricted list of candidate NACE codes and their explanatory notes

You must follow the instructions rigorously and return only the requested JSON output.
"""

USER_PROMPT_TEMPLATE = """

# Company main activity:
{activity}

# Candidate NACE codes and their explanatory notes:
{proposed_nace_descriptions}

========

# Instructions:
1. You MUST select the NACE code strictly from the provided candidate list. No external codes are allowed.
2. If multiple activities are mentioned, ONLY consider the first one.
3. If the description is unclear or insufficient to determine a classification, return:
- "nace2025": null
- "codable": false
4. The selected code MUST belong to this list:
[{proposed_nace_codes}]
5. Provide a realistic confidence score between 0.00 and 1.00 (two decimal places max).
6. Your response MUST be a valid JSON object following the schema below.

- "codable": <boolean>,
- "nace2025": <string or null>,
- "confidence": <float>

7. DO NOT include any explanation, reasoning, or additional text.

"""
Tip

Prompt engineering best practices for classification:

  • Constrain the candidate list (instruction 4): prevents the LLM from inventing a code outside the nomenclature.
  • Handle uncertainty explicitly (instruction 3): an ambiguous label should return codable: false rather than a random code.
  • Require a confidence score: useful for filtering unreliable predictions in production.
  • Forbid explanations (instruction 7): reduces the risk of malformed JSON output.

Question 4 — LLM inference

Compile the prompt with the retrieved candidates and run inference. The steps are:

  1. Format USER_PROMPT_TEMPLATE by injecting activity, descriptions_retrieved and codes_retrieved.
  2. Call client_llmlab.chat.completions.create with the system and user messages.
  3. Parse the response content with json.loads.
Tip

Constraining the output to JSON. The OpenAI-compatible API accepts a response_format parameter that forces the model to return valid JSON:

response_format={"type": "json_object"}

This eliminates free-text preambles and makes parsing reliable. The expected keys (nace2025, codable, confidence) are described in the system prompt.

Click to see the answer
import json

user_prompt = USER_PROMPT_TEMPLATE.format(
    activity=activity,
    proposed_nace_descriptions="## " + "\n\n## ".join(descriptions_retrieved),
    proposed_nace_codes=", ".join(codes_retrieved)
)

response = client_llmlab.chat.completions.create(
    model=GEN_MODEL_NAME,
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": user_prompt}
    ],
    temperature=TEMPERATURE,
    response_format={"type": "json_object"}
)

llm_response = json.loads(response.choices[0].message.content)
print(json.dumps(llm_response, indent=2))

Inference on multiple activities

Next, let’s use some synthetic data. We will:

  • run automatic classification on a sample of labelled activities,
  • build metrics to evaluate the quality of the coding process.

Load the data

import duckdb
import pandas as pd

con = duckdb.connect(database=":memory:")

con.execute("INSTALL httpfs;")
con.execute("LOAD httpfs;")

query_definition = f"""
SELECT *
FROM read_parquet(
  'https://minio.lab.sspcloud.fr/projet-formation/diffusion/funathon/2026/project2/generation_None_temp08.parquet'
)
USING SAMPLE {SAMPLE_SIZE}
"""

annotations = (
    con.sql(query_definition)
    .to_df()
    .to_dict(orient="records")
)
print(f"Dataset loaded: {len(annotations)} rows")
print(f"Keys: {list(annotations[0].keys())}")
annotations[:2]
Dataset loaded: 50 rows
Keys: ['code', 'name', 'label']
[{'code': '73.30',
  'name': 'Public relations and communication activities',
  'label': 'Corporate messaging and public perception'},
 {'code': '68.20',
  'name': 'Rental and operating of own or leased real estate',
  'label': 'Land rental for long-term purposes'}]
Note

Dataset structure: Each row contains a free-text activity label (activity) and its reference NACE 2.1 code (nace_code). These annotations were generated by an agentic AI system and constitute synthetic data. As such, they are well-suited for testing a RAG pipeline on English-language activity labels, but no claim is made about whether they are representative of real-world annotation data.

Utility function: a full pipeline call

Before looping, we encapsulate steps 1–4 into a reusable function.

def run_rag_pipeline(activity: str) -> dict:
    """
    Run the full RAG pipeline for a single activity label.

    Parameters
    ----------
    activity : str
        Free-text economic activity label to be coded.

    Returns
    -------
    dict with keys:
        - nace2025 (str | None) : predicted NACE code
        - codable (bool)        : True if the label could be coded
        - confidence (float)    : confidence score (0–1)
        - retrieved_codes (list): candidates returned by the retriever
    """
    # --- Step 1: Embedding ---
    emb_response = client_llmlab.embeddings.create(
        model=EMB_MODEL_NAME,
        input=activity
    )
    embedding = emb_response.data[0].embedding

    # --- Step 2: Retrieval ---
    points = client_qdrant.query_points(
        collection_name=COLLECTION_NAME,
        query=embedding,
        limit=RETRIEVER_LIMIT,
    )
    descriptions_retrieved = []
    codes_retrieved = []
    for point in points.model_dump()["points"]:
        descriptions_retrieved.append(point["payload"]["text"])
        codes_retrieved.append(point["payload"]["code"])

    # --- Step 3: Prompt construction ---
    user_prompt = USER_PROMPT_TEMPLATE.format(
        activity=activity,
        proposed_nace_descriptions="## " + "\n\n## ".join(descriptions_retrieved),
        proposed_nace_codes=", ".join(codes_retrieved)
    )

    # --- Step 4: LLM inference ---
    gen_response = client_llmlab.chat.completions.create(
        model=GEN_MODEL_NAME,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_prompt}
        ],
        temperature=TEMPERATURE,
        response_format={"type": "json_object"}
    )

    result = json.loads(gen_response.choices[0].message.content)
    # Keep retrieved candidates for retriever evaluation
    result["retrieved_codes"] = codes_retrieved

    return result

Inference loop

Tip Exercise 2: Code all activities in the dataset

Apply run_rag_pipeline to every row of the annotations list. Store the results in a DataFrame called results.

Note

The default sample size is SAMPLE_SIZE = 50, which keeps inference fast and avoids consuming too many resources. Each call to the pipeline involves two API requests (embedding + generation), so larger samples will take proportionally longer. Feel free to increase SAMPLE_SIZE in the global parameters to get more robust evaluation metrics.

Click to see the answer
from tqdm import tqdm

records = []

for row in tqdm(annotations, total=len(annotations), desc="Coding"):
    activity_label = row["label"]
    true_code      = row["code"]

    try:
        pred = run_rag_pipeline(activity_label)
    except Exception as e:
        pred = {
            "nace2025":       None,
            "codable":        False,
            "confidence":     0.0,
            "retrieved_codes": []
        }
        print(f"⚠ Error for '{activity_label[:60]}...': {e}")

    records.append({
        "activity":        activity_label,
        "true_code":       true_code,
        "pred_code":       pred.get("nace2025"),
        "codable":         pred.get("codable", False),
        "confidence":      pred.get("confidence", 0.0),
        "retrieved_codes": pred.get("retrieved_codes", []),
    })

results = pd.DataFrame(records)
print(f"\n✓ Inference complete: {len(results)} activities processed")
results.head()
Note

Robustness: LLM calls can occasionally fail (timeout, malformed JSON). The try/except block ensures the loop does not stop and that every error is logged.


Pipeline evaluation

To understand where the pipeline succeeds or fails, we evaluate each stage separately. Three metrics are useful here.

              ┌─────────────────────────────────────────────────┐
              │             EVALUATION METRICS                  │
              ├──────────────────┬──────────────────────────────┤
              │ Retriever@k      │ Is the correct code among    │
              │ accuracy         │ the k retrieved candidates?  │
              ├──────────────────┼──────────────────────────────┤
              │ LLM accuracy     │ When the retriever succeeded,│
              │ (conditional)    │ does the LLM pick the right  │
              │                  │ code?                        │
              ├──────────────────┼──────────────────────────────┤
              │ Pipeline         │ Is the final predicted code  │
              │ accuracy         │ correct? (end-to-end)        │
              └──────────────────┴──────────────────────────────┘

These three metrics are related by:

\[\text{Pipeline accuracy} = \text{Retriever@k} \times \text{LLM accuracy (conditional)}\]

This decomposition helps pinpoint whether errors come from the retriever or the LLM, and guides improvement efforts.

Tip Exercise 3: Compute evaluation metrics

Prepare evaluation columns

Before computing any metric, add three boolean columns to results:

  • retriever_hit: whether the true NACE code is among the k candidates returned by Qdrant
  • pipeline_correct: whether the predicted code matches the true code
  • llm_correct_given_retriever: same as pipeline_correct, but set to None when the retriever did not return the true code
Click to see the answer
# Is the true code among the retriever's candidates?
results["retriever_hit"] = results.apply(
    lambda row: row["true_code"] in row["retrieved_codes"], axis=1
)

# Is the predicted code correct?
results["pipeline_correct"] = results["pred_code"] == results["true_code"]

# Did the LLM pick the right code, given that the retriever found it?
results["llm_correct_given_retriever"] = results.apply(
    lambda row: row["pipeline_correct"] if row["retriever_hit"] else None,
    axis=1
)
activity true_code pred_code retriever_hit pipeline_correct
0 Corporate messaging and public perception 73.30 73.30 True True
1 Land rental for long-term purposes 68.20 68.2 False False
2 Electric vehicle supply equipment setup 43.21 43.21 True True
3 Corporate headquarters administration 70.10 70.10 True True
4 Real estate asset operation and maintenance 68.20 68.20 True True

Question 1 — Retriever accuracy (Retriever@k)

Retriever@k accuracy measures the proportion of cases where the reference code is present among the k candidates returned by Qdrant. It is the theoretical ceiling of the pipeline: if the retriever misses the correct code, the LLM cannot recover it.

\[\text{Retriever@k} = \frac{N_\text{hit}}{N}\]

where \(N_\text{hit}\) is the number of activities for which the true code is among the \(k\) retrieved candidates, and \(N\) is the total number of activities.

Using the retriever_hit column, compute the proportion of rows where the retriever returned the true code. Store the result in a variable called retriever_accuracy.

retriever_hit
Retriever found true code    35
Retriever missed             15
Name: count, dtype: int64
Click to see the answer
retriever_accuracy = results["retriever_hit"].mean()
print(f"Retriever@{RETRIEVER_LIMIT} accuracy: {retriever_accuracy:.1%}")
print(f"  → {results['retriever_hit'].sum()} / {len(results)} correctly retrieved")

Question 2 — Conditional LLM accuracy

Conditional LLM accuracy measures how often the LLM picks the right code when the retriever already returned it as a candidate. This metric isolates the LLM’s own contribution to the pipeline.

\[\text{LLM accuracy} = \frac{N_{\text{correct} \mid \text{hit}}}{N_\text{hit}}\]

where \(N_{\text{correct} \mid \text{hit}}\) is the number of correct predictions in the subset where the retriever succeeded.

Filter results to keep only the rows where retriever_hit is True, then compute the proportion of pipeline_correct in that subset. Store the result in llm_accuracy.

pipeline_correct
LLM correct    30
LLM wrong       5
Name: count, dtype: int64
Click to see the answer
retriever_success = results[results["retriever_hit"]]
llm_accuracy = retriever_success["pipeline_correct"].mean()

print(f"LLM accuracy (conditional on retriever): {llm_accuracy:.1%}")
print(f"  → {retriever_success['pipeline_correct'].sum()} / {len(retriever_success)} correctly coded by the LLM")

Question 3 — End-to-end pipeline accuracy

End-to-end pipeline accuracy is the proportion of activity labels that are correctly coded, regardless of which component failed.

\[\text{Pipeline accuracy} = \text{Retriever@k} \times \text{LLM accuracy}\]

Compute the overall accuracy from the pipeline_correct column and store it in pipeline_accuracy. Then verify empirically that the multiplicative relationship with retriever_accuracy and llm_accuracy holds.

pipeline_correct
Correctly coded      30
Incorrectly coded    20
Name: count, dtype: int64
Click to see the answer
pipeline_accuracy = results["pipeline_correct"].mean()

print(f"Pipeline accuracy (end-to-end)          : {pipeline_accuracy:.1%}")
print(f"  → {results['pipeline_correct'].sum()} / {len(results)} correctly coded")
print()
print(f"Cross-check: Retriever@k × LLM = {retriever_accuracy:.3f} × {llm_accuracy:.3f} = {retriever_accuracy * llm_accuracy:.1%}")

Question 4 — Summary dashboard and error decomposition

Count how many errors come from the retriever (cases where retriever_hit is False) and how many come from the LLM (cases where retriever_hit is True but pipeline_correct is False). Then produce a summary table of all metrics.

Click to see the answer
n_total          = len(results)
n_retriever_miss = (~results["retriever_hit"]).sum()
n_llm_miss       = (results["retriever_hit"] & ~results["pipeline_correct"]).sum()
n_correct        = results["pipeline_correct"].sum()

print("=" * 52)
print("      DASHBOARD — RAG PIPELINE NACE 2.1")
print("=" * 52)
print(f"  Activities processed        : {n_total:>6}")
print(f"  Correctly coded             : {n_correct:>6}  ({pipeline_accuracy:.1%})")
print()
print(f"  Retriever@{RETRIEVER_LIMIT} accuracy        : {retriever_accuracy:>6.1%}")
print(f"  LLM accuracy (conditional)  : {llm_accuracy:>6.1%}")
print(f"  Pipeline accuracy           : {pipeline_accuracy:>6.1%}")
print()
print(f"  Retriever errors            : {n_retriever_miss:>6}  ({n_retriever_miss/n_total:.1%})")
print(f"  LLM errors                  : {n_llm_miss:>6}  ({n_llm_miss/n_total:.1%})")
print("=" * 52)
Note

How to interpret the error decomposition:

  • If retriever errors dominate → improve the embedding model, increase k, or enrich NACE descriptions in the vector store.
  • If LLM errors dominate → refine the prompt, switch to a more capable generative model, or lower the temperature further.
Note

Not all errors are equal. A wrong prediction can mean very different things. Some errors are completely off — predicting a manufacturing code for a services activity. Others are near-misses: the predicted code is a parent (e.g. 47 instead of 47.11) or a sibling at the same level of the hierarchy. Near-misses may be acceptable in practice, depending on how the coded data will be used. A hierarchical accuracy metric — counting a prediction as correct if it matches up to a certain depth — would capture this nuance.

Important

On the optimism of these results. The evaluation dataset is synthetic: activity labels were generated by an AI system at low temperature, producing clean and unambiguous descriptions that are easier to classify than real-world data. Actual labels are often shorter, noisier, or ambiguous. The accuracy figures obtained here are therefore optimistic and should not be taken as representative of production performance.

Confidence score analysis

The confidence score produced by the LLM can serve as a quality signal to filter unreliable predictions in production, at the cost of reduced coverage.

Click to see the answer
from plotnine import (
    ggplot, aes,
    geom_boxplot, geom_line, geom_point,
    scale_color_manual, scale_linetype_manual,
    labs, theme_minimal,
)
import pandas as pd

# --- Left: confidence distribution by correctness ---
results_plot = results.assign(
    correctness=results["pipeline_correct"].map({False: "Incorrect", True: "Correct"})
)

p1 = (
    ggplot(results_plot, aes(x="correctness", y="confidence"))
    + geom_boxplot()
    + labs(
        title="Confidence distribution by pipeline correctness",
        x="Prediction correct",
        y="Confidence score",
    )
    + theme_minimal()
)

# --- Right: precision and coverage vs confidence threshold ---
thresholds = [i / 10 for i in range(1, 10)]
rows = []
for t in thresholds:
    subset = results[results["confidence"] >= t]
    if len(subset) > 0:
        rows += [
            {"threshold": t, "metric": "Precision", "value": subset["pipeline_correct"].mean()},
            {"threshold": t, "metric": "Coverage",  "value": len(subset) / len(results)},
        ]

df_thresh = pd.DataFrame(rows)

p2 = (
    ggplot(df_thresh, aes(x="threshold", y="value", color="metric", linetype="metric"))
    + geom_line()
    + geom_point()
    + scale_color_manual(values={"Precision": "steelblue", "Coverage": "coral"})
    + scale_linetype_manual(values={"Precision": "solid", "Coverage": "dashed"})
    + labs(
        title="Precision and coverage vs. confidence threshold",
        x="Confidence threshold",
        y="Value",
        color="",
        linetype="",
    )
    + theme_minimal()
)

from IPython.display import display
display(p1)
display(p2)

Note

Precision / coverage trade-off: raising the confidence threshold increases precision (the retained predictions are more likely to be correct) but reduces coverage (fewer labels are automatically coded, the rest requiring manual review). Choosing the right threshold depends on the operational constraints of your use case.