RAG pipeline approach - Query a LLM for automatic coding
Forewords
This second tutorial builds on the vector database created in part 1. It completes the RAG pipeline by loading labelled data, retrieving context from Qdrant, and generating final answers from a query. You will learn how to automatically classify free-text activity labels into the NACE 2.1 nomenclature, and then evaluate the quality of the pipeline at each stage.
RAG pipeline overview
The automated coding workflow is broken down into five steps:
Raw text label
│
▼
[1] Embedding ← transform the label into a dense vector
│
▼
[2] Retrieval ← find the k nearest neighbours in Qdrant
│
▼
[3] Augmentation ← inject retrieved candidates into a structured prompt
│
▼
[4] Generation ← LLM inference with a JSON-constrained output
│
▼
[5] Evaluation ← compare predictions against reference codes
The RAG acronym specifically refers to the combination of steps 2 (Retrieval) and 4 (Generation): instead of asking the LLM to code from memory — which would be unreliable for fine-grained nomenclatures — we supply it with the most semantically relevant candidates from the vector store.
RAG workflow: Part 2 runs the inference pipeline on new activity labels.
Why do we need the generation step?
Retrieval alone is not enough. Here’s why we need generation:
Raw retrieval returns documents, not answers: Qdrant gives us relevant chunks, but a user wants a clear, coherent response tailored to their question.
Synthesis across multiple documents: The LLM can read 5+ retrieved chunks and synthesize a unified answer, not just return the closest match.
Natural language output: Generation models (GPT-like) produce human-readable text that directly addresses the user’s intent.
Reasoning and inference: The LLM can combine retrieved facts with its own reasoning to make predictions or classifications.
Evaluation: In our case, we compare the LLM’s predicted labels against ground truth to measure RAG quality.
In this notebook, you will:
Load and inspect annotated examples
Connect to Qdrant and verify collections are available
Build a retrieval + generation function that queries Qdrant and sends retrieved context to an LLM
Compare predicted labels with ground truth for evaluation
Understand how retrieval quality affects final generation quality
Prerequisites
Completion of tutorial 1 (creation of the nace-collection Qdrant collection)
# ModelsEMB_MODEL_NAME ="qwen3-embedding:8b"# Embedding modelGEN_MODEL_NAME ="gpt-oss:20b"# Generative model# QdrantCOLLECTION_NAME ="nace-collection"RETRIEVER_LIMIT =5# Number of candidates returned by the vector search# GenerationTEMPERATURE =0.1# Low temperature → more deterministic, reproducible outputs# EvaluationSAMPLE_SIZE =50# Number of activities to evaluate (increase for more robust results)
Note
Why a low temperature? Temperature controls the degree of randomness in generation. For a closed-list classification task, we want stable and reproducible answers: a value between 0.0 and 0.2 is recommended. Higher values are better suited to creative tasks.
First attempt on a single activity label
Before running the pipeline on the full dataset, test each component step by step.
Tip Exercise 1: Test the pipeline on a single example
Question 1 — Embed an activity label
Embedding is the process of transforming a text into a high-dimensional numerical vector that captures its semantics. Two semantically similar texts will have vectors that are close to each other in the vector space. This property is what enables nearest-neighbour search.
Use client_llmlab.embeddings.create to embed the activity label below.
activity ="Installation, maintenance and repair of residential air conditioning systems for private customers"print(f"Activity to code: {activity}")
Activity to code: Installation, maintenance and repair of residential air conditioning systems for private customers
Click to see the answer
response = client_llmlab.embeddings.create( model=EMB_MODEL_NAME,input=activity)search_embedding = response.data[0].embeddingprint(f"Vector created of length: {len(search_embedding)}")
Note
Vector dimension: the vector size depends on the embedding model and is fixed at collection creation time (tutorial 1). You must use the same embedding model at every stage of the pipeline.
Question 2 — Retrieval: search for NACE candidates
Retrieval consists of querying the Qdrant vector store to find the k NACE codes whose descriptions are semantically closest to the label being coded.
Use client_qdrant.query_points with limit=RETRIEVER_LIMIT and inspect the results.
Click to see the answer
points = client_qdrant.query_points( collection_name=COLLECTION_NAME, query=search_embedding, limit=RETRIEVER_LIMIT,)descriptions_retrieved = []codes_retrieved = []for point in points.model_dump()["points"]: descriptions_retrieved.append(point["payload"]["text"]) codes_retrieved.append(point["payload"]["code"])print(f"✓ Vector search completed: {len(descriptions_retrieved)} codes and descriptions retrieved\n")print("Check the first code retrieved ==============\n")print(descriptions_retrieved[0])
Note
Retriever accuracy: the retrieval step is critical. If the correct code is not among the k returned candidates, the LLM cannot select it. You will compute this metric in Part 3.
Question 3 — Prompt construction
The quality of the prompt directly affects the LLM’s output. A RAG prompt typically has two parts:
The system prompt: defines the model’s general role and behaviour.
The user prompt: contains request-specific information (label to code, candidates, output format instructions).
Constraining the output to JSON format is a best practice: it simplifies automatic parsing and reduces format hallucinations.
SYSTEM_PROMPT ="""You are an expert in the Statistical Classification of Economic Activities in the European Community (NACE).Your task is to classify the main economic activity of a company into the NACE 2.1 classification system, based strictly on:- the textual description of the company's activity.- a restricted list of candidate NACE codes and their explanatory notesYou must follow the instructions rigorously and return only the requested JSON output."""USER_PROMPT_TEMPLATE ="""# Company main activity:{activity}# Candidate NACE codes and their explanatory notes:{proposed_nace_descriptions}========# Instructions:1. You MUST select the NACE code strictly from the provided candidate list. No external codes are allowed.2. If multiple activities are mentioned, ONLY consider the first one.3. If the description is unclear or insufficient to determine a classification, return:- "nace2025": null- "codable": false4. The selected code MUST belong to this list:[{proposed_nace_codes}]5. Provide a realistic confidence score between 0.00 and 1.00 (two decimal places max).6. Your response MUST be a valid JSON object following the schema below.- "codable": <boolean>,- "nace2025": <string or null>,- "confidence": <float>7. DO NOT include any explanation, reasoning, or additional text."""
Tip
Prompt engineering best practices for classification:
Constrain the candidate list (instruction 4): prevents the LLM from inventing a code outside the nomenclature.
Handle uncertainty explicitly (instruction 3): an ambiguous label should return codable: false rather than a random code.
Require a confidence score: useful for filtering unreliable predictions in production.
Forbid explanations (instruction 7): reduces the risk of malformed JSON output.
Question 4 — LLM inference
Compile the prompt with the retrieved candidates and run inference. The steps are:
Format USER_PROMPT_TEMPLATE by injecting activity, descriptions_retrieved and codes_retrieved.
[{'code': '73.30',
'name': 'Public relations and communication activities',
'label': 'Corporate messaging and public perception'},
{'code': '68.20',
'name': 'Rental and operating of own or leased real estate',
'label': 'Land rental for long-term purposes'}]
Note
Dataset structure: Each row contains a free-text activity label (activity) and its reference NACE 2.1 code (nace_code). These annotations were generated by an agentic AI system and constitute synthetic data. As such, they are well-suited for testing a RAG pipeline on English-language activity labels, but no claim is made about whether they are representative of real-world annotation data.
Utility function: a full pipeline call
Before looping, we encapsulate steps 1–4 into a reusable function.
def run_rag_pipeline(activity: str) ->dict:""" Run the full RAG pipeline for a single activity label. Parameters ---------- activity : str Free-text economic activity label to be coded. Returns ------- dict with keys: - nace2025 (str | None) : predicted NACE code - codable (bool) : True if the label could be coded - confidence (float) : confidence score (0–1) - retrieved_codes (list): candidates returned by the retriever """# --- Step 1: Embedding --- emb_response = client_llmlab.embeddings.create( model=EMB_MODEL_NAME,input=activity ) embedding = emb_response.data[0].embedding# --- Step 2: Retrieval --- points = client_qdrant.query_points( collection_name=COLLECTION_NAME, query=embedding, limit=RETRIEVER_LIMIT, ) descriptions_retrieved = [] codes_retrieved = []for point in points.model_dump()["points"]: descriptions_retrieved.append(point["payload"]["text"]) codes_retrieved.append(point["payload"]["code"])# --- Step 3: Prompt construction --- user_prompt = USER_PROMPT_TEMPLATE.format( activity=activity, proposed_nace_descriptions="## "+"\n\n## ".join(descriptions_retrieved), proposed_nace_codes=", ".join(codes_retrieved) )# --- Step 4: LLM inference --- gen_response = client_llmlab.chat.completions.create( model=GEN_MODEL_NAME, messages=[ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": user_prompt} ], temperature=TEMPERATURE, response_format={"type": "json_object"} ) result = json.loads(gen_response.choices[0].message.content)# Keep retrieved candidates for retriever evaluation result["retrieved_codes"] = codes_retrievedreturn result
Inference loop
Tip Exercise 2: Code all activities in the dataset
Apply run_rag_pipeline to every row of the annotations list. Store the results in a DataFrame called results.
Note
The default sample size is SAMPLE_SIZE = 50, which keeps inference fast and avoids consuming too many resources. Each call to the pipeline involves two API requests (embedding + generation), so larger samples will take proportionally longer. Feel free to increase SAMPLE_SIZE in the global parameters to get more robust evaluation metrics.
Robustness: LLM calls can occasionally fail (timeout, malformed JSON). The try/except block ensures the loop does not stop and that every error is logged.
Pipeline evaluation
To understand where the pipeline succeeds or fails, we evaluate each stage separately. Three metrics are useful here.
┌─────────────────────────────────────────────────┐
│ EVALUATION METRICS │
├──────────────────┬──────────────────────────────┤
│ Retriever@k │ Is the correct code among │
│ accuracy │ the k retrieved candidates? │
├──────────────────┼──────────────────────────────┤
│ LLM accuracy │ When the retriever succeeded,│
│ (conditional) │ does the LLM pick the right │
│ │ code? │
├──────────────────┼──────────────────────────────┤
│ Pipeline │ Is the final predicted code │
│ accuracy │ correct? (end-to-end) │
└──────────────────┴──────────────────────────────┘
This decomposition helps pinpoint whether errors come from the retriever or the LLM, and guides improvement efforts.
Tip Exercise 3: Compute evaluation metrics
Prepare evaluation columns
Before computing any metric, add three boolean columns to results:
retriever_hit: whether the true NACE code is among the k candidates returned by Qdrant
pipeline_correct: whether the predicted code matches the true code
llm_correct_given_retriever: same as pipeline_correct, but set to None when the retriever did not return the true code
Click to see the answer
# Is the true code among the retriever's candidates?results["retriever_hit"] = results.apply(lambda row: row["true_code"] in row["retrieved_codes"], axis=1)# Is the predicted code correct?results["pipeline_correct"] = results["pred_code"] == results["true_code"]# Did the LLM pick the right code, given that the retriever found it?results["llm_correct_given_retriever"] = results.apply(lambda row: row["pipeline_correct"] if row["retriever_hit"] elseNone, axis=1)
activity
true_code
pred_code
retriever_hit
pipeline_correct
0
Corporate messaging and public perception
73.30
73.30
True
True
1
Land rental for long-term purposes
68.20
68.2
False
False
2
Electric vehicle supply equipment setup
43.21
43.21
True
True
3
Corporate headquarters administration
70.10
70.10
True
True
4
Real estate asset operation and maintenance
68.20
68.20
True
True
Question 1 — Retriever accuracy (Retriever@k)
Retriever@k accuracy measures the proportion of cases where the reference code is present among the k candidates returned by Qdrant. It is the theoretical ceiling of the pipeline: if the retriever misses the correct code, the LLM cannot recover it.
\[\text{Retriever@k} = \frac{N_\text{hit}}{N}\]
where \(N_\text{hit}\) is the number of activities for which the true code is among the \(k\) retrieved candidates, and \(N\) is the total number of activities.
Using the retriever_hit column, compute the proportion of rows where the retriever returned the true code. Store the result in a variable called retriever_accuracy.
Conditional LLM accuracy measures how often the LLM picks the right code when the retriever already returned it as a candidate. This metric isolates the LLM’s own contribution to the pipeline.
where \(N_{\text{correct} \mid \text{hit}}\) is the number of correct predictions in the subset where the retriever succeeded.
Filter results to keep only the rows where retriever_hit is True, then compute the proportion of pipeline_correct in that subset. Store the result in llm_accuracy.
Compute the overall accuracy from the pipeline_correct column and store it in pipeline_accuracy. Then verify empirically that the multiplicative relationship with retriever_accuracy and llm_accuracy holds.
Question 4 — Summary dashboard and error decomposition
Count how many errors come from the retriever (cases where retriever_hit is False) and how many come from the LLM (cases where retriever_hit is True but pipeline_correct is False). Then produce a summary table of all metrics.
If retriever errors dominate → improve the embedding model, increase k, or enrich NACE descriptions in the vector store.
If LLM errors dominate → refine the prompt, switch to a more capable generative model, or lower the temperature further.
Note
Not all errors are equal. A wrong prediction can mean very different things. Some errors are completely off — predicting a manufacturing code for a services activity. Others are near-misses: the predicted code is a parent (e.g. 47 instead of 47.11) or a sibling at the same level of the hierarchy. Near-misses may be acceptable in practice, depending on how the coded data will be used. A hierarchical accuracy metric — counting a prediction as correct if it matches up to a certain depth — would capture this nuance.
Important
On the optimism of these results. The evaluation dataset is synthetic: activity labels were generated by an AI system at low temperature, producing clean and unambiguous descriptions that are easier to classify than real-world data. Actual labels are often shorter, noisier, or ambiguous. The accuracy figures obtained here are therefore optimistic and should not be taken as representative of production performance.
Confidence score analysis
The confidence score produced by the LLM can serve as a quality signal to filter unreliable predictions in production, at the cost of reduced coverage.
Click to see the answer
from plotnine import ( ggplot, aes, geom_boxplot, geom_line, geom_point, scale_color_manual, scale_linetype_manual, labs, theme_minimal,)import pandas as pd# --- Left: confidence distribution by correctness ---results_plot = results.assign( correctness=results["pipeline_correct"].map({False: "Incorrect", True: "Correct"}))p1 = ( ggplot(results_plot, aes(x="correctness", y="confidence"))+ geom_boxplot()+ labs( title="Confidence distribution by pipeline correctness", x="Prediction correct", y="Confidence score", )+ theme_minimal())# --- Right: precision and coverage vs confidence threshold ---thresholds = [i /10for i inrange(1, 10)]rows = []for t in thresholds: subset = results[results["confidence"] >= t]iflen(subset) >0: rows += [ {"threshold": t, "metric": "Precision", "value": subset["pipeline_correct"].mean()}, {"threshold": t, "metric": "Coverage", "value": len(subset) /len(results)}, ]df_thresh = pd.DataFrame(rows)p2 = ( ggplot(df_thresh, aes(x="threshold", y="value", color="metric", linetype="metric"))+ geom_line()+ geom_point()+ scale_color_manual(values={"Precision": "steelblue", "Coverage": "coral"})+ scale_linetype_manual(values={"Precision": "solid", "Coverage": "dashed"})+ labs( title="Precision and coverage vs. confidence threshold", x="Confidence threshold", y="Value", color="", linetype="", )+ theme_minimal())from IPython.display import displaydisplay(p1)display(p2)
Note
Precision / coverage trade-off: raising the confidence threshold increases precision (the retained predictions are more likely to be correct) but reduces coverage (fewer labels are automatically coded, the rest requiring manual review). Choosing the right threshold depends on the operational constraints of your use case.