RAG approach - Create a vectorial database

This first part will show you how to build a vectorial database, e.g the “augmented” knowledge aimed to guide your LLM’s reponses.

The simplified workflow will be as follows : - 0. Create a proper working environment with all needed connections (s3, qdrant and llm.lab) - 1. Take the NACE2.1 codes definitions (code + title, + inclusions + exclusions in plain text) - 2. Concatenate these pieces of information for each unique NACE (one code => 1 text) - 3. Embedded all theses texts (1 text => 1 vector point) - 4. Update the points into Qdrant (the vectorial database service)

1 1 - Prepare your environment

Tip Exercice 1: Connections
  1. Load your secrets stored in the .env file into your environment.
Click to see the answer
import os
from dotenv import load_dotenv

load_dotenv()
  1. Create your llm.lab connection using openai client. Please name this client client_llmlab.

You will need to use your creds os.environ["LLMLAB_URL"] and os.environ["LLMLAB_API_KEY"] (see the previous chapter).

More information here.

  1. Print all the available models (kindly made available to everyone by the SSPcloud team ❤️)

You can notice the llm.lab plateform provides both generation and embedding models.

TO COMPLETE: explanations about openai client

Click to see the answer
from openai import OpenAI

client_llmlab = OpenAI(
    base_url=os.environ["LLMLAB_URL"],
    api_key=os.environ["LLMLAB_API_KEY"],
)

# Print models list
models = client_llmlab.models.list()
for model in models.data:
    print(f"ID: {model.id}")
  1. Create a connection to your Qdrant server. Please name it client_qdrant.

Find the documentation here to dive deeper 😉.

  1. Check all your existing collections (= databases). Probably not a single one for now.
Click to see the answer
from qdrant_client import QdrantClient

client_qdrant = QdrantClient(
    url=os.environ["QDRANT_URL"],
    api_key=os.environ["QDRANT_API_KEY"],
    port=os.environ["QDRANT_API_PORT"]
)

collections = client_qdrant.get_collections()
for collection in collections.collections:
    print(collection.name)

1.1 2. Get and process NACE data

Time to import our data: the official NACE2.1 definitions!

The data are stored in the s3 storing system of the SSPCloud plateform (uploaded from Eurostat)

Tip Exercice 2: handle NACE data
  1. Import NACE data (please, no dataframe: use the structure below)
Click to see the answer
import duckdb
con = duckdb.connect(database=":memory:")

con.execute("INSTALL httpfs;")
con.execute("LOAD httpfs;")

path_nace = 'https://minio.lab.sspcloud.fr/projet-formation/diffusion/funathon/2026/project2/NACE_Rev2.1_Structure_Explanatory_Notes_EN.tsv'
query_definition = f"SELECT * FROM read_csv('{path_nace}')"
table = con.execute(query_definition).fetch_arrow_table()
nace = table.to_pylist()
nace[22]
{'ORDER_KEY': 230,
 'ID': '014',
 'CODE': '01.4',
 'HEADING': 'Animal production',
 'PARENT_ID': '01',
 'PARENT_CODE': '01',
 'LEVEL': 3,
 'Implementation_rule': None,
 'Includes': 'This group includes:\\n- farming (husbandry, raising) and breeding of all animals, except aquatic animals',
 'IncludesAlso': None,
 'Excludes': 'This group excludes:\\n- farm animal boarding and care, see 01.62\\n- production of hides and skins from slaughterhouses, see 10.11'}

1.2 3. Embed your NACE items

Fine! For now, let’s choose an embedding model (e.g. a large langage model able to transform a text to a quite smart numerical vector). We will use Qwen3-Embedding-8B, available on HuggingFace🤗 (like all other models provided by llm.lab)!

For a later use, we need to know the dimension of the vectors given by our model. Here, the size is 4096 (don’t trust me, check yourself on the hf page).

# Embedding model parameters
emb_model = "qwen3-embedding:8b"
emb_dim = 4096