Click to see the answer
import os
from dotenv import load_dotenv
load_dotenv()This first part will show you how to build a vectorial database, e.g the “augmented” knowledge aimed to guide your LLM’s reponses.
The simplified workflow will be as follows : - 0. Create a proper working environment with all needed connections (s3, qdrant and llm.lab) - 1. Take the NACE2.1 codes definitions (code + title, + inclusions + exclusions in plain text) - 2. Concatenate these pieces of information for each unique NACE (one code => 1 text) - 3. Embedded all theses texts (1 text => 1 vector point) - 4. Update the points into Qdrant (the vectorial database service)
.env file into your environment.openai client. Please name this client client_llmlab.You will need to use your creds os.environ["LLMLAB_URL"] and os.environ["LLMLAB_API_KEY"] (see the previous chapter).
More information here.
You can notice the llm.lab plateform provides both generation and embedding models.
TO COMPLETE: explanations about openai client
client_qdrant.Find the documentation here to dive deeper 😉.
Time to import our data: the official NACE2.1 definitions!
The data are stored in the s3 storing system of the SSPCloud plateform (uploaded from Eurostat)
import duckdb
con = duckdb.connect(database=":memory:")
con.execute("INSTALL httpfs;")
con.execute("LOAD httpfs;")
path_nace = 'https://minio.lab.sspcloud.fr/projet-formation/diffusion/funathon/2026/project2/NACE_Rev2.1_Structure_Explanatory_Notes_EN.tsv'
query_definition = f"SELECT * FROM read_csv('{path_nace}')"
table = con.execute(query_definition).fetch_arrow_table()
nace = table.to_pylist()
nace[22]{'ORDER_KEY': 230,
'ID': '014',
'CODE': '01.4',
'HEADING': 'Animal production',
'PARENT_ID': '01',
'PARENT_CODE': '01',
'LEVEL': 3,
'Implementation_rule': None,
'Includes': 'This group includes:\\n- farming (husbandry, raising) and breeding of all animals, except aquatic animals',
'IncludesAlso': None,
'Excludes': 'This group excludes:\\n- farm animal boarding and care, see 01.62\\n- production of hides and skins from slaughterhouses, see 10.11'}
Fine! For now, let’s choose an embedding model (e.g. a large langage model able to transform a text to a quite smart numerical vector). We will use Qwen3-Embedding-8B, available on HuggingFace🤗 (like all other models provided by llm.lab)!
For a later use, we need to know the dimension of the vectors given by our model. Here, the size is 4096 (don’t trust me, check yourself on the hf page).