Steve's thoughts and experiments
Lab 4 Building a RAG pipeline image

Lab 4 Building a RAG pipeline

In this lab, I'm going to move away from security for a little while and look at how we can augment our LLM using a technique called Retrieval Augmented Generation (RAG).

The idea is that we can use a pre-trained LLM to answer questions about a specific dataset. We can then augment the pre-trained model with additional information from a knowledge base. To do this we usually convert the data into a vector space and storing this information in a vector database, like ChromaDB. In this we'll cover:

  • Store text embeddings in ChromaDB (an open source vector database)
  • Retrieve relevant knowledge dynamically
  • Use a local model to generate answers

The Lab

View the Lab Repository

What is RAG?

Before we get into the lab lets break down what RAG is and how it is useful for building domain specific models. RAG enhances the model by retrieving relevant information before generating a response. Instead of relying solely on the pre-trained model, it will fetch context from an external knowledge base, stored as embeddings in a vector database.

Why would we use RAG?

  • Keeps models up to date without the need for retraining the entire model
  • Improves domain accuracy by providing relevant information
  • Reduces hallucinations by providing additional context

Converting data to embeddings

To create emebedings we use a pre-trained model like OpenAI's text-embedding-3-small or text-embedding-3-large. These models are trained on a large corpus of text and are able to convert text into a vector space which we can store in ChromaDB.

import chromadb
from sentence_transformers import SentenceTransformer

chroma_client = chromadb.PersistentClient(path="./chroma_db")  # Saves to disk
collection = chroma_client.get_or_create_collection("rag_knowledge_base")

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

documents = [
    "The Eiffel Tower is located in Paris, France.",
    "Neural networks are a type of machine learning model.",
    "ChromaDB is an open-source vector database for embeddings."
]

# Generate embeddings and store in ChromaDB
for i, doc in enumerate(documents):
    embedding = embedding_model.encode(doc).tolist()
    collection.add(ids=[str(i)], embeddings=[embedding], metadatas=[{"text": doc}])

Retrieving relevant information

Now that we have our embeddings stored in ChromaDB we can retrieve relevant information from the knowledge base.

query_text = "Where is the Eiffel Tower located?"

# Generate query embedding
query_embedding = embedding_model.encode(query_text).tolist()

# Retrieve relevant document
results = collection.query(query_embeddings=[query_embedding], n_results=1)
retrieved_text = results["metadatas"][0][0]["text"]

# Construct prompt with retrieved knowledge
prompt = f"Answer the question based on the following context:\n\n{retrieved_text}\n\nQuestion: {query_text}\nAnswer:"
llm = pipeline("text-generation", model="mistralai/Mistral-7B-Instruct-v0.1")
response = llm(prompt, max_length=100)[0]["generated_text"]

print(response)

In this we convert the query into an embedding and then retrieve the most relevant document from the knowledge base. We then add the retrieved document to the prompt and use the LLM to answer the question. This gives the model additional context via the prompt. This is how we can augment the model with additional knowledge.

For example; instead of storing a hard coded dictionary of informaiton we might load a collection of documents from a service desk system and then store these in the knowledge base. We can then use the RAG pipeline to answer questions about the documents.

Can I have more data?

Vector DBs typically allow you to store metadata with each embedding. ChromaDB specifically allows for documents, ids, embeddings and metadata. This means we can store a lot more information that just the text of a document to better facilitate retrieval and allow the model to respond with contextualised answers.

collection.add(
    ids=['ticket_1', 'ticket_2', 'ticket_3'],
    embeddings=[[0.123], [0.456], [0.789]],
    documents=[
        "This is the first ticket",
        "This is the second ticket",
        "This is the third ticket"
    ],
    metadatas=[
        {"customer": "John Smith", "text": "Ticket 1", "status": "Open", "priority": "High"},
        {"customer": "Jane Doe", "text": "Ticket 2", "status": "Closed", "priority": "Medium"},
        {"customer": "John Smith", "text": "Ticket 3", "status": "Open", "priority": "Low"}
    ]
)

In the vectordb the collections are aligned by array index. This means that the first embedding in the collection is the first ticket, the second is the second ticket and so on. This allows us to retrieve the most relevant ticket for a given query. This helps improve performance for the queries by allow us to prefilter based on metadata!

results = collection.query(
    query_embeddings=[query_embedding],
    n_results=1,
    where={"customer": "John Smith"}
)

We prefilter the DB to only documents from John Smith and then use the query emebedding to see if there is a relevant ticket in their history. This context would then be passed to the model to answer the question.

There we have it! We've used RAG to help augment the model.