Step-by-Step Guide to Local RAG with Ollama's Gemma 2, and LangChain.dart

Learn how to build a fully local RAG (Retrieval-Augmented Generation) system using LangChain.dart, Ollama and local embedding databases with Chroma

Henry Adu

··

23 min read

Step-by-Step Guide to Local RAG with Ollama's Gemma 2, and LangChain.dart

Introduction to Local RAG
Understanding the Components
Setting Up Your Local RAG Environment
Building the RAG System
Running and Interacting with Your Local RAG
- CLI Implementation
Enhancing Our RAG System
- Debugging and Testing
- Adding Advanced Features
  - Complete code
Resources
- Recommended reading
Conclusion

Introduction to Local RAG

What is Retrieval-Augmented Generation (RAG)?

The core idea is to enhance the generative capabilities of language models by incorporating external knowledge retrieved from a document store or database. This approach is very useful when the LLM alone may not have sufficient information to answer a user correctly or the response is dependent on proprietary information. Customer support, for example, is where RAG is really good because you need the user’s data to provide the best support possible. If a human is behind the screen, instead of them querying the user’s data and filtering to get the most relevant one, a RAG helper could easily fetch the relevant information needed based on the user’s description of the issue. All that’s left is for the support agent to read the information provided, like they queried a search engine.

Key Components of Local RAG

Vector Store: A database that stores document embeddings (vector representations of documents). In this local RAG example, we are going to be using Chroma as the vector store.
Embeddings Model: A model that converts text into vector representations. For this example, we are using the nomic-embed-text model from Ollama.
LLM (Large Language Model): A generative model that produces responses based on the retrieved documents and the user query. The gemma 2 model is perfect for this.
Document Loader: A component that loads documents into the vector store. In the example, we will use DirectoryLoader to load text data to load text data.
Retriever: A component that retrieves the most relevant documents from the vector store based on the user query.
Prompt Template: A template that combines the retrieved documents and the user query to form a prompt for the LLM.
Memory: (Optional) A component that maintains the conversation history to provide context for follow-up questions. We’re not going to be adding memory in this one.

Advantages of local setups vs. cloud-based RAG solutions

You'll find that there are many benefits to handling things locally, such as improved privacy and security. However, I believe the greatest advantages are the cost savings and flexibility that this approach provides. You can iterate and improve your prompts and switch out models if they’re not working as you’d want. I say this as someone who spends the whole day perfecting prompt templates, it can’t be understated.

What is Local Retrieval-Augmented Generation?

In a local RAG setup, everything happens right on your computer, from getting documents to generating responses, without needing any external APIs or services. I briefly talked about one crucial aspect of it, embeddings in my previous article. Go check it out if you haven’t, it might explain some basic stuff that I might fail to cover in this one.

Understanding the Components

Why Gemma 2 and Ollama?

Gemma 2 is one of the models available in Ollama’s model library, and it’s a 9 billion parameter model that can be run locally with fast responses. Ollama has an easy setup that makes it all possible and as a Mac user all I have to do is to download the dmg, extract it and open the application. Installation instructions are here https://github.com/ollama/ollama?tab=readme-ov-file, once installed it serves the models you pull locally, automatically.

Chroma: Vector database explained

Chroma is an open-source embedding database designed to store and manage vector embeddings efficiently. It’s really useful for us since we need a fast, scalable vector similarity search for use in our Retrieval-Augmented Generation.

The key feature that concerns us is that it can run in a client/server mode with the help of Docker or the Python client, and this means we can test our implementation on our local computers and still be able to deploy it to any cloud provider like Google Cloud. It also supports collections which can be configured with metadata and different distance functions but we’re not going to be doing that. Once you load the documents, Chroma can filter queries by metadata and the document contents, allowing for more searches that are to the point and relevant. LangChain is going to be making said query though so no need to stress about that.

Benefits of using Gemma 2 with Ollama and LangChain

If you don’t know, LangChain.dart provides a set of ready-to-use components for working with language models and a standard interface for chaining them together to formulate more advanced use cases. Integrating Gemma 2 with LangChain.dart is beneficial because it’s easy to switch between different models without changing 99% of the application code due to its unified API and LangChain.dart offers support for advanced use cases, such as Retrieval-Augmented Generation (RAG).

Setting Up Your Local RAG Environment

Prerequisites

Hardware requirements for running local models

According to Ollama’s README.md, you should have at least 8 GB RAM available to run the 7B models, 16 GB to run the 13B models, and 32 GB to run the 33B models. The 9B Gemma 2 model requires more than 8 GB of RAM but if you don’t have more than that go with the 2B model instead.

Installing Gemma 2 and Ollama

Download Ollama for macOS here, Windows here and Linux with this:

curl -fsSL https://ollama.com/install.sh | sh

Now that you’re done with the installation open your favourite terminal and pull the model you want to use locally:

ollama pull gemma2

You can then chat with the model locally:

ollama run gemma2

If the model takes too long to respond, it could be that your computers specs aren’t good enough. You might need to pull the 2B model instead.

ollama pull gemma2:2b

If the response it’s still taking too long you might have better chance with the Llama 3.2 model instead, that’s the default model used by LangChain.dart’s ChatOllama instance. For those with 32 GM RAM or more Gemma 2’s 27B model exists if you want something really powerful.

That’s we need to do to use Ollama with LangChain. Let’s move onto Chroma next.

Chroma vector database installation

From LangChain.dart, You can run a Chroma server in two ways:

Using Python client

The Python client supports spinning up a Chroma server easily:

pip install chromadb
chroma run --path /db_path

Using Docker

Otherwise, you can run the Chroma server using Docker:

docker pull chromadb/chroma
docker run -p 8000:8000 chromadb/chroma

By default, the Chroma client will connect to a server running on http://localhost:8000. To connect to a different server, pass the baseUrl parameter to the constructor.

I’m going to be using Docker since its setup is easier.

All you have to do is head to https://www.docker.com/ and then Download Docker Desktop. Once it’s installed you can either run Chroma via the cli like above or search for chromadb/chroma in Docker and run the first one you see with 1M downloads.

Open the Optional settings when the Run dialog appears and pass 8000 to the Host port.

The above is the same as this, but to run it via the CLI, make sure Docker is running first by opening the app.

docker run -p 8000:8000 chromadb/chroma

Dart development environment preparation

This project is going to be a simple command line application so:

Open your favourite text editor
Change your directory to your projects directory
Run dart create ollama_local_rag or use an existing dart cli project if you already have it setup.
We’re going to be working in the lib directory, so feel free to remove any code in the bin directory:

rm -r bin

And then change the directory to lib:

cd lib

Now add the necessary dependencies to your project’s pubspec.yaml file

dependencies:
  langchain: {version}
  langchain_community: {version}
  langchain_ollama: {version}
  langchain_chroma: {version}

You can also do this using the CLI, just copy and paste:

dart pub add langchain langchain_ollama langchain_chroma langchain_community

The langchain_community package provides support for Document loaders like DirectoryLoader which we’re going to use to load our documents en masse.

Building the RAG System

Embeddings generation

In your ollama_local_rag file you should have this:

void main() async {
...
}

Let’s initialise an embeddings instance:

final embeddings = OllamaEmbeddings(model: "nomic-embed-text", keepAlive: 30);

From Ollama themselves:

nomic-embed-text is a large context length text encoder that surpasses OpenAI text-embedding-ada-002 and text-embedding-3-small performance on short and long context tasks.

The keepAlive parameter sets the duration (in minutes) to keep the model loaded in memory. By default this is 5 minutes but since we’re probably going to want to play around for longer than that I’m setting it to 30 minutes. You free to change to whatever value.

Vector store integration

final vectorStore = Chroma(
  embeddings: embeddings,
  collectionName: "renewable_energy_technologies",
  collectionMetadata: {
    "description": "Documents related to renewable energy technologies",
  },
);

We’re passing a collectionName to help us manage and organise the embedding more efficiently, by default it’s set to langchain. The collectionMetadata parameter allows us to associate additional metadata with the collection which helps in filtering and querying the embeddings later on.

Document loading and preprocessing

final loader = DirectoryLoader(
    "../renewable_energy_technologies",
    glob: "*.txt",
  );

final documents = await loader.load();

I’m using a glob pattern to match only text files, it defaults to all files when not set. It supports .txt and .json as well.

Next we call loader.load() this method calls lazyLoad() under the hood and it’s best to use this with small amounts of data since it loads all the documents into memory at once. lazyLoad returns Stream<Document>, which allows you to process each Document as it’s loaded.

We needed some documents to work with so I went ahead and created a few text documents containing renewable energy technologies. First create a directory at the same hierarchy as your lib directory:

So if you’re in the lib directory take a step back up:

cd ..

Then create the directory:

mkdir renewable_energy_technologies

Preparing your knowledge base

You can get the text documents here: renewable_energy_technologies, after which you can download or copy/paste them into the renewable_energy_technologies directory.

Text splitting strategy

final textSplitter = RecursiveCharacterTextSplitter(
  chunkSize: 1000,
  chunkOverlap: 200,
);

final splitDocuments = textSplitter.splitDocuments(documents);

await vectorStore.addDocuments(documents: splitDocuments);

From LangChain.dart:

When you want to deal with long pieces of text, it is necessary to split up that text into chunks. As simple as this sounds, there is a lot of potential complexity here. Ideally, you want to keep the semantically related pieces of text together. What "semantically related" means could depend on the type of text.

We could use the other Text Splitter CharacterTextSplitter to split each new line, but that would make it default for the LLM to maintain context and might produce uneven chunks.

Chunk size and overlap considerations

The renewable_energy_technologies documents have about 1000 characters each that’s why a chunkSize of 1000 was chosen. The chunkOverlap helps us to maintain context as well as improve coherence. If the chunks were cleanly created without any overlap it would be difficult for the model to understand the continuity of the text.

Once this splitting is done we call upon the Chroma vector store to add these documents for use later.

Implementing the Retriever

final retriever = vectorStore.asRetriever(
    defaultOptions: VectorStoreRetrieverOptions(
        searchType: VectorStoreSearchType.similarity(k: 5)),
);

From LangChain.dart:

A retriever is an interface that returns documents given an unstructured query. It is more general than a vector store. A retriever does not need to be able to store documents, only to return (or retrieve) it. Vector stores can be used as the backbone of a retriever, but there are other types of retrievers as well.

Configuring retrieval parameters

The search type is via similarity but during testing you can try out VectorStoreSearchType.mmr instead for more diversity. Using similarity means we get the most relevant results even if they’re redundant. On the other hand using mmr means you’ll get diverse results but are still relevant to the search query. I want more strict relevance though that’s why I’m using similarity.

k:5 here means we get the top 5 most similar documents to the query, this is going to be used as relevant context in the prompt soon.

Integrating Gemma 2 LLM

Configuring model parameters: tokens, temperature, keep-alive settings

final chatModel = ChatOllama(
  defaultOptions: ChatOllamaOptions(
    model: "gemma2",
    temperature: 0,
    keepAlive: 30,
  ),
);

We’re creating an instance of the ChatOllama class to be used as the chat model in the chain. By default llama3.2 is going to be used as the model with a temperature of 0.8, and the keepAlive is 5 minutes just like with the OllamaEmbeddings class. The temperature set is 0 to prevent the model from being creative, this ensures consistent responses however if you’re testing on your own, you can change it for more variation. It doesn’t affect the responses from the retriever though, just how different the model responds based on the same context retrieved.

Prompt engineering

final ragPromptTemplate = ChatPromptTemplate.fromTemplates([
  (
    ChatMessageType.system,
    """
      You are an expert assistant providing precise answers based
      strictly on the given context.

      Context Guidelines:
      - Answer only from the provided context.
      - If no direct answer exists, clearly state "I cannot find a specific
        answer in the provided documents".
      - Prioritize accuracy over comprehensiveness.
      - If context is partially relevant, explain the limitation.
      - Cite document sources if multiple documents contribute to the answer.

      CONTEXT: {context}

      QUESTION: {question}
    """
  ),
  (ChatMessageType.human, "{question}"),
]);

This is a generic prompt for most RAG projects you might want to build. Separating the templates into roles helps the LLM understand what’s being asked of it. In this case, it get’s the question from the user, as well as the context from the retriever. The retriever simply finds relevant documents based on the search query in this case it’s the question from the user.

The prompt template is like the glue that holds everything together, in the RAG chain since it holds all the information and tells the LLM what to do with said information. It includes placeholders {context} and {question} that will be filled with actual values when the template is used.

Building the RAG Pipeline

final ragChain = Runnable.fromMap<String>({
  "context": retriever.pipe(
    Runnable.mapInput<List<Document>, String>((docs) => docs.join('\n')),
  ),
  "question": Runnable.passthrough<String>(),
}).pipe(ragPromptTemplate).pipe(chatModel).pipe(const StringOutputParser());

This chain involves retrieving relevant documents, formatting a prompt, invoking a chat model, and parsing the output. That’s what LangChain Expression Language (LCEL) is all about. From the LangChain.dart docs:

To make it as easy as possible to create custom chains, LangChain provides a Runnable interface that most components implement, including chat models, LLMs, output parsers, retrievers, prompt templates, and more.

Here's a detailed explanation of each part of the chain

Runnable.fromMap: This method creates a Runnable from a map of operations. Each key in the map corresponds to an input variable, and the value is a Runnable that processes that input. The input type is specified as a String, that’s going to be the user’s question being passed.
context: The context is the key for the context retrieved by the retriever we created previously. We then use the pipe() method to map the output of the retriever which are going to be a list of documents into a single string. This string is going to contain the relevant content which the retriever picked out for the LLM to use when answering the question.
question: Here we simply associate this key with a runnable that passes the input question without any modification thereby also outputting a string.
.pipe(ragPromptTemplate): This pipes the output of the previous Runnable which contains the context and the question into the prompt template. It formats the context and question into a prompt suitable for the language model.
.pipe(chatModel): Then this would pipe the formatted prompt template to the LLM for it to generate a response based on the prompt.
.pipe(const StringOutputParser()): This pipes the output of the chatModel into a StringOutputParser.

The StringOutputParser simply converts any input into a String; here, the input is the output of the chatModel. The StringOutputParser extends BaseOutputParser meaning it can take either a String or a ChatMessage as input.

In summary, this ragChain takes a question, retrieves relevant documents, formats them into a prompt, generates a response using a language model, and then parses the response into a string. Due to its modular design, we can swap out each step with something identical and process each independently.

Running and Interacting with Your Local RAG

CLI Implementation

print("Local RAG CLI Application");
print("Type your question (or \"quit\" to exit):");

while (true) {
  stdout.write("> ");
  final userInput = stdin.readLineSync()?.trim();

  if (userInput == null || userInput.toLowerCase() == "quit") {
    break;
  }

  try {
    print("\nThinking...\n");

    final stream = ragChain.stream(userInput);

    await for (final chunk in stream) {
      stdout.write(chunk);
    }
    print("\n");
  } catch (e) {
    print("Error processing your question: $e");
  }
}

print("\nThank you for using Local RAG CLI!");

Command-line interaction flow

The above is a simple CLI interaction loop. We start with the welcome message and then we use an infinite loop to continuously chat with the model until we want to quit. Without this, we would be re-computing the embeddings each time we ask a question. Currently we don’t have that much but for larger documents, it isn’t wise.

Input processing and Response streaming

Once the input is valid we call .stream on the ragChain to get the response from the LLM in chunks instead of all at once. You can decide to change it to .invoke which would return the string output all at once through a future.

Running the thing

Just make sure you’re in the lib directory before you run:

dart run ollama_local_rag.dart

You should see this

Local RAG CLI Application
Type your question (or "quit" to exit):
>

I’m going to go ahead and ask it a question.

Local RAG CLI Application
Type your question (or "quit" to exit):
> What are the applications for hydrogen energy?

Thinking...

You should get this response:

Local RAG CLI Application
Type your question (or "quit" to exit):
> What are the applications for hydrogen energy?

Thinking...

According to the provided documents, the diverse uses of hydrogen energy include:

- Transportation sector
- Industrial processes
- Grid energy storage
- Residential and commercial heating
- Space exploration technologies.

>

Which matches the portion in the hydrogen_energy.txt exactly, letting you know that it’s all local knowledge. You can turn off your internet to confirm 😊.

Enhancing Our RAG System

Debugging and Testing

Currently (as of 15th December, 2024), the documentation of LangChain.dart doesn’t have specific instructions but there are still some things we can do. We can access the results of the intermediate steps before the final output is produced using formatPrompt or the following helper function from LangChain.dart.

Runnable<T, RunnableOptions, T> logOutput<T extends Object>(String stepName) {
  return Runnable.fromFunction<T, T>(
    invoke: (input, options) {
      print('Output from step "$stepName":\n$input\n---');
      return Future.value(input);
    },
    stream: (inputStream, options) {
      return inputStream.map((input) {
        print('Chunk from step "$stepName":\n$input\n---');
        return input;
      });
    },
  );
}

Since this helper function is a runnable we can plug it into the chain and see the output of that particular step.

The context and question chain

final ragChain = Runnable.fromMap<String>({
  "context": retriever.pipe(
    Runnable.mapInput<List<Document>, String>((docs) => docs.join('\n')),
  ),
  "question": Runnable.passthrough<String>(),
})
    .pipe(logOutput("context and question"))
    .pipe(ragPromptTemplate)
    .pipe(chatModel)
    .pipe(const StringOutputParser());

Placing it right after the context and question runnable gives us this output: I’ve truncated the response in order for it to be more readable.

Local RAG CLI Application
Type your question (or "quit" to exit):
> What is biomass energy?

Thinking...

Chunk from step "context and question":
{question: What is biomass energy?}
---
Chunk from step "context and question":
{context: Document{id: 16fc0824-34b9-47eb-9885-c3a40b51fa65, pageContent: Biomass Energy: Converting Organic Matter to Power

Biomass energy represents a renewable technology that generates power by converting organic materials into usable energy forms. This document provides a comprehensive overview of biomass energy technologies.

1. Biomass Conversion Methods
Primary approaches to biomass energy generation:
- Direct combustion
- Gasification
- Pyrolysis
- Anaerobic digestion
- Fermentation technologies
- Waste management strategies, metadata: {extension: .txt, lastModified: 1734116256000, name: biomass_energy.txt, size: 1210, source: ../renewable_energy_technologies/biomass_energy.txt}}
Document{id: 87462af0-8046-414e-96ee-b22165ee6259, pageContent: Biomass Energy: Converting Organic Matter to Power

Biomass energy represents a renewable technology that generates power by converting organic materials into usable energy forms. This document provides a comprehensive overview of biomass energy technologies.
---

We can see the question passed to the first runnable same for the context which contains the document which was correctly fetched by the retriever due to its similarity with the question.

The RAG chain

final ragChain = Runnable.fromMap<String>({
  "context": retriever.pipe(
    Runnable.mapInput<List<Document>, String>((docs) => docs.join('\n')),
  ),
  "question": Runnable.passthrough<String>(),
})
    .pipe(ragPromptTemplate)
    .pipe(logOutput("ragPromptTemplate"))
    .pipe(chatModel)
    .pipe(const StringOutputParser());

Placing this runnable after the ragPromptTemplate yeilds this:

Local RAG CLI Application
Type your question (or "quit" to exit):
> What's the impact of solar energy on the environment?

Thinking...

Chunk from step "ragPromptTemplate":
System:         You are an expert assistant providing precise answers based
        strictly on the given context.

        Context Guidelines:
        - Answer only from the provided context.
        - If no direct answer exists, clearly state "I cannot find a specific
          answer in the provided documents".
        - Prioritize accuracy over comprehensiveness.
        - If context is partially relevant, explain the limitation.
        - Cite document sources if multiple documents contribute to the answer.

        CONTEXT: Document{id: b81e4b8a-7930-4e7b-8a57-3fee568b41f1, pageContent: 3. Emerging Solar Technologies
Innovative approaches are pushing the boundaries of solar energy:
- Perovskite solar cells
- Organic photovoltaics
- Quantum dot solar cells
- Transparent solar panels

4. Applications
Solar energy is increasingly used in:
- Residential and commercial electricity generation
- Industrial process heat
- Agricultural irrigation
- Remote power systems
- Satellite and space technology

5. Global Impact
As of 2024, solar energy represents a critical component of global renewable energy strategies, with increasing efficiency and decreasing costs driving widespread adoption., metadata: {extension: .txt, lastModified: 1734116208000, name: solar_energy_overview.txt, size: 1500, source: ../renewable_energy_technologies/solar_energy_overview.txt}}
Document{id: d2a574f6-31e7-40a5-9215-e48155fc1f80, pageContent: 3. Emerging Solar Technologies
Innovative approaches are pushing the boundaries of solar energy:
- Perovskite solar cells
- Organic photovoltaics
- Quantum dot solar cells
- Transparent solar panels

        QUESTION: What's the impact of solar energy on the environment?

Human: What's the impact of solar energy on the environment?
---

The template created has been filled in just I said above, this is what’s going to be passed to the ChatModel.

Using formatPrompt

We can pick our prompt template and call formatPrompt on it like this:

final formattedPrompt = ragPromptTemplate.formatPrompt({
  "context": retriever.pipe(
    Runnable.mapInput<List<Document>, String>((docs) => docs.join('\n')),
  ),
  "question": "What is a solar panel?",
});

Running the project shows this in the terminal:

formattedPrompt: System:         You are an expert assistant providing precise answers based
        strictly on the given context.

        Context Guidelines:
        - Answer only from the provided context.
        - If no direct answer exists, clearly state "I cannot find a specific
          answer in the provided documents".
        - Prioritize accuracy over comprehensiveness.
        - If context is partially relevant, explain the limitation.
        - Cite document sources if multiple documents contribute to the answer.

        CONTEXT: Instance of 'RunnableSequence<String, String>'

        QUESTION: What is a solar panel?

Human: What is a solar panel?
Local RAG CLI Application
Type your question (or "quit" to exit):
>

If we were to call toChatMessages on the runnable, we would get it formatted the way we have in the ragPromptTemplate.

final formattedPrompt = ragPromptTemplate.formatPrompt({
  "context": retriever.pipe(
    Runnable.mapInput<List<Document>, String>((docs) => docs.join('\n')),
  ),
  "question": "What is a solar panel?",
}).toChatMessages();

The output:

formattedPrompt: [SystemChatMessage{
  content:         You are an expert assistant providing precise answers based
        strictly on the given context.

        Context Guidelines:
        - Answer only from the provided context.
        - If no direct answer exists, clearly state "I cannot find a specific
          answer in the provided documents".
        - Prioritize accuracy over comprehensiveness.
        - If context is partially relevant, explain the limitation.
        - Cite document sources if multiple documents contribute to the answer.

        CONTEXT: Instance of 'RunnableSequence<String, String>'

        QUESTION: What is a solar panel?
      ,
}, HumanChatMessage{
  content: ChatMessageContentText{
  text: What is a solar panel?,
},
}]
Local RAG CLI Application
Type your question (or "quit" to exit):

This is really helpful during development when your LLM is not setup yet and you want to see what exactly is being passed to the LLM.

Adding Advanced Features

I’m sure you’ve realised by now that the LLM doesn’t “remember” what goes on in past loops. When we ask it What is hydro electric power?

Local RAG CLI Application
Type your question (or "quit" to exit):
> What is hydro electric power?

Thinking...

Hydroelectric power is a well-established renewable energy technology that generates electricity by utilizing the potential energy of water.

And then we ask how does it differ from wind energy?

Local RAG CLI Application
Type your question (or "quit" to exit):
> What is hydro electric power?

Thinking...

Hydroelectric power is a well-established renewable energy technology that generates electricity by utilizing the potential energy of water.

> how does it differ from wind energy?

Thinking...

I cannot find a specific answer in the provided documents. However, based on general knowledge, wind energy and wind power are often used interchangeably to describe the conversion of kinetic energy from wind into electrical power. 

While the terms are commonly used, they may have slightly different connotations or applications. "Wind energy" typically refers to the overall renewable technology that harnesses wind as a source of power, encompassing various technologies such as wind turbines and offshore wind farms.

In contrast, "wind power" often specifically refers to the electricity generated by these wind energy technologies.

However, without specific context or further clarification, it is difficult to provide a more precise answer.

>

We can’t get a proper response. This is because the first context retrieved which was just for hydro electric power doesn’t contain any documents related to wind power.

However, I’d like for this tutorial to not go on for too long so check this example out https://langchaindart.dev/#/expression_language/cookbook/retrieval?id=with-memory-and-returning-source-documents

I’ll write another article later where we add and use past conversations to guide the model’s responses.

Complete code

Here’s the entire ollama_local_rag code:

import "dart:io";
import "package:langchain/langchain.dart";
import "package:langchain_chroma/langchain_chroma.dart";
import "package:langchain_community/langchain_community.dart";
import "package:langchain_ollama/langchain_ollama.dart";

void main() async {
  // Initialize embeddings using Ollama
  final embeddings = OllamaEmbeddings(model: "nomic-embed-text", keepAlive: 30);

  // Initialize vector store (using Chroma in this example)
  final vectorStore = Chroma(
    embeddings: embeddings,
    collectionName: "renewable_energy_technologies",
    collectionMetadata: {
      "description": "Documents related to renewable energy technologies",
    },
  );

  final loader = DirectoryLoader(
    "../renewable_energy_technologies",
    glob: "*.txt",
  );

  final documents = await loader.load();

  // Split documents
  final textSplitter = RecursiveCharacterTextSplitter(
    chunkSize: 1000,
    chunkOverlap: 200,
  );
  final splitDocuments = textSplitter.splitDocuments(documents);

  // Add documents to vector store
  await vectorStore.addDocuments(documents: splitDocuments);

  // Initialize chat model
  final chatModel = ChatOllama(
    defaultOptions: ChatOllamaOptions(
      model: "gemma2",
      temperature: 0,
      keepAlive: 30,
    ),
  );

  // Create retriever
  final retriever = vectorStore.asRetriever(
    defaultOptions: VectorStoreRetrieverOptions(
        searchType: VectorStoreSearchType.similarity(k: 5)),
  );

  // Create RAG prompt template
  final ragPromptTemplate = ChatPromptTemplate.fromTemplates([
    (
      ChatMessageType.system,
      """
        You are an expert assistant providing precise answers based
        strictly on the given context.

        Context Guidelines:
        - Answer only from the provided context.
        - If no direct answer exists, clearly state "I cannot find a specific
          answer in the provided documents".
        - Prioritize accuracy over comprehensiveness.
        - If context is partially relevant, explain the limitation.
        - Cite document sources if multiple documents contribute to the answer.

        CONTEXT: {context}

        QUESTION: {question}
      """
    ),
    (ChatMessageType.human, "{question}"),
  ]);

  // Runnable<T, RunnableOptions, T> logOutput<T extends Object>(String stepName) {
  //   return Runnable.fromFunction<T, T>(
  //     invoke: (input, options) {
  //       print('Output from step "$stepName":\n$input\n---');
  //       return Future.value(input);
  //     },
  //     stream: (inputStream, options) {
  //       return inputStream.map((input) {
  //         print('Chunk from step "$stepName":\n$input\n---');
  //         return input;
  //       });
  //     },
  //   );
  // }

  // final formattedPrompt = ragPromptTemplate.formatPrompt({
  //   "context": retriever.pipe(
  //     Runnable.mapInput<List<Document>, String>((docs) => docs.join('\n')),
  //   ),
  //   "question": "What is a solar panel?",
  // }).toChatMessages();

  // print("formattedPrompt: $formattedPrompt");

  final ragChain = Runnable.fromMap<String>({
    "context": retriever.pipe(
      Runnable.mapInput<List<Document>, String>((docs) => docs.join('\n')),
    ),
    "question": Runnable.passthrough<String>(),
  }).pipe(ragPromptTemplate).pipe(chatModel).pipe(const StringOutputParser());

  print("Local RAG CLI Application");
  print("Type your question (or \"quit\" to exit):");

  // CLI interaction loop
  while (true) {
    stdout.write("> ");
    final userInput = stdin.readLineSync()?.trim();

    if (userInput == null || userInput.toLowerCase() == "quit") {
      break;
    }

    try {
      print("\nThinking...\n");

      final stream = ragChain.stream(userInput);

      await for (final chunk in stream) {
        stdout.write(chunk);
      }
      print("\n");
    } catch (e) {
      print("Error processing your question: $e");
    }
  }

  print("\nThank you for using Local RAG CLI!");
}

💡

Source code on GitHub

Resources

Conclusion

As you’ve realised, setting up a local Retrieval-Augmented Generation (RAG) system using Ollama's Gemma 2 and LangChain.dart is quite easy and it offers significant advantages in terms of privacy, cost, and flexibility. By leveraging local resources, you can maintain control over your data and experiment with different models and configurations without relying on external services.

As you continue to explore and enhance your RAG system, consider adding advanced features like memory and agents to improve the system's capabilities and user experience. You can also try deploying your Chroma database to the cloud with the help of this guide: https://docs.trychroma.com/deployment. This will allow you to embed and store larger amounts of documents for your production apps. Remember that your users can’t connect to your http://localhost:8000 😊.

If you found this valuable, like and share this article to your developer friends. If you have any questions, you can leave a comment as well. There’s still more coming so subscribe to my newsletter as well to get more LangChain.dart tutorials.

RAG Retrieval-Augmented Generation language models ollama Gemma2 langchaindart langchain chromadb vector embeddings Text embeddings #PromptEngineering Prompt template agents #Embeddings