Step-by-Step Guide to Local RAG with Ollama's Gemma 2, and LangChain.dart
Learn how to build a fully local RAG (Retrieval-Augmented Generation) system using LangChain.dart, Ollama and local embedding databases with Chroma
Table of contents
- Introduction to Local RAG
- Understanding the Components
- Setting Up Your Local RAG Environment
- Building the RAG System
- Running and Interacting with Your Local RAG
- Enhancing Our RAG System
- Resources
- Conclusion
Introduction to Local RAG
What is Retrieval-Augmented Generation (RAG)?
The core idea is to enhance the generative capabilities of language models by incorporating external knowledge retrieved from a document store or database. This approach is very useful when the LLM alone may not have sufficient information to answer a user correctly or the response is dependent on proprietary information. Customer support, for example, is where RAG is really good because you need the user’s data to provide the best support possible. If a human is behind the screen, instead of them querying the user’s data and filtering to get the most relevant one, a RAG helper could easily fetch the relevant information needed based on the user’s description of the issue. All that’s left is for the support agent to read the information provided, like they queried a search engine.
Key Components of Local RAG
Vector Store: A database that stores document embeddings (vector representations of documents). In this local RAG example, we are going to be using Chroma as the vector store.
Embeddings Model: A model that converts text into vector representations. For this example, we are using the
nomic-embed-text
model from Ollama.LLM (Large Language Model): A generative model that produces responses based on the retrieved documents and the user query. The
gemma 2
model is perfect for this.Document Loader: A component that loads documents into the vector store. In the example, we will use
DirectoryLoader
to load text data to load text data.Retriever: A component that retrieves the most relevant documents from the vector store based on the user query.
Prompt Template: A template that combines the retrieved documents and the user query to form a prompt for the LLM.
Memory: (Optional) A component that maintains the conversation history to provide context for follow-up questions. We’re not going to be adding memory in this one.
Advantages of local setups vs. cloud-based RAG solutions
You'll find that there are many benefits to handling things locally, such as improved privacy and security. However, I believe the greatest advantages are the cost savings and flexibility that this approach provides. You can iterate and improve your prompts and switch out models if they’re not working as you’d want. I say this as someone who spends the whole day perfecting prompt templates, it can’t be understated.
What is Local Retrieval-Augmented Generation?
In a local RAG setup, everything happens right on your computer, from getting documents to generating responses, without needing any external APIs or services. I briefly talked about one crucial aspect of it, embeddings in my previous article. Go check it out if you haven’t, it might explain some basic stuff that I might fail to cover in this one.
Understanding the Components
Why Gemma 2 and Ollama?
Gemma 2 is one of the models available in Ollama’s model library, and it’s a 9 billion parameter model that can be run locally with fast responses. Ollama has an easy setup that makes it all possible and as a Mac user all I have to do is to download the dmg, extract it and open the application. Installation instructions are here https://github.com/ollama/ollama?tab=readme-ov-file, once installed it serves the models you pull locally, automatically.
Chroma: Vector database explained
Chroma is an open-source embedding database designed to store and manage vector embeddings efficiently. It’s really useful for us since we need a fast, scalable vector similarity search for use in our Retrieval-Augmented Generation.
The key feature that concerns us is that it can run in a client/server mode with the help of Docker or the Python client, and this means we can test our implementation on our local computers and still be able to deploy it to any cloud provider like Google Cloud. It also supports collections which can be configured with metadata and different distance functions but we’re not going to be doing that. Once you load the documents, Chroma can filter queries by metadata and the document contents, allowing for more searches that are to the point and relevant. LangChain is going to be making said query though so no need to stress about that.
Benefits of using Gemma 2 with Ollama and LangChain
If you don’t know, LangChain.dart provides a set of ready-to-use components for working with language models and a standard interface for chaining them together to formulate more advanced use cases. Integrating Gemma 2 with LangChain.dart is beneficial because it’s easy to switch between different models without changing 99% of the application code due to its unified API and LangChain.dart offers support for advanced use cases, such as Retrieval-Augmented Generation (RAG).
Setting Up Your Local RAG Environment
Prerequisites
Hardware requirements for running local models
According to Ollama’s README.md
, you should have at least 8 GB RAM available to run the 7B models, 16 GB to run the 13B models, and 32 GB to run the 33B models. The 9B Gemma 2 model requires more than 8 GB of RAM but if you don’t have more than that go with the 2B model instead.
Installing Gemma 2 and Ollama
Download Ollama for macOS here, Windows here and Linux with this:
curl -fsSL https://ollama.com/install.sh | sh
Now that you’re done with the installation open your favourite terminal and pull the model you want to use locally:
ollama pull gemma2
You can then chat with the model locally:
ollama run gemma2
If the model takes too long to respond, it could be that your computers specs aren’t good enough. You might need to pull the 2B model instead.
ollama pull gemma2:2b
If the response it’s still taking too long you might have better chance with the Llama 3.2 model instead, that’s the default model used by LangChain.dart’s ChatOllama
instance. For those with 32 GM RAM or more Gemma 2’s 27B model exists if you want something really powerful.
That’s we need to do to use Ollama with LangChain. Let’s move onto Chroma next.
Chroma vector database installation
From LangChain.dart, You can run a Chroma server in two ways:
Using Python client
The Python client supports spinning up a Chroma server easily:
pip install chromadb
chroma run --path /db_path
Using Docker
Otherwise, you can run the Chroma server using Docker:
docker pull chromadb/chroma
docker run -p 8000:8000 chromadb/chroma
By default, the Chroma client will connect to a server running on http://localhost:8000
. To connect to a different server, pass the baseUrl
parameter to the constructor.
I’m going to be using Docker since its setup is easier.
All you have to do is head to https://www.docker.com/ and then Download Docker Desktop. Once it’s installed you can either run Chroma via the cli like above or search for chromadb/chroma
in Docker and run the first one you see with 1M downloads.
Open the Optional settings when the Run dialog appears and pass 8000 to the Host port.
The above is the same as this, but to run it via the CLI, make sure Docker is running first by opening the app.
docker run -p 8000:8000 chromadb/chroma
Dart development environment preparation
This project is going to be a simple command line application so:
Open your favourite text editor
Change your directory to your projects directory
Run
dart create ollama_local_rag
or use an existing dart cli project if you already have it setup.We’re going to be working in the
lib
directory, so feel free to remove any code in thebin
directory:
rm -r bin
And then change the directory to lib:
cd lib
- Now add the necessary dependencies to your project’s
pubspec.yaml
file
dependencies:
langchain: {version}
langchain_community: {version}
langchain_ollama: {version}
langchain_chroma: {version}
You can also do this using the CLI, just copy and paste:
dart pub add langchain langchain_ollama langchain_chroma langchain_community
The langchain_community
package provides support for Document loaders like DirectoryLoader
which we’re going to use to load our documents en masse.
Building the RAG System
Embeddings generation
In your ollama_local_rag
file you should have this:
void main() async {
...
}
Let’s initialise an embeddings instance:
final embeddings = OllamaEmbeddings(model: "nomic-embed-text", keepAlive: 30);
From Ollama themselves:
nomic-embed-text
is a large context length text encoder that surpasses OpenAItext-embedding-ada-002
andtext-embedding-3-small
performance on short and long context tasks.
The keepAlive
parameter sets the duration (in minutes) to keep the model loaded in memory. By default this is 5 minutes but since we’re probably going to want to play around for longer than that I’m setting it to 30 minutes. You free to change to whatever value.
Vector store integration
final vectorStore = Chroma(
embeddings: embeddings,
collectionName: "renewable_energy_technologies",
collectionMetadata: {
"description": "Documents related to renewable energy technologies",
},
);
We’re passing a collectionName
to help us manage and organise the embedding more efficiently, by default it’s set to langchain
. The collectionMetadata
parameter allows us to associate additional metadata with the collection which helps in filtering and querying the embeddings later on.
Document loading and preprocessing
final loader = DirectoryLoader(
"../renewable_energy_technologies",
glob: "*.txt",
);
final documents = await loader.load();
I’m using a glob pattern to match only text files, it defaults to all files when not set. It supports .txt
and .json
as well.
Next we call loader.load()
this method calls lazyLoad()
under the hood and it’s best to use this with small amounts of data since it loads all the documents into memory at once. lazyLoad
returns Stream<Document>
, which allows you to process each Document
as it’s loaded.
We needed some documents to work with so I went ahead and created a few text documents containing renewable energy technologies. First create a directory at the same hierarchy as your lib directory:
So if you’re in the lib directory take a step back up:
cd ..
Then create the directory:
mkdir renewable_energy_technologies
Preparing your knowledge base
You can get the text documents here: renewable_energy_technologies, after which you can download or copy/paste them into the renewable_energy_technologies
directory.
Text splitting strategy
final textSplitter = RecursiveCharacterTextSplitter(
chunkSize: 1000,
chunkOverlap: 200,
);
final splitDocuments = textSplitter.splitDocuments(documents);
await vectorStore.addDocuments(documents: splitDocuments);
From LangChain.dart:
When you want to deal with long pieces of text, it is necessary to split up that text into chunks. As simple as this sounds, there is a lot of potential complexity here. Ideally, you want to keep the semantically related pieces of text together. What "semantically related" means could depend on the type of text.
We could use the other Text Splitter CharacterTextSplitter
to split each new line, but that would make it default for the LLM to maintain context and might produce uneven chunks.
Chunk size and overlap considerations
The renewable_energy_technologies
documents have about 1000 characters each that’s why a chunkSize
of 1000 was chosen. The chunkOverlap
helps us to maintain context as well as improve coherence. If the chunks were cleanly created without any overlap it would be difficult for the model to understand the continuity of the text.
Once this splitting is done we call upon the Chroma vector store to add these documents for use later.
Implementing the Retriever
final retriever = vectorStore.asRetriever(
defaultOptions: VectorStoreRetrieverOptions(
searchType: VectorStoreSearchType.similarity(k: 5)),
);
From LangChain.dart:
A retriever is an interface that returns documents given an unstructured query. It is more general than a vector store. A retriever does not need to be able to store documents, only to return (or retrieve) it. Vector stores can be used as the backbone of a retriever, but there are other types of retrievers as well.
Configuring retrieval parameters
The search type is via similarity
but during testing you can try out VectorStoreSearchType.mmr
instead for more diversity. Using similarity
means we get the most relevant results even if they’re redundant. On the other hand using mmr
means you’ll get diverse results but are still relevant to the search query. I want more strict relevance though that’s why I’m using similarity
.
k:5
here means we get the top 5 most similar documents to the query, this is going to be used as relevant context in the prompt soon.
Integrating Gemma 2 LLM
Configuring model parameters: tokens, temperature, keep-alive settings
final chatModel = ChatOllama(
defaultOptions: ChatOllamaOptions(
model: "gemma2",
temperature: 0,
keepAlive: 30,
),
);
We’re creating an instance of the ChatOllama
class to be used as the chat model in the chain. By default llama3.2
is going to be used as the model with a temperature of 0.8, and the keepAlive
is 5 minutes just like with the OllamaEmbeddings
class. The temperature set is 0 to prevent the model from being creative, this ensures consistent responses however if you’re testing on your own, you can change it for more variation. It doesn’t affect the responses from the retriever though, just how different the model responds based on the same context retrieved.
Prompt engineering
final ragPromptTemplate = ChatPromptTemplate.fromTemplates([
(
ChatMessageType.system,
"""
You are an expert assistant providing precise answers based
strictly on the given context.
Context Guidelines:
- Answer only from the provided context.
- If no direct answer exists, clearly state "I cannot find a specific
answer in the provided documents".
- Prioritize accuracy over comprehensiveness.
- If context is partially relevant, explain the limitation.
- Cite document sources if multiple documents contribute to the answer.
CONTEXT: {context}
QUESTION: {question}
"""
),
(ChatMessageType.human, "{question}"),
]);
This is a generic prompt for most RAG projects you might want to build. Separating the templates into roles helps the LLM understand what’s being asked of it. In this case, it get’s the question from the user, as well as the context from the retriever. The retriever simply finds relevant documents based on the search query in this case it’s the question from the user.
The prompt template is like the glue that holds everything together, in the RAG chain since it holds all the information and tells the LLM what to do with said information. It includes placeholders {context}
and {question}
that will be filled with actual values when the template is used.
Building the RAG Pipeline
final ragChain = Runnable.fromMap<String>({
"context": retriever.pipe(
Runnable.mapInput<List<Document>, String>((docs) => docs.join('\n')),
),
"question": Runnable.passthrough<String>(),
}).pipe(ragPromptTemplate).pipe(chatModel).pipe(const StringOutputParser());
This chain involves retrieving relevant documents, formatting a prompt, invoking a chat model, and parsing the output. That’s what LangChain Expression Language (LCEL) is all about. From the LangChain.dart docs:
To make it as easy as possible to create custom chains, LangChain provides a
Runnable
interface that most components implement, including chat models, LLMs, output parsers, retrievers, prompt templates, and more.
Here's a detailed explanation of each part of the chain
Runnable.fromMap: This method creates a
Runnable
from a map of operations. Each key in the map corresponds to an input variable, and the value is aRunnable
that processes that input. The input type is specified as aString
, that’s going to be the user’s question being passed.context: The
context
is the key for the context retrieved by theretriever
we created previously. We then use thepipe()
method to map the output of the retriever which are going to be a list of documents into a single string. This string is going to contain the relevant content which the retriever picked out for the LLM to use when answering the question.question: Here we simply associate this key with a runnable that passes the input question without any modification thereby also outputting a string.
.pipe(ragPromptTemplate): This pipes the output of the previous
Runnable
which contains the context and the question into the prompt template. It formats the context and question into a prompt suitable for the language model..pipe(chatModel): Then this would pipe the formatted prompt template to the LLM for it to generate a response based on the prompt.
.pipe(const StringOutputParser()): This pipes the output of the
chatModel
into aStringOutputParser
.The
StringOutputParser
simply converts any input into aString
; here, the input is the output of thechatModel
. TheStringOutputParser
extendsBaseOutputParser
meaning it can take either aString
or aChatMessage
as input.
In summary, this ragChain
takes a question, retrieves relevant documents, formats them into a prompt, generates a response using a language model, and then parses the response into a string. Due to its modular design, we can swap out each step with something identical and process each independently.
Running and Interacting with Your Local RAG
CLI Implementation
print("Local RAG CLI Application");
print("Type your question (or \"quit\" to exit):");
while (true) {
stdout.write("> ");
final userInput = stdin.readLineSync()?.trim();
if (userInput == null || userInput.toLowerCase() == "quit") {
break;
}
try {
print("\nThinking...\n");
final stream = ragChain.stream(userInput);
await for (final chunk in stream) {
stdout.write(chunk);
}
print("\n");
} catch (e) {
print("Error processing your question: $e");
}
}
print("\nThank you for using Local RAG CLI!");
Command-line interaction flow
The above is a simple CLI interaction loop. We start with the welcome message and then we use an infinite loop to continuously chat with the model until we want to quit. Without this, we would be re-computing the embeddings each time we ask a question. Currently we don’t have that much but for larger documents, it isn’t wise.
Input processing and Response streaming
Once the input is valid we call .stream
on the ragChain
to get the response from the LLM in chunks instead of all at once. You can decide to change it to .invoke
which would return the string output all at once through a future
.
Running the thing
Just make sure you’re in the lib
directory before you run:
dart run ollama_local_rag.dart
You should see this
Local RAG CLI Application
Type your question (or "quit" to exit):
>
I’m going to go ahead and ask it a question.
Local RAG CLI Application
Type your question (or "quit" to exit):
> What are the applications for hydrogen energy?
Thinking...
You should get this response:
Local RAG CLI Application
Type your question (or "quit" to exit):
> What are the applications for hydrogen energy?
Thinking...
According to the provided documents, the diverse uses of hydrogen energy include:
- Transportation sector
- Industrial processes
- Grid energy storage
- Residential and commercial heating
- Space exploration technologies.
>
Which matches the portion in the hydrogen_energy.txt
exactly, letting you know that it’s all local knowledge. You can turn off your internet to confirm 😊.
Enhancing Our RAG System
Debugging and Testing
Currently (as of 15th December, 2024), the documentation of LangChain.dart doesn’t have specific instructions but there are still some things we can do. We can access the results of the intermediate steps before the final output is produced using formatPrompt
or the following helper function from LangChain.dart.
Runnable<T, RunnableOptions, T> logOutput<T extends Object>(String stepName) {
return Runnable.fromFunction<T, T>(
invoke: (input, options) {
print('Output from step "$stepName":\n$input\n---');
return Future.value(input);
},
stream: (inputStream, options) {
return inputStream.map((input) {
print('Chunk from step "$stepName":\n$input\n---');
return input;
});
},
);
}
Since this helper function is a runnable we can plug it into the chain and see the output of that particular step.
The context and question chain
final ragChain = Runnable.fromMap<String>({
"context": retriever.pipe(
Runnable.mapInput<List<Document>, String>((docs) => docs.join('\n')),
),
"question": Runnable.passthrough<String>(),
})
.pipe(logOutput("context and question"))
.pipe(ragPromptTemplate)
.pipe(chatModel)
.pipe(const StringOutputParser());
Placing it right after the context and question runnable gives us this output: I’ve truncated the response in order for it to be more readable.
Local RAG CLI Application
Type your question (or "quit" to exit):
> What is biomass energy?
Thinking...
Chunk from step "context and question":
{question: What is biomass energy?}
---
Chunk from step "context and question":
{context: Document{id: 16fc0824-34b9-47eb-9885-c3a40b51fa65, pageContent: Biomass Energy: Converting Organic Matter to Power
Biomass energy represents a renewable technology that generates power by converting organic materials into usable energy forms. This document provides a comprehensive overview of biomass energy technologies.
1. Biomass Conversion Methods
Primary approaches to biomass energy generation:
- Direct combustion
- Gasification
- Pyrolysis
- Anaerobic digestion
- Fermentation technologies
- Waste management strategies, metadata: {extension: .txt, lastModified: 1734116256000, name: biomass_energy.txt, size: 1210, source: ../renewable_energy_technologies/biomass_energy.txt}}
Document{id: 87462af0-8046-414e-96ee-b22165ee6259, pageContent: Biomass Energy: Converting Organic Matter to Power
Biomass energy represents a renewable technology that generates power by converting organic materials into usable energy forms. This document provides a comprehensive overview of biomass energy technologies.
---
We can see the question
passed to the first runnable same for the context which contains the document which was correctly fetched by the retriever due to its similarity with the question.
The RAG chain
final ragChain = Runnable.fromMap<String>({
"context": retriever.pipe(
Runnable.mapInput<List<Document>, String>((docs) => docs.join('\n')),
),
"question": Runnable.passthrough<String>(),
})
.pipe(ragPromptTemplate)
.pipe(logOutput("ragPromptTemplate"))
.pipe(chatModel)
.pipe(const StringOutputParser());
Placing this runnable after the ragPromptTemplate
yeilds this:
Local RAG CLI Application
Type your question (or "quit" to exit):
> What's the impact of solar energy on the environment?
Thinking...
Chunk from step "ragPromptTemplate":
System: You are an expert assistant providing precise answers based
strictly on the given context.
Context Guidelines:
- Answer only from the provided context.
- If no direct answer exists, clearly state "I cannot find a specific
answer in the provided documents".
- Prioritize accuracy over comprehensiveness.
- If context is partially relevant, explain the limitation.
- Cite document sources if multiple documents contribute to the answer.
CONTEXT: Document{id: b81e4b8a-7930-4e7b-8a57-3fee568b41f1, pageContent: 3. Emerging Solar Technologies
Innovative approaches are pushing the boundaries of solar energy:
- Perovskite solar cells
- Organic photovoltaics
- Quantum dot solar cells
- Transparent solar panels
4. Applications
Solar energy is increasingly used in:
- Residential and commercial electricity generation
- Industrial process heat
- Agricultural irrigation
- Remote power systems
- Satellite and space technology
5. Global Impact
As of 2024, solar energy represents a critical component of global renewable energy strategies, with increasing efficiency and decreasing costs driving widespread adoption., metadata: {extension: .txt, lastModified: 1734116208000, name: solar_energy_overview.txt, size: 1500, source: ../renewable_energy_technologies/solar_energy_overview.txt}}
Document{id: d2a574f6-31e7-40a5-9215-e48155fc1f80, pageContent: 3. Emerging Solar Technologies
Innovative approaches are pushing the boundaries of solar energy:
- Perovskite solar cells
- Organic photovoltaics
- Quantum dot solar cells
- Transparent solar panels
QUESTION: What's the impact of solar energy on the environment?
Human: What's the impact of solar energy on the environment?
---
The template created has been filled in just I said above, this is what’s going to be passed to the ChatModel
.
Using formatPrompt
We can pick our prompt template and call formatPrompt
on it like this:
final formattedPrompt = ragPromptTemplate.formatPrompt({
"context": retriever.pipe(
Runnable.mapInput<List<Document>, String>((docs) => docs.join('\n')),
),
"question": "What is a solar panel?",
});
Running the project shows this in the terminal:
formattedPrompt: System: You are an expert assistant providing precise answers based
strictly on the given context.
Context Guidelines:
- Answer only from the provided context.
- If no direct answer exists, clearly state "I cannot find a specific
answer in the provided documents".
- Prioritize accuracy over comprehensiveness.
- If context is partially relevant, explain the limitation.
- Cite document sources if multiple documents contribute to the answer.
CONTEXT: Instance of 'RunnableSequence<String, String>'
QUESTION: What is a solar panel?
Human: What is a solar panel?
Local RAG CLI Application
Type your question (or "quit" to exit):
>
If we were to call toChatMessages
on the runnable, we would get it formatted the way we have in the ragPromptTemplate
.
final formattedPrompt = ragPromptTemplate.formatPrompt({
"context": retriever.pipe(
Runnable.mapInput<List<Document>, String>((docs) => docs.join('\n')),
),
"question": "What is a solar panel?",
}).toChatMessages();
The output:
formattedPrompt: [SystemChatMessage{
content: You are an expert assistant providing precise answers based
strictly on the given context.
Context Guidelines:
- Answer only from the provided context.
- If no direct answer exists, clearly state "I cannot find a specific
answer in the provided documents".
- Prioritize accuracy over comprehensiveness.
- If context is partially relevant, explain the limitation.
- Cite document sources if multiple documents contribute to the answer.
CONTEXT: Instance of 'RunnableSequence<String, String>'
QUESTION: What is a solar panel?
,
}, HumanChatMessage{
content: ChatMessageContentText{
text: What is a solar panel?,
},
}]
Local RAG CLI Application
Type your question (or "quit" to exit):
This is really helpful during development when your LLM is not setup yet and you want to see what exactly is being passed to the LLM.
Adding Advanced Features
I’m sure you’ve realised by now that the LLM doesn’t “remember” what goes on in past loops. When we ask it What is hydro electric power?
Local RAG CLI Application
Type your question (or "quit" to exit):
> What is hydro electric power?
Thinking...
Hydroelectric power is a well-established renewable energy technology that generates electricity by utilizing the potential energy of water.
And then we ask how does it differ from wind energy?
Local RAG CLI Application
Type your question (or "quit" to exit):
> What is hydro electric power?
Thinking...
Hydroelectric power is a well-established renewable energy technology that generates electricity by utilizing the potential energy of water.
> how does it differ from wind energy?
Thinking...
I cannot find a specific answer in the provided documents. However, based on general knowledge, wind energy and wind power are often used interchangeably to describe the conversion of kinetic energy from wind into electrical power.
While the terms are commonly used, they may have slightly different connotations or applications. "Wind energy" typically refers to the overall renewable technology that harnesses wind as a source of power, encompassing various technologies such as wind turbines and offshore wind farms.
In contrast, "wind power" often specifically refers to the electricity generated by these wind energy technologies.
However, without specific context or further clarification, it is difficult to provide a more precise answer.
>
We can’t get a proper response. This is because the first context retrieved which was just for hydro electric power doesn’t contain any documents related to wind power.
However, I’d like for this tutorial to not go on for too long so check this example out https://langchaindart.dev/#/expression_language/cookbook/retrieval?id=with-memory-and-returning-source-documents
I’ll write another article later where we add and use past conversations to guide the model’s responses.
Complete code
Here’s the entire ollama_local_rag
code:
import "dart:io";
import "package:langchain/langchain.dart";
import "package:langchain_chroma/langchain_chroma.dart";
import "package:langchain_community/langchain_community.dart";
import "package:langchain_ollama/langchain_ollama.dart";
void main() async {
// Initialize embeddings using Ollama
final embeddings = OllamaEmbeddings(model: "nomic-embed-text", keepAlive: 30);
// Initialize vector store (using Chroma in this example)
final vectorStore = Chroma(
embeddings: embeddings,
collectionName: "renewable_energy_technologies",
collectionMetadata: {
"description": "Documents related to renewable energy technologies",
},
);
final loader = DirectoryLoader(
"../renewable_energy_technologies",
glob: "*.txt",
);
final documents = await loader.load();
// Split documents
final textSplitter = RecursiveCharacterTextSplitter(
chunkSize: 1000,
chunkOverlap: 200,
);
final splitDocuments = textSplitter.splitDocuments(documents);
// Add documents to vector store
await vectorStore.addDocuments(documents: splitDocuments);
// Initialize chat model
final chatModel = ChatOllama(
defaultOptions: ChatOllamaOptions(
model: "gemma2",
temperature: 0,
keepAlive: 30,
),
);
// Create retriever
final retriever = vectorStore.asRetriever(
defaultOptions: VectorStoreRetrieverOptions(
searchType: VectorStoreSearchType.similarity(k: 5)),
);
// Create RAG prompt template
final ragPromptTemplate = ChatPromptTemplate.fromTemplates([
(
ChatMessageType.system,
"""
You are an expert assistant providing precise answers based
strictly on the given context.
Context Guidelines:
- Answer only from the provided context.
- If no direct answer exists, clearly state "I cannot find a specific
answer in the provided documents".
- Prioritize accuracy over comprehensiveness.
- If context is partially relevant, explain the limitation.
- Cite document sources if multiple documents contribute to the answer.
CONTEXT: {context}
QUESTION: {question}
"""
),
(ChatMessageType.human, "{question}"),
]);
// Runnable<T, RunnableOptions, T> logOutput<T extends Object>(String stepName) {
// return Runnable.fromFunction<T, T>(
// invoke: (input, options) {
// print('Output from step "$stepName":\n$input\n---');
// return Future.value(input);
// },
// stream: (inputStream, options) {
// return inputStream.map((input) {
// print('Chunk from step "$stepName":\n$input\n---');
// return input;
// });
// },
// );
// }
// final formattedPrompt = ragPromptTemplate.formatPrompt({
// "context": retriever.pipe(
// Runnable.mapInput<List<Document>, String>((docs) => docs.join('\n')),
// ),
// "question": "What is a solar panel?",
// }).toChatMessages();
// print("formattedPrompt: $formattedPrompt");
final ragChain = Runnable.fromMap<String>({
"context": retriever.pipe(
Runnable.mapInput<List<Document>, String>((docs) => docs.join('\n')),
),
"question": Runnable.passthrough<String>(),
}).pipe(ragPromptTemplate).pipe(chatModel).pipe(const StringOutputParser());
print("Local RAG CLI Application");
print("Type your question (or \"quit\" to exit):");
// CLI interaction loop
while (true) {
stdout.write("> ");
final userInput = stdin.readLineSync()?.trim();
if (userInput == null || userInput.toLowerCase() == "quit") {
break;
}
try {
print("\nThinking...\n");
final stream = ragChain.stream(userInput);
await for (final chunk in stream) {
stdout.write(chunk);
}
print("\n");
} catch (e) {
print("Error processing your question: $e");
}
}
print("\nThank you for using Local RAG CLI!");
}
Resources
Recommended reading
LangChain.dart docs
I really recommend that you check out the LangChain.dart docs by David here:
Most of the information in this article was pillaged from there.
Gemma 2 - Local RAG with Ollama and LangChain
For those who don’t mind watching videos, check out the Python tutorial this article was based upon:
You would learn a lot of important concepts in the LangChain ecosystem which you can apply to your projects using LangChain.dart.
Chroma
You can learn more about Chroma here:
Ollama
And more about Ollama here:
Conclusion
As you’ve realised, setting up a local Retrieval-Augmented Generation (RAG) system using Ollama's Gemma 2 and LangChain.dart is quite easy and it offers significant advantages in terms of privacy, cost, and flexibility. By leveraging local resources, you can maintain control over your data and experiment with different models and configurations without relying on external services.
As you continue to explore and enhance your RAG system, consider adding advanced features like memory and agents to improve the system's capabilities and user experience. You can also try deploying your Chroma
database to the cloud with the help of this guide: https://docs.trychroma.com/deployment. This will allow you to embed and store larger amounts of documents for your production apps. Remember that your users can’t connect to your http://localhost:8000
😊.
If you found this valuable, like and share this article to your developer friends. If you have any questions, you can leave a comment as well. There’s still more coming so subscribe to my newsletter as well to get more LangChain.dart tutorials.