07 - Building a 100% Free On-Prem RAG System with Open Source LLMs, Embeddings, Pinecone, and n8n
Building a 100% Free On-Prem RAG System with Open Source LLMs, Embeddings, Pinecone, and n8n
After the last post on building a financial statement analyzer using OpenAI and n8n, many readers reached out with a common question:
“Can I build a similar RAG system without relying on OpenAI APIs or paid cloud services?”
The answer is — yes, absolutely.
In this tutorial, I’ll walk you through building a complete Retrieval-Augmented Generation (RAG) system entirely on-prem, using free and open-source tools. No API keys, no vendor lock-in, and no code required.
With the help of:
-
n8n for orchestrating your workflow
-
Pinecone as a vector database (free-tier available)
-
Ollama for running open-source LLMs and embedding models locally
-
Windows Command Prompt for setup and automation
You’ll create a fully functional RAG pipeline that:
-
Accepts documents
-
Converts them to embeddings
-
Stores and retrieves relevant context
-
Answers user queries intelligently — all from your own machine
This project is perfect for developers, hobbyists, and enterprise teams who want more control, privacy, or cost-efficiency in their GenAI experiments.
What You'll Build
-
A local RAG pipeline powered by Ollama models (
nomic-embed-text:latest
for embeddings,mistral-small3.1:latest
for text generation) -
Integration with Pinecone for storing and retrieving vector embeddings
-
n8n workflow automation to run the entire system without writing code
-
A fully offline, privacy-focused, and cost-free solution for knowledge retrieval and generation
Components of the Stack
-
n8n Workflow: For automating the entire RAG pipeline
-
Document Chunking: Split large documents into smaller pieces for better performance
-
Embedding Model: Convert documents into vector embeddings
Model:nomic-embed-text:latest
via Ollama -
Vector Database: Store and retrieve embeddings efficiently
Tool: Pinecone for vector storage. If you want even vector DB also on prem, you can use CromDB. Let me know if you are interested to understand the same. I can write separate post for it. -
LLM: Generate responses based on retrieved content
Model:mistral-small3.1:latest
via Ollama
Step-by-Step Guide to Set Up Your RAG System
1. Install Ollama Locally
To get started, install Ollama and the required models on your Windows machine:
-
Download and Install Ollama:
Visit the official Ollama website to download the latest Windows version:
Ollama Download. -
Install the Models: Open a Command Prompt and run the following commands to download the necessary models:
This will download and install both the
nomic-embed-text:latest
andmistral-small3.1:latest
models.
2. Set Up Pinecone
Pinecone is a powerful vector database, and we will use it to store the embeddings generated from our documents. Follow these steps:
-
Create a Pinecone Account: Go to Pinecone and create a free account.
-
Create Index: Use Pinecone interface to create new Index. Ensure give Dimension number to 768 which is suitable for nomic-embed-text:latest embedding.
3. Set Up n8n
Next, install n8n to automate the workflow:
-
Install n8n:
Open a Command Prompt and install n8n globally using npm:
-
Start n8n:
Run the following command to start the n8n server:
After starting n8n, navigate to http://localhost:5678 in your browser to access the n8n UI.
4. Create and Automate the Workflow in n8n
Now that your environment is set up, create a workflow in n8n to automate the document processing and RAG pipeline.
-
Load Document: Google Drive node in n8n to load documents from a google drive directory.
-
Chunk the Document: Use the Text Splitter node to divide large documents into smaller chunks (e.g., 1000 characters each). This helps make the embeddings more efficient.
-
Generate Embeddings: Use an Embeddings Ollama node in n8n to call the Ollama API and generate embeddings for each document chunk.
This sends your chunked document to Ollama's
nomic-embed-text:latest
model, which returns a vector embedding. -
Store Embeddings in Pinecone: After generating the embeddings, store them in your Pinecone index for later retrieval. Use the Pinecone client in n8n to store the embeddings:
Build 2nd workflow to Chat with newly build knowledge database.
-
Query and Response Generation: When a user submits a query, retrieve the top-k nearest document embeddings from Pinecone and pass them to Ollama's
mistral-small3.1:latest
for response generation:Ollama will generate a response based on the retrieved document chunks.
5. Expose the RAG Pipeline via Webhook (Optional)
To interact with the RAG system from an external application, you can expose the workflow via n8n's Webhook node:
-
Use the Webhook node in n8n to listen for incoming HTTP requests with a query.
-
Trigger the entire RAG workflow using the received query.
-
Return the generated response to the user via the Webhook node.
Why Build This System On-Prem?
-
Privacy: Your data never leaves your local environment.
-
Cost-Free: You avoid recurring fees from cloud services.
-
Customizable: Tailor the workflow, models, and retrieval process as needed.
-
Offline Functionality: All components work fully offline without the need for internet access.
Final Thoughts
By leveraging n8n, Ollama, and Pinecone, you can build a fully functional, privacy-respecting RAG system entirely on-prem. This approach eliminates the need for cloud services and offers complete control over your data and infrastructure — all with a no-code setup.
✍️ Author’s Note
This blog reflects the author’s personal point of view — shaped by 22+ years of industry experience, along with a deep passion for continuous learning and teaching.
The content has been phrased and structured using Generative AI tools, with the intent to make it engaging, accessible, and insightful for a broader audience.
Comments
Post a Comment