07 - Building a 100% Free On-Prem RAG System with Open Source LLMs, Embeddings, Pinecone, and n8n

June 11, 2025

Building a 100% Free On-Prem RAG System with Open Source LLMs, Embeddings, Pinecone, and n8n

After the last post on building a financial statement analyzer using OpenAI and n8n, many readers reached out with a common question:
“Can I build a similar RAG system without relying on OpenAI APIs or paid cloud services?”

The answer is — yes, absolutely.

In this tutorial, I’ll walk you through building a complete Retrieval-Augmented Generation (RAG) system entirely on-prem, using free and open-source tools. No API keys, no vendor lock-in, and no code required.

With the help of:

n8n for orchestrating your workflow
Pinecone as a vector database (free-tier available)
Ollama for running open-source LLMs and embedding models locally
Windows Command Prompt for setup and automation

You’ll create a fully functional RAG pipeline that:

Accepts documents
Converts them to embeddings
Stores and retrieves relevant context
Answers user queries intelligently — all from your own machine

This project is perfect for developers, hobbyists, and enterprise teams who want more control, privacy, or cost-efficiency in their GenAI experiments.

What You'll Build

A local RAG pipeline powered by Ollama models (nomic-embed-text:latest for embeddings, mistral-small3.1:latest for text generation)
Integration with Pinecone for storing and retrieving vector embeddings
n8n workflow automation to run the entire system without writing code
A fully offline, privacy-focused, and cost-free solution for knowledge retrieval and generation

Components of the Stack

n8n Workflow: For automating the entire RAG pipeline
Document Chunking: Split large documents into smaller pieces for better performance
Embedding Model: Convert documents into vector embeddings
Model: nomic-embed-text:latest via Ollama
Vector Database: Store and retrieve embeddings efficiently
Tool: Pinecone for vector storage. If you want even vector DB also on prem, you can use CromDB. Let me know if you are interested to understand the same. I can write separate post for it.
LLM: Generate responses based on retrieved content
Model: mistral-small3.1:latest via Ollama

Step-by-Step Guide to Set Up Your RAG System

1. Install Ollama Locally

To get started, install Ollama and the required models on your Windows machine:

Download and Install Ollama:

Visit the official Ollama website to download the latest Windows version:
Ollama Download.
Install the Models: Open a Command Prompt and run the following commands to download the necessary models:
```
ollama pull nomic-embed-text
ollama pull mistral-small3.1
```
This will download and install both the nomic-embed-text:latest and mistral-small3.1:latest models.

2. Set Up Pinecone

Pinecone is a powerful vector database, and we will use it to store the embeddings generated from our documents. Follow these steps:

Create a Pinecone Account: Go to Pinecone and create a free account.
Create Index: Use Pinecone interface to create new Index. Ensure give Dimension number to 768 which is suitable for nomic-embed-text:latest embedding.
```
Ensure the dimension matches the embedding model output
```

3. Set Up n8n

Next, install n8n to automate the workflow:

Install n8n:

Open a Command Prompt and install n8n globally using npm:
```
npm install n8n -g
```
Start n8n:

Run the following command to start the n8n server:
```
n8n
```
After starting n8n, navigate to http://localhost:5678 in your browser to access the n8n UI.

4. Create and Automate the Workflow in n8n

Now that your environment is set up, create a workflow in n8n to automate the document processing and RAG pipeline.

Load Document: Google Drive node in n8n to load documents from a google drive directory.
Chunk the Document: Use the Text Splitter node to divide large documents into smaller chunks (e.g., 1000 characters each). This helps make the embeddings more efficient.
Generate Embeddings: Use an Embeddings Ollama node in n8n to call the Ollama API and generate embeddings for each document chunk.

This sends your chunked document to Ollama's nomic-embed-text:latest model, which returns a vector embedding.
Store Embeddings in Pinecone: After generating the embeddings, store them in your Pinecone index for later retrieval. Use the Pinecone client in n8n to store the embeddings:
Build 2nd workflow to Chat with newly build knowledge database.
Query and Response Generation: When a user submits a query, retrieve the top-k nearest document embeddings from Pinecone and pass them to Ollama's mistral-small3.1:latest for response generation:

Ollama will generate a response based on the retrieved document chunks.

5. Expose the RAG Pipeline via Webhook (Optional)

To interact with the RAG system from an external application, you can expose the workflow via n8n's Webhook node:

Use the Webhook node in n8n to listen for incoming HTTP requests with a query.
Trigger the entire RAG workflow using the received query.
Return the generated response to the user via the Webhook node.

Why Build This System On-Prem?

Privacy: Your data never leaves your local environment.
Cost-Free: You avoid recurring fees from cloud services.
Customizable: Tailor the workflow, models, and retrieval process as needed.
Offline Functionality: All components work fully offline without the need for internet access.

Final Thoughts

By leveraging n8n, Ollama, and Pinecone, you can build a fully functional, privacy-respecting RAG system entirely on-prem. This approach eliminates the need for cloud services and offers complete control over your data and infrastructure — all with a no-code setup.

✍️ Author’s Note

This blog reflects the author’s personal point of view — shaped by 22+ years of industry experience, along with a deep passion for continuous learning and teaching.
The content has been phrased and structured using Generative AI tools, with the intent to make it engaging, accessible, and insightful for a broader audience.

Search This Blog

Tech to Transform