esrmnt

RAG: The System Works (Terms and Conditions Apply)

Well, this is unexpected. I managed to build a functional basic RAG system. And it actually works. Who saw that coming?

What Do Have Here

Meet Reference Chat — a local document search and chat assistant that lets you upload PDF and TXT files, then ask questions about them using a local language model. Everything runs locally, so the documents never leave my machine. No cloud dependencies, no privacy concerns, no additional cost - just me and my PDFs having a conversation.

I've noticed that the RAG system struggles with converted PDFs, i.e when I convert PPTs to PDF. Most of the times text content that the extractor and indexer gets are missing the content from embedded images. I'm guessing that could be the issue. I need to attend to it, mostly over next weekend.

The Core Features

The Architecture (Or: How I Organized the Chaos)

The backend is built with FastAPI and handles document processing, search, and chat endpoints. The frontend is built with Streamlit and provides a web interface for uploading, searching, and chatting with documents. Ollama is used for local language model inference. All document data and processing remain local.

High level Setup

Backend (FastAPI)

backend/
├── main.py              # Application setup with lifespan management
├── settings.py          # Environment-based configuration
├── models.py            # Pydantic request/response models
├── logging_config.py    # Structured logging setup
├── config.py            # Backward compatibility layer
├── core/
│   ├── document_manager.py # File handling, validation, text extraction
│   ├── indexer.py          # Semantic indexing and search
│   ├── model.py           # Ollama LLM integration
│   ├── utils.py           # Text processing utilities
│   └── state.py           # Application state management
└── api/
    ├── knowledge.py       # Document upload/management endpoints
    ├── search.py          # Search endpoints (keyword/semantic)
    └── chat.py            # RAG chat endpoints

Frontend (Streamlit)

frontend/
├── app.py               # Main Streamlit application
├── components/          # Reusable UI components (future)
└── pages/               # Multi-page app structure (future)

The Tech Stack and Implementation Details

I ended up using:

How It Actually Works

The pipeline follows that classic three-stage approach:

  1. Indexing: Upload a document, extract text, chunk it into overlapping sentences, generate embeddings, store in memory
  2. Retrieval: Ask a question, embed the query, calculate cosine similarity with all stored chunks, return top-k matches
  3. Generation: Feed the retrieved context to Llama3 via Ollama and get an answer with proper citations

The whole thing runs locally using Ollama, which means no API keys, no usage limits.

The Surprising Parts

A few things caught me off guard:

What's Next

Now that I have a working Simple RAG system, the real fun begins. The roadmap includes:

But for now, I'm going to enjoy the rare feeling of having built something that works on the first major attempt. It won't last long, but I'll take the win.

Here is a link to the repository

#rag