Posted in

Your Local AI Assistant for Reading Documents

There’s a flood of RAG (Retrieval-Augmented Generation) tools on the market, but if you want to quickly see how it works in practice without wading through documentation, you’ve come to the right place. I didn’t get picky; I took the first setup I found, a combination of Kotaemon + Ollama, and fired it up. The result? A private AI assistant that runs 100% locally. Zero cloud. It’s worth remembering that this solution, while tempting for its privacy, has significant limitations. A local system like this often makes mistakes. LLMs running offline tend to generate imprecise information, misinterpret data, and hallucinate, especially when dealing with complex queries. So, we gain full control over our data, but we must be aware that reliability and analytical capabilities may be far from ideal.

How to run Kotaemon with Ollama in a few minutes

The kotaemon:main-ollama image has built-in support for Ollama, so all the models needed for the system to work can be downloaded inside the container.

docker run \
  --name kotaemon-local \
  -e GRADIO_SERVER_NAME=0.0.0.0 \
  -e GRADIO_SERVER_PORT=7860 \
  -v ./ktem_app_data:/app/ktem_app_data \
  -p 7860:7860 \
  -it --rm \
  ghcr.io/cinnamon/kotaemon:main-ollama
Bash

Next, open a new terminal and connect to the running container:

docker exec -it kotaemon-local /bin/bash
Bash

To download the models, execute the following commands inside the container:

ollama pull qwen2.5:7b 
ollama pull nomic-embed-text
Bash

This will download the LLM and embedding models. The interface will be available at:

http://localhost:7860
Bash

Thanks to this simple process, the entire system works without any need for manual editing of configuration files.

If you want to try out other AI models, visit the official Ollama.com website to browse the available options. When you find a model that interests you, you can download it inside the container (using docker exec ...) with the command ollama pull <model-name>. Then, in the Kotaemon web interface, under the resource settings, you will be able to select the newly downloaded model as the default for generating responses or creating embeddings.

How the system works

Kotaemon is based on Retrieval-Augmented Generation (RAG) technology. Documents are divided into chunks, which are then converted into so-called embeddings—mathematical representations of the text’s meaning. These embeddings are placed in a local vector store (Chroma), allowing searches to be based on the meaning of the words, not just keywords.

When you ask a question, Kotaemon converts it into an embedding and searches the database for the most semantically similar chunks. It then passes them to the LLM, which creates a ready-made answer in natural language. The effect? A relevant, contextual answer based solely on your documents.

Why two models?

A RAG system uses two models because each has a different task. The embedding model (nomic-embed-text) understands the meaning of the text and converts it into numbers, which allows for precise fragment retrieval. The LLM (qwen2.5:7b) generates responses in natural language; it can write, translate, and summarize, but it is not optimized for comparing meaning.

One understands, the other speaks—this combination gives you both accurate content matching and natural, coherent answers.

Why it’s worth it

The biggest advantage of this solution is privacy and full control. Everything runs locally, without the cloud and without sending data externally. There are no API fees or token limits.

You can tailor the models to your hardware—lightweight ones for quick tests or larger, more precise ones for working with a large number of documents. The whole system is fast, flexible, and works without unnecessary configuration.

Why it’s not always worth it

Despite its advantages, a local RAG is not the perfect solution for everyone. Firstly, it requires relatively powerful hardware, especially with larger LLMs. Secondly, local models have their limitations in terms of size and capabilities. They are smaller and less “intelligent” than full cloud-based models, so don’t expect perfect answers to very complex questions, which you will quickly see when you run the system.