Knowledge Bases

Knowledge bases give your agents access to your own documents through RAG (Retrieval-Augmented Generation). Drop files into a folder, point a knowledge base at it, and agents can search the indexed content when answering questions.

How It Works

You configure a knowledge base pointing to a folder of documents
MindRoom indexes the files into a vector database (ChromaDB) using an embedder
Agents assigned to that knowledge base get a search tool that queries the indexed documents
When the agent uses the tool, relevant document chunks are included in its context

Indexing (startup + file changes):

  ┌──────────────┐      ┌──────────┐      ┌──────────┐
  │ Files/Folder │ ───▶ │ Embedder │ ───▶ │ ChromaDB │
  └──────────────┘      └──────────┘      └──────────┘
         ▲
         │ file watcher
         │ git sync

Querying (agentic RAG):

  ┌───────┐  search   ┌──────────┐
  │ Agent │ ────────▶ │ ChromaDB │
  │       │ ◀──────── │          │
  └───────┘  chunks   └──────────┘

Quick Start

Add a knowledge base and assign it to an agent:

knowledge_bases:
  docs:
    path: ./knowledge_docs
    watch: true
    chunk_size: 5000
    chunk_overlap: 0

agents:
  assistant:
    display_name: Assistant
    role: A helpful assistant with access to our docs
    knowledge_bases: [docs]

Place files in ./knowledge_docs/ and they'll be indexed automatically on startup. When watch: true, new or modified files are re-indexed in real time.

Configuration

Basic Knowledge Base

knowledge_bases:
  my_docs:
    path: ./knowledge_docs/my_docs   # Folder containing documents
    watch: true                       # Auto-reindex on file changes
    chunk_size: 5000                  # Max characters per chunk
    chunk_overlap: 0                  # Overlap between adjacent chunks

Field	Type	Default	Description
`path`	string	`./knowledge_docs`	Folder path (relative to the config file directory or absolute)
`watch`	bool	`true`	Watch for filesystem changes and reindex automatically
`chunk_size`	int	`5000`	Maximum characters per chunk for text-like files (minimum: `128`)
`chunk_overlap`	int	`0`	Overlap characters between adjacent chunks (must be `< chunk_size`)
`git`	object	`null`	Optional Git repository sync settings

Use smaller chunk_size values when your embedding server has lower token or batch limits. If chunking is too large, indexing retries will fail with embedder 500 errors.

Private Agent Knowledge

Use agents.<name>.private.knowledge when one shared agent definition should index requester-local knowledge from that requester's private root.

knowledge_bases:
  company_docs:
    path: ./company_docs
    watch: true

agents:
  mind:
    display_name: Mind
    role: A persistent personal AI companion
    model: sonnet
    private:
      per: user
      root: mind_data
      template_dir: ./mind_template
      knowledge:
        path: memory
        watch: true
    knowledge_bases: [company_docs]

With this configuration, each requester's private knowledge path becomes <their private root>/memory. The template source is explicit, so you can see and edit the files being copied into each requester's private root. private.template_dir only copies files. Requester-local knowledge is enabled only when you explicitly configure private.knowledge.path. private.knowledge.path must be relative to the private root and cannot be absolute or escape with ... private.knowledge.path can point to any folder inside the private root, including . for the private root itself. MindRoom keeps a separate index per effective private root, so one requester's indexed data is not shared with another requester's runtime. For isolating scopes such as user and user_agent, MindRoom refreshes the private index on access instead of keeping a background watcher alive for every requester root. Git-backed knowledge still keeps the repository fresh when watch: false; that flag only disables filesystem watchers. Top-level knowledge_bases remain the shared/global mechanism, so the same agent can combine private local knowledge with shared company knowledge. This requester-local private knowledge flow applies to the normal agent runtime path, not the OpenAI-compatible /v1 API. If you enable private.knowledge.git, use a dedicated subtree such as kb_repo. Do not point Git-backed private knowledge at . or memory/, and do not use a Git checkout path that your template or private file memory also writes into.

Field	Type	Default	Description
`private.knowledge.enabled`	bool	`true`	Whether requester-local knowledge indexing is active for this agent
`private.knowledge.path`	string	`null`	Private-root-relative folder to index. Required when `private.knowledge.enabled` is `true`; set `enabled: false` to disable private knowledge
`private.knowledge.watch`	bool	`true`	Whether local filesystem changes should be watched. For isolating scopes, MindRoom refreshes on access instead of keeping a background watcher per requester root. Git sync still runs even when this is `false`
`private.knowledge.chunk_size`	int	`5000`	Maximum characters per indexed chunk
`private.knowledge.chunk_overlap`	int	`0`	Overlap characters between adjacent chunks. Must be smaller than `chunk_size`
`private.knowledge.git`	object	`null`	Optional Git sync configuration for requester-local knowledge. Git-backed private knowledge must use a dedicated subtree outside requester-writable memory/template content

Use private.knowledge when the data itself should be private to that requester's private instance. Use top-level knowledge_bases when the same documents should stay shared across agents or users.

Multiple Knowledge Bases

You can define multiple knowledge bases and assign them to different agents:

knowledge_bases:
  engineering:
    path: ./knowledge_docs/engineering
    watch: true
    chunk_size: 5000
    chunk_overlap: 0
  product:
    path: ./knowledge_docs/product
    watch: true
    chunk_size: 5000
    chunk_overlap: 0
  legal:
    path: ./knowledge_docs/legal
    watch: false
    chunk_size: 1000
    chunk_overlap: 100

agents:
  developer:
    display_name: Developer
    role: Engineering assistant
    knowledge_bases: [engineering]

  pm:
    display_name: Product Manager
    role: Product planning assistant
    knowledge_bases: [product, engineering]  # Can access multiple bases

  compliance:
    display_name: Compliance
    role: Legal and compliance reviewer
    knowledge_bases: [legal]

When an agent has multiple knowledge bases, results are interleaved fairly so no single base dominates the top results.

Git-Backed Knowledge Bases

Knowledge bases can sync from a Git repository. MindRoom clones the repo on first run and periodically pulls updates.

knowledge_bases:
  pipefunc_docs:
    path: ./knowledge_docs/pipefunc
    watch: false
    chunk_size: 1200
    chunk_overlap: 120
    git:
      repo_url: https://github.com/pipefunc/pipefunc
      branch: main
      poll_interval_seconds: 300
      skip_hidden: true
      include_patterns:
        - "docs/**"

Git Configuration Fields

Field	Type	Default	Description
`repo_url`	string	required	HTTPS repository URL to clone/fetch
`branch`	string	`main`	Branch to track
`poll_interval_seconds`	int	`300`	How often to check for updates (minimum: 5)
`credentials_service`	string	`null`	Service name in CredentialsManager for private repos
`skip_hidden`	bool	`true`	Skip files/folders starting with `.`
`include_patterns`	list	`[]`	Root-anchored glob patterns to include
`exclude_patterns`	list	`[]`	Root-anchored glob patterns to exclude

Sync Behavior

On startup, the repo is cloned (or fetched if it already exists)
Every poll_interval_seconds, MindRoom runs git fetch + git reset --hard origin/<branch>
Local uncommitted changes in the checkout folder are discarded on each sync
Only changed files are re-indexed (not the entire repo each time)
Deleted files are automatically removed from the index
Git polling runs regardless of the watch setting — watch controls only local filesystem events

File Filtering with Patterns

Patterns are matched from the repository root. * matches one path segment, ** matches zero or more segments.

knowledge_bases:
  project_docs:
    path: ./knowledge_docs/project
    git:
      repo_url: https://github.com/org/project
      include_patterns:
        - "docs/**"                    # All files under docs/
        - "README.md"                  # Root README only
        - "content/posts/*/index.md"   # Specific nested files
      exclude_patterns:
        - "docs/internal/**"           # Exclude internal docs

If include_patterns is empty, all non-hidden files are eligible
If include_patterns is set, a file must match at least one pattern
exclude_patterns are applied last and remove matching files

Private Repository Authentication

For private HTTPS repositories, store credentials and reference them in the config.

Step 1: Store credentials via the API or Dashboard (Credentials tab):

curl -X POST http://localhost:8765/api/credentials/github_private \
  -H "Content-Type: application/json" \
  -d '{"credentials":{"username":"x-access-token","token":"ghp_your_token_here"}}'

Step 2: Reference the service name in your knowledge base config:

knowledge_bases:
  private_docs:
    path: ./knowledge_docs/private
    git:
      repo_url: https://github.com/org/private-repo
      credentials_service: github_private

Accepted credential fields:

Fields	Notes
`username` + `token`	Standard GitHub/GitLab access token auth
`username` + `password`	Basic HTTP auth
`api_key`	Uses `x-access-token` as username automatically

Embedder Configuration

Knowledge bases use the same embedder configured in the memory section:

memory:
  embedder:
    provider: openai        # or "ollama", "huggingface", or "sentence_transformers"
    config:
      model: text-embedding-3-small
      host: null             # For self-hosted (Ollama)
      dimensions: null       # Optional: embedding dimension override (e.g., 256)

Provider	Model Example	Notes
`openai`	`text-embedding-3-small`	Requires `OPENAI_API_KEY`
`ollama`	`nomic-embed-text`	Self-hosted, set `host` or `OLLAMA_HOST`
`sentence_transformers`	`sentence-transformers/all-MiniLM-L6-v2`	Fully local Python runtime; auto-installs the optional extra on first use

Storage

Knowledge data is stored under <storage_path>/knowledge_db/<sanitized_base_id>_<hash>/. Each knowledge base gets its own ChromaDB collection named mindroom_knowledge_<sanitized_base_id>_<hash>, where the base ID is sanitized to alphanumerics, hyphens, and underscores only, and the hash is a digest of the resolved knowledge path. For requester-private agent knowledge, the effective private-root path is part of that hash, so each requester-local root gets an isolated index.

The storage path defaults to mindroom_data/ next to your config.yaml, or can be set with MINDROOM_STORAGE_PATH.

Dashboard Management

The web dashboard provides a Knowledge tab for managing knowledge bases without editing YAML:

Create, edit, and delete knowledge bases
Configure chunk size and overlap per knowledge base
Configure Git sync settings
Upload and remove files
Trigger a full reindex on demand
Monitor indexing status (file count vs. indexed count)
Assign knowledge bases to agents from the Agents tab

API Endpoints

See the Dashboard API reference for the full list of knowledge base endpoints (list, upload, delete, reindex, status).

Hot Reload

Knowledge base configuration supports hot reload. When you change config.yaml:

New knowledge bases are created and indexed
Removed knowledge bases are stopped and cleaned up
Changed settings (path, chunking, embedder, git config) trigger a re-initialization
Unchanged knowledge bases continue running without interruption
Background watchers are preserved across reloads when that knowledge base actually runs a watcher