Voice Messages

MindRoom can process voice messages sent to Matrix rooms, transcribing them and responding appropriately.

Overview

When voice message handling is enabled:

Voice messages are detected in Matrix rooms
Audio is downloaded and decrypted (if E2E encrypted)
Audio is sent to an OpenAI-compatible speech-to-text (STT) service
Transcription is processed by an AI to recognize agent mentions and commands
The formatted message is sent to the room (prefixed with a microphone emoji)
The appropriate agent responds

Configuration

Enable voice in config.yaml:

voice:
  enabled: true
  stt:
    provider: openai
    model: whisper-1
    # Optional: custom endpoint (without /v1 suffix)
    # host: http://localhost:8080
  intelligence:
    model: default  # Model used for command recognition

Or use the dashboard's Voice tab.

STT Providers

MindRoom uses the OpenAI-compatible transcription API. Any service that implements the /v1/audio/transcriptions endpoint will work.

OpenAI Whisper (Cloud)

voice:
  enabled: true
  stt:
    provider: openai
    model: whisper-1

Requires OPENAI_API_KEY environment variable.

Self-Hosted Whisper

voice:
  enabled: true
  stt:
    provider: openai
    model: whisper-1
    host: http://localhost:8080

Note: Do not include /v1 in the host URL - MindRoom appends /v1/audio/transcriptions automatically.

Use with faster-whisper-server or similar OpenAI-compatible STT servers.

Custom API Key

For self-hosted solutions that require authentication:

voice:
  enabled: true
  stt:
    provider: openai
    model: whisper-1
    host: http://localhost:8080
    api_key: your-custom-api-key

If api_key is not set, MindRoom falls back to the OPENAI_API_KEY environment variable.

Command Recognition

The intelligence component uses an AI model to analyze transcriptions and format them properly:

Agent mentions - Converts spoken agent names to @agent format
Command patterns - Identifies and formats !command syntax
Smart formatting - Handles speech recognition errors and natural language variations

Intelligence Model

The intelligence model processes raw transcriptions to recognize commands and agent names:

voice:
  intelligence:
    model: default  # Uses the default model from your models config

You can specify a different model for faster or more accurate command recognition.

How It Works

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│ Voice Msg   │────▶│ Download &  │────▶│ Transcribe  │────▶│ Format with │
│ (Audio)     │     │ Decrypt     │     │ (STT)       │     │ AI (LLM)    │
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘
                                                                  │
                                                                  ▼
                                                            ┌─────────────┐
                                                            │ 🎤 Message  │
                                                            │ to Room     │
                                                            └─────────────┘
                                                                  │
                                                                  ▼
                                                            ┌─────────────┐
                                                            │ Agent       │
                                                            │ Responds    │
                                                            └─────────────┘

Matrix Integration

Voice messages in Matrix are:

Detected as RoomMessageAudio or RoomEncryptedAudio events
Downloaded from the Matrix media server
Decrypted if end-to-end encrypted (using the encryption key from the event)
Saved temporarily as .ogg files for processing
Sent to the STT service via the OpenAI-compatible API

The router agent handles all voice message processing to avoid duplicate transcriptions.

Environment Variables

Variable	Description
`OPENAI_API_KEY`	For OpenAI Whisper API (used as fallback if no `api_key` configured)

Text-to-Speech Tools

MindRoom also supports text-to-speech (TTS) through agent tools. These are separate from voice message transcription and allow agents to generate audio responses:

OpenAI - Speech synthesis via openai tool
ElevenLabs - High-quality AI voices and sound effects via eleven_labs tool
Cartesia - Voice AI with optional voice localization via cartesia tool
Groq - Fast speech generation via groq tool

See the Tools documentation for configuration details.

Limitations

Only OpenAI-compatible STT APIs are supported
Audio quality and background noise affect transcription accuracy

Tips

Say the agent name first - "Hey @assistant, what's the weather?"
Use display names - The AI converts spoken names like "HomeAssistant" to the correct @home mention