Voice Messages
MindRoom can process voice messages sent to Matrix rooms, transcribing them and responding appropriately.
Overview
When voice message handling is enabled:
- Voice messages are detected in Matrix rooms
- Audio is downloaded and decrypted (if E2E encrypted)
- Audio is sent to an OpenAI-compatible speech-to-text (STT) service
- Transcription is processed by an AI to recognize agent mentions and commands
- The formatted message is sent to the room (prefixed with a microphone emoji)
- The appropriate agent responds
Configuration
Enable voice in config.yaml:
voice:
enabled: true
stt:
provider: openai
model: whisper-1
# Optional: custom endpoint (without /v1 suffix)
# host: http://localhost:8080
intelligence:
model: default # Model used for command recognition
Or use the dashboard's Voice tab.
STT Providers
MindRoom uses the OpenAI-compatible transcription API. Any service that implements the /v1/audio/transcriptions endpoint will work.
OpenAI Whisper (Cloud)
Requires OPENAI_API_KEY environment variable.
Self-Hosted Whisper
Note: Do not include /v1 in the host URL - MindRoom appends /v1/audio/transcriptions automatically.
Use with faster-whisper-server or similar OpenAI-compatible STT servers.
Custom API Key
For self-hosted solutions that require authentication:
voice:
enabled: true
stt:
provider: openai
model: whisper-1
host: http://localhost:8080
api_key: your-custom-api-key
If api_key is not set, MindRoom falls back to the OPENAI_API_KEY environment variable.
Command Recognition
The intelligence component uses an AI model to analyze transcriptions and format them properly:
- Agent mentions - Converts spoken agent names to
@agentformat - Command patterns - Identifies and formats
!commandsyntax - Smart formatting - Handles speech recognition errors and natural language variations
Intelligence Model
The intelligence model processes raw transcriptions to recognize commands and agent names:
You can specify a different model for faster or more accurate command recognition.
How It Works
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Voice Msg │────▶│ Download & │────▶│ Transcribe │────▶│ Format with │
│ (Audio) │ │ Decrypt │ │ (STT) │ │ AI (LLM) │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│
▼
┌─────────────┐
│ 🎤 Message │
│ to Room │
└─────────────┘
│
▼
┌─────────────┐
│ Agent │
│ Responds │
└─────────────┘
Matrix Integration
Voice messages in Matrix are:
- Detected as
RoomMessageAudioorRoomEncryptedAudioevents - Downloaded from the Matrix media server
- Decrypted if end-to-end encrypted (using the encryption key from the event)
- Saved temporarily as
.oggfiles for processing - Sent to the STT service via the OpenAI-compatible API
The router agent handles all voice message processing to avoid duplicate transcriptions.
Environment Variables
| Variable | Description |
|---|---|
OPENAI_API_KEY |
For OpenAI Whisper API (used as fallback if no api_key configured) |
Text-to-Speech Tools
MindRoom also supports text-to-speech (TTS) through agent tools. These are separate from voice message transcription and allow agents to generate audio responses:
- OpenAI - Speech synthesis via
openaitool - ElevenLabs - High-quality AI voices and sound effects via
eleven_labstool - Cartesia - Voice AI with optional voice localization via
cartesiatool - Groq - Fast speech generation via
groqtool
See the Tools documentation for configuration details.
Limitations
- Only OpenAI-compatible STT APIs are supported
- Audio quality and background noise affect transcription accuracy
Tips
- Say the agent name first - "Hey @assistant, what's the weather?"
- Use display names - The AI converts spoken names like "HomeAssistant" to the correct
@homemention