Image Messages

MindRoom can process images sent to Matrix rooms, passing them to vision-capable AI models for analysis.

Overview

When a user sends an image in a Matrix room:

The agent determines whether it should respond (via mention, thread participation, or DM)
The image is downloaded and decrypted (if E2E encrypted)
The image is wrapped as an agno.media.Image and passed to the AI model
The agent responds with its analysis

Image support works automatically for all agents -- no configuration is needed. The AI model must support vision (e.g., Claude, GPT-5.4).

Supported Formats

MindRoom detects image format from file byte signatures:

PNG
JPEG
GIF
WebP
BMP
TIFF

If the declared MIME type in the Matrix event does not match the detected byte signature, MindRoom logs a warning and uses the detected type.

How It Works

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│ Image Msg   │────>│ Download &  │────>│ Pass to AI  │
│ (Matrix)    │     │ Decrypt     │     │ Model       │
└─────────────┘     └─────────────┘     └─────────────┘
                                              │
                                              v
                                        ┌─────────────┐
                                        │ Agent       │
                                        │ Responds    │
                                        └─────────────┘

Usage

Send an image in a Matrix room and mention the agent in the caption:

With caption: @assistant What does this diagram show? -- the caption is used as the prompt
Without caption: The agent receives [Attached image] as the prompt and describes what it sees
Bare filename: If the body is just a filename (e.g., IMG_1234.jpg), it is treated the same as no caption

Images work in both direct messages and threads, and with both individual agents and teams.

Captions (MSC2530)

If the Matrix event's filename field differs from body, the body is used as a user caption. This follows MSC2530 semantics and works with clients that set the caption in the body.

Image Persistence

Images are saved under mindroom_data/attachments/ and mindroom_data/incoming_media/ and registered as attachment records with 30-day retention. In addition to being passed to the AI model as vision input, each image is also registered as an att_* attachment ID so agents can reference it via tool calls. See Attachments for details on retention and context scoping.

Encryption

Both unencrypted and E2E encrypted images are supported. Encrypted images are decrypted transparently using the key material from the Matrix event.

Caching

AI response caching is automatically skipped when images are present, since image payloads are large and unlikely to repeat.

Media Fallback

If a model rejects inline media (images, audio, video, or documents), MindRoom automatically retries the request without the inline media. The retried prompt includes [Inline media unavailable for this model] to inform the agent that attachments were dropped. Agents can still reference the files via attachment IDs and tools.

This fallback is transparent — no user action is required. It detects provider-specific error patterns such as unsupported media type, base64 field validation failures, and capability rejections.

Limitations

Routing in multi-agent rooms -- in multi-agent rooms without an @mention, the router selects the best agent based on the image caption.
Bridge mention detection uses m.mentions in the event, falling back to parsing HTML pills from formatted_body when m.mentions is absent (e.g., mautrix-telegram). Bridges that set neither may not trigger agent responses.
Model support -- the configured model must support vision. Text-only models will ignore the image or return an error. If the model rejects the image entirely, the media fallback retries without the inline image.