Web Scraping & Browser

Use these tools to read pages, extract article text, crawl websites, and drive either a local browser runtime or a hosted browser API.

What This Page Covers

This page documents the built-in tools in the web-scraping-and-browser group. Use these tools when you need lightweight text extraction, structured scraping APIs, or browser automation against live websites.

Tools On This Page

[crawl4ai] - Local Crawl4AI crawling with readable-text extraction and optional query-aware filtering.
[website] - Simple website reader and optional knowledge-base ingester.
[trafilatura] - Local page extraction, metadata extraction, HTML-to-text conversion, batch extraction, and focused crawling.
[newspaper] - News-article reader backed by the newspaper4k dependency.
[jina] - Jina Reader URL reading and optional web search with an optional API key.
[firecrawl] - Firecrawl API for scrape, crawl, map, and search jobs.
[spider] - Spider Cloud API for search, scrape, and crawl.
[scrapegraph] - ScrapeGraph AI extraction, markdown conversion, search scraping, and agentic crawling.
[apify] - Apify Actor runner that turns configured actors into tool functions.
[brightdata] - Bright Data scraping, screenshots, SERP queries, and feed endpoints.
[oxylabs] - Oxylabs Google search, Amazon data, and general web scraping.
[agentql] - AgentQL browser-assisted scraping with optional custom extraction queries.
[browserbase] - Browserbase-hosted browser sessions with remote navigation, screenshots, and page reads.
[browser] - MindRoom's local Playwright browser controller.
[web_browser_tools] - Host OS browser opener for launching a real browser tab or window.

Common Setup Notes

crawl4ai, website, trafilatura, newspaper, and web_browser_tools are the lowest-friction no-config options on this page. firecrawl, browserbase, agentql, scrapegraph, apify, brightdata, and oxylabs are all credentialed tools that normally need stored credentials or SDK environment variables before they are useful. spider also needs credentials in practice even though the current MindRoom metadata marks it as setup_type: none, because the installed spider-client raises when SPIDER_API_KEY is missing. jina is the middle ground here, because the installed JinaReaderTools only adds an Authorization header when api_key is present, so public read_url() usage works without a key while authenticated plans can still set one. browser is local Playwright automation, browserbase is a hosted browser API that you connect to over CDP, and web_browser_tools simply asks the host operating system to open a browser tab or window. src/mindroom/api/integrations.py currently only exposes Spotify OAuth routes on this branch, so none of the tools on this page have a dedicated MindRoom OAuth flow. Store password fields through the dashboard or credential store instead of inline YAML, and use environment variables such as FIRECRAWL_API_KEY, SPIDER_API_KEY, BROWSERBASE_API_KEY, BROWSERBASE_PROJECT_ID, AGENTQL_API_KEY, SGAI_API_KEY, APIFY_API_TOKEN, BRIGHT_DATA_API_KEY, OXYLABS_USERNAME, OXYLABS_PASSWORD, and JINA_API_KEY when you prefer SDK-native auth. crawl4ai, agentql, browserbase, and browser also depend on a working browser runtime, and web_browser_tools only makes sense on a host that can open a real desktop browser. Missing optional dependencies can auto-install at first use unless MINDROOM_NO_AUTO_INSTALL_TOOLS=1 is set.

No-Config Scrapers

[`crawl4ai`]

crawl4ai is the best local option on this page when you want one tool that can fetch readable page content from one URL or a short URL list.

What It Does

crawl4ai exposes crawl(url, search_query=None). It accepts either one URL string or a list of URLs and returns readable extracted content for each one. When you pass search_query, the tool enables BM25-based content filtering to keep the extracted text focused on that query. When use_pruning is enabled without a query, the tool uses Crawl4AI pruning to trim noisy page content. The current implementation bypasses Crawl4AI cache for fresher reads and truncates the result to max_length when needed. This is a local crawler rather than a hosted API, so it does not need an API key, but it still needs a working browser runtime.

Configuration

Option	Type	Required	Default	Notes
`max_length`	`number`	`no`	`5000`	Maximum returned character count after extraction.
`timeout`	`number`	`no`	`60`	Crawl timeout in seconds.
`use_pruning`	`boolean`	`no`	`false`	Enable pruning-based cleanup when no `search_query` is provided.
`pruning_threshold`	`number`	`no`	`0.48`	Threshold passed to Crawl4AI pruning mode.
`bm25_threshold`	`number`	`no`	`1.0`	Threshold passed to BM25 filtering when `search_query` is used.
`headless`	`boolean`	`no`	`true`	Launch Crawl4AI's browser in headless mode.
`wait_until`	`text`	`no`	`domcontentloaded`	Playwright wait condition before extraction.
`proxy_config`	`object`	`no`	`null`	Raw browser proxy config passed into Crawl4AI `BrowserConfig`, while the current MindRoom metadata exposes this as text.
`enable_crawl`	`boolean`	`no`	`true`	Enable `crawl()`.
`all`	`boolean`	`no`	`false`	Enable the full upstream toolkit surface.

Example

agents:
  researcher:
    tools:
      - crawl4ai:
          max_length: 8000
          use_pruning: true
          wait_until: networkidle

crawl("https://matrix.org/blog/", search_query="bridges and federation")

Notes

Use crawl4ai when you want a local scraper instead of a hosted API.
proxy_config maps directly to Crawl4AI browser settings, so treat it as an advanced raw config object.
For heavily protected or browser-hostile sites, browserbase, brightdata, or browser can be a better fit.

[`website`]

website is the lightest built-in page reader on this page.

What It Does

With normal MindRoom YAML configuration, website exposes read_url(url) and returns JSON-serialized Document objects from MindRoom's WebsiteReader variant. That reader keeps Agno's crawl and document shape while filtering search UI, navigation, headers, footers, sidebars, hidden content, and modals before choosing the page text. If a Knowledge object is injected programmatically through the knowledge constructor argument, the tool exposes add_website_to_knowledge(url) instead of read_url(). That means the same registry entry can act either as a simple page reader or as a knowledge-base ingestion hook depending on how it is constructed. In normal hand-authored config.yaml, you should treat this as a quick page-reading tool.

Configuration

Option	Type	Required	Default	Notes
`knowledge`	`object`	`no`	`null`	Advanced programmatic `Knowledge` object injection that changes the tool surface from `read_url()` to `add_website_to_knowledge()`.

Example

agents:
  assistant:
    tools:
      - website

read_url("https://docs.mindroom.chat")

Notes

website is the simplest default when you just need to read one page.
The knowledge field is not typical hand-written YAML and is mainly useful in programmatic setups.
If you need metadata-only extraction, batch extraction, or crawling, trafilatura is usually a better fit.

[`trafilatura`]

trafilatura is the most capable local extractor on this page when you want text extraction, metadata, HTML conversion, and lightweight crawling from one toolkit.

What It Does

trafilatura exposes extract_text(), extract_metadata_only(), crawl_website(), html_to_text(), and extract_batch(). It fetches pages locally through Trafilatura and can return plain text, Markdown, JSON, XML, CSV, or HTML output depending on output_format. extract_metadata_only() returns metadata without full article text. extract_batch() loops over multiple URLs and returns one JSON payload with successes and failures. crawl_website() uses Trafilatura's focused spider support when that module is importable in the runtime. If the spider module is missing, the tool skips crawler registration instead of exposing a broken crawl function.

Configuration

Option	Type	Required	Default	Notes
`output_format`	`text`	`no`	`txt`	Default extraction format such as `txt`, `json`, `markdown`, `xml`, `csv`, or `html`.
`include_comments`	`boolean`	`no`	`true`	Include comment content in extracted output.
`include_tables`	`boolean`	`no`	`true`	Keep table content in extracted output.
`include_images`	`boolean`	`no`	`false`	Include image information where Trafilatura supports it.
`include_formatting`	`boolean`	`no`	`false`	Preserve formatting markers in extracted output.
`include_links`	`boolean`	`no`	`false`	Preserve links in extracted output.
`with_metadata`	`boolean`	`no`	`false`	Include metadata in extraction output.
`favor_precision`	`boolean`	`no`	`false`	Bias extraction toward precision.
`favor_recall`	`boolean`	`no`	`false`	Bias extraction toward recall.
`target_language`	`text`	`no`	`null`	Optional ISO 639-1 language filter such as `en` or `de`.
`deduplicate`	`boolean`	`no`	`false`	Deduplicate repeated content segments.
`max_tree_size`	`number`	`no`	`null`	Optional parser tree-size limit.
`max_crawl_urls`	`number`	`no`	`10`	Maximum URLs to visit when crawling.
`max_known_urls`	`number`	`no`	`100000`	Maximum discovered URLs to track while crawling.
`enable_extract_text`	`boolean`	`no`	`true`	Enable `extract_text()`.
`enable_extract_metadata_only`	`boolean`	`no`	`true`	Enable `extract_metadata_only()`.
`enable_html_to_text`	`boolean`	`no`	`true`	Enable `html_to_text()`.
`enable_extract_batch`	`boolean`	`no`	`true`	Enable `extract_batch()`.
`enable_crawl_website`	`boolean`	`no`	`true`	Enable `crawl_website()` when Trafilatura spider support is available.
`all`	`boolean`	`no`	`false`	Enable the full upstream toolkit surface.

Example

agents:
  analyst:
    tools:
      - trafilatura:
          output_format: markdown
          with_metadata: true
          include_links: true

extract_text("https://matrix.org/blog/", output_format="markdown")
extract_metadata_only("https://matrix.org/blog/")

Notes

trafilatura is the strongest no-key option when you want more than a plain page read.
crawl_website() depends on Trafilatura spider support in the runtime, so verify the crawler function exists if crawling matters to your workflow.
For news-article specific extraction with titles, authors, and summaries, newspaper can be a better fit.

[`newspaper`]

newspaper is the article-focused extractor for news pages and blog posts.

What It Does

newspaper exposes read_article(url). It returns JSON with whichever article fields were extracted successfully, including title, authors, text, publish date, and optional summary. article_length truncates article text after extraction. The registry name is newspaper, but the underlying module and dependency still come from newspaper4k. That means old references to newspaper4k are stale for current MindRoom config.

Configuration

Option	Type	Required	Default	Notes
`include_summary`	`boolean`	`no`	`false`	Include article summary when available.
`article_length`	`number`	`no`	`null`	Truncate article text to this many characters.
`enable_read_article`	`boolean`	`no`	`true`	Enable `read_article()`.
`all`	`boolean`	`no`	`false`	Enable the full upstream toolkit surface.

Example

agents:
  newsdesk:
    tools:
      - newspaper:
          include_summary: true
          article_length: 6000

read_article("https://matrix.org/blog/")

Notes

Use newspaper in tools:, not newspaper4k.
This tool is tuned for article-style pages rather than arbitrary websites.
For generic site crawling or metadata extraction across many URLs, use trafilatura or crawl4ai.

[`jina`]

jina wraps Jina Reader's read and search endpoints and is the easiest hosted option on this page when you want an optional-key reader rather than a strict credential gate.

What It Does

jina exposes read_url(url) and, when enabled, search_query(query). read_url() prepends the target URL to base_url, which defaults to https://r.jina.ai/. search_query() posts the query to search_url, which defaults to https://s.jina.ai/. When search_query_content is false, the tool adds X-Respond-With: no-content to avoid returning full page text in search results. Returned content is truncated to max_content_length. The installed implementation only adds the Authorization header when an API key is present, so unauthenticated public-reader usage still works.

Configuration

Option	Type	Required	Default	Notes
`api_key`	`password`	`no`	`null`	Optional Jina API key, with `JINA_API_KEY` as the SDK fallback.
`base_url`	`url`	`no`	`https://r.jina.ai/`	Base URL for `read_url()`.
`search_url`	`url`	`no`	`https://s.jina.ai/`	Base URL for `search_query()`.
`max_content_length`	`number`	`no`	`10000`	Maximum returned character count.
`timeout`	`number`	`no`	`null`	Optional Jina timeout header in seconds.
`search_query_content`	`boolean`	`no`	`true`	Return full content in search results instead of metadata-only search summaries.
`enable_read_url`	`boolean`	`no`	`true`	Enable `read_url()`.
`enable_search_query`	`boolean`	`no`	`false`	Enable `search_query()`.
`all`	`boolean`	`no`	`false`	Enable the full upstream toolkit surface.

Example

agents:
  researcher:
    tools:
      - jina:
          enable_search_query: true
          search_query_content: false

read_url("https://matrix.org/blog/")
search_query("latest Matrix bridge updates")

Notes

jina works without a key for public reader endpoints, but a key is still useful for authenticated plans or rate limits.
The current MindRoom metadata marks this tool as requires_config, but the installed code only treats auth as optional.
Pick jina when you specifically want Jina Reader semantics instead of local extraction libraries.

API-Based Scrapers

[`firecrawl`]

firecrawl is the hosted scraper on this page that covers scrape, crawl, map, and search from one API.

What It Does

firecrawl exposes scrape_website(), crawl_website(), map_website(), and search_web(). formats is applied to scrape, crawl, and search requests. limit acts as the default result cap for crawl and search operations. poll_interval controls how often crawl jobs are polled. search_params is passed through to Firecrawl search calls as raw provider-specific options. The upstream tool falls back to FIRECRAWL_API_KEY when api_key is not provided directly.

Configuration

Option	Type	Required	Default	Notes
`api_key`	`password`	`yes`	`null`	Firecrawl API key, with `FIRECRAWL_API_KEY` as the SDK fallback.
`enable_scrape`	`boolean`	`no`	`true`	Enable `scrape_website()`.
`enable_crawl`	`boolean`	`no`	`false`	Enable `crawl_website()`.
`enable_mapping`	`boolean`	`no`	`false`	Enable `map_website()`.
`enable_search`	`boolean`	`no`	`false`	Enable `search_web()`.
`all`	`boolean`	`no`	`false`	Enable the full upstream toolkit surface.
`formats`	`string[]`	`no`	`null`	Requested Firecrawl formats such as `markdown` or `html`, while the current MindRoom metadata exposes this field as text.
`limit`	`number`	`no`	`10`	Default page or result limit for crawl and search.
`poll_interval`	`number`	`no`	`30`	Crawl polling interval in seconds.
`search_params`	`object`	`no`	`null`	Raw Firecrawl search parameters object, while the current MindRoom metadata exposes this field as text.
`api_url`	`url`	`no`	`https://api.firecrawl.dev`	Firecrawl API base URL.

Example

agents:
  research:
    tools:
      - firecrawl:
          enable_crawl: true
          enable_search: true
          limit: 5

scrape_website("https://matrix.org/blog/")
search_web("latest Matrix bridges")

Notes

Use firecrawl when you want scrape, crawl, map, and search in one hosted API.
formats and search_params are raw upstream arguments, so verify them against your Firecrawl plan and endpoint version.
This is usually a better fit than crawl4ai when you want provider-hosted crawling instead of local browser work.

[`spider`]

spider is Spider Cloud's search, scrape, and crawl toolkit for LLM-ready output.

What It Does

spider exposes search_web(query, max_results=5), scrape(url), and crawl(url, limit=None). The current wrapper calls Spider search with fetch_page_content: false, so search is primarily discovery rather than full-content extraction. scrape() and crawl() request Markdown-style output from Spider. optional_params is merged into Spider API requests as a raw provider options object. The installed spider-client constructor raises when no API key is available, even though the current MindRoom metadata says this tool is available without setup.

Configuration

Option	Type	Required	Default	Notes
`max_results`	`number`	`no`	`null`	Default result count override for `search_web()`.
`url`	`url`	`no`	`null`	Optional default URL constructor argument from the upstream toolkit.
`optional_params`	`object`	`no`	`null`	Raw Spider API parameters merged into search, scrape, and crawl requests, while the current MindRoom metadata exposes this field as text.
`enable_search`	`boolean`	`no`	`true`	Enable `search_web()`.
`enable_scrape`	`boolean`	`no`	`true`	Enable `scrape()`.
`enable_crawl`	`boolean`	`no`	`true`	Enable `crawl()`.
`all`	`boolean`	`no`	`false`	Enable the full upstream toolkit surface.

Example

agents:
  crawler:
    tools:
      - spider:
          max_results: 8
          enable_crawl: true

search_web("MindRoom Matrix setup", max_results=5)
scrape("https://matrix.org/blog/")

Notes

Treat spider as a credentialed tool and set SPIDER_API_KEY, even though the current MindRoom metadata still says setup_type: none.
optional_params is a raw provider object and is best used only when you already know the Spider API field names you want.
If you want a cleaner, explicitly credentialed hosted scraper with clearer metadata, firecrawl is usually simpler.

[`scrapegraph`]

scrapegraph is the prompt-driven extractor on this page for turning web pages into structured answers.

What It Does

scrapegraph exposes smartscraper(), markdownify(), crawl(), agentic_crawler(), searchscraper(), and scrape(). smartscraper() extracts structured data from one page based on a natural-language prompt. markdownify() returns a Markdown version of a page. crawl() applies a prompt plus JSON schema across a crawl. agentic_crawler() performs automated steps in the browser and can optionally run AI extraction over the resulting content. searchscraper() searches the web before extracting information. render_heavy_js only affects the low-level scrape() path.

Configuration

Option	Type	Required	Default	Notes
`api_key`	`password`	`yes`	`null`	ScrapeGraph API key, with `SGAI_API_KEY` as the SDK fallback.
`enable_smartscraper`	`boolean`	`no`	`true`	Enable `smartscraper()`.
`enable_markdownify`	`boolean`	`no`	`false`	Enable `markdownify()`.
`enable_crawl`	`boolean`	`no`	`false`	Enable `crawl()`.
`enable_searchscraper`	`boolean`	`no`	`false`	Enable `searchscraper()`.
`enable_agentic_crawler`	`boolean`	`no`	`false`	Enable `agentic_crawler()`.
`enable_scrape`	`boolean`	`no`	`false`	Enable raw `scrape()`.
`render_heavy_js`	`boolean`	`no`	`false`	Ask ScrapeGraph to render heavy JavaScript for `scrape()`.
`all`	`boolean`	`no`	`false`	Enable the full upstream toolkit surface.

Example

agents:
  extractor:
    tools:
      - scrapegraph:
          enable_searchscraper: true
          enable_agentic_crawler: true

smartscraper("https://matrix.org/blog/", "Extract the title, date, and three main points.")
markdownify("https://matrix.org/blog/")

Notes

If you disable enable_smartscraper without enabling all, the installed upstream toolkit auto-enables markdownify() so the tool still has a useful default surface.
Use scrapegraph when you want prompt-shaped extraction rather than generic page text.
For purely local extraction with no hosted API dependency, use crawl4ai or trafilatura.

[`apify`]

apify is the dynamic tool on this page, because its callable surface depends on which Actors you register.

What It Does

apify does not expose one fixed method like the other tools on this page. Instead, it reads the configured Actor IDs and registers one tool function per Actor at startup. Each generated tool uses the Actor's input schema to build parameters and returns that Actor's dataset items as JSON. Without configured actors, there is no practical tool surface. This is best thought of as a hosted Actor adapter rather than a single scraper API.

Configuration

Option	Type	Required	Default	Notes
`apify_api_token`	`password`	`yes`	`null`	Apify API token, with `APIFY_API_TOKEN` as the SDK fallback.
`actors`	`text`	`yes`	`null`	Actor ID string such as `apify/rag-web-browser`, with the current MindRoom metadata also claiming comma-separated lists even though the installed upstream class treats a plain string as one actor ID.

Example

agents:
  extractor:
    tools:
      - apify:
          actors: apify/rag-web-browser

Notes

actors is the important field here, because it determines which functions actually exist at runtime.
The current metadata advertises comma-separated Actor IDs, but the installed upstream constructor does not split plain strings, so the safest documented path on this branch is a single Actor ID.
Generated tool names are derived from the Actor ID, so check the runtime tool list if you need the exact callable name.

[`brightdata`]

brightdata is the hosted toolkit for markdown scraping, screenshots, SERP queries, and provider-specific web data feeds.

What It Does

brightdata exposes scrape_as_markdown(), get_screenshot(), search_engine(), and web_data_feed(). scrape_as_markdown() uses the configured web-unlocker zone and returns Markdown output. get_screenshot() returns a ToolResult with an image artifact instead of just raw text. search_engine() supports Google, Bing, and Yandex search through Bright Data's SERP infrastructure. web_data_feed() accesses Bright Data feed endpoints for supported source types. Zone selection is controlled by serp_zone and web_unlocker_zone, which can also be overridden by environment variables.

Configuration

Option	Type	Required	Default	Notes
`api_key`	`password`	`yes`	`null`	Bright Data API key, with `BRIGHT_DATA_API_KEY` as the SDK fallback.
`enable_scrape_markdown`	`boolean`	`no`	`true`	Enable `scrape_as_markdown()`.
`enable_screenshot`	`boolean`	`no`	`true`	Enable `get_screenshot()`.
`enable_search_engine`	`boolean`	`no`	`true`	Enable `search_engine()`.
`enable_web_data_feed`	`boolean`	`no`	`true`	Enable `web_data_feed()`.
`all`	`boolean`	`no`	`false`	Enable the full upstream toolkit surface.
`serp_zone`	`text`	`no`	`serp_api`	SERP zone, with `BRIGHT_DATA_SERP_ZONE` able to override it.
`web_unlocker_zone`	`text`	`no`	`web_unlocker1`	Web unlocker zone, with `BRIGHT_DATA_WEB_UNLOCKER_ZONE` able to override it.
`verbose`	`boolean`	`no`	`false`	Emit extra Bright Data request logging.
`timeout`	`number`	`no`	`600`	Timeout in seconds.

Example

agents:
  research:
    tools:
      - brightdata:
          enable_web_data_feed: false
          timeout: 300

scrape_as_markdown("https://matrix.org/blog/")
search_engine("Matrix hosting", engine="google", num_results=5)

Notes

brightdata is the better fit than firecrawl when screenshots and feed endpoints matter.
Zone environment variables can override the inline config values, so document your deployment defaults if multiple zones exist.
get_screenshot() returns an image artifact rather than a file path string, which is useful for agents that need to hand the screenshot to a model immediately.

[`oxylabs`]

oxylabs is the e-commerce and SERP-oriented scraper on this page.

What It Does

oxylabs exposes search_google(), get_amazon_product(), search_amazon_products(), and scrape_website(). It uses the Oxylabs realtime client for Google and Amazon scraping rather than a generic HTML fetch path. search_google() returns parsed organic results with title, URL, description, and position. The Amazon functions expose both product-detail and product-search workflows. scrape_website() is the generic fallback when you just want one URL scraped. This tool is credentialed with a username and password pair rather than one API key.

Configuration

Option	Type	Required	Default	Notes
`username`	`text`	`yes`	`null`	Oxylabs username, with `OXYLABS_USERNAME` as the SDK fallback.
`password`	`password`	`yes`	`null`	Oxylabs password, with `OXYLABS_PASSWORD` as the SDK fallback.

Example

agents:
  commerce:
    tools:
      - oxylabs

search_google("Matrix hosting", domain_code="com")
search_amazon_products("ergonomic keyboard", domain_code="com")

Notes

oxylabs needs both username and password, so it is not a single-key setup like firecrawl or brightdata.
Use domain_code to switch between regional Google and Amazon domains.
Pick oxylabs when Google SERP plus Amazon data matters more than generic website crawling.

Browser Tools

[`agentql`]

agentql is the browser-assisted extractor for sites where you want AgentQL queries rather than plain text scraping.

What It Does

agentql exposes scrape_website(url) and, when enabled, custom_scrape_website(url). scrape_website() uses a built-in query that extracts generic page text. custom_scrape_website() only becomes useful when agentql_query is non-empty. The installed upstream toolkit registers the custom scrape function automatically when agentql_query is set, even if enable_custom_scrape_website is false. The current upstream implementation launches Playwright with headless=False, which matters on headless-only runtimes.

Configuration

Option	Type	Required	Default	Notes
`api_key`	`password`	`yes`	`null`	AgentQL API key, with `AGENTQL_API_KEY` as the SDK fallback.
`enable_scrape_website`	`boolean`	`no`	`true`	Enable `scrape_website()`.
`enable_custom_scrape_website`	`boolean`	`no`	`false`	Enable `custom_scrape_website()` when `agentql_query` is also useful.
`all`	`boolean`	`no`	`false`	Enable the full upstream toolkit surface.
`agentql_query`	`text`	`no`	`""`	Custom AgentQL query used by `custom_scrape_website()`.

Example

agents:
  extractor:
    tools:
      - agentql:
          agentql_query: |
            {
              title
              links[]
            }

scrape_website("https://matrix.org/blog/")
custom_scrape_website("https://matrix.org/blog/")

Notes

The installed upstream code launches Playwright with headless=False, so this tool may need a GUI-capable runtime or virtual display.
Setting agentql_query is enough to register the custom scrape function on this branch.
Use agentql when you want AgentQL query semantics rather than a generic readable-text scraper.

[`browserbase`]

browserbase is the hosted browser session tool for navigation, screenshots, and page-content reads over a remote browser.

What It Does

browserbase exposes navigate_to(), screenshot(), get_page_content(), and close_session(), plus async variants for async agent execution. The tool auto-creates a Browserbase session, stores its connect_url, and connects to it over Playwright CDP. get_page_content() returns visible cleaned text when parse_html is true and raw HTML when parse_html is false. Long page content is truncated to max_content_length. base_url configures the Browserbase API endpoint, not the website you want to visit. This is simpler than browser when you only need remote navigation, screenshots, and page reads.

Configuration

Option	Type	Required	Default	Notes
`api_key`	`password`	`yes`	`null`	Browserbase API key, with `BROWSERBASE_API_KEY` as the SDK fallback.
`project_id`	`text`	`yes`	`null`	Browserbase project ID, with `BROWSERBASE_PROJECT_ID` as the SDK fallback.
`base_url`	`url`	`no`	`null`	Optional Browserbase API endpoint override, with `BROWSERBASE_BASE_URL` as the SDK fallback.
`enable_navigate_to`	`boolean`	`no`	`true`	Enable `navigate_to()`.
`enable_screenshot`	`boolean`	`no`	`true`	Enable `screenshot()`.
`enable_get_page_content`	`boolean`	`no`	`true`	Enable `get_page_content()`.
`enable_close_session`	`boolean`	`no`	`true`	Enable `close_session()`.
`all`	`boolean`	`no`	`false`	Enable the full upstream toolkit surface.
`parse_html`	`boolean`	`no`	`true`	Return cleaned visible text instead of raw HTML.
`max_content_length`	`number`	`no`	`100000`	Maximum returned character count for page content.

Example

agents:
  browser_worker:
    tools:
      - browserbase:
          parse_html: true
          max_content_length: 20000

navigate_to("https://matrix.org/blog/")
get_page_content()

Notes

browserbase needs both api_key and project_id.
It still depends on local Playwright support because the client connects to the remote browser over CDP.
Use browserbase when you want a hosted browser session but do not need the broader local action surface of browser.

[`browser`]

browser is MindRoom's local browser controller for multi-step browser sessions, snapshots, screenshots, PDFs, uploads, dialogs, and low-level actions.

What It Does

browser exposes one callable, browser(action=...), with actions such as status, start, stop, profiles, tabs, open, focus, close, snapshot, screenshot, navigate, console, pdf, upload, dialog, act, help, and actions. It manages named browser profiles, with mindroom as the default profile name. It creates tabs, tracks the active tab, records console entries, and resolves temporary element refs from snapshot() into later act() and screenshot() calls. snapshot() can return either ai or aria format. act() currently supports click, type, press, hover, drag, select, fill, resize, wait, evaluate, and close. Only target="host" is supported on this branch, so sandbox or node targeting fields currently return an error. If output_dir is unset, screenshots and PDFs are written under <storage>/browser. The runtime picks Chromium from BROWSER_EXECUTABLE_PATH, chromium, or google-chrome-stable when available.

Configuration

Option	Type	Required	Default	Notes
`output_dir`	`text`	`no`	`null`	Optional directory for screenshots, PDFs, and other browser artifacts, with `<storage>/browser` as the runtime default when omitted.

Example

agents:
  browser_worker:
    tools:
      - browser:
          output_dir: browser-artifacts

browser(action="open", targetUrl="https://matrix.org/blog/")
browser(action="snapshot", snapshotFormat="ai")
browser(action="act", request={"kind": "click", "ref": "e1"})
browser(action="screenshot", fullPage=True)

Notes

This tool is local Playwright automation rather than a hosted browser API.

[`web_browser_tools`]

web_browser_tools is the simplest browser-related tool here, because it just opens a URL in the host's real browser.

What It Does

web_browser_tools exposes open_page(url, new_window=False). It uses Python's standard-library webbrowser module to open a tab or window on the host operating system. It does not return page content, DOM state, screenshots, or automation handles. This makes it useful for human handoff or local desktop workflows, but not for scraping.

Configuration

Option	Type	Required	Default	Notes
`enable_open_page`	`boolean`	`no`	`true`	Enable `open_page()`.
`all`	`boolean`	`no`	`false`	Enable the full upstream toolkit surface.

Example

agents:
  assistant:
    tools:
      - web_browser_tools

open_page("https://docs.mindroom.chat")
open_page("https://matrix.org/blog/", new_window=True)

Notes

web_browser_tools only makes sense on a host that can launch a real browser window or tab.
This tool is not a scraper and does not feed page content back to the model.
Use browser or browserbase when you need browser automation or content returned to the agent.

Web Scraping & Browser

What This Page Covers

Tools On This Page

Common Setup Notes

No-Config Scrapers

[crawl4ai]

What It Does

Configuration

Example

Notes

[website]

What It Does

Configuration

Example

Notes

[trafilatura]

What It Does

Configuration

Example

Notes

[newspaper]

What It Does

Configuration

Example

Notes

[jina]

What It Does

Configuration

Example

Notes

API-Based Scrapers

[firecrawl]

What It Does

Configuration

Example

Notes

[spider]

What It Does

Configuration

Example

Notes

[scrapegraph]

What It Does

Configuration

Example

Notes

[apify]

What It Does

Configuration

Example

Notes

[brightdata]

What It Does

Configuration

Example

Notes

[oxylabs]

What It Does

Configuration

Example

Notes

Browser Tools

[agentql]

What It Does

Configuration

Example

Notes

[browserbase]

What It Does

Configuration

Example

Notes

[browser]

What It Does

Configuration

Example

Notes

[web_browser_tools]

What It Does

Configuration

[`crawl4ai`]

[`website`]

[`trafilatura`]

[`newspaper`]

[`jina`]

[`firecrawl`]

[`spider`]

[`scrapegraph`]

[`apify`]

[`brightdata`]

[`oxylabs`]

[`agentql`]

[`browserbase`]

[`browser`]

[`web_browser_tools`]