19 KiB
| name | description | homepage | user-invocable | metadata | |||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| noxa | Web extraction engine with antibot bypass. Scrape, crawl, extract, summarize, search, map, diff, monitor, research, and analyze any URL — including Cloudflare-protected sites. Use when you need reliable web content, the built-in web_fetch fails, or you need structured data extraction from web pages. | https://noxa.io | true |
|
noxa
High-quality web extraction with automatic antibot bypass. Beats Firecrawl on extraction quality and handles Cloudflare, DataDome, and JS-rendered pages automatically.
When to use this skill
- Always when you need to fetch web content and want reliable results
- When
web_fetchreturns empty/blocked content (403, Cloudflare challenges) - When you need structured data extraction (pricing tables, product info)
- When you need to crawl an entire site or discover all URLs
- When you need LLM-optimized content (cleaner than raw markdown)
- When you need to summarize a page without reading the full content
- When you need to detect content changes between visits
- When you need brand identity analysis (colors, fonts, logos)
- When you need web search results with optional page scraping
- When you need deep multi-source research on a topic
- When you need AI-guided scraping to accomplish a goal on a page
- When you need to monitor a URL for changes over time
API base
All requests go to https://api.noxa.io/v1/.
Authentication: Authorization: Bearer $NOXA_API_KEY
Endpoints
1. Scrape — extract content from a single URL
curl -X POST https://api.noxa.io/v1/scrape \
-H "Authorization: Bearer $NOXA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"formats": ["markdown"],
"only_main_content": true
}'
Request fields:
| Field | Type | Default | Description |
|---|---|---|---|
url |
string | required | URL to scrape |
formats |
string[] | ["markdown"] |
Output formats: markdown, text, llm, json |
include_selectors |
string[] | [] |
CSS selectors to keep (e.g. ["article", ".content"]) |
exclude_selectors |
string[] | [] |
CSS selectors to remove (e.g. ["nav", "footer", ".ads"]) |
only_main_content |
bool | false |
Extract only the main article/content area |
no_cache |
bool | false |
Skip cache, fetch fresh |
max_cache_age |
int | server default | Max acceptable cache age in seconds |
Response:
{
"url": "https://example.com",
"metadata": {
"title": "Example",
"description": "...",
"language": "en",
"word_count": 1234
},
"markdown": "# Page Title\n\nContent here...",
"cache": { "status": "miss" }
}
Format options:
markdown— clean markdown, best for general usetext— plain text without formattingllm— optimized for LLM consumption: includes page title, URL, and cleaned content with link references. Best for feeding to AI models.json— full extraction result with all metadata
When antibot bypass activates (automatic, no extra config):
{
"antibot": {
"bypass": true,
"elapsed_ms": 3200
}
}
2. Crawl — scrape an entire website
Starts an async job. Poll for results.
Start crawl:
curl -X POST https://api.noxa.io/v1/crawl \
-H "Authorization: Bearer $NOXA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://docs.example.com",
"max_depth": 3,
"max_pages": 50,
"use_sitemap": true
}'
Response: { "job_id": "abc-123", "status": "running" }
Poll status:
curl https://api.noxa.io/v1/crawl/abc-123 \
-H "Authorization: Bearer $NOXA_API_KEY"
Response when complete:
{
"job_id": "abc-123",
"status": "completed",
"total": 47,
"completed": 45,
"errors": 2,
"pages": [
{
"url": "https://docs.example.com/intro",
"markdown": "# Introduction\n...",
"metadata": { "title": "Intro", "word_count": 500 }
}
]
}
Request fields:
| Field | Type | Default | Description |
|---|---|---|---|
url |
string | required | Starting URL |
max_depth |
int | 3 |
How many links deep to follow |
max_pages |
int | 100 |
Maximum pages to crawl |
use_sitemap |
bool | false |
Seed URLs from sitemap.xml |
formats |
string[] | ["markdown"] |
Output formats per page |
include_selectors |
string[] | [] |
CSS selectors to keep |
exclude_selectors |
string[] | [] |
CSS selectors to remove |
only_main_content |
bool | false |
Main content only |
3. Map — discover all URLs on a site
Fast URL discovery without full content extraction.
curl -X POST https://api.noxa.io/v1/map \
-H "Authorization: Bearer $NOXA_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'
Response:
{
"url": "https://example.com",
"count": 142,
"urls": [
"https://example.com/about",
"https://example.com/pricing",
"https://example.com/docs/intro"
]
}
4. Batch — scrape multiple URLs in parallel
curl -X POST https://api.noxa.io/v1/batch \
-H "Authorization: Bearer $NOXA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"urls": [
"https://a.com",
"https://b.com",
"https://c.com"
],
"formats": ["markdown"],
"concurrency": 5
}'
Response:
{
"total": 3,
"completed": 3,
"errors": 0,
"results": [
{ "url": "https://a.com", "markdown": "...", "metadata": {} },
{ "url": "https://b.com", "markdown": "...", "metadata": {} },
{ "url": "https://c.com", "error": "timeout" }
]
}
5. Extract — LLM-powered structured extraction
Pull structured data from any page using a JSON schema or plain-text prompt.
With JSON schema:
curl -X POST https://api.noxa.io/v1/extract \
-H "Authorization: Bearer $NOXA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/pricing",
"schema": {
"type": "object",
"properties": {
"plans": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"price": { "type": "string" },
"features": { "type": "array", "items": { "type": "string" } }
}
}
}
}
}
}'
With prompt:
curl -X POST https://api.noxa.io/v1/extract \
-H "Authorization: Bearer $NOXA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/pricing",
"prompt": "Extract all pricing tiers with names, monthly prices, and key features"
}'
Response:
{
"url": "https://example.com/pricing",
"data": {
"plans": [
{ "name": "Starter", "price": "$49/mo", "features": ["10k pages", "Email support"] },
{ "name": "Pro", "price": "$99/mo", "features": ["100k pages", "Priority support", "API access"] }
]
}
}
6. Summarize — get a quick summary of any page
curl -X POST https://api.noxa.io/v1/summarize \
-H "Authorization: Bearer $NOXA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/long-article",
"max_sentences": 3
}'
Response:
{
"url": "https://example.com/long-article",
"summary": "The article discusses... Key findings include... The author concludes that..."
}
7. Diff — detect content changes
Compare current page content against a previous snapshot.
curl -X POST https://api.noxa.io/v1/diff \
-H "Authorization: Bearer $NOXA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"previous": {
"markdown": "# Old content...",
"metadata": { "title": "Old Title" }
}
}'
Response:
{
"url": "https://example.com",
"status": "changed",
"diff": "--- previous\n+++ current\n@@ -1 +1 @@\n-# Old content\n+# New content",
"metadata_changes": [
{ "field": "title", "old": "Old Title", "new": "New Title" }
]
}
8. Brand — extract brand identity
Analyze a website's visual identity: colors, fonts, logo.
curl -X POST https://api.noxa.io/v1/brand \
-H "Authorization: Bearer $NOXA_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'
Response:
{
"url": "https://example.com",
"brand": {
"colors": [
{ "hex": "#FF6B35", "usage": "primary" },
{ "hex": "#1A1A2E", "usage": "background" }
],
"fonts": ["Inter", "JetBrains Mono"],
"logo_url": "https://example.com/logo.svg",
"favicon_url": "https://example.com/favicon.ico"
}
}
9. Search — web search with optional scraping
Search the web and optionally scrape each result page.
curl -X POST https://api.noxa.io/v1/search \
-H "Authorization: Bearer $NOXA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"query": "best rust web frameworks 2026",
"num_results": 5,
"scrape": true,
"formats": ["markdown"]
}'
Request fields:
| Field | Type | Default | Description |
|---|---|---|---|
query |
string | required | Search query |
num_results |
int | 10 |
Number of search results to return |
scrape |
bool | false |
Also scrape each result page for full content |
formats |
string[] | ["markdown"] |
Output formats when scrape is true |
country |
string | none | Country code for localized results (e.g. "us", "de") |
lang |
string | none | Language code for results (e.g. "en", "fr") |
Response:
{
"query": "best rust web frameworks 2026",
"results": [
{
"title": "Top Rust Web Frameworks in 2026",
"url": "https://blog.example.com/rust-frameworks",
"snippet": "A comprehensive comparison of Axum, Actix, and Rocket...",
"position": 1,
"markdown": "# Top Rust Web Frameworks\n\n..."
},
{
"title": "Choosing a Rust Backend Framework",
"url": "https://dev.to/rust-backends",
"snippet": "When starting a new Rust web project...",
"position": 2,
"markdown": "# Choosing a Rust Backend\n\n..."
}
]
}
The markdown field on each result is only present when scrape: true. Without it, you get titles, URLs, snippets, and positions only.
10. Research — deep multi-source research
Starts an async research job that searches, scrapes, and synthesizes information across multiple sources. Poll for results.
Start research:
curl -X POST https://api.noxa.io/v1/research \
-H "Authorization: Bearer $NOXA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"query": "How does Cloudflare Turnstile work and what are its known bypass methods?",
"max_iterations": 5,
"max_sources": 10,
"topic": "security",
"deep": true
}'
Request fields:
| Field | Type | Default | Description |
|---|---|---|---|
query |
string | required | Research question or topic |
max_iterations |
int | server default | Maximum research iterations (search-read-analyze cycles) |
max_sources |
int | server default | Maximum number of sources to consult |
topic |
string | none | Topic hint to guide search strategy (e.g. "security", "finance", "engineering") |
deep |
bool | false |
Enable deep research mode for more thorough analysis (costs 10 credits instead of 1) |
Response: { "id": "res-abc-123", "status": "running" }
Poll results:
curl https://api.noxa.io/v1/research/res-abc-123 \
-H "Authorization: Bearer $NOXA_API_KEY"
Response when complete:
{
"id": "res-abc-123",
"status": "completed",
"query": "How does Cloudflare Turnstile work and what are its known bypass methods?",
"report": "# Cloudflare Turnstile Analysis\n\n## Overview\nCloudflare Turnstile is a CAPTCHA replacement that...\n\n## How It Works\n...\n\n## Known Bypass Methods\n...",
"sources": [
{ "url": "https://developers.cloudflare.com/turnstile/", "title": "Turnstile Documentation" },
{ "url": "https://blog.cloudflare.com/turnstile-ga/", "title": "Turnstile GA Announcement" }
],
"findings": [
"Turnstile uses browser environment signals and proof-of-work challenges",
"Managed mode auto-selects challenge difficulty based on visitor risk score",
"Known bypass approaches include instrumented browser automation"
],
"iterations": 5,
"elapsed_ms": 34200
}
Status values: running, completed, failed
11. Agent Scrape — AI-guided scraping
Use an AI agent to navigate and interact with a page to accomplish a specific goal. The agent can click, scroll, fill forms, and extract data across multiple steps.
curl -X POST https://api.noxa.io/v1/agent-scrape \
-H "Authorization: Bearer $NOXA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/products",
"goal": "Find the cheapest laptop with at least 16GB RAM and extract its full specs",
"max_steps": 10
}'
Request fields:
| Field | Type | Default | Description |
|---|---|---|---|
url |
string | required | Starting URL |
goal |
string | required | What the agent should accomplish |
max_steps |
int | server default | Maximum number of actions the agent can take |
Response:
{
"url": "https://example.com/products",
"result": "The cheapest laptop with 16GB+ RAM is the ThinkPad E14 Gen 6 at $649. Specs: AMD Ryzen 5 7535U, 16GB DDR4, 512GB SSD, 14\" FHD IPS display, 57Wh battery.",
"steps": [
{ "action": "navigate", "detail": "Loaded products page" },
{ "action": "click", "detail": "Clicked 'Laptops' category filter" },
{ "action": "click", "detail": "Applied '16GB+' RAM filter" },
{ "action": "click", "detail": "Sorted by price: low to high" },
{ "action": "extract", "detail": "Extracted specs from first matching product" }
]
}
12. Watch — monitor a URL for changes
Create persistent monitors that check a URL on a schedule and notify via webhook when content changes.
Create a monitor:
curl -X POST https://api.noxa.io/v1/watch \
-H "Authorization: Bearer $NOXA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/pricing",
"interval": "0 */6 * * *",
"webhook_url": "https://hooks.example.com/pricing-changed",
"formats": ["markdown"]
}'
Request fields:
| Field | Type | Default | Description |
|---|---|---|---|
url |
string | required | URL to monitor |
interval |
string | required | Check frequency as cron expression or seconds (e.g. "0 */6 * * *" or "3600") |
webhook_url |
string | none | URL to POST when changes are detected |
formats |
string[] | ["markdown"] |
Output formats for snapshots |
Response:
{
"id": "watch-abc-123",
"url": "https://example.com/pricing",
"interval": "0 */6 * * *",
"webhook_url": "https://hooks.example.com/pricing-changed",
"formats": ["markdown"],
"created_at": "2026-03-20T10:00:00Z",
"last_check": null,
"status": "active"
}
List all monitors:
curl https://api.noxa.io/v1/watch \
-H "Authorization: Bearer $NOXA_API_KEY"
Response:
{
"monitors": [
{
"id": "watch-abc-123",
"url": "https://example.com/pricing",
"interval": "0 */6 * * *",
"status": "active",
"last_check": "2026-03-20T16:00:00Z",
"checks": 4
}
]
}
Get a monitor with snapshots:
curl https://api.noxa.io/v1/watch/watch-abc-123 \
-H "Authorization: Bearer $NOXA_API_KEY"
Response:
{
"id": "watch-abc-123",
"url": "https://example.com/pricing",
"interval": "0 */6 * * *",
"status": "active",
"snapshots": [
{
"checked_at": "2026-03-20T16:00:00Z",
"status": "changed",
"diff": "--- previous\n+++ current\n@@ -5 +5 @@\n-Pro: $99/mo\n+Pro: $119/mo"
},
{
"checked_at": "2026-03-20T10:00:00Z",
"status": "baseline"
}
]
}
Trigger an immediate check:
curl -X POST https://api.noxa.io/v1/watch/watch-abc-123/check \
-H "Authorization: Bearer $NOXA_API_KEY"
Delete a monitor:
curl -X DELETE https://api.noxa.io/v1/watch/watch-abc-123 \
-H "Authorization: Bearer $NOXA_API_KEY"
Choosing the right format
| Goal | Format | Why |
|---|---|---|
| Read and understand a page | markdown |
Clean structure, headings, links preserved |
| Feed content to an AI model | llm |
Optimized: includes title + URL header, clean link refs |
| Search or index content | text |
Plain text, no formatting noise |
| Programmatic analysis | json |
Full metadata, structured data, DOM statistics |
Tips
- Use
llmformat when passing content to yourself or another AI — it's specifically optimized for LLM consumption with better context framing. - Use
only_main_content: trueto skip navigation, sidebars, and footers. Reduces noise significantly. - Use
include_selectors/exclude_selectorsfor fine-grained control whenonly_main_contentisn't enough. - Batch over individual scrapes when fetching multiple URLs — it's faster and more efficient.
- Use
mapbeforecrawlto discover the site structure first, then crawl specific sections. - Use
extractwith a JSON schema for reliable structured output (e.g., pricing tables, product specs, contact info). - Antibot bypass is automatic — no extra configuration needed. Works on Cloudflare, DataDome, AWS WAF, and JS-rendered SPAs.
- Use
searchwithscrape: trueto get full page content for each search result in one call instead of searching then scraping separately. - Use
researchfor complex questions that need multiple sources — it handles the search-read-synthesize loop automatically. Enabledeep: truefor thorough analysis. - Use
agent-scrapefor interactive pages where data is behind filters, pagination, or form submissions that a simple scrape cannot reach. - Use
watchfor ongoing monitoring — set up a cron schedule and a webhook to get notified when a page changes without polling manually.
Smart Fetch Architecture
The noxa MCP server uses a local-first approach:
- Local fetch — fast, free, no API credits used (~80% of sites)
- Cloud API fallback — automatic when bot protection or JS rendering is detected
This means:
- Most scrapes cost zero credits (local extraction)
- Cloudflare, DataDome, AWS WAF sites automatically fall back to the cloud API
- JS-rendered SPAs (React, Next.js, Vue) also fall back automatically
- Set
NOXA_API_KEYto enable cloud fallback
vs web_fetch
| noxa | web_fetch | |
|---|---|---|
| Cloudflare bypass | Automatic (cloud fallback) | Fails (403) |
| JS-rendered pages | Automatic fallback | Readability only |
| Output quality | 20-step optimization pipeline | Basic HTML parsing |
| Structured extraction | LLM-powered, schema-based | None |
| Crawling | Full site crawl with sitemap | Single page only |
| Caching | Built-in, configurable TTL | Per-session |
| Rate limiting | Managed server-side | Client responsibility |
Use web_fetch for simple, fast lookups. Use noxa when you need reliability, quality, or advanced features.