Fast, local-first web content extraction for LLMs. Scrape, crawl, extract structured data — all from Rust. CLI, REST API, and MCP server. https://webclaw.io
Find a file
Jacob Magar cc1617a3a9 fix(gemini-cli): correct CLI invocation to match gemini v0.36 interface
The previous implementation used wrong flags (-p without value, --json,
--max-output-tokens) that don't exist in the real gemini CLI.

Correct invocation:
- Pass prompt as -p STRING value (not via stdin)
- Use --output-format json to get structured {response, stats} output
- Add --yolo to suppress interactive confirmation prompts
- Remove nonexistent --json and --max-output-tokens flags
- Parse `.response` field from JSON output, skipping MCP noise lines
- Extend timeout from 30s to 60s (agentic CLI is slower than raw API)

Smoke tested end-to-end: stdin HTML → summarize and --extract-json
both produce correct output via Gemini CLI.
2026-04-11 12:16:21 -04:00
.cargo chore: remove reqwest_unstable rustflag (no longer needed) 2026-04-01 18:15:05 +02:00
.github/workflows chore: rebrand webclaw to noxa 2026-04-11 00:10:38 -04:00
benchmarks chore: rebrand webclaw to noxa 2026-04-11 00:10:38 -04:00
crates fix(gemini-cli): correct CLI invocation to match gemini v0.36 interface 2026-04-11 12:16:21 -04:00
examples chore: rebrand webclaw to noxa 2026-04-11 00:10:38 -04:00
packages/create-noxa chore: rebrand webclaw to noxa 2026-04-11 00:10:38 -04:00
skill chore: rebrand webclaw to noxa 2026-04-11 00:10:38 -04:00
.dockerignore Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
.gitignore Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
Cargo.lock chore: rebrand webclaw to noxa 2026-04-11 00:10:38 -04:00
Cargo.toml chore: rebrand webclaw to noxa 2026-04-11 00:10:38 -04:00
CHANGELOG.md chore: rebrand webclaw to noxa 2026-04-11 00:10:38 -04:00
CLAUDE.md docs(noxa-9fw.4): describe gemini cli as primary llm backend 2026-04-11 07:36:19 -04:00
docker-compose.yml chore: rebrand webclaw to noxa 2026-04-11 00:10:38 -04:00
Dockerfile chore: rebrand webclaw to noxa 2026-04-11 00:10:38 -04:00
Dockerfile.ci chore: rebrand webclaw to noxa 2026-04-11 00:10:38 -04:00
env.example docs(noxa-9fw.4): describe gemini cli as primary llm backend 2026-04-11 07:36:19 -04:00
proxies.example.txt chore: rebrand webclaw to noxa 2026-04-11 00:10:38 -04:00
README.md chore: rebrand webclaw to noxa 2026-04-11 00:10:38 -04:00
rustfmt.toml Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
setup.sh chore: rebrand webclaw to noxa 2026-04-11 00:10:38 -04:00
SKILL.md chore: rebrand webclaw to noxa 2026-04-11 00:10:38 -04:00

noxa

The fastest web scraper for AI agents.
67% fewer tokens. Sub-millisecond extraction. Zero browser overhead.

Stars Version License npm installs

X / Twitter Website Docs


Claude Code: web_fetch gets 403, noxa extracts successfully
Claude Code's built-in web_fetch → 403 Forbidden. noxa → clean markdown.


Your AI agent calls fetch() and gets a 403. Or 142KB of raw HTML that burns through your token budget. noxa fixes both.

It extracts clean, structured content from any URL using Chrome-level TLS fingerprinting — no headless browser, no Selenium, no Puppeteer. Output is optimized for LLMs: 67% fewer tokens than raw HTML, with metadata, links, and images preserved.

                     Raw HTML                          noxa
┌──────────────────────────────────┐    ┌──────────────────────────────────┐
│ <div class="ad-wrapper">         │    │ # Breaking: AI Breakthrough      │
│ <nav class="global-nav">         │    │                                  │
│ <script>window.__NEXT_DATA__     │    │ Researchers achieved 94%         │
│ ={...8KB of JSON...}</script>    │    │ accuracy on cross-domain         │
│ <div class="social-share">       │    │ reasoning benchmarks.            │
│ <button>Tweet</button>           │    │                                  │
│ <footer class="site-footer">     │    │ ## Key Findings                  │
│ <!-- 142,847 characters -->      │    │ - 3x faster inference            │
│                                  │    │ - Open-source weights            │
│         4,820 tokens             │    │         1,590 tokens             │
└──────────────────────────────────┘    └──────────────────────────────────┘

Get Started (30 seconds)

For AI agents (Claude, Cursor, Windsurf, VS Code)

npx create-noxa

Auto-detects your AI tools, downloads the MCP server, and configures everything. One command.

Homebrew (macOS/Linux)

brew tap jmagar/noxa
brew install noxa

Prebuilt binaries

Download from GitHub Releases for macOS (arm64, x86_64) and Linux (x86_64, aarch64).

Cargo (from source)

cargo install --git https://github.com/jmagar/noxa.git noxa
cargo install --git https://github.com/jmagar/noxa.git noxa-mcp

Docker

docker run --rm ghcr.io/0xmassi/noxa https://example.com

Docker Compose (with Ollama for LLM features)

cp env.example .env
docker compose up -d

Why noxa?

noxa Firecrawl Trafilatura Readability
Extraction accuracy 95.1% 80.6% 83.5%
Token efficiency -67% -55% -51%
Speed (100KB page) 3.2ms ~500ms 18.4ms 8.7ms
TLS fingerprinting Yes No No No
Self-hosted Yes No Yes Yes
MCP (Claude/Cursor) Yes No No No
No browser required Yes No Yes Yes
Cost Free Free Free

Choose noxa if you want fast local extraction, LLM-optimized output, and native AI agent integration.


What it looks like

$ noxa https://stripe.com -f llm

> URL: https://stripe.com
> Title: Stripe | Financial Infrastructure for the Internet
> Language: en
> Word count: 847

# Stripe | Financial Infrastructure for the Internet

Stripe is a suite of APIs powering online payment processing
and commerce solutions for internet businesses of all sizes.

## Products
- Payments — Accept payments online and in person
- Billing — Manage subscriptions and invoicing
- Connect — Build a marketplace or platform
...
$ noxa https://github.com --brand

{
  "name": "GitHub",
  "colors": [{"hex": "#59636E", "usage": "Primary"}, ...],
  "fonts": ["Mona Sans", "ui-monospace"],
  "logos": [{"url": "https://github.githubassets.com/...", "kind": "svg"}]
}
$ noxa https://docs.rust-lang.org --crawl --depth 2 --max-pages 50

Crawling... 50/50 pages extracted
---
# Page 1: https://docs.rust-lang.org/
...
# Page 2: https://docs.rust-lang.org/book/
...

MCP Server — 10 tools for AI agents

noxa MCP server

noxa ships as an MCP server that plugs into Claude Desktop, Claude Code, Cursor, Windsurf, OpenCode, Antigravity, Codex CLI, and any MCP-compatible client.

npx create-noxa    # auto-detects and configures everything

Or manual setup — add to your Claude Desktop config:

{
  "mcpServers": {
    "noxa": {
      "command": "~/.noxa/noxa-mcp"
    }
  }
}

Then in Claude: "Scrape the top 5 results for 'web scraping tools' and compare their pricing" — it just works.

Available tools

Tool Description Requires API key?
scrape Extract content from any URL No
crawl Recursive site crawl No
map Discover URLs from sitemaps No
batch Parallel multi-URL extraction No
extract LLM-powered structured extraction No (needs Ollama)
summarize Page summarization No (needs Ollama)
diff Content change detection No
brand Brand identity extraction No
search Web search + scrape results Yes
research Deep multi-source research Yes

8 of 10 tools work locally — no account, no API key, fully private.


Features

Extraction

  • Readability scoring — multi-signal content detection (text density, semantic tags, link ratio)
  • Noise filtering — strips nav, footer, ads, modals, cookie banners (Tailwind-safe)
  • Data island extraction — catches React/Next.js JSON payloads, JSON-LD, hydration data
  • YouTube metadata — structured data from any YouTube video
  • PDF extraction — auto-detected via Content-Type
  • 5 output formats — markdown, text, JSON, LLM-optimized, HTML

Content control

noxa URL --include "article, .content"       # CSS selector include
noxa URL --exclude "nav, footer, .sidebar"    # CSS selector exclude
noxa URL --only-main-content                  # Auto-detect main content

Crawling

noxa URL --crawl --depth 3 --max-pages 100   # BFS same-origin crawl
noxa URL --crawl --sitemap                    # Seed from sitemap
noxa URL --map                                # Discover URLs only

LLM features (Ollama / OpenAI / Anthropic)

noxa URL --summarize                          # Page summary
noxa URL --extract-prompt "Get all prices"    # Natural language extraction
noxa URL --extract-json '{"type":"object"}'   # Schema-enforced extraction

Change tracking

noxa URL -f json > snap.json                  # Take snapshot
noxa URL --diff-with snap.json                # Compare later

Brand extraction

noxa URL --brand                              # Colors, fonts, logos, OG image

Proxy rotation

noxa URL --proxy http://user:pass@host:port   # Single proxy
noxa URLs --proxy-file proxies.txt            # Pool rotation

Benchmarks

All numbers from real tests on 50 diverse pages. See benchmarks/ for methodology and reproduction instructions.

Extraction quality

Accuracy      noxa     ███████████████████ 95.1%
              readability ████████████████▋   83.5%
              trafilatura ████████████████    80.6%
              newspaper3k █████████████▎      66.4%

Noise removal noxa     ███████████████████ 96.1%
              readability █████████████████▊  89.4%
              trafilatura ██████████████████▏ 91.2%
              newspaper3k ███████████████▎    76.8%

Speed (pure extraction, no network)

10KB page     noxa     ██                   0.8ms
              readability █████                2.1ms
              trafilatura ██████████           4.3ms

100KB page    noxa     ██                   3.2ms
              readability █████                8.7ms
              trafilatura ██████████           18.4ms

Token efficiency (feeding to Claude/GPT)

Format Tokens vs Raw HTML
Raw HTML 4,820 baseline
readability 2,340 -51%
trafilatura 2,180 -55%
noxa llm 1,590 -67%

Crawl speed

Concurrency noxa Crawl4AI Scrapy
5 9.8 pg/s 5.2 pg/s 7.1 pg/s
10 18.4 pg/s 8.7 pg/s 12.3 pg/s
20 32.1 pg/s 14.2 pg/s 21.8 pg/s

Architecture

noxa/
  crates/
    noxa-core     Pure extraction engine. Zero network deps. WASM-safe.
    noxa-fetch    HTTP client + TLS fingerprinting (wreq/BoringSSL). Crawler. Batch ops.
    noxa-llm      LLM provider chain (Ollama -> OpenAI -> Anthropic)
    noxa-pdf      PDF text extraction
    noxa-mcp      MCP server (10 tools for AI agents)
    noxa      CLI binary

noxa-core takes raw HTML as a &str and returns structured output. No I/O, no network, no allocator tricks. Can compile to WASM.


Configuration

Variable Description
NOXA_API_KEY Cloud API key (enables bot bypass, JS rendering, search, research)
OLLAMA_HOST Ollama URL for local LLM features (default: http://localhost:11434)
OPENAI_API_KEY OpenAI API key for LLM features
ANTHROPIC_API_KEY Anthropic API key for LLM features
NOXA_PROXY Single proxy URL
NOXA_PROXY_FILE Path to proxy pool file

Cloud API (optional)

For bot-protected sites, JS rendering, and advanced features, noxa offers a hosted API at noxa.io.

The CLI and MCP server work locally first. Cloud is used as a fallback when:

  • A site has bot protection (Cloudflare, DataDome, WAF)
  • A page requires JavaScript rendering
  • You use search or research tools
export NOXA_API_KEY=wc_your_key

# Automatic: tries local first, cloud on bot detection
noxa https://protected-site.com

# Force cloud
noxa --cloud https://spa-site.com

SDKs

npm install @noxa/sdk                  # TypeScript/JavaScript
pip install noxa                        # Python
go get github.com/jmagar/noxa-go      # Go

Use cases

  • AI agents — Give Claude/Cursor/GPT real-time web access via MCP
  • Research — Crawl documentation, competitor sites, news archives
  • Price monitoring — Track changes with --diff-with snapshots
  • Training data — Prepare web content for fine-tuning with token-optimized output
  • Content pipelines — Batch extract + summarize in CI/CD
  • Brand intelligence — Extract visual identity from any website

Community

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Acknowledgments

TLS and HTTP/2 browser fingerprinting is powered by wreq and http2 by @0x676e67, who pioneered browser-grade HTTP/2 fingerprinting in Rust.

License

AGPL-3.0