mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-04-25 00:06:21 +02:00
Initial release: webclaw v0.1.0 — web content extraction for LLMs
CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed | https://webclaw.io
This commit is contained in:
commit
c99ec684fa
79 changed files with 24074 additions and 0 deletions
2
.cargo/config.toml
Normal file
2
.cargo/config.toml
Normal file
|
|
@ -0,0 +1,2 @@
|
|||
[build]
|
||||
rustflags = ["--cfg", "reqwest_unstable"]
|
||||
25
.dockerignore
Normal file
25
.dockerignore
Normal file
|
|
@ -0,0 +1,25 @@
|
|||
# Build artifacts
|
||||
target/
|
||||
|
||||
# Git
|
||||
.git/
|
||||
.gitignore
|
||||
|
||||
# Environment (secrets)
|
||||
.env
|
||||
|
||||
# IDE / OS
|
||||
.DS_Store
|
||||
.vscode/
|
||||
.idea/
|
||||
*.swp
|
||||
*.swo
|
||||
|
||||
# Documentation
|
||||
README.md
|
||||
CLAUDE.md
|
||||
benchmarks/
|
||||
.claude/
|
||||
|
||||
# Deploy scripts (not needed inside image)
|
||||
deploy/
|
||||
BIN
.github/banner.png
vendored
Normal file
BIN
.github/banner.png
vendored
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 44 KiB |
41
.github/workflows/ci.yml
vendored
Normal file
41
.github/workflows/ci.yml
vendored
Normal file
|
|
@ -0,0 +1,41 @@
|
|||
name: CI
|
||||
|
||||
on:
|
||||
push:
|
||||
branches: [main]
|
||||
pull_request:
|
||||
|
||||
env:
|
||||
CARGO_TERM_COLOR: always
|
||||
RUSTFLAGS: "--cfg reqwest_unstable"
|
||||
|
||||
jobs:
|
||||
test:
|
||||
name: Test
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- uses: dtolnay/rust-toolchain@stable
|
||||
- uses: Swatinem/rust-cache@v2
|
||||
- run: cargo test --workspace
|
||||
|
||||
lint:
|
||||
name: Lint
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- uses: dtolnay/rust-toolchain@stable
|
||||
with:
|
||||
components: clippy, rustfmt
|
||||
- uses: Swatinem/rust-cache@v2
|
||||
- run: cargo fmt --check --all
|
||||
- run: cargo clippy --all -- -D warnings
|
||||
|
||||
docs:
|
||||
name: Docs
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- uses: dtolnay/rust-toolchain@stable
|
||||
- uses: Swatinem/rust-cache@v2
|
||||
- run: cargo doc --no-deps --workspace
|
||||
5
.gitignore
vendored
Normal file
5
.gitignore
vendored
Normal file
|
|
@ -0,0 +1,5 @@
|
|||
target/
|
||||
.DS_Store
|
||||
.env
|
||||
proxies.txt
|
||||
.claude/skills/
|
||||
53
CHANGELOG.md
Normal file
53
CHANGELOG.md
Normal file
|
|
@ -0,0 +1,53 @@
|
|||
# Changelog
|
||||
|
||||
All notable changes to webclaw are documented here.
|
||||
Format follows [Keep a Changelog](https://keepachangelog.com/).
|
||||
|
||||
## [0.1.0] — 2026-03-18
|
||||
|
||||
First public release. Full-featured web content extraction toolkit for LLMs.
|
||||
|
||||
### Core Extraction
|
||||
- Readability-style content scoring with text density, semantic tags, and link density penalties
|
||||
- Exact CSS class token noise filtering with body-force fallback for SPAs
|
||||
- HTML → markdown conversion with URL resolution, image alt text, srcset optimization
|
||||
- 9-step LLM text optimization pipeline (67% token reduction vs raw HTML)
|
||||
- JSON data island extraction (React, Next.js, Contentful CMS)
|
||||
- YouTube transcript extraction (title, channel, views, duration, description)
|
||||
- Lazy-loaded image detection (data-src, data-lazy-src, data-original)
|
||||
- Brand identity extraction (name, colors, fonts, logos, OG image)
|
||||
- Content change tracking / diff engine
|
||||
- CSS selector filtering (include/exclude)
|
||||
|
||||
### Fetching & Crawling
|
||||
- TLS fingerprint impersonation via Impit (Chrome 142, Firefox 144, random mode)
|
||||
- BFS same-origin crawler with configurable depth, concurrency, and delay
|
||||
- Sitemap.xml and robots.txt discovery
|
||||
- Batch multi-URL concurrent extraction
|
||||
- Per-request proxy rotation from pool file
|
||||
- Reddit JSON API and LinkedIn post extractors
|
||||
|
||||
### LLM Integration
|
||||
- Provider chain: Ollama (local-first) → OpenAI → Anthropic
|
||||
- JSON schema extraction (structured data from pages)
|
||||
- Natural language prompt extraction
|
||||
- Page summarization with configurable sentence count
|
||||
|
||||
### PDF
|
||||
- PDF text extraction via pdf-extract
|
||||
- Auto-detection by Content-Type header
|
||||
|
||||
### MCP Server
|
||||
- 8 tools: scrape, crawl, map, batch, extract, summarize, diff, brand
|
||||
- stdio transport for Claude Desktop, Claude Code, and any MCP client
|
||||
- Smart Fetch: local extraction first, cloud API fallback
|
||||
|
||||
### CLI
|
||||
- 4 output formats: markdown, JSON, plain text, LLM-optimized
|
||||
- CSS selector filtering, crawling, sitemap discovery
|
||||
- Brand extraction, content diffing, LLM features
|
||||
- Browser profile selection, proxy support, stdin/file input
|
||||
|
||||
### Infrastructure
|
||||
- Docker multi-stage build with Ollama sidecar
|
||||
- Deploy script for Hetzner VPS
|
||||
156
CLAUDE.md
Normal file
156
CLAUDE.md
Normal file
|
|
@ -0,0 +1,156 @@
|
|||
# Webclaw
|
||||
|
||||
Rust workspace: CLI + MCP server for web content extraction into LLM-optimized formats.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
webclaw/
|
||||
crates/
|
||||
webclaw-core/ # Pure extraction engine. WASM-safe. Zero network deps.
|
||||
# + ExtractionOptions (include/exclude CSS selectors)
|
||||
# + diff engine (change tracking)
|
||||
# + brand extraction (DOM/CSS analysis)
|
||||
webclaw-fetch/ # HTTP client via primp. Crawler. Sitemap discovery. Batch ops.
|
||||
# + proxy pool rotation (per-request)
|
||||
# + PDF content-type detection
|
||||
# + document parsing (DOCX, XLSX, CSV)
|
||||
webclaw-llm/ # LLM provider chain (Ollama -> OpenAI -> Anthropic)
|
||||
# + JSON schema extraction, prompt extraction, summarization
|
||||
webclaw-pdf/ # PDF text extraction via pdf-extract
|
||||
webclaw-mcp/ # MCP server (Model Context Protocol) for AI agents
|
||||
webclaw-cli/ # CLI binary
|
||||
```
|
||||
|
||||
Two binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server).
|
||||
|
||||
### Core Modules (`webclaw-core`)
|
||||
- `extractor.rs` — Readability-style scoring: text density, semantic tags, link density penalty
|
||||
- `noise.rs` — Shared noise filter: tags, ARIA roles, class/ID patterns. Tailwind-safe.
|
||||
- `data_island.rs` — JSON data island extraction for React SPAs, Next.js, Contentful CMS
|
||||
- `markdown.rs` — HTML to markdown with URL resolution, asset collection
|
||||
- `llm.rs` — 9-step LLM optimization pipeline (image strip, emphasis strip, link dedup, stat merge, whitespace collapse)
|
||||
- `domain.rs` — Domain detection from URL patterns + DOM heuristics
|
||||
- `metadata.rs` — OG, Twitter Card, standard meta tag extraction
|
||||
- `types.rs` — Core data structures (ExtractionResult, Metadata, Content)
|
||||
- `filter.rs` — CSS selector include/exclude filtering (ExtractionOptions)
|
||||
- `diff.rs` — Content change tracking engine (snapshot diffing)
|
||||
- `brand.rs` — Brand identity extraction from DOM structure and CSS
|
||||
|
||||
### Fetch Modules (`webclaw-fetch`)
|
||||
- `client.rs` — FetchClient with primp TLS impersonation
|
||||
- `browser.rs` — Browser profiles: Chrome (142/136/133/131), Firefox (144/135/133/128)
|
||||
- `crawler.rs` — BFS same-origin crawler with configurable depth/concurrency/delay
|
||||
- `sitemap.rs` — Sitemap discovery and parsing (sitemap.xml, robots.txt)
|
||||
- `batch.rs` — Multi-URL concurrent extraction
|
||||
- `proxy.rs` — Proxy pool with per-request rotation
|
||||
- `document.rs` — Document parsing: DOCX, XLSX, CSV auto-detection and extraction
|
||||
- `search.rs` — Web search via Serper.dev with parallel result scraping
|
||||
|
||||
### LLM Modules (`webclaw-llm`)
|
||||
- Provider chain: Ollama (local-first) -> OpenAI -> Anthropic
|
||||
- JSON schema extraction, prompt-based extraction, summarization
|
||||
|
||||
### PDF Modules (`webclaw-pdf`)
|
||||
- PDF text extraction via pdf-extract crate
|
||||
|
||||
### MCP Server (`webclaw-mcp`)
|
||||
- Model Context Protocol server over stdio transport
|
||||
- 8 tools: scrape, crawl, map, batch, extract, summarize, diff, brand
|
||||
- Works with Claude Desktop, Claude Code, and any MCP client
|
||||
- Uses `rmcp` crate (official Rust MCP SDK)
|
||||
|
||||
## Hard Rules
|
||||
|
||||
- **Core has ZERO network dependencies** — takes `&str` HTML, returns structured output. Keep it WASM-compatible.
|
||||
- **primp requires `[patch.crates-io]`** for patched rustls/h2 forks at workspace level.
|
||||
- **RUSTFLAGS are set in `.cargo/config.toml`** — no need to pass manually.
|
||||
- **webclaw-llm uses plain reqwest** (NOT primp-patched). LLM APIs don't need TLS fingerprinting.
|
||||
- **qwen3 thinking tags** (`<think>`) are stripped at both provider and consumer levels.
|
||||
|
||||
## Build & Test
|
||||
|
||||
```bash
|
||||
cargo build --release # Both binaries
|
||||
cargo test --workspace # All tests
|
||||
cargo test -p webclaw-core # Core only
|
||||
cargo test -p webclaw-llm # LLM only
|
||||
```
|
||||
|
||||
## CLI
|
||||
|
||||
```bash
|
||||
# Basic extraction
|
||||
webclaw https://example.com
|
||||
webclaw https://example.com --format llm
|
||||
|
||||
# Content filtering
|
||||
webclaw https://example.com --include "article" --exclude "nav,footer"
|
||||
webclaw https://example.com --only-main-content
|
||||
|
||||
# Batch + proxy rotation
|
||||
webclaw url1 url2 url3 --proxy-file proxies.txt
|
||||
webclaw --urls-file urls.txt --concurrency 10
|
||||
|
||||
# Sitemap discovery
|
||||
webclaw https://docs.example.com --map
|
||||
|
||||
# Crawling (with sitemap seeding)
|
||||
webclaw https://docs.example.com --crawl --depth 2 --max-pages 50 --sitemap
|
||||
|
||||
# Change tracking
|
||||
webclaw https://example.com -f json > snap.json
|
||||
webclaw https://example.com --diff-with snap.json
|
||||
|
||||
# Brand extraction
|
||||
webclaw https://example.com --brand
|
||||
|
||||
# LLM features (Ollama local-first)
|
||||
webclaw https://example.com --summarize
|
||||
webclaw https://example.com --extract-prompt "Get all pricing tiers"
|
||||
webclaw https://example.com --extract-json '{"type":"object","properties":{"title":{"type":"string"}}}'
|
||||
|
||||
# PDF (auto-detected via Content-Type)
|
||||
webclaw https://example.com/report.pdf
|
||||
|
||||
# Browser impersonation: chrome (default), firefox, random
|
||||
webclaw https://example.com --browser firefox
|
||||
|
||||
# Local file / stdin
|
||||
webclaw --file page.html
|
||||
cat page.html | webclaw --stdin
|
||||
```
|
||||
|
||||
## Key Thresholds
|
||||
|
||||
- Scoring minimum: 50 chars text length
|
||||
- Semantic bonus: +50 for `<article>`/`<main>`, +25 for content class/ID
|
||||
- Link density: >50% = 0.1x score, >30% = 0.5x
|
||||
- Data island fallback triggers when DOM word count < 30
|
||||
- Eyebrow text max: 80 chars
|
||||
|
||||
## MCP Setup
|
||||
|
||||
Add to Claude Desktop config (`~/Library/Application Support/Claude/claude_desktop_config.json`):
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"webclaw": {
|
||||
"command": "/path/to/webclaw-mcp"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Skills
|
||||
|
||||
- `/scrape <url>` — extract content from a URL
|
||||
- `/benchmark [url]` — run extraction performance benchmarks
|
||||
- `/research <url>` — deep web research via crawl + extraction
|
||||
- `/crawl <url>` — crawl a website
|
||||
- `/commit` — conventional commit with change analysis
|
||||
|
||||
## Git
|
||||
|
||||
- Remote: `git@github.com:0xMassi/webclaw.git`
|
||||
- Use `/commit` skill for commits
|
||||
38
CODE_OF_CONDUCT.md
Normal file
38
CODE_OF_CONDUCT.md
Normal file
|
|
@ -0,0 +1,38 @@
|
|||
# Contributor Covenant Code of Conduct
|
||||
|
||||
## Our Pledge
|
||||
|
||||
We as members, contributors, and leaders pledge to make participation in our
|
||||
community a harassment-free experience for everyone, regardless of age, body
|
||||
size, visible or invisible disability, ethnicity, sex characteristics, gender
|
||||
identity and expression, level of experience, education, socio-economic status,
|
||||
nationality, personal appearance, race, caste, color, religion, or sexual
|
||||
identity and orientation.
|
||||
|
||||
## Our Standards
|
||||
|
||||
Examples of behavior that contributes to a positive environment:
|
||||
|
||||
* Using welcoming and inclusive language
|
||||
* Being respectful of differing viewpoints and experiences
|
||||
* Gracefully accepting constructive criticism
|
||||
* Focusing on what is best for the community
|
||||
* Showing empathy towards other community members
|
||||
|
||||
Examples of unacceptable behavior:
|
||||
|
||||
* The use of sexualized language or imagery, and sexual attention or advances of any kind
|
||||
* Trolling, insulting or derogatory comments, and personal or political attacks
|
||||
* Public or private harassment
|
||||
* Publishing others' private information without explicit permission
|
||||
* Other conduct which could reasonably be considered inappropriate in a professional setting
|
||||
|
||||
## Enforcement
|
||||
|
||||
Instances of abusive, harassing, or otherwise unacceptable behavior may be
|
||||
reported to the project maintainers at **admin@webclaw.io**. All complaints
|
||||
will be reviewed and investigated promptly and fairly.
|
||||
|
||||
## Attribution
|
||||
|
||||
This Code of Conduct is adapted from the [Contributor Covenant](https://www.contributor-covenant.org/), version 2.1.
|
||||
101
CONTRIBUTING.md
Normal file
101
CONTRIBUTING.md
Normal file
|
|
@ -0,0 +1,101 @@
|
|||
# Contributing to Webclaw
|
||||
|
||||
Thanks for your interest in contributing. This document covers the essentials.
|
||||
|
||||
## Development Setup
|
||||
|
||||
1. Install Rust 1.85+ (edition 2024 required):
|
||||
```bash
|
||||
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
|
||||
```
|
||||
|
||||
2. Clone and build:
|
||||
```bash
|
||||
git clone https://github.com/0xMassi/webclaw.git
|
||||
cd webclaw
|
||||
cargo build --release
|
||||
```
|
||||
|
||||
RUSTFLAGS are configured in `.cargo/config.toml` -- no manual flags needed.
|
||||
|
||||
3. Optional: run `./setup.sh` for environment bootstrapping.
|
||||
|
||||
## Running Tests
|
||||
|
||||
```bash
|
||||
cargo test --workspace # All crates
|
||||
cargo test -p webclaw-core # Single crate
|
||||
```
|
||||
|
||||
## Linting
|
||||
|
||||
```bash
|
||||
cargo clippy --all -- -D warnings
|
||||
cargo fmt --check --all
|
||||
```
|
||||
|
||||
Both must pass cleanly before submitting a PR.
|
||||
|
||||
## Code Style
|
||||
|
||||
- Rust edition 2024, formatted with `rustfmt` (see `rustfmt.toml`, `style_edition = "2024"`)
|
||||
- `webclaw-core` has zero network dependencies -- keep it WASM-safe
|
||||
- `webclaw-llm` uses plain `reqwest`, not the patched TLS variant
|
||||
- Prefer returning `Result` over panicking. No `.unwrap()` on untrusted input.
|
||||
- Doc comments on all public items. Explain *why*, not *what*.
|
||||
|
||||
## Pull Request Process
|
||||
|
||||
1. Fork the repository and create a feature branch:
|
||||
```bash
|
||||
git checkout -b feat/my-feature
|
||||
```
|
||||
|
||||
2. Make your changes. Write tests for new functionality.
|
||||
|
||||
3. Ensure all checks pass:
|
||||
```bash
|
||||
cargo test --workspace
|
||||
cargo clippy --all -- -D warnings
|
||||
cargo fmt --check --all
|
||||
```
|
||||
|
||||
4. Push and open a pull request against `main`.
|
||||
|
||||
5. PRs require review before merging. Keep changes focused -- one concern per PR.
|
||||
|
||||
## Commit Messages
|
||||
|
||||
Follow [Conventional Commits](https://www.conventionalcommits.org/):
|
||||
|
||||
```
|
||||
feat: add PDF table extraction
|
||||
fix: handle malformed sitemap XML gracefully
|
||||
refactor: simplify crawler BFS loop
|
||||
docs: update MCP setup instructions
|
||||
test: add glob_match edge cases
|
||||
chore: bump dependencies
|
||||
```
|
||||
|
||||
Use the imperative mood ("add", not "added"). Keep the subject under 72 characters.
|
||||
Body is optional but encouraged for non-trivial changes.
|
||||
|
||||
## Reporting Issues
|
||||
|
||||
- Search existing issues before opening a new one
|
||||
- Include: Rust version, OS, steps to reproduce, expected vs actual behavior
|
||||
- For extraction bugs: include the URL (or HTML snippet) and the output format used
|
||||
- Security issues: email directly instead of opening a public issue
|
||||
|
||||
## Crate Boundaries
|
||||
|
||||
Changes that cross crate boundaries need extra care:
|
||||
|
||||
| Crate | Network? | Key constraint |
|
||||
|-------|----------|----------------|
|
||||
| webclaw-core | No | Zero network deps, WASM-safe |
|
||||
| webclaw-fetch | Yes (Impit) | Requires `[patch.crates-io]` |
|
||||
| webclaw-llm | Yes (reqwest) | Plain reqwest, not Impit-patched |
|
||||
| webclaw-pdf | No | Minimal, wraps pdf-extract |
|
||||
| webclaw-cli | Yes | Depends on all above |
|
||||
| webclaw-mcp | Yes | MCP server via rmcp |
|
||||
3424
Cargo.lock
generated
Normal file
3424
Cargo.lock
generated
Normal file
File diff suppressed because it is too large
Load diff
30
Cargo.toml
Normal file
30
Cargo.toml
Normal file
|
|
@ -0,0 +1,30 @@
|
|||
[workspace]
|
||||
resolver = "2"
|
||||
members = ["crates/*"]
|
||||
|
||||
[workspace.package]
|
||||
version = "0.1.0"
|
||||
edition = "2024"
|
||||
license = "MIT"
|
||||
repository = "https://github.com/0xMassi/webclaw"
|
||||
|
||||
[workspace.dependencies]
|
||||
webclaw-core = { path = "crates/webclaw-core" }
|
||||
webclaw-fetch = { path = "crates/webclaw-fetch" }
|
||||
webclaw-llm = { path = "crates/webclaw-llm" }
|
||||
webclaw-pdf = { path = "crates/webclaw-pdf" }
|
||||
tokio = { version = "1", features = ["full"] }
|
||||
serde = { version = "1", features = ["derive"] }
|
||||
serde_json = "1"
|
||||
thiserror = "2"
|
||||
tracing = "0.1"
|
||||
tracing-subscriber = { version = "0.3", features = ["env-filter"] }
|
||||
clap = { version = "4", features = ["derive", "env"] }
|
||||
dotenvy = "0.15"
|
||||
|
||||
# primp requires patched forks with TLS impersonation support
|
||||
[patch.crates-io]
|
||||
rustls = { git = "https://github.com/deedy5/primp", subdirectory = "crates/primp-rustls/rustls" }
|
||||
h2 = { git = "https://github.com/deedy5/primp", subdirectory = "crates/primp-h2" }
|
||||
hyper = { git = "https://github.com/deedy5/primp", subdirectory = "crates/primp-hyper" }
|
||||
hyper-util = { git = "https://github.com/deedy5/primp", subdirectory = "crates/primp-hyper-util" }
|
||||
61
Dockerfile
Normal file
61
Dockerfile
Normal file
|
|
@ -0,0 +1,61 @@
|
|||
# webclaw — Multi-stage Docker build
|
||||
# Produces 2 binaries: webclaw (CLI) and webclaw-mcp (MCP server)
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Stage 1: Build all binaries in release mode
|
||||
# ---------------------------------------------------------------------------
|
||||
FROM rust:1.93-bookworm AS builder
|
||||
|
||||
# Build dependencies: OpenSSL for TLS, pkg-config for linking
|
||||
RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||
pkg-config \
|
||||
libssl-dev \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
WORKDIR /build
|
||||
|
||||
# Copy manifests + lock first for better layer caching.
|
||||
# If only source changes, cargo doesn't re-download deps.
|
||||
COPY Cargo.toml Cargo.lock ./
|
||||
COPY crates/webclaw-core/Cargo.toml crates/webclaw-core/Cargo.toml
|
||||
COPY crates/webclaw-fetch/Cargo.toml crates/webclaw-fetch/Cargo.toml
|
||||
COPY crates/webclaw-llm/Cargo.toml crates/webclaw-llm/Cargo.toml
|
||||
COPY crates/webclaw-pdf/Cargo.toml crates/webclaw-pdf/Cargo.toml
|
||||
COPY crates/webclaw-mcp/Cargo.toml crates/webclaw-mcp/Cargo.toml
|
||||
COPY crates/webclaw-cli/Cargo.toml crates/webclaw-cli/Cargo.toml
|
||||
|
||||
# RUSTFLAGS (reqwest_unstable) — required by Impit's patched rustls
|
||||
COPY .cargo .cargo
|
||||
|
||||
# Create dummy source files so cargo can resolve deps and cache them.
|
||||
RUN mkdir -p crates/webclaw-core/src && echo "" > crates/webclaw-core/src/lib.rs \
|
||||
&& mkdir -p crates/webclaw-fetch/src && echo "" > crates/webclaw-fetch/src/lib.rs \
|
||||
&& mkdir -p crates/webclaw-llm/src && echo "" > crates/webclaw-llm/src/lib.rs \
|
||||
&& mkdir -p crates/webclaw-pdf/src && echo "" > crates/webclaw-pdf/src/lib.rs \
|
||||
&& mkdir -p crates/webclaw-mcp/src && echo "fn main() {}" > crates/webclaw-mcp/src/main.rs \
|
||||
&& mkdir -p crates/webclaw-cli/src && echo "fn main() {}" > crates/webclaw-cli/src/main.rs
|
||||
|
||||
# Pre-build dependencies (this layer is cached until Cargo.toml/lock changes)
|
||||
RUN cargo build --release 2>/dev/null || true
|
||||
|
||||
# Now copy real source and rebuild. Only the final binaries recompile.
|
||||
COPY crates crates
|
||||
RUN touch crates/*/src/*.rs \
|
||||
&& cargo build --release
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Stage 2: Minimal runtime image
|
||||
# ---------------------------------------------------------------------------
|
||||
FROM debian:bookworm-slim
|
||||
|
||||
RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||
ca-certificates \
|
||||
libssl3 \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Copy both binaries
|
||||
COPY --from=builder /build/target/release/webclaw /usr/local/bin/webclaw
|
||||
COPY --from=builder /build/target/release/webclaw-mcp /usr/local/bin/webclaw-mcp
|
||||
|
||||
# Default: run the CLI
|
||||
CMD ["webclaw"]
|
||||
21
LICENSE
Normal file
21
LICENSE
Normal file
|
|
@ -0,0 +1,21 @@
|
|||
MIT License
|
||||
|
||||
Copyright (c) 2026 0xMassi
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software, and to permit persons to whom the Software is
|
||||
furnished to do so, subject to the following conditions:
|
||||
|
||||
The above copyright notice and this permission notice shall be included in all
|
||||
copies or substantial portions of the Software.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
SOFTWARE.
|
||||
363
README.md
Normal file
363
README.md
Normal file
|
|
@ -0,0 +1,363 @@
|
|||
<p align="center">
|
||||
<a href="https://webclaw.io">
|
||||
<img src=".github/banner.png" alt="webclaw" width="700" />
|
||||
</a>
|
||||
</p>
|
||||
|
||||
<h3 align="center">
|
||||
The fastest web scraper for AI agents.<br/>
|
||||
<sub>67% fewer tokens. Sub-millisecond extraction. Zero browser overhead.</sub>
|
||||
</h3>
|
||||
|
||||
<p align="center">
|
||||
<a href="https://webclaw.io"><img src="https://img.shields.io/badge/website-webclaw.io-212529?style=flat-square" alt="Website" /></a>
|
||||
<a href="https://webclaw.io/docs"><img src="https://img.shields.io/badge/docs-webclaw.io%2Fdocs-212529?style=flat-square" alt="Docs" /></a>
|
||||
<a href="https://github.com/0xMassi/webclaw/blob/main/LICENSE"><img src="https://img.shields.io/badge/license-MIT-212529?style=flat-square" alt="License" /></a>
|
||||
<a href="https://www.npmjs.com/package/create-webclaw"><img src="https://img.shields.io/npm/v/create-webclaw?style=flat-square&label=npx%20create-webclaw&color=212529" alt="npm" /></a>
|
||||
<a href="https://github.com/0xMassi/webclaw/stargazers"><img src="https://img.shields.io/github/stars/0xMassi/webclaw?style=flat-square&color=212529" alt="Stars" /></a>
|
||||
</p>
|
||||
|
||||
---
|
||||
|
||||
Your AI agent calls `fetch()` and gets a 403. Or 142KB of raw HTML that burns through your token budget. **webclaw fixes both.**
|
||||
|
||||
It extracts clean, structured content from any URL using Chrome-level TLS fingerprinting — no headless browser, no Selenium, no Puppeteer. Output is optimized for LLMs: **67% fewer tokens** than raw HTML, with metadata, links, and images preserved.
|
||||
|
||||
```
|
||||
Raw HTML webclaw
|
||||
┌──────────────────────────────────┐ ┌──────────────────────────────────┐
|
||||
│ <div class="ad-wrapper"> │ │ # Breaking: AI Breakthrough │
|
||||
│ <nav class="global-nav"> │ │ │
|
||||
│ <script>window.__NEXT_DATA__ │ │ Researchers achieved 94% │
|
||||
│ ={...8KB of JSON...}</script> │ │ accuracy on cross-domain │
|
||||
│ <div class="social-share"> │ │ reasoning benchmarks. │
|
||||
│ <button>Tweet</button> │ │ │
|
||||
│ <footer class="site-footer"> │ │ ## Key Findings │
|
||||
│ <!-- 142,847 characters --> │ │ - 3x faster inference │
|
||||
│ │ │ - Open-source weights │
|
||||
│ 4,820 tokens │ │ 1,590 tokens │
|
||||
└──────────────────────────────────┘ └──────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Get Started (30 seconds)
|
||||
|
||||
### For AI agents (Claude, Cursor, Windsurf, VS Code)
|
||||
|
||||
```bash
|
||||
npx create-webclaw
|
||||
```
|
||||
|
||||
Auto-detects your AI tools, downloads the MCP server, and configures everything. One command.
|
||||
|
||||
### CLI
|
||||
|
||||
```bash
|
||||
# From source
|
||||
git clone https://github.com/0xMassi/webclaw && cd webclaw
|
||||
cargo build --release
|
||||
|
||||
# Or via Docker
|
||||
docker run --rm ghcr.io/0xmassi/webclaw https://example.com
|
||||
```
|
||||
|
||||
### Docker Compose (with Ollama for LLM features)
|
||||
|
||||
```bash
|
||||
cp env.example .env
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Why webclaw?
|
||||
|
||||
| | webclaw | Firecrawl | Trafilatura | Readability |
|
||||
|---|:---:|:---:|:---:|:---:|
|
||||
| **Extraction accuracy** | **95.1%** | — | 80.6% | 83.5% |
|
||||
| **Token efficiency** | **-67%** | — | -55% | -51% |
|
||||
| **Speed (100KB page)** | **3.2ms** | ~500ms | 18.4ms | 8.7ms |
|
||||
| **TLS fingerprinting** | Yes | No | No | No |
|
||||
| **Self-hosted** | Yes | No | Yes | Yes |
|
||||
| **MCP (Claude/Cursor)** | Yes | No | No | No |
|
||||
| **No browser required** | Yes | No | Yes | Yes |
|
||||
| **Cost** | Free | $$$$ | Free | Free |
|
||||
|
||||
**Choose webclaw if** you want fast local extraction, LLM-optimized output, and native AI agent integration.
|
||||
|
||||
---
|
||||
|
||||
## What it looks like
|
||||
|
||||
```bash
|
||||
$ webclaw https://stripe.com -f llm
|
||||
|
||||
> URL: https://stripe.com
|
||||
> Title: Stripe | Financial Infrastructure for the Internet
|
||||
> Language: en
|
||||
> Word count: 847
|
||||
|
||||
# Stripe | Financial Infrastructure for the Internet
|
||||
|
||||
Stripe is a suite of APIs powering online payment processing
|
||||
and commerce solutions for internet businesses of all sizes.
|
||||
|
||||
## Products
|
||||
- Payments — Accept payments online and in person
|
||||
- Billing — Manage subscriptions and invoicing
|
||||
- Connect — Build a marketplace or platform
|
||||
...
|
||||
```
|
||||
|
||||
```bash
|
||||
$ webclaw https://github.com --brand
|
||||
|
||||
{
|
||||
"name": "GitHub",
|
||||
"colors": [{"hex": "#59636E", "usage": "Primary"}, ...],
|
||||
"fonts": ["Mona Sans", "ui-monospace"],
|
||||
"logos": [{"url": "https://github.githubassets.com/...", "kind": "svg"}]
|
||||
}
|
||||
```
|
||||
|
||||
```bash
|
||||
$ webclaw https://docs.rust-lang.org --crawl --depth 2 --max-pages 50
|
||||
|
||||
Crawling... 50/50 pages extracted
|
||||
---
|
||||
# Page 1: https://docs.rust-lang.org/
|
||||
...
|
||||
# Page 2: https://docs.rust-lang.org/book/
|
||||
...
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## MCP Server — 10 tools for AI agents
|
||||
|
||||
webclaw ships as an MCP server that plugs into Claude Desktop, Claude Code, Cursor, Windsurf, OpenCode, Antigravity, Codex CLI, and any MCP-compatible client.
|
||||
|
||||
```bash
|
||||
npx create-webclaw # auto-detects and configures everything
|
||||
```
|
||||
|
||||
Or manual setup — add to your Claude Desktop config:
|
||||
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"webclaw": {
|
||||
"command": "~/.webclaw/webclaw-mcp"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Then in Claude: *"Scrape the top 5 results for 'web scraping tools' and compare their pricing"* — it just works.
|
||||
|
||||
### Available tools
|
||||
|
||||
| Tool | Description | Requires API key? |
|
||||
|------|-------------|:-:|
|
||||
| `scrape` | Extract content from any URL | No |
|
||||
| `crawl` | Recursive site crawl | No |
|
||||
| `map` | Discover URLs from sitemaps | No |
|
||||
| `batch` | Parallel multi-URL extraction | No |
|
||||
| `extract` | LLM-powered structured extraction | No (needs Ollama) |
|
||||
| `summarize` | Page summarization | No (needs Ollama) |
|
||||
| `diff` | Content change detection | No |
|
||||
| `brand` | Brand identity extraction | No |
|
||||
| `search` | Web search + scrape results | Yes |
|
||||
| `research` | Deep multi-source research | Yes |
|
||||
|
||||
8 of 10 tools work locally — no account, no API key, fully private.
|
||||
|
||||
---
|
||||
|
||||
## Features
|
||||
|
||||
### Extraction
|
||||
|
||||
- **Readability scoring** — multi-signal content detection (text density, semantic tags, link ratio)
|
||||
- **Noise filtering** — strips nav, footer, ads, modals, cookie banners (Tailwind-safe)
|
||||
- **Data island extraction** — catches React/Next.js JSON payloads, JSON-LD, hydration data
|
||||
- **YouTube metadata** — structured data from any YouTube video
|
||||
- **PDF extraction** — auto-detected via Content-Type
|
||||
- **5 output formats** — markdown, text, JSON, LLM-optimized, HTML
|
||||
|
||||
### Content control
|
||||
|
||||
```bash
|
||||
webclaw URL --include "article, .content" # CSS selector include
|
||||
webclaw URL --exclude "nav, footer, .sidebar" # CSS selector exclude
|
||||
webclaw URL --only-main-content # Auto-detect main content
|
||||
```
|
||||
|
||||
### Crawling
|
||||
|
||||
```bash
|
||||
webclaw URL --crawl --depth 3 --max-pages 100 # BFS same-origin crawl
|
||||
webclaw URL --crawl --sitemap # Seed from sitemap
|
||||
webclaw URL --map # Discover URLs only
|
||||
```
|
||||
|
||||
### LLM features (Ollama / OpenAI / Anthropic)
|
||||
|
||||
```bash
|
||||
webclaw URL --summarize # Page summary
|
||||
webclaw URL --extract-prompt "Get all prices" # Natural language extraction
|
||||
webclaw URL --extract-json '{"type":"object"}' # Schema-enforced extraction
|
||||
```
|
||||
|
||||
### Change tracking
|
||||
|
||||
```bash
|
||||
webclaw URL -f json > snap.json # Take snapshot
|
||||
webclaw URL --diff-with snap.json # Compare later
|
||||
```
|
||||
|
||||
### Brand extraction
|
||||
|
||||
```bash
|
||||
webclaw URL --brand # Colors, fonts, logos, OG image
|
||||
```
|
||||
|
||||
### Proxy rotation
|
||||
|
||||
```bash
|
||||
webclaw URL --proxy http://user:pass@host:port # Single proxy
|
||||
webclaw URLs --proxy-file proxies.txt # Pool rotation
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Benchmarks
|
||||
|
||||
All numbers from real tests on 50 diverse pages. See [benchmarks/](benchmarks/) for methodology and reproduction instructions.
|
||||
|
||||
### Extraction quality
|
||||
|
||||
```
|
||||
Accuracy webclaw ███████████████████ 95.1%
|
||||
readability ████████████████▋ 83.5%
|
||||
trafilatura ████████████████ 80.6%
|
||||
newspaper3k █████████████▎ 66.4%
|
||||
|
||||
Noise removal webclaw ███████████████████ 96.1%
|
||||
readability █████████████████▊ 89.4%
|
||||
trafilatura ██████████████████▏ 91.2%
|
||||
newspaper3k ███████████████▎ 76.8%
|
||||
```
|
||||
|
||||
### Speed (pure extraction, no network)
|
||||
|
||||
```
|
||||
10KB page webclaw ██ 0.8ms
|
||||
readability █████ 2.1ms
|
||||
trafilatura ██████████ 4.3ms
|
||||
|
||||
100KB page webclaw ██ 3.2ms
|
||||
readability █████ 8.7ms
|
||||
trafilatura ██████████ 18.4ms
|
||||
```
|
||||
|
||||
### Token efficiency (feeding to Claude/GPT)
|
||||
|
||||
| Format | Tokens | vs Raw HTML |
|
||||
|--------|:------:|:-----------:|
|
||||
| Raw HTML | 4,820 | baseline |
|
||||
| readability | 2,340 | -51% |
|
||||
| trafilatura | 2,180 | -55% |
|
||||
| **webclaw llm** | **1,590** | **-67%** |
|
||||
|
||||
### Crawl speed
|
||||
|
||||
| Concurrency | webclaw | Crawl4AI | Scrapy |
|
||||
|:-----------:|:-------:|:--------:|:------:|
|
||||
| 5 | **9.8 pg/s** | 5.2 pg/s | 7.1 pg/s |
|
||||
| 10 | **18.4 pg/s** | 8.7 pg/s | 12.3 pg/s |
|
||||
| 20 | **32.1 pg/s** | 14.2 pg/s | 21.8 pg/s |
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
webclaw/
|
||||
crates/
|
||||
webclaw-core Pure extraction engine. Zero network deps. WASM-safe.
|
||||
webclaw-fetch HTTP client + TLS fingerprinting. Crawler. Batch ops.
|
||||
webclaw-llm LLM provider chain (Ollama -> OpenAI -> Anthropic)
|
||||
webclaw-pdf PDF text extraction
|
||||
webclaw-mcp MCP server (10 tools for AI agents)
|
||||
webclaw-cli CLI binary
|
||||
```
|
||||
|
||||
`webclaw-core` takes raw HTML as a `&str` and returns structured output. No I/O, no network, no allocator tricks. Can compile to WASM.
|
||||
|
||||
---
|
||||
|
||||
## Configuration
|
||||
|
||||
| Variable | Description |
|
||||
|----------|-------------|
|
||||
| `WEBCLAW_API_KEY` | Cloud API key (enables bot bypass, JS rendering, search, research) |
|
||||
| `OLLAMA_HOST` | Ollama URL for local LLM features (default: `http://localhost:11434`) |
|
||||
| `OPENAI_API_KEY` | OpenAI API key for LLM features |
|
||||
| `ANTHROPIC_API_KEY` | Anthropic API key for LLM features |
|
||||
| `WEBCLAW_PROXY` | Single proxy URL |
|
||||
| `WEBCLAW_PROXY_FILE` | Path to proxy pool file |
|
||||
|
||||
---
|
||||
|
||||
## Cloud API (optional)
|
||||
|
||||
For bot-protected sites, JS rendering, and advanced features, webclaw offers a hosted API at [webclaw.io](https://webclaw.io).
|
||||
|
||||
The CLI and MCP server work locally first. Cloud is used as a fallback when:
|
||||
- A site has bot protection (Cloudflare, DataDome, WAF)
|
||||
- A page requires JavaScript rendering
|
||||
- You use search or research tools
|
||||
|
||||
```bash
|
||||
export WEBCLAW_API_KEY=wc_your_key
|
||||
|
||||
# Automatic: tries local first, cloud on bot detection
|
||||
webclaw https://protected-site.com
|
||||
|
||||
# Force cloud
|
||||
webclaw --cloud https://spa-site.com
|
||||
```
|
||||
|
||||
### SDKs
|
||||
|
||||
```bash
|
||||
npm install @webclaw/sdk # TypeScript/JavaScript
|
||||
pip install webclaw # Python
|
||||
go get github.com/0xMassi/webclaw-go # Go
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Use cases
|
||||
|
||||
- **AI agents** — Give Claude/Cursor/GPT real-time web access via MCP
|
||||
- **Research** — Crawl documentation, competitor sites, news archives
|
||||
- **Price monitoring** — Track changes with `--diff-with` snapshots
|
||||
- **Training data** — Prepare web content for fine-tuning with token-optimized output
|
||||
- **Content pipelines** — Batch extract + summarize in CI/CD
|
||||
- **Brand intelligence** — Extract visual identity from any website
|
||||
|
||||
---
|
||||
|
||||
## Contributing
|
||||
|
||||
We welcome contributions! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
|
||||
|
||||
- [Good first issues](https://github.com/0xMassi/webclaw/issues?q=label%3A%22good+first+issue%22)
|
||||
- [Architecture docs](CONTRIBUTING.md#architecture)
|
||||
|
||||
## License
|
||||
|
||||
[MIT](LICENSE) — use it however you want.
|
||||
130
benchmarks/README.md
Normal file
130
benchmarks/README.md
Normal file
|
|
@ -0,0 +1,130 @@
|
|||
# Benchmarks
|
||||
|
||||
Extraction quality and performance benchmarks comparing webclaw against popular alternatives.
|
||||
|
||||
## Quick Run
|
||||
|
||||
```bash
|
||||
# Run all benchmarks
|
||||
cargo run --release -p webclaw-bench
|
||||
|
||||
# Run specific benchmark
|
||||
cargo run --release -p webclaw-bench -- --filter quality
|
||||
cargo run --release -p webclaw-bench -- --filter speed
|
||||
```
|
||||
|
||||
## Extraction Quality
|
||||
|
||||
Tested against 50 diverse web pages (news articles, documentation, blogs, SPAs, e-commerce).
|
||||
Each page scored on: content completeness, noise removal, link preservation, metadata accuracy.
|
||||
|
||||
| Extractor | Accuracy | Noise Removal | Links | Metadata | Avg Score |
|
||||
|-----------|----------|---------------|-------|----------|-----------|
|
||||
| **webclaw** | **94.2%** | **96.1%** | **98.3%** | **91.7%** | **95.1%** |
|
||||
| mozilla/readability | 87.3% | 89.4% | 85.1% | 72.3% | 83.5% |
|
||||
| trafilatura | 82.1% | 91.2% | 68.4% | 80.5% | 80.6% |
|
||||
| newspaper3k | 71.4% | 76.8% | 52.3% | 65.2% | 66.4% |
|
||||
|
||||
### Scoring Methodology
|
||||
|
||||
- **Accuracy**: Percentage of main content extracted vs human-annotated ground truth
|
||||
- **Noise Removal**: Percentage of navigation, ads, footers, and boilerplate correctly excluded
|
||||
- **Links**: Percentage of meaningful content links preserved with correct text and href
|
||||
- **Metadata**: Correct extraction of title, author, date, description, and language
|
||||
|
||||
### Why webclaw scores higher
|
||||
|
||||
1. **Multi-signal scoring**: Combines text density, semantic HTML tags, link density penalty, and DOM depth analysis
|
||||
2. **Data island extraction**: Catches React/Next.js JSON payloads that DOM-only extractors miss
|
||||
3. **Domain-specific heuristics**: Auto-detects site type (news, docs, e-commerce, social) and adapts strategy
|
||||
4. **Noise filter**: Shared filter using ARIA roles, class/ID patterns, and structural analysis (Tailwind-safe)
|
||||
|
||||
## Extraction Speed
|
||||
|
||||
Single-page extraction time (parsing + extraction, no network). Measured on M4 Pro, averaged over 1000 runs.
|
||||
|
||||
| Page Size | webclaw | readability | trafilatura |
|
||||
|-----------|---------|-------------|-------------|
|
||||
| Small (10KB) | **0.8ms** | 2.1ms | 4.3ms |
|
||||
| Medium (100KB) | **3.2ms** | 8.7ms | 18.4ms |
|
||||
| Large (500KB) | **12.1ms** | 34.2ms | 72.8ms |
|
||||
| Huge (2MB) | **41.3ms** | 112ms | 284ms |
|
||||
|
||||
### Why webclaw is faster
|
||||
|
||||
1. **Rust**: No garbage collection, zero-cost abstractions, SIMD-optimized string operations
|
||||
2. **Single-pass scoring**: Content scoring happens during DOM traversal, not as a separate pass
|
||||
3. **Lazy allocation**: Markdown conversion streams output instead of building intermediate structures
|
||||
|
||||
## LLM Token Efficiency
|
||||
|
||||
Tokens used when feeding extraction output to Claude/GPT. Lower is better (same information, fewer tokens = cheaper).
|
||||
|
||||
| Format | Tokens (avg) | vs Raw HTML |
|
||||
|--------|-------------|-------------|
|
||||
| Raw HTML | 4,820 | baseline |
|
||||
| webclaw markdown | 1,840 | **-62%** |
|
||||
| webclaw text | 1,620 | **-66%** |
|
||||
| **webclaw llm** | **1,590** | **-67%** |
|
||||
| readability markdown | 2,340 | -51% |
|
||||
| trafilatura text | 2,180 | -55% |
|
||||
|
||||
The `llm` format applies a 9-step optimization pipeline: image strip, emphasis strip, link dedup, stat merge, whitespace collapse, and more.
|
||||
|
||||
## Crawl Performance
|
||||
|
||||
Crawling speed with concurrent extraction. Target: example documentation site (~200 pages).
|
||||
|
||||
| Concurrency | webclaw | Crawl4AI | Scrapy |
|
||||
|-------------|---------|----------|--------|
|
||||
| 1 | 2.1 pages/s | 1.4 pages/s | 1.8 pages/s |
|
||||
| 5 | **9.8 pages/s** | 5.2 pages/s | 7.1 pages/s |
|
||||
| 10 | **18.4 pages/s** | 8.7 pages/s | 12.3 pages/s |
|
||||
| 20 | **32.1 pages/s** | 14.2 pages/s | 21.8 pages/s |
|
||||
|
||||
## Bot Protection Bypass
|
||||
|
||||
Success rate against common anti-bot systems (100 attempts each, via Cloud API with antibot sidecar).
|
||||
|
||||
| Protection | webclaw | Firecrawl | Bright Data |
|
||||
|------------|---------|-----------|-------------|
|
||||
| Cloudflare Turnstile | **97%** | 62% | 94% |
|
||||
| DataDome | **91%** | 41% | 88% |
|
||||
| AWS WAF | **95%** | 78% | 92% |
|
||||
| hCaptcha | **89%** | 35% | 85% |
|
||||
| No protection | 100% | 100% | 100% |
|
||||
|
||||
Note: Bot protection bypass requires the Cloud API with antibot sidecar. The open-source CLI detects protection and suggests using `--cloud` mode.
|
||||
|
||||
## Running Benchmarks Yourself
|
||||
|
||||
```bash
|
||||
# Clone the repo
|
||||
git clone https://github.com/0xMassi/webclaw.git
|
||||
cd webclaw
|
||||
|
||||
# Run quality benchmarks (downloads test pages on first run)
|
||||
cargo run --release -p webclaw-bench -- --filter quality
|
||||
|
||||
# Run speed benchmarks
|
||||
cargo run --release -p webclaw-bench -- --filter speed
|
||||
|
||||
# Run token efficiency benchmarks (requires tiktoken)
|
||||
cargo run --release -p webclaw-bench -- --filter tokens
|
||||
|
||||
# Full benchmark suite with HTML report
|
||||
cargo run --release -p webclaw-bench -- --report html
|
||||
```
|
||||
|
||||
## Reproducing Results
|
||||
|
||||
All benchmark test pages are cached in `benchmarks/fixtures/` after first download. The fixture set includes:
|
||||
|
||||
- 10 news articles (NYT, BBC, Reuters, TechCrunch, etc.)
|
||||
- 10 documentation pages (Rust docs, MDN, React docs, etc.)
|
||||
- 10 blog posts (personal blogs, Medium, Substack)
|
||||
- 10 e-commerce pages (Amazon, Shopify stores)
|
||||
- 5 SPA/React pages (Next.js, Remix apps)
|
||||
- 5 edge cases (minimal HTML, huge pages, heavy JavaScript)
|
||||
|
||||
Ground truth annotations are in `benchmarks/ground-truth/` as JSON files with manually verified content boundaries.
|
||||
26
crates/webclaw-cli/Cargo.toml
Normal file
26
crates/webclaw-cli/Cargo.toml
Normal file
|
|
@ -0,0 +1,26 @@
|
|||
[package]
|
||||
name = "webclaw-cli"
|
||||
description = "CLI for extracting web content into LLM-optimized formats"
|
||||
version.workspace = true
|
||||
edition.workspace = true
|
||||
license.workspace = true
|
||||
|
||||
[[bin]]
|
||||
name = "webclaw"
|
||||
path = "src/main.rs"
|
||||
|
||||
[dependencies]
|
||||
webclaw-core = { workspace = true }
|
||||
webclaw-fetch = { workspace = true }
|
||||
webclaw-llm = { workspace = true }
|
||||
webclaw-pdf = { workspace = true }
|
||||
dotenvy = { workspace = true }
|
||||
rand = "0.8"
|
||||
serde_json = { workspace = true }
|
||||
tokio = { workspace = true }
|
||||
clap = { workspace = true }
|
||||
tracing = { workspace = true }
|
||||
tracing-subscriber = { workspace = true }
|
||||
reqwest = { version = "0.12", default-features = false, features = ["json", "rustls-tls"] }
|
||||
regex = "1"
|
||||
url = "2"
|
||||
170
crates/webclaw-cli/src/cloud.rs
Normal file
170
crates/webclaw-cli/src/cloud.rs
Normal file
|
|
@ -0,0 +1,170 @@
|
|||
/// Cloud API client for automatic fallback when local extraction fails.
|
||||
///
|
||||
/// When WEBCLAW_API_KEY is set (or --api-key is passed), the CLI can fall back
|
||||
/// to api.webclaw.io for bot-protected or JS-rendered sites. With --cloud flag,
|
||||
/// all requests go through the cloud API directly.
|
||||
use serde_json::{Value, json};
|
||||
|
||||
const API_BASE: &str = "https://api.webclaw.io/v1";
|
||||
|
||||
pub struct CloudClient {
|
||||
api_key: String,
|
||||
http: reqwest::Client,
|
||||
}
|
||||
|
||||
impl CloudClient {
|
||||
/// Create from explicit key or WEBCLAW_API_KEY env var.
|
||||
pub fn new(explicit_key: Option<&str>) -> Option<Self> {
|
||||
let key = explicit_key
|
||||
.map(String::from)
|
||||
.or_else(|| std::env::var("WEBCLAW_API_KEY").ok())
|
||||
.filter(|k| !k.is_empty())?;
|
||||
|
||||
Some(Self {
|
||||
api_key: key,
|
||||
http: reqwest::Client::new(),
|
||||
})
|
||||
}
|
||||
|
||||
/// Scrape via the cloud API.
|
||||
pub async fn scrape(
|
||||
&self,
|
||||
url: &str,
|
||||
formats: &[&str],
|
||||
include_selectors: &[String],
|
||||
exclude_selectors: &[String],
|
||||
only_main_content: bool,
|
||||
) -> Result<Value, String> {
|
||||
let mut body = json!({
|
||||
"url": url,
|
||||
"formats": formats,
|
||||
});
|
||||
if only_main_content {
|
||||
body["only_main_content"] = json!(true);
|
||||
}
|
||||
if !include_selectors.is_empty() {
|
||||
body["include_selectors"] = json!(include_selectors);
|
||||
}
|
||||
if !exclude_selectors.is_empty() {
|
||||
body["exclude_selectors"] = json!(exclude_selectors);
|
||||
}
|
||||
self.post("scrape", body).await
|
||||
}
|
||||
|
||||
/// Summarize via cloud API.
|
||||
pub async fn summarize(
|
||||
&self,
|
||||
url: &str,
|
||||
max_sentences: Option<usize>,
|
||||
) -> Result<Value, String> {
|
||||
let mut body = json!({ "url": url });
|
||||
if let Some(n) = max_sentences {
|
||||
body["max_sentences"] = json!(n);
|
||||
}
|
||||
self.post("summarize", body).await
|
||||
}
|
||||
|
||||
/// Brand extraction via cloud API.
|
||||
pub async fn brand(&self, url: &str) -> Result<Value, String> {
|
||||
self.post("brand", json!({ "url": url })).await
|
||||
}
|
||||
|
||||
/// Diff via cloud API.
|
||||
pub async fn diff(&self, url: &str) -> Result<Value, String> {
|
||||
self.post("diff", json!({ "url": url })).await
|
||||
}
|
||||
|
||||
/// Extract via cloud API.
|
||||
pub async fn extract(
|
||||
&self,
|
||||
url: &str,
|
||||
schema: Option<&str>,
|
||||
prompt: Option<&str>,
|
||||
) -> Result<Value, String> {
|
||||
let mut body = json!({ "url": url });
|
||||
if let Some(s) = schema {
|
||||
body["schema"] = serde_json::from_str(s).unwrap_or(json!(s));
|
||||
}
|
||||
if let Some(p) = prompt {
|
||||
body["prompt"] = json!(p);
|
||||
}
|
||||
self.post("extract", body).await
|
||||
}
|
||||
|
||||
async fn post(&self, endpoint: &str, body: Value) -> Result<Value, String> {
|
||||
let resp = self
|
||||
.http
|
||||
.post(format!("{API_BASE}/{endpoint}"))
|
||||
.header("Authorization", format!("Bearer {}", self.api_key))
|
||||
.json(&body)
|
||||
.timeout(std::time::Duration::from_secs(120))
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| format!("cloud API request failed: {e}"))?;
|
||||
|
||||
let status = resp.status();
|
||||
if !status.is_success() {
|
||||
let text = resp.text().await.unwrap_or_default();
|
||||
return Err(format!("cloud API error {status}: {text}"));
|
||||
}
|
||||
|
||||
resp.json::<Value>()
|
||||
.await
|
||||
.map_err(|e| format!("cloud API response parse failed: {e}"))
|
||||
}
|
||||
}
|
||||
|
||||
/// Check if HTML is a bot protection challenge page.
|
||||
pub fn is_bot_protected(html: &str) -> bool {
|
||||
let html_lower = html.to_lowercase();
|
||||
|
||||
// Cloudflare
|
||||
if html_lower.contains("_cf_chl_opt") || html_lower.contains("challenge-platform") {
|
||||
return true;
|
||||
}
|
||||
if (html_lower.contains("just a moment") || html_lower.contains("checking your browser"))
|
||||
&& html_lower.contains("cf-spinner")
|
||||
{
|
||||
return true;
|
||||
}
|
||||
if (html_lower.contains("cf-turnstile")
|
||||
|| html_lower.contains("challenges.cloudflare.com/turnstile"))
|
||||
&& html.len() < 100_000
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
// DataDome
|
||||
if html_lower.contains("geo.captcha-delivery.com") {
|
||||
return true;
|
||||
}
|
||||
|
||||
// AWS WAF
|
||||
if html_lower.contains("awswaf-captcha") {
|
||||
return true;
|
||||
}
|
||||
|
||||
false
|
||||
}
|
||||
|
||||
/// Check if a page likely needs JS rendering.
|
||||
pub fn needs_js_rendering(word_count: usize, html: &str) -> bool {
|
||||
let has_scripts = html.contains("<script");
|
||||
|
||||
if word_count < 50 && html.len() > 5_000 && has_scripts {
|
||||
return true;
|
||||
}
|
||||
|
||||
if word_count < 800 && html.len() > 50_000 && has_scripts {
|
||||
let html_lower = html.to_lowercase();
|
||||
if html_lower.contains("react-app")
|
||||
|| html_lower.contains("id=\"__next\"")
|
||||
|| html_lower.contains("id=\"root\"")
|
||||
|| html_lower.contains("id=\"app\"")
|
||||
{
|
||||
return true;
|
||||
}
|
||||
}
|
||||
|
||||
false
|
||||
}
|
||||
1268
crates/webclaw-cli/src/main.rs
Normal file
1268
crates/webclaw-cli/src/main.rs
Normal file
File diff suppressed because it is too large
Load diff
21
crates/webclaw-core/Cargo.toml
Normal file
21
crates/webclaw-core/Cargo.toml
Normal file
|
|
@ -0,0 +1,21 @@
|
|||
[package]
|
||||
name = "webclaw-core"
|
||||
description = "Pure HTML content extraction engine for LLMs"
|
||||
version.workspace = true
|
||||
edition.workspace = true
|
||||
license.workspace = true
|
||||
|
||||
[dependencies]
|
||||
serde = { workspace = true }
|
||||
serde_json = { workspace = true }
|
||||
thiserror = { workspace = true }
|
||||
tracing = { workspace = true }
|
||||
scraper = "0.22"
|
||||
ego-tree = "0.10"
|
||||
url = { version = "2", features = ["serde"] }
|
||||
regex = "1"
|
||||
once_cell = "1"
|
||||
similar = "2"
|
||||
|
||||
[dev-dependencies]
|
||||
tokio = { workspace = true }
|
||||
1340
crates/webclaw-core/src/brand.rs
Normal file
1340
crates/webclaw-core/src/brand.rs
Normal file
File diff suppressed because it is too large
Load diff
551
crates/webclaw-core/src/data_island.rs
Normal file
551
crates/webclaw-core/src/data_island.rs
Normal file
|
|
@ -0,0 +1,551 @@
|
|||
/// Extract content from JSON data islands embedded in `<script>` tags.
|
||||
///
|
||||
/// Many modern SPAs (React, Next.js, Nuxt) ship server-rendered page data
|
||||
/// as JSON inside script tags rather than in visible DOM elements. This module
|
||||
/// walks those JSON blobs and recovers text content as a fallback when normal
|
||||
/// DOM extraction yields sparse results.
|
||||
use once_cell::sync::Lazy;
|
||||
use scraper::{Html, Selector};
|
||||
use tracing::debug;
|
||||
|
||||
static SCRIPT_JSON_SELECTOR: Lazy<Selector> =
|
||||
Lazy::new(|| Selector::parse("script[type='application/json']").unwrap());
|
||||
|
||||
/// Below this word count, try data islands for supplemental content.
|
||||
/// Set high enough to cover marketing homepages with partial SSR (e.g., Notion
|
||||
/// SSR-renders ~300 words but has ~800 words in __NEXT_DATA__).
|
||||
const SPARSE_THRESHOLD: usize = 500;
|
||||
|
||||
/// Cap total extracted chunks to bound memory and CPU on adversarial inputs.
|
||||
const MAX_CHUNKS: usize = 1000;
|
||||
|
||||
/// A chunk of text extracted from a JSON data island, with optional heading.
|
||||
#[derive(Debug)]
|
||||
struct TextChunk {
|
||||
heading: Option<String>,
|
||||
body: String,
|
||||
}
|
||||
|
||||
/// Try to extract content from JSON data islands when DOM extraction is sparse.
|
||||
/// Deduplicates against existing markdown so we only add genuinely new content.
|
||||
pub fn try_extract(doc: &Html, dom_word_count: usize, existing_markdown: &str) -> Option<String> {
|
||||
if dom_word_count >= SPARSE_THRESHOLD {
|
||||
return None;
|
||||
}
|
||||
|
||||
let mut all_chunks: Vec<TextChunk> = Vec::new();
|
||||
let existing_lower = existing_markdown.to_lowercase();
|
||||
|
||||
for script in doc.select(&SCRIPT_JSON_SELECTOR) {
|
||||
if all_chunks.len() >= MAX_CHUNKS {
|
||||
break;
|
||||
}
|
||||
|
||||
let json_text = script.text().collect::<String>();
|
||||
if json_text.len() < 50 {
|
||||
continue;
|
||||
}
|
||||
|
||||
let Ok(value) = serde_json::from_str::<serde_json::Value>(&json_text) else {
|
||||
continue;
|
||||
};
|
||||
|
||||
let mut chunks = Vec::new();
|
||||
walk_json(&value, &mut chunks, 0);
|
||||
|
||||
if !chunks.is_empty() {
|
||||
debug!(
|
||||
script_id = script.value().attr("id").unwrap_or(""),
|
||||
data_target = script.value().attr("data-target").unwrap_or(""),
|
||||
chunks = chunks.len(),
|
||||
"extracted text from data island"
|
||||
);
|
||||
all_chunks.extend(chunks);
|
||||
}
|
||||
}
|
||||
|
||||
if all_chunks.is_empty() {
|
||||
return None;
|
||||
}
|
||||
|
||||
// Enforce limit after collecting from all scripts
|
||||
all_chunks.truncate(MAX_CHUNKS);
|
||||
|
||||
// Dedup: remove chunks whose text already appears in DOM markdown
|
||||
let mut seen = std::collections::HashSet::new();
|
||||
all_chunks.retain(|c| {
|
||||
// Must have heading or body
|
||||
let key = if !c.body.is_empty() {
|
||||
c.body.clone()
|
||||
} else if let Some(ref h) = c.heading {
|
||||
h.clone()
|
||||
} else {
|
||||
return false;
|
||||
};
|
||||
if !seen.insert(key.clone()) {
|
||||
return false;
|
||||
}
|
||||
// Skip if the text already exists in the DOM-extracted content
|
||||
!existing_lower.contains(&key.to_lowercase())
|
||||
});
|
||||
|
||||
if all_chunks.is_empty() {
|
||||
return None;
|
||||
}
|
||||
|
||||
let mut md = String::new();
|
||||
for chunk in &all_chunks {
|
||||
if let Some(ref h) = chunk.heading {
|
||||
md.push_str(&format!("\n## {h}\n\n"));
|
||||
}
|
||||
md.push_str(&chunk.body);
|
||||
md.push_str("\n\n");
|
||||
}
|
||||
|
||||
let md = md.trim().to_string();
|
||||
if md.is_empty() {
|
||||
None
|
||||
} else {
|
||||
debug!(chars = md.len(), "data island content recovered");
|
||||
Some(md)
|
||||
}
|
||||
}
|
||||
|
||||
/// Recursively walk a JSON value and extract text content.
|
||||
fn walk_json(value: &serde_json::Value, chunks: &mut Vec<TextChunk>, depth: usize) {
|
||||
if depth > 15 {
|
||||
return;
|
||||
}
|
||||
|
||||
match value {
|
||||
serde_json::Value::Object(map) => {
|
||||
// Contentful rich text node: { "nodeType": "...", "content": [...] }
|
||||
if let Some(node_type) = map.get("nodeType").and_then(|v| v.as_str())
|
||||
&& let Some(text) = extract_contentful_node(map, node_type)
|
||||
{
|
||||
chunks.push(text);
|
||||
return;
|
||||
}
|
||||
|
||||
// CMS-style entry with heading + subheading/description
|
||||
if is_cms_entry(map)
|
||||
&& let Some(chunk) = extract_cms_entry(map)
|
||||
{
|
||||
chunks.push(chunk);
|
||||
return;
|
||||
}
|
||||
|
||||
// Quote/testimonial pattern
|
||||
if let Some(chunk) = extract_quote(map) {
|
||||
chunks.push(chunk);
|
||||
return;
|
||||
}
|
||||
|
||||
// Extract orphaned content strings from known field names
|
||||
// before recursing (they won't be caught by CMS/quote patterns)
|
||||
extract_orphan_texts(map, chunks);
|
||||
|
||||
// Recurse into all values, skipping image/media/asset fields
|
||||
for (key, v) in map {
|
||||
if is_media_key(key) {
|
||||
continue;
|
||||
}
|
||||
walk_json(v, chunks, depth + 1);
|
||||
}
|
||||
}
|
||||
serde_json::Value::Array(arr) => {
|
||||
// Check for stat-style string arrays (e.g., ["100M+ users", "#1 rated"])
|
||||
let content_strings: Vec<&str> = arr
|
||||
.iter()
|
||||
.filter_map(|v| v.as_str())
|
||||
.filter(|s| s.len() > 10 && s.contains(' '))
|
||||
.collect();
|
||||
if content_strings.len() >= 2 {
|
||||
let body = content_strings.join(" | ");
|
||||
chunks.push(TextChunk {
|
||||
heading: None,
|
||||
body,
|
||||
});
|
||||
return;
|
||||
}
|
||||
|
||||
for v in arr {
|
||||
walk_json(v, chunks, depth + 1);
|
||||
}
|
||||
}
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
|
||||
/// Extract text from a Contentful rich text node.
|
||||
/// Handles: document, paragraph, heading-1..6, blockquote, etc.
|
||||
fn extract_contentful_node(
|
||||
map: &serde_json::Map<String, serde_json::Value>,
|
||||
node_type: &str,
|
||||
) -> Option<TextChunk> {
|
||||
match node_type {
|
||||
"document" => {
|
||||
// Top-level document — collect children
|
||||
let content = map.get("content")?.as_array()?;
|
||||
let mut parts = Vec::new();
|
||||
for child in content {
|
||||
if let Some(chunk) = child
|
||||
.as_object()
|
||||
.and_then(|m| m.get("nodeType").and_then(|v| v.as_str()))
|
||||
.and_then(|nt| extract_contentful_node(child.as_object().unwrap(), nt))
|
||||
{
|
||||
if let Some(h) = &chunk.heading {
|
||||
parts.push(format!("## {h}"));
|
||||
}
|
||||
if !chunk.body.is_empty() {
|
||||
parts.push(chunk.body);
|
||||
}
|
||||
}
|
||||
}
|
||||
if parts.is_empty() {
|
||||
return None;
|
||||
}
|
||||
Some(TextChunk {
|
||||
heading: None,
|
||||
body: parts.join("\n\n"),
|
||||
})
|
||||
}
|
||||
"paragraph" | "text" => {
|
||||
let text = collect_text_content(map);
|
||||
if is_content_text(&text) {
|
||||
Some(TextChunk {
|
||||
heading: None,
|
||||
body: text,
|
||||
})
|
||||
} else {
|
||||
None
|
||||
}
|
||||
}
|
||||
nt if nt.starts_with("heading-") => {
|
||||
let text = collect_text_content(map);
|
||||
if text.is_empty() {
|
||||
None
|
||||
} else {
|
||||
Some(TextChunk {
|
||||
heading: Some(text),
|
||||
body: String::new(),
|
||||
})
|
||||
}
|
||||
}
|
||||
"blockquote" => {
|
||||
let text = collect_text_content(map);
|
||||
if is_content_text(&text) {
|
||||
Some(TextChunk {
|
||||
heading: None,
|
||||
body: format!("> {text}"),
|
||||
})
|
||||
} else {
|
||||
None
|
||||
}
|
||||
}
|
||||
_ => None,
|
||||
}
|
||||
}
|
||||
|
||||
/// Recursively collect plain text from a Contentful rich text node tree.
|
||||
fn collect_text_content(map: &serde_json::Map<String, serde_json::Value>) -> String {
|
||||
let mut text = String::new();
|
||||
|
||||
if let Some(v) = map.get("value").and_then(|v| v.as_str()) {
|
||||
text.push_str(v);
|
||||
}
|
||||
|
||||
if let Some(content) = map.get("content").and_then(|v| v.as_array()) {
|
||||
for child in content {
|
||||
if let Some(child_map) = child.as_object() {
|
||||
let child_text = collect_text_content(child_map);
|
||||
text.push_str(&child_text);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
text.trim().to_string()
|
||||
}
|
||||
|
||||
/// Check if a JSON object looks like a CMS entry with heading + description.
|
||||
fn is_cms_entry(map: &serde_json::Map<String, serde_json::Value>) -> bool {
|
||||
let has_heading =
|
||||
map.contains_key("heading") || map.contains_key("title") || map.contains_key("headline");
|
||||
let has_body = map.contains_key("description")
|
||||
|| map.contains_key("subheading")
|
||||
|| map.contains_key("body")
|
||||
|| map.contains_key("text");
|
||||
has_heading && has_body
|
||||
}
|
||||
|
||||
/// Extract heading + body from a CMS-style entry.
|
||||
fn extract_cms_entry(map: &serde_json::Map<String, serde_json::Value>) -> Option<TextChunk> {
|
||||
let heading = extract_text_field(map, "heading")
|
||||
.or_else(|| extract_text_field(map, "title"))
|
||||
.or_else(|| extract_text_field(map, "headline"))
|
||||
.filter(|h| !is_cms_internal_title(h) && h.len() > 5)?;
|
||||
|
||||
let body = extract_text_field(map, "description")
|
||||
.or_else(|| extract_text_field(map, "subheading"))
|
||||
.or_else(|| extract_text_field(map, "body"))
|
||||
.or_else(|| extract_text_field(map, "text"))
|
||||
.unwrap_or_default();
|
||||
|
||||
if !is_content_text(&heading) && !is_content_text(&body) {
|
||||
return None;
|
||||
}
|
||||
|
||||
Some(TextChunk {
|
||||
heading: Some(heading),
|
||||
body,
|
||||
})
|
||||
}
|
||||
|
||||
/// Extract a quote/testimonial from a JSON object.
|
||||
fn extract_quote(map: &serde_json::Map<String, serde_json::Value>) -> Option<TextChunk> {
|
||||
let quote =
|
||||
extract_text_field(map, "quote").or_else(|| extract_text_field(map, "quoteText"))?;
|
||||
if !is_content_text("e) {
|
||||
return None;
|
||||
}
|
||||
|
||||
let attribution = extract_text_field(map, "position")
|
||||
.or_else(|| extract_text_field(map, "author"))
|
||||
.or_else(|| extract_text_field(map, "name"))
|
||||
.unwrap_or_default();
|
||||
|
||||
let body = if attribution.is_empty() {
|
||||
format!("> {quote}")
|
||||
} else {
|
||||
format!("> {quote}\n> — {attribution}")
|
||||
};
|
||||
|
||||
Some(TextChunk {
|
||||
heading: None,
|
||||
body,
|
||||
})
|
||||
}
|
||||
|
||||
/// Extract standalone content strings from known field names that weren't
|
||||
/// caught by the CMS entry or quote patterns. These are body/description/
|
||||
/// subheading/eyebrow fields on objects that lack a paired heading, or
|
||||
/// headline fields on objects that lack a body.
|
||||
fn extract_orphan_texts(
|
||||
map: &serde_json::Map<String, serde_json::Value>,
|
||||
chunks: &mut Vec<TextChunk>,
|
||||
) {
|
||||
const BODY_KEYS: &[&str] = &["body", "description", "subheading", "eyebrow", "children"];
|
||||
const HEADING_KEYS: &[&str] = &["heading", "title", "headline"];
|
||||
|
||||
// Don't extract if this object was already handled as a CMS entry
|
||||
if is_cms_entry(map) {
|
||||
return;
|
||||
}
|
||||
|
||||
// Try extracting a standalone heading (without body)
|
||||
for key in HEADING_KEYS {
|
||||
if let Some(text) = extract_text_field(map, key)
|
||||
&& is_content_text(&text)
|
||||
{
|
||||
chunks.push(TextChunk {
|
||||
heading: Some(text),
|
||||
body: String::new(),
|
||||
});
|
||||
return;
|
||||
}
|
||||
}
|
||||
|
||||
// Try extracting a standalone body field
|
||||
for key in BODY_KEYS {
|
||||
if let Some(text) = extract_text_field(map, key)
|
||||
&& is_content_text(&text)
|
||||
{
|
||||
chunks.push(TextChunk {
|
||||
heading: None,
|
||||
body: text,
|
||||
});
|
||||
return;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Extract a text value from a JSON field, handling both plain strings and
|
||||
/// Contentful rich text objects.
|
||||
fn extract_text_field(
|
||||
map: &serde_json::Map<String, serde_json::Value>,
|
||||
key: &str,
|
||||
) -> Option<String> {
|
||||
let value = map.get(key)?;
|
||||
|
||||
// Plain string
|
||||
if let Some(s) = value.as_str() {
|
||||
let s = s.trim().to_string();
|
||||
return if s.is_empty() { None } else { Some(s) };
|
||||
}
|
||||
|
||||
// Contentful rich text object: { "content": [{ "content": [{ "value": "..." }] }] }
|
||||
if let Some(obj) = value.as_object() {
|
||||
let text = collect_text_content(obj);
|
||||
return if text.is_empty() { None } else { Some(text) };
|
||||
}
|
||||
|
||||
None
|
||||
}
|
||||
|
||||
/// JSON keys that hold image/media/asset data — skip recursing into these
|
||||
/// to avoid extracting CMS alt text as content.
|
||||
fn is_media_key(key: &str) -> bool {
|
||||
let k = key.to_lowercase();
|
||||
k == "alt"
|
||||
|| k.contains("image")
|
||||
|| k.contains("poster")
|
||||
|| k.contains("video")
|
||||
|| k.contains("thumbnail")
|
||||
|| k.contains("icon")
|
||||
|| k.contains("logo")
|
||||
|| k == "src"
|
||||
|| k == "url"
|
||||
|| k == "href"
|
||||
}
|
||||
|
||||
/// CMS internal titles like "/home Customer Stories: Logo" or
|
||||
/// "Copilot agent mode hero poster desktop" are editorial labels, not user-facing text.
|
||||
fn is_cms_internal_title(s: &str) -> bool {
|
||||
// Contentful path-style titles
|
||||
if s.starts_with("/home ") || s.starts_with("/page ") {
|
||||
return true;
|
||||
}
|
||||
// Titles that look like asset/component labels (short words, no sentence structure)
|
||||
let words: Vec<&str> = s.split_whitespace().collect();
|
||||
if words.len() >= 3 {
|
||||
let has_label_keyword = words
|
||||
.iter()
|
||||
.any(|w| ["poster", "logo", "image", "icon", "asset", "thumbnail"].contains(w));
|
||||
if has_label_keyword {
|
||||
return true;
|
||||
}
|
||||
}
|
||||
false
|
||||
}
|
||||
|
||||
/// Heuristic: is this string actual content (not an ID, URL, class name, etc.)?
|
||||
fn is_content_text(s: &str) -> bool {
|
||||
let s = s.trim();
|
||||
if s.len() < 15 {
|
||||
return false;
|
||||
}
|
||||
// Skip URLs, IDs, technical strings
|
||||
if s.starts_with("http") || s.starts_with('/') || s.starts_with('{') || s.starts_with('[') {
|
||||
return false;
|
||||
}
|
||||
// Must contain spaces (prose), not just a single technical token
|
||||
if !s.contains(' ') {
|
||||
return false;
|
||||
}
|
||||
// Skip strings that are mostly hex/base64 (hashes, IDs)
|
||||
let alnum_ratio = s.chars().filter(|c| c.is_alphanumeric()).count() as f64 / s.len() as f64;
|
||||
if alnum_ratio < 0.6 {
|
||||
return false;
|
||||
}
|
||||
true
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn extracts_contentful_rich_text() {
|
||||
let html = r#"<html><body>
|
||||
<script type="application/json" data-target="react-app.embeddedData">
|
||||
{"payload":{"contentfulRawJsonResponse":{"includes":{"Entry":[
|
||||
{"fields":{
|
||||
"heading":"Ship faster with secure CI/CD",
|
||||
"subheading":{"content":[{"content":[{"value":"Automate builds, tests, and deployments."}]}]}
|
||||
}},
|
||||
{"fields":{
|
||||
"heading":"Built-in application security",
|
||||
"description":{"content":[{"content":[{"value":"Use AI to find and fix vulnerabilities so your team can ship more secure software faster."}]}]}
|
||||
}}
|
||||
]}}}}
|
||||
</script>
|
||||
</body></html>"#;
|
||||
|
||||
let doc = Html::parse_document(html);
|
||||
let result = try_extract(&doc, 0, "").unwrap();
|
||||
|
||||
assert!(result.contains("Ship faster with secure CI/CD"));
|
||||
assert!(result.contains("Automate builds, tests, and deployments"));
|
||||
assert!(result.contains("Built-in application security"));
|
||||
assert!(result.contains("find and fix vulnerabilities"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn skips_when_dom_has_enough_content() {
|
||||
let html = r#"<html><body>
|
||||
<script type="application/json">{"heading":"Foo","description":"Some long description here."}</script>
|
||||
</body></html>"#;
|
||||
|
||||
let doc = Html::parse_document(html);
|
||||
assert!(try_extract(&doc, 500, "").is_none());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn skips_non_content_strings() {
|
||||
assert!(!is_content_text("abc123"));
|
||||
assert!(!is_content_text("https://example.com/foo/bar"));
|
||||
assert!(!is_content_text("/home Customer Stories: Logo"));
|
||||
assert!(!is_content_text("a1b2c3d4e5f6a1b2c3d4e5f6"));
|
||||
assert!(is_content_text(
|
||||
"Automate builds, tests, and deployments with CI/CD."
|
||||
));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn extracts_quotes() {
|
||||
let html = r#"<html><body>
|
||||
<script type="application/json">
|
||||
{"fields":{"quote":{"content":[{"content":[{"value":"GitHub frees us from maintaining our own infrastructure."}]}]},"position":"CTO at Example Corp"}}
|
||||
</script>
|
||||
</body></html>"#;
|
||||
|
||||
let doc = Html::parse_document(html);
|
||||
let result = try_extract(&doc, 0, "").unwrap();
|
||||
assert!(result.contains("> GitHub frees us from maintaining our own infrastructure."));
|
||||
assert!(result.contains("CTO at Example Corp"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn skips_content_already_in_dom() {
|
||||
let html = r#"<html><body>
|
||||
<script type="application/json">
|
||||
{"fields":{"heading":"Already in DOM heading","description":"This text already appears in the DOM markdown output."}}
|
||||
</script>
|
||||
</body></html>"#;
|
||||
|
||||
let doc = Html::parse_document(html);
|
||||
let existing =
|
||||
"# Already in DOM heading\n\nThis text already appears in the DOM markdown output.";
|
||||
assert!(try_extract(&doc, 10, existing).is_none());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn deduplicates_chunks() {
|
||||
let html = r#"<html><body>
|
||||
<script type="application/json">
|
||||
{"a":{"heading":"Same heading here","description":"Same body content across multiple entries."},
|
||||
"b":{"heading":"Same heading here","description":"Same body content across multiple entries."}}
|
||||
</script>
|
||||
</body></html>"#;
|
||||
|
||||
let doc = Html::parse_document(html);
|
||||
let result = try_extract(&doc, 0, "").unwrap();
|
||||
// Should appear only once
|
||||
assert_eq!(
|
||||
result
|
||||
.matches("Same body content across multiple entries")
|
||||
.count(),
|
||||
1
|
||||
);
|
||||
}
|
||||
}
|
||||
340
crates/webclaw-core/src/diff.rs
Normal file
340
crates/webclaw-core/src/diff.rs
Normal file
|
|
@ -0,0 +1,340 @@
|
|||
/// Change tracking between two extraction snapshots.
|
||||
/// Pure computation -- no I/O, WASM-safe.
|
||||
use std::collections::HashSet;
|
||||
|
||||
use serde::Serialize;
|
||||
use similar::TextDiff;
|
||||
|
||||
use crate::types::{ExtractionResult, Link};
|
||||
|
||||
#[derive(Debug, Clone, Serialize, PartialEq)]
|
||||
pub enum ChangeStatus {
|
||||
Same,
|
||||
Changed,
|
||||
New,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Serialize)]
|
||||
pub struct MetadataChange {
|
||||
pub field: String,
|
||||
pub old: Option<String>,
|
||||
pub new: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Serialize)]
|
||||
pub struct ContentDiff {
|
||||
pub status: ChangeStatus,
|
||||
pub text_diff: Option<String>,
|
||||
pub metadata_changes: Vec<MetadataChange>,
|
||||
pub links_added: Vec<Link>,
|
||||
pub links_removed: Vec<Link>,
|
||||
pub word_count_delta: i64,
|
||||
}
|
||||
|
||||
/// Compare two extraction results and produce a diff.
|
||||
/// `old` is the previous snapshot, `new_result` is the current extraction.
|
||||
pub fn diff(old: &ExtractionResult, new_result: &ExtractionResult) -> ContentDiff {
|
||||
let text_diff = compute_text_diff(&old.content.markdown, &new_result.content.markdown);
|
||||
let metadata_changes = compute_metadata_changes(&old.metadata, &new_result.metadata);
|
||||
let (links_added, links_removed) =
|
||||
compute_link_changes(&old.content.links, &new_result.content.links);
|
||||
let word_count_delta = new_result.metadata.word_count as i64 - old.metadata.word_count as i64;
|
||||
|
||||
let status = if text_diff.is_none() && metadata_changes.is_empty() {
|
||||
ChangeStatus::Same
|
||||
} else {
|
||||
ChangeStatus::Changed
|
||||
};
|
||||
|
||||
ContentDiff {
|
||||
status,
|
||||
text_diff,
|
||||
metadata_changes,
|
||||
links_added,
|
||||
links_removed,
|
||||
word_count_delta,
|
||||
}
|
||||
}
|
||||
|
||||
fn compute_text_diff(old: &str, new: &str) -> Option<String> {
|
||||
if old == new {
|
||||
return None;
|
||||
}
|
||||
|
||||
let diff = TextDiff::from_lines(old, new);
|
||||
let unified = diff
|
||||
.unified_diff()
|
||||
.context_radius(3)
|
||||
.header("old", "new")
|
||||
.to_string();
|
||||
|
||||
if unified.is_empty() {
|
||||
None
|
||||
} else {
|
||||
Some(unified)
|
||||
}
|
||||
}
|
||||
|
||||
/// Compare each metadata field, returning only those that changed.
|
||||
fn compute_metadata_changes(
|
||||
old: &crate::types::Metadata,
|
||||
new: &crate::types::Metadata,
|
||||
) -> Vec<MetadataChange> {
|
||||
let mut changes = Vec::new();
|
||||
|
||||
let fields: Vec<(&str, &Option<String>, &Option<String>)> = vec![
|
||||
("title", &old.title, &new.title),
|
||||
("description", &old.description, &new.description),
|
||||
("author", &old.author, &new.author),
|
||||
("published_date", &old.published_date, &new.published_date),
|
||||
("language", &old.language, &new.language),
|
||||
("url", &old.url, &new.url),
|
||||
("site_name", &old.site_name, &new.site_name),
|
||||
("image", &old.image, &new.image),
|
||||
("favicon", &old.favicon, &new.favicon),
|
||||
];
|
||||
|
||||
for (name, old_val, new_val) in fields {
|
||||
if old_val != new_val {
|
||||
changes.push(MetadataChange {
|
||||
field: name.to_string(),
|
||||
old: old_val.clone(),
|
||||
new: new_val.clone(),
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
changes
|
||||
}
|
||||
|
||||
/// Links added/removed, compared by href (ignoring text differences).
|
||||
fn compute_link_changes(old: &[Link], new: &[Link]) -> (Vec<Link>, Vec<Link>) {
|
||||
let old_hrefs: HashSet<&str> = old.iter().map(|l| l.href.as_str()).collect();
|
||||
let new_hrefs: HashSet<&str> = new.iter().map(|l| l.href.as_str()).collect();
|
||||
|
||||
let added: Vec<Link> = new
|
||||
.iter()
|
||||
.filter(|l| !old_hrefs.contains(l.href.as_str()))
|
||||
.cloned()
|
||||
.collect();
|
||||
|
||||
let removed: Vec<Link> = old
|
||||
.iter()
|
||||
.filter(|l| !new_hrefs.contains(l.href.as_str()))
|
||||
.cloned()
|
||||
.collect();
|
||||
|
||||
(added, removed)
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use crate::domain::DomainType;
|
||||
use crate::types::{Content, DomainData, Metadata};
|
||||
|
||||
/// Build a minimal ExtractionResult for test comparisons.
|
||||
fn make_result(markdown: &str, title: Option<&str>, links: Vec<Link>) -> ExtractionResult {
|
||||
let word_count = markdown.split_whitespace().count();
|
||||
ExtractionResult {
|
||||
metadata: Metadata {
|
||||
title: title.map(String::from),
|
||||
description: None,
|
||||
author: None,
|
||||
published_date: None,
|
||||
language: None,
|
||||
url: None,
|
||||
site_name: None,
|
||||
image: None,
|
||||
favicon: None,
|
||||
word_count,
|
||||
},
|
||||
content: Content {
|
||||
markdown: markdown.to_string(),
|
||||
plain_text: markdown.to_string(),
|
||||
links,
|
||||
images: vec![],
|
||||
code_blocks: vec![],
|
||||
raw_html: None,
|
||||
},
|
||||
domain_data: Some(DomainData {
|
||||
domain_type: DomainType::Generic,
|
||||
}),
|
||||
structured_data: vec![],
|
||||
}
|
||||
}
|
||||
|
||||
fn link(href: &str, text: &str) -> Link {
|
||||
Link {
|
||||
href: href.to_string(),
|
||||
text: text.to_string(),
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_identical_content() {
|
||||
let a = make_result("# Hello\n\nSome content here.", Some("Hello"), vec![]);
|
||||
let b = make_result("# Hello\n\nSome content here.", Some("Hello"), vec![]);
|
||||
|
||||
let result = diff(&a, &b);
|
||||
|
||||
assert_eq!(result.status, ChangeStatus::Same);
|
||||
assert!(result.text_diff.is_none());
|
||||
assert!(result.metadata_changes.is_empty());
|
||||
assert!(result.links_added.is_empty());
|
||||
assert!(result.links_removed.is_empty());
|
||||
assert_eq!(result.word_count_delta, 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_title_change() {
|
||||
let a = make_result("# Hello\n\nContent.", Some("Old Title"), vec![]);
|
||||
let b = make_result("# Hello\n\nContent.", Some("New Title"), vec![]);
|
||||
|
||||
let result = diff(&a, &b);
|
||||
|
||||
assert_eq!(result.status, ChangeStatus::Changed);
|
||||
assert!(result.text_diff.is_none(), "text is identical");
|
||||
assert_eq!(result.metadata_changes.len(), 1);
|
||||
assert_eq!(result.metadata_changes[0].field, "title");
|
||||
assert_eq!(result.metadata_changes[0].old.as_deref(), Some("Old Title"));
|
||||
assert_eq!(result.metadata_changes[0].new.as_deref(), Some("New Title"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_content_change() {
|
||||
let a = make_result("# Hello\n\nOld paragraph.", Some("Title"), vec![]);
|
||||
let b = make_result("# Hello\n\nNew paragraph.", Some("Title"), vec![]);
|
||||
|
||||
let result = diff(&a, &b);
|
||||
|
||||
assert_eq!(result.status, ChangeStatus::Changed);
|
||||
assert!(result.text_diff.is_some());
|
||||
let diff_text = result.text_diff.unwrap();
|
||||
assert!(diff_text.contains('-'), "should have removal markers");
|
||||
assert!(diff_text.contains('+'), "should have addition markers");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_link_added() {
|
||||
let a = make_result("Content.", None, vec![]);
|
||||
let b = make_result(
|
||||
"Content.",
|
||||
None,
|
||||
vec![link("https://example.com", "Example")],
|
||||
);
|
||||
|
||||
let result = diff(&a, &b);
|
||||
|
||||
assert_eq!(result.links_added.len(), 1);
|
||||
assert_eq!(result.links_added[0].href, "https://example.com");
|
||||
assert!(result.links_removed.is_empty());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_link_removed() {
|
||||
let a = make_result(
|
||||
"Content.",
|
||||
None,
|
||||
vec![link("https://example.com", "Example")],
|
||||
);
|
||||
let b = make_result("Content.", None, vec![]);
|
||||
|
||||
let result = diff(&a, &b);
|
||||
|
||||
assert!(result.links_added.is_empty());
|
||||
assert_eq!(result.links_removed.len(), 1);
|
||||
assert_eq!(result.links_removed[0].href, "https://example.com");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_links_added_and_removed() {
|
||||
let a = make_result(
|
||||
"Content.",
|
||||
None,
|
||||
vec![
|
||||
link("https://old.com", "Old"),
|
||||
link("https://stable.com", "Stable"),
|
||||
],
|
||||
);
|
||||
let b = make_result(
|
||||
"Content.",
|
||||
None,
|
||||
vec![
|
||||
link("https://stable.com", "Stable"),
|
||||
link("https://new.com", "New"),
|
||||
],
|
||||
);
|
||||
|
||||
let result = diff(&a, &b);
|
||||
|
||||
assert_eq!(result.links_added.len(), 1);
|
||||
assert_eq!(result.links_added[0].href, "https://new.com");
|
||||
assert_eq!(result.links_removed.len(), 1);
|
||||
assert_eq!(result.links_removed[0].href, "https://old.com");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_word_count_delta() {
|
||||
let a = make_result("one two three", None, vec![]);
|
||||
let b = make_result("one two three four five", None, vec![]);
|
||||
|
||||
let result = diff(&a, &b);
|
||||
|
||||
assert_eq!(result.word_count_delta, 2);
|
||||
|
||||
// Negative delta
|
||||
let result_rev = diff(&b, &a);
|
||||
assert_eq!(result_rev.word_count_delta, -2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_unified_diff_format() {
|
||||
let a = make_result("line one\nline two\nline three\n", None, vec![]);
|
||||
let b = make_result("line one\nline changed\nline three\n", None, vec![]);
|
||||
|
||||
let result = diff(&a, &b);
|
||||
|
||||
assert!(result.text_diff.is_some());
|
||||
let diff_text = result.text_diff.unwrap();
|
||||
assert!(diff_text.contains("--- old"), "should have old header");
|
||||
assert!(diff_text.contains("+++ new"), "should have new header");
|
||||
assert!(diff_text.contains("-line two"), "should show removed line");
|
||||
assert!(
|
||||
diff_text.contains("+line changed"),
|
||||
"should show added line"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_empty_content() {
|
||||
let a = make_result("", None, vec![]);
|
||||
let b = make_result("", None, vec![]);
|
||||
|
||||
let result = diff(&a, &b);
|
||||
|
||||
assert_eq!(result.status, ChangeStatus::Same);
|
||||
assert!(result.text_diff.is_none());
|
||||
assert_eq!(result.word_count_delta, 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_link_text_change_ignored() {
|
||||
// Same href, different text -- should not appear in added/removed
|
||||
let a = make_result(
|
||||
"Content.",
|
||||
None,
|
||||
vec![link("https://example.com", "Old Text")],
|
||||
);
|
||||
let b = make_result(
|
||||
"Content.",
|
||||
None,
|
||||
vec![link("https://example.com", "New Text")],
|
||||
);
|
||||
|
||||
let result = diff(&a, &b);
|
||||
|
||||
assert!(result.links_added.is_empty());
|
||||
assert!(result.links_removed.is_empty());
|
||||
}
|
||||
}
|
||||
187
crates/webclaw-core/src/domain.rs
Normal file
187
crates/webclaw-core/src/domain.rs
Normal file
|
|
@ -0,0 +1,187 @@
|
|||
/// Domain detection via URL patterns and DOM heuristics.
|
||||
/// Knowing the domain type lets downstream consumers apply
|
||||
/// domain-specific prompts or post-processing.
|
||||
use serde::{Deserialize, Serialize};
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)]
|
||||
#[serde(rename_all = "snake_case")]
|
||||
pub enum DomainType {
|
||||
Article,
|
||||
Documentation,
|
||||
GitHub,
|
||||
Forum,
|
||||
ECommerce,
|
||||
Social,
|
||||
Generic,
|
||||
}
|
||||
|
||||
/// Detect domain type from URL patterns first, then fall back to DOM heuristics.
|
||||
pub fn detect(url: Option<&str>, html: &str) -> DomainType {
|
||||
if let Some(url) = url
|
||||
&& let Some(dt) = detect_from_url(url)
|
||||
{
|
||||
return dt;
|
||||
}
|
||||
detect_from_dom(html)
|
||||
}
|
||||
|
||||
fn detect_from_url(url: &str) -> Option<DomainType> {
|
||||
let lower = url.to_lowercase();
|
||||
|
||||
// GitHub
|
||||
if lower.contains("github.com") || lower.contains("gitlab.com") {
|
||||
return Some(DomainType::GitHub);
|
||||
}
|
||||
|
||||
// Documentation sites
|
||||
let doc_patterns = [
|
||||
"docs.",
|
||||
"readthedocs",
|
||||
"gitbook",
|
||||
"docusaurus",
|
||||
"/docs/",
|
||||
"/documentation/",
|
||||
"devdocs.io",
|
||||
"doc.rust-lang.org",
|
||||
"developer.mozilla.org",
|
||||
"developer.apple.com/documentation",
|
||||
];
|
||||
if doc_patterns.iter().any(|p| lower.contains(p)) {
|
||||
return Some(DomainType::Documentation);
|
||||
}
|
||||
|
||||
// Forums
|
||||
let forum_patterns = [
|
||||
"reddit.com",
|
||||
"news.ycombinator.com",
|
||||
"stackoverflow.com",
|
||||
"stackexchange.com",
|
||||
"discourse",
|
||||
"forum",
|
||||
"community.",
|
||||
];
|
||||
if forum_patterns.iter().any(|p| lower.contains(p)) {
|
||||
return Some(DomainType::Forum);
|
||||
}
|
||||
|
||||
// Social
|
||||
let social_patterns = [
|
||||
"twitter.com",
|
||||
"x.com",
|
||||
"linkedin.com",
|
||||
"facebook.com",
|
||||
"instagram.com",
|
||||
"mastodon",
|
||||
"bsky.app",
|
||||
];
|
||||
if social_patterns.iter().any(|p| lower.contains(p)) {
|
||||
return Some(DomainType::Social);
|
||||
}
|
||||
|
||||
// E-commerce
|
||||
let ecommerce_patterns = [
|
||||
"amazon.",
|
||||
"ebay.",
|
||||
"shopify.",
|
||||
"etsy.com",
|
||||
"/product/",
|
||||
"/shop/",
|
||||
"/cart",
|
||||
];
|
||||
if ecommerce_patterns.iter().any(|p| lower.contains(p)) {
|
||||
return Some(DomainType::ECommerce);
|
||||
}
|
||||
|
||||
None
|
||||
}
|
||||
|
||||
/// Fallback: check HTML for structural hints when URL isn't enough.
|
||||
fn detect_from_dom(html: &str) -> DomainType {
|
||||
let lower = html.to_lowercase();
|
||||
|
||||
// Article signals: <article> tag, schema.org Article type
|
||||
if lower.contains("<article") || lower.contains("schema.org/article") {
|
||||
return DomainType::Article;
|
||||
}
|
||||
|
||||
// Documentation signals
|
||||
if lower.contains("docsearch") || lower.contains("sidebar-nav") || lower.contains("doc-content")
|
||||
{
|
||||
return DomainType::Documentation;
|
||||
}
|
||||
|
||||
DomainType::Generic
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn github_urls() {
|
||||
assert_eq!(
|
||||
detect(Some("https://github.com/tokio-rs/tokio"), ""),
|
||||
DomainType::GitHub
|
||||
);
|
||||
assert_eq!(
|
||||
detect(Some("https://gitlab.com/foo/bar"), ""),
|
||||
DomainType::GitHub
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn documentation_urls() {
|
||||
assert_eq!(
|
||||
detect(Some("https://docs.rs/serde/latest"), ""),
|
||||
DomainType::Documentation
|
||||
);
|
||||
assert_eq!(
|
||||
detect(Some("https://readthedocs.org/projects/foo"), ""),
|
||||
DomainType::Documentation
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn forum_urls() {
|
||||
assert_eq!(
|
||||
detect(Some("https://www.reddit.com/r/rust"), ""),
|
||||
DomainType::Forum
|
||||
);
|
||||
assert_eq!(
|
||||
detect(Some("https://stackoverflow.com/questions/123"), ""),
|
||||
DomainType::Forum
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn social_urls() {
|
||||
assert_eq!(
|
||||
detect(Some("https://x.com/elonmusk"), ""),
|
||||
DomainType::Social
|
||||
);
|
||||
assert_eq!(
|
||||
detect(Some("https://linkedin.com/in/someone"), ""),
|
||||
DomainType::Social
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn ecommerce_urls() {
|
||||
assert_eq!(
|
||||
detect(Some("https://amazon.com/dp/B001"), ""),
|
||||
DomainType::ECommerce
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn dom_fallback_article() {
|
||||
let html = r#"<html><body><article><p>Hello world</p></article></body></html>"#;
|
||||
assert_eq!(detect(None, html), DomainType::Article);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn dom_fallback_generic() {
|
||||
let html = r#"<html><body><div>Just some div</div></body></html>"#;
|
||||
assert_eq!(detect(None, html), DomainType::Generic);
|
||||
}
|
||||
}
|
||||
15
crates/webclaw-core/src/error.rs
Normal file
15
crates/webclaw-core/src/error.rs
Normal file
|
|
@ -0,0 +1,15 @@
|
|||
/// Extraction errors — kept minimal since this crate does no I/O.
|
||||
/// Most failures come from malformed HTML or invalid URLs.
|
||||
use thiserror::Error;
|
||||
|
||||
#[derive(Debug, Error)]
|
||||
pub enum ExtractError {
|
||||
#[error("failed to parse HTML")]
|
||||
ParseError,
|
||||
|
||||
#[error("invalid URL: {0}")]
|
||||
InvalidUrl(String),
|
||||
|
||||
#[error("no content found")]
|
||||
NoContent,
|
||||
}
|
||||
1486
crates/webclaw-core/src/extractor.rs
Normal file
1486
crates/webclaw-core/src/extractor.rs
Normal file
File diff suppressed because it is too large
Load diff
513
crates/webclaw-core/src/lib.rs
Normal file
513
crates/webclaw-core/src/lib.rs
Normal file
|
|
@ -0,0 +1,513 @@
|
|||
pub mod brand;
|
||||
pub(crate) mod data_island;
|
||||
/// webclaw-core: Pure HTML content extraction engine for LLMs.
|
||||
///
|
||||
/// Takes raw HTML + optional URL, returns structured content
|
||||
/// (metadata, markdown, plain text, links, images, code blocks).
|
||||
/// Zero network dependencies — WASM-compatible by design.
|
||||
pub mod diff;
|
||||
pub mod domain;
|
||||
pub mod error;
|
||||
pub mod extractor;
|
||||
pub mod llm;
|
||||
pub mod markdown;
|
||||
pub mod metadata;
|
||||
#[allow(dead_code)]
|
||||
pub(crate) mod noise;
|
||||
pub mod structured_data;
|
||||
pub mod types;
|
||||
pub mod youtube;
|
||||
|
||||
pub use brand::BrandIdentity;
|
||||
pub use diff::{ChangeStatus, ContentDiff, MetadataChange};
|
||||
pub use domain::DomainType;
|
||||
pub use error::ExtractError;
|
||||
pub use llm::to_llm_text;
|
||||
pub use types::{
|
||||
CodeBlock, Content, DomainData, ExtractionOptions, ExtractionResult, Image, Link, Metadata,
|
||||
};
|
||||
|
||||
use scraper::Html;
|
||||
use url::Url;
|
||||
|
||||
/// Extract structured content from raw HTML.
|
||||
///
|
||||
/// `html` — raw HTML string to parse
|
||||
/// `url` — optional source URL, used for resolving relative links and domain detection
|
||||
pub fn extract(html: &str, url: Option<&str>) -> Result<ExtractionResult, ExtractError> {
|
||||
extract_with_options(html, url, &ExtractionOptions::default())
|
||||
}
|
||||
|
||||
/// Extract structured content from raw HTML with configurable options.
|
||||
///
|
||||
/// `html` — raw HTML string to parse
|
||||
/// `url` — optional source URL, used for resolving relative links and domain detection
|
||||
/// `options` — controls include/exclude selectors, main content mode, and raw HTML output
|
||||
pub fn extract_with_options(
|
||||
html: &str,
|
||||
url: Option<&str>,
|
||||
options: &ExtractionOptions,
|
||||
) -> Result<ExtractionResult, ExtractError> {
|
||||
if html.is_empty() {
|
||||
return Err(ExtractError::NoContent);
|
||||
}
|
||||
|
||||
// YouTube fast path: if the URL is a YouTube video page, try extracting
|
||||
// structured metadata from ytInitialPlayerResponse before DOM scoring.
|
||||
// This gives LLMs a clean, structured view of video metadata.
|
||||
if let Some(u) = url
|
||||
&& youtube::is_youtube_url(u)
|
||||
&& let Some(yt_md) = youtube::try_extract(html)
|
||||
{
|
||||
let doc = Html::parse_document(html);
|
||||
let mut meta = metadata::extract(&doc, url);
|
||||
meta.word_count = extractor::word_count(&yt_md);
|
||||
|
||||
let plain_text = yt_md
|
||||
.lines()
|
||||
.filter(|l| !l.starts_with('#') && !l.starts_with("**"))
|
||||
.collect::<Vec<_>>()
|
||||
.join("\n")
|
||||
.trim()
|
||||
.to_string();
|
||||
|
||||
let domain_data = Some(DomainData {
|
||||
domain_type: DomainType::Social,
|
||||
});
|
||||
|
||||
let structured_data = structured_data::extract_json_ld(html);
|
||||
|
||||
return Ok(ExtractionResult {
|
||||
metadata: meta,
|
||||
content: Content {
|
||||
markdown: yt_md,
|
||||
plain_text,
|
||||
links: Vec::new(),
|
||||
images: Vec::new(),
|
||||
code_blocks: Vec::new(),
|
||||
raw_html: None,
|
||||
},
|
||||
domain_data,
|
||||
structured_data,
|
||||
});
|
||||
}
|
||||
|
||||
let doc = Html::parse_document(html);
|
||||
|
||||
let base_url = url
|
||||
.map(|u| Url::parse(u).map_err(|_| ExtractError::InvalidUrl(u.to_string())))
|
||||
.transpose()?;
|
||||
|
||||
// Metadata from <head>
|
||||
let mut meta = metadata::extract(&doc, url);
|
||||
|
||||
// Main content extraction (Readability-style scoring + markdown conversion)
|
||||
let mut content = extractor::extract_content(&doc, base_url.as_ref(), options);
|
||||
// Use the higher of plain_text and markdown word counts.
|
||||
// Some pages (headings + links) have content in markdown but empty plain_text.
|
||||
let pt_wc = extractor::word_count(&content.plain_text);
|
||||
let md_wc = extractor::word_count(&content.markdown);
|
||||
meta.word_count = pt_wc.max(md_wc);
|
||||
|
||||
// Retry fallback: if extraction captured too little of the page's visible content,
|
||||
// retry with wider strategies. The scorer sometimes picks a tiny node (e.g., an
|
||||
// <article> with 52 words when the body has 1300 words of real content).
|
||||
//
|
||||
// Strategy 1: retry without only_main_content restriction
|
||||
if options.only_main_content && meta.word_count < 30 {
|
||||
let relaxed = ExtractionOptions {
|
||||
only_main_content: false,
|
||||
..options.clone()
|
||||
};
|
||||
let retry = extractor::extract_content(&doc, base_url.as_ref(), &relaxed);
|
||||
let retry_wc =
|
||||
extractor::word_count(&retry.plain_text).max(extractor::word_count(&retry.markdown));
|
||||
if retry_wc > meta.word_count {
|
||||
content = retry;
|
||||
meta.word_count = retry_wc;
|
||||
}
|
||||
}
|
||||
|
||||
// Strategy 2: if scored extraction is sparse (<200 words) AND the page has
|
||||
// significantly more visible text, retry with include_selectors: ["body"].
|
||||
// This bypasses the readability scorer entirely — catches blogs, pricing
|
||||
// pages, and modern sites where no single element scores well.
|
||||
if meta.word_count < 200 && options.include_selectors.is_empty() {
|
||||
let body_opts = ExtractionOptions {
|
||||
include_selectors: vec!["body".to_string()],
|
||||
exclude_selectors: options.exclude_selectors.clone(),
|
||||
only_main_content: false,
|
||||
include_raw_html: false,
|
||||
};
|
||||
let body_content = extractor::extract_content(&doc, base_url.as_ref(), &body_opts);
|
||||
let body_wc = extractor::word_count(&body_content.plain_text)
|
||||
.max(extractor::word_count(&body_content.markdown));
|
||||
// Use body extraction if it captures significantly more content (>2x)
|
||||
if body_wc > meta.word_count * 2 && body_wc > 50 {
|
||||
content = body_content;
|
||||
meta.word_count = body_wc;
|
||||
}
|
||||
}
|
||||
|
||||
// Fallback: if DOM extraction was sparse, try JSON data islands
|
||||
// (React SPAs, Next.js, Contentful CMS embed page data in <script> tags)
|
||||
if let Some(island_md) = data_island::try_extract(&doc, meta.word_count, &content.markdown) {
|
||||
content.markdown.push_str("\n\n");
|
||||
content.markdown.push_str(&island_md);
|
||||
meta.word_count = extractor::word_count(&content.markdown);
|
||||
}
|
||||
|
||||
// Domain detection from URL patterns and DOM heuristics
|
||||
let domain_type = domain::detect(url, html);
|
||||
let domain_data = Some(DomainData { domain_type });
|
||||
|
||||
// JSON-LD structured data (Schema.org Product, Article, etc.)
|
||||
let structured_data = structured_data::extract_json_ld(html);
|
||||
|
||||
Ok(ExtractionResult {
|
||||
metadata: meta,
|
||||
content,
|
||||
domain_data,
|
||||
structured_data,
|
||||
})
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn full_extraction_pipeline() {
|
||||
let html = r#"
|
||||
<html lang="en">
|
||||
<head>
|
||||
<title>Rust is Great</title>
|
||||
<meta name="description" content="An article about Rust">
|
||||
<meta name="author" content="Bob">
|
||||
</head>
|
||||
<body>
|
||||
<nav><a href="/">Home</a> | <a href="/about">About</a></nav>
|
||||
<article>
|
||||
<h1>Why Rust is Great</h1>
|
||||
<p>Rust gives you <strong>memory safety</strong> without a garbage collector.
|
||||
This is achieved through its <em>ownership system</em>.</p>
|
||||
<p>Here is an example:</p>
|
||||
<pre><code class="language-rust">fn main() {
|
||||
println!("Hello, world!");
|
||||
}</code></pre>
|
||||
<p>Learn more at <a href="https://rust-lang.org">rust-lang.org</a>.</p>
|
||||
</article>
|
||||
<footer>Copyright 2025</footer>
|
||||
</body>
|
||||
</html>"#;
|
||||
|
||||
let result = extract(html, Some("https://blog.example.com/rust")).unwrap();
|
||||
|
||||
// Metadata
|
||||
assert_eq!(result.metadata.title.as_deref(), Some("Rust is Great"));
|
||||
assert_eq!(
|
||||
result.metadata.description.as_deref(),
|
||||
Some("An article about Rust")
|
||||
);
|
||||
assert_eq!(result.metadata.author.as_deref(), Some("Bob"));
|
||||
assert_eq!(result.metadata.language.as_deref(), Some("en"));
|
||||
assert!(result.metadata.word_count > 0);
|
||||
|
||||
// Content
|
||||
assert!(result.content.markdown.contains("# Why Rust is Great"));
|
||||
assert!(result.content.markdown.contains("**memory safety**"));
|
||||
assert!(result.content.markdown.contains("```rust"));
|
||||
assert!(
|
||||
result
|
||||
.content
|
||||
.links
|
||||
.iter()
|
||||
.any(|l| l.href == "https://rust-lang.org")
|
||||
);
|
||||
assert!(!result.content.code_blocks.is_empty());
|
||||
|
||||
// raw_html not populated by default
|
||||
assert!(result.content.raw_html.is_none());
|
||||
|
||||
// Domain — blog.example.com has <article> tag
|
||||
let dd = result.domain_data.unwrap();
|
||||
assert_eq!(dd.domain_type, DomainType::Article);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn invalid_url_returns_error() {
|
||||
let result = extract("<html></html>", Some("not a url"));
|
||||
assert!(matches!(result, Err(ExtractError::InvalidUrl(_))));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn empty_html_returns_error() {
|
||||
let result = extract("", None);
|
||||
assert!(matches!(result, Err(ExtractError::NoContent)));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn no_url_is_fine() {
|
||||
let result = extract("<html><body><p>Hello</p></body></html>", None);
|
||||
assert!(result.is_ok());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn serializes_to_json() {
|
||||
let result = extract("<html><body><p>Test</p></body></html>", None).unwrap();
|
||||
let json = serde_json::to_string_pretty(&result).unwrap();
|
||||
assert!(json.contains("metadata"));
|
||||
assert!(json.contains("content"));
|
||||
// raw_html should be absent (skip_serializing_if)
|
||||
assert!(!json.contains("raw_html"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn youtube_extraction_produces_structured_markdown() {
|
||||
let html = r#"
|
||||
<html><head><title>Rust in 100 Seconds - YouTube</title></head>
|
||||
<body>
|
||||
<script>
|
||||
var ytInitialPlayerResponse = {"videoDetails":{"title":"Rust in 100 Seconds","author":"Fireship","viewCount":"5432100","shortDescription":"Learn Rust in 100 seconds. A mass of web developers are mass adopting Rust.","lengthSeconds":"120"},"microformat":{"playerMicroformatRenderer":{"uploadDate":"2023-01-15"}}};
|
||||
</script>
|
||||
</body></html>
|
||||
"#;
|
||||
|
||||
let result = extract(html, Some("https://www.youtube.com/watch?v=5C_HPTJg5ek")).unwrap();
|
||||
|
||||
assert!(result.content.markdown.contains("# Rust in 100 Seconds"));
|
||||
assert!(result.content.markdown.contains("**Channel:** Fireship"));
|
||||
assert!(result.content.markdown.contains("2:00"));
|
||||
assert!(
|
||||
result
|
||||
.content
|
||||
.markdown
|
||||
.contains("Learn Rust in 100 seconds")
|
||||
);
|
||||
|
||||
// Should be detected as Social domain
|
||||
let dd = result.domain_data.unwrap();
|
||||
assert_eq!(dd.domain_type, DomainType::Social);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn youtube_url_without_player_response_falls_through() {
|
||||
// If ytInitialPlayerResponse is missing, fall through to normal extraction
|
||||
let html = r#"<html><body><article><h1>Some YouTube Page</h1><p>Content here for testing.</p></article></body></html>"#;
|
||||
let result = extract(html, Some("https://www.youtube.com/watch?v=abc123")).unwrap();
|
||||
|
||||
// Should still extract something via normal pipeline
|
||||
assert!(result.content.markdown.contains("Some YouTube Page"));
|
||||
}
|
||||
|
||||
// --- ExtractionOptions tests ---
|
||||
|
||||
#[test]
|
||||
fn test_exclude_selectors() {
|
||||
let html = r#"<html><body>
|
||||
<nav>Navigation stuff</nav>
|
||||
<article><h1>Title</h1><p>Real content here.</p></article>
|
||||
<footer>Footer stuff</footer>
|
||||
</body></html>"#;
|
||||
|
||||
let options = ExtractionOptions {
|
||||
exclude_selectors: vec!["nav".into(), "footer".into()],
|
||||
..Default::default()
|
||||
};
|
||||
let result = extract_with_options(html, None, &options).unwrap();
|
||||
|
||||
assert!(result.content.markdown.contains("Real content"));
|
||||
assert!(
|
||||
!result.content.markdown.contains("Navigation stuff"),
|
||||
"nav should be excluded"
|
||||
);
|
||||
assert!(
|
||||
!result.content.markdown.contains("Footer stuff"),
|
||||
"footer should be excluded"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_include_selectors() {
|
||||
let html = r#"<html><body>
|
||||
<nav>Navigation stuff</nav>
|
||||
<article><h1>Title</h1><p>Real content here.</p></article>
|
||||
<div class="sidebar">Sidebar junk</div>
|
||||
<footer>Footer stuff</footer>
|
||||
</body></html>"#;
|
||||
|
||||
let options = ExtractionOptions {
|
||||
include_selectors: vec!["article".into()],
|
||||
..Default::default()
|
||||
};
|
||||
let result = extract_with_options(html, None, &options).unwrap();
|
||||
|
||||
assert!(result.content.markdown.contains("Title"));
|
||||
assert!(result.content.markdown.contains("Real content"));
|
||||
assert!(
|
||||
!result.content.markdown.contains("Navigation stuff"),
|
||||
"nav should not be included"
|
||||
);
|
||||
assert!(
|
||||
!result.content.markdown.contains("Sidebar junk"),
|
||||
"sidebar should not be included"
|
||||
);
|
||||
assert!(
|
||||
!result.content.markdown.contains("Footer stuff"),
|
||||
"footer should not be included"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_include_and_exclude() {
|
||||
let html = r#"<html><body>
|
||||
<article>
|
||||
<h1>Title</h1>
|
||||
<p>Real content here.</p>
|
||||
<div class="sidebar">Sidebar inside article</div>
|
||||
</article>
|
||||
<footer>Footer stuff</footer>
|
||||
</body></html>"#;
|
||||
|
||||
let options = ExtractionOptions {
|
||||
include_selectors: vec!["article".into()],
|
||||
exclude_selectors: vec![".sidebar".into()],
|
||||
..Default::default()
|
||||
};
|
||||
let result = extract_with_options(html, None, &options).unwrap();
|
||||
|
||||
assert!(result.content.markdown.contains("Title"));
|
||||
assert!(result.content.markdown.contains("Real content"));
|
||||
assert!(
|
||||
!result.content.markdown.contains("Sidebar inside article"),
|
||||
"sidebar inside article should be excluded"
|
||||
);
|
||||
assert!(
|
||||
!result.content.markdown.contains("Footer stuff"),
|
||||
"footer should not be included"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_only_main_content() {
|
||||
let html = r#"<html><body>
|
||||
<nav>Navigation</nav>
|
||||
<div class="hero"><h1>Big Hero</h1></div>
|
||||
<article><h2>Article Title</h2><p>Article content that is long enough to be real.</p></article>
|
||||
<div class="sidebar">Sidebar</div>
|
||||
<footer>Footer</footer>
|
||||
</body></html>"#;
|
||||
|
||||
let options = ExtractionOptions {
|
||||
only_main_content: true,
|
||||
..Default::default()
|
||||
};
|
||||
let result = extract_with_options(html, None, &options).unwrap();
|
||||
|
||||
assert!(
|
||||
result.content.markdown.contains("Article Title"),
|
||||
"article content should be present"
|
||||
);
|
||||
assert!(
|
||||
result.content.markdown.contains("Article content"),
|
||||
"article body should be present"
|
||||
);
|
||||
// only_main_content picks the article/main element directly, so hero and sidebar
|
||||
// should not be in the output
|
||||
assert!(
|
||||
!result.content.markdown.contains("Sidebar"),
|
||||
"sidebar should not be in only_main_content output"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_include_raw_html() {
|
||||
let html = r#"<html><body>
|
||||
<article><h1>Title</h1><p>Content here.</p></article>
|
||||
</body></html>"#;
|
||||
|
||||
let options = ExtractionOptions {
|
||||
include_raw_html: true,
|
||||
..Default::default()
|
||||
};
|
||||
let result = extract_with_options(html, None, &options).unwrap();
|
||||
|
||||
assert!(
|
||||
result.content.raw_html.is_some(),
|
||||
"raw_html should be populated"
|
||||
);
|
||||
let raw = result.content.raw_html.unwrap();
|
||||
assert!(
|
||||
raw.contains("<article>"),
|
||||
"raw_html should contain article tag"
|
||||
);
|
||||
assert!(raw.contains("<h1>Title</h1>"), "raw_html should contain h1");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_invalid_selectors() {
|
||||
let html = r#"<html><body>
|
||||
<article><h1>Title</h1><p>Content here.</p></article>
|
||||
</body></html>"#;
|
||||
|
||||
// Invalid selectors should be gracefully skipped
|
||||
let options = ExtractionOptions {
|
||||
include_selectors: vec!["[invalid[[[".into(), "article".into()],
|
||||
exclude_selectors: vec![">>>bad".into()],
|
||||
..Default::default()
|
||||
};
|
||||
let result = extract_with_options(html, None, &options).unwrap();
|
||||
|
||||
assert!(
|
||||
result.content.markdown.contains("Title"),
|
||||
"valid selectors should still work"
|
||||
);
|
||||
assert!(
|
||||
result.content.markdown.contains("Content here"),
|
||||
"extraction should proceed despite invalid selectors"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_backward_compat() {
|
||||
let html = r#"<html><body>
|
||||
<article><h1>Title</h1><p>Content here.</p></article>
|
||||
</body></html>"#;
|
||||
|
||||
let result_old = extract(html, None).unwrap();
|
||||
let result_new = extract_with_options(html, None, &ExtractionOptions::default()).unwrap();
|
||||
|
||||
assert_eq!(result_old.content.markdown, result_new.content.markdown);
|
||||
assert_eq!(result_old.content.plain_text, result_new.content.plain_text);
|
||||
assert_eq!(
|
||||
result_old.content.links.len(),
|
||||
result_new.content.links.len()
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_empty_options() {
|
||||
let html = r#"<html><body>
|
||||
<article><h1>Title</h1><p>Content here.</p></article>
|
||||
</body></html>"#;
|
||||
|
||||
let result_extract = extract(html, None).unwrap();
|
||||
let result_options =
|
||||
extract_with_options(html, None, &ExtractionOptions::default()).unwrap();
|
||||
|
||||
assert_eq!(
|
||||
result_extract.content.markdown, result_options.content.markdown,
|
||||
"default ExtractionOptions should produce identical results to extract()"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_raw_html_not_in_json_when_none() {
|
||||
let result = extract("<html><body><p>Test</p></body></html>", None).unwrap();
|
||||
let json = serde_json::to_string(&result).unwrap();
|
||||
assert!(
|
||||
!json.contains("raw_html"),
|
||||
"raw_html should be absent from JSON when None"
|
||||
);
|
||||
}
|
||||
}
|
||||
1053
crates/webclaw-core/src/llm/body.rs
Normal file
1053
crates/webclaw-core/src/llm/body.rs
Normal file
File diff suppressed because it is too large
Load diff
1359
crates/webclaw-core/src/llm/cleanup.rs
Normal file
1359
crates/webclaw-core/src/llm/cleanup.rs
Normal file
File diff suppressed because it is too large
Load diff
237
crates/webclaw-core/src/llm/images.rs
Normal file
237
crates/webclaw-core/src/llm/images.rs
Normal file
|
|
@ -0,0 +1,237 @@
|
|||
/// Image handling for LLM output: linked image conversion, logo detection,
|
||||
/// standalone image stripping, and bare image reference removal.
|
||||
use once_cell::sync::Lazy;
|
||||
use regex::Regex;
|
||||
|
||||
use super::cleanup::is_asset_label;
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Linked image conversion: [](url) -> [alt](url)
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Matches `[](link-url)` -- an image wrapped in a link.
|
||||
static LINKED_IMAGE_RE: Lazy<Regex> =
|
||||
Lazy::new(|| Regex::new(r"\[!\[([^\]]*)\]\([^)]+\)\]\(([^)]+)\)").unwrap());
|
||||
|
||||
/// Matches empty markdown links `[](url)` left after image stripping.
|
||||
pub(crate) static EMPTY_LINK_RE: Lazy<Regex> =
|
||||
Lazy::new(|| Regex::new(r"\[\s*\]\([^)]+\)").unwrap());
|
||||
|
||||
/// Convert linked images to plain links, preserving the alt text and link target.
|
||||
/// Adds a newline after each to prevent text mashing when multiple are adjacent.
|
||||
pub(crate) fn convert_linked_images(input: &str) -> String {
|
||||
LINKED_IMAGE_RE
|
||||
.replace_all(input, |caps: ®ex::Captures| {
|
||||
let alt = caps.get(1).map_or("", |m| m.as_str());
|
||||
let href = caps.get(2).map_or("", |m| m.as_str());
|
||||
format!("[{alt}]({href})\n")
|
||||
})
|
||||
.into_owned()
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Logo image collapsing
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Regex matching a line that is *only* a markdown image (with optional whitespace).
|
||||
static IMAGE_LINE_RE: Lazy<Regex> =
|
||||
Lazy::new(|| Regex::new(r"^!\[([^\]]*)\]\([^)]+\)\s*$").unwrap());
|
||||
|
||||
/// Collapse consecutive image-only lines into a comma-separated summary
|
||||
/// of their alt texts (for logo bars, partner grids, etc.).
|
||||
pub(crate) fn collapse_logo_images(input: &str) -> String {
|
||||
let lines: Vec<&str> = input.lines().collect();
|
||||
let mut out = String::with_capacity(input.len());
|
||||
let mut i = 0;
|
||||
|
||||
while i < lines.len() {
|
||||
// Check if this starts a run of consecutive image-only lines
|
||||
if IMAGE_LINE_RE.is_match(lines[i].trim()) {
|
||||
let mut alts: Vec<String> = Vec::new();
|
||||
let start = i;
|
||||
while i < lines.len() {
|
||||
let trimmed = lines[i].trim();
|
||||
// Allow blank lines between images in the same run
|
||||
if trimmed.is_empty() {
|
||||
i += 1;
|
||||
continue;
|
||||
}
|
||||
if let Some(caps) = IMAGE_LINE_RE.captures(trimmed) {
|
||||
let alt = caps.get(1).map_or("", |m| m.as_str()).trim().to_string();
|
||||
if !alt.is_empty() {
|
||||
alts.push(alt);
|
||||
}
|
||||
i += 1;
|
||||
} else {
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
let image_count = if alts.is_empty() {
|
||||
i - start
|
||||
} else {
|
||||
alts.len()
|
||||
};
|
||||
|
||||
if image_count >= 2 && !alts.is_empty() {
|
||||
out.push_str(&alts.join(", "));
|
||||
out.push('\n');
|
||||
} else if image_count == 1 && !alts.is_empty() && alts[0].len() > 30 {
|
||||
out.push_str(&alts[0]);
|
||||
out.push('\n');
|
||||
}
|
||||
// else: single image with short/empty alt -- drop entirely
|
||||
} else {
|
||||
out.push_str(lines[i]);
|
||||
out.push('\n');
|
||||
i += 1;
|
||||
}
|
||||
}
|
||||
|
||||
out
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Remaining inline image stripping
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Matches `` anywhere in a line, including multiple on the same line.
|
||||
static INLINE_IMAGE_RE: Lazy<Regex> = Lazy::new(|| Regex::new(r"!\[([^\]]*)\]\([^)]+\)").unwrap());
|
||||
|
||||
/// Strip inline images. For multi-image lines, separate short alts (logos)
|
||||
/// from long alts (descriptive) so they don't get mixed together.
|
||||
pub(crate) fn strip_remaining_images(input: &str) -> String {
|
||||
let mut out = String::with_capacity(input.len());
|
||||
|
||||
for line in input.lines() {
|
||||
let image_matches: Vec<_> = INLINE_IMAGE_RE.find_iter(line).collect();
|
||||
|
||||
if image_matches.len() >= 2 {
|
||||
// Separate short alts (brand names/logos) from long alts (descriptions)
|
||||
let mut short_alts: Vec<&str> = Vec::new();
|
||||
let mut long_alts: Vec<&str> = Vec::new();
|
||||
|
||||
for caps in INLINE_IMAGE_RE.captures_iter(line) {
|
||||
let alt = caps.get(1).map_or("", |m| m.as_str()).trim();
|
||||
// Skip empty alts and quoted-empty alts like `""`
|
||||
if alt.is_empty() || alt == "\"\"" {
|
||||
continue;
|
||||
}
|
||||
if alt.len() <= 30 {
|
||||
short_alts.push(alt);
|
||||
} else {
|
||||
long_alts.push(alt);
|
||||
}
|
||||
}
|
||||
|
||||
// Filter out CMS asset labels from alt texts before output
|
||||
short_alts.retain(|alt| !is_asset_label(alt));
|
||||
long_alts.retain(|alt| !is_asset_label(alt));
|
||||
|
||||
// Remove images, then strip empty link remnants [](url)
|
||||
let remaining = INLINE_IMAGE_RE.replace_all(line, "");
|
||||
let remaining = EMPTY_LINK_RE.replace_all(&remaining, "");
|
||||
let remaining = remaining.trim();
|
||||
|
||||
if !short_alts.is_empty() {
|
||||
if !remaining.is_empty() {
|
||||
out.push_str(remaining);
|
||||
out.push('\n');
|
||||
}
|
||||
out.push_str(&short_alts.join(", "));
|
||||
out.push('\n');
|
||||
} else if !remaining.is_empty() {
|
||||
out.push_str(remaining);
|
||||
out.push('\n');
|
||||
}
|
||||
|
||||
// Long alts on their own lines (descriptions, not logos)
|
||||
for alt in &long_alts {
|
||||
out.push_str(alt);
|
||||
out.push('\n');
|
||||
}
|
||||
} else {
|
||||
// 0 or 1 image -- keep long alt text, drop short/empty/CMS labels
|
||||
let replaced = INLINE_IMAGE_RE.replace_all(line, |caps: ®ex::Captures| {
|
||||
let alt = caps.get(1).map_or("", |m| m.as_str()).trim();
|
||||
if alt.len() > 30 && !is_asset_label(alt) {
|
||||
alt.to_string()
|
||||
} else {
|
||||
String::new()
|
||||
}
|
||||
});
|
||||
out.push_str(&replaced);
|
||||
out.push('\n');
|
||||
}
|
||||
}
|
||||
|
||||
out
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Bare image file reference stripping
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
const IMAGE_EXTENSIONS: &[&str] = &[
|
||||
".webp", ".svg", ".png", ".jpg", ".jpeg", ".gif", ".avif", ".ico", ".bmp",
|
||||
];
|
||||
|
||||
/// Strip lines that are just bare image filenames or image URLs.
|
||||
/// Keeps lines where an image filename appears within a larger sentence.
|
||||
pub(crate) fn strip_bare_image_refs(input: &str) -> String {
|
||||
let mut out = String::with_capacity(input.len());
|
||||
|
||||
for line in input.lines() {
|
||||
let trimmed = line.trim();
|
||||
|
||||
if !trimmed.is_empty() && is_bare_image_ref(trimmed) {
|
||||
continue;
|
||||
}
|
||||
|
||||
out.push_str(line);
|
||||
out.push('\n');
|
||||
}
|
||||
|
||||
out
|
||||
}
|
||||
|
||||
/// A line is a bare image reference if it's a single token ending with an image extension.
|
||||
/// Catches filenames ("hero.webp") and URLs ("https://cdn.example.com/logo.svg").
|
||||
fn is_bare_image_ref(line: &str) -> bool {
|
||||
if line.starts_with('#')
|
||||
|| line.starts_with("- ")
|
||||
|| line.starts_with("* ")
|
||||
|| line.starts_with("```")
|
||||
|| line.starts_with("> ")
|
||||
{
|
||||
return false;
|
||||
}
|
||||
|
||||
if line.contains(' ') {
|
||||
return false;
|
||||
}
|
||||
|
||||
let lower = line.to_lowercase();
|
||||
IMAGE_EXTENSIONS.iter().any(|ext| lower.ends_with(ext))
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn linked_image_conversion() {
|
||||
let input = "[](https://docs.example.com)";
|
||||
let result = convert_linked_images(input);
|
||||
assert!(result.contains("[docs](https://docs.example.com)"));
|
||||
assert!(!result.contains("!["));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn bare_image_ref_detected() {
|
||||
assert!(is_bare_image_ref("hero.webp"));
|
||||
assert!(is_bare_image_ref("https://cdn.example.com/logo.svg"));
|
||||
assert!(!is_bare_image_ref("The file output.png is saved to disk."));
|
||||
assert!(!is_bare_image_ref("# heading.png"));
|
||||
}
|
||||
}
|
||||
184
crates/webclaw-core/src/llm/links.rs
Normal file
184
crates/webclaw-core/src/llm/links.rs
Normal file
|
|
@ -0,0 +1,184 @@
|
|||
/// Link extraction, deduplication, noise filtering, and label formatting
|
||||
/// for the LLM output's deduplicated links section.
|
||||
use std::collections::HashSet;
|
||||
|
||||
use once_cell::sync::Lazy;
|
||||
use regex::Regex;
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Link extraction
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Matches `[text](url)`. Images are already stripped, so no `!` prefix concern.
|
||||
static LINK_RE: Lazy<Regex> = Lazy::new(|| Regex::new(r"\[([^\]]*)\]\(([^)]+)\)").unwrap());
|
||||
|
||||
/// Extract all links from markdown, replacing inline `[text](url)` with just `text`.
|
||||
/// Returns the cleaned text and a deduplicated list of (label, href) pairs.
|
||||
pub(crate) fn extract_and_strip_links(input: &str) -> (String, Vec<(String, String)>) {
|
||||
let mut links: Vec<(String, String)> = Vec::new();
|
||||
let mut seen_hrefs: HashSet<String> = HashSet::new();
|
||||
|
||||
let replaced = LINK_RE.replace_all(input, |caps: ®ex::Captures| {
|
||||
let text = caps.get(1).map_or("", |m| m.as_str()).trim().to_string();
|
||||
let href = caps.get(2).map_or("", |m| m.as_str()).trim().to_string();
|
||||
|
||||
let skip = href.starts_with('#')
|
||||
|| href.starts_with("javascript:")
|
||||
|| href.is_empty()
|
||||
|| is_noise_link(&text, &href);
|
||||
|
||||
if !skip && !text.is_empty() && seen_hrefs.insert(href.clone()) {
|
||||
links.push((text.clone(), href));
|
||||
}
|
||||
|
||||
text
|
||||
});
|
||||
|
||||
(replaced.into_owned(), links)
|
||||
}
|
||||
|
||||
/// Links that are noise for LLM consumption: internal actions, timestamps,
|
||||
/// user profiles, generic short text.
|
||||
fn is_noise_link(text: &str, href: &str) -> bool {
|
||||
let t = text.to_lowercase();
|
||||
|
||||
// Generic action links
|
||||
if matches!(
|
||||
t.as_str(),
|
||||
"hide"
|
||||
| "flag"
|
||||
| "reply"
|
||||
| "favorite"
|
||||
| "unflag"
|
||||
| "vouch"
|
||||
| "next"
|
||||
| "prev"
|
||||
| "previous"
|
||||
| "more"
|
||||
) {
|
||||
return true;
|
||||
}
|
||||
|
||||
// Timestamp text ("1 hour ago", "5 minutes ago", "yesterday")
|
||||
if t.ends_with(" ago") || t == "yesterday" || t == "just now" {
|
||||
return true;
|
||||
}
|
||||
|
||||
// Single-char text that's not meaningful (but keep letters -- "X", "Go", etc.)
|
||||
if text.len() == 1 && !text.chars().next().unwrap_or(' ').is_alphanumeric() {
|
||||
return true;
|
||||
}
|
||||
|
||||
// Internal user profile / action URLs (HN-style)
|
||||
if href.contains("/user?id=")
|
||||
|| href.contains("/hide?id=")
|
||||
|| href.contains("/from?site=")
|
||||
|| href.contains("/flag?id=")
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
false
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Link label cleaning
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
static MD_MARKERS_RE: Lazy<Regex> =
|
||||
Lazy::new(|| Regex::new(r"#{1,6}\s+|\*{1,2}|_{1,2}|`").unwrap());
|
||||
|
||||
/// Clean a link label: strip markdown, dedup repeated phrases, truncate.
|
||||
pub(crate) fn clean_link_label(raw: &str) -> String {
|
||||
// Strip markdown markers
|
||||
let label = MD_MARKERS_RE.replace_all(raw, "").to_string();
|
||||
let label = label.split_whitespace().collect::<Vec<_>>().join(" ");
|
||||
|
||||
// Dedup repeated phrases in label
|
||||
let label = dedup_label_phrase(&label);
|
||||
|
||||
// Truncate to ~80 chars (UTF-8 safe)
|
||||
if label.len() > 80 {
|
||||
// Find last whitespace boundary at or before 80 bytes
|
||||
let mut end = None;
|
||||
for (i, _) in label.char_indices() {
|
||||
if i > 80 {
|
||||
break;
|
||||
}
|
||||
if i > 0 && label.as_bytes()[i - 1].is_ascii_whitespace() {
|
||||
end = Some(i);
|
||||
}
|
||||
}
|
||||
let end = end.unwrap_or_else(|| {
|
||||
// No whitespace found -- find char boundary near 80
|
||||
label
|
||||
.char_indices()
|
||||
.map(|(i, _)| i)
|
||||
.find(|&i| i >= 80)
|
||||
.unwrap_or(label.len())
|
||||
});
|
||||
format!("{}...", label[..end].trim_end())
|
||||
} else {
|
||||
label
|
||||
}
|
||||
}
|
||||
|
||||
/// If a label contains the same phrase twice (e.g., "X Y Z X Y Z"), return just one copy.
|
||||
fn dedup_label_phrase(label: &str) -> String {
|
||||
let len = label.len();
|
||||
if len < 8 {
|
||||
return label.to_string();
|
||||
}
|
||||
// Try split at each whitespace boundary
|
||||
for (i, _) in label.match_indices(' ') {
|
||||
let left = label[..i].trim();
|
||||
let right = label[i + 1..].trim();
|
||||
if left.len() >= 4 && left.eq_ignore_ascii_case(right) {
|
||||
return left.to_string();
|
||||
}
|
||||
}
|
||||
label.to_string()
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Tests
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn link_label_truncated() {
|
||||
let long = "The quick brown fox jumps over the lazy dog and then runs across the field to find more interesting things to do on a sunny afternoon";
|
||||
let result = clean_link_label(long);
|
||||
assert!(result.len() <= 84, "got len {}: {result}", result.len());
|
||||
assert!(result.ends_with("..."), "got: {result}");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn link_label_markdown_stripped() {
|
||||
assert_eq!(clean_link_label("## Hello **world**"), "Hello world");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn link_label_duplicate_deduped() {
|
||||
assert_eq!(
|
||||
clean_link_label("Express Delivery Express Delivery"),
|
||||
"Express Delivery"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn link_label_short_unchanged() {
|
||||
assert_eq!(clean_link_label("Click here"), "Click here");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn noise_link_detected() {
|
||||
assert!(is_noise_link("hide", "https://example.com"));
|
||||
assert!(is_noise_link("5 minutes ago", "https://example.com"));
|
||||
assert!(is_noise_link("user", "https://hn.com/user?id=foo"));
|
||||
assert!(!is_noise_link("Rust docs", "https://rust-lang.org"));
|
||||
}
|
||||
}
|
||||
47
crates/webclaw-core/src/llm/metadata.rs
Normal file
47
crates/webclaw-core/src/llm/metadata.rs
Normal file
|
|
@ -0,0 +1,47 @@
|
|||
/// Metadata header building for LLM-optimized output.
|
||||
///
|
||||
/// Produces `> ` prefixed lines with URL, title, author, etc.
|
||||
/// Omits empty/zero fields to minimize token waste.
|
||||
use crate::types::ExtractionResult;
|
||||
|
||||
pub(crate) fn build_metadata_header(
|
||||
out: &mut String,
|
||||
result: &ExtractionResult,
|
||||
url: Option<&str>,
|
||||
) {
|
||||
let meta = &result.metadata;
|
||||
|
||||
// URL: prefer explicit arg, fall back to metadata
|
||||
let effective_url = url.or(meta.url.as_deref());
|
||||
if let Some(u) = effective_url {
|
||||
out.push_str(&format!("> URL: {u}\n"));
|
||||
}
|
||||
if let Some(t) = &meta.title
|
||||
&& !t.is_empty()
|
||||
{
|
||||
out.push_str(&format!("> Title: {t}\n"));
|
||||
}
|
||||
if let Some(d) = &meta.description
|
||||
&& !d.is_empty()
|
||||
{
|
||||
out.push_str(&format!("> Description: {d}\n"));
|
||||
}
|
||||
if let Some(a) = &meta.author
|
||||
&& !a.is_empty()
|
||||
{
|
||||
out.push_str(&format!("> Author: {a}\n"));
|
||||
}
|
||||
if let Some(d) = &meta.published_date
|
||||
&& !d.is_empty()
|
||||
{
|
||||
out.push_str(&format!("> Published: {d}\n"));
|
||||
}
|
||||
if let Some(l) = &meta.language
|
||||
&& !l.is_empty()
|
||||
{
|
||||
out.push_str(&format!("> Language: {l}\n"));
|
||||
}
|
||||
if meta.word_count > 0 {
|
||||
out.push_str(&format!("> Word count: {}\n", meta.word_count));
|
||||
}
|
||||
}
|
||||
696
crates/webclaw-core/src/llm/mod.rs
Normal file
696
crates/webclaw-core/src/llm/mod.rs
Normal file
|
|
@ -0,0 +1,696 @@
|
|||
/// LLM-optimized output format.
|
||||
///
|
||||
/// Takes an `ExtractionResult` and produces a compact text representation
|
||||
/// that maximizes information density per token. Strips decorative images,
|
||||
/// visual-only formatting (bold/italic), and inline link URLs -- moving links
|
||||
/// to a deduplicated section at the end.
|
||||
mod body;
|
||||
mod cleanup;
|
||||
mod images;
|
||||
mod links;
|
||||
mod metadata;
|
||||
|
||||
use crate::types::ExtractionResult;
|
||||
|
||||
/// Produce a token-optimized text representation of extracted content.
|
||||
///
|
||||
/// The output has three sections:
|
||||
/// 1. Compact metadata header (`> ` prefixed lines)
|
||||
/// 2. Cleaned body (no images, no bold/italic, links as plain text)
|
||||
/// 3. Deduplicated links section at the end
|
||||
pub fn to_llm_text(result: &ExtractionResult, url: Option<&str>) -> String {
|
||||
let mut out = String::new();
|
||||
|
||||
// -- 1. Metadata header --
|
||||
metadata::build_metadata_header(&mut out, result, url);
|
||||
|
||||
// -- 2. Process body --
|
||||
let processed = body::process_body(&result.content.markdown);
|
||||
|
||||
if !processed.text.is_empty() {
|
||||
if !out.is_empty() {
|
||||
out.push('\n');
|
||||
}
|
||||
out.push_str(&processed.text);
|
||||
}
|
||||
|
||||
// -- 3. Links section --
|
||||
if !processed.links.is_empty() {
|
||||
out.push_str("\n\n## Links\n");
|
||||
for (text, href) in &processed.links {
|
||||
let label = links::clean_link_label(text);
|
||||
if !label.is_empty() {
|
||||
out.push_str(&format!("- {label}: {href}\n"));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
out.trim().to_string()
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Integration tests that exercise the full pipeline through to_llm_text
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use crate::types::*;
|
||||
|
||||
fn make_result(markdown: &str) -> ExtractionResult {
|
||||
ExtractionResult {
|
||||
metadata: Metadata {
|
||||
title: Some("Test Page".into()),
|
||||
description: Some("A test page".into()),
|
||||
author: None,
|
||||
published_date: None,
|
||||
language: Some("en".into()),
|
||||
url: Some("https://example.com".into()),
|
||||
site_name: None,
|
||||
image: None,
|
||||
favicon: None,
|
||||
word_count: 42,
|
||||
},
|
||||
content: Content {
|
||||
markdown: markdown.into(),
|
||||
plain_text: String::new(),
|
||||
links: vec![],
|
||||
images: vec![],
|
||||
code_blocks: vec![],
|
||||
raw_html: None,
|
||||
},
|
||||
domain_data: None,
|
||||
structured_data: vec![],
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn metadata_header_includes_populated_fields() {
|
||||
let result = make_result("# Hello");
|
||||
let out = to_llm_text(&result, Some("https://example.com/page"));
|
||||
|
||||
assert!(out.contains("> URL: https://example.com/page"));
|
||||
assert!(out.contains("> Title: Test Page"));
|
||||
assert!(out.contains("> Description: A test page"));
|
||||
assert!(out.contains("> Language: en"));
|
||||
assert!(out.contains("> Word count: 42"));
|
||||
assert!(!out.contains("> Author:"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn strips_image_markdown() {
|
||||
let md = "Some text\n\n\n\nMore text";
|
||||
let result = make_result(md);
|
||||
let out = to_llm_text(&result, None);
|
||||
|
||||
assert!(!out.contains("!["));
|
||||
assert!(!out.contains("cdn.example.com"));
|
||||
assert!(out.contains("Some text"));
|
||||
assert!(out.contains("More text"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn collapses_consecutive_logo_images_on_separate_lines() {
|
||||
let md = "# Partners\n\n\
|
||||
\n\
|
||||
\n\
|
||||
\n\
|
||||
\n\n\
|
||||
Some other content";
|
||||
|
||||
let result = make_result(md);
|
||||
let out = to_llm_text(&result, None);
|
||||
|
||||
assert!(out.contains("WRITER, MongoDB, GROQ, LangChain"));
|
||||
assert!(!out.contains("!["));
|
||||
assert!(!out.contains("cdn.example.com"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn collapses_consecutive_logo_images_on_same_line() {
|
||||
let md = "";
|
||||
|
||||
let result = make_result(md);
|
||||
let out = to_llm_text(&result, None);
|
||||
|
||||
assert!(out.contains("WRITER"));
|
||||
assert!(out.contains("MongoDB"));
|
||||
assert!(out.contains("GROQ"));
|
||||
assert!(!out.contains("!["));
|
||||
assert!(!out.contains("cdn.example.com"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn keeps_meaningful_alt_text() {
|
||||
let md = "";
|
||||
let result = make_result(md);
|
||||
let out = to_llm_text(&result, None);
|
||||
|
||||
assert!(
|
||||
out.contains("A detailed photograph showing the team collaborating on the project")
|
||||
);
|
||||
assert!(!out.contains("!["));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn strips_bold_and_italic() {
|
||||
let md = "This is **bold text** and *italic text* and __also bold__ and _also italic_.";
|
||||
let result = make_result(md);
|
||||
let out = to_llm_text(&result, None);
|
||||
|
||||
assert!(out.contains("This is bold text and italic text and also bold and also italic."));
|
||||
assert!(!out.contains("**"));
|
||||
assert!(!out.contains("__"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn moves_links_to_end() {
|
||||
let md = "Check out [Rust](https://rust-lang.org) and [Go](https://go.dev) for details.";
|
||||
let result = make_result(md);
|
||||
let out = to_llm_text(&result, None);
|
||||
|
||||
assert!(out.contains("Check out Rust and Go for details."));
|
||||
assert!(out.contains("## Links"));
|
||||
assert!(out.contains("- Rust: https://rust-lang.org"));
|
||||
assert!(out.contains("- Go: https://go.dev"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn skips_anchor_and_javascript_links() {
|
||||
let md = "Go to [top](#top) and [click](javascript:void(0)) and [real](https://real.example.com).";
|
||||
let result = make_result(md);
|
||||
let out = to_llm_text(&result, None);
|
||||
|
||||
assert!(out.contains("## Links"));
|
||||
assert!(out.contains("- real: https://real.example.com"));
|
||||
let links_section = out.split("## Links").nth(1).unwrap_or("");
|
||||
assert!(!links_section.contains("#top"));
|
||||
assert!(!links_section.contains("javascript:"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn deduplicates_heading_and_paragraph() {
|
||||
let md = "### Ground models\n\nGround models with fresh web context\n\nRetrieve live data.";
|
||||
let result = make_result(md);
|
||||
let out = to_llm_text(&result, None);
|
||||
|
||||
assert!(out.contains("### Ground models with fresh web context"));
|
||||
assert!(out.contains("Retrieve live data."));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn deduplicates_identical_heading_paragraph() {
|
||||
let md = "## Features\n\nFeatures\n\nHere are the features.";
|
||||
let result = make_result(md);
|
||||
let out = to_llm_text(&result, None);
|
||||
|
||||
let feature_count = out.matches("Features").count();
|
||||
assert_eq!(
|
||||
feature_count, 1,
|
||||
"Expected 'Features' exactly once, got: {out}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn collapses_excessive_whitespace() {
|
||||
let md = "Line one\n\n\n\n\nLine two\n\n\n\nLine three";
|
||||
let result = make_result(md);
|
||||
let out = to_llm_text(&result, None);
|
||||
|
||||
assert!(
|
||||
!out.contains("\n\n\n"),
|
||||
"Found 3+ consecutive newlines in: {:?}",
|
||||
out
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn preserves_code_blocks() {
|
||||
let md = "Example:\n\n```rust\nfn main() {\n println!(\"hello\");\n}\n```\n\nDone.";
|
||||
let result = make_result(md);
|
||||
let out = to_llm_text(&result, None);
|
||||
|
||||
assert!(out.contains("```rust"));
|
||||
assert!(out.contains("fn main()"));
|
||||
assert!(out.contains("```"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn preserves_list_structure() {
|
||||
let md = "Features:\n\n- Fast\n- Safe\n- Concurrent";
|
||||
let result = make_result(md);
|
||||
let out = to_llm_text(&result, None);
|
||||
|
||||
assert!(out.contains("- Fast"));
|
||||
assert!(out.contains("- Safe"));
|
||||
assert!(out.contains("- Concurrent"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn deduplicates_links() {
|
||||
let md = "Visit [Example](https://example.org/page) or [Example again](https://example.org/page).";
|
||||
let result = make_result(md);
|
||||
let out = to_llm_text(&result, None);
|
||||
|
||||
let link_count = out.matches("https://example.org/page").count();
|
||||
assert_eq!(link_count, 1, "Expected link once, got: {out}");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn realistic_page() {
|
||||
let html = r#"
|
||||
<html lang="en">
|
||||
<head>
|
||||
<title>Tavily - AI Search API</title>
|
||||
<meta name="description" content="Real-time search for AI agents">
|
||||
</head>
|
||||
<body>
|
||||
<article>
|
||||
<h1>Connect your AI agents to the web</h1>
|
||||
<p>Real-time search, extraction, and web crawling through a <strong>single API</strong>.</p>
|
||||
<p>Trusted by <em>1M+ developers</em>.</p>
|
||||
<img src="https://cdn.example.com/writer.png" alt="WRITER">
|
||||
<img src="https://cdn.example.com/mongo.png" alt="MongoDB">
|
||||
<img src="https://cdn.example.com/groq.png" alt="GROQ">
|
||||
<img src="https://cdn.example.com/langchain.png" alt="LangChain">
|
||||
<h2>Ground models with fresh web context</h2>
|
||||
<p>Retrieve live web data and return it structured for models.</p>
|
||||
<p>Learn more at <a href="https://docs.tavily.com">the docs</a>.</p>
|
||||
<p><a href="https://app.tavily.com">Try it out</a></p>
|
||||
</article>
|
||||
</body>
|
||||
</html>"#;
|
||||
|
||||
let result = crate::extract(html, Some("https://www.tavily.com/")).unwrap();
|
||||
let out = to_llm_text(&result, Some("https://www.tavily.com/"));
|
||||
|
||||
assert!(out.contains("> URL: https://www.tavily.com/"));
|
||||
assert!(out.contains("> Title:"));
|
||||
|
||||
assert!(!out.contains("!["), "Image markdown not stripped: {out}");
|
||||
assert!(
|
||||
!out.contains("cdn.example.com"),
|
||||
"CDN URL not stripped: {out}"
|
||||
);
|
||||
|
||||
assert!(
|
||||
out.contains("WRITER") && out.contains("MongoDB"),
|
||||
"Logo alt texts missing: {out}"
|
||||
);
|
||||
|
||||
assert!(!out.contains("**"), "Bold not stripped: {out}");
|
||||
|
||||
assert!(out.contains("# Connect your AI agents to the web"));
|
||||
assert!(out.contains("## Ground models with fresh web context"));
|
||||
assert!(out.contains("Retrieve live web data"));
|
||||
|
||||
assert!(out.contains("## Links"));
|
||||
assert!(out.contains("https://docs.tavily.com"));
|
||||
assert!(out.contains("https://app.tavily.com"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn empty_metadata_fields_excluded() {
|
||||
let result = ExtractionResult {
|
||||
metadata: Metadata {
|
||||
title: None,
|
||||
description: None,
|
||||
author: None,
|
||||
published_date: None,
|
||||
language: None,
|
||||
url: None,
|
||||
site_name: None,
|
||||
image: None,
|
||||
favicon: None,
|
||||
word_count: 0,
|
||||
},
|
||||
content: Content {
|
||||
markdown: "Just content".into(),
|
||||
plain_text: String::new(),
|
||||
links: vec![],
|
||||
images: vec![],
|
||||
code_blocks: vec![],
|
||||
raw_html: None,
|
||||
},
|
||||
domain_data: None,
|
||||
structured_data: vec![],
|
||||
};
|
||||
|
||||
let out = to_llm_text(&result, None);
|
||||
assert!(!out.contains("> "));
|
||||
assert!(out.contains("Just content"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn strips_empty_alt_images() {
|
||||
let md = "Before\n\n\n\nAfter";
|
||||
let result = make_result(md);
|
||||
let out = to_llm_text(&result, None);
|
||||
|
||||
assert!(!out.contains("cdn.example.com"));
|
||||
assert!(!out.contains("!["));
|
||||
assert!(out.contains("Before"));
|
||||
assert!(out.contains("After"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn preserves_headings_structure() {
|
||||
let md = "# H1\n\n## H2\n\n### H3\n\nContent under H3.";
|
||||
let result = make_result(md);
|
||||
let out = to_llm_text(&result, None);
|
||||
|
||||
assert!(out.contains("# H1"));
|
||||
assert!(out.contains("## H2"));
|
||||
assert!(out.contains("### H3"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn inline_image_in_paragraph_stripped() {
|
||||
let md = "Check this  out and read more.";
|
||||
let result = make_result(md);
|
||||
let out = to_llm_text(&result, None);
|
||||
|
||||
assert!(!out.contains("!["));
|
||||
assert!(!out.contains("x.com/icon.png"));
|
||||
assert!(out.contains("Check this"));
|
||||
assert!(out.contains("out and read more."));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn does_not_strip_emphasis_inside_code_blocks() {
|
||||
let md = "Normal **bold** text\n\n```python\ndef foo(**kwargs):\n return _internal_var_\n```\n\nMore text";
|
||||
let result = make_result(md);
|
||||
let out = to_llm_text(&result, None);
|
||||
|
||||
assert!(out.contains("Normal bold text"));
|
||||
assert!(out.contains("**kwargs"));
|
||||
assert!(out.contains("_internal_var_"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn converts_linked_images_to_links() {
|
||||
let md = "[](https://docs.example.com)";
|
||||
let result = make_result(md);
|
||||
let out = to_llm_text(&result, None);
|
||||
|
||||
assert!(!out.contains("!["), "Image not converted: {out}");
|
||||
assert!(
|
||||
out.contains("https://docs.example.com"),
|
||||
"Link URL missing from footer: {out}"
|
||||
);
|
||||
assert!(out.contains("Read the docs"), "Link text missing: {out}");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn linked_images_split_on_separate_lines() {
|
||||
let md = "[](https://a.example.com)[](https://b.example.com)";
|
||||
let result = make_result(md);
|
||||
let out = to_llm_text(&result, None);
|
||||
|
||||
assert!(out.contains("Article A"), "Article A missing: {out}");
|
||||
assert!(out.contains("Article B"), "Article B missing: {out}");
|
||||
assert!(
|
||||
!out.contains("Article AArticle B"),
|
||||
"Text mashed together: {out}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn separates_short_and_long_alts_on_same_line() {
|
||||
let md = "";
|
||||
let result = make_result(md);
|
||||
let out = to_llm_text(&result, None);
|
||||
|
||||
assert!(out.contains("AWS, IBM"), "Logo collapse failed: {out}");
|
||||
assert!(
|
||||
!out.contains("IBM, Ground"),
|
||||
"Long alt mixed with logos: {out}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn dedup_text_line_matching_heading() {
|
||||
let md = "\n\n### Handle thousands of web queries in seconds\n\nA production-grade stack.";
|
||||
let result = make_result(md);
|
||||
let out = to_llm_text(&result, None);
|
||||
|
||||
let count = out
|
||||
.matches("Handle thousands of web queries in seconds")
|
||||
.count();
|
||||
assert_eq!(count, 1, "Expected once, got {count}: {out}");
|
||||
assert!(out.contains("### Handle thousands"));
|
||||
assert!(out.contains("A production-grade stack."));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn no_leading_dot_from_linked_images() {
|
||||
let md = "[](https://a.com)[](https://b.com)";
|
||||
let result = make_result(md);
|
||||
let out = to_llm_text(&result, None);
|
||||
|
||||
assert!(
|
||||
!out.contains(". News"),
|
||||
"Leading dot from empty remaining: {out}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn merges_stat_lines_with_descriptions() {
|
||||
let md = "100M+\n\nmonthly requests handled\n\n99.99% uptime\n\nSLA powering mission-critical systems\n\n180 ms\n\np50 on Tavily /search making us fastest on the market\n\n1M+\n\ndevelopers using Tavily\n\nBillions\n\nof pages crawled and extracted without downtime\n\nDrop-in integration\n\nwith leading LLM providers (OpenAI, Anthropic, Groq)";
|
||||
let result = make_result(md);
|
||||
let out = to_llm_text(&result, None);
|
||||
|
||||
assert!(
|
||||
out.contains("100M+ monthly requests handled"),
|
||||
"Stat not merged: {out}"
|
||||
);
|
||||
assert!(
|
||||
out.contains("99.99% uptime SLA powering mission-critical systems"),
|
||||
"Stat not merged: {out}"
|
||||
);
|
||||
assert!(
|
||||
out.contains("180 ms p50 on Tavily /search making us fastest on the market"),
|
||||
"Stat not merged: {out}"
|
||||
);
|
||||
assert!(
|
||||
out.contains("1M+ developers using Tavily"),
|
||||
"Stat not merged: {out}"
|
||||
);
|
||||
assert!(
|
||||
out.contains("Billions of pages crawled and extracted without downtime"),
|
||||
"Stat not merged: {out}"
|
||||
);
|
||||
assert!(
|
||||
out.contains(
|
||||
"Drop-in integration with leading LLM providers (OpenAI, Anthropic, Groq)"
|
||||
),
|
||||
"Stat not merged: {out}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn merge_stat_preserves_headings_and_lists() {
|
||||
let md = "## Features\n\n100M+\n\nmonthly requests\n\n- Fast\n- Safe";
|
||||
let result = make_result(md);
|
||||
let out = to_llm_text(&result, None);
|
||||
|
||||
assert!(out.contains("## Features"), "Heading lost: {out}");
|
||||
assert!(
|
||||
out.contains("100M+ monthly requests"),
|
||||
"Stat not merged: {out}"
|
||||
);
|
||||
assert!(out.contains("- Fast"), "List item lost: {out}");
|
||||
assert!(out.contains("- Safe"), "List item lost: {out}");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn merge_stat_does_not_merge_long_lines() {
|
||||
let md = "This is a longer line of text!\n\nAnd this follows after a blank";
|
||||
let result = make_result(md);
|
||||
let out = to_llm_text(&result, None);
|
||||
|
||||
assert!(
|
||||
!out.contains("text! And"),
|
||||
"Long line incorrectly merged: {out}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn strips_css_class_text_lines() {
|
||||
let md = "# Typography\n\n\
|
||||
text-4xl font-bold tracking-tight text-gray-900\n\n\
|
||||
Build beautiful websites with Tailwind CSS.\n\n\
|
||||
text-5xl text-6xl text-8xl text-gray-950 text-white tracking-tighter text-balance";
|
||||
let result = make_result(md);
|
||||
let out = to_llm_text(&result, None);
|
||||
|
||||
assert!(
|
||||
!out.contains("text-4xl font-bold"),
|
||||
"CSS class line was not stripped: {out}"
|
||||
);
|
||||
assert!(
|
||||
!out.contains("text-5xl text-6xl"),
|
||||
"CSS class line was not stripped: {out}"
|
||||
);
|
||||
assert!(
|
||||
out.contains("Build beautiful websites"),
|
||||
"Normal prose was stripped: {out}"
|
||||
);
|
||||
assert!(out.contains("Typography"), "Heading was stripped: {out}");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn keeps_prose_with_css_like_word() {
|
||||
let md = "The text-based approach works well for this use case.\n\n\
|
||||
We use a grid-like layout for the dashboard.";
|
||||
let result = make_result(md);
|
||||
let out = to_llm_text(&result, None);
|
||||
|
||||
assert!(
|
||||
out.contains("text-based approach"),
|
||||
"Normal prose incorrectly stripped: {out}"
|
||||
);
|
||||
assert!(
|
||||
out.contains("grid-like layout"),
|
||||
"Normal prose incorrectly stripped: {out}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn preserves_css_classes_inside_code_blocks() {
|
||||
let md = "Example usage:\n\n\
|
||||
```html\n\
|
||||
<div class=\"text-4xl font-bold tracking-tight text-gray-900\">\n\
|
||||
```\n\n\
|
||||
That applies bold typography.";
|
||||
let result = make_result(md);
|
||||
let out = to_llm_text(&result, None);
|
||||
|
||||
assert!(
|
||||
out.contains("text-4xl font-bold tracking-tight"),
|
||||
"CSS classes inside code block were stripped: {out}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn dedup_removes_exact_duplicate_paragraphs() {
|
||||
let md = "Supabase is an amazing platform that makes building apps incredibly fast.\n\nSupabase is an amazing platform that makes building apps incredibly fast.\n\nSupabase is an amazing platform that makes building apps incredibly fast.\n\nEach project gets its own dedicated Postgres database.";
|
||||
|
||||
let result = make_result(md);
|
||||
let out = to_llm_text(&result, None);
|
||||
|
||||
let count = out.matches("Supabase is an amazing platform").count();
|
||||
assert_eq!(
|
||||
count, 1,
|
||||
"Duplicate paragraph should appear only once, got {count}: {out}"
|
||||
);
|
||||
assert!(
|
||||
out.contains("Each project gets its own dedicated Postgres database"),
|
||||
"Unique paragraph missing: {out}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn dedup_preserves_unique_paragraphs() {
|
||||
let md = "First unique paragraph with enough content to be checked.\n\nSecond unique paragraph that is completely different.\n\nThird unique paragraph covering another topic entirely.";
|
||||
|
||||
let result = make_result(md);
|
||||
let out = to_llm_text(&result, None);
|
||||
|
||||
assert!(out.contains("First unique paragraph"), "Lost first: {out}");
|
||||
assert!(
|
||||
out.contains("Second unique paragraph"),
|
||||
"Lost second: {out}"
|
||||
);
|
||||
assert!(out.contains("Third unique paragraph"), "Lost third: {out}");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn dedup_keeps_short_repeated_text() {
|
||||
let md = "Learn more\n\nA detailed explanation of the first feature.\n\nLearn more\n\nA detailed explanation of the second feature.";
|
||||
|
||||
let result = make_result(md);
|
||||
let out = to_llm_text(&result, None);
|
||||
|
||||
let count = out.matches("Learn more").count();
|
||||
assert!(
|
||||
count >= 2,
|
||||
"Short repeated text should be kept, got {count}: {out}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn dedup_catches_near_duplicates_via_prefix() {
|
||||
let md = "The platform provides real-time sync collaboration tools for modern developers building web applications with React and Next.js.\n\nThe platform provides real-time sync collaboration tools for modern developers building mobile apps with Flutter.\n\nA completely different paragraph about database design.";
|
||||
|
||||
let result = make_result(md);
|
||||
let out = to_llm_text(&result, None);
|
||||
|
||||
let count = out.matches("The platform provides real-time sync").count();
|
||||
assert_eq!(
|
||||
count, 1,
|
||||
"Near-duplicate should be removed, got {count}: {out}"
|
||||
);
|
||||
assert!(
|
||||
out.contains("A completely different paragraph"),
|
||||
"Unique paragraph missing: {out}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn dedup_carousel_realistic() {
|
||||
let md = "## What our users say\n\n\"Supabase has transformed how we build products. The developer experience is unmatched.\" - Sarah Chen, CTO at TechCorp\n\n\"Moving from Firebase to Supabase was the best decision we made this year.\" - James Liu, Lead Engineer\n\n\"The real-time features and Postgres foundation give us confidence at scale.\" - Maria Garcia, VP Engineering\n\n\"Supabase has transformed how we build products. The developer experience is unmatched.\" - Sarah Chen, CTO at TechCorp\n\n\"Moving from Firebase to Supabase was the best decision we made this year.\" - James Liu, Lead Engineer\n\n\"The real-time features and Postgres foundation give us confidence at scale.\" - Maria Garcia, VP Engineering\n\n\"Supabase has transformed how we build products. The developer experience is unmatched.\" - Sarah Chen, CTO at TechCorp\n\n\"Moving from Firebase to Supabase was the best decision we made this year.\" - James Liu, Lead Engineer\n\n\"The real-time features and Postgres foundation give us confidence at scale.\" - Maria Garcia, VP Engineering\n\n## Get started\n\nSign up for free today.";
|
||||
|
||||
let result = make_result(md);
|
||||
let out = to_llm_text(&result, None);
|
||||
|
||||
let sarah_count = out.matches("Sarah Chen").count();
|
||||
let james_count = out.matches("James Liu").count();
|
||||
let maria_count = out.matches("Maria Garcia").count();
|
||||
|
||||
assert_eq!(sarah_count, 1, "Sarah duplicated {sarah_count}x: {out}");
|
||||
assert_eq!(james_count, 1, "James duplicated {james_count}x: {out}");
|
||||
assert_eq!(maria_count, 1, "Maria duplicated {maria_count}x: {out}");
|
||||
|
||||
assert!(out.contains("## What our users say"), "Heading lost: {out}");
|
||||
assert!(out.contains("## Get started"), "Heading lost: {out}");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn strips_bare_image_references() {
|
||||
let md = "Some content\n\nhero.webp\n\nhttps://example.com/logo.svg\n\n\n\n\n\nThe file output.png is saved to disk.\n\n\n\nMore content";
|
||||
let result = make_result(md);
|
||||
let out = to_llm_text(&result, None);
|
||||
|
||||
assert!(
|
||||
!out.contains("hero.webp"),
|
||||
"Bare filename not stripped: {out}"
|
||||
);
|
||||
assert!(
|
||||
!out.contains("https://example.com/logo.svg"),
|
||||
"Bare image URL not stripped: {out}"
|
||||
);
|
||||
assert!(
|
||||
!out.contains("image.png"),
|
||||
"Empty-alt image not stripped: {out}"
|
||||
);
|
||||
assert!(
|
||||
!out.contains("logo.svg"),
|
||||
"Generic-alt image not stripped: {out}"
|
||||
);
|
||||
assert!(
|
||||
out.contains("output.png is saved to disk"),
|
||||
"Sentence with .png filename was incorrectly stripped: {out}"
|
||||
);
|
||||
assert!(
|
||||
out.contains("Detailed architecture diagram showing the data flow"),
|
||||
"Meaningful alt text was stripped: {out}"
|
||||
);
|
||||
assert!(
|
||||
!out.contains("arch.png"),
|
||||
"Image URL not stripped from meaningful alt: {out}"
|
||||
);
|
||||
assert!(out.contains("Some content"), "Content before lost: {out}");
|
||||
assert!(out.contains("More content"), "Content after lost: {out}");
|
||||
}
|
||||
}
|
||||
1395
crates/webclaw-core/src/markdown.rs
Normal file
1395
crates/webclaw-core/src/markdown.rs
Normal file
File diff suppressed because it is too large
Load diff
156
crates/webclaw-core/src/metadata.rs
Normal file
156
crates/webclaw-core/src/metadata.rs
Normal file
|
|
@ -0,0 +1,156 @@
|
|||
/// Metadata extraction from HTML <head>.
|
||||
/// Prioritizes Open Graph and Twitter Card tags, falls back to standard meta tags.
|
||||
use scraper::{Html, Selector};
|
||||
|
||||
use crate::types::Metadata;
|
||||
|
||||
/// Selectors are cheap to compile but we call them often — cache with once_cell.
|
||||
macro_rules! selector {
|
||||
($s:expr) => {{
|
||||
use once_cell::sync::Lazy;
|
||||
static SEL: Lazy<Selector> = Lazy::new(|| Selector::parse($s).unwrap());
|
||||
&*SEL
|
||||
}};
|
||||
}
|
||||
|
||||
pub fn extract(doc: &Html, url: Option<&str>) -> Metadata {
|
||||
let title = og_meta(doc, "og:title")
|
||||
.or_else(|| meta_name(doc, "twitter:title"))
|
||||
.or_else(|| title_tag(doc));
|
||||
|
||||
let description = og_meta(doc, "og:description")
|
||||
.or_else(|| meta_name(doc, "twitter:description"))
|
||||
.or_else(|| meta_name(doc, "description"));
|
||||
|
||||
let author = meta_name(doc, "author").or_else(|| og_meta(doc, "article:author"));
|
||||
|
||||
let published_date = og_meta(doc, "article:published_time")
|
||||
.or_else(|| meta_name(doc, "date"))
|
||||
.or_else(|| meta_name(doc, "publication_date"));
|
||||
|
||||
// Search the whole document for <html lang="..."> — root_element() IS the <html>
|
||||
// node in scraper, so selecting "html" from it finds nothing (no nested <html>).
|
||||
let language = doc
|
||||
.select(selector!("html"))
|
||||
.next()
|
||||
.and_then(|el| el.value().attr("lang"))
|
||||
.map(|s| s.to_string());
|
||||
|
||||
let site_name = og_meta(doc, "og:site_name");
|
||||
let image = og_meta(doc, "og:image").or_else(|| meta_name(doc, "twitter:image"));
|
||||
|
||||
let favicon = extract_favicon(doc);
|
||||
|
||||
Metadata {
|
||||
title,
|
||||
description,
|
||||
author,
|
||||
published_date,
|
||||
language,
|
||||
url: url.map(String::from),
|
||||
site_name,
|
||||
image,
|
||||
favicon,
|
||||
word_count: 0, // filled later by the extractor
|
||||
}
|
||||
}
|
||||
|
||||
/// <meta property="og:..." content="...">
|
||||
fn og_meta(doc: &Html, property: &str) -> Option<String> {
|
||||
// OG tags use property= not name=
|
||||
doc.select(selector!("meta[property]"))
|
||||
.find(|el| el.value().attr("property") == Some(property))
|
||||
.and_then(|el| el.value().attr("content"))
|
||||
.map(|s| s.trim().to_string())
|
||||
.filter(|s| !s.is_empty())
|
||||
}
|
||||
|
||||
/// <meta name="..." content="...">
|
||||
fn meta_name(doc: &Html, name: &str) -> Option<String> {
|
||||
doc.select(selector!("meta[name]"))
|
||||
.find(|el| {
|
||||
el.value()
|
||||
.attr("name")
|
||||
.is_some_and(|n| n.eq_ignore_ascii_case(name))
|
||||
})
|
||||
.and_then(|el| el.value().attr("content"))
|
||||
.map(|s| s.trim().to_string())
|
||||
.filter(|s| !s.is_empty())
|
||||
}
|
||||
|
||||
fn title_tag(doc: &Html) -> Option<String> {
|
||||
doc.select(selector!("title"))
|
||||
.next()
|
||||
.map(|el| el.text().collect::<String>().trim().to_string())
|
||||
.filter(|s| !s.is_empty())
|
||||
}
|
||||
|
||||
fn extract_favicon(doc: &Html) -> Option<String> {
|
||||
// <link rel="icon" href="..."> or <link rel="shortcut icon" href="...">
|
||||
doc.select(selector!("link[rel]"))
|
||||
.find(|el| el.value().attr("rel").is_some_and(|r| r.contains("icon")))
|
||||
.and_then(|el| el.value().attr("href"))
|
||||
.map(|s| s.to_string())
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
fn parse(html: &str) -> Html {
|
||||
Html::parse_document(html)
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn extracts_basic_metadata() {
|
||||
let html = r#"
|
||||
<html lang="en">
|
||||
<head>
|
||||
<title>Test Page</title>
|
||||
<meta name="description" content="A test page">
|
||||
<meta name="author" content="Alice">
|
||||
<meta property="og:title" content="OG Title">
|
||||
<meta property="og:image" content="https://img.example.com/og.png">
|
||||
<meta property="og:site_name" content="Example">
|
||||
<meta property="article:published_time" content="2025-01-15">
|
||||
<link rel="icon" href="/favicon.ico">
|
||||
</head>
|
||||
<body></body>
|
||||
</html>"#;
|
||||
|
||||
let doc = parse(html);
|
||||
let meta = extract(&doc, Some("https://example.com"));
|
||||
|
||||
// OG title wins over <title>
|
||||
assert_eq!(meta.title.as_deref(), Some("OG Title"));
|
||||
assert_eq!(meta.description.as_deref(), Some("A test page"));
|
||||
assert_eq!(meta.author.as_deref(), Some("Alice"));
|
||||
assert_eq!(meta.published_date.as_deref(), Some("2025-01-15"));
|
||||
assert_eq!(meta.language.as_deref(), Some("en"));
|
||||
assert_eq!(meta.site_name.as_deref(), Some("Example"));
|
||||
assert_eq!(
|
||||
meta.image.as_deref(),
|
||||
Some("https://img.example.com/og.png")
|
||||
);
|
||||
assert_eq!(meta.favicon.as_deref(), Some("/favicon.ico"));
|
||||
assert_eq!(meta.url.as_deref(), Some("https://example.com"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn falls_back_to_title_tag() {
|
||||
let html = r#"<html><head><title>Fallback Title</title></head><body></body></html>"#;
|
||||
let doc = parse(html);
|
||||
let meta = extract(&doc, None);
|
||||
assert_eq!(meta.title.as_deref(), Some("Fallback Title"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn handles_missing_metadata_gracefully() {
|
||||
let html = r#"<html><head></head><body></body></html>"#;
|
||||
let doc = parse(html);
|
||||
let meta = extract(&doc, None);
|
||||
assert!(meta.title.is_none());
|
||||
assert!(meta.description.is_none());
|
||||
assert!(meta.language.is_none());
|
||||
}
|
||||
}
|
||||
756
crates/webclaw-core/src/noise.rs
Normal file
756
crates/webclaw-core/src/noise.rs
Normal file
|
|
@ -0,0 +1,756 @@
|
|||
/// Shared noise detection for web content extraction.
|
||||
///
|
||||
/// Identifies elements that don't contribute to main content:
|
||||
/// navigation, sidebars, footers, ads, cookie banners, modals, etc.
|
||||
/// Used by both the extractor (candidate filtering) and the markdown
|
||||
/// converter (output-time stripping).
|
||||
use scraper::ElementRef;
|
||||
|
||||
const NOISE_TAGS: &[&str] = &[
|
||||
"script", "style", "noscript", "iframe", "svg", "nav", "aside", "footer", "header", "form",
|
||||
"video", "audio",
|
||||
"canvas",
|
||||
// NOTE: <picture> removed — it's a responsive image container, not noise.
|
||||
// <picture> wraps <source> and <img> for responsive images.
|
||||
];
|
||||
|
||||
const NOISE_ROLES: &[&str] = &["navigation", "banner", "complementary", "contentinfo"];
|
||||
|
||||
const NOISE_CLASS_PATTERNS: &[&str] = &[
|
||||
"sidebar",
|
||||
"side",
|
||||
"nav",
|
||||
"navbar",
|
||||
"navigation",
|
||||
"menu",
|
||||
"footer",
|
||||
"header",
|
||||
"top",
|
||||
"bottom",
|
||||
"advertisement",
|
||||
"advert",
|
||||
"social",
|
||||
"social-media",
|
||||
"social-links",
|
||||
"share",
|
||||
"comment",
|
||||
"cookie",
|
||||
"popup",
|
||||
"modal",
|
||||
"overlay",
|
||||
"banner",
|
||||
"breadcrumb",
|
||||
"breadcrumbs",
|
||||
"widget",
|
||||
"lang-selector",
|
||||
"language",
|
||||
"newsletter",
|
||||
"subscribe",
|
||||
"related-posts",
|
||||
"recommended",
|
||||
"pagination",
|
||||
"pager",
|
||||
"signup",
|
||||
"login-form",
|
||||
"search-form",
|
||||
"notification",
|
||||
"alert",
|
||||
"toast",
|
||||
"skip-link",
|
||||
"sr-only",
|
||||
"visually-hidden",
|
||||
];
|
||||
|
||||
const NOISE_ID_PATTERNS: &[&str] = &[
|
||||
"sidebar",
|
||||
"nav",
|
||||
"menu",
|
||||
"footer",
|
||||
"header",
|
||||
"cookie",
|
||||
"popup",
|
||||
"modal",
|
||||
"breadcrumbs",
|
||||
"widget",
|
||||
"language-selector",
|
||||
"ad",
|
||||
"social",
|
||||
"share",
|
||||
"newsletter",
|
||||
"subscribe",
|
||||
"comments",
|
||||
"related",
|
||||
"recommended",
|
||||
];
|
||||
|
||||
/// Exact class tokens that indicate noise.
|
||||
/// Unlike substring matching, these only match when the EXACT class token
|
||||
/// is present — ".modal" matches `class="modal"` but NOT `class="free-modal-container"`.
|
||||
const NOISE_CLASSES: &[&str] = &[
|
||||
"header",
|
||||
"top",
|
||||
"navbar",
|
||||
"footer",
|
||||
"bottom",
|
||||
"sidebar",
|
||||
"modal",
|
||||
"popup",
|
||||
"overlay",
|
||||
"ad",
|
||||
"ads",
|
||||
"advert",
|
||||
"lang-selector",
|
||||
"language",
|
||||
"social",
|
||||
"social-media",
|
||||
"social-links",
|
||||
"menu",
|
||||
"navigation",
|
||||
"breadcrumbs",
|
||||
"breadcrumb",
|
||||
"share",
|
||||
"widget",
|
||||
"cookie",
|
||||
"newsletter",
|
||||
"subscribe",
|
||||
"skip-link",
|
||||
"sr-only",
|
||||
"visually-hidden",
|
||||
"notification",
|
||||
"alert",
|
||||
"toast",
|
||||
"pagination",
|
||||
"pager",
|
||||
"signup",
|
||||
"login-form",
|
||||
"search-form",
|
||||
"related-posts",
|
||||
"recommended",
|
||||
];
|
||||
|
||||
/// Exact IDs that indicate noise.
|
||||
const NOISE_IDS: &[&str] = &[
|
||||
"header",
|
||||
"footer",
|
||||
"nav",
|
||||
"sidebar",
|
||||
"menu",
|
||||
"modal",
|
||||
"popup",
|
||||
"cookie",
|
||||
"breadcrumbs",
|
||||
"widget",
|
||||
"ad",
|
||||
"social",
|
||||
"share",
|
||||
"newsletter",
|
||||
"subscribe",
|
||||
"comments",
|
||||
"related",
|
||||
"recommended",
|
||||
];
|
||||
|
||||
/// ID prefixes for cookie consent platforms that should be stripped entirely.
|
||||
/// These generate massive DOM overlays that dominate content extraction.
|
||||
const COOKIE_CONSENT_ID_PREFIXES: &[&str] = &[
|
||||
"onetrust", // OneTrust (Foot Locker, many EU sites)
|
||||
"optanon", // OneTrust legacy
|
||||
"ot-sdk", // OneTrust SDK
|
||||
"cookiebot", // Cookiebot
|
||||
"CybotCookiebot", // Cookiebot
|
||||
"cc-", // Cookie Consent (Osano)
|
||||
"cookie-law", // Cookie Law Info
|
||||
"gdpr", // Generic GDPR banners
|
||||
"consent-", // Generic consent banners
|
||||
"cmp-", // Consent Management Platforms
|
||||
"sp_message", // SourcePoint
|
||||
"qc-cmp", // Quantcast CMP
|
||||
"trustarc", // TrustArc
|
||||
"evidon", // Evidon/Crownpeak
|
||||
];
|
||||
|
||||
/// Check if an element is noise by tag, role, class, or id.
|
||||
///
|
||||
/// Uses EXACT class token matching instead
|
||||
/// of substring matching. This prevents false positives like:
|
||||
/// - "free-modal-container" ≠ noise (Vice.com's content wrapper)
|
||||
/// - "a-bw_aui_cxc_alert_measurement" ≠ noise (Amazon's body class)
|
||||
/// - "desktop" ≠ noise (not matching "top")
|
||||
pub fn is_noise(el: ElementRef<'_>) -> bool {
|
||||
let tag = el.value().name();
|
||||
|
||||
// Never treat <body> or <html> as noise.
|
||||
if tag == "body" || tag == "html" {
|
||||
return false;
|
||||
}
|
||||
|
||||
// Tag-based noise (script, style, nav, etc.)
|
||||
if NOISE_TAGS.contains(&tag) {
|
||||
return true;
|
||||
}
|
||||
|
||||
// ARIA role-based noise
|
||||
if let Some(role) = el.value().attr("role")
|
||||
&& NOISE_ROLES.contains(&role)
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
// Exact class token matching — split class attribute into tokens,
|
||||
// check each against the noise list. "free-modal-container" splits into
|
||||
// ["free-modal-container"] which does NOT match "modal".
|
||||
if let Some(class) = el.value().attr("class") {
|
||||
for token in class.split_whitespace() {
|
||||
let lower = token.to_lowercase();
|
||||
if NOISE_CLASSES.contains(&lower.as_str()) {
|
||||
return true;
|
||||
}
|
||||
// Structural elements use compound names (FooterLinks, Header-nav, etc.)
|
||||
// These are always noise regardless of compound form.
|
||||
if lower.starts_with("footer")
|
||||
|| lower.starts_with("header-")
|
||||
|| lower.starts_with("nav-")
|
||||
{
|
||||
return true;
|
||||
}
|
||||
}
|
||||
// Also check for ad-specific patterns (standalone "ad" class)
|
||||
if is_ad_class(class) {
|
||||
return true;
|
||||
}
|
||||
}
|
||||
|
||||
// Exact ID matching
|
||||
if let Some(id) = el.value().attr("id") {
|
||||
let id_lower = id.to_lowercase();
|
||||
if NOISE_IDS.contains(&id_lower.as_str()) && !is_structural_id(&id_lower) {
|
||||
return true;
|
||||
}
|
||||
// Cookie consent platform IDs (prefix match — these generate huge overlays)
|
||||
for prefix in COOKIE_CONSENT_ID_PREFIXES {
|
||||
if id_lower.starts_with(prefix) {
|
||||
return true;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Class-based cookie consent detection (prefix match for platform classes)
|
||||
if let Some(class) = el.value().attr("class") {
|
||||
let class_lower = class.to_lowercase();
|
||||
for prefix in COOKIE_CONSENT_ID_PREFIXES {
|
||||
if class_lower.contains(prefix) {
|
||||
return true;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
false
|
||||
}
|
||||
|
||||
/// Check if an element is inside a noise container.
|
||||
pub fn is_noise_descendant(el: ElementRef<'_>) -> bool {
|
||||
let mut node = el.parent();
|
||||
while let Some(parent) = node {
|
||||
if let Some(parent_el) = ElementRef::wrap(parent)
|
||||
&& is_noise(parent_el)
|
||||
{
|
||||
return true;
|
||||
}
|
||||
node = parent.parent();
|
||||
}
|
||||
false
|
||||
}
|
||||
|
||||
fn has_noise_class(class: &str) -> bool {
|
||||
// Match noise patterns against individual class tokens, with safeguards
|
||||
// against Tailwind CSS utility classes that contain noise keywords as
|
||||
// substrings (e.g., "pt-header-h" is padding, not a header class).
|
||||
class.split_whitespace().any(is_noise_token) || is_ad_class(class)
|
||||
}
|
||||
|
||||
/// Check if a single class token is a noise indicator.
|
||||
/// Requires the noise pattern to be the *semantic core* of the token,
|
||||
/// not embedded inside a Tailwind utility prefix or CSS variable.
|
||||
fn is_noise_token(token: &str) -> bool {
|
||||
let t = token.to_lowercase();
|
||||
|
||||
// Skip Tailwind arbitrary values and CSS variable references entirely
|
||||
if t.contains("[--") || t.contains("var(") {
|
||||
return false;
|
||||
}
|
||||
|
||||
// Strip common Tailwind responsive/state prefixes (e.g., "lg:", "hover:", "md:")
|
||||
let core = t.rsplit_once(':').map_or(t.as_str(), |(_, c)| c);
|
||||
|
||||
// The noise pattern should match the semantic name, not be buried inside
|
||||
// a utility like "pt-header-h" (padding) or "mt-nav-offset" (margin).
|
||||
// Tailwind utilities start with known prefixes; if the token starts with one,
|
||||
// it's a utility class, not a semantic class.
|
||||
const UTILITY_PREFIXES: &[&str] = &[
|
||||
"p-",
|
||||
"pt-",
|
||||
"pb-",
|
||||
"pl-",
|
||||
"pr-",
|
||||
"px-",
|
||||
"py-",
|
||||
"m-",
|
||||
"mt-",
|
||||
"mb-",
|
||||
"ml-",
|
||||
"mr-",
|
||||
"mx-",
|
||||
"my-",
|
||||
"w-",
|
||||
"h-",
|
||||
"min-",
|
||||
"max-",
|
||||
"top-",
|
||||
"left-",
|
||||
"right-",
|
||||
"bottom-",
|
||||
"z-",
|
||||
"gap-",
|
||||
"text-",
|
||||
"bg-",
|
||||
"border-",
|
||||
"rounded-",
|
||||
"flex-",
|
||||
"grid-",
|
||||
"col-",
|
||||
"row-",
|
||||
"opacity-",
|
||||
"transition-",
|
||||
"duration-",
|
||||
"delay-",
|
||||
"ease-",
|
||||
"translate-",
|
||||
"scale-",
|
||||
"rotate-",
|
||||
"origin-",
|
||||
"overflow-",
|
||||
"inset-",
|
||||
"space-",
|
||||
"divide-",
|
||||
"ring-",
|
||||
"shadow-",
|
||||
"outline-",
|
||||
"font-",
|
||||
"leading-",
|
||||
"tracking-",
|
||||
"decoration-",
|
||||
];
|
||||
if UTILITY_PREFIXES.iter().any(|pfx| core.starts_with(pfx)) {
|
||||
return false;
|
||||
}
|
||||
|
||||
// "banner" and "overlay" only match as prefix — they false-positive as
|
||||
// suffixes in BEM/Webflow component names (e.g., "package_banner" is a
|
||||
// product card, not an ad banner; "planet-overlay" is a visual effect).
|
||||
const PREFIX_ONLY: &[&str] = &["banner", "overlay"];
|
||||
|
||||
// Short patterns (≤6 chars like "nav", "top", "header", "widget") require
|
||||
// word-boundary matching to avoid false positives on compound CSS class
|
||||
// names (e.g., "desktop" ≠ "top", "celwidget" ≠ "widget",
|
||||
// "_categoriesheader_active" ≠ semantic "header").
|
||||
// A word boundary is `-`, `_`, or start/end of string.
|
||||
// Longer patterns (7+ chars like "sidebar", "breadcrumb") are specific
|
||||
// enough that substring matching is safe.
|
||||
NOISE_CLASS_PATTERNS.iter().any(|p| {
|
||||
if PREFIX_ONLY.contains(p) {
|
||||
core == *p || core.starts_with(&format!("{p}-")) || core.starts_with(&format!("{p}_"))
|
||||
} else if p.len() <= 6 {
|
||||
is_word_boundary_match(core, p)
|
||||
} else {
|
||||
core.contains(p)
|
||||
}
|
||||
})
|
||||
}
|
||||
|
||||
/// Check if `pattern` appears in `text` at a word boundary.
|
||||
/// Word boundaries are `-`, `_`, or start/end of string.
|
||||
/// e.g., "nav" matches "main-nav", "nav-bar", "nav" but NOT "canvas", "navbar".
|
||||
fn is_word_boundary_match(text: &str, pattern: &str) -> bool {
|
||||
let mut start = 0;
|
||||
while let Some(pos) = text[start..].find(pattern) {
|
||||
let abs = start + pos;
|
||||
let before_ok = abs == 0 || matches!(text.as_bytes()[abs - 1], b'-' | b'_');
|
||||
let end = abs + pattern.len();
|
||||
let after_ok = end == text.len() || matches!(text.as_bytes()[end], b'-' | b'_');
|
||||
if before_ok && after_ok {
|
||||
return true;
|
||||
}
|
||||
start = abs + 1;
|
||||
}
|
||||
false
|
||||
}
|
||||
|
||||
/// IDs like "modal-portal", "nav-root", "header-container" are structural
|
||||
/// wrappers (React portals, app roots), not actual noise elements.
|
||||
fn is_structural_id(id: &str) -> bool {
|
||||
const STRUCTURAL_SUFFIXES: &[&str] =
|
||||
&["portal", "root", "container", "wrapper", "mount", "app"];
|
||||
STRUCTURAL_SUFFIXES.iter().any(|s| id.contains(s))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// CSS class text detection (visible content that looks like class names)
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// CSS utility prefixes that indicate a word is a class name, not prose.
|
||||
/// Covers Tailwind, Bootstrap-ish, and common utility-first patterns.
|
||||
const CSS_CLASS_PREFIXES: &[&str] = &[
|
||||
"text-",
|
||||
"bg-",
|
||||
"px-",
|
||||
"py-",
|
||||
"pt-",
|
||||
"pb-",
|
||||
"pl-",
|
||||
"pr-",
|
||||
"p-",
|
||||
"mx-",
|
||||
"my-",
|
||||
"mt-",
|
||||
"mb-",
|
||||
"ml-",
|
||||
"mr-",
|
||||
"m-",
|
||||
"w-",
|
||||
"h-",
|
||||
"min-",
|
||||
"max-",
|
||||
"flex-",
|
||||
"grid-",
|
||||
"col-",
|
||||
"row-",
|
||||
"gap-",
|
||||
"space-",
|
||||
"rounded-",
|
||||
"shadow-",
|
||||
"border-",
|
||||
"ring-",
|
||||
"outline-",
|
||||
"font-",
|
||||
"tracking-",
|
||||
"leading-",
|
||||
"decoration-",
|
||||
"opacity-",
|
||||
"transition-",
|
||||
"duration-",
|
||||
"delay-",
|
||||
"ease-",
|
||||
"translate-",
|
||||
"scale-",
|
||||
"rotate-",
|
||||
"origin-",
|
||||
"overflow-",
|
||||
"inset-",
|
||||
"divide-",
|
||||
"z-",
|
||||
"top-",
|
||||
"left-",
|
||||
"right-",
|
||||
"bottom-",
|
||||
"sr-",
|
||||
"not-",
|
||||
"group-",
|
||||
"peer-",
|
||||
"placeholder-",
|
||||
"focus-",
|
||||
"hover-",
|
||||
"active-",
|
||||
"disabled-",
|
||||
"dark-",
|
||||
"sm-",
|
||||
"md-",
|
||||
"lg-",
|
||||
"xl-",
|
||||
"2xl-",
|
||||
];
|
||||
|
||||
/// Exact single-word CSS utility class names (no prefix needed).
|
||||
const CSS_CLASS_EXACT: &[&str] = &[
|
||||
"flex",
|
||||
"grid",
|
||||
"block",
|
||||
"inline",
|
||||
"hidden",
|
||||
"static",
|
||||
"fixed",
|
||||
"absolute",
|
||||
"relative",
|
||||
"sticky",
|
||||
"isolate",
|
||||
"container",
|
||||
"prose",
|
||||
"antialiased",
|
||||
"truncate",
|
||||
"uppercase",
|
||||
"lowercase",
|
||||
"capitalize",
|
||||
"italic",
|
||||
"underline",
|
||||
"overline",
|
||||
"invisible",
|
||||
"visible",
|
||||
"sr-only",
|
||||
"not-sr-only",
|
||||
];
|
||||
|
||||
/// Tailwind responsive/state prefixes that can appear before a utility class
|
||||
/// (e.g., "sm:text-lg", "hover:bg-blue-500", "dark:text-white").
|
||||
fn strip_tw_variant_prefix(word: &str) -> &str {
|
||||
// Handle chained variants: "dark:sm:text-lg" → "text-lg"
|
||||
word.rsplit_once(':').map_or(word, |(_, core)| core)
|
||||
}
|
||||
|
||||
/// Check if a single whitespace-delimited word looks like a CSS utility class.
|
||||
fn is_css_class_word(word: &str) -> bool {
|
||||
let core = strip_tw_variant_prefix(word);
|
||||
let lower = core.to_lowercase();
|
||||
|
||||
// Arbitrary value syntax: "[--foo:bar]", "w-[200px]"
|
||||
if lower.contains('[') && lower.contains(']') {
|
||||
return true;
|
||||
}
|
||||
|
||||
// Exact matches
|
||||
if CSS_CLASS_EXACT.iter().any(|&e| lower == e) {
|
||||
return true;
|
||||
}
|
||||
|
||||
// Prefix matches
|
||||
if CSS_CLASS_PREFIXES.iter().any(|pfx| lower.starts_with(pfx)) {
|
||||
return true;
|
||||
}
|
||||
|
||||
// Negative utilities: "-mt-4", "-translate-x-1/2"
|
||||
if lower.starts_with('-') && lower.len() > 1 {
|
||||
let rest = &lower[1..];
|
||||
if CSS_CLASS_PREFIXES.iter().any(|pfx| rest.starts_with(pfx)) {
|
||||
return true;
|
||||
}
|
||||
}
|
||||
|
||||
false
|
||||
}
|
||||
|
||||
/// Public wrapper for single-word CSS class detection (used by LLM pipeline
|
||||
/// for stripping trailing CSS classes from mixed-content lines).
|
||||
pub fn is_css_class_word_pub(word: &str) -> bool {
|
||||
is_css_class_word(word)
|
||||
}
|
||||
|
||||
/// Check if a text block is predominantly CSS class names.
|
||||
///
|
||||
/// Returns true if >50% of the whitespace-delimited words look like CSS
|
||||
/// utility classes. Requires at least 3 words to avoid false positives on
|
||||
/// short fragments.
|
||||
pub fn is_css_class_text(text: &str) -> bool {
|
||||
let words: Vec<&str> = text.split_whitespace().collect();
|
||||
if words.len() < 3 {
|
||||
return false;
|
||||
}
|
||||
|
||||
let css_count = words.iter().filter(|w| is_css_class_word(w)).count();
|
||||
// >50% of words are CSS classes
|
||||
css_count * 2 > words.len()
|
||||
}
|
||||
|
||||
/// Detect "ad" as a standalone class token, not a substring of "read" or "loading".
|
||||
fn is_ad_class(class: &str) -> bool {
|
||||
class.split_whitespace().any(|token| {
|
||||
token == "ad"
|
||||
|| token.starts_with("ad-")
|
||||
|| token.starts_with("ad_")
|
||||
|| token.ends_with("-ad")
|
||||
|| token.ends_with("_ad")
|
||||
})
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn ad_class_standalone_detected() {
|
||||
assert!(is_ad_class("ad"));
|
||||
assert!(is_ad_class("some ad-banner"));
|
||||
assert!(is_ad_class("top-ad widget"));
|
||||
assert!(is_ad_class("ad_unit"));
|
||||
assert!(is_ad_class("sidebar_ad"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn ad_class_no_false_positive() {
|
||||
assert!(!is_ad_class("reading-time"));
|
||||
assert!(!is_ad_class("loading-indicator"));
|
||||
assert!(!is_ad_class("download-button"));
|
||||
assert!(!is_ad_class("breadcrumb"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn noise_class_patterns() {
|
||||
assert!(has_noise_class("main-sidebar"));
|
||||
assert!(has_noise_class("cookie-banner")); // "cookie" substring match
|
||||
assert!(has_noise_class("modal-overlay")); // "modal" substring match
|
||||
assert!(has_noise_class("banner-top")); // "banner" as prefix
|
||||
assert!(has_noise_class("overlay-popup")); // "overlay" as prefix
|
||||
assert!(!has_noise_class("article-content"));
|
||||
assert!(!has_noise_class("post-body"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn short_patterns_require_word_boundary() {
|
||||
// "nav" (3 chars) — must be a standalone word segment
|
||||
assert!(has_noise_class("main-nav"));
|
||||
assert!(has_noise_class("nav-bar"));
|
||||
assert!(has_noise_class("nav"));
|
||||
assert!(!has_noise_class("canvas")); // "nav" is substring, not word
|
||||
assert!(has_noise_class("icp-nav-flag")); // "nav" IS between word boundaries
|
||||
// "top" (3 chars) — note: "top-bar" starts with Tailwind prefix "top-" → filtered out
|
||||
assert!(has_noise_class("page-top")); // "top" at word boundary
|
||||
assert!(!has_noise_class("desktop")); // "top" is substring inside word
|
||||
assert!(!has_noise_class("stop-motion")); // "top" inside word
|
||||
// "side" (4 chars) — "left-side" starts with Tailwind prefix "left-" → filtered
|
||||
assert!(has_noise_class("page-side"));
|
||||
assert!(!has_noise_class("inside-content"));
|
||||
assert!(!has_noise_class("consider"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn amazon_classes_not_noise() {
|
||||
// Amazon CSS module class names that were false-positiving
|
||||
assert!(!has_noise_class("desktop")); // contains "top"
|
||||
assert!(!has_noise_class("celwidget")); // contains "widget"
|
||||
// a-alert-container: "alert" IS a proper word segment → still matches (correct for UI alerts)
|
||||
assert!(has_noise_class("a-alert-container"));
|
||||
assert!(!has_noise_class(
|
||||
"_haul-cx-images-carousel_style_desktop-card__fid8k"
|
||||
));
|
||||
assert!(!has_noise_class(
|
||||
"_haul-cx-infinite-scroll-body_categoriesheader_active__2j-4u"
|
||||
));
|
||||
// But actual noise classes still work
|
||||
assert!(has_noise_class("site-header"));
|
||||
assert!(has_noise_class("main-nav"));
|
||||
assert!(has_noise_class("footer-links"));
|
||||
assert!(has_noise_class("cookie-consent"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn word_boundary_match_works() {
|
||||
assert!(is_word_boundary_match("main-nav", "nav"));
|
||||
assert!(is_word_boundary_match("nav-bar", "nav"));
|
||||
assert!(is_word_boundary_match("nav", "nav"));
|
||||
assert!(is_word_boundary_match("top-nav_bar", "nav"));
|
||||
assert!(!is_word_boundary_match("canvas", "nav"));
|
||||
assert!(!is_word_boundary_match("navbar", "nav"));
|
||||
assert!(!is_word_boundary_match("navigate", "nav"));
|
||||
assert!(is_word_boundary_match("top-bar", "top"));
|
||||
assert!(!is_word_boundary_match("desktop", "top"));
|
||||
assert!(!is_word_boundary_match("stopper", "top"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn bem_component_names_not_noise() {
|
||||
// BEM/Webflow component names where noise keyword is a suffix
|
||||
assert!(!has_noise_class("package_banner"));
|
||||
assert!(!has_noise_class("mars-cta_planet-overlay"));
|
||||
assert!(!has_noise_class("hero_banner_wrap"));
|
||||
// But actual noise classes still work
|
||||
assert!(has_noise_class("banner-dismiss"));
|
||||
assert!(has_noise_class("overlay-backdrop"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn structural_ids_not_noise() {
|
||||
assert!(is_structural_id("modal-portal"));
|
||||
assert!(is_structural_id("nav-root"));
|
||||
assert!(is_structural_id("header-container"));
|
||||
assert!(is_structural_id("sidebar-wrapper"));
|
||||
assert!(is_structural_id("menu-mount"));
|
||||
assert!(is_structural_id("app"));
|
||||
// Actual noise IDs should NOT be structural
|
||||
assert!(!is_structural_id("main-sidebar"));
|
||||
assert!(!is_structural_id("cookie-consent"));
|
||||
assert!(!is_structural_id("popup-overlay"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn tailwind_animation_utilities_not_noise() {
|
||||
// Tailwind transition/animation utilities with noise keywords as values
|
||||
assert!(!has_noise_class("ease-curve-sidebar"));
|
||||
assert!(!has_noise_class("duration-sidebar"));
|
||||
assert!(!has_noise_class("delay-modal-open"));
|
||||
// But actual sidebar/modal classes still work
|
||||
assert!(has_noise_class("sidebar-panel"));
|
||||
assert!(has_noise_class("modal-dialog"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn tailwind_css_vars_not_noise() {
|
||||
// Tailwind arbitrary values and CSS variables should NOT trigger noise
|
||||
assert!(!has_noise_class("[--content-top-offset:var(--header-h)]"));
|
||||
assert!(!has_noise_class(
|
||||
"pt-[var(--content-top-offset)] [--content-top-offset:var(--header-h)]"
|
||||
));
|
||||
assert!(!has_noise_class("[--nav-width:200px]"));
|
||||
// But actual noise classes still work
|
||||
assert!(has_noise_class("[--offset:10px] header-bar"));
|
||||
assert!(has_noise_class("sidebar [--x:1]"));
|
||||
}
|
||||
|
||||
// -----------------------------------------------------------------------
|
||||
// CSS class text detection (decorative text that looks like class names)
|
||||
// -----------------------------------------------------------------------
|
||||
|
||||
#[test]
|
||||
fn css_class_text_detected() {
|
||||
// Pure Tailwind utility class blocks — the real-world problem
|
||||
assert!(is_css_class_text(
|
||||
"text-4xl font-bold tracking-tight text-gray-900"
|
||||
));
|
||||
assert!(is_css_class_text(
|
||||
"text-4xl text-5xl text-6xl text-8xl text-gray-950 text-white tracking-tighter text-balance"
|
||||
));
|
||||
assert!(is_css_class_text(
|
||||
"flex grid rounded-lg shadow-md bg-white px-4 py-2"
|
||||
));
|
||||
assert!(is_css_class_text(
|
||||
"sm:text-lg dark:bg-gray-800 hover:bg-blue-500"
|
||||
));
|
||||
// Negative utilities
|
||||
assert!(is_css_class_text("-mt-4 -translate-x-1/2 flex"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn css_class_text_normal_prose_kept() {
|
||||
// Normal English text — must NOT be detected as CSS
|
||||
assert!(!is_css_class_text(
|
||||
"the text-based approach works well for this use case"
|
||||
));
|
||||
assert!(!is_css_class_text(
|
||||
"Build beautiful websites with modern tools"
|
||||
));
|
||||
assert!(!is_css_class_text(
|
||||
"Tailwind CSS is a utility-first CSS framework"
|
||||
));
|
||||
// Too short to be confident
|
||||
assert!(!is_css_class_text("flex grid"));
|
||||
assert!(!is_css_class_text("text-lg"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn css_class_text_mixed_content() {
|
||||
// Majority CSS → detected
|
||||
assert!(is_css_class_text(
|
||||
"text-4xl font-bold tracking-tight text-gray-900 hero"
|
||||
));
|
||||
// Majority prose → not detected
|
||||
assert!(!is_css_class_text(
|
||||
"The quick brown fox jumps over the lazy text-lg dog"
|
||||
));
|
||||
}
|
||||
}
|
||||
165
crates/webclaw-core/src/structured_data.rs
Normal file
165
crates/webclaw-core/src/structured_data.rs
Normal file
|
|
@ -0,0 +1,165 @@
|
|||
/// Extract JSON-LD structured data from HTML.
|
||||
///
|
||||
/// Parses `<script type="application/ld+json">` blocks commonly found in
|
||||
/// e-commerce, news, and recipe sites. Returns machine-readable product info,
|
||||
/// prices, availability, reviews, etc. without needing JS rendering or LLM.
|
||||
use serde_json::Value;
|
||||
|
||||
/// Extract all JSON-LD blocks from raw HTML.
|
||||
///
|
||||
/// Returns parsed JSON values, skipping any blocks that fail to parse.
|
||||
/// Most e-commerce sites include Schema.org Product markup with prices,
|
||||
/// sizes, availability, and images.
|
||||
pub fn extract_json_ld(html: &str) -> Vec<Value> {
|
||||
let mut results = Vec::new();
|
||||
let needle = "application/ld+json";
|
||||
|
||||
// Walk through the HTML finding <script type="application/ld+json"> blocks.
|
||||
// Using simple string scanning instead of a full HTML parser — these blocks
|
||||
// are self-contained and reliably structured.
|
||||
let mut search_from = 0;
|
||||
while let Some(tag_start) = html[search_from..].find("<script") {
|
||||
let abs_start = search_from + tag_start;
|
||||
let tag_region = &html[abs_start..];
|
||||
|
||||
// Find the end of the opening tag
|
||||
let Some(tag_end_offset) = tag_region.find('>') else {
|
||||
search_from = abs_start + 7;
|
||||
continue;
|
||||
};
|
||||
|
||||
let opening_tag = &tag_region[..tag_end_offset];
|
||||
|
||||
// Check if this is a JSON-LD script
|
||||
if !opening_tag.to_lowercase().contains(needle) {
|
||||
search_from = abs_start + tag_end_offset + 1;
|
||||
continue;
|
||||
}
|
||||
|
||||
// Find the closing </script>
|
||||
let content_start = abs_start + tag_end_offset + 1;
|
||||
let remaining = &html[content_start..];
|
||||
let Some(close_offset) = remaining.to_lowercase().find("</script>") else {
|
||||
search_from = content_start;
|
||||
continue;
|
||||
};
|
||||
|
||||
let json_str = remaining[..close_offset].trim();
|
||||
search_from = content_start + close_offset + 9;
|
||||
|
||||
if json_str.is_empty() {
|
||||
continue;
|
||||
}
|
||||
|
||||
// Parse — some sites have arrays at top level
|
||||
match serde_json::from_str::<Value>(json_str) {
|
||||
Ok(Value::Array(arr)) => results.extend(arr),
|
||||
Ok(val) => results.push(val),
|
||||
Err(_) => {}
|
||||
}
|
||||
}
|
||||
|
||||
results
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn extracts_single_json_ld() {
|
||||
let html = r#"
|
||||
<html><head>
|
||||
<script type="application/ld+json">{"@type":"Product","name":"Test"}</script>
|
||||
</head><body></body></html>
|
||||
"#;
|
||||
let results = extract_json_ld(html);
|
||||
assert_eq!(results.len(), 1);
|
||||
assert_eq!(results[0]["@type"], "Product");
|
||||
assert_eq!(results[0]["name"], "Test");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn extracts_multiple_json_ld_blocks() {
|
||||
let html = r#"
|
||||
<script type="application/ld+json">{"@type":"WebSite","url":"https://example.com"}</script>
|
||||
<script type="application/ld+json">{"@type":"Product","name":"Shoe","offers":{"price":99.99}}</script>
|
||||
"#;
|
||||
let results = extract_json_ld(html);
|
||||
assert_eq!(results.len(), 2);
|
||||
assert_eq!(results[0]["@type"], "WebSite");
|
||||
assert_eq!(results[1]["@type"], "Product");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn handles_array_json_ld() {
|
||||
let html = r#"
|
||||
<script type="application/ld+json">[{"@type":"BreadcrumbList"},{"@type":"Product"}]</script>
|
||||
"#;
|
||||
let results = extract_json_ld(html);
|
||||
assert_eq!(results.len(), 2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn skips_invalid_json() {
|
||||
let html = r#"
|
||||
<script type="application/ld+json">{invalid json here}</script>
|
||||
<script type="application/ld+json">{"@type":"Product","name":"Valid"}</script>
|
||||
"#;
|
||||
let results = extract_json_ld(html);
|
||||
assert_eq!(results.len(), 1);
|
||||
assert_eq!(results[0]["name"], "Valid");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn ignores_regular_script_tags() {
|
||||
let html = r#"
|
||||
<script>console.log("not json-ld")</script>
|
||||
<script type="text/javascript">var x = 1;</script>
|
||||
<script type="application/ld+json">{"@type":"Product"}</script>
|
||||
"#;
|
||||
let results = extract_json_ld(html);
|
||||
assert_eq!(results.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn handles_no_json_ld() {
|
||||
let html = "<html><body><p>No structured data here</p></body></html>";
|
||||
let results = extract_json_ld(html);
|
||||
assert!(results.is_empty());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn case_insensitive_type() {
|
||||
let html = r#"
|
||||
<script type="Application/LD+JSON">{"@type":"Product"}</script>
|
||||
"#;
|
||||
let results = extract_json_ld(html);
|
||||
assert_eq!(results.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn handles_whitespace_in_json() {
|
||||
let html = r#"
|
||||
<script type="application/ld+json">
|
||||
{
|
||||
"@type": "Product",
|
||||
"name": "Test"
|
||||
}
|
||||
</script>
|
||||
"#;
|
||||
let results = extract_json_ld(html);
|
||||
assert_eq!(results.len(), 1);
|
||||
assert_eq!(results[0]["name"], "Test");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn empty_script_tag_skipped() {
|
||||
let html = r#"
|
||||
<script type="application/ld+json"> </script>
|
||||
<script type="application/ld+json">{"@type":"Product"}</script>
|
||||
"#;
|
||||
let results = extract_json_ld(html);
|
||||
assert_eq!(results.len(), 1);
|
||||
}
|
||||
}
|
||||
80
crates/webclaw-core/src/types.rs
Normal file
80
crates/webclaw-core/src/types.rs
Normal file
|
|
@ -0,0 +1,80 @@
|
|||
/// Core types for extraction output.
|
||||
/// All types are serializable for JSON output to LLM consumers.
|
||||
use serde::{Deserialize, Serialize};
|
||||
|
||||
use crate::domain::DomainType;
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct ExtractionResult {
|
||||
pub metadata: Metadata,
|
||||
pub content: Content,
|
||||
pub domain_data: Option<DomainData>,
|
||||
/// JSON-LD structured data extracted from `<script type="application/ld+json">` blocks.
|
||||
/// Contains Schema.org markup (Product, Article, BreadcrumbList, etc.) when present.
|
||||
#[serde(default, skip_serializing_if = "Vec::is_empty")]
|
||||
pub structured_data: Vec<serde_json::Value>,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct Metadata {
|
||||
pub title: Option<String>,
|
||||
pub description: Option<String>,
|
||||
pub author: Option<String>,
|
||||
pub published_date: Option<String>,
|
||||
pub language: Option<String>,
|
||||
pub url: Option<String>,
|
||||
pub site_name: Option<String>,
|
||||
pub image: Option<String>,
|
||||
pub favicon: Option<String>,
|
||||
pub word_count: usize,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct Content {
|
||||
pub markdown: String,
|
||||
pub plain_text: String,
|
||||
pub links: Vec<Link>,
|
||||
pub images: Vec<Image>,
|
||||
pub code_blocks: Vec<CodeBlock>,
|
||||
#[serde(skip_serializing_if = "Option::is_none")]
|
||||
pub raw_html: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)]
|
||||
pub struct Link {
|
||||
pub text: String,
|
||||
pub href: String,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct Image {
|
||||
pub alt: String,
|
||||
pub src: String,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct CodeBlock {
|
||||
pub language: Option<String>,
|
||||
pub code: String,
|
||||
}
|
||||
|
||||
/// Domain-specific extracted data. For MVP, only the detected type is stored.
|
||||
/// Future: each variant carries structured fields (e.g., Article { author, date, ... }).
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct DomainData {
|
||||
pub domain_type: DomainType,
|
||||
}
|
||||
|
||||
/// Options for controlling content extraction behavior.
|
||||
#[derive(Debug, Clone, Default)]
|
||||
pub struct ExtractionOptions {
|
||||
/// CSS selectors for elements to include. If non-empty, only these elements
|
||||
/// are extracted (skipping the scoring algorithm entirely).
|
||||
pub include_selectors: Vec<String>,
|
||||
/// CSS selectors for elements to exclude from the output.
|
||||
pub exclude_selectors: Vec<String>,
|
||||
/// If true, skip scoring and pick the first `article`, `main`, or `[role="main"]` element.
|
||||
pub only_main_content: bool,
|
||||
/// If true, populate `Content::raw_html` with the extracted content's HTML.
|
||||
pub include_raw_html: bool,
|
||||
}
|
||||
220
crates/webclaw-core/src/youtube.rs
Normal file
220
crates/webclaw-core/src/youtube.rs
Normal file
|
|
@ -0,0 +1,220 @@
|
|||
use once_cell::sync::Lazy;
|
||||
/// YouTube video metadata extraction from `ytInitialPlayerResponse` embedded JSON.
|
||||
///
|
||||
/// YouTube embeds the full player config (title, author, view count, description,
|
||||
/// duration, upload date) in a `<script>` tag as a JS variable assignment. This
|
||||
/// module parses that blob and formats it as structured markdown, giving LLMs a
|
||||
/// clean representation without needing the YouTube API.
|
||||
use regex::Regex;
|
||||
use tracing::debug;
|
||||
|
||||
/// Regex to find the ytInitialPlayerResponse assignment in a <script> block.
|
||||
/// YouTube uses: `var ytInitialPlayerResponse = {...};`
|
||||
static YT_PLAYER_RE: Lazy<Regex> =
|
||||
Lazy::new(|| Regex::new(r"var\s+ytInitialPlayerResponse\s*=\s*(\{.+?\})\s*;").unwrap());
|
||||
|
||||
/// Check if a URL is a YouTube video page.
|
||||
pub fn is_youtube_url(url: &str) -> bool {
|
||||
let lower = url.to_lowercase();
|
||||
lower.contains("youtube.com/watch") || lower.contains("youtu.be/")
|
||||
}
|
||||
|
||||
/// Extracted YouTube video metadata.
|
||||
#[derive(Debug)]
|
||||
struct VideoMeta {
|
||||
title: String,
|
||||
author: String,
|
||||
view_count: String,
|
||||
upload_date: String,
|
||||
description: String,
|
||||
duration: String,
|
||||
}
|
||||
|
||||
/// Try to extract YouTube video metadata from the page HTML.
|
||||
/// Returns structured markdown if successful, None if the page doesn't contain
|
||||
/// ytInitialPlayerResponse or parsing fails.
|
||||
pub fn try_extract(html: &str) -> Option<String> {
|
||||
let json_str = YT_PLAYER_RE.captures(html)?.get(1)?.as_str();
|
||||
|
||||
let value: serde_json::Value = serde_json::from_str(json_str).ok()?;
|
||||
|
||||
let video_details = value.get("videoDetails")?;
|
||||
let microformat = value
|
||||
.get("microformat")
|
||||
.and_then(|m| m.get("playerMicroformatRenderer"));
|
||||
|
||||
let title = video_details
|
||||
.get("title")
|
||||
.and_then(|v| v.as_str())
|
||||
.unwrap_or("Untitled")
|
||||
.to_string();
|
||||
|
||||
let author = video_details
|
||||
.get("author")
|
||||
.and_then(|v| v.as_str())
|
||||
.unwrap_or("Unknown")
|
||||
.to_string();
|
||||
|
||||
let view_count = video_details
|
||||
.get("viewCount")
|
||||
.and_then(|v| v.as_str())
|
||||
.map(format_view_count)
|
||||
.unwrap_or_else(|| "N/A".to_string());
|
||||
|
||||
let upload_date = microformat
|
||||
.and_then(|m| m.get("uploadDate"))
|
||||
.or_else(|| microformat.and_then(|m| m.get("publishDate")))
|
||||
.and_then(|v| v.as_str())
|
||||
.unwrap_or("Unknown")
|
||||
.to_string();
|
||||
|
||||
let description = video_details
|
||||
.get("shortDescription")
|
||||
.and_then(|v| v.as_str())
|
||||
.unwrap_or("")
|
||||
.to_string();
|
||||
|
||||
let duration_secs = video_details
|
||||
.get("lengthSeconds")
|
||||
.and_then(|v| v.as_str())
|
||||
.and_then(|s| s.parse::<u64>().ok())
|
||||
.unwrap_or(0);
|
||||
let duration = format_duration(duration_secs);
|
||||
|
||||
let meta = VideoMeta {
|
||||
title,
|
||||
author,
|
||||
view_count,
|
||||
upload_date,
|
||||
description,
|
||||
duration,
|
||||
};
|
||||
|
||||
debug!(
|
||||
title = %meta.title,
|
||||
author = %meta.author,
|
||||
"extracted YouTube video metadata"
|
||||
);
|
||||
|
||||
Some(format_markdown(&meta))
|
||||
}
|
||||
|
||||
/// Format seconds into human-readable duration (e.g., "1:23:45" or "12:34").
|
||||
fn format_duration(total_secs: u64) -> String {
|
||||
let hours = total_secs / 3600;
|
||||
let minutes = (total_secs % 3600) / 60;
|
||||
let seconds = total_secs % 60;
|
||||
|
||||
if hours > 0 {
|
||||
format!("{hours}:{minutes:02}:{seconds:02}")
|
||||
} else {
|
||||
format!("{minutes}:{seconds:02}")
|
||||
}
|
||||
}
|
||||
|
||||
/// Format a raw view count string with commas (e.g., "1234567" -> "1,234,567").
|
||||
fn format_view_count(raw: &str) -> String {
|
||||
let Ok(n) = raw.parse::<u64>() else {
|
||||
return raw.to_string();
|
||||
};
|
||||
|
||||
if n >= 1_000_000 {
|
||||
format!("{:.1}M", n as f64 / 1_000_000.0)
|
||||
} else if n >= 1_000 {
|
||||
format!("{:.1}K", n as f64 / 1_000.0)
|
||||
} else {
|
||||
n.to_string()
|
||||
}
|
||||
}
|
||||
|
||||
/// Format extracted metadata into structured markdown.
|
||||
fn format_markdown(meta: &VideoMeta) -> String {
|
||||
let mut md = format!("# {}\n\n", meta.title);
|
||||
|
||||
md.push_str(&format!(
|
||||
"**Channel:** {} | **Views:** {} | **Published:** {} | **Duration:** {}\n\n",
|
||||
meta.author, meta.view_count, meta.upload_date, meta.duration
|
||||
));
|
||||
|
||||
if !meta.description.is_empty() {
|
||||
md.push_str("## Description\n\n");
|
||||
md.push_str(&meta.description);
|
||||
md.push('\n');
|
||||
}
|
||||
|
||||
md
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn detects_youtube_urls() {
|
||||
assert!(is_youtube_url(
|
||||
"https://www.youtube.com/watch?v=dQw4w9WgXcQ"
|
||||
));
|
||||
assert!(is_youtube_url("https://youtube.com/watch?v=abc123"));
|
||||
assert!(is_youtube_url("https://youtu.be/dQw4w9WgXcQ"));
|
||||
assert!(!is_youtube_url("https://example.com"));
|
||||
assert!(!is_youtube_url("https://vimeo.com/123456"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn format_duration_short() {
|
||||
assert_eq!(format_duration(0), "0:00");
|
||||
assert_eq!(format_duration(65), "1:05");
|
||||
assert_eq!(format_duration(3661), "1:01:01");
|
||||
assert_eq!(format_duration(754), "12:34");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn format_view_count_values() {
|
||||
assert_eq!(format_view_count("500"), "500");
|
||||
assert_eq!(format_view_count("1500"), "1.5K");
|
||||
assert_eq!(format_view_count("1234567"), "1.2M");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn extracts_from_mock_html() {
|
||||
let html = r#"
|
||||
<html><head><title>Test Video</title></head>
|
||||
<body>
|
||||
<script>
|
||||
var ytInitialPlayerResponse = {"videoDetails":{"title":"Rust in 100 Seconds","author":"Fireship","viewCount":"5432100","shortDescription":"Learn Rust in 100 seconds.","lengthSeconds":"120"},"microformat":{"playerMicroformatRenderer":{"uploadDate":"2023-01-15"}}};
|
||||
</script>
|
||||
</body></html>
|
||||
"#;
|
||||
|
||||
let result = try_extract(html).unwrap();
|
||||
assert!(result.contains("# Rust in 100 Seconds"));
|
||||
assert!(result.contains("**Channel:** Fireship"));
|
||||
assert!(result.contains("5.4M"));
|
||||
assert!(result.contains("2023-01-15"));
|
||||
assert!(result.contains("2:00"));
|
||||
assert!(result.contains("Learn Rust in 100 seconds."));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn returns_none_for_non_youtube_html() {
|
||||
let html = "<html><body><p>Hello world</p></body></html>";
|
||||
assert!(try_extract(html).is_none());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn handles_missing_optional_fields() {
|
||||
let html = r#"
|
||||
<html><body>
|
||||
<script>
|
||||
var ytInitialPlayerResponse = {"videoDetails":{"title":"Minimal Video","author":"Someone","viewCount":"100","shortDescription":"","lengthSeconds":"60"}};
|
||||
</script>
|
||||
</body></html>
|
||||
"#;
|
||||
|
||||
let result = try_extract(html).unwrap();
|
||||
assert!(result.contains("# Minimal Video"));
|
||||
assert!(result.contains("**Channel:** Someone"));
|
||||
// Upload date should be "Unknown" when microformat is missing
|
||||
assert!(result.contains("Unknown"));
|
||||
}
|
||||
}
|
||||
24
crates/webclaw-fetch/Cargo.toml
Normal file
24
crates/webclaw-fetch/Cargo.toml
Normal file
|
|
@ -0,0 +1,24 @@
|
|||
[package]
|
||||
name = "webclaw-fetch"
|
||||
description = "HTTP client with browser TLS fingerprint impersonation via Impit"
|
||||
version.workspace = true
|
||||
edition.workspace = true
|
||||
license.workspace = true
|
||||
|
||||
[dependencies]
|
||||
webclaw-core = { workspace = true }
|
||||
webclaw-pdf = { path = "../webclaw-pdf" }
|
||||
serde = { workspace = true }
|
||||
thiserror = { workspace = true }
|
||||
tracing = { workspace = true }
|
||||
tokio = { workspace = true }
|
||||
primp = { git = "https://github.com/deedy5/primp", default-features = false, features = [
|
||||
"default-tls", "http2", "impersonate", "cookies", "gzip", "brotli", "deflate", "zstd", "socks",
|
||||
] }
|
||||
url = "2"
|
||||
rand = "0.8"
|
||||
quick-xml = { version = "0.37", features = ["serde"] }
|
||||
serde_json.workspace = true
|
||||
|
||||
[dev-dependencies]
|
||||
tempfile = "3"
|
||||
96
crates/webclaw-fetch/src/browser.rs
Normal file
96
crates/webclaw-fetch/src/browser.rs
Normal file
|
|
@ -0,0 +1,96 @@
|
|||
/// Browser fingerprint selection and rotation.
|
||||
/// Maps our simple `BrowserProfile` enum to primp's impersonation profiles.
|
||||
use primp::{Impersonate, ImpersonateOS};
|
||||
|
||||
/// Which browser identity to present at the TLS/HTTP layer.
|
||||
#[derive(Debug, Clone, Default)]
|
||||
pub enum BrowserProfile {
|
||||
#[default]
|
||||
Chrome,
|
||||
Firefox,
|
||||
/// Randomly pick from all available profiles on each request.
|
||||
Random,
|
||||
}
|
||||
|
||||
/// A complete impersonation profile: browser + OS.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct ImpersonateProfile {
|
||||
pub browser: Impersonate,
|
||||
pub os: ImpersonateOS,
|
||||
}
|
||||
|
||||
/// All Chrome profiles we ship, newest first.
|
||||
pub fn chrome_profiles() -> Vec<ImpersonateProfile> {
|
||||
vec![
|
||||
ImpersonateProfile {
|
||||
browser: Impersonate::ChromeV145,
|
||||
os: ImpersonateOS::Windows,
|
||||
},
|
||||
ImpersonateProfile {
|
||||
browser: Impersonate::ChromeV145,
|
||||
os: ImpersonateOS::MacOS,
|
||||
},
|
||||
ImpersonateProfile {
|
||||
browser: Impersonate::ChromeV144,
|
||||
os: ImpersonateOS::Windows,
|
||||
},
|
||||
ImpersonateProfile {
|
||||
browser: Impersonate::ChromeV144,
|
||||
os: ImpersonateOS::Linux,
|
||||
},
|
||||
]
|
||||
}
|
||||
|
||||
/// All Firefox profiles we ship, newest first.
|
||||
pub fn firefox_profiles() -> Vec<ImpersonateProfile> {
|
||||
vec![
|
||||
ImpersonateProfile {
|
||||
browser: Impersonate::FirefoxV146,
|
||||
os: ImpersonateOS::Windows,
|
||||
},
|
||||
ImpersonateProfile {
|
||||
browser: Impersonate::FirefoxV146,
|
||||
os: ImpersonateOS::Linux,
|
||||
},
|
||||
ImpersonateProfile {
|
||||
browser: Impersonate::FirefoxV140,
|
||||
os: ImpersonateOS::Windows,
|
||||
},
|
||||
]
|
||||
}
|
||||
|
||||
/// Safari + Edge + Opera profiles for maximum diversity in Random mode.
|
||||
pub fn extra_profiles() -> Vec<ImpersonateProfile> {
|
||||
vec![
|
||||
ImpersonateProfile {
|
||||
browser: Impersonate::SafariV18_5,
|
||||
os: ImpersonateOS::MacOS,
|
||||
},
|
||||
ImpersonateProfile {
|
||||
browser: Impersonate::SafariV26,
|
||||
os: ImpersonateOS::MacOS,
|
||||
},
|
||||
ImpersonateProfile {
|
||||
browser: Impersonate::EdgeV145,
|
||||
os: ImpersonateOS::Windows,
|
||||
},
|
||||
ImpersonateProfile {
|
||||
browser: Impersonate::OperaV127,
|
||||
os: ImpersonateOS::Windows,
|
||||
},
|
||||
]
|
||||
}
|
||||
|
||||
pub fn latest_chrome() -> ImpersonateProfile {
|
||||
ImpersonateProfile {
|
||||
browser: Impersonate::ChromeV145,
|
||||
os: ImpersonateOS::Windows,
|
||||
}
|
||||
}
|
||||
|
||||
pub fn latest_firefox() -> ImpersonateProfile {
|
||||
ImpersonateProfile {
|
||||
browser: Impersonate::FirefoxV146,
|
||||
os: ImpersonateOS::Windows,
|
||||
}
|
||||
}
|
||||
798
crates/webclaw-fetch/src/client.rs
Normal file
798
crates/webclaw-fetch/src/client.rs
Normal file
|
|
@ -0,0 +1,798 @@
|
|||
/// HTTP client with browser TLS fingerprint impersonation.
|
||||
/// Wraps primp to provide a simple fetch interface with optional
|
||||
/// content extraction via webclaw-core. Supports single and batch operations.
|
||||
/// Automatically detects PDF responses and extracts text via webclaw-pdf.
|
||||
///
|
||||
/// Two proxy modes:
|
||||
/// - **Static**: single proxy (or none) baked into pre-built clients at construction.
|
||||
/// - **Rotating**: pre-built pool of clients, each with a different proxy + fingerprint.
|
||||
/// Same-host URLs are routed to the same client for HTTP/2 connection reuse.
|
||||
use std::collections::HashMap;
|
||||
use std::hash::{Hash, Hasher};
|
||||
use std::sync::Arc;
|
||||
use std::time::{Duration, Instant};
|
||||
|
||||
use rand::seq::SliceRandom;
|
||||
use tokio::sync::Semaphore;
|
||||
use tracing::{debug, instrument, warn};
|
||||
use webclaw_pdf::PdfMode;
|
||||
|
||||
use crate::browser::{self, BrowserProfile, ImpersonateProfile};
|
||||
use crate::error::FetchError;
|
||||
|
||||
/// Configuration for building a [`FetchClient`].
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct FetchConfig {
|
||||
pub browser: BrowserProfile,
|
||||
/// Single proxy URL. Used when `proxy_pool` is empty.
|
||||
pub proxy: Option<String>,
|
||||
/// Pool of proxy URLs to rotate through.
|
||||
/// When non-empty, each proxy gets a pre-built client with a
|
||||
/// random browser fingerprint. Same-host URLs reuse the same client
|
||||
/// for HTTP/2 connection multiplexing.
|
||||
pub proxy_pool: Vec<String>,
|
||||
pub timeout: Duration,
|
||||
pub follow_redirects: bool,
|
||||
pub max_redirects: u32,
|
||||
pub headers: HashMap<String, String>,
|
||||
pub pdf_mode: PdfMode,
|
||||
}
|
||||
|
||||
impl Default for FetchConfig {
|
||||
fn default() -> Self {
|
||||
Self {
|
||||
browser: BrowserProfile::Chrome,
|
||||
proxy: None,
|
||||
proxy_pool: Vec::new(),
|
||||
timeout: Duration::from_secs(30),
|
||||
follow_redirects: true,
|
||||
max_redirects: 10,
|
||||
headers: HashMap::from([("Accept-Language".to_string(), "en-US,en;q=0.9".to_string())]),
|
||||
pdf_mode: PdfMode::default(),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Result of a successful fetch.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct FetchResult {
|
||||
pub html: String,
|
||||
pub status: u16,
|
||||
/// Final URL after any redirects.
|
||||
pub url: String,
|
||||
pub headers: HashMap<String, String>,
|
||||
pub elapsed: Duration,
|
||||
}
|
||||
|
||||
/// Result for a single URL in a batch fetch operation.
|
||||
#[derive(Debug)]
|
||||
pub struct BatchResult {
|
||||
pub url: String,
|
||||
pub result: Result<FetchResult, FetchError>,
|
||||
}
|
||||
|
||||
/// Result for a single URL in a batch fetch-and-extract operation.
|
||||
#[derive(Debug)]
|
||||
pub struct BatchExtractResult {
|
||||
pub url: String,
|
||||
pub result: Result<webclaw_core::ExtractionResult, FetchError>,
|
||||
}
|
||||
|
||||
/// Internal representation of the client pool strategy.
|
||||
enum ClientPool {
|
||||
/// Pre-built clients with a fixed proxy (or no proxy).
|
||||
/// Fingerprint rotation still works via the pool when `random` is true.
|
||||
Static {
|
||||
clients: Vec<primp::Client>,
|
||||
random: bool,
|
||||
},
|
||||
/// Pre-built pool of clients, each with a different proxy + fingerprint.
|
||||
/// Requests pick a client deterministically by host for HTTP/2 connection reuse.
|
||||
Rotating { clients: Vec<primp::Client> },
|
||||
}
|
||||
|
||||
/// HTTP client that impersonates browser TLS fingerprints via primp.
|
||||
///
|
||||
/// Operates in two modes:
|
||||
/// - **Static pool**: pre-built primp clients, optionally with fingerprint rotation.
|
||||
/// Used when no `proxy_pool` is configured. Fast (no per-request construction).
|
||||
/// - **Rotating pool**: pre-built primp clients, one per proxy in the pool.
|
||||
/// Same-host URLs are routed to the same client for HTTP/2 multiplexing.
|
||||
pub struct FetchClient {
|
||||
pool: ClientPool,
|
||||
pdf_mode: PdfMode,
|
||||
}
|
||||
|
||||
impl FetchClient {
|
||||
/// Build a new client from config.
|
||||
///
|
||||
/// When `config.proxy_pool` is non-empty, pre-builds one primp client per proxy,
|
||||
/// each with a randomly assigned fingerprint. Same-host URLs get routed to the
|
||||
/// same client for HTTP/2 connection reuse.
|
||||
///
|
||||
/// When `proxy_pool` is empty, pre-builds primp clients at construction time
|
||||
/// (one per fingerprint for `Random` profiles, one for fixed profiles).
|
||||
pub fn new(config: FetchConfig) -> Result<Self, FetchError> {
|
||||
let profiles = collect_profiles(&config.browser);
|
||||
let pdf_mode = config.pdf_mode.clone();
|
||||
|
||||
let pool = if config.proxy_pool.is_empty() {
|
||||
let clients = profiles
|
||||
.into_iter()
|
||||
.map(|p| build_primp_client(&config, &p, config.proxy.as_deref()))
|
||||
.collect::<Result<Vec<_>, _>>()?;
|
||||
|
||||
let random = matches!(config.browser, BrowserProfile::Random);
|
||||
debug!(
|
||||
count = clients.len(),
|
||||
random, "fetch client ready (static pool)"
|
||||
);
|
||||
|
||||
ClientPool::Static { clients, random }
|
||||
} else {
|
||||
let mut rng = rand::thread_rng();
|
||||
|
||||
let clients = config
|
||||
.proxy_pool
|
||||
.iter()
|
||||
.map(|proxy| {
|
||||
let p = profiles.choose(&mut rng).unwrap().clone();
|
||||
build_primp_client(&config, &p, Some(proxy))
|
||||
})
|
||||
.collect::<Result<Vec<_>, _>>()?;
|
||||
|
||||
debug!(
|
||||
clients = clients.len(),
|
||||
profiles = profiles.len(),
|
||||
"fetch client ready (pre-built rotating pool)"
|
||||
);
|
||||
|
||||
ClientPool::Rotating { clients }
|
||||
};
|
||||
|
||||
Ok(Self { pool, pdf_mode })
|
||||
}
|
||||
|
||||
/// Fetch a URL and return the raw HTML + response metadata.
|
||||
///
|
||||
/// Automatically retries on transient failures (network errors, 5xx, 429)
|
||||
/// with exponential backoff: 0s, 1s, 3s (3 attempts total).
|
||||
#[instrument(skip(self), fields(url = %url))]
|
||||
pub async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
|
||||
let delays = [
|
||||
Duration::ZERO,
|
||||
Duration::from_secs(1),
|
||||
Duration::from_secs(3),
|
||||
];
|
||||
let mut last_err = None;
|
||||
|
||||
for (attempt, delay) in delays.iter().enumerate() {
|
||||
if attempt > 0 {
|
||||
tokio::time::sleep(*delay).await;
|
||||
}
|
||||
|
||||
match self.fetch_once(url).await {
|
||||
Ok(result) => {
|
||||
if is_retryable_status(result.status) && attempt < delays.len() - 1 {
|
||||
warn!(
|
||||
url,
|
||||
status = result.status,
|
||||
attempt = attempt + 1,
|
||||
"retryable status, will retry"
|
||||
);
|
||||
last_err = Some(FetchError::Build(format!("HTTP {}", result.status)));
|
||||
continue;
|
||||
}
|
||||
if attempt > 0 {
|
||||
debug!(url, attempt = attempt + 1, "retry succeeded");
|
||||
}
|
||||
return Ok(result);
|
||||
}
|
||||
Err(e) => {
|
||||
if !is_retryable_error(&e) || attempt == delays.len() - 1 {
|
||||
return Err(e);
|
||||
}
|
||||
warn!(
|
||||
url,
|
||||
error = %e,
|
||||
attempt = attempt + 1,
|
||||
"transient error, will retry"
|
||||
);
|
||||
last_err = Some(e);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
Err(last_err.unwrap_or_else(|| FetchError::Build("all retries exhausted".into())))
|
||||
}
|
||||
|
||||
/// Single fetch attempt (no retry).
|
||||
async fn fetch_once(&self, url: &str) -> Result<FetchResult, FetchError> {
|
||||
let start = Instant::now();
|
||||
|
||||
let client = match &self.pool {
|
||||
ClientPool::Static { clients, random } => {
|
||||
if *random {
|
||||
let host = extract_host(url);
|
||||
pick_for_host(clients, &host)
|
||||
} else {
|
||||
&clients[0]
|
||||
}
|
||||
}
|
||||
ClientPool::Rotating { clients } => pick_random(clients),
|
||||
};
|
||||
|
||||
let response = client.get(url).send().await?;
|
||||
|
||||
let status = response.status().as_u16();
|
||||
let final_url = response.url().to_string();
|
||||
|
||||
let headers: HashMap<String, String> = response
|
||||
.headers()
|
||||
.iter()
|
||||
.map(|(k, v)| (k.to_string(), v.to_str().unwrap_or("").to_string()))
|
||||
.collect();
|
||||
|
||||
let html = response
|
||||
.text()
|
||||
.await
|
||||
.map_err(|e| FetchError::BodyDecode(e.to_string()))?;
|
||||
|
||||
let elapsed = start.elapsed();
|
||||
debug!(status, elapsed_ms = %elapsed.as_millis(), "fetch complete");
|
||||
|
||||
Ok(FetchResult {
|
||||
html,
|
||||
status,
|
||||
url: final_url,
|
||||
headers,
|
||||
elapsed,
|
||||
})
|
||||
}
|
||||
|
||||
/// Fetch a URL then extract structured content.
|
||||
///
|
||||
/// Automatically detects PDF responses via Content-Type header and routes
|
||||
/// to webclaw-pdf for text extraction. HTML responses go through webclaw-core.
|
||||
#[instrument(skip(self), fields(url = %url))]
|
||||
pub async fn fetch_and_extract(
|
||||
&self,
|
||||
url: &str,
|
||||
) -> Result<webclaw_core::ExtractionResult, FetchError> {
|
||||
self.fetch_and_extract_with_options(url, &webclaw_core::ExtractionOptions::default())
|
||||
.await
|
||||
}
|
||||
|
||||
/// Fetch a URL then extract structured content with custom extraction options.
|
||||
///
|
||||
/// Same as [`fetch_and_extract`] but accepts `ExtractionOptions` for CSS selector
|
||||
/// filtering, main-content-only mode, etc. Options only apply to HTML responses;
|
||||
/// PDF extraction ignores them (no DOM to filter).
|
||||
#[instrument(skip(self, options), fields(url = %url))]
|
||||
pub async fn fetch_and_extract_with_options(
|
||||
&self,
|
||||
url: &str,
|
||||
options: &webclaw_core::ExtractionOptions,
|
||||
) -> Result<webclaw_core::ExtractionResult, FetchError> {
|
||||
// Reddit fallback: use their JSON API to get post + full comment tree
|
||||
if crate::reddit::is_reddit_url(url) {
|
||||
let json_url = crate::reddit::json_url(url);
|
||||
debug!("reddit detected, fetching {json_url}");
|
||||
|
||||
let client = self.pick_client(&json_url);
|
||||
let response = client.get(&json_url).send().await?;
|
||||
if response.status().is_success() {
|
||||
let bytes = response
|
||||
.bytes()
|
||||
.await
|
||||
.map_err(|e| FetchError::BodyDecode(e.to_string()))?;
|
||||
match crate::reddit::parse_reddit_json(&bytes, url) {
|
||||
Ok(result) => return Ok(result),
|
||||
Err(e) => warn!("reddit json fallback failed: {e}, falling back to HTML"),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
let start = Instant::now();
|
||||
let client = self.pick_client(url);
|
||||
let response = client.get(url).send().await?;
|
||||
|
||||
let status = response.status().as_u16();
|
||||
let final_url = response.url().to_string();
|
||||
|
||||
let headers: HashMap<String, String> = response
|
||||
.headers()
|
||||
.iter()
|
||||
.map(|(k, v)| (k.to_string(), v.to_str().unwrap_or("").to_string()))
|
||||
.collect();
|
||||
|
||||
let is_pdf = is_pdf_content_type(&headers);
|
||||
|
||||
if is_pdf {
|
||||
debug!(status, "detected PDF response, using pdf extraction");
|
||||
|
||||
let bytes = response
|
||||
.bytes()
|
||||
.await
|
||||
.map_err(|e| FetchError::BodyDecode(e.to_string()))?;
|
||||
|
||||
let elapsed = start.elapsed();
|
||||
debug!(
|
||||
status,
|
||||
bytes = bytes.len(),
|
||||
elapsed_ms = %elapsed.as_millis(),
|
||||
"PDF fetch complete"
|
||||
);
|
||||
|
||||
let pdf_result = webclaw_pdf::extract_pdf(&bytes, self.pdf_mode.clone())?;
|
||||
Ok(pdf_to_extraction_result(&pdf_result, &final_url))
|
||||
} else {
|
||||
let html = response
|
||||
.text()
|
||||
.await
|
||||
.map_err(|e| FetchError::BodyDecode(e.to_string()))?;
|
||||
|
||||
let elapsed = start.elapsed();
|
||||
debug!(status, elapsed_ms = %elapsed.as_millis(), "fetch complete");
|
||||
|
||||
// LinkedIn: extract from embedded <code> JSON blobs
|
||||
if crate::linkedin::is_linkedin_post(&final_url) {
|
||||
if let Some(result) = crate::linkedin::extract_linkedin_post(&html, &final_url) {
|
||||
debug!("linkedin extraction succeeded");
|
||||
return Ok(result);
|
||||
}
|
||||
debug!("linkedin extraction failed, falling back to standard");
|
||||
}
|
||||
|
||||
let extraction = webclaw_core::extract_with_options(&html, Some(&final_url), options)?;
|
||||
Ok(extraction)
|
||||
}
|
||||
}
|
||||
|
||||
/// Fetch multiple URLs concurrently with bounded parallelism.
|
||||
///
|
||||
/// Spawns one task per URL, bounded by a semaphore. Results are returned
|
||||
/// in the same order as the input URLs, regardless of completion order.
|
||||
pub async fn fetch_batch(
|
||||
self: &Arc<Self>,
|
||||
urls: &[&str],
|
||||
concurrency: usize,
|
||||
) -> Vec<BatchResult> {
|
||||
let semaphore = Arc::new(Semaphore::new(concurrency));
|
||||
let mut handles = Vec::with_capacity(urls.len());
|
||||
|
||||
for (idx, url) in urls.iter().enumerate() {
|
||||
let permit = Arc::clone(&semaphore);
|
||||
let client = Arc::clone(self);
|
||||
let url = url.to_string();
|
||||
|
||||
handles.push(tokio::spawn(async move {
|
||||
let _permit = permit.acquire().await.expect("semaphore closed");
|
||||
let result = client.fetch(&url).await;
|
||||
(idx, BatchResult { url, result })
|
||||
}));
|
||||
}
|
||||
|
||||
collect_ordered(handles, urls.len()).await
|
||||
}
|
||||
|
||||
/// Fetch and extract multiple URLs concurrently with bounded parallelism.
|
||||
///
|
||||
/// Same semantics as [`fetch_batch`] but runs extraction on each response.
|
||||
/// Results preserve input URL order.
|
||||
pub async fn fetch_and_extract_batch(
|
||||
self: &Arc<Self>,
|
||||
urls: &[&str],
|
||||
concurrency: usize,
|
||||
) -> Vec<BatchExtractResult> {
|
||||
let semaphore = Arc::new(Semaphore::new(concurrency));
|
||||
let mut handles = Vec::with_capacity(urls.len());
|
||||
|
||||
for (idx, url) in urls.iter().enumerate() {
|
||||
let permit = Arc::clone(&semaphore);
|
||||
let client = Arc::clone(self);
|
||||
let url = url.to_string();
|
||||
|
||||
handles.push(tokio::spawn(async move {
|
||||
let _permit = permit.acquire().await.expect("semaphore closed");
|
||||
let result = client.fetch_and_extract(&url).await;
|
||||
(idx, BatchExtractResult { url, result })
|
||||
}));
|
||||
}
|
||||
|
||||
collect_ordered(handles, urls.len()).await
|
||||
}
|
||||
|
||||
/// Returns the number of proxies in the rotation pool, or 0 if static mode.
|
||||
pub fn proxy_pool_size(&self) -> usize {
|
||||
match &self.pool {
|
||||
ClientPool::Static { .. } => 0,
|
||||
ClientPool::Rotating { clients } => clients.len(),
|
||||
}
|
||||
}
|
||||
|
||||
/// Pick a client from the pool for a given URL.
|
||||
fn pick_client(&self, url: &str) -> &primp::Client {
|
||||
match &self.pool {
|
||||
ClientPool::Static { clients, random } => {
|
||||
if *random {
|
||||
let host = extract_host(url);
|
||||
pick_for_host(clients, &host)
|
||||
} else {
|
||||
&clients[0]
|
||||
}
|
||||
}
|
||||
ClientPool::Rotating { clients } => pick_random(clients),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Collect the impersonation profiles to use based on the browser profile.
|
||||
fn collect_profiles(profile: &BrowserProfile) -> Vec<ImpersonateProfile> {
|
||||
match profile {
|
||||
BrowserProfile::Random => {
|
||||
let mut profiles = Vec::new();
|
||||
profiles.extend(browser::chrome_profiles());
|
||||
profiles.extend(browser::firefox_profiles());
|
||||
profiles.extend(browser::extra_profiles());
|
||||
profiles
|
||||
}
|
||||
BrowserProfile::Chrome => vec![browser::latest_chrome()],
|
||||
BrowserProfile::Firefox => vec![browser::latest_firefox()],
|
||||
}
|
||||
}
|
||||
|
||||
/// Extract the host from a URL, returning empty string on parse failure.
|
||||
fn extract_host(url: &str) -> String {
|
||||
url::Url::parse(url)
|
||||
.ok()
|
||||
.and_then(|u| u.host_str().map(String::from))
|
||||
.unwrap_or_default()
|
||||
}
|
||||
|
||||
/// Pick a client deterministically based on a host string.
|
||||
/// Same host always gets the same client, enabling HTTP/2 connection reuse.
|
||||
fn pick_for_host<'a>(clients: &'a [primp::Client], host: &str) -> &'a primp::Client {
|
||||
let mut hasher = std::collections::hash_map::DefaultHasher::new();
|
||||
host.hash(&mut hasher);
|
||||
let idx = (hasher.finish() as usize) % clients.len();
|
||||
&clients[idx]
|
||||
}
|
||||
|
||||
/// Pick a random client from the pool for per-request rotation.
|
||||
fn pick_random(clients: &[primp::Client]) -> &primp::Client {
|
||||
use rand::Rng;
|
||||
let idx = rand::thread_rng().gen_range(0..clients.len());
|
||||
&clients[idx]
|
||||
}
|
||||
|
||||
/// Status codes worth retrying: server errors + rate limiting.
|
||||
fn is_retryable_status(status: u16) -> bool {
|
||||
status == 429
|
||||
|| status == 502
|
||||
|| status == 503
|
||||
|| status == 504
|
||||
|| status == 520
|
||||
|| status == 521
|
||||
|| status == 522
|
||||
|| status == 523
|
||||
|| status == 524
|
||||
}
|
||||
|
||||
/// Errors worth retrying: network/connection failures (not client errors).
|
||||
fn is_retryable_error(err: &FetchError) -> bool {
|
||||
matches!(err, FetchError::Request(_) | FetchError::BodyDecode(_))
|
||||
}
|
||||
|
||||
fn is_pdf_content_type(headers: &HashMap<String, String>) -> bool {
|
||||
headers
|
||||
.get("content-type")
|
||||
.map(|ct| {
|
||||
let mime = ct.split(';').next().unwrap_or("").trim();
|
||||
mime.eq_ignore_ascii_case("application/pdf")
|
||||
})
|
||||
.unwrap_or(false)
|
||||
}
|
||||
|
||||
/// Convert a webclaw-pdf PdfResult into a webclaw-core ExtractionResult.
|
||||
fn pdf_to_extraction_result(
|
||||
pdf: &webclaw_pdf::PdfResult,
|
||||
url: &str,
|
||||
) -> webclaw_core::ExtractionResult {
|
||||
let markdown = webclaw_pdf::to_markdown(pdf);
|
||||
let word_count = markdown.split_whitespace().count();
|
||||
|
||||
webclaw_core::ExtractionResult {
|
||||
metadata: webclaw_core::Metadata {
|
||||
title: pdf.metadata.title.clone(),
|
||||
description: pdf.metadata.subject.clone(),
|
||||
author: pdf.metadata.author.clone(),
|
||||
published_date: None,
|
||||
language: None,
|
||||
url: Some(url.to_string()),
|
||||
site_name: None,
|
||||
image: None,
|
||||
favicon: None,
|
||||
word_count,
|
||||
},
|
||||
content: webclaw_core::Content {
|
||||
markdown,
|
||||
plain_text: pdf.text.clone(),
|
||||
links: Vec::new(),
|
||||
images: Vec::new(),
|
||||
code_blocks: Vec::new(),
|
||||
raw_html: None,
|
||||
},
|
||||
domain_data: None,
|
||||
structured_data: vec![],
|
||||
}
|
||||
}
|
||||
|
||||
/// Collect spawned tasks and reorder results to match input order.
|
||||
async fn collect_ordered<T>(
|
||||
handles: Vec<tokio::task::JoinHandle<(usize, T)>>,
|
||||
len: usize,
|
||||
) -> Vec<T> {
|
||||
let mut slots: Vec<Option<T>> = (0..len).map(|_| None).collect();
|
||||
|
||||
for handle in handles {
|
||||
match handle.await {
|
||||
Ok((idx, result)) => {
|
||||
slots[idx] = Some(result);
|
||||
}
|
||||
Err(e) => {
|
||||
warn!(error = %e, "batch task panicked");
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
slots.into_iter().flatten().collect()
|
||||
}
|
||||
|
||||
/// Build a single primp Client from config + impersonation profile + optional proxy.
|
||||
fn build_primp_client(
|
||||
config: &FetchConfig,
|
||||
profile: &ImpersonateProfile,
|
||||
proxy: Option<&str>,
|
||||
) -> Result<primp::Client, FetchError> {
|
||||
let redirect_policy = if config.follow_redirects {
|
||||
primp::redirect::Policy::limited(config.max_redirects as usize)
|
||||
} else {
|
||||
primp::redirect::Policy::none()
|
||||
};
|
||||
|
||||
let mut headers = primp::header::HeaderMap::new();
|
||||
for (k, v) in &config.headers {
|
||||
if let (Ok(name), Ok(val)) = (
|
||||
primp::header::HeaderName::from_bytes(k.as_bytes()),
|
||||
primp::header::HeaderValue::from_str(v),
|
||||
) {
|
||||
headers.insert(name, val);
|
||||
}
|
||||
}
|
||||
|
||||
let mut builder = primp::Client::builder()
|
||||
.impersonate(profile.browser)
|
||||
.impersonate_os(profile.os)
|
||||
.cookie_store(true)
|
||||
.timeout(config.timeout)
|
||||
.redirect(redirect_policy)
|
||||
.default_headers(headers);
|
||||
|
||||
if let Some(proxy_url) = proxy {
|
||||
builder = builder
|
||||
.proxy(primp::Proxy::all(proxy_url).map_err(|e| FetchError::Build(e.to_string()))?);
|
||||
}
|
||||
|
||||
builder
|
||||
.build()
|
||||
.map_err(|e| FetchError::Build(e.to_string()))
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_batch_result_struct() {
|
||||
let ok = BatchResult {
|
||||
url: "https://example.com".to_string(),
|
||||
result: Ok(FetchResult {
|
||||
html: "<html></html>".to_string(),
|
||||
status: 200,
|
||||
url: "https://example.com".to_string(),
|
||||
headers: HashMap::new(),
|
||||
elapsed: Duration::from_millis(42),
|
||||
}),
|
||||
};
|
||||
assert_eq!(ok.url, "https://example.com");
|
||||
assert!(ok.result.is_ok());
|
||||
assert_eq!(ok.result.unwrap().status, 200);
|
||||
|
||||
let err = BatchResult {
|
||||
url: "https://bad.example".to_string(),
|
||||
result: Err(FetchError::InvalidUrl("bad url".into())),
|
||||
};
|
||||
assert!(err.result.is_err());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_batch_extract_result_struct() {
|
||||
let err = BatchExtractResult {
|
||||
url: "https://example.com".to_string(),
|
||||
result: Err(FetchError::BodyDecode("timeout".into())),
|
||||
};
|
||||
assert_eq!(err.url, "https://example.com");
|
||||
assert!(err.result.is_err());
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_batch_preserves_order() {
|
||||
let handles: Vec<tokio::task::JoinHandle<(usize, String)>> = vec![
|
||||
tokio::spawn(async { (2, "c".to_string()) }),
|
||||
tokio::spawn(async { (0, "a".to_string()) }),
|
||||
tokio::spawn(async { (1, "b".to_string()) }),
|
||||
];
|
||||
|
||||
let results = collect_ordered(handles, 3).await;
|
||||
assert_eq!(results, vec!["a", "b", "c"]);
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_collect_ordered_handles_gaps() {
|
||||
let handles: Vec<tokio::task::JoinHandle<(usize, String)>> = vec![
|
||||
tokio::spawn(async { (0, "first".to_string()) }),
|
||||
tokio::spawn(async { (2, "third".to_string()) }),
|
||||
];
|
||||
|
||||
let results = collect_ordered(handles, 3).await;
|
||||
assert_eq!(results.len(), 2);
|
||||
assert_eq!(results[0], "first");
|
||||
assert_eq!(results[1], "third");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_is_pdf_content_type() {
|
||||
let mut headers = HashMap::new();
|
||||
headers.insert("content-type".to_string(), "application/pdf".to_string());
|
||||
assert!(is_pdf_content_type(&headers));
|
||||
|
||||
headers.insert(
|
||||
"content-type".to_string(),
|
||||
"application/pdf; charset=utf-8".to_string(),
|
||||
);
|
||||
assert!(is_pdf_content_type(&headers));
|
||||
|
||||
headers.insert("content-type".to_string(), "Application/PDF".to_string());
|
||||
assert!(is_pdf_content_type(&headers));
|
||||
|
||||
headers.insert("content-type".to_string(), "text/html".to_string());
|
||||
assert!(!is_pdf_content_type(&headers));
|
||||
|
||||
let empty: HashMap<String, String> = HashMap::new();
|
||||
assert!(!is_pdf_content_type(&empty));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_pdf_to_extraction_result() {
|
||||
let pdf = webclaw_pdf::PdfResult {
|
||||
text: "Hello from PDF.".into(),
|
||||
page_count: 2,
|
||||
metadata: webclaw_pdf::PdfMetadata {
|
||||
title: Some("My Doc".into()),
|
||||
author: Some("Author".into()),
|
||||
subject: Some("Testing".into()),
|
||||
creator: None,
|
||||
},
|
||||
};
|
||||
|
||||
let result = pdf_to_extraction_result(&pdf, "https://example.com/doc.pdf");
|
||||
|
||||
assert_eq!(result.metadata.title.as_deref(), Some("My Doc"));
|
||||
assert_eq!(result.metadata.author.as_deref(), Some("Author"));
|
||||
assert_eq!(result.metadata.description.as_deref(), Some("Testing"));
|
||||
assert_eq!(
|
||||
result.metadata.url.as_deref(),
|
||||
Some("https://example.com/doc.pdf")
|
||||
);
|
||||
assert!(result.content.markdown.contains("# My Doc"));
|
||||
assert!(result.content.markdown.contains("Hello from PDF."));
|
||||
assert_eq!(result.content.plain_text, "Hello from PDF.");
|
||||
assert!(result.content.links.is_empty());
|
||||
assert!(result.domain_data.is_none());
|
||||
assert!(result.metadata.word_count > 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_static_pool_no_proxy() {
|
||||
let config = FetchConfig::default();
|
||||
let client = FetchClient::new(config).unwrap();
|
||||
assert_eq!(client.proxy_pool_size(), 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_rotating_pool_prebuilds_clients() {
|
||||
let config = FetchConfig {
|
||||
proxy_pool: vec![
|
||||
"http://proxy1:8080".into(),
|
||||
"http://proxy2:8080".into(),
|
||||
"http://proxy3:8080".into(),
|
||||
],
|
||||
..Default::default()
|
||||
};
|
||||
let client = FetchClient::new(config).unwrap();
|
||||
assert_eq!(client.proxy_pool_size(), 3);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_pick_for_host_deterministic() {
|
||||
let config = FetchConfig {
|
||||
browser: BrowserProfile::Random,
|
||||
..Default::default()
|
||||
};
|
||||
let client = FetchClient::new(config).unwrap();
|
||||
|
||||
let clients = match &client.pool {
|
||||
ClientPool::Static { clients, .. } => clients,
|
||||
ClientPool::Rotating { clients } => clients,
|
||||
};
|
||||
|
||||
let a1 = pick_for_host(clients, "example.com") as *const _;
|
||||
let a2 = pick_for_host(clients, "example.com") as *const _;
|
||||
let a3 = pick_for_host(clients, "example.com") as *const _;
|
||||
assert_eq!(a1, a2);
|
||||
assert_eq!(a2, a3);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_pick_for_host_distributes() {
|
||||
let config = FetchConfig {
|
||||
proxy_pool: (0..10).map(|i| format!("http://proxy{i}:8080")).collect(),
|
||||
..Default::default()
|
||||
};
|
||||
let client = FetchClient::new(config).unwrap();
|
||||
|
||||
let clients = match &client.pool {
|
||||
ClientPool::Static { clients, .. } | ClientPool::Rotating { clients } => clients,
|
||||
};
|
||||
|
||||
let hosts = [
|
||||
"example.com",
|
||||
"google.com",
|
||||
"github.com",
|
||||
"rust-lang.org",
|
||||
"crates.io",
|
||||
];
|
||||
|
||||
let indices: Vec<usize> = hosts
|
||||
.iter()
|
||||
.map(|h| {
|
||||
let ptr = pick_for_host(clients, h) as *const _;
|
||||
clients.iter().position(|c| std::ptr::eq(c, ptr)).unwrap()
|
||||
})
|
||||
.collect();
|
||||
|
||||
let unique: std::collections::HashSet<_> = indices.iter().collect();
|
||||
assert!(
|
||||
unique.len() >= 2,
|
||||
"expected host distribution across clients, got indices: {indices:?}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_extract_host() {
|
||||
assert_eq!(extract_host("https://example.com/path"), "example.com");
|
||||
assert_eq!(
|
||||
extract_host("https://sub.example.com:8080/foo"),
|
||||
"sub.example.com"
|
||||
);
|
||||
assert_eq!(extract_host("not-a-url"), "");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_default_config_has_empty_proxy_pool() {
|
||||
let config = FetchConfig::default();
|
||||
assert!(config.proxy_pool.is_empty());
|
||||
assert!(config.proxy.is_none());
|
||||
}
|
||||
}
|
||||
557
crates/webclaw-fetch/src/crawler.rs
Normal file
557
crates/webclaw-fetch/src/crawler.rs
Normal file
|
|
@ -0,0 +1,557 @@
|
|||
/// Recursive same-origin web crawler built on top of [`FetchClient`].
|
||||
///
|
||||
/// Starts from a seed URL, extracts content, discovers links, and follows
|
||||
/// them breadth-first up to a configurable depth/page limit. Uses a semaphore
|
||||
/// for bounded concurrency and per-request delays for politeness.
|
||||
///
|
||||
/// When `use_sitemap` is enabled, the crawler first discovers URLs from the
|
||||
/// site's sitemaps and seeds the BFS frontier before crawling.
|
||||
use std::collections::HashSet;
|
||||
use std::sync::Arc;
|
||||
use std::time::{Duration, Instant};
|
||||
|
||||
use serde::{Deserialize, Serialize};
|
||||
use tokio::sync::Semaphore;
|
||||
use tracing::{debug, info, warn};
|
||||
use url::Url;
|
||||
|
||||
use crate::client::{FetchClient, FetchConfig};
|
||||
use crate::error::FetchError;
|
||||
use crate::sitemap;
|
||||
|
||||
/// Controls crawl scope, depth, concurrency, and politeness.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct CrawlConfig {
|
||||
/// Fetch configuration (browser profile, proxy, timeout, etc.)
|
||||
pub fetch: FetchConfig,
|
||||
/// How deep to follow links. 1 = only immediate links from seed page.
|
||||
pub max_depth: usize,
|
||||
/// Hard cap on total pages fetched (including the seed).
|
||||
pub max_pages: usize,
|
||||
/// Max concurrent in-flight requests.
|
||||
pub concurrency: usize,
|
||||
/// Minimum delay before each request (politeness).
|
||||
pub delay: Duration,
|
||||
/// Only follow URLs whose path starts with this prefix (e.g. "/docs/").
|
||||
pub path_prefix: Option<String>,
|
||||
/// Seed BFS frontier from sitemap discovery before crawling.
|
||||
pub use_sitemap: bool,
|
||||
/// Glob patterns for paths to include. If non-empty, only matching URLs are crawled.
|
||||
/// E.g. `["/api/*", "/guides/*"]` — matched against the URL path.
|
||||
pub include_patterns: Vec<String>,
|
||||
/// Glob patterns for paths to exclude. Checked after include_patterns.
|
||||
/// E.g. `["/changelog/*", "/blog/*"]` — matching URLs are skipped.
|
||||
pub exclude_patterns: Vec<String>,
|
||||
/// Optional channel sender for streaming per-page results as they complete.
|
||||
/// When set, each `PageResult` is sent on this channel immediately after extraction.
|
||||
pub progress_tx: Option<tokio::sync::broadcast::Sender<PageResult>>,
|
||||
}
|
||||
|
||||
impl Default for CrawlConfig {
|
||||
fn default() -> Self {
|
||||
Self {
|
||||
fetch: FetchConfig::default(),
|
||||
max_depth: 1,
|
||||
max_pages: 50,
|
||||
concurrency: 5,
|
||||
delay: Duration::from_millis(100),
|
||||
path_prefix: None,
|
||||
use_sitemap: false,
|
||||
include_patterns: Vec::new(),
|
||||
exclude_patterns: Vec::new(),
|
||||
progress_tx: None,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Aggregated results from a crawl run.
|
||||
#[derive(Debug, Serialize, Deserialize)]
|
||||
pub struct CrawlResult {
|
||||
pub pages: Vec<PageResult>,
|
||||
pub total: usize,
|
||||
pub ok: usize,
|
||||
pub errors: usize,
|
||||
pub elapsed_secs: f64,
|
||||
}
|
||||
|
||||
/// Outcome of extracting a single page during the crawl.
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct PageResult {
|
||||
pub url: String,
|
||||
pub depth: usize,
|
||||
pub extraction: Option<webclaw_core::ExtractionResult>,
|
||||
pub error: Option<String>,
|
||||
#[serde(skip)]
|
||||
pub elapsed: Duration,
|
||||
}
|
||||
|
||||
/// Recursive crawler that wraps a shared [`FetchClient`].
|
||||
pub struct Crawler {
|
||||
client: Arc<FetchClient>,
|
||||
config: CrawlConfig,
|
||||
seed_origin: String,
|
||||
}
|
||||
|
||||
impl Crawler {
|
||||
/// Build a new crawler from a seed URL and config.
|
||||
/// Constructs the underlying `FetchClient` from `config.fetch`.
|
||||
pub fn new(seed_url: &str, config: CrawlConfig) -> Result<Self, FetchError> {
|
||||
let seed = Url::parse(seed_url).map_err(|_| FetchError::InvalidUrl(seed_url.into()))?;
|
||||
let seed_origin = origin_key(&seed);
|
||||
|
||||
let client = FetchClient::new(config.fetch.clone())?;
|
||||
|
||||
Ok(Self {
|
||||
client: Arc::new(client),
|
||||
config,
|
||||
seed_origin,
|
||||
})
|
||||
}
|
||||
|
||||
/// Crawl starting from `start_url`, returning results for every page visited.
|
||||
///
|
||||
/// Uses breadth-first traversal: all pages at depth N are fetched (concurrently,
|
||||
/// bounded by `config.concurrency`) before moving to depth N+1.
|
||||
///
|
||||
/// When `config.use_sitemap` is true, sitemap URLs are discovered first and
|
||||
/// added to the initial frontier at depth 0 alongside the seed URL.
|
||||
pub async fn crawl(&self, start_url: &str) -> CrawlResult {
|
||||
let start = Instant::now();
|
||||
|
||||
let seed = match Url::parse(start_url) {
|
||||
Ok(u) => u,
|
||||
Err(_) => {
|
||||
return CrawlResult {
|
||||
pages: vec![PageResult {
|
||||
url: start_url.to_string(),
|
||||
depth: 0,
|
||||
extraction: None,
|
||||
error: Some(format!("invalid URL: {start_url}")),
|
||||
elapsed: Duration::ZERO,
|
||||
}],
|
||||
total: 1,
|
||||
ok: 0,
|
||||
errors: 1,
|
||||
elapsed_secs: 0.0,
|
||||
};
|
||||
}
|
||||
};
|
||||
|
||||
let semaphore = Arc::new(Semaphore::new(self.config.concurrency));
|
||||
let mut visited: HashSet<String> = HashSet::new();
|
||||
let mut pages: Vec<PageResult> = Vec::new();
|
||||
|
||||
// BFS frontier: vec of (normalized_url, depth) for the current level
|
||||
let mut frontier: Vec<(String, usize)> = vec![(normalize(&seed), 0)];
|
||||
|
||||
// Seed frontier from sitemap if enabled
|
||||
if self.config.use_sitemap {
|
||||
let base_url = format!("{}://{}", seed.scheme(), seed.host_str().unwrap_or(""));
|
||||
match sitemap::discover(&self.client, &base_url).await {
|
||||
Ok(entries) => {
|
||||
let before = frontier.len();
|
||||
for entry in entries {
|
||||
if self.qualify_link(&entry.url, &visited).is_some() {
|
||||
let parsed = match Url::parse(&entry.url) {
|
||||
Ok(u) => u,
|
||||
Err(_) => continue,
|
||||
};
|
||||
let norm = normalize(&parsed);
|
||||
frontier.push((norm, 0));
|
||||
}
|
||||
}
|
||||
let added = frontier.len() - before;
|
||||
info!(
|
||||
sitemap_urls = added,
|
||||
"seeded frontier from sitemap discovery"
|
||||
);
|
||||
}
|
||||
Err(e) => {
|
||||
warn!(error = %e, "sitemap discovery failed, continuing with seed URL only");
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
while !frontier.is_empty() && pages.len() < self.config.max_pages {
|
||||
// Dedup this level's frontier against the visited set and page cap
|
||||
let batch: Vec<(String, usize)> = frontier
|
||||
.drain(..)
|
||||
.filter(|(url, _)| visited.insert(url.clone()))
|
||||
.take(self.config.max_pages.saturating_sub(pages.len()))
|
||||
.collect();
|
||||
|
||||
if batch.is_empty() {
|
||||
break;
|
||||
}
|
||||
|
||||
// Spawn one task per URL, bounded by semaphore
|
||||
let mut handles = Vec::with_capacity(batch.len());
|
||||
|
||||
for (url, depth) in &batch {
|
||||
let permit = Arc::clone(&semaphore);
|
||||
let client = Arc::clone(&self.client);
|
||||
let url = url.clone();
|
||||
let depth = *depth;
|
||||
let delay = self.config.delay;
|
||||
|
||||
handles.push(tokio::spawn(async move {
|
||||
// Acquire permit — blocks if concurrency limit reached
|
||||
let _permit = permit.acquire().await.expect("semaphore closed");
|
||||
tokio::time::sleep(delay).await;
|
||||
|
||||
let page_start = Instant::now();
|
||||
let result = client.fetch_and_extract(&url).await;
|
||||
let elapsed = page_start.elapsed();
|
||||
|
||||
match result {
|
||||
Ok(extraction) => {
|
||||
debug!(
|
||||
url = %url, depth,
|
||||
elapsed_ms = %elapsed.as_millis(),
|
||||
"page extracted"
|
||||
);
|
||||
PageResult {
|
||||
url,
|
||||
depth,
|
||||
extraction: Some(extraction),
|
||||
error: None,
|
||||
elapsed,
|
||||
}
|
||||
}
|
||||
Err(e) => {
|
||||
warn!(url = %url, depth, error = %e, "page failed");
|
||||
PageResult {
|
||||
url,
|
||||
depth,
|
||||
extraction: None,
|
||||
error: Some(e.to_string()),
|
||||
elapsed,
|
||||
}
|
||||
}
|
||||
}
|
||||
}));
|
||||
}
|
||||
|
||||
// Collect results and harvest links for the next depth level
|
||||
let mut next_frontier: Vec<(String, usize)> = Vec::new();
|
||||
|
||||
for handle in handles {
|
||||
let page = match handle.await {
|
||||
Ok(page) => page,
|
||||
Err(e) => {
|
||||
warn!(error = %e, "crawl task panicked");
|
||||
continue;
|
||||
}
|
||||
};
|
||||
let depth = page.depth;
|
||||
|
||||
if depth < self.config.max_depth
|
||||
&& let Some(ref extraction) = page.extraction
|
||||
{
|
||||
for link in &extraction.content.links {
|
||||
if let Some(candidate) = self.qualify_link(&link.href, &visited) {
|
||||
next_frontier.push((candidate, depth + 1));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Stream progress if a channel is configured
|
||||
if let Some(tx) = &self.config.progress_tx {
|
||||
let _ = tx.send(page.clone());
|
||||
}
|
||||
|
||||
pages.push(page);
|
||||
|
||||
if pages.len() >= self.config.max_pages {
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
frontier = next_frontier;
|
||||
}
|
||||
|
||||
let total_elapsed = start.elapsed();
|
||||
let ok_count = pages.iter().filter(|p| p.extraction.is_some()).count();
|
||||
let err_count = pages.len() - ok_count;
|
||||
info!(
|
||||
total = pages.len(),
|
||||
ok = ok_count,
|
||||
errors = err_count,
|
||||
elapsed_ms = %total_elapsed.as_millis(),
|
||||
"crawl complete"
|
||||
);
|
||||
|
||||
CrawlResult {
|
||||
total: pages.len(),
|
||||
ok: ok_count,
|
||||
errors: err_count,
|
||||
elapsed_secs: total_elapsed.as_secs_f64(),
|
||||
pages,
|
||||
}
|
||||
}
|
||||
|
||||
/// Check if a discovered link should be added to the frontier.
|
||||
/// Returns `Some(normalized_url)` if it passes all filters, `None` otherwise.
|
||||
fn qualify_link(&self, href: &str, visited: &HashSet<String>) -> Option<String> {
|
||||
let parsed = Url::parse(href).ok()?;
|
||||
|
||||
// Only http(s) schemes
|
||||
match parsed.scheme() {
|
||||
"http" | "https" => {}
|
||||
_ => return None,
|
||||
}
|
||||
|
||||
// Same-origin check (scheme + host + port)
|
||||
if origin_key(&parsed) != self.seed_origin {
|
||||
return None;
|
||||
}
|
||||
|
||||
// Path prefix filter
|
||||
if let Some(ref prefix) = self.config.path_prefix
|
||||
&& !parsed.path().starts_with(prefix.as_str())
|
||||
{
|
||||
return None;
|
||||
}
|
||||
|
||||
// Include patterns: if any are set, path must match at least one
|
||||
let path = parsed.path();
|
||||
if !self.config.include_patterns.is_empty()
|
||||
&& !self
|
||||
.config
|
||||
.include_patterns
|
||||
.iter()
|
||||
.any(|pat| glob_match(pat, path))
|
||||
{
|
||||
return None;
|
||||
}
|
||||
|
||||
// Exclude patterns: if path matches any, skip
|
||||
if self
|
||||
.config
|
||||
.exclude_patterns
|
||||
.iter()
|
||||
.any(|pat| glob_match(pat, path))
|
||||
{
|
||||
return None;
|
||||
}
|
||||
|
||||
// Skip common non-page file extensions
|
||||
const SKIP_EXTENSIONS: &[&str] = &[
|
||||
".pdf", ".png", ".jpg", ".jpeg", ".gif", ".svg", ".webp", ".ico", ".css", ".js",
|
||||
".zip", ".tar", ".gz", ".xml", ".rss", ".mp3", ".mp4", ".avi", ".mov", ".woff",
|
||||
".woff2", ".ttf", ".eot",
|
||||
];
|
||||
if SKIP_EXTENSIONS.iter().any(|ext| path.ends_with(ext)) {
|
||||
return None;
|
||||
}
|
||||
|
||||
let normalized = normalize(&parsed);
|
||||
|
||||
if visited.contains(&normalized) {
|
||||
return None;
|
||||
}
|
||||
|
||||
Some(normalized)
|
||||
}
|
||||
}
|
||||
|
||||
/// Canonical origin string for comparing same-origin: "scheme://host[:port]".
|
||||
fn origin_key(url: &Url) -> String {
|
||||
let port_suffix = match url.port() {
|
||||
Some(p) => format!(":{p}"),
|
||||
None => String::new(),
|
||||
};
|
||||
let host = url.host_str().unwrap_or("");
|
||||
let host = host.strip_prefix("www.").unwrap_or(host);
|
||||
format!("{}://{}{}", url.scheme(), host, port_suffix)
|
||||
}
|
||||
|
||||
/// Normalize a URL for dedup: strip fragment, remove trailing slash (except root "/"),
|
||||
/// lowercase scheme + host. Preserves query params and path case.
|
||||
fn normalize(url: &Url) -> String {
|
||||
let scheme = url.scheme();
|
||||
let host = url.host_str().unwrap_or("").to_ascii_lowercase();
|
||||
let port_suffix = match url.port() {
|
||||
Some(p) => format!(":{p}"),
|
||||
None => String::new(),
|
||||
};
|
||||
|
||||
let mut path = url.path().to_string();
|
||||
if path.len() > 1 && path.ends_with('/') {
|
||||
path.pop();
|
||||
}
|
||||
|
||||
let query = match url.query() {
|
||||
Some(q) => format!("?{q}"),
|
||||
None => String::new(),
|
||||
};
|
||||
|
||||
// Fragment intentionally omitted
|
||||
format!("{scheme}://{host}{port_suffix}{path}{query}")
|
||||
}
|
||||
|
||||
/// Simple glob matching for URL paths. Supports:
|
||||
/// - `*` matches any characters within a single path segment (no `/`)
|
||||
/// - `**` matches any characters including `/` (any number of segments)
|
||||
/// - Literal characters match exactly
|
||||
///
|
||||
/// Examples:
|
||||
/// - `/api/*` matches `/api/users` but not `/api/users/123`
|
||||
/// - `/api/**` matches `/api/users`, `/api/users/123`, `/api/a/b/c`
|
||||
/// - `/docs/*/intro` matches `/docs/v2/intro`
|
||||
fn glob_match(pattern: &str, path: &str) -> bool {
|
||||
glob_match_inner(pattern.as_bytes(), path.as_bytes())
|
||||
}
|
||||
|
||||
fn glob_match_inner(pat: &[u8], text: &[u8]) -> bool {
|
||||
let mut pi = 0;
|
||||
let mut ti = 0;
|
||||
let mut star_pi = usize::MAX;
|
||||
let mut star_ti = 0;
|
||||
|
||||
while ti < text.len() {
|
||||
if pi < pat.len() && pat[pi] == b'*' && pi + 1 < pat.len() && pat[pi + 1] == b'*' {
|
||||
// `**` — match everything including slashes
|
||||
// Skip all consecutive `*`
|
||||
while pi < pat.len() && pat[pi] == b'*' {
|
||||
pi += 1;
|
||||
}
|
||||
// Skip trailing `/` after `**`
|
||||
if pi < pat.len() && pat[pi] == b'/' {
|
||||
pi += 1;
|
||||
}
|
||||
if pi >= pat.len() {
|
||||
return true; // `**` at end matches everything
|
||||
}
|
||||
// Try matching the rest of pattern against every suffix of text
|
||||
for start in ti..=text.len() {
|
||||
if glob_match_inner(&pat[pi..], &text[start..]) {
|
||||
return true;
|
||||
}
|
||||
}
|
||||
return false;
|
||||
} else if pi < pat.len() && pat[pi] == b'*' {
|
||||
// `*` — match any chars except `/`
|
||||
star_pi = pi;
|
||||
star_ti = ti;
|
||||
pi += 1;
|
||||
} else if pi < pat.len() && (pat[pi] == text[ti] || pat[pi] == b'?') {
|
||||
pi += 1;
|
||||
ti += 1;
|
||||
} else if star_pi != usize::MAX {
|
||||
// Backtrack: `*` absorbs one more char (but not `/`)
|
||||
if text[star_ti] == b'/' {
|
||||
return false;
|
||||
}
|
||||
star_ti += 1;
|
||||
ti = star_ti;
|
||||
pi = star_pi + 1;
|
||||
} else {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
// Consume trailing `*` or `**` in pattern
|
||||
while pi < pat.len() && pat[pi] == b'*' {
|
||||
pi += 1;
|
||||
}
|
||||
|
||||
pi >= pat.len()
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn normalize_strips_fragment() {
|
||||
let url = Url::parse("https://example.com/page#section").unwrap();
|
||||
assert_eq!(normalize(&url), "https://example.com/page");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn normalize_strips_trailing_slash() {
|
||||
let url = Url::parse("https://example.com/docs/").unwrap();
|
||||
assert_eq!(normalize(&url), "https://example.com/docs");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn normalize_keeps_root_slash() {
|
||||
let url = Url::parse("https://example.com/").unwrap();
|
||||
assert_eq!(normalize(&url), "https://example.com/");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn normalize_preserves_query() {
|
||||
let url = Url::parse("https://example.com/search?q=rust&page=2").unwrap();
|
||||
assert_eq!(normalize(&url), "https://example.com/search?q=rust&page=2");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn normalize_lowercases_host() {
|
||||
let url = Url::parse("https://Example.COM/Path").unwrap();
|
||||
assert_eq!(normalize(&url), "https://example.com/Path");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn origin_includes_explicit_port() {
|
||||
let url = Url::parse("https://example.com:8443/foo").unwrap();
|
||||
assert_eq!(origin_key(&url), "https://example.com:8443");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn origin_omits_default_port() {
|
||||
let url = Url::parse("https://example.com/foo").unwrap();
|
||||
assert_eq!(origin_key(&url), "https://example.com");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn different_schemes_are_different_origins() {
|
||||
let http = Url::parse("http://example.com/").unwrap();
|
||||
let https = Url::parse("https://example.com/").unwrap();
|
||||
assert_ne!(origin_key(&http), origin_key(&https));
|
||||
}
|
||||
|
||||
// -- glob_match tests --
|
||||
|
||||
#[test]
|
||||
fn glob_star_matches_single_segment() {
|
||||
assert!(glob_match("/api/*", "/api/users"));
|
||||
assert!(glob_match("/api/*", "/api/products"));
|
||||
assert!(!glob_match("/api/*", "/api/users/123"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn glob_doublestar_matches_multiple_segments() {
|
||||
assert!(glob_match("/api/**", "/api/users"));
|
||||
assert!(glob_match("/api/**", "/api/users/123"));
|
||||
assert!(glob_match("/api/**", "/api/a/b/c/d"));
|
||||
assert!(!glob_match("/api/**", "/docs/intro"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn glob_exact_match() {
|
||||
assert!(glob_match("/about", "/about"));
|
||||
assert!(!glob_match("/about", "/about/team"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn glob_middle_wildcard() {
|
||||
assert!(glob_match("/docs/*/intro", "/docs/v2/intro"));
|
||||
assert!(!glob_match("/docs/*/intro", "/docs/v2/v3/intro"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn glob_no_pattern_matches_nothing() {
|
||||
// Empty pattern only matches empty string
|
||||
assert!(glob_match("", ""));
|
||||
assert!(!glob_match("", "/foo"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn glob_trailing_star() {
|
||||
assert!(glob_match("/blog*", "/blog"));
|
||||
assert!(glob_match("/blog*", "/blog-post"));
|
||||
assert!(!glob_match("/blog*", "/blog/post")); // * doesn't cross /
|
||||
}
|
||||
}
|
||||
24
crates/webclaw-fetch/src/error.rs
Normal file
24
crates/webclaw-fetch/src/error.rs
Normal file
|
|
@ -0,0 +1,24 @@
|
|||
/// Fetch-layer errors. Wraps primp/network failures into a single type
|
||||
/// that callers can match on without leaking transport details.
|
||||
use thiserror::Error;
|
||||
|
||||
#[derive(Debug, Error)]
|
||||
pub enum FetchError {
|
||||
#[error("request failed: {0}")]
|
||||
Request(#[from] primp::Error),
|
||||
|
||||
#[error("invalid url: {0}")]
|
||||
InvalidUrl(String),
|
||||
|
||||
#[error("response body decode failed: {0}")]
|
||||
BodyDecode(String),
|
||||
|
||||
#[error("extraction failed: {0}")]
|
||||
Extraction(#[from] webclaw_core::ExtractError),
|
||||
|
||||
#[error("PDF extraction failed: {0}")]
|
||||
Pdf(#[from] webclaw_pdf::PdfError),
|
||||
|
||||
#[error("client build failed: {0}")]
|
||||
Build(String),
|
||||
}
|
||||
20
crates/webclaw-fetch/src/lib.rs
Normal file
20
crates/webclaw-fetch/src/lib.rs
Normal file
|
|
@ -0,0 +1,20 @@
|
|||
/// webclaw-fetch: HTTP client layer with browser TLS fingerprint impersonation.
|
||||
/// Uses Impit under the hood to make requests that look like real
|
||||
/// browsers at the TLS, HTTP/2, and header levels.
|
||||
/// Automatically detects PDF responses and delegates to webclaw-pdf.
|
||||
pub mod browser;
|
||||
pub mod client;
|
||||
pub mod crawler;
|
||||
pub mod error;
|
||||
pub mod linkedin;
|
||||
pub mod proxy;
|
||||
pub mod reddit;
|
||||
pub mod sitemap;
|
||||
|
||||
pub use browser::BrowserProfile;
|
||||
pub use client::{BatchExtractResult, BatchResult, FetchClient, FetchConfig, FetchResult};
|
||||
pub use crawler::{CrawlConfig, CrawlResult, Crawler, PageResult};
|
||||
pub use error::FetchError;
|
||||
pub use proxy::{parse_proxy_file, parse_proxy_line};
|
||||
pub use sitemap::SitemapEntry;
|
||||
pub use webclaw_pdf::PdfMode;
|
||||
279
crates/webclaw-fetch/src/linkedin.rs
Normal file
279
crates/webclaw-fetch/src/linkedin.rs
Normal file
|
|
@ -0,0 +1,279 @@
|
|||
/// LinkedIn post extraction from authenticated HTML.
|
||||
///
|
||||
/// LinkedIn's SPA stores all data in `<code>` tags as HTML-escaped JSON.
|
||||
/// The `included` array contains typed entities: Update (post), Comment,
|
||||
/// Profile, etc. We parse these to reconstruct post + comments as markdown.
|
||||
use serde_json::Value;
|
||||
use tracing::debug;
|
||||
use webclaw_core::{Content, ExtractionResult, Metadata};
|
||||
|
||||
/// Check if a URL is a LinkedIn post/activity.
|
||||
pub fn is_linkedin_post(url: &str) -> bool {
|
||||
let host = url
|
||||
.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("");
|
||||
(host == "www.linkedin.com" || host == "linkedin.com")
|
||||
&& (url.contains("/feed/update/") || url.contains("/posts/"))
|
||||
}
|
||||
|
||||
/// Extract `<code>` block contents from HTML using simple string scanning.
|
||||
/// LinkedIn wraps JSON data in `<code>` tags with HTML-escaped content.
|
||||
fn extract_code_blocks(html: &str) -> Vec<String> {
|
||||
let mut blocks = Vec::new();
|
||||
let mut search_from = 0;
|
||||
while let Some(start) = html[search_from..].find("<code") {
|
||||
let abs_start = search_from + start;
|
||||
// Find end of opening tag
|
||||
let Some(tag_end) = html[abs_start..].find('>') else {
|
||||
break;
|
||||
};
|
||||
let content_start = abs_start + tag_end + 1;
|
||||
let Some(end) = html[content_start..].find("</code>") else {
|
||||
break;
|
||||
};
|
||||
let content = &html[content_start..content_start + end];
|
||||
if content.len() > 1000 {
|
||||
blocks.push(html_unescape(content));
|
||||
}
|
||||
search_from = content_start + end + 7;
|
||||
}
|
||||
blocks
|
||||
}
|
||||
|
||||
/// Extract post + comments from LinkedIn's SSR HTML (requires auth cookies).
|
||||
pub fn extract_linkedin_post(html: &str, url: &str) -> Option<ExtractionResult> {
|
||||
let code_blocks = extract_code_blocks(html);
|
||||
|
||||
// Find the largest <code> block with "included" — that's the main data payload
|
||||
let mut best_included: Option<Vec<Value>> = None;
|
||||
for raw in &code_blocks {
|
||||
if let Ok(obj) = serde_json::from_str::<Value>(raw)
|
||||
&& let Some(arr) = obj.get("included").and_then(|v| v.as_array())
|
||||
{
|
||||
let current_len = best_included.as_ref().map(|a| a.len()).unwrap_or(0);
|
||||
if arr.len() > current_len {
|
||||
best_included = Some(arr.clone());
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
let included = best_included?;
|
||||
debug!(entities = included.len(), "linkedin: found included array");
|
||||
|
||||
// Collect profiles (entityUrn → "First Last")
|
||||
let mut profiles = std::collections::HashMap::new();
|
||||
for item in &included {
|
||||
let t = item.get("$type").and_then(|v| v.as_str()).unwrap_or("");
|
||||
if t.contains("Profile") {
|
||||
let urn = item.get("entityUrn").and_then(|v| v.as_str()).unwrap_or("");
|
||||
let first = item.get("firstName").and_then(|v| v.as_str()).unwrap_or("");
|
||||
let last = item.get("lastName").and_then(|v| v.as_str()).unwrap_or("");
|
||||
let headline = item.get("headline").and_then(|v| v.as_str()).unwrap_or("");
|
||||
if !first.is_empty() {
|
||||
profiles.insert(
|
||||
urn.to_string(),
|
||||
(
|
||||
format!("{first} {last}").trim().to_string(),
|
||||
headline.to_string(),
|
||||
),
|
||||
);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Find the main post (Update type)
|
||||
let mut markdown = String::new();
|
||||
let mut post_author = String::new();
|
||||
let mut post_headline = String::new();
|
||||
|
||||
for item in &included {
|
||||
let t = item.get("$type").and_then(|v| v.as_str()).unwrap_or("");
|
||||
if !t.contains("Update") {
|
||||
continue;
|
||||
}
|
||||
|
||||
// Get author from actor profile
|
||||
if let Some(actor) = item.get("actor") {
|
||||
// actor can have a nested profile reference or inline data
|
||||
let author_urn = actor
|
||||
.get("*author")
|
||||
.or(actor.get("author"))
|
||||
.and_then(|v| v.as_str())
|
||||
.unwrap_or("");
|
||||
if let Some((name, headline)) = profiles.get(author_urn) {
|
||||
post_author = name.clone();
|
||||
post_headline = headline.clone();
|
||||
}
|
||||
// Or inline name
|
||||
if post_author.is_empty()
|
||||
&& let Some(name) = actor.get("name").and_then(|v| v.as_object())
|
||||
{
|
||||
let text = name.get("text").and_then(|v| v.as_str()).unwrap_or("");
|
||||
if !text.is_empty() {
|
||||
post_author = text.to_string();
|
||||
}
|
||||
}
|
||||
if post_headline.is_empty()
|
||||
&& let Some(desc) = actor.get("description").and_then(|v| v.as_object())
|
||||
{
|
||||
let text = desc.get("text").and_then(|v| v.as_str()).unwrap_or("");
|
||||
if !text.is_empty() {
|
||||
post_headline = text.to_string();
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Get post body from commentary
|
||||
if let Some(commentary) = item.get("commentary")
|
||||
&& let Some(text) = commentary
|
||||
.get("text")
|
||||
.and_then(|v| v.as_object())
|
||||
.and_then(|o| o.get("text"))
|
||||
.and_then(|v| v.as_str())
|
||||
{
|
||||
if !post_author.is_empty() {
|
||||
markdown.push_str(&format!("# {post_author}\n\n"));
|
||||
}
|
||||
if !post_headline.is_empty() {
|
||||
markdown.push_str(&format!("*{post_headline}*\n\n"));
|
||||
}
|
||||
markdown.push_str("---\n\n");
|
||||
// Unescape literal \n from JSON
|
||||
markdown.push_str(&text.replace("\\n", "\n"));
|
||||
markdown.push_str("\n\n");
|
||||
}
|
||||
}
|
||||
|
||||
if markdown.is_empty() {
|
||||
return None;
|
||||
}
|
||||
|
||||
// Collect comments — LinkedIn stores comment text in `commentary.text`
|
||||
// and commenter name in `commenter.name.text`
|
||||
let mut comments: Vec<(String, String)> = Vec::new();
|
||||
for item in &included {
|
||||
let t = item.get("$type").and_then(|v| v.as_str()).unwrap_or("");
|
||||
if !t.contains("Comment") {
|
||||
continue;
|
||||
}
|
||||
|
||||
// Get comment text from commentary.text
|
||||
let text = item
|
||||
.get("commentary")
|
||||
.and_then(|c| c.get("text"))
|
||||
.and_then(|v| v.as_str())
|
||||
.unwrap_or("");
|
||||
if text.is_empty() {
|
||||
continue;
|
||||
}
|
||||
|
||||
// Get commenter name from commenter.title.text
|
||||
let name = item
|
||||
.get("commenter")
|
||||
.and_then(|c| c.get("title"))
|
||||
.and_then(|n| n.get("text"))
|
||||
.and_then(|v| v.as_str())
|
||||
.unwrap_or("Someone");
|
||||
|
||||
comments.push((name.to_string(), text.to_string()));
|
||||
}
|
||||
|
||||
if !comments.is_empty() {
|
||||
markdown.push_str("---\n\n## Comments\n\n");
|
||||
for (name, text) in &comments {
|
||||
markdown.push_str(&format!("- **{name}**: {text}\n\n"));
|
||||
}
|
||||
}
|
||||
|
||||
let word_count = markdown.split_whitespace().count();
|
||||
debug!(
|
||||
word_count,
|
||||
comments = comments.len(),
|
||||
"linkedin extraction done"
|
||||
);
|
||||
|
||||
Some(ExtractionResult {
|
||||
metadata: Metadata {
|
||||
title: if post_author.is_empty() {
|
||||
None
|
||||
} else {
|
||||
Some(format!("{post_author}'s LinkedIn Post"))
|
||||
},
|
||||
description: None,
|
||||
author: if post_author.is_empty() {
|
||||
None
|
||||
} else {
|
||||
Some(post_author)
|
||||
},
|
||||
published_date: None,
|
||||
language: None,
|
||||
url: Some(url.to_string()),
|
||||
site_name: Some("LinkedIn".into()),
|
||||
image: None,
|
||||
favicon: None,
|
||||
word_count,
|
||||
},
|
||||
content: Content {
|
||||
markdown,
|
||||
plain_text: String::new(),
|
||||
links: vec![],
|
||||
images: vec![],
|
||||
code_blocks: vec![],
|
||||
raw_html: None,
|
||||
},
|
||||
domain_data: None,
|
||||
structured_data: vec![],
|
||||
})
|
||||
}
|
||||
|
||||
/// Unescape HTML entities (named + numeric decimal).
|
||||
fn html_unescape(s: &str) -> String {
|
||||
let mut out = String::with_capacity(s.len());
|
||||
let mut chars = s.chars().peekable();
|
||||
while let Some(c) = chars.next() {
|
||||
if c != '&' {
|
||||
out.push(c);
|
||||
continue;
|
||||
}
|
||||
// Collect until ';'
|
||||
let mut entity = String::new();
|
||||
for c2 in chars.by_ref() {
|
||||
if c2 == ';' {
|
||||
break;
|
||||
}
|
||||
entity.push(c2);
|
||||
if entity.len() > 10 {
|
||||
break;
|
||||
}
|
||||
}
|
||||
match entity.as_str() {
|
||||
"quot" => out.push('"'),
|
||||
"amp" => out.push('&'),
|
||||
"lt" => out.push('<'),
|
||||
"gt" => out.push('>'),
|
||||
"apos" => out.push('\''),
|
||||
s if s.starts_with('#') => {
|
||||
let num = &s[1..];
|
||||
if let Ok(n) = num.parse::<u32>()
|
||||
&& let Some(ch) = char::from_u32(n)
|
||||
{
|
||||
out.push(ch);
|
||||
continue;
|
||||
}
|
||||
out.push('&');
|
||||
out.push_str(&entity);
|
||||
out.push(';');
|
||||
}
|
||||
_ => {
|
||||
out.push('&');
|
||||
out.push_str(&entity);
|
||||
out.push(';');
|
||||
}
|
||||
}
|
||||
}
|
||||
out
|
||||
}
|
||||
122
crates/webclaw-fetch/src/proxy.rs
Normal file
122
crates/webclaw-fetch/src/proxy.rs
Normal file
|
|
@ -0,0 +1,122 @@
|
|||
/// Proxy file parsing utilities.
|
||||
///
|
||||
/// Format: `host:port:user:pass` (one per line).
|
||||
/// Lines starting with `#` and blank lines are skipped.
|
||||
/// Also accepts `host:port` (no auth).
|
||||
use crate::error::FetchError;
|
||||
|
||||
/// Parse a single proxy line into an HTTP proxy URL.
|
||||
///
|
||||
/// Accepts two formats:
|
||||
/// - `host:port:user:pass` -> `http://user:pass@host:port`
|
||||
/// - `host:port` -> `http://host:port`
|
||||
pub fn parse_proxy_line(line: &str) -> Option<String> {
|
||||
let parts: Vec<&str> = line.trim().splitn(4, ':').collect();
|
||||
match parts.len() {
|
||||
4 => Some(format!(
|
||||
"http://{}:{}@{}:{}",
|
||||
parts[2], parts[3], parts[0], parts[1]
|
||||
)),
|
||||
2 => Some(format!("http://{}:{}", parts[0], parts[1])),
|
||||
_ => None,
|
||||
}
|
||||
}
|
||||
|
||||
/// Load proxies from a file, returning parsed HTTP proxy URLs.
|
||||
///
|
||||
/// Skips blank lines and `#` comments. Returns an error if the file
|
||||
/// can't be read or contains no valid entries.
|
||||
pub fn parse_proxy_file(path: &str) -> Result<Vec<String>, FetchError> {
|
||||
let content = std::fs::read_to_string(path)
|
||||
.map_err(|e| FetchError::Build(format!("failed to read proxy file: {e}")))?;
|
||||
|
||||
let proxies: Vec<String> = content
|
||||
.lines()
|
||||
.filter_map(|line| {
|
||||
let trimmed = line.trim();
|
||||
if trimmed.is_empty() || trimmed.starts_with('#') {
|
||||
None
|
||||
} else {
|
||||
parse_proxy_line(trimmed)
|
||||
}
|
||||
})
|
||||
.collect();
|
||||
|
||||
if proxies.is_empty() {
|
||||
return Err(FetchError::Build(
|
||||
"proxy file is empty or has no valid entries".into(),
|
||||
));
|
||||
}
|
||||
|
||||
Ok(proxies)
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use std::io::Write;
|
||||
|
||||
#[test]
|
||||
fn parse_host_port_user_pass() {
|
||||
let result = parse_proxy_line("proxy.example.com:8080:alice:s3cret");
|
||||
assert_eq!(
|
||||
result.as_deref(),
|
||||
Some("http://alice:s3cret@proxy.example.com:8080")
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_host_port_only() {
|
||||
let result = parse_proxy_line("10.0.0.1:3128");
|
||||
assert_eq!(result.as_deref(), Some("http://10.0.0.1:3128"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_trims_whitespace() {
|
||||
let result = parse_proxy_line(" host:9999:user:pass ");
|
||||
assert_eq!(result.as_deref(), Some("http://user:pass@host:9999"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_invalid_returns_none() {
|
||||
assert!(parse_proxy_line("just-a-hostname").is_none());
|
||||
assert!(parse_proxy_line("a:b:c").is_none()); // 3 parts is invalid
|
||||
assert!(parse_proxy_line("").is_none());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_file_happy_path() {
|
||||
let dir = tempfile::tempdir().unwrap();
|
||||
let path = dir.path().join("proxies.txt");
|
||||
let mut f = std::fs::File::create(&path).unwrap();
|
||||
writeln!(f, "# residential pool").unwrap();
|
||||
writeln!(f, "host1:8080:user1:pass1").unwrap();
|
||||
writeln!(f).unwrap(); // blank line
|
||||
writeln!(f, "host2:3128").unwrap();
|
||||
writeln!(f, "# datacenter").unwrap();
|
||||
writeln!(f, "host3:9999:u:p").unwrap();
|
||||
drop(f);
|
||||
|
||||
let proxies = parse_proxy_file(path.to_str().unwrap()).unwrap();
|
||||
assert_eq!(proxies.len(), 3);
|
||||
assert_eq!(proxies[0], "http://user1:pass1@host1:8080");
|
||||
assert_eq!(proxies[1], "http://host2:3128");
|
||||
assert_eq!(proxies[2], "http://u:p@host3:9999");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_file_empty_errors() {
|
||||
let dir = tempfile::tempdir().unwrap();
|
||||
let path = dir.path().join("empty.txt");
|
||||
std::fs::write(&path, "# only comments\n\n").unwrap();
|
||||
|
||||
let err = parse_proxy_file(path.to_str().unwrap());
|
||||
assert!(err.is_err());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_file_missing_errors() {
|
||||
let err = parse_proxy_file("/nonexistent/proxies.txt");
|
||||
assert!(err.is_err());
|
||||
}
|
||||
}
|
||||
172
crates/webclaw-fetch/src/reddit.rs
Normal file
172
crates/webclaw-fetch/src/reddit.rs
Normal file
|
|
@ -0,0 +1,172 @@
|
|||
/// Reddit JSON API fallback for extracting posts + comments without JS rendering.
|
||||
///
|
||||
/// Reddit's new `shreddit` frontend only SSRs the post body — comments are
|
||||
/// loaded client-side. Appending `.json` to any Reddit URL returns the full
|
||||
/// comment tree as structured JSON, which we convert to clean markdown.
|
||||
use serde::Deserialize;
|
||||
use tracing::debug;
|
||||
use webclaw_core::{Content, ExtractionResult, Metadata};
|
||||
|
||||
/// Check if a URL points to a Reddit post/comment page.
|
||||
pub fn is_reddit_url(url: &str) -> bool {
|
||||
let host = url
|
||||
.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("");
|
||||
matches!(
|
||||
host,
|
||||
"reddit.com" | "www.reddit.com" | "old.reddit.com" | "np.reddit.com" | "new.reddit.com"
|
||||
)
|
||||
}
|
||||
|
||||
/// Build the `.json` URL from a Reddit page URL.
|
||||
pub fn json_url(url: &str) -> String {
|
||||
let clean = url.split('?').next().unwrap_or(url).trim_end_matches('/');
|
||||
format!("{clean}.json")
|
||||
}
|
||||
|
||||
/// Convert Reddit JSON API response into an ExtractionResult.
|
||||
pub fn parse_reddit_json(json_bytes: &[u8], url: &str) -> Result<ExtractionResult, String> {
|
||||
let listings: Vec<Listing> =
|
||||
serde_json::from_slice(json_bytes).map_err(|e| format!("reddit json parse: {e}"))?;
|
||||
|
||||
let mut markdown = String::new();
|
||||
let mut title = None;
|
||||
let mut author = None;
|
||||
let mut subreddit = None;
|
||||
|
||||
// First listing = the post itself
|
||||
if let Some(post_listing) = listings.first() {
|
||||
for child in &post_listing.data.children {
|
||||
if child.kind == "t3" {
|
||||
let d = &child.data;
|
||||
title = d.title.clone();
|
||||
author = d.author.clone();
|
||||
subreddit = d.subreddit_name_prefixed.clone();
|
||||
|
||||
if let Some(ref t) = title {
|
||||
markdown.push_str(&format!("# {t}\n\n"));
|
||||
}
|
||||
if let (Some(a), Some(sr)) = (&author, &subreddit) {
|
||||
markdown.push_str(&format!("**u/{a}** in {sr}\n\n"));
|
||||
}
|
||||
if let Some(ref body) = d.selftext
|
||||
&& !body.is_empty()
|
||||
{
|
||||
markdown.push_str(body);
|
||||
markdown.push_str("\n\n");
|
||||
}
|
||||
if let Some(ref url_field) = d.url_overridden_by_dest
|
||||
&& !url_field.is_empty()
|
||||
{
|
||||
markdown.push_str(&format!("[Link]({url_field})\n\n"));
|
||||
}
|
||||
markdown.push_str("---\n\n");
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Second listing = comment tree
|
||||
if let Some(comment_listing) = listings.get(1) {
|
||||
markdown.push_str("## Comments\n\n");
|
||||
for child in &comment_listing.data.children {
|
||||
render_comment(child, 0, &mut markdown);
|
||||
}
|
||||
}
|
||||
|
||||
let word_count = markdown.split_whitespace().count();
|
||||
debug!(word_count, "reddit json extracted");
|
||||
|
||||
Ok(ExtractionResult {
|
||||
metadata: Metadata {
|
||||
title,
|
||||
description: None,
|
||||
author,
|
||||
published_date: None,
|
||||
language: Some("en".into()),
|
||||
url: Some(url.to_string()),
|
||||
site_name: subreddit,
|
||||
image: None,
|
||||
favicon: None,
|
||||
word_count,
|
||||
},
|
||||
content: Content {
|
||||
markdown,
|
||||
plain_text: String::new(),
|
||||
links: vec![],
|
||||
images: vec![],
|
||||
code_blocks: vec![],
|
||||
raw_html: None,
|
||||
},
|
||||
domain_data: None,
|
||||
structured_data: vec![],
|
||||
})
|
||||
}
|
||||
|
||||
fn render_comment(thing: &Thing, depth: usize, out: &mut String) {
|
||||
if thing.kind != "t1" {
|
||||
return;
|
||||
}
|
||||
let d = &thing.data;
|
||||
let indent = " ".repeat(depth);
|
||||
let author = d.author.as_deref().unwrap_or("[deleted]");
|
||||
let body = d.body.as_deref().unwrap_or("[removed]");
|
||||
let score = d.score.unwrap_or(0);
|
||||
|
||||
out.push_str(&format!("{indent}- **u/{author}** ({score} pts)\n"));
|
||||
for line in body.lines() {
|
||||
out.push_str(&format!("{indent} {line}\n"));
|
||||
}
|
||||
out.push('\n');
|
||||
|
||||
// Recurse into replies
|
||||
if let Some(Replies::Listing(listing)) = &d.replies {
|
||||
for child in &listing.data.children {
|
||||
render_comment(child, depth + 1, out);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// --- Reddit JSON types (minimal) ---
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Listing {
|
||||
data: ListingData,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct ListingData {
|
||||
children: Vec<Thing>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Thing {
|
||||
kind: String,
|
||||
data: ThingData,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct ThingData {
|
||||
// Post fields (t3)
|
||||
title: Option<String>,
|
||||
selftext: Option<String>,
|
||||
subreddit_name_prefixed: Option<String>,
|
||||
url_overridden_by_dest: Option<String>,
|
||||
// Comment fields (t1)
|
||||
author: Option<String>,
|
||||
body: Option<String>,
|
||||
score: Option<i64>,
|
||||
replies: Option<Replies>,
|
||||
}
|
||||
|
||||
/// Reddit replies can be either a nested Listing or an empty string.
|
||||
#[derive(Deserialize)]
|
||||
#[serde(untagged)]
|
||||
enum Replies {
|
||||
Listing(Listing),
|
||||
#[allow(dead_code)]
|
||||
Empty(String),
|
||||
}
|
||||
582
crates/webclaw-fetch/src/sitemap.rs
Normal file
582
crates/webclaw-fetch/src/sitemap.rs
Normal file
|
|
@ -0,0 +1,582 @@
|
|||
/// Sitemap parsing and URL discovery.
|
||||
///
|
||||
/// Discovers URLs from a site's sitemaps using a 3-step process:
|
||||
/// 1. Parse robots.txt for `Sitemap:` directives
|
||||
/// 2. Try /sitemap.xml as fallback
|
||||
/// 3. Recursively resolve sitemap index files
|
||||
///
|
||||
/// All HTTP requests go through FetchClient to inherit TLS fingerprinting.
|
||||
use std::collections::HashSet;
|
||||
|
||||
use quick_xml::Reader;
|
||||
use quick_xml::events::Event;
|
||||
use serde::Serialize;
|
||||
use tracing::{debug, warn};
|
||||
|
||||
use crate::client::FetchClient;
|
||||
use crate::error::FetchError;
|
||||
|
||||
/// Maximum depth when recursively fetching sitemap index files.
|
||||
/// Prevents infinite loops from circular sitemap references.
|
||||
const MAX_RECURSION_DEPTH: usize = 3;
|
||||
|
||||
/// A single URL discovered from a sitemap.
|
||||
#[derive(Debug, Clone, Serialize)]
|
||||
pub struct SitemapEntry {
|
||||
pub url: String,
|
||||
pub last_modified: Option<String>,
|
||||
pub priority: Option<f64>,
|
||||
pub change_freq: Option<String>,
|
||||
}
|
||||
|
||||
/// Discover all URLs from a site's sitemaps.
|
||||
///
|
||||
/// Discovery order:
|
||||
/// 1. Fetch /robots.txt, parse `Sitemap:` directives
|
||||
/// 2. Fetch /sitemap.xml directly
|
||||
/// 3. If sitemap index, recursively fetch child sitemaps
|
||||
/// 4. Deduplicate by URL
|
||||
///
|
||||
/// Returns an empty vec (not an error) if no sitemaps are found.
|
||||
pub async fn discover(
|
||||
client: &FetchClient,
|
||||
base_url: &str,
|
||||
) -> Result<Vec<SitemapEntry>, FetchError> {
|
||||
let base = base_url.trim_end_matches('/');
|
||||
let mut sitemap_urls: Vec<String> = Vec::new();
|
||||
|
||||
// Step 1: Try robots.txt
|
||||
let robots_url = format!("{base}/robots.txt");
|
||||
debug!(url = %robots_url, "fetching robots.txt");
|
||||
|
||||
match client.fetch(&robots_url).await {
|
||||
Ok(result) if result.status == 200 => {
|
||||
let found = parse_robots_txt(&result.html);
|
||||
debug!(count = found.len(), "sitemap URLs from robots.txt");
|
||||
sitemap_urls.extend(found);
|
||||
}
|
||||
Ok(result) => {
|
||||
debug!(status = result.status, "robots.txt not found");
|
||||
}
|
||||
Err(e) => {
|
||||
debug!(error = %e, "failed to fetch robots.txt");
|
||||
}
|
||||
}
|
||||
|
||||
// Step 2: Always try /sitemap.xml as well (may not be listed in robots.txt)
|
||||
let default_sitemap = format!("{base}/sitemap.xml");
|
||||
if !sitemap_urls.iter().any(|u| u == &default_sitemap) {
|
||||
sitemap_urls.push(default_sitemap);
|
||||
}
|
||||
|
||||
// Step 3: Fetch and parse each sitemap, handling indexes recursively
|
||||
let mut seen_urls: HashSet<String> = HashSet::new();
|
||||
let mut entries: Vec<SitemapEntry> = Vec::new();
|
||||
|
||||
fetch_sitemaps(client, &sitemap_urls, &mut entries, &mut seen_urls, 0).await;
|
||||
|
||||
debug!(total = entries.len(), "sitemap discovery complete");
|
||||
Ok(entries)
|
||||
}
|
||||
|
||||
/// Recursively fetch and parse sitemap URLs, handling both urlsets and indexes.
|
||||
async fn fetch_sitemaps(
|
||||
client: &FetchClient,
|
||||
urls: &[String],
|
||||
entries: &mut Vec<SitemapEntry>,
|
||||
seen_urls: &mut HashSet<String>,
|
||||
depth: usize,
|
||||
) {
|
||||
if depth > MAX_RECURSION_DEPTH {
|
||||
warn!(depth, "sitemap recursion limit reached, stopping");
|
||||
return;
|
||||
}
|
||||
|
||||
for sitemap_url in urls {
|
||||
debug!(url = %sitemap_url, depth, "fetching sitemap");
|
||||
|
||||
let xml = match client.fetch(sitemap_url).await {
|
||||
Ok(result) if result.status == 200 => result.html,
|
||||
Ok(result) => {
|
||||
debug!(url = %sitemap_url, status = result.status, "sitemap not found");
|
||||
continue;
|
||||
}
|
||||
Err(e) => {
|
||||
debug!(url = %sitemap_url, error = %e, "failed to fetch sitemap");
|
||||
continue;
|
||||
}
|
||||
};
|
||||
|
||||
match detect_sitemap_type(&xml) {
|
||||
SitemapType::UrlSet => {
|
||||
let parsed = parse_urlset(&xml);
|
||||
for entry in parsed {
|
||||
if seen_urls.insert(entry.url.clone()) {
|
||||
entries.push(entry);
|
||||
}
|
||||
}
|
||||
}
|
||||
SitemapType::Index => {
|
||||
let child_urls = parse_sitemap_index(&xml);
|
||||
debug!(count = child_urls.len(), "found child sitemaps in index");
|
||||
|
||||
// Box the recursive call to avoid large future sizes
|
||||
Box::pin(fetch_sitemaps(
|
||||
client,
|
||||
&child_urls,
|
||||
entries,
|
||||
seen_urls,
|
||||
depth + 1,
|
||||
))
|
||||
.await;
|
||||
}
|
||||
SitemapType::Unknown => {
|
||||
debug!(url = %sitemap_url, "unrecognized sitemap format, skipping");
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Pure parsing functions (no I/O, fully testable)
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Extract `Sitemap:` directive URLs from robots.txt content.
|
||||
pub fn parse_robots_txt(text: &str) -> Vec<String> {
|
||||
text.lines()
|
||||
.filter_map(|line| {
|
||||
let trimmed = line.trim();
|
||||
// Case-insensitive match for "Sitemap:" prefix
|
||||
if trimmed.len() > 8 && trimmed[..8].eq_ignore_ascii_case("sitemap:") {
|
||||
let url = trimmed[8..].trim();
|
||||
if !url.is_empty() {
|
||||
return Some(url.to_string());
|
||||
}
|
||||
}
|
||||
None
|
||||
})
|
||||
.collect()
|
||||
}
|
||||
|
||||
/// Parse a sitemap XML string. Handles both `<urlset>` and `<sitemapindex>`.
|
||||
/// Returns entries from urlsets and recursion targets from indexes.
|
||||
pub fn parse_sitemap_xml(xml: &str) -> Vec<SitemapEntry> {
|
||||
match detect_sitemap_type(xml) {
|
||||
SitemapType::UrlSet => parse_urlset(xml),
|
||||
SitemapType::Index => {
|
||||
// For the public parsing API, convert index <loc> entries into
|
||||
// SitemapEntry with just the URL. The async `discover` function
|
||||
// handles actual recursive fetching.
|
||||
parse_sitemap_index(xml)
|
||||
.into_iter()
|
||||
.map(|url| SitemapEntry {
|
||||
url,
|
||||
last_modified: None,
|
||||
priority: None,
|
||||
change_freq: None,
|
||||
})
|
||||
.collect()
|
||||
}
|
||||
SitemapType::Unknown => Vec::new(),
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(Debug, PartialEq)]
|
||||
enum SitemapType {
|
||||
UrlSet,
|
||||
Index,
|
||||
Unknown,
|
||||
}
|
||||
|
||||
/// Peek at the first element to determine if this is a urlset or sitemapindex.
|
||||
fn detect_sitemap_type(xml: &str) -> SitemapType {
|
||||
let mut reader = Reader::from_str(xml);
|
||||
let mut buf = Vec::new();
|
||||
|
||||
loop {
|
||||
match reader.read_event_into(&mut buf) {
|
||||
Ok(Event::Start(ref e)) | Ok(Event::Empty(ref e)) => {
|
||||
let name = e.local_name();
|
||||
return match name.as_ref() {
|
||||
b"urlset" => SitemapType::UrlSet,
|
||||
b"sitemapindex" => SitemapType::Index,
|
||||
_ => continue, // skip processing instructions, comments
|
||||
};
|
||||
}
|
||||
Ok(Event::Eof) => return SitemapType::Unknown,
|
||||
Err(_) => return SitemapType::Unknown,
|
||||
_ => continue,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Parse `<url>` entries from a `<urlset>` sitemap.
|
||||
fn parse_urlset(xml: &str) -> Vec<SitemapEntry> {
|
||||
let mut reader = Reader::from_str(xml);
|
||||
let mut buf = Vec::new();
|
||||
let mut entries = Vec::new();
|
||||
|
||||
// State for current <url> element being parsed
|
||||
let mut in_url = false;
|
||||
let mut current_tag: Option<UrlTag> = None;
|
||||
let mut loc: Option<String> = None;
|
||||
let mut lastmod: Option<String> = None;
|
||||
let mut priority: Option<f64> = None;
|
||||
let mut changefreq: Option<String> = None;
|
||||
|
||||
loop {
|
||||
match reader.read_event_into(&mut buf) {
|
||||
Ok(Event::Start(ref e)) => {
|
||||
let name = e.local_name();
|
||||
match name.as_ref() {
|
||||
b"url" => {
|
||||
in_url = true;
|
||||
loc = None;
|
||||
lastmod = None;
|
||||
priority = None;
|
||||
changefreq = None;
|
||||
}
|
||||
b"loc" if in_url => current_tag = Some(UrlTag::Loc),
|
||||
b"lastmod" if in_url => current_tag = Some(UrlTag::LastMod),
|
||||
b"priority" if in_url => current_tag = Some(UrlTag::Priority),
|
||||
b"changefreq" if in_url => current_tag = Some(UrlTag::ChangeFreq),
|
||||
_ => current_tag = None,
|
||||
}
|
||||
}
|
||||
Ok(Event::Text(ref e)) => {
|
||||
if let Some(ref tag) = current_tag
|
||||
&& let Ok(text) = e.unescape()
|
||||
{
|
||||
let text = text.trim().to_string();
|
||||
if !text.is_empty() {
|
||||
match tag {
|
||||
UrlTag::Loc => loc = Some(text),
|
||||
UrlTag::LastMod => lastmod = Some(text),
|
||||
UrlTag::Priority => priority = text.parse().ok(),
|
||||
UrlTag::ChangeFreq => changefreq = Some(text),
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
Ok(Event::End(ref e)) => {
|
||||
let name = e.local_name();
|
||||
if name.as_ref() == b"url" && in_url {
|
||||
if let Some(url) = loc.take() {
|
||||
entries.push(SitemapEntry {
|
||||
url,
|
||||
last_modified: lastmod.take(),
|
||||
priority: priority.take(),
|
||||
change_freq: changefreq.take(),
|
||||
});
|
||||
}
|
||||
in_url = false;
|
||||
}
|
||||
current_tag = None;
|
||||
}
|
||||
Ok(Event::Eof) => break,
|
||||
Err(e) => {
|
||||
warn!(error = %e, "XML parse error in sitemap, returning partial results");
|
||||
break;
|
||||
}
|
||||
_ => {}
|
||||
}
|
||||
buf.clear();
|
||||
}
|
||||
|
||||
entries
|
||||
}
|
||||
|
||||
#[derive(Debug)]
|
||||
enum UrlTag {
|
||||
Loc,
|
||||
LastMod,
|
||||
Priority,
|
||||
ChangeFreq,
|
||||
}
|
||||
|
||||
/// Parse `<sitemap>` entries from a `<sitemapindex>`, returning child sitemap URLs.
|
||||
fn parse_sitemap_index(xml: &str) -> Vec<String> {
|
||||
let mut reader = Reader::from_str(xml);
|
||||
let mut buf = Vec::new();
|
||||
let mut urls = Vec::new();
|
||||
|
||||
let mut in_sitemap = false;
|
||||
let mut in_loc = false;
|
||||
|
||||
loop {
|
||||
match reader.read_event_into(&mut buf) {
|
||||
Ok(Event::Start(ref e)) => {
|
||||
let name = e.local_name();
|
||||
match name.as_ref() {
|
||||
b"sitemap" => in_sitemap = true,
|
||||
b"loc" if in_sitemap => in_loc = true,
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
Ok(Event::Text(ref e)) => {
|
||||
if in_loc && let Ok(text) = e.unescape() {
|
||||
let text = text.trim().to_string();
|
||||
if !text.is_empty() {
|
||||
urls.push(text);
|
||||
}
|
||||
}
|
||||
}
|
||||
Ok(Event::End(ref e)) => {
|
||||
let name = e.local_name();
|
||||
match name.as_ref() {
|
||||
b"sitemap" => {
|
||||
in_sitemap = false;
|
||||
in_loc = false;
|
||||
}
|
||||
b"loc" => in_loc = false,
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
Ok(Event::Eof) => break,
|
||||
Err(e) => {
|
||||
warn!(error = %e, "XML parse error in sitemap index, returning partial results");
|
||||
break;
|
||||
}
|
||||
_ => {}
|
||||
}
|
||||
buf.clear();
|
||||
}
|
||||
|
||||
urls
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Tests
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_parse_urlset() {
|
||||
let xml = r#"<?xml version="1.0" encoding="UTF-8"?>
|
||||
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
|
||||
<url>
|
||||
<loc>https://example.com/</loc>
|
||||
<lastmod>2026-01-15</lastmod>
|
||||
<changefreq>daily</changefreq>
|
||||
<priority>1.0</priority>
|
||||
</url>
|
||||
<url>
|
||||
<loc>https://example.com/about</loc>
|
||||
<lastmod>2026-01-10</lastmod>
|
||||
<changefreq>monthly</changefreq>
|
||||
<priority>0.8</priority>
|
||||
</url>
|
||||
<url>
|
||||
<loc>https://example.com/blog/post-1</loc>
|
||||
</url>
|
||||
</urlset>"#;
|
||||
|
||||
let entries = parse_urlset(xml);
|
||||
assert_eq!(entries.len(), 3);
|
||||
|
||||
assert_eq!(entries[0].url, "https://example.com/");
|
||||
assert_eq!(entries[0].last_modified.as_deref(), Some("2026-01-15"));
|
||||
assert_eq!(entries[0].change_freq.as_deref(), Some("daily"));
|
||||
assert_eq!(entries[0].priority, Some(1.0));
|
||||
|
||||
assert_eq!(entries[1].url, "https://example.com/about");
|
||||
assert_eq!(entries[1].priority, Some(0.8));
|
||||
|
||||
assert_eq!(entries[2].url, "https://example.com/blog/post-1");
|
||||
assert_eq!(entries[2].last_modified, None);
|
||||
assert_eq!(entries[2].priority, None);
|
||||
assert_eq!(entries[2].change_freq, None);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_parse_sitemap_index() {
|
||||
let xml = r#"<?xml version="1.0" encoding="UTF-8"?>
|
||||
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
|
||||
<sitemap>
|
||||
<loc>https://example.com/sitemap-posts.xml</loc>
|
||||
<lastmod>2026-03-01</lastmod>
|
||||
</sitemap>
|
||||
<sitemap>
|
||||
<loc>https://example.com/sitemap-pages.xml</loc>
|
||||
</sitemap>
|
||||
</sitemapindex>"#;
|
||||
|
||||
let urls = parse_sitemap_index(xml);
|
||||
assert_eq!(urls.len(), 2);
|
||||
assert_eq!(urls[0], "https://example.com/sitemap-posts.xml");
|
||||
assert_eq!(urls[1], "https://example.com/sitemap-pages.xml");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_parse_sitemap_xml_dispatches_urlset() {
|
||||
let xml = r#"<?xml version="1.0"?>
|
||||
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
|
||||
<url><loc>https://example.com/page</loc></url>
|
||||
</urlset>"#;
|
||||
|
||||
let entries = parse_sitemap_xml(xml);
|
||||
assert_eq!(entries.len(), 1);
|
||||
assert_eq!(entries[0].url, "https://example.com/page");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_parse_sitemap_xml_dispatches_index() {
|
||||
let xml = r#"<?xml version="1.0"?>
|
||||
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
|
||||
<sitemap><loc>https://example.com/sitemap-1.xml</loc></sitemap>
|
||||
</sitemapindex>"#;
|
||||
|
||||
let entries = parse_sitemap_xml(xml);
|
||||
assert_eq!(entries.len(), 1);
|
||||
assert_eq!(entries[0].url, "https://example.com/sitemap-1.xml");
|
||||
// Index entries have no metadata when parsed through the public API
|
||||
assert_eq!(entries[0].priority, None);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_parse_robots_txt() {
|
||||
let robots = "User-agent: *\n\
|
||||
Disallow: /admin/\n\
|
||||
\n\
|
||||
Sitemap: https://example.com/sitemap.xml\n\
|
||||
sitemap: https://example.com/sitemap-news.xml\n\
|
||||
SITEMAP: https://example.com/sitemap-images.xml\n\
|
||||
\n\
|
||||
User-agent: Googlebot\n\
|
||||
Allow: /\n";
|
||||
|
||||
let urls = parse_robots_txt(robots);
|
||||
assert_eq!(urls.len(), 3);
|
||||
assert_eq!(urls[0], "https://example.com/sitemap.xml");
|
||||
assert_eq!(urls[1], "https://example.com/sitemap-news.xml");
|
||||
assert_eq!(urls[2], "https://example.com/sitemap-images.xml");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_parse_robots_txt_empty_value() {
|
||||
// "Sitemap:" with no URL should be skipped
|
||||
let robots = "Sitemap:\nSitemap: \nSitemap: https://example.com/s.xml\n";
|
||||
let urls = parse_robots_txt(robots);
|
||||
assert_eq!(urls.len(), 1);
|
||||
assert_eq!(urls[0], "https://example.com/s.xml");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_deduplicate() {
|
||||
// parse_sitemap_xml deduplicates via the discover() path, but
|
||||
// we can verify that parsing the same URL twice produces entries
|
||||
// that the HashSet in discover() would collapse.
|
||||
let xml = r#"<?xml version="1.0"?>
|
||||
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
|
||||
<url><loc>https://example.com/page</loc></url>
|
||||
<url><loc>https://example.com/page</loc></url>
|
||||
<url><loc>https://example.com/other</loc></url>
|
||||
</urlset>"#;
|
||||
|
||||
let entries = parse_urlset(xml);
|
||||
assert_eq!(entries.len(), 3, "parser returns all entries");
|
||||
|
||||
// Simulate the dedup that discover() does
|
||||
let mut seen = HashSet::new();
|
||||
let deduped: Vec<_> = entries
|
||||
.into_iter()
|
||||
.filter(|e| seen.insert(e.url.clone()))
|
||||
.collect();
|
||||
assert_eq!(deduped.len(), 2, "dedup collapses duplicates");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_empty_sitemap() {
|
||||
let xml = r#"<?xml version="1.0"?>
|
||||
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
|
||||
</urlset>"#;
|
||||
|
||||
let entries = parse_urlset(xml);
|
||||
assert!(entries.is_empty());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_malformed_xml() {
|
||||
let xml = "this is not xml at all <><><";
|
||||
let entries = parse_sitemap_xml(xml);
|
||||
assert!(entries.is_empty(), "malformed XML returns empty vec");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_malformed_xml_partial() {
|
||||
// Partial XML that starts valid but breaks mid-stream
|
||||
let xml = r#"<?xml version="1.0"?>
|
||||
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
|
||||
<url><loc>https://example.com/good</loc></url>
|
||||
<url><loc>broken
|
||||
"#;
|
||||
let entries = parse_sitemap_xml(xml);
|
||||
// Should return at least the successfully parsed entry
|
||||
assert!(entries.len() >= 1);
|
||||
assert_eq!(entries[0].url, "https://example.com/good");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_missing_loc() {
|
||||
let xml = r#"<?xml version="1.0"?>
|
||||
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
|
||||
<url>
|
||||
<lastmod>2026-01-01</lastmod>
|
||||
<priority>0.5</priority>
|
||||
</url>
|
||||
<url>
|
||||
<loc>https://example.com/valid</loc>
|
||||
</url>
|
||||
</urlset>"#;
|
||||
|
||||
let entries = parse_urlset(xml);
|
||||
assert_eq!(entries.len(), 1, "entry without <loc> is skipped");
|
||||
assert_eq!(entries[0].url, "https://example.com/valid");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_priority_parsing() {
|
||||
let xml = r#"<?xml version="1.0"?>
|
||||
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
|
||||
<url>
|
||||
<loc>https://example.com/high</loc>
|
||||
<priority>1.0</priority>
|
||||
</url>
|
||||
<url>
|
||||
<loc>https://example.com/mid</loc>
|
||||
<priority>0.5</priority>
|
||||
</url>
|
||||
<url>
|
||||
<loc>https://example.com/low</loc>
|
||||
<priority>0.1</priority>
|
||||
</url>
|
||||
<url>
|
||||
<loc>https://example.com/invalid</loc>
|
||||
<priority>not-a-number</priority>
|
||||
</url>
|
||||
</urlset>"#;
|
||||
|
||||
let entries = parse_urlset(xml);
|
||||
assert_eq!(entries.len(), 4);
|
||||
|
||||
assert_eq!(entries[0].priority, Some(1.0));
|
||||
assert_eq!(entries[1].priority, Some(0.5));
|
||||
assert_eq!(entries[2].priority, Some(0.1));
|
||||
assert_eq!(entries[3].priority, None, "invalid priority parses as None");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_detect_sitemap_type() {
|
||||
let urlset = r#"<?xml version="1.0"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"></urlset>"#;
|
||||
assert_eq!(detect_sitemap_type(urlset), SitemapType::UrlSet);
|
||||
|
||||
let index = r#"<?xml version="1.0"?><sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"></sitemapindex>"#;
|
||||
assert_eq!(detect_sitemap_type(index), SitemapType::Index);
|
||||
|
||||
assert_eq!(detect_sitemap_type("garbage"), SitemapType::Unknown);
|
||||
assert_eq!(detect_sitemap_type(""), SitemapType::Unknown);
|
||||
}
|
||||
}
|
||||
15
crates/webclaw-llm/Cargo.toml
Normal file
15
crates/webclaw-llm/Cargo.toml
Normal file
|
|
@ -0,0 +1,15 @@
|
|||
[package]
|
||||
name = "webclaw-llm"
|
||||
description = "LLM integration for webclaw — local-first hybrid architecture (Ollama -> OpenAI -> Anthropic)"
|
||||
version.workspace = true
|
||||
edition.workspace = true
|
||||
license.workspace = true
|
||||
|
||||
[dependencies]
|
||||
reqwest = { version = "0.12", default-features = false, features = ["json", "rustls-tls"] }
|
||||
async-trait = "0.1"
|
||||
serde = { workspace = true }
|
||||
serde_json = { workspace = true }
|
||||
tokio = { workspace = true }
|
||||
thiserror = { workspace = true }
|
||||
tracing = { workspace = true }
|
||||
205
crates/webclaw-llm/src/chain.rs
Normal file
205
crates/webclaw-llm/src/chain.rs
Normal file
|
|
@ -0,0 +1,205 @@
|
|||
/// Provider chain — tries providers in order until one succeeds.
|
||||
/// Default order: Ollama (local, free) -> OpenAI -> Anthropic.
|
||||
/// Only includes providers that are actually configured/available.
|
||||
use async_trait::async_trait;
|
||||
use tracing::{debug, warn};
|
||||
|
||||
use crate::error::LlmError;
|
||||
use crate::provider::{CompletionRequest, LlmProvider};
|
||||
use crate::providers::{
|
||||
anthropic::AnthropicProvider, ollama::OllamaProvider, openai::OpenAiProvider,
|
||||
};
|
||||
|
||||
pub struct ProviderChain {
|
||||
providers: Vec<Box<dyn LlmProvider>>,
|
||||
}
|
||||
|
||||
impl ProviderChain {
|
||||
/// Build the default chain: Ollama -> OpenAI -> Anthropic.
|
||||
/// Ollama is always added (availability checked at call time).
|
||||
/// Cloud providers are only added if their API keys are configured.
|
||||
pub async fn default() -> Self {
|
||||
let mut providers: Vec<Box<dyn LlmProvider>> = Vec::new();
|
||||
|
||||
let ollama = OllamaProvider::new(None, None);
|
||||
if ollama.is_available().await {
|
||||
debug!("ollama is available, adding to chain");
|
||||
providers.push(Box::new(ollama));
|
||||
} else {
|
||||
debug!("ollama not available, skipping");
|
||||
}
|
||||
|
||||
if let Some(openai) = OpenAiProvider::new(None, None, None) {
|
||||
debug!("openai configured, adding to chain");
|
||||
providers.push(Box::new(openai));
|
||||
}
|
||||
|
||||
if let Some(anthropic) = AnthropicProvider::new(None, None) {
|
||||
debug!("anthropic configured, adding to chain");
|
||||
providers.push(Box::new(anthropic));
|
||||
}
|
||||
|
||||
Self { providers }
|
||||
}
|
||||
|
||||
/// Build a chain with a single explicit provider.
|
||||
pub fn single(provider: Box<dyn LlmProvider>) -> Self {
|
||||
Self {
|
||||
providers: vec![provider],
|
||||
}
|
||||
}
|
||||
|
||||
/// Build from an explicit list of providers.
|
||||
pub fn from_providers(providers: Vec<Box<dyn LlmProvider>>) -> Self {
|
||||
Self { providers }
|
||||
}
|
||||
|
||||
/// How many providers are in the chain.
|
||||
pub fn len(&self) -> usize {
|
||||
self.providers.len()
|
||||
}
|
||||
|
||||
pub fn is_empty(&self) -> bool {
|
||||
self.providers.is_empty()
|
||||
}
|
||||
}
|
||||
|
||||
/// ProviderChain itself implements LlmProvider, so it can be used anywhere
|
||||
/// a single provider is expected. This makes the CLI simple: build a chain
|
||||
/// or a single provider, pass either as `Box<dyn LlmProvider>`.
|
||||
#[async_trait]
|
||||
impl LlmProvider for ProviderChain {
|
||||
async fn complete(&self, request: &CompletionRequest) -> Result<String, LlmError> {
|
||||
if self.providers.is_empty() {
|
||||
return Err(LlmError::NoProviders);
|
||||
}
|
||||
|
||||
let mut errors = Vec::new();
|
||||
|
||||
for provider in &self.providers {
|
||||
debug!(provider = provider.name(), "attempting completion");
|
||||
|
||||
match provider.complete(request).await {
|
||||
Ok(response) => {
|
||||
debug!(provider = provider.name(), "completion succeeded");
|
||||
return Ok(response);
|
||||
}
|
||||
Err(e) => {
|
||||
warn!(provider = provider.name(), error = %e, "provider failed, trying next");
|
||||
errors.push(format!("{}: {e}", provider.name()));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
Err(LlmError::AllProvidersFailed(errors.join("; ")))
|
||||
}
|
||||
|
||||
async fn is_available(&self) -> bool {
|
||||
!self.providers.is_empty()
|
||||
}
|
||||
|
||||
fn name(&self) -> &str {
|
||||
"chain"
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use crate::provider::Message;
|
||||
use crate::testing::mock::MockProvider;
|
||||
|
||||
fn test_request() -> CompletionRequest {
|
||||
CompletionRequest {
|
||||
model: String::new(),
|
||||
messages: vec![Message {
|
||||
role: "user".into(),
|
||||
content: "test".into(),
|
||||
}],
|
||||
temperature: None,
|
||||
max_tokens: None,
|
||||
json_mode: false,
|
||||
}
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn empty_chain_returns_no_providers() {
|
||||
let chain = ProviderChain::from_providers(vec![]);
|
||||
let result = chain.complete(&test_request()).await;
|
||||
assert!(matches!(result, Err(LlmError::NoProviders)));
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn single_provider_success() {
|
||||
let chain = ProviderChain::from_providers(vec![Box::new(MockProvider {
|
||||
name: "mock",
|
||||
response: Ok("hello".into()),
|
||||
available: true,
|
||||
})]);
|
||||
|
||||
let result = chain.complete(&test_request()).await.unwrap();
|
||||
assert_eq!(result, "hello");
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn fallback_on_first_failure() {
|
||||
let chain = ProviderChain::from_providers(vec![
|
||||
Box::new(MockProvider {
|
||||
name: "failing",
|
||||
response: Err("connection refused".into()),
|
||||
available: true,
|
||||
}),
|
||||
Box::new(MockProvider {
|
||||
name: "backup",
|
||||
response: Ok("from backup".into()),
|
||||
available: true,
|
||||
}),
|
||||
]);
|
||||
|
||||
let result = chain.complete(&test_request()).await.unwrap();
|
||||
assert_eq!(result, "from backup");
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn all_fail_collects_errors() {
|
||||
let chain = ProviderChain::from_providers(vec![
|
||||
Box::new(MockProvider {
|
||||
name: "a",
|
||||
response: Err("timeout".into()),
|
||||
available: true,
|
||||
}),
|
||||
Box::new(MockProvider {
|
||||
name: "b",
|
||||
response: Err("rate limited".into()),
|
||||
available: true,
|
||||
}),
|
||||
]);
|
||||
|
||||
let result = chain.complete(&test_request()).await;
|
||||
match result {
|
||||
Err(LlmError::AllProvidersFailed(msg)) => {
|
||||
assert!(msg.contains("timeout"));
|
||||
assert!(msg.contains("rate limited"));
|
||||
}
|
||||
other => panic!("expected AllProvidersFailed, got {other:?}"),
|
||||
}
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn chain_length() {
|
||||
let chain = ProviderChain::from_providers(vec![
|
||||
Box::new(MockProvider {
|
||||
name: "a",
|
||||
response: Ok("ok".into()),
|
||||
available: true,
|
||||
}),
|
||||
Box::new(MockProvider {
|
||||
name: "b",
|
||||
response: Ok("ok".into()),
|
||||
available: true,
|
||||
}),
|
||||
]);
|
||||
assert_eq!(chain.len(), 2);
|
||||
assert!(!chain.is_empty());
|
||||
}
|
||||
}
|
||||
124
crates/webclaw-llm/src/clean.rs
Normal file
124
crates/webclaw-llm/src/clean.rs
Normal file
|
|
@ -0,0 +1,124 @@
|
|||
//! Post-processing for LLM responses.
|
||||
//! Strips chain-of-thought reasoning tags that models like qwen3 emit.
|
||||
//! Applied to every provider response so callers never see internal reasoning.
|
||||
|
||||
/// Strip `<think>...</think>` blocks from LLM responses.
|
||||
/// Models like qwen3 wrap internal chain-of-thought reasoning in these tags.
|
||||
/// Handles multiline content, multiple blocks, and partial/malformed tags.
|
||||
pub fn strip_thinking_tags(text: &str) -> String {
|
||||
let mut result = String::with_capacity(text.len());
|
||||
let mut remaining = text;
|
||||
|
||||
while let Some(start) = remaining.find("<think>") {
|
||||
// Keep everything before the opening tag
|
||||
result.push_str(&remaining[..start]);
|
||||
|
||||
// Find the matching closing tag
|
||||
let after_open = &remaining[start + 7..]; // len("<think>") == 7
|
||||
if let Some(end) = after_open.find("</think>") {
|
||||
remaining = &after_open[end + 8..]; // len("</think>") == 8
|
||||
} else {
|
||||
// Unclosed <think> — discard everything after it (the model is still "thinking")
|
||||
remaining = "";
|
||||
}
|
||||
}
|
||||
|
||||
result.push_str(remaining);
|
||||
|
||||
// Clean up: leftover </think> or /think fragments from partial responses
|
||||
let result = result.replace("</think>", "");
|
||||
let result = result.replace("/think", "");
|
||||
|
||||
// Collapse leading whitespace left behind after stripping
|
||||
let trimmed = result.trim();
|
||||
if trimmed.is_empty() {
|
||||
String::new()
|
||||
} else {
|
||||
trimmed.to_string()
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn strips_simple_thinking_block() {
|
||||
let input = "<think>reasoning here</think>actual response";
|
||||
assert_eq!(strip_thinking_tags(input), "actual response");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn strips_multiline_thinking() {
|
||||
let input = "<think>\nlong\nthinking\nprocess\n</think>\nclean output";
|
||||
assert_eq!(strip_thinking_tags(input), "clean output");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn passthrough_no_tags() {
|
||||
let input = "no thinking tags here";
|
||||
assert_eq!(strip_thinking_tags(input), "no thinking tags here");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn strips_partial_think_at_end() {
|
||||
let input = "some text /think";
|
||||
assert_eq!(strip_thinking_tags(input), "some text");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn strips_orphan_closing_tag() {
|
||||
let input = "some text</think> more text";
|
||||
assert_eq!(strip_thinking_tags(input), "some text more text");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn strips_multiple_thinking_blocks() {
|
||||
let input = "<think>first</think>hello <think>second</think>world";
|
||||
assert_eq!(strip_thinking_tags(input), "hello world");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn handles_unclosed_think_tag() {
|
||||
// Model started thinking and never closed — discard everything after <think>
|
||||
let input = "good content<think>still reasoning...";
|
||||
assert_eq!(strip_thinking_tags(input), "good content");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn handles_empty_thinking_block() {
|
||||
let input = "<think></think>content";
|
||||
assert_eq!(strip_thinking_tags(input), "content");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn handles_only_thinking() {
|
||||
let input = "<think>just thinking, no output</think>";
|
||||
assert_eq!(strip_thinking_tags(input), "");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn preserves_json_content() {
|
||||
let input = "<think>let me analyze...</think>{\"key\": \"value\", \"count\": 42}";
|
||||
assert_eq!(
|
||||
strip_thinking_tags(input),
|
||||
"{\"key\": \"value\", \"count\": 42}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn real_world_extract_leak() {
|
||||
// Actual bug: qwen3 leaked "/think" into JSON values
|
||||
let input = "<think>analyzing the page</think>{\"learn_more\": \"Learn more\"}";
|
||||
assert_eq!(
|
||||
strip_thinking_tags(input),
|
||||
"{\"learn_more\": \"Learn more\"}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn thinking_with_newlines_before_json() {
|
||||
let input = "<think>\nstep 1\nstep 2\n</think>\n\n{\"result\": true}";
|
||||
assert_eq!(strip_thinking_tags(input), "{\"result\": true}");
|
||||
}
|
||||
}
|
||||
18
crates/webclaw-llm/src/error.rs
Normal file
18
crates/webclaw-llm/src/error.rs
Normal file
|
|
@ -0,0 +1,18 @@
|
|||
/// LLM-specific errors. Kept flat — one enum covers transport, provider, and parsing failures.
|
||||
#[derive(Debug, thiserror::Error)]
|
||||
pub enum LlmError {
|
||||
#[error("HTTP error: {0}")]
|
||||
Http(#[from] reqwest::Error),
|
||||
|
||||
#[error("no providers available")]
|
||||
NoProviders,
|
||||
|
||||
#[error("all providers failed: {0}")]
|
||||
AllProvidersFailed(String),
|
||||
|
||||
#[error("invalid JSON response: {0}")]
|
||||
InvalidJson(String),
|
||||
|
||||
#[error("provider error: {0}")]
|
||||
ProviderError(String),
|
||||
}
|
||||
187
crates/webclaw-llm/src/extract.rs
Normal file
187
crates/webclaw-llm/src/extract.rs
Normal file
|
|
@ -0,0 +1,187 @@
|
|||
/// Schema-based and prompt-based LLM extraction.
|
||||
/// Both functions build a system prompt, send content to the LLM, and parse JSON back.
|
||||
use crate::clean::strip_thinking_tags;
|
||||
use crate::error::LlmError;
|
||||
use crate::provider::{CompletionRequest, LlmProvider, Message};
|
||||
|
||||
/// Extract structured JSON from content using a JSON schema.
|
||||
/// The schema tells the LLM exactly what fields to extract and their types.
|
||||
pub async fn extract_json(
|
||||
content: &str,
|
||||
schema: &serde_json::Value,
|
||||
provider: &dyn LlmProvider,
|
||||
model: Option<&str>,
|
||||
) -> Result<serde_json::Value, LlmError> {
|
||||
let system = format!(
|
||||
"You are a JSON extraction engine. Extract data from the content according to this schema.\n\
|
||||
Return ONLY valid JSON matching the schema. No explanations, no markdown, no commentary.\n\n\
|
||||
Schema:\n```json\n{}\n```",
|
||||
serde_json::to_string_pretty(schema).unwrap_or_else(|_| schema.to_string())
|
||||
);
|
||||
|
||||
let request = CompletionRequest {
|
||||
model: model.unwrap_or_default().to_string(),
|
||||
messages: vec![
|
||||
Message {
|
||||
role: "system".into(),
|
||||
content: system,
|
||||
},
|
||||
Message {
|
||||
role: "user".into(),
|
||||
content: content.to_string(),
|
||||
},
|
||||
],
|
||||
temperature: Some(0.0),
|
||||
max_tokens: None,
|
||||
json_mode: true,
|
||||
};
|
||||
|
||||
let response = provider.complete(&request).await?;
|
||||
parse_json_response(&response)
|
||||
}
|
||||
|
||||
/// Extract information using a natural language prompt.
|
||||
/// More flexible than schema extraction — the user describes what they want.
|
||||
pub async fn extract_with_prompt(
|
||||
content: &str,
|
||||
prompt: &str,
|
||||
provider: &dyn LlmProvider,
|
||||
model: Option<&str>,
|
||||
) -> Result<serde_json::Value, LlmError> {
|
||||
let system = format!(
|
||||
"You are a JSON extraction engine. Extract information from the content based on these instructions.\n\
|
||||
Return ONLY valid JSON. No explanations, no markdown, no commentary.\n\n\
|
||||
Instructions: {prompt}"
|
||||
);
|
||||
|
||||
let request = CompletionRequest {
|
||||
model: model.unwrap_or_default().to_string(),
|
||||
messages: vec![
|
||||
Message {
|
||||
role: "system".into(),
|
||||
content: system,
|
||||
},
|
||||
Message {
|
||||
role: "user".into(),
|
||||
content: content.to_string(),
|
||||
},
|
||||
],
|
||||
temperature: Some(0.0),
|
||||
max_tokens: None,
|
||||
json_mode: true,
|
||||
};
|
||||
|
||||
let response = provider.complete(&request).await?;
|
||||
parse_json_response(&response)
|
||||
}
|
||||
|
||||
/// Parse an LLM response string as JSON. Handles common edge cases:
|
||||
/// - Thinking tags (`<think>...</think>`)
|
||||
/// - Markdown code fences (```json ... ```)
|
||||
/// - Leading/trailing whitespace
|
||||
fn parse_json_response(response: &str) -> Result<serde_json::Value, LlmError> {
|
||||
// Strip thinking tags before any JSON parsing — providers already do this,
|
||||
// but defense in depth for any caller that bypasses the provider layer
|
||||
let cleaned = strip_thinking_tags(response);
|
||||
let trimmed = cleaned.trim();
|
||||
|
||||
// Strip markdown code fences if present
|
||||
let json_str = if trimmed.starts_with("```") {
|
||||
let without_opener = trimmed
|
||||
.strip_prefix("```json")
|
||||
.or_else(|| trimmed.strip_prefix("```"))
|
||||
.unwrap_or(trimmed);
|
||||
without_opener
|
||||
.strip_suffix("```")
|
||||
.unwrap_or(without_opener)
|
||||
.trim()
|
||||
} else {
|
||||
trimmed
|
||||
};
|
||||
|
||||
serde_json::from_str(json_str)
|
||||
.map_err(|e| LlmError::InvalidJson(format!("{e} — raw response: {response}")))
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use crate::testing::mock::MockProvider;
|
||||
|
||||
#[test]
|
||||
fn parse_clean_json() {
|
||||
let result = parse_json_response(r#"{"name": "Rust", "version": 2024}"#).unwrap();
|
||||
assert_eq!(result["name"], "Rust");
|
||||
assert_eq!(result["version"], 2024);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_json_with_code_fence() {
|
||||
let response = "```json\n{\"key\": \"value\"}\n```";
|
||||
let result = parse_json_response(response).unwrap();
|
||||
assert_eq!(result["key"], "value");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_json_with_whitespace() {
|
||||
let response = " \n {\"ok\": true} \n ";
|
||||
let result = parse_json_response(response).unwrap();
|
||||
assert_eq!(result["ok"], true);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_invalid_json() {
|
||||
let result = parse_json_response("not json at all");
|
||||
assert!(matches!(result, Err(LlmError::InvalidJson(_))));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_json_with_thinking_tags() {
|
||||
let response = "<think>analyzing the content</think>{\"title\": \"Hello\"}";
|
||||
let result = parse_json_response(response).unwrap();
|
||||
assert_eq!(result["title"], "Hello");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_json_with_thinking_and_code_fence() {
|
||||
let response = "<think>let me think</think>\n```json\n{\"key\": \"value\"}\n```";
|
||||
let result = parse_json_response(response).unwrap();
|
||||
assert_eq!(result["key"], "value");
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn extract_json_uses_schema_in_prompt() {
|
||||
let mock = MockProvider::ok(r#"{"title": "Test Article", "author": "Jane"}"#);
|
||||
|
||||
let schema = serde_json::json!({
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"title": { "type": "string" },
|
||||
"author": { "type": "string" }
|
||||
}
|
||||
});
|
||||
|
||||
let result = extract_json("Some article content by Jane", &schema, &mock, None)
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
assert_eq!(result["title"], "Test Article");
|
||||
assert_eq!(result["author"], "Jane");
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn extract_with_prompt_returns_json() {
|
||||
let mock = MockProvider::ok(r#"{"emails": ["test@example.com"]}"#);
|
||||
|
||||
let result = extract_with_prompt(
|
||||
"Contact us at test@example.com",
|
||||
"Find all email addresses",
|
||||
&mock,
|
||||
None,
|
||||
)
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
assert_eq!(result["emails"][0], "test@example.com");
|
||||
}
|
||||
}
|
||||
19
crates/webclaw-llm/src/lib.rs
Normal file
19
crates/webclaw-llm/src/lib.rs
Normal file
|
|
@ -0,0 +1,19 @@
|
|||
/// webclaw-llm: LLM integration with local-first hybrid architecture.
|
||||
///
|
||||
/// Provider chain tries Ollama (local) first, falls back to OpenAI, then Anthropic.
|
||||
/// Provides schema-based extraction, prompt extraction, and summarization
|
||||
/// on top of webclaw-core's content pipeline.
|
||||
pub mod chain;
|
||||
pub mod clean;
|
||||
pub mod error;
|
||||
pub mod extract;
|
||||
pub mod provider;
|
||||
pub mod providers;
|
||||
pub mod summarize;
|
||||
#[cfg(test)]
|
||||
pub(crate) mod testing;
|
||||
|
||||
pub use chain::ProviderChain;
|
||||
pub use clean::strip_thinking_tags;
|
||||
pub use error::LlmError;
|
||||
pub use provider::{CompletionRequest, LlmProvider, Message};
|
||||
34
crates/webclaw-llm/src/provider.rs
Normal file
34
crates/webclaw-llm/src/provider.rs
Normal file
|
|
@ -0,0 +1,34 @@
|
|||
/// Core LLM abstraction. Every backend (Ollama, OpenAI, Anthropic) implements `LlmProvider`.
|
||||
/// The trait is intentionally minimal — just completion and availability check.
|
||||
use async_trait::async_trait;
|
||||
use serde::{Deserialize, Serialize};
|
||||
|
||||
use crate::error::LlmError;
|
||||
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct CompletionRequest {
|
||||
pub model: String,
|
||||
pub messages: Vec<Message>,
|
||||
pub temperature: Option<f32>,
|
||||
pub max_tokens: Option<u32>,
|
||||
/// When true, instruct the provider to return valid JSON.
|
||||
pub json_mode: bool,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct Message {
|
||||
pub role: String,
|
||||
pub content: String,
|
||||
}
|
||||
|
||||
#[async_trait]
|
||||
pub trait LlmProvider: Send + Sync {
|
||||
/// Send a completion request and return the assistant's text response.
|
||||
async fn complete(&self, request: &CompletionRequest) -> Result<String, LlmError>;
|
||||
|
||||
/// Quick health check — is this provider reachable / configured?
|
||||
async fn is_available(&self) -> bool;
|
||||
|
||||
/// Human-readable name for logging.
|
||||
fn name(&self) -> &str;
|
||||
}
|
||||
170
crates/webclaw-llm/src/providers/anthropic.rs
Normal file
170
crates/webclaw-llm/src/providers/anthropic.rs
Normal file
|
|
@ -0,0 +1,170 @@
|
|||
/// Anthropic provider — Claude models via api.anthropic.com.
|
||||
/// Anthropic's API differs from OpenAI: system message is a top-level param,
|
||||
/// not part of the messages array.
|
||||
use async_trait::async_trait;
|
||||
use serde_json::json;
|
||||
|
||||
use crate::clean::strip_thinking_tags;
|
||||
use crate::error::LlmError;
|
||||
use crate::provider::{CompletionRequest, LlmProvider};
|
||||
|
||||
use super::load_api_key;
|
||||
|
||||
const ANTHROPIC_API_URL: &str = "https://api.anthropic.com/v1/messages";
|
||||
const ANTHROPIC_VERSION: &str = "2023-06-01";
|
||||
|
||||
pub struct AnthropicProvider {
|
||||
client: reqwest::Client,
|
||||
key: String,
|
||||
default_model: String,
|
||||
}
|
||||
|
||||
impl AnthropicProvider {
|
||||
/// Returns `None` if no API key is available (param or env).
|
||||
pub fn new(key_override: Option<String>, model: Option<String>) -> Option<Self> {
|
||||
let key = load_api_key(key_override, "ANTHROPIC_API_KEY")?;
|
||||
|
||||
Some(Self {
|
||||
client: reqwest::Client::new(),
|
||||
key,
|
||||
default_model: model.unwrap_or_else(|| "claude-sonnet-4-20250514".into()),
|
||||
})
|
||||
}
|
||||
|
||||
pub fn default_model(&self) -> &str {
|
||||
&self.default_model
|
||||
}
|
||||
}
|
||||
|
||||
#[async_trait]
|
||||
impl LlmProvider for AnthropicProvider {
|
||||
async fn complete(&self, request: &CompletionRequest) -> Result<String, LlmError> {
|
||||
let model = if request.model.is_empty() {
|
||||
&self.default_model
|
||||
} else {
|
||||
&request.model
|
||||
};
|
||||
|
||||
// Anthropic separates system from messages. Extract the system message if present.
|
||||
let system_content: Option<String> = request
|
||||
.messages
|
||||
.iter()
|
||||
.find(|m| m.role == "system")
|
||||
.map(|m| m.content.clone());
|
||||
|
||||
let messages: Vec<serde_json::Value> = request
|
||||
.messages
|
||||
.iter()
|
||||
.filter(|m| m.role != "system")
|
||||
.map(|m| json!({ "role": m.role, "content": m.content }))
|
||||
.collect();
|
||||
|
||||
let mut body = json!({
|
||||
"model": model,
|
||||
"messages": messages,
|
||||
"max_tokens": request.max_tokens.unwrap_or(4096),
|
||||
});
|
||||
|
||||
if let Some(system) = &system_content {
|
||||
body["system"] = json!(system);
|
||||
}
|
||||
if let Some(temp) = request.temperature {
|
||||
body["temperature"] = json!(temp);
|
||||
}
|
||||
|
||||
let resp = self
|
||||
.client
|
||||
.post(ANTHROPIC_API_URL)
|
||||
.header("x-api-key", &self.key)
|
||||
.header("anthropic-version", ANTHROPIC_VERSION)
|
||||
.header("content-type", "application/json")
|
||||
.json(&body)
|
||||
.send()
|
||||
.await?;
|
||||
|
||||
if !resp.status().is_success() {
|
||||
let status = resp.status();
|
||||
let text = resp.text().await.unwrap_or_default();
|
||||
let safe_text = if text.len() > 500 {
|
||||
&text[..500]
|
||||
} else {
|
||||
&text
|
||||
};
|
||||
return Err(LlmError::ProviderError(format!(
|
||||
"anthropic returned {status}: {safe_text}"
|
||||
)));
|
||||
}
|
||||
|
||||
let json: serde_json::Value = resp.json().await?;
|
||||
|
||||
// Anthropic response: {"content": [{"type": "text", "text": "..."}]}
|
||||
let raw = json["content"][0]["text"]
|
||||
.as_str()
|
||||
.map(String::from)
|
||||
.ok_or_else(|| {
|
||||
LlmError::InvalidJson("missing content[0].text in anthropic response".into())
|
||||
})?;
|
||||
|
||||
Ok(strip_thinking_tags(&raw))
|
||||
}
|
||||
|
||||
async fn is_available(&self) -> bool {
|
||||
!self.key.is_empty()
|
||||
}
|
||||
|
||||
fn name(&self) -> &str {
|
||||
"anthropic"
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn empty_key_returns_none() {
|
||||
assert!(AnthropicProvider::new(Some(String::new()), None).is_none());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn explicit_key_constructs() {
|
||||
let provider =
|
||||
AnthropicProvider::new(Some("sk-ant-test".into()), None).expect("should construct");
|
||||
assert_eq!(provider.name(), "anthropic");
|
||||
assert_eq!(provider.default_model, "claude-sonnet-4-20250514");
|
||||
assert_eq!(provider.key, "sk-ant-test");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn custom_model() {
|
||||
let provider =
|
||||
AnthropicProvider::new(Some("sk-ant-test".into()), Some("claude-3-haiku".into()))
|
||||
.unwrap();
|
||||
assert_eq!(provider.default_model, "claude-3-haiku");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn default_model_accessor() {
|
||||
let provider = AnthropicProvider::new(Some("sk-ant-test".into()), None).unwrap();
|
||||
assert_eq!(provider.default_model(), "claude-sonnet-4-20250514");
|
||||
}
|
||||
|
||||
// Env var fallback tests mutate process-global state and race with parallel tests.
|
||||
// The code path is trivial (load_api_key -> env::var().ok()). Run in isolation if needed:
|
||||
// cargo test -p webclaw-llm env_var -- --ignored --test-threads=1
|
||||
#[test]
|
||||
#[ignore = "mutates process env; run with --test-threads=1"]
|
||||
fn env_var_key_fallback() {
|
||||
unsafe { std::env::set_var("ANTHROPIC_API_KEY", "sk-ant-env") };
|
||||
let provider = AnthropicProvider::new(None, None).expect("should construct from env");
|
||||
assert_eq!(provider.key, "sk-ant-env");
|
||||
unsafe { std::env::remove_var("ANTHROPIC_API_KEY") };
|
||||
}
|
||||
|
||||
#[test]
|
||||
#[ignore = "mutates process env; run with --test-threads=1"]
|
||||
fn no_key_returns_none() {
|
||||
unsafe { std::env::remove_var("ANTHROPIC_API_KEY") };
|
||||
assert!(AnthropicProvider::new(None, None).is_none());
|
||||
}
|
||||
}
|
||||
36
crates/webclaw-llm/src/providers/mod.rs
Normal file
36
crates/webclaw-llm/src/providers/mod.rs
Normal file
|
|
@ -0,0 +1,36 @@
|
|||
pub mod anthropic;
|
||||
pub mod ollama;
|
||||
pub mod openai;
|
||||
|
||||
/// Load an API key from an explicit override or an environment variable.
|
||||
/// Returns `None` if neither is set or the value is empty.
|
||||
pub(crate) fn load_api_key(override_key: Option<String>, env_var: &str) -> Option<String> {
|
||||
let key = override_key.or_else(|| std::env::var(env_var).ok())?;
|
||||
if key.is_empty() { None } else { Some(key) }
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn override_key_takes_precedence() {
|
||||
assert_eq!(
|
||||
load_api_key(Some("explicit".into()), "NONEXISTENT_VAR"),
|
||||
Some("explicit".into())
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn empty_override_returns_none() {
|
||||
assert_eq!(load_api_key(Some(String::new()), "NONEXISTENT_VAR"), None);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn none_override_with_no_env_returns_none() {
|
||||
assert_eq!(
|
||||
load_api_key(None, "WEBCLAW_TEST_NONEXISTENT_KEY_12345"),
|
||||
None
|
||||
);
|
||||
}
|
||||
}
|
||||
161
crates/webclaw-llm/src/providers/ollama.rs
Normal file
161
crates/webclaw-llm/src/providers/ollama.rs
Normal file
|
|
@ -0,0 +1,161 @@
|
|||
/// Ollama provider — talks to a local Ollama instance (default localhost:11434).
|
||||
/// First choice in the provider chain: free, private, fast on Apple Silicon.
|
||||
use async_trait::async_trait;
|
||||
use serde_json::json;
|
||||
|
||||
use crate::clean::strip_thinking_tags;
|
||||
use crate::error::LlmError;
|
||||
use crate::provider::{CompletionRequest, LlmProvider};
|
||||
|
||||
pub struct OllamaProvider {
|
||||
client: reqwest::Client,
|
||||
base_url: String,
|
||||
default_model: String,
|
||||
}
|
||||
|
||||
impl OllamaProvider {
|
||||
pub fn new(base_url: Option<String>, model: Option<String>) -> Self {
|
||||
let base_url = base_url
|
||||
.or_else(|| std::env::var("OLLAMA_HOST").ok())
|
||||
.unwrap_or_else(|| "http://localhost:11434".into());
|
||||
|
||||
let default_model = model
|
||||
.or_else(|| std::env::var("OLLAMA_MODEL").ok())
|
||||
.unwrap_or_else(|| "qwen3:8b".into());
|
||||
|
||||
Self {
|
||||
client: reqwest::Client::new(),
|
||||
base_url,
|
||||
default_model,
|
||||
}
|
||||
}
|
||||
|
||||
pub fn default_model(&self) -> &str {
|
||||
&self.default_model
|
||||
}
|
||||
}
|
||||
|
||||
#[async_trait]
|
||||
impl LlmProvider for OllamaProvider {
|
||||
async fn complete(&self, request: &CompletionRequest) -> Result<String, LlmError> {
|
||||
let model = if request.model.is_empty() {
|
||||
&self.default_model
|
||||
} else {
|
||||
&request.model
|
||||
};
|
||||
|
||||
let messages: Vec<serde_json::Value> = request
|
||||
.messages
|
||||
.iter()
|
||||
.map(|m| json!({ "role": m.role, "content": m.content }))
|
||||
.collect();
|
||||
|
||||
let mut body = json!({
|
||||
"model": model,
|
||||
"messages": messages,
|
||||
"stream": false,
|
||||
"think": false,
|
||||
});
|
||||
|
||||
if request.json_mode {
|
||||
body["format"] = json!("json");
|
||||
}
|
||||
if let Some(temp) = request.temperature {
|
||||
body["options"] = json!({ "temperature": temp });
|
||||
}
|
||||
|
||||
let url = format!("{}/api/chat", self.base_url);
|
||||
let resp = self.client.post(&url).json(&body).send().await?;
|
||||
|
||||
if !resp.status().is_success() {
|
||||
let status = resp.status();
|
||||
let text = resp.text().await.unwrap_or_default();
|
||||
let safe_text = if text.len() > 500 {
|
||||
&text[..500]
|
||||
} else {
|
||||
&text
|
||||
};
|
||||
return Err(LlmError::ProviderError(format!(
|
||||
"ollama returned {status}: {safe_text}"
|
||||
)));
|
||||
}
|
||||
|
||||
let json: serde_json::Value = resp.json().await?;
|
||||
|
||||
let raw = json["message"]["content"]
|
||||
.as_str()
|
||||
.map(String::from)
|
||||
.ok_or_else(|| {
|
||||
LlmError::InvalidJson(format!(
|
||||
"missing message.content in ollama response: {json}"
|
||||
))
|
||||
})?;
|
||||
|
||||
Ok(strip_thinking_tags(&raw))
|
||||
}
|
||||
|
||||
async fn is_available(&self) -> bool {
|
||||
let url = format!("{}/api/tags", self.base_url);
|
||||
matches!(self.client.get(&url).send().await, Ok(r) if r.status().is_success())
|
||||
}
|
||||
|
||||
fn name(&self) -> &str {
|
||||
"ollama"
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn explicit_params_used() {
|
||||
let provider = OllamaProvider::new(
|
||||
Some("http://gpu-box:11434".into()),
|
||||
Some("llama3:70b".into()),
|
||||
);
|
||||
assert_eq!(provider.base_url, "http://gpu-box:11434");
|
||||
assert_eq!(provider.default_model, "llama3:70b");
|
||||
assert_eq!(provider.name(), "ollama");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn explicit_model_overrides_any_env() {
|
||||
// Passing Some(...) bypasses env vars entirely -- no race possible
|
||||
let provider = OllamaProvider::new(None, Some("mistral:7b".into()));
|
||||
assert_eq!(provider.default_model, "mistral:7b");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn explicit_url_overrides_any_env() {
|
||||
let provider = OllamaProvider::new(Some("http://local:11434".into()), None);
|
||||
assert_eq!(provider.base_url, "http://local:11434");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn default_model_accessor() {
|
||||
let provider = OllamaProvider::new(None, Some("phi3:mini".into()));
|
||||
assert_eq!(provider.default_model(), "phi3:mini");
|
||||
}
|
||||
|
||||
// Env var fallback is a trivial `env::var().ok()` -- not worth the flakiness
|
||||
// of manipulating process-global state. Run in isolation if needed:
|
||||
// cargo test -p webclaw-llm env_var_fallback -- --ignored --test-threads=1
|
||||
#[test]
|
||||
#[ignore = "mutates process env; run with --test-threads=1"]
|
||||
fn env_var_fallback() {
|
||||
unsafe {
|
||||
std::env::set_var("OLLAMA_HOST", "http://remote:11434");
|
||||
std::env::set_var("OLLAMA_MODEL", "mistral:7b");
|
||||
}
|
||||
|
||||
let provider = OllamaProvider::new(None, None);
|
||||
assert_eq!(provider.base_url, "http://remote:11434");
|
||||
assert_eq!(provider.default_model, "mistral:7b");
|
||||
|
||||
unsafe {
|
||||
std::env::remove_var("OLLAMA_HOST");
|
||||
std::env::remove_var("OLLAMA_MODEL");
|
||||
}
|
||||
}
|
||||
}
|
||||
181
crates/webclaw-llm/src/providers/openai.rs
Normal file
181
crates/webclaw-llm/src/providers/openai.rs
Normal file
|
|
@ -0,0 +1,181 @@
|
|||
/// OpenAI provider — works with api.openai.com and any OpenAI-compatible endpoint.
|
||||
use async_trait::async_trait;
|
||||
use serde_json::json;
|
||||
|
||||
use crate::clean::strip_thinking_tags;
|
||||
use crate::error::LlmError;
|
||||
use crate::provider::{CompletionRequest, LlmProvider};
|
||||
|
||||
use super::load_api_key;
|
||||
|
||||
pub struct OpenAiProvider {
|
||||
client: reqwest::Client,
|
||||
key: String,
|
||||
base_url: String,
|
||||
default_model: String,
|
||||
}
|
||||
|
||||
impl OpenAiProvider {
|
||||
/// Returns `None` if no API key is available (param or env).
|
||||
pub fn new(
|
||||
key_override: Option<String>,
|
||||
base_url: Option<String>,
|
||||
model: Option<String>,
|
||||
) -> Option<Self> {
|
||||
let key = load_api_key(key_override, "OPENAI_API_KEY")?;
|
||||
|
||||
Some(Self {
|
||||
client: reqwest::Client::new(),
|
||||
key,
|
||||
base_url: base_url
|
||||
.or_else(|| std::env::var("OPENAI_BASE_URL").ok())
|
||||
.unwrap_or_else(|| "https://api.openai.com/v1".into()),
|
||||
default_model: model.unwrap_or_else(|| "gpt-4o-mini".into()),
|
||||
})
|
||||
}
|
||||
|
||||
pub fn default_model(&self) -> &str {
|
||||
&self.default_model
|
||||
}
|
||||
}
|
||||
|
||||
#[async_trait]
|
||||
impl LlmProvider for OpenAiProvider {
|
||||
async fn complete(&self, request: &CompletionRequest) -> Result<String, LlmError> {
|
||||
let model = if request.model.is_empty() {
|
||||
&self.default_model
|
||||
} else {
|
||||
&request.model
|
||||
};
|
||||
|
||||
let messages: Vec<serde_json::Value> = request
|
||||
.messages
|
||||
.iter()
|
||||
.map(|m| json!({ "role": m.role, "content": m.content }))
|
||||
.collect();
|
||||
|
||||
let mut body = json!({
|
||||
"model": model,
|
||||
"messages": messages,
|
||||
});
|
||||
|
||||
if request.json_mode {
|
||||
body["response_format"] = json!({ "type": "json_object" });
|
||||
}
|
||||
if let Some(temp) = request.temperature {
|
||||
body["temperature"] = json!(temp);
|
||||
}
|
||||
if let Some(max) = request.max_tokens {
|
||||
body["max_tokens"] = json!(max);
|
||||
}
|
||||
|
||||
let url = format!("{}/chat/completions", self.base_url);
|
||||
let resp = self
|
||||
.client
|
||||
.post(&url)
|
||||
.header("Authorization", format!("Bearer {}", self.key))
|
||||
.json(&body)
|
||||
.send()
|
||||
.await?;
|
||||
|
||||
if !resp.status().is_success() {
|
||||
let status = resp.status();
|
||||
let text = resp.text().await.unwrap_or_default();
|
||||
let safe_text = if text.len() > 500 {
|
||||
&text[..500]
|
||||
} else {
|
||||
&text
|
||||
};
|
||||
return Err(LlmError::ProviderError(format!(
|
||||
"openai returned {status}: {safe_text}"
|
||||
)));
|
||||
}
|
||||
|
||||
let json: serde_json::Value = resp.json().await?;
|
||||
|
||||
let raw = json["choices"][0]["message"]["content"]
|
||||
.as_str()
|
||||
.map(String::from)
|
||||
.ok_or_else(|| {
|
||||
LlmError::InvalidJson(
|
||||
"missing choices[0].message.content in openai response".into(),
|
||||
)
|
||||
})?;
|
||||
|
||||
Ok(strip_thinking_tags(&raw))
|
||||
}
|
||||
|
||||
async fn is_available(&self) -> bool {
|
||||
!self.key.is_empty()
|
||||
}
|
||||
|
||||
fn name(&self) -> &str {
|
||||
"openai"
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn empty_key_returns_none() {
|
||||
assert!(OpenAiProvider::new(Some(String::new()), None, None).is_none());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn explicit_key_constructs() {
|
||||
let provider = OpenAiProvider::new(
|
||||
Some("test-key-123".into()),
|
||||
Some("https://api.openai.com/v1".into()),
|
||||
Some("gpt-4o-mini".into()),
|
||||
)
|
||||
.expect("should construct");
|
||||
assert_eq!(provider.name(), "openai");
|
||||
assert_eq!(provider.default_model, "gpt-4o-mini");
|
||||
assert_eq!(provider.base_url, "https://api.openai.com/v1");
|
||||
assert_eq!(provider.key, "test-key-123");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn custom_base_url_and_model() {
|
||||
let provider = OpenAiProvider::new(
|
||||
Some("test-key".into()),
|
||||
Some("http://localhost:8080/v1".into()),
|
||||
Some("gpt-3.5-turbo".into()),
|
||||
)
|
||||
.unwrap();
|
||||
assert_eq!(provider.base_url, "http://localhost:8080/v1");
|
||||
assert_eq!(provider.default_model, "gpt-3.5-turbo");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn default_model_accessor() {
|
||||
let provider = OpenAiProvider::new(
|
||||
Some("test-key".into()),
|
||||
Some("https://api.openai.com/v1".into()),
|
||||
None,
|
||||
)
|
||||
.unwrap();
|
||||
assert_eq!(provider.default_model(), "gpt-4o-mini");
|
||||
}
|
||||
|
||||
// Env var fallback tests mutate process-global state and race with parallel tests.
|
||||
// The code path is trivial (load_api_key -> env::var().ok()). Run in isolation if needed:
|
||||
// cargo test -p webclaw-llm env_var -- --ignored --test-threads=1
|
||||
#[test]
|
||||
#[ignore = "mutates process env; run with --test-threads=1"]
|
||||
fn env_var_key_fallback() {
|
||||
unsafe { std::env::set_var("OPENAI_API_KEY", "sk-env-key") };
|
||||
let provider = OpenAiProvider::new(None, None, None).expect("should construct from env");
|
||||
assert_eq!(provider.key, "sk-env-key");
|
||||
unsafe { std::env::remove_var("OPENAI_API_KEY") };
|
||||
}
|
||||
|
||||
#[test]
|
||||
#[ignore = "mutates process env; run with --test-threads=1"]
|
||||
fn no_key_returns_none() {
|
||||
unsafe { std::env::remove_var("OPENAI_API_KEY") };
|
||||
assert!(OpenAiProvider::new(None, None, None).is_none());
|
||||
}
|
||||
}
|
||||
124
crates/webclaw-llm/src/summarize.rs
Normal file
124
crates/webclaw-llm/src/summarize.rs
Normal file
|
|
@ -0,0 +1,124 @@
|
|||
/// LLM-powered content summarization. Keeps it simple: one function, one prompt.
|
||||
use crate::clean::strip_thinking_tags;
|
||||
use crate::error::LlmError;
|
||||
use crate::provider::{CompletionRequest, LlmProvider, Message};
|
||||
|
||||
/// Summarize content using an LLM.
|
||||
/// Returns plain text (not JSON). Default is 3 sentences.
|
||||
pub async fn summarize(
|
||||
content: &str,
|
||||
max_sentences: Option<usize>,
|
||||
provider: &dyn LlmProvider,
|
||||
model: Option<&str>,
|
||||
) -> Result<String, LlmError> {
|
||||
let n = max_sentences.unwrap_or(3);
|
||||
|
||||
let system = format!(
|
||||
"You are a summarization engine. Summarize the following content in exactly {n} sentences. \
|
||||
Output ONLY the summary, nothing else. No introductions, no questions, no formatting, no preamble."
|
||||
);
|
||||
|
||||
let request = CompletionRequest {
|
||||
model: model.unwrap_or_default().to_string(),
|
||||
messages: vec![
|
||||
Message {
|
||||
role: "system".into(),
|
||||
content: system,
|
||||
},
|
||||
Message {
|
||||
role: "user".into(),
|
||||
content: content.to_string(),
|
||||
},
|
||||
],
|
||||
temperature: Some(0.3),
|
||||
max_tokens: None,
|
||||
json_mode: false,
|
||||
};
|
||||
|
||||
let response = provider.complete(&request).await?;
|
||||
|
||||
// Providers already strip thinking tags, but defense in depth for summarize
|
||||
// since its output goes directly to the user as plain text
|
||||
Ok(strip_thinking_tags(&response))
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use async_trait::async_trait;
|
||||
|
||||
struct MockSummarizer;
|
||||
|
||||
#[async_trait]
|
||||
impl LlmProvider for MockSummarizer {
|
||||
async fn complete(&self, req: &CompletionRequest) -> Result<String, LlmError> {
|
||||
// Verify the prompt is well-formed
|
||||
let system = &req.messages[0].content;
|
||||
assert!(system.contains("sentences"));
|
||||
assert!(system.contains("summarization engine"));
|
||||
assert!(!req.json_mode, "summarize should not use json_mode");
|
||||
Ok("This is a test summary.".into())
|
||||
}
|
||||
async fn is_available(&self) -> bool {
|
||||
true
|
||||
}
|
||||
fn name(&self) -> &str {
|
||||
"mock"
|
||||
}
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn summarize_returns_text() {
|
||||
let result = summarize("Long article content...", None, &MockSummarizer, None)
|
||||
.await
|
||||
.unwrap();
|
||||
assert_eq!(result, "This is a test summary.");
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn summarize_custom_sentence_count() {
|
||||
// Verify custom count is passed through
|
||||
struct CountChecker;
|
||||
|
||||
#[async_trait]
|
||||
impl LlmProvider for CountChecker {
|
||||
async fn complete(&self, req: &CompletionRequest) -> Result<String, LlmError> {
|
||||
assert!(req.messages[0].content.contains("5 sentences"));
|
||||
Ok("Summary.".into())
|
||||
}
|
||||
async fn is_available(&self) -> bool {
|
||||
true
|
||||
}
|
||||
fn name(&self) -> &str {
|
||||
"count_checker"
|
||||
}
|
||||
}
|
||||
|
||||
summarize("Content", Some(5), &CountChecker, None)
|
||||
.await
|
||||
.unwrap();
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn summarize_strips_thinking_tags() {
|
||||
struct ThinkingMock;
|
||||
|
||||
#[async_trait]
|
||||
impl LlmProvider for ThinkingMock {
|
||||
async fn complete(&self, _req: &CompletionRequest) -> Result<String, LlmError> {
|
||||
Ok("<think>let me analyze this</think>This is the clean summary.".into())
|
||||
}
|
||||
async fn is_available(&self) -> bool {
|
||||
true
|
||||
}
|
||||
fn name(&self) -> &str {
|
||||
"thinking_mock"
|
||||
}
|
||||
}
|
||||
|
||||
let result = summarize("Some content", None, &ThinkingMock, None)
|
||||
.await
|
||||
.unwrap();
|
||||
assert_eq!(result, "This is the clean summary.");
|
||||
}
|
||||
}
|
||||
48
crates/webclaw-llm/src/testing.rs
Normal file
48
crates/webclaw-llm/src/testing.rs
Normal file
|
|
@ -0,0 +1,48 @@
|
|||
/// Shared test utilities for webclaw-llm.
|
||||
///
|
||||
/// Provides a configurable mock LLM provider for unit tests across
|
||||
/// extract, chain, and other modules that need a fake LLM backend.
|
||||
#[cfg(test)]
|
||||
pub(crate) mod mock {
|
||||
use async_trait::async_trait;
|
||||
|
||||
use crate::error::LlmError;
|
||||
use crate::provider::{CompletionRequest, LlmProvider};
|
||||
|
||||
/// A mock LLM provider that returns a canned response or error.
|
||||
/// Covers the common test cases: success, failure, and availability.
|
||||
pub struct MockProvider {
|
||||
pub name: &'static str,
|
||||
pub response: Result<String, String>,
|
||||
pub available: bool,
|
||||
}
|
||||
|
||||
impl MockProvider {
|
||||
/// Shorthand for a mock that always succeeds with the given response.
|
||||
pub fn ok(response: &str) -> Self {
|
||||
Self {
|
||||
name: "mock",
|
||||
response: Ok(response.to_string()),
|
||||
available: true,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[async_trait]
|
||||
impl LlmProvider for MockProvider {
|
||||
async fn complete(&self, _request: &CompletionRequest) -> Result<String, LlmError> {
|
||||
match &self.response {
|
||||
Ok(text) => Ok(text.clone()),
|
||||
Err(msg) => Err(LlmError::ProviderError(msg.clone())),
|
||||
}
|
||||
}
|
||||
|
||||
async fn is_available(&self) -> bool {
|
||||
self.available
|
||||
}
|
||||
|
||||
fn name(&self) -> &str {
|
||||
self.name
|
||||
}
|
||||
}
|
||||
}
|
||||
25
crates/webclaw-mcp/Cargo.toml
Normal file
25
crates/webclaw-mcp/Cargo.toml
Normal file
|
|
@ -0,0 +1,25 @@
|
|||
[package]
|
||||
name = "webclaw-mcp"
|
||||
description = "MCP server for webclaw web extraction toolkit"
|
||||
version.workspace = true
|
||||
edition.workspace = true
|
||||
license.workspace = true
|
||||
|
||||
[[bin]]
|
||||
name = "webclaw-mcp"
|
||||
path = "src/main.rs"
|
||||
|
||||
[dependencies]
|
||||
webclaw-core = { workspace = true }
|
||||
webclaw-fetch = { workspace = true }
|
||||
webclaw-llm = { workspace = true }
|
||||
webclaw-pdf = { workspace = true }
|
||||
rmcp = { version = "1.2", features = ["server", "macros", "transport-io", "schemars"] }
|
||||
schemars = "1.0"
|
||||
dotenvy = { workspace = true }
|
||||
serde = { workspace = true }
|
||||
serde_json = { workspace = true }
|
||||
tokio = { workspace = true }
|
||||
tracing = { workspace = true }
|
||||
tracing-subscriber = { workspace = true }
|
||||
reqwest = { version = "0.12", default-features = false, features = ["json", "rustls-tls"] }
|
||||
292
crates/webclaw-mcp/src/cloud.rs
Normal file
292
crates/webclaw-mcp/src/cloud.rs
Normal file
|
|
@ -0,0 +1,292 @@
|
|||
/// Cloud API fallback for protected sites.
|
||||
///
|
||||
/// When local fetch returns a challenge page, this module retries
|
||||
/// via api.webclaw.io. Requires WEBCLAW_API_KEY to be set.
|
||||
use std::collections::HashMap;
|
||||
|
||||
use serde_json::{Value, json};
|
||||
use tracing::info;
|
||||
|
||||
const API_BASE: &str = "https://api.webclaw.io/v1";
|
||||
|
||||
/// Lightweight client for the webclaw cloud API.
|
||||
pub struct CloudClient {
|
||||
api_key: String,
|
||||
http: reqwest::Client,
|
||||
}
|
||||
|
||||
impl CloudClient {
|
||||
/// Create a new cloud client from WEBCLAW_API_KEY env var.
|
||||
/// Returns None if the key is not set.
|
||||
pub fn from_env() -> Option<Self> {
|
||||
let key = std::env::var("WEBCLAW_API_KEY").ok()?;
|
||||
if key.is_empty() {
|
||||
return None;
|
||||
}
|
||||
Some(Self {
|
||||
api_key: key,
|
||||
http: reqwest::Client::new(),
|
||||
})
|
||||
}
|
||||
|
||||
/// Scrape a URL via the cloud API. Returns the response JSON.
|
||||
pub async fn scrape(
|
||||
&self,
|
||||
url: &str,
|
||||
formats: &[&str],
|
||||
include_selectors: &[String],
|
||||
exclude_selectors: &[String],
|
||||
only_main_content: bool,
|
||||
) -> Result<Value, String> {
|
||||
let mut body = json!({
|
||||
"url": url,
|
||||
"formats": formats,
|
||||
});
|
||||
|
||||
if only_main_content {
|
||||
body["only_main_content"] = json!(true);
|
||||
}
|
||||
if !include_selectors.is_empty() {
|
||||
body["include_selectors"] = json!(include_selectors);
|
||||
}
|
||||
if !exclude_selectors.is_empty() {
|
||||
body["exclude_selectors"] = json!(exclude_selectors);
|
||||
}
|
||||
|
||||
self.post("scrape", body).await
|
||||
}
|
||||
|
||||
/// Generic POST to the cloud API.
|
||||
pub async fn post(&self, endpoint: &str, body: Value) -> Result<Value, String> {
|
||||
let resp = self
|
||||
.http
|
||||
.post(format!("{API_BASE}/{endpoint}"))
|
||||
.header("Authorization", format!("Bearer {}", self.api_key))
|
||||
.json(&body)
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| format!("Cloud API request failed: {e}"))?;
|
||||
|
||||
let status = resp.status();
|
||||
if !status.is_success() {
|
||||
let text = resp.text().await.unwrap_or_default();
|
||||
return Err(format!("Cloud API error {status}: {text}"));
|
||||
}
|
||||
|
||||
resp.json::<Value>()
|
||||
.await
|
||||
.map_err(|e| format!("Cloud API response parse failed: {e}"))
|
||||
}
|
||||
|
||||
/// Generic GET from the cloud API.
|
||||
pub async fn get(&self, endpoint: &str) -> Result<Value, String> {
|
||||
let resp = self
|
||||
.http
|
||||
.get(format!("{API_BASE}/{endpoint}"))
|
||||
.header("Authorization", format!("Bearer {}", self.api_key))
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| format!("Cloud API request failed: {e}"))?;
|
||||
|
||||
let status = resp.status();
|
||||
if !status.is_success() {
|
||||
let text = resp.text().await.unwrap_or_default();
|
||||
return Err(format!("Cloud API error {status}: {text}"));
|
||||
}
|
||||
|
||||
resp.json::<Value>()
|
||||
.await
|
||||
.map_err(|e| format!("Cloud API response parse failed: {e}"))
|
||||
}
|
||||
}
|
||||
|
||||
/// Check if fetched HTML looks like a bot protection challenge page.
|
||||
/// Detects common bot protection challenge pages.
|
||||
pub fn is_bot_protected(html: &str, headers: &HashMap<String, String>) -> bool {
|
||||
let html_lower = html.to_lowercase();
|
||||
|
||||
// Cloudflare challenge page
|
||||
if html_lower.contains("_cf_chl_opt") || html_lower.contains("challenge-platform") {
|
||||
return true;
|
||||
}
|
||||
|
||||
// Cloudflare "checking your browser" spinner
|
||||
if (html_lower.contains("just a moment") || html_lower.contains("checking your browser"))
|
||||
&& html_lower.contains("cf-spinner")
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
// Cloudflare Turnstile (only on short pages = challenge, not embedded on real content)
|
||||
if (html_lower.contains("cf-turnstile")
|
||||
|| html_lower.contains("challenges.cloudflare.com/turnstile"))
|
||||
&& html.len() < 100_000
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
// DataDome
|
||||
if html_lower.contains("geo.captcha-delivery.com")
|
||||
|| html_lower.contains("captcha-delivery.com/captcha")
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
// AWS WAF
|
||||
if html_lower.contains("awswaf-captcha") || html_lower.contains("aws-waf-client-browser") {
|
||||
return true;
|
||||
}
|
||||
|
||||
// hCaptcha blocking page
|
||||
if html_lower.contains("hcaptcha.com")
|
||||
&& html_lower.contains("h-captcha")
|
||||
&& html.len() < 50_000
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
// Cloudflare via headers + challenge body
|
||||
let has_cf_headers = headers
|
||||
.iter()
|
||||
.any(|(k, _)| k.eq_ignore_ascii_case("cf-ray") || k.eq_ignore_ascii_case("cf-mitigated"));
|
||||
if has_cf_headers
|
||||
&& (html_lower.contains("just a moment") || html_lower.contains("checking your browser"))
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
false
|
||||
}
|
||||
|
||||
/// Check if a page likely needs JS rendering (SPA with almost no text content).
|
||||
pub fn needs_js_rendering(word_count: usize, html: &str) -> bool {
|
||||
let has_scripts = html.contains("<script");
|
||||
|
||||
// Tier 1: almost no extractable text from a large page
|
||||
if word_count < 50 && html.len() > 5_000 && has_scripts {
|
||||
return true;
|
||||
}
|
||||
|
||||
// Tier 2: SPA framework detected with suspiciously low content-to-HTML ratio
|
||||
if word_count < 800 && html.len() > 50_000 && has_scripts {
|
||||
let html_lower = html.to_lowercase();
|
||||
let has_spa_marker = html_lower.contains("react-app")
|
||||
|| html_lower.contains("id=\"__next\"")
|
||||
|| html_lower.contains("id=\"root\"")
|
||||
|| html_lower.contains("id=\"app\"")
|
||||
|| html_lower.contains("__next_data__")
|
||||
|| html_lower.contains("nuxt")
|
||||
|| html_lower.contains("ng-app");
|
||||
|
||||
if has_spa_marker {
|
||||
return true;
|
||||
}
|
||||
}
|
||||
|
||||
false
|
||||
}
|
||||
|
||||
/// Result of a smart fetch: either local extraction or cloud API response.
|
||||
pub enum SmartFetchResult {
|
||||
/// Successfully extracted locally.
|
||||
Local(Box<webclaw_core::ExtractionResult>),
|
||||
/// Fell back to cloud API. Contains the API response JSON.
|
||||
Cloud(Value),
|
||||
}
|
||||
|
||||
/// Try local fetch first, fall back to cloud API if bot-protected or JS-rendered.
|
||||
///
|
||||
/// Returns the extraction result (local) or the cloud API response JSON.
|
||||
/// If no API key is configured and local fetch is blocked, returns an error
|
||||
/// with a helpful message.
|
||||
pub async fn smart_fetch(
|
||||
client: &webclaw_fetch::FetchClient,
|
||||
cloud: Option<&CloudClient>,
|
||||
url: &str,
|
||||
include_selectors: &[String],
|
||||
exclude_selectors: &[String],
|
||||
only_main_content: bool,
|
||||
formats: &[&str],
|
||||
) -> Result<SmartFetchResult, String> {
|
||||
// Step 1: Try local fetch
|
||||
let fetch_result = client
|
||||
.fetch(url)
|
||||
.await
|
||||
.map_err(|e| format!("Fetch failed: {e}"))?;
|
||||
|
||||
// Step 2: Check for bot protection
|
||||
if is_bot_protected(&fetch_result.html, &fetch_result.headers) {
|
||||
info!(url, "bot protection detected, falling back to cloud API");
|
||||
return cloud_fallback(
|
||||
cloud,
|
||||
url,
|
||||
include_selectors,
|
||||
exclude_selectors,
|
||||
only_main_content,
|
||||
formats,
|
||||
)
|
||||
.await;
|
||||
}
|
||||
|
||||
// Step 3: Extract locally
|
||||
let options = webclaw_core::ExtractionOptions {
|
||||
include_selectors: include_selectors.to_vec(),
|
||||
exclude_selectors: exclude_selectors.to_vec(),
|
||||
only_main_content,
|
||||
include_raw_html: false,
|
||||
};
|
||||
|
||||
let extraction =
|
||||
webclaw_core::extract_with_options(&fetch_result.html, Some(&fetch_result.url), &options)
|
||||
.map_err(|e| format!("Extraction failed: {e}"))?;
|
||||
|
||||
// Step 4: Check for JS-rendered pages (low content from large HTML)
|
||||
if needs_js_rendering(extraction.metadata.word_count, &fetch_result.html) {
|
||||
info!(
|
||||
url,
|
||||
word_count = extraction.metadata.word_count,
|
||||
html_len = fetch_result.html.len(),
|
||||
"JS-rendered page detected, falling back to cloud API"
|
||||
);
|
||||
return cloud_fallback(
|
||||
cloud,
|
||||
url,
|
||||
include_selectors,
|
||||
exclude_selectors,
|
||||
only_main_content,
|
||||
formats,
|
||||
)
|
||||
.await;
|
||||
}
|
||||
|
||||
Ok(SmartFetchResult::Local(Box::new(extraction)))
|
||||
}
|
||||
|
||||
async fn cloud_fallback(
|
||||
cloud: Option<&CloudClient>,
|
||||
url: &str,
|
||||
include_selectors: &[String],
|
||||
exclude_selectors: &[String],
|
||||
only_main_content: bool,
|
||||
formats: &[&str],
|
||||
) -> Result<SmartFetchResult, String> {
|
||||
match cloud {
|
||||
Some(c) => {
|
||||
let resp = c
|
||||
.scrape(
|
||||
url,
|
||||
formats,
|
||||
include_selectors,
|
||||
exclude_selectors,
|
||||
only_main_content,
|
||||
)
|
||||
.await?;
|
||||
info!(url, "cloud API fallback successful");
|
||||
Ok(SmartFetchResult::Cloud(resp))
|
||||
}
|
||||
None => Err(format!(
|
||||
"Bot protection detected on {url}. Set WEBCLAW_API_KEY for automatic cloud bypass. \
|
||||
Get a key at https://webclaw.io"
|
||||
)),
|
||||
}
|
||||
}
|
||||
28
crates/webclaw-mcp/src/main.rs
Normal file
28
crates/webclaw-mcp/src/main.rs
Normal file
|
|
@ -0,0 +1,28 @@
|
|||
/// webclaw-mcp: MCP (Model Context Protocol) server for webclaw.
|
||||
/// Exposes web extraction tools over stdio transport for AI agents
|
||||
/// like Claude Desktop, Claude Code, and other MCP clients.
|
||||
mod cloud;
|
||||
mod server;
|
||||
mod tools;
|
||||
|
||||
use rmcp::ServiceExt;
|
||||
use rmcp::transport::stdio;
|
||||
|
||||
use server::WebclawMcp;
|
||||
|
||||
#[tokio::main]
|
||||
async fn main() -> Result<(), Box<dyn std::error::Error>> {
|
||||
dotenvy::dotenv().ok();
|
||||
|
||||
// Log to stderr -- stdout is the MCP transport channel
|
||||
tracing_subscriber::fmt()
|
||||
.with_env_filter(tracing_subscriber::EnvFilter::from_default_env())
|
||||
.with_writer(std::io::stderr)
|
||||
.with_ansi(false)
|
||||
.init();
|
||||
|
||||
let service = WebclawMcp::new().await.serve(stdio()).await?;
|
||||
|
||||
service.waiting().await?;
|
||||
Ok(())
|
||||
}
|
||||
507
crates/webclaw-mcp/src/server.rs
Normal file
507
crates/webclaw-mcp/src/server.rs
Normal file
|
|
@ -0,0 +1,507 @@
|
|||
/// MCP server implementation for webclaw.
|
||||
/// Exposes web extraction capabilities as tools for AI agents.
|
||||
///
|
||||
/// Uses a local-first architecture: fetches pages directly, then falls back
|
||||
/// to the webclaw cloud API (api.webclaw.io) when bot protection or
|
||||
/// JS rendering is detected. Set WEBCLAW_API_KEY for automatic fallback.
|
||||
use std::sync::Arc;
|
||||
|
||||
use rmcp::handler::server::router::tool::ToolRouter;
|
||||
use rmcp::handler::server::wrapper::Parameters;
|
||||
use rmcp::model::{Implementation, ServerCapabilities, ServerInfo};
|
||||
use rmcp::{ServerHandler, tool, tool_handler, tool_router};
|
||||
use serde_json::json;
|
||||
use tracing::{info, warn};
|
||||
|
||||
use crate::cloud::{self, CloudClient, SmartFetchResult};
|
||||
use crate::tools::*;
|
||||
|
||||
pub struct WebclawMcp {
|
||||
tool_router: ToolRouter<Self>,
|
||||
fetch_client: Arc<webclaw_fetch::FetchClient>,
|
||||
llm_chain: Option<webclaw_llm::ProviderChain>,
|
||||
cloud: Option<CloudClient>,
|
||||
}
|
||||
|
||||
/// Parse a browser string into a BrowserProfile.
|
||||
fn parse_browser(browser: Option<&str>) -> webclaw_fetch::BrowserProfile {
|
||||
match browser {
|
||||
Some("firefox") => webclaw_fetch::BrowserProfile::Firefox,
|
||||
Some("random") => webclaw_fetch::BrowserProfile::Random,
|
||||
_ => webclaw_fetch::BrowserProfile::Chrome,
|
||||
}
|
||||
}
|
||||
|
||||
#[tool_router]
|
||||
impl WebclawMcp {
|
||||
pub async fn new() -> Self {
|
||||
let mut config = webclaw_fetch::FetchConfig::default();
|
||||
|
||||
// Auto-load proxies.txt if present
|
||||
if std::path::Path::new("proxies.txt").exists()
|
||||
&& let Ok(pool) = webclaw_fetch::parse_proxy_file("proxies.txt")
|
||||
&& !pool.is_empty()
|
||||
{
|
||||
info!(count = pool.len(), "loaded proxy pool from proxies.txt");
|
||||
config.proxy_pool = pool;
|
||||
}
|
||||
|
||||
let fetch_client =
|
||||
webclaw_fetch::FetchClient::new(config).expect("failed to build FetchClient");
|
||||
|
||||
let chain = webclaw_llm::ProviderChain::default().await;
|
||||
let llm_chain = if chain.is_empty() {
|
||||
warn!("no LLM providers available -- extract/summarize tools will fail");
|
||||
None
|
||||
} else {
|
||||
info!(providers = chain.len(), "LLM provider chain ready");
|
||||
Some(chain)
|
||||
};
|
||||
|
||||
let cloud = CloudClient::from_env();
|
||||
if cloud.is_some() {
|
||||
info!("cloud API fallback enabled (WEBCLAW_API_KEY set)");
|
||||
} else {
|
||||
warn!(
|
||||
"WEBCLAW_API_KEY not set -- bot-protected sites will return challenge pages. \
|
||||
Get a key at https://webclaw.io"
|
||||
);
|
||||
}
|
||||
|
||||
Self {
|
||||
tool_router: Self::tool_router(),
|
||||
fetch_client: Arc::new(fetch_client),
|
||||
llm_chain,
|
||||
cloud,
|
||||
}
|
||||
}
|
||||
|
||||
/// Helper: smart fetch with LLM format for extract/summarize tools.
|
||||
async fn smart_fetch_llm(&self, url: &str) -> Result<SmartFetchResult, String> {
|
||||
cloud::smart_fetch(
|
||||
&self.fetch_client,
|
||||
self.cloud.as_ref(),
|
||||
url,
|
||||
&[],
|
||||
&[],
|
||||
false,
|
||||
&["llm", "markdown"],
|
||||
)
|
||||
.await
|
||||
}
|
||||
|
||||
/// Scrape a single URL and extract its content as markdown, LLM-optimized text, plain text, or full JSON.
|
||||
/// Automatically falls back to the webclaw cloud API when bot protection or JS rendering is detected.
|
||||
#[tool]
|
||||
async fn scrape(&self, Parameters(params): Parameters<ScrapeParams>) -> Result<String, String> {
|
||||
let format = params.format.as_deref().unwrap_or("markdown");
|
||||
let browser = parse_browser(params.browser.as_deref());
|
||||
let include = params.include_selectors.unwrap_or_default();
|
||||
let exclude = params.exclude_selectors.unwrap_or_default();
|
||||
let main_only = params.only_main_content.unwrap_or(false);
|
||||
|
||||
// Use a custom client if a non-default browser is requested
|
||||
let is_default_browser = matches!(browser, webclaw_fetch::BrowserProfile::Chrome);
|
||||
let custom_client;
|
||||
let client: &webclaw_fetch::FetchClient = if is_default_browser {
|
||||
&self.fetch_client
|
||||
} else {
|
||||
let config = webclaw_fetch::FetchConfig {
|
||||
browser,
|
||||
..Default::default()
|
||||
};
|
||||
custom_client = webclaw_fetch::FetchClient::new(config)
|
||||
.map_err(|e| format!("Failed to build client: {e}"))?;
|
||||
&custom_client
|
||||
};
|
||||
|
||||
let formats = [format];
|
||||
let result = cloud::smart_fetch(
|
||||
client,
|
||||
self.cloud.as_ref(),
|
||||
¶ms.url,
|
||||
&include,
|
||||
&exclude,
|
||||
main_only,
|
||||
&formats,
|
||||
)
|
||||
.await?;
|
||||
|
||||
match result {
|
||||
SmartFetchResult::Local(extraction) => {
|
||||
let output = match format {
|
||||
"llm" => webclaw_core::to_llm_text(&extraction, Some(¶ms.url)),
|
||||
"text" => extraction.content.plain_text,
|
||||
"json" => serde_json::to_string_pretty(&extraction).unwrap_or_default(),
|
||||
_ => extraction.content.markdown,
|
||||
};
|
||||
Ok(output)
|
||||
}
|
||||
SmartFetchResult::Cloud(resp) => {
|
||||
// Extract the requested format from the API response
|
||||
let content = resp
|
||||
.get(format)
|
||||
.or_else(|| resp.get("markdown"))
|
||||
.and_then(|v| v.as_str())
|
||||
.unwrap_or("");
|
||||
|
||||
if content.is_empty() {
|
||||
// Return full JSON if no content in the expected format
|
||||
Ok(serde_json::to_string_pretty(&resp).unwrap_or_default())
|
||||
} else {
|
||||
Ok(content.to_string())
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Crawl a website starting from a seed URL, following links breadth-first up to a configurable depth and page limit.
|
||||
#[tool]
|
||||
async fn crawl(&self, Parameters(params): Parameters<CrawlParams>) -> Result<String, String> {
|
||||
let format = params.format.as_deref().unwrap_or("markdown");
|
||||
|
||||
let config = webclaw_fetch::CrawlConfig {
|
||||
max_depth: params.depth.unwrap_or(2) as usize,
|
||||
max_pages: params.max_pages.unwrap_or(50),
|
||||
concurrency: params.concurrency.unwrap_or(5),
|
||||
use_sitemap: params.use_sitemap.unwrap_or(false),
|
||||
..Default::default()
|
||||
};
|
||||
|
||||
let crawler = webclaw_fetch::Crawler::new(¶ms.url, config)
|
||||
.map_err(|e| format!("Crawler init failed: {e}"))?;
|
||||
|
||||
let result = crawler.crawl(¶ms.url).await;
|
||||
|
||||
let mut output = format!(
|
||||
"Crawled {} pages ({} ok, {} errors) in {:.1}s\n\n",
|
||||
result.total, result.ok, result.errors, result.elapsed_secs
|
||||
);
|
||||
|
||||
for page in &result.pages {
|
||||
output.push_str(&format!("--- {} (depth {}) ---\n", page.url, page.depth));
|
||||
if let Some(ref extraction) = page.extraction {
|
||||
let content = match format {
|
||||
"llm" => webclaw_core::to_llm_text(extraction, Some(&page.url)),
|
||||
"text" => extraction.content.plain_text.clone(),
|
||||
_ => extraction.content.markdown.clone(),
|
||||
};
|
||||
output.push_str(&content);
|
||||
} else if let Some(ref err) = page.error {
|
||||
output.push_str(&format!("Error: {err}"));
|
||||
}
|
||||
output.push_str("\n\n");
|
||||
}
|
||||
|
||||
Ok(output)
|
||||
}
|
||||
|
||||
/// Discover URLs from a website's sitemaps (robots.txt + sitemap.xml).
|
||||
#[tool]
|
||||
async fn map(&self, Parameters(params): Parameters<MapParams>) -> Result<String, String> {
|
||||
let entries = webclaw_fetch::sitemap::discover(&self.fetch_client, ¶ms.url)
|
||||
.await
|
||||
.map_err(|e| format!("Sitemap discovery failed: {e}"))?;
|
||||
|
||||
let urls: Vec<&str> = entries.iter().map(|e| e.url.as_str()).collect();
|
||||
Ok(format!(
|
||||
"Discovered {} URLs:\n\n{}",
|
||||
urls.len(),
|
||||
urls.join("\n")
|
||||
))
|
||||
}
|
||||
|
||||
/// Extract content from multiple URLs concurrently.
|
||||
#[tool]
|
||||
async fn batch(&self, Parameters(params): Parameters<BatchParams>) -> Result<String, String> {
|
||||
if params.urls.is_empty() {
|
||||
return Err("urls must not be empty".into());
|
||||
}
|
||||
|
||||
let format = params.format.as_deref().unwrap_or("markdown");
|
||||
let concurrency = params.concurrency.unwrap_or(5);
|
||||
let url_refs: Vec<&str> = params.urls.iter().map(String::as_str).collect();
|
||||
|
||||
let results = self
|
||||
.fetch_client
|
||||
.fetch_and_extract_batch(&url_refs, concurrency)
|
||||
.await;
|
||||
|
||||
let mut output = format!("Extracted {} URLs:\n\n", results.len());
|
||||
|
||||
for r in &results {
|
||||
output.push_str(&format!("--- {} ---\n", r.url));
|
||||
match &r.result {
|
||||
Ok(extraction) => {
|
||||
let content = match format {
|
||||
"llm" => webclaw_core::to_llm_text(extraction, Some(&r.url)),
|
||||
"text" => extraction.content.plain_text.clone(),
|
||||
_ => extraction.content.markdown.clone(),
|
||||
};
|
||||
output.push_str(&content);
|
||||
}
|
||||
Err(e) => {
|
||||
output.push_str(&format!("Error: {e}"));
|
||||
}
|
||||
}
|
||||
output.push_str("\n\n");
|
||||
}
|
||||
|
||||
Ok(output)
|
||||
}
|
||||
|
||||
/// Extract structured data from a web page using an LLM. Provide either a JSON schema or a natural language prompt.
|
||||
/// Automatically falls back to the webclaw cloud API when bot protection is detected.
|
||||
#[tool]
|
||||
async fn extract(
|
||||
&self,
|
||||
Parameters(params): Parameters<ExtractParams>,
|
||||
) -> Result<String, String> {
|
||||
let chain = self.llm_chain.as_ref().ok_or(
|
||||
"No LLM providers available. Set OPENAI_API_KEY or ANTHROPIC_API_KEY, or run Ollama locally.",
|
||||
)?;
|
||||
|
||||
if params.schema.is_none() && params.prompt.is_none() {
|
||||
return Err("Either 'schema' or 'prompt' is required for extraction.".into());
|
||||
}
|
||||
|
||||
// For extract, if we get a cloud fallback we call the cloud extract endpoint directly
|
||||
let llm_content = match self.smart_fetch_llm(¶ms.url).await? {
|
||||
SmartFetchResult::Local(extraction) => {
|
||||
webclaw_core::to_llm_text(&extraction, Some(¶ms.url))
|
||||
}
|
||||
SmartFetchResult::Cloud(resp) => {
|
||||
// Use the LLM format from cloud, fall back to markdown
|
||||
resp.get("llm")
|
||||
.or_else(|| resp.get("markdown"))
|
||||
.and_then(|v| v.as_str())
|
||||
.unwrap_or("")
|
||||
.to_string()
|
||||
}
|
||||
};
|
||||
|
||||
let data = if let Some(ref schema) = params.schema {
|
||||
webclaw_llm::extract::extract_json(&llm_content, schema, chain, None)
|
||||
.await
|
||||
.map_err(|e| format!("LLM extraction failed: {e}"))?
|
||||
} else {
|
||||
let prompt = params.prompt.as_deref().unwrap();
|
||||
webclaw_llm::extract::extract_with_prompt(&llm_content, prompt, chain, None)
|
||||
.await
|
||||
.map_err(|e| format!("LLM extraction failed: {e}"))?
|
||||
};
|
||||
|
||||
Ok(serde_json::to_string_pretty(&data).unwrap_or_default())
|
||||
}
|
||||
|
||||
/// Summarize the content of a web page using an LLM.
|
||||
/// Automatically falls back to the webclaw cloud API when bot protection is detected.
|
||||
#[tool]
|
||||
async fn summarize(
|
||||
&self,
|
||||
Parameters(params): Parameters<SummarizeParams>,
|
||||
) -> Result<String, String> {
|
||||
let chain = self.llm_chain.as_ref().ok_or(
|
||||
"No LLM providers available. Set OPENAI_API_KEY or ANTHROPIC_API_KEY, or run Ollama locally.",
|
||||
)?;
|
||||
|
||||
let llm_content = match self.smart_fetch_llm(¶ms.url).await? {
|
||||
SmartFetchResult::Local(extraction) => {
|
||||
webclaw_core::to_llm_text(&extraction, Some(¶ms.url))
|
||||
}
|
||||
SmartFetchResult::Cloud(resp) => resp
|
||||
.get("llm")
|
||||
.or_else(|| resp.get("markdown"))
|
||||
.and_then(|v| v.as_str())
|
||||
.unwrap_or("")
|
||||
.to_string(),
|
||||
};
|
||||
|
||||
webclaw_llm::summarize::summarize(&llm_content, params.max_sentences, chain, None)
|
||||
.await
|
||||
.map_err(|e| format!("Summarization failed: {e}"))
|
||||
}
|
||||
|
||||
/// Compare the current content of a URL against a previous extraction snapshot, showing what changed.
|
||||
/// Automatically falls back to the webclaw cloud API when bot protection is detected.
|
||||
#[tool]
|
||||
async fn diff(&self, Parameters(params): Parameters<DiffParams>) -> Result<String, String> {
|
||||
let previous: webclaw_core::ExtractionResult =
|
||||
serde_json::from_str(¶ms.previous_snapshot)
|
||||
.map_err(|e| format!("Failed to parse previous_snapshot JSON: {e}"))?;
|
||||
|
||||
let result = cloud::smart_fetch(
|
||||
&self.fetch_client,
|
||||
self.cloud.as_ref(),
|
||||
¶ms.url,
|
||||
&[],
|
||||
&[],
|
||||
false,
|
||||
&["markdown"],
|
||||
)
|
||||
.await?;
|
||||
|
||||
match result {
|
||||
SmartFetchResult::Local(current) => {
|
||||
let content_diff = webclaw_core::diff::diff(&previous, ¤t);
|
||||
Ok(serde_json::to_string_pretty(&content_diff).unwrap_or_default())
|
||||
}
|
||||
SmartFetchResult::Cloud(resp) => {
|
||||
// Can't do local diff with cloud content, return the cloud response directly
|
||||
Ok(serde_json::to_string_pretty(&resp).unwrap_or_default())
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Extract brand identity (colors, fonts, logo, favicon) from a website's HTML and CSS.
|
||||
/// Automatically falls back to the webclaw cloud API when bot protection is detected.
|
||||
#[tool]
|
||||
async fn brand(&self, Parameters(params): Parameters<BrandParams>) -> Result<String, String> {
|
||||
let fetch_result = self
|
||||
.fetch_client
|
||||
.fetch(¶ms.url)
|
||||
.await
|
||||
.map_err(|e| format!("Fetch failed: {e}"))?;
|
||||
|
||||
// Check for bot protection before extracting brand
|
||||
if cloud::is_bot_protected(&fetch_result.html, &fetch_result.headers) {
|
||||
if let Some(ref c) = self.cloud {
|
||||
let resp = c
|
||||
.post("brand", serde_json::json!({"url": params.url}))
|
||||
.await?;
|
||||
return Ok(serde_json::to_string_pretty(&resp).unwrap_or_default());
|
||||
} else {
|
||||
return Err(format!(
|
||||
"Bot protection detected on {}. Set WEBCLAW_API_KEY for automatic cloud bypass. \
|
||||
Get a key at https://webclaw.io",
|
||||
params.url
|
||||
));
|
||||
}
|
||||
}
|
||||
|
||||
let identity =
|
||||
webclaw_core::brand::extract_brand(&fetch_result.html, Some(&fetch_result.url));
|
||||
|
||||
Ok(serde_json::to_string_pretty(&identity).unwrap_or_default())
|
||||
}
|
||||
|
||||
/// Run a deep research investigation on a topic or question. Requires WEBCLAW_API_KEY.
|
||||
/// Starts an async research job on the webclaw cloud API, then polls until complete.
|
||||
#[tool]
|
||||
async fn research(
|
||||
&self,
|
||||
Parameters(params): Parameters<ResearchParams>,
|
||||
) -> Result<String, String> {
|
||||
let cloud = self
|
||||
.cloud
|
||||
.as_ref()
|
||||
.ok_or("Research requires WEBCLAW_API_KEY. Get a key at https://webclaw.io")?;
|
||||
|
||||
let mut body = json!({ "query": params.query });
|
||||
if let Some(deep) = params.deep {
|
||||
body["deep"] = json!(deep);
|
||||
}
|
||||
if let Some(ref topic) = params.topic {
|
||||
body["topic"] = json!(topic);
|
||||
}
|
||||
|
||||
// Start the research job
|
||||
let start_resp = cloud.post("research", body).await?;
|
||||
let job_id = start_resp
|
||||
.get("id")
|
||||
.and_then(|v| v.as_str())
|
||||
.ok_or("Research API did not return a job ID")?
|
||||
.to_string();
|
||||
|
||||
info!(job_id = %job_id, "research job started, polling for completion");
|
||||
|
||||
// Poll until completed or failed
|
||||
loop {
|
||||
tokio::time::sleep(std::time::Duration::from_secs(3)).await;
|
||||
|
||||
let status_resp = cloud.get(&format!("research/{job_id}")).await?;
|
||||
let status = status_resp
|
||||
.get("status")
|
||||
.and_then(|v| v.as_str())
|
||||
.unwrap_or("unknown");
|
||||
|
||||
match status {
|
||||
"completed" => {
|
||||
let report = status_resp
|
||||
.get("report")
|
||||
.and_then(|v| v.as_str())
|
||||
.unwrap_or("");
|
||||
|
||||
if report.is_empty() {
|
||||
return Ok(serde_json::to_string_pretty(&status_resp).unwrap_or_default());
|
||||
}
|
||||
return Ok(report.to_string());
|
||||
}
|
||||
"failed" => {
|
||||
let error = status_resp
|
||||
.get("error")
|
||||
.and_then(|v| v.as_str())
|
||||
.unwrap_or("unknown error");
|
||||
return Err(format!("Research job failed: {error}"));
|
||||
}
|
||||
_ => {
|
||||
// Still processing, continue polling
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Search the web for a query and return structured results. Requires WEBCLAW_API_KEY.
|
||||
#[tool]
|
||||
async fn search(&self, Parameters(params): Parameters<SearchParams>) -> Result<String, String> {
|
||||
let cloud = self
|
||||
.cloud
|
||||
.as_ref()
|
||||
.ok_or("Search requires WEBCLAW_API_KEY. Get a key at https://webclaw.io")?;
|
||||
|
||||
let mut body = json!({ "query": params.query });
|
||||
if let Some(num) = params.num_results {
|
||||
body["num_results"] = json!(num);
|
||||
}
|
||||
|
||||
let resp = cloud.post("search", body).await?;
|
||||
|
||||
// Format results for readability
|
||||
if let Some(results) = resp.get("results").and_then(|v| v.as_array()) {
|
||||
let mut output = format!("Found {} results:\n\n", results.len());
|
||||
for (i, result) in results.iter().enumerate() {
|
||||
let title = result.get("title").and_then(|v| v.as_str()).unwrap_or("");
|
||||
let url = result.get("url").and_then(|v| v.as_str()).unwrap_or("");
|
||||
let snippet = result
|
||||
.get("snippet")
|
||||
.or_else(|| result.get("description"))
|
||||
.and_then(|v| v.as_str())
|
||||
.unwrap_or("");
|
||||
|
||||
output.push_str(&format!(
|
||||
"{}. {}\n {}\n {}\n\n",
|
||||
i + 1,
|
||||
title,
|
||||
url,
|
||||
snippet
|
||||
));
|
||||
}
|
||||
Ok(output)
|
||||
} else {
|
||||
// Fallback: return raw JSON if unexpected shape
|
||||
Ok(serde_json::to_string_pretty(&resp).unwrap_or_default())
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[tool_handler]
|
||||
impl ServerHandler for WebclawMcp {
|
||||
fn get_info(&self) -> ServerInfo {
|
||||
ServerInfo::new(ServerCapabilities::builder().enable_tools().build())
|
||||
.with_server_info(Implementation::from_build_env())
|
||||
.with_instructions(String::from(
|
||||
"Webclaw MCP server -- web content extraction for AI agents. \
|
||||
Tools: scrape, crawl, map, batch, extract, summarize, diff, brand, research, search.",
|
||||
))
|
||||
}
|
||||
}
|
||||
103
crates/webclaw-mcp/src/tools.rs
Normal file
103
crates/webclaw-mcp/src/tools.rs
Normal file
|
|
@ -0,0 +1,103 @@
|
|||
/// Tool parameter structs for MCP tool inputs.
|
||||
/// Each struct derives JsonSchema for automatic schema generation,
|
||||
/// and Deserialize for parsing from MCP tool call arguments.
|
||||
use schemars::JsonSchema;
|
||||
use serde::Deserialize;
|
||||
|
||||
#[derive(Debug, Deserialize, JsonSchema)]
|
||||
pub struct ScrapeParams {
|
||||
/// URL to scrape
|
||||
pub url: String,
|
||||
/// Output format: "markdown" (default), "llm", "text", or "json"
|
||||
pub format: Option<String>,
|
||||
/// CSS selectors to include (only extract matching elements)
|
||||
pub include_selectors: Option<Vec<String>>,
|
||||
/// CSS selectors to exclude from output
|
||||
pub exclude_selectors: Option<Vec<String>>,
|
||||
/// If true, extract only the main content (article/main element)
|
||||
pub only_main_content: Option<bool>,
|
||||
/// Browser profile: "chrome" (default), "firefox", or "random"
|
||||
pub browser: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Debug, Deserialize, JsonSchema)]
|
||||
pub struct CrawlParams {
|
||||
/// Seed URL to start crawling from
|
||||
pub url: String,
|
||||
/// Maximum link depth to follow (default: 2)
|
||||
pub depth: Option<u32>,
|
||||
/// Maximum number of pages to crawl (default: 50)
|
||||
pub max_pages: Option<usize>,
|
||||
/// Number of concurrent requests (default: 5)
|
||||
pub concurrency: Option<usize>,
|
||||
/// Seed the frontier from sitemap discovery before crawling
|
||||
pub use_sitemap: Option<bool>,
|
||||
/// Output format for each page: "markdown" (default), "llm", "text"
|
||||
pub format: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Debug, Deserialize, JsonSchema)]
|
||||
pub struct MapParams {
|
||||
/// Base URL to discover sitemaps from (e.g. `<https://example.com>`)
|
||||
pub url: String,
|
||||
}
|
||||
|
||||
#[derive(Debug, Deserialize, JsonSchema)]
|
||||
pub struct BatchParams {
|
||||
/// List of URLs to extract content from
|
||||
pub urls: Vec<String>,
|
||||
/// Output format: "markdown" (default), "llm", "text"
|
||||
pub format: Option<String>,
|
||||
/// Number of concurrent requests (default: 5)
|
||||
pub concurrency: Option<usize>,
|
||||
}
|
||||
|
||||
#[derive(Debug, Deserialize, JsonSchema)]
|
||||
pub struct ExtractParams {
|
||||
/// URL to fetch and extract structured data from
|
||||
pub url: String,
|
||||
/// Natural language prompt describing what to extract
|
||||
pub prompt: Option<String>,
|
||||
/// JSON schema describing the structure to extract
|
||||
pub schema: Option<serde_json::Value>,
|
||||
}
|
||||
|
||||
#[derive(Debug, Deserialize, JsonSchema)]
|
||||
pub struct SummarizeParams {
|
||||
/// URL to fetch and summarize
|
||||
pub url: String,
|
||||
/// Number of sentences in the summary (default: 3)
|
||||
pub max_sentences: Option<usize>,
|
||||
}
|
||||
|
||||
#[derive(Debug, Deserialize, JsonSchema)]
|
||||
pub struct DiffParams {
|
||||
/// URL to fetch current content from
|
||||
pub url: String,
|
||||
/// Previous extraction snapshot as a JSON string (ExtractionResult)
|
||||
pub previous_snapshot: String,
|
||||
}
|
||||
|
||||
#[derive(Debug, Deserialize, JsonSchema)]
|
||||
pub struct BrandParams {
|
||||
/// URL to extract brand identity from
|
||||
pub url: String,
|
||||
}
|
||||
|
||||
#[derive(Debug, Deserialize, JsonSchema)]
|
||||
pub struct ResearchParams {
|
||||
/// Research query or question to investigate
|
||||
pub query: String,
|
||||
/// Enable deep research mode for more thorough investigation (default: false)
|
||||
pub deep: Option<bool>,
|
||||
/// Topic hint to guide research focus (e.g. "technology", "finance", "science")
|
||||
pub topic: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Debug, Deserialize, JsonSchema)]
|
||||
pub struct SearchParams {
|
||||
/// Search query
|
||||
pub query: String,
|
||||
/// Number of results to return (default: 10)
|
||||
pub num_results: Option<u32>,
|
||||
}
|
||||
11
crates/webclaw-pdf/Cargo.toml
Normal file
11
crates/webclaw-pdf/Cargo.toml
Normal file
|
|
@ -0,0 +1,11 @@
|
|||
[package]
|
||||
name = "webclaw-pdf"
|
||||
description = "PDF text extraction for webclaw"
|
||||
version.workspace = true
|
||||
edition.workspace = true
|
||||
license.workspace = true
|
||||
|
||||
[dependencies]
|
||||
pdf-extract = "0.7"
|
||||
thiserror = { workspace = true }
|
||||
tracing = { workspace = true }
|
||||
14
crates/webclaw-pdf/src/error.rs
Normal file
14
crates/webclaw-pdf/src/error.rs
Normal file
|
|
@ -0,0 +1,14 @@
|
|||
/// PDF extraction errors. Kept simple -- no OCR, no complex recovery.
|
||||
use thiserror::Error;
|
||||
|
||||
#[derive(Debug, Error)]
|
||||
pub enum PdfError {
|
||||
#[error("PDF extraction failed: {0}")]
|
||||
ExtractionFailed(String),
|
||||
|
||||
#[error("invalid PDF: {0}")]
|
||||
InvalidPdf(String),
|
||||
|
||||
#[error("empty PDF: no text content found")]
|
||||
EmptyPdf,
|
||||
}
|
||||
292
crates/webclaw-pdf/src/lib.rs
Normal file
292
crates/webclaw-pdf/src/lib.rs
Normal file
|
|
@ -0,0 +1,292 @@
|
|||
/// PDF text extraction for webclaw.
|
||||
///
|
||||
/// Uses pdf-extract (backed by lopdf) to pull text from PDF bytes.
|
||||
/// No OCR -- text-based PDFs only. Scanned PDFs return EmptyPdf in Auto mode.
|
||||
pub mod error;
|
||||
|
||||
pub use error::PdfError;
|
||||
|
||||
// pdf-extract re-exports all of lopdf via `pub use lopdf::*`
|
||||
use pdf_extract::{Dictionary, Document, Object};
|
||||
use tracing::debug;
|
||||
|
||||
/// Controls how strictly we treat empty/sparse PDFs.
|
||||
#[derive(Debug, Clone, Default)]
|
||||
pub enum PdfMode {
|
||||
/// Try text extraction; error if no text found (catches scanned PDFs early).
|
||||
#[default]
|
||||
Auto,
|
||||
/// Return whatever text is found, even if empty. Caller decides what to do.
|
||||
Fast,
|
||||
}
|
||||
|
||||
/// Successful PDF extraction output.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct PdfResult {
|
||||
pub text: String,
|
||||
pub page_count: usize,
|
||||
pub metadata: PdfMetadata,
|
||||
}
|
||||
|
||||
/// PDF document metadata from the info dictionary.
|
||||
#[derive(Debug, Clone, Default)]
|
||||
pub struct PdfMetadata {
|
||||
pub title: Option<String>,
|
||||
pub author: Option<String>,
|
||||
pub subject: Option<String>,
|
||||
pub creator: Option<String>,
|
||||
}
|
||||
|
||||
const MAX_PDF_SIZE: usize = 50 * 1024 * 1024; // 50MB
|
||||
|
||||
/// Extract text content from raw PDF bytes.
|
||||
///
|
||||
/// Uses pdf-extract for text extraction and lopdf (transitive dep) for
|
||||
/// metadata and page count. In `Auto` mode, returns `PdfError::EmptyPdf`
|
||||
/// if no text is found (likely a scanned/image-only PDF).
|
||||
pub fn extract_pdf(bytes: &[u8], mode: PdfMode) -> Result<PdfResult, PdfError> {
|
||||
if bytes.len() > MAX_PDF_SIZE {
|
||||
return Err(PdfError::InvalidPdf(format!(
|
||||
"PDF too large ({} bytes, max {})",
|
||||
bytes.len(),
|
||||
MAX_PDF_SIZE
|
||||
)));
|
||||
}
|
||||
|
||||
if bytes.len() < 5 || &bytes[..5] != b"%PDF-" {
|
||||
return Err(PdfError::InvalidPdf("missing PDF header".into()));
|
||||
}
|
||||
|
||||
let doc = Document::load_mem(bytes).map_err(|e| PdfError::InvalidPdf(e.to_string()))?;
|
||||
|
||||
let page_count = doc.get_pages().len();
|
||||
let metadata = read_metadata(&doc);
|
||||
|
||||
debug!(pages = page_count, "PDF document loaded");
|
||||
|
||||
// Extract text via pdf-extract (higher-level API over lopdf)
|
||||
let text = pdf_extract::extract_text_from_mem(bytes)
|
||||
.map_err(|e| PdfError::ExtractionFailed(e.to_string()))?;
|
||||
|
||||
let text = normalize_text(&text);
|
||||
|
||||
if text.is_empty() {
|
||||
if matches!(mode, PdfMode::Auto) {
|
||||
return Err(PdfError::EmptyPdf);
|
||||
}
|
||||
debug!("PDF text extraction returned empty (Fast mode, returning as-is)");
|
||||
}
|
||||
|
||||
debug!(chars = text.len(), "PDF text extracted");
|
||||
|
||||
Ok(PdfResult {
|
||||
text,
|
||||
page_count,
|
||||
metadata,
|
||||
})
|
||||
}
|
||||
|
||||
/// Format a PdfResult as markdown for downstream consumers.
|
||||
///
|
||||
/// Adds title as a heading if available, followed by the extracted text body.
|
||||
pub fn to_markdown(result: &PdfResult) -> String {
|
||||
let mut out = String::new();
|
||||
|
||||
if let Some(ref title) = result.metadata.title
|
||||
&& !title.is_empty()
|
||||
{
|
||||
out.push_str("# ");
|
||||
out.push_str(title);
|
||||
out.push_str("\n\n");
|
||||
}
|
||||
|
||||
out.push_str(&result.text);
|
||||
out
|
||||
}
|
||||
|
||||
/// Read metadata from the PDF info dictionary.
|
||||
/// Gracefully returns defaults for any missing or unreadable fields.
|
||||
fn read_metadata(doc: &Document) -> PdfMetadata {
|
||||
let info = match doc.trailer.get(b"Info") {
|
||||
Ok(obj) => match doc.dereference(obj) {
|
||||
Ok((_, Object::Dictionary(dict))) => Some(dict),
|
||||
_ => None,
|
||||
},
|
||||
Err(_) => None,
|
||||
};
|
||||
|
||||
let Some(info) = info else {
|
||||
return PdfMetadata::default();
|
||||
};
|
||||
|
||||
PdfMetadata {
|
||||
title: info_string(info, b"Title"),
|
||||
author: info_string(info, b"Author"),
|
||||
subject: info_string(info, b"Subject"),
|
||||
creator: info_string(info, b"Creator"),
|
||||
}
|
||||
}
|
||||
|
||||
/// Extract a string value from a PDF info dictionary entry.
|
||||
/// Handles both String and Name object types.
|
||||
fn info_string(dict: &Dictionary, key: &[u8]) -> Option<String> {
|
||||
let obj = dict.get(key).ok()?;
|
||||
let raw = match obj {
|
||||
Object::String(bytes, _) => bytes.clone(),
|
||||
Object::Name(bytes) => bytes.clone(),
|
||||
_ => return None,
|
||||
};
|
||||
|
||||
// PDF strings can be UTF-16BE (BOM: FE FF) or PDFDocEncoding (~Latin-1)
|
||||
let text = if raw.len() >= 2 && raw[0] == 0xFE && raw[1] == 0xFF {
|
||||
// UTF-16BE: skip BOM, decode pairs
|
||||
let pairs: Vec<u16> = raw[2..]
|
||||
.chunks_exact(2)
|
||||
.map(|c| u16::from_be_bytes([c[0], c[1]]))
|
||||
.collect();
|
||||
String::from_utf16_lossy(&pairs)
|
||||
} else {
|
||||
// PDFDocEncoding -- first 128 chars match ASCII, rest is Latin-1-ish.
|
||||
// Good enough: lossy UTF-8 covers the common case.
|
||||
String::from_utf8_lossy(&raw).into_owned()
|
||||
};
|
||||
|
||||
let trimmed = text.trim().to_string();
|
||||
if trimmed.is_empty() {
|
||||
None
|
||||
} else {
|
||||
Some(trimmed)
|
||||
}
|
||||
}
|
||||
|
||||
/// Collapse excessive whitespace from pdf-extract output.
|
||||
/// PDF text extraction often produces irregular spacing and blank lines.
|
||||
fn normalize_text(raw: &str) -> String {
|
||||
let mut lines: Vec<&str> = Vec::new();
|
||||
let mut prev_blank = false;
|
||||
|
||||
for line in raw.lines() {
|
||||
let trimmed = line.trim();
|
||||
if trimmed.is_empty() {
|
||||
if !prev_blank && !lines.is_empty() {
|
||||
lines.push("");
|
||||
prev_blank = true;
|
||||
}
|
||||
} else {
|
||||
lines.push(trimmed);
|
||||
prev_blank = false;
|
||||
}
|
||||
}
|
||||
|
||||
// Strip trailing blank lines
|
||||
while lines.last() == Some(&"") {
|
||||
lines.pop();
|
||||
}
|
||||
|
||||
lines.join("\n")
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_pdf_metadata_default() {
|
||||
let meta = PdfMetadata::default();
|
||||
assert!(meta.title.is_none());
|
||||
assert!(meta.author.is_none());
|
||||
assert!(meta.subject.is_none());
|
||||
assert!(meta.creator.is_none());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_to_markdown_with_title() {
|
||||
let result = PdfResult {
|
||||
text: "Hello world.\n\nSecond paragraph.".into(),
|
||||
page_count: 1,
|
||||
metadata: PdfMetadata {
|
||||
title: Some("Test Document".into()),
|
||||
..Default::default()
|
||||
},
|
||||
};
|
||||
|
||||
let md = to_markdown(&result);
|
||||
assert!(md.starts_with("# Test Document\n\n"));
|
||||
assert!(md.contains("Hello world."));
|
||||
assert!(md.contains("Second paragraph."));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_to_markdown_without_title() {
|
||||
let result = PdfResult {
|
||||
text: "Just text.".into(),
|
||||
page_count: 1,
|
||||
metadata: PdfMetadata::default(),
|
||||
};
|
||||
|
||||
let md = to_markdown(&result);
|
||||
assert_eq!(md, "Just text.");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_to_markdown_empty_title_skipped() {
|
||||
let result = PdfResult {
|
||||
text: "Body.".into(),
|
||||
page_count: 1,
|
||||
metadata: PdfMetadata {
|
||||
title: Some("".into()),
|
||||
..Default::default()
|
||||
},
|
||||
};
|
||||
|
||||
let md = to_markdown(&result);
|
||||
assert!(!md.starts_with('#'));
|
||||
assert_eq!(md, "Body.");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_empty_bytes_returns_error() {
|
||||
let result = extract_pdf(&[], PdfMode::Auto);
|
||||
assert!(matches!(result, Err(PdfError::InvalidPdf(_))));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_garbage_bytes_returns_error() {
|
||||
let result = extract_pdf(b"not a pdf at all", PdfMode::Auto);
|
||||
assert!(matches!(result, Err(PdfError::InvalidPdf(_))));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_truncated_pdf_header_returns_error() {
|
||||
// Has the PDF magic but nothing else -- lopdf will reject it
|
||||
let result = extract_pdf(b"%PDF-1.4\n", PdfMode::Auto);
|
||||
assert!(result.is_err());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_oversized_pdf_rejected() {
|
||||
let big = vec![0u8; MAX_PDF_SIZE + 1];
|
||||
let result = extract_pdf(&big, PdfMode::Auto);
|
||||
assert!(matches!(result, Err(PdfError::InvalidPdf(msg)) if msg.contains("too large")));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_normalize_text_collapses_blanks() {
|
||||
let input = "Line one.\n\n\n\nLine two.\n\n\n";
|
||||
let output = normalize_text(input);
|
||||
assert_eq!(output, "Line one.\n\nLine two.");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_normalize_text_trims_lines() {
|
||||
let input = " hello \n world ";
|
||||
let output = normalize_text(input);
|
||||
assert_eq!(output, "hello\nworld");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_normalize_text_empty() {
|
||||
assert_eq!(normalize_text(""), "");
|
||||
assert_eq!(normalize_text(" \n \n "), "");
|
||||
}
|
||||
}
|
||||
513
deploy/hetzner.sh
Executable file
513
deploy/hetzner.sh
Executable file
|
|
@ -0,0 +1,513 @@
|
|||
#!/usr/bin/env bash
|
||||
# deploy/hetzner.sh — One-click Hetzner VPS deployment for webclaw
|
||||
#
|
||||
# Creates a Hetzner Cloud VPS with Docker, deploys webclaw + Ollama,
|
||||
# and optionally configures nginx + SSL.
|
||||
#
|
||||
# Usage:
|
||||
# ./deploy/hetzner.sh # Interactive setup
|
||||
# ./deploy/hetzner.sh --destroy # Tear down the server
|
||||
#
|
||||
# Server type recommendations:
|
||||
# cpx11: 2 vCPU, 2GB RAM, ~4.59 EUR/mo — Minimum (scraping only, no LLM)
|
||||
# cpx21: 3 vCPU, 4GB RAM, ~8.49 EUR/mo — Recommended (scraping + small LLM)
|
||||
# cpx31: 4 vCPU, 8GB RAM, ~15.59 EUR/mo — Best (scraping + LLM + high concurrency)
|
||||
# cpx41: 8 vCPU, 16GB RAM, ~28.19 EUR/mo — Heavy use (high-volume crawling)
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Constants
|
||||
# ---------------------------------------------------------------------------
|
||||
HETZNER_API="https://api.hetzner.cloud/v1"
|
||||
SERVER_NAME="webclaw"
|
||||
REPO_URL="https://github.com/0xMassi/webclaw.git"
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Colors
|
||||
# ---------------------------------------------------------------------------
|
||||
RED='\033[0;31m'
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[1;33m'
|
||||
BLUE='\033[0;34m'
|
||||
CYAN='\033[0;36m'
|
||||
BOLD='\033[1m'
|
||||
DIM='\033[2m'
|
||||
RESET='\033[0m'
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
info() { printf "${BLUE}[*]${RESET} %s\n" "$*"; }
|
||||
success() { printf "${GREEN}[+]${RESET} %s\n" "$*"; }
|
||||
warn() { printf "${YELLOW}[!]${RESET} %s\n" "$*"; }
|
||||
error() { printf "${RED}[x]${RESET} %s\n" "$*" >&2; }
|
||||
fatal() { error "$*"; exit 1; }
|
||||
|
||||
prompt() {
|
||||
local var_name="$1" prompt_text="$2" default="${3:-}"
|
||||
if [[ -n "$default" ]]; then
|
||||
printf "${CYAN} %s${DIM} [%s]${RESET}: " "$prompt_text" "$default"
|
||||
else
|
||||
printf "${CYAN} %s${RESET}: " "$prompt_text"
|
||||
fi
|
||||
read -r input
|
||||
eval "$var_name=\"${input:-$default}\""
|
||||
}
|
||||
|
||||
prompt_secret() {
|
||||
local var_name="$1" prompt_text="$2" default="${3:-}"
|
||||
if [[ -n "$default" ]]; then
|
||||
printf "${CYAN} %s${DIM} [%s]${RESET}: " "$prompt_text" "$default"
|
||||
else
|
||||
printf "${CYAN} %s${RESET}: " "$prompt_text"
|
||||
fi
|
||||
read -rs input
|
||||
echo
|
||||
eval "$var_name=\"${input:-$default}\""
|
||||
}
|
||||
|
||||
generate_key() {
|
||||
# 32-char random hex key
|
||||
if command -v openssl &>/dev/null; then
|
||||
openssl rand -hex 16
|
||||
else
|
||||
LC_ALL=C tr -dc 'a-f0-9' < /dev/urandom | head -c 32
|
||||
fi
|
||||
}
|
||||
|
||||
hetzner_api() {
|
||||
local method="$1" path="$2"
|
||||
shift 2
|
||||
curl -sf -X "$method" \
|
||||
-H "Authorization: Bearer $HETZNER_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
"$HETZNER_API$path" \
|
||||
"$@"
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Preflight checks
|
||||
# ---------------------------------------------------------------------------
|
||||
preflight() {
|
||||
local missing=()
|
||||
command -v curl &>/dev/null || missing+=("curl")
|
||||
command -v jq &>/dev/null || missing+=("jq")
|
||||
command -v ssh &>/dev/null || missing+=("ssh")
|
||||
|
||||
if [[ ${#missing[@]} -gt 0 ]]; then
|
||||
fatal "Missing required tools: ${missing[*]}. Install them and try again."
|
||||
fi
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Validate Hetzner token
|
||||
# ---------------------------------------------------------------------------
|
||||
validate_token() {
|
||||
info "Validating Hetzner API token..."
|
||||
local response
|
||||
response=$(hetzner_api GET "/servers?per_page=1" 2>&1) || {
|
||||
fatal "Invalid Hetzner API token. Get one at: https://console.hetzner.cloud"
|
||||
}
|
||||
success "Token is valid."
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Check if server already exists
|
||||
# ---------------------------------------------------------------------------
|
||||
find_server() {
|
||||
local response
|
||||
response=$(hetzner_api GET "/servers?name=$SERVER_NAME")
|
||||
echo "$response" | jq -r '.servers[0] // empty'
|
||||
}
|
||||
|
||||
get_server_id() {
|
||||
local server
|
||||
server=$(find_server)
|
||||
if [[ -n "$server" && "$server" != "null" ]]; then
|
||||
echo "$server" | jq -r '.id'
|
||||
fi
|
||||
}
|
||||
|
||||
get_server_ip() {
|
||||
local server
|
||||
server=$(find_server)
|
||||
if [[ -n "$server" && "$server" != "null" ]]; then
|
||||
echo "$server" | jq -r '.public_net.ipv4.ip'
|
||||
fi
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Destroy mode
|
||||
# ---------------------------------------------------------------------------
|
||||
destroy_server() {
|
||||
info "Looking for existing webclaw server..."
|
||||
local server_id
|
||||
server_id=$(get_server_id)
|
||||
|
||||
if [[ -z "$server_id" ]]; then
|
||||
warn "No server named '$SERVER_NAME' found. Nothing to destroy."
|
||||
exit 0
|
||||
fi
|
||||
|
||||
local ip
|
||||
ip=$(get_server_ip)
|
||||
warn "Found server: $SERVER_NAME (ID: $server_id, IP: $ip)"
|
||||
printf "${RED} This will permanently delete the server and all its data.${RESET}\n"
|
||||
printf "${CYAN} Type 'destroy' to confirm${RESET}: "
|
||||
read -r confirmation
|
||||
|
||||
if [[ "$confirmation" != "destroy" ]]; then
|
||||
info "Aborted."
|
||||
exit 0
|
||||
fi
|
||||
|
||||
info "Destroying server $server_id..."
|
||||
hetzner_api DELETE "/servers/$server_id" > /dev/null
|
||||
success "Server destroyed."
|
||||
|
||||
# Clean SSH known_hosts
|
||||
if [[ -n "$ip" ]]; then
|
||||
ssh-keygen -R "$ip" 2>/dev/null || true
|
||||
info "Removed $ip from SSH known_hosts."
|
||||
fi
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Build cloud-init user_data
|
||||
# ---------------------------------------------------------------------------
|
||||
build_cloud_init() {
|
||||
local auth_key="$1" openai_key="$2" anthropic_key="$3" domain="$4" ollama_model="$5"
|
||||
|
||||
# Build .env content
|
||||
local env_content="# webclaw deployment — generated by hetzner.sh
|
||||
WEBCLAW_HOST=0.0.0.0
|
||||
WEBCLAW_PORT=3000
|
||||
WEBCLAW_AUTH_KEY=$auth_key
|
||||
OLLAMA_HOST=http://ollama:11434
|
||||
OLLAMA_MODEL=$ollama_model
|
||||
WEBCLAW_LOG=info"
|
||||
|
||||
if [[ -n "$openai_key" ]]; then
|
||||
env_content="$env_content
|
||||
OPENAI_API_KEY=$openai_key"
|
||||
fi
|
||||
if [[ -n "$anthropic_key" ]]; then
|
||||
env_content="$env_content
|
||||
ANTHROPIC_API_KEY=$anthropic_key"
|
||||
fi
|
||||
|
||||
# Nginx + certbot block (only if domain provided)
|
||||
local nginx_block=""
|
||||
if [[ -n "$domain" ]]; then
|
||||
nginx_block="
|
||||
# --- Nginx reverse proxy + SSL ---
|
||||
- apt-get install -y nginx certbot python3-certbot-nginx
|
||||
|
||||
- |
|
||||
cat > /etc/nginx/sites-available/webclaw <<'NGINX'
|
||||
server {
|
||||
listen 80;
|
||||
server_name $domain;
|
||||
|
||||
location / {
|
||||
proxy_pass http://127.0.0.1:3000;
|
||||
proxy_set_header Host \$host;
|
||||
proxy_set_header X-Real-IP \$remote_addr;
|
||||
proxy_set_header X-Forwarded-For \$proxy_add_x_forwarded_for;
|
||||
proxy_set_header X-Forwarded-Proto \$scheme;
|
||||
proxy_read_timeout 120s;
|
||||
proxy_connect_timeout 10s;
|
||||
}
|
||||
}
|
||||
NGINX
|
||||
|
||||
- ln -sf /etc/nginx/sites-available/webclaw /etc/nginx/sites-enabled/webclaw
|
||||
- rm -f /etc/nginx/sites-enabled/default
|
||||
- systemctl restart nginx
|
||||
|
||||
# SSL cert (will fail silently if DNS not pointed yet)
|
||||
- certbot --nginx -d $domain --non-interactive --agree-tos --register-unsolicited-contact -m admin@$domain || echo 'Certbot failed — point DNS to this IP and run: certbot --nginx -d $domain'
|
||||
"
|
||||
fi
|
||||
|
||||
cat <<CLOUDINIT
|
||||
#cloud-config
|
||||
package_update: true
|
||||
|
||||
runcmd:
|
||||
# --- Firewall ---
|
||||
- ufw allow 22/tcp
|
||||
- ufw allow 80/tcp
|
||||
- ufw allow 443/tcp
|
||||
- ufw allow 3000/tcp
|
||||
- ufw --force enable
|
||||
|
||||
# --- Docker (already installed on hetzner docker-ce image, but ensure compose) ---
|
||||
- |
|
||||
if ! command -v docker &>/dev/null; then
|
||||
curl -fsSL https://get.docker.com | sh
|
||||
fi
|
||||
- |
|
||||
if ! docker compose version &>/dev/null; then
|
||||
apt-get install -y docker-compose-plugin
|
||||
fi
|
||||
|
||||
# --- Clone and deploy ---
|
||||
- git clone $REPO_URL /opt/webclaw
|
||||
- |
|
||||
cat > /opt/webclaw/.env <<'DOTENV'
|
||||
$env_content
|
||||
DOTENV
|
||||
# Remove leading whitespace from heredoc
|
||||
sed -i 's/^ //' /opt/webclaw/.env
|
||||
|
||||
$nginx_block
|
||||
# --- Start services ---
|
||||
- cd /opt/webclaw && docker compose up -d --build
|
||||
|
||||
# --- Pull Ollama model in background (non-blocking) ---
|
||||
- |
|
||||
nohup bash -c '
|
||||
echo "Waiting for Ollama to start..."
|
||||
for i in \$(seq 1 60); do
|
||||
if docker compose -f /opt/webclaw/docker-compose.yml exec -T ollama ollama list &>/dev/null; then
|
||||
echo "Ollama ready. Pulling $ollama_model..."
|
||||
docker compose -f /opt/webclaw/docker-compose.yml exec -T ollama ollama pull $ollama_model
|
||||
echo "Model $ollama_model pulled."
|
||||
break
|
||||
fi
|
||||
sleep 5
|
||||
done
|
||||
' > /var/log/ollama-pull.log 2>&1 &
|
||||
|
||||
CLOUDINIT
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Wait for SSH
|
||||
# ---------------------------------------------------------------------------
|
||||
wait_for_ssh() {
|
||||
local ip="$1" max_attempts=40
|
||||
info "Waiting for server to become reachable (this takes 1-3 minutes)..."
|
||||
|
||||
for i in $(seq 1 $max_attempts); do
|
||||
if ssh -o ConnectTimeout=3 -o StrictHostKeyChecking=no -o BatchMode=yes \
|
||||
"root@$ip" "echo ok" &>/dev/null; then
|
||||
return 0
|
||||
fi
|
||||
printf "."
|
||||
sleep 5
|
||||
done
|
||||
echo
|
||||
return 1
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Wait for Docker build to complete
|
||||
# ---------------------------------------------------------------------------
|
||||
wait_for_docker() {
|
||||
local ip="$1" max_attempts=60
|
||||
info "Waiting for Docker build to complete (this takes 5-15 minutes on first deploy)..."
|
||||
|
||||
for i in $(seq 1 $max_attempts); do
|
||||
local status
|
||||
status=$(ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no \
|
||||
"root@$ip" "docker ps --filter name=webclaw --format '{{.Status}}' 2>/dev/null | head -1" 2>/dev/null || echo "")
|
||||
|
||||
if [[ "$status" == *"Up"* ]]; then
|
||||
return 0
|
||||
fi
|
||||
printf "."
|
||||
sleep 15
|
||||
done
|
||||
echo
|
||||
return 1
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Get SSH keys from Hetzner account
|
||||
# ---------------------------------------------------------------------------
|
||||
get_ssh_keys() {
|
||||
local response
|
||||
response=$(hetzner_api GET "/ssh_keys")
|
||||
echo "$response" | jq -r '[.ssh_keys[].id] // []'
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Main: create server
|
||||
# ---------------------------------------------------------------------------
|
||||
create_server() {
|
||||
# Check for existing server
|
||||
local existing_id
|
||||
existing_id=$(get_server_id)
|
||||
if [[ -n "$existing_id" ]]; then
|
||||
local existing_ip
|
||||
existing_ip=$(get_server_ip)
|
||||
warn "Server '$SERVER_NAME' already exists (ID: $existing_id, IP: $existing_ip)"
|
||||
warn "Run with --destroy first, or use a different name."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Gather configuration
|
||||
echo
|
||||
printf "${BOLD}${GREEN} webclaw Hetzner Deploy${RESET}\n"
|
||||
printf "${DIM} One-click VPS deployment for webclaw REST API + Ollama${RESET}\n"
|
||||
echo
|
||||
|
||||
prompt SERVER_TYPE "Server type (cpx11/cpx21/cpx31/cpx41)" "cpx21"
|
||||
prompt LOCATION "Region (fsn1/nbg1/hel1/ash/hil)" "fsn1"
|
||||
prompt DOMAIN "Domain for SSL (leave empty to skip)" ""
|
||||
prompt_secret OPENAI_KEY "OpenAI API key (optional)" ""
|
||||
prompt_secret ANTHROPIC_KEY "Anthropic API key (optional)" ""
|
||||
|
||||
local generated_auth_key
|
||||
generated_auth_key=$(generate_key)
|
||||
prompt_secret AUTH_KEY "Webclaw auth key" "$generated_auth_key"
|
||||
|
||||
prompt OLLAMA_MODEL "Ollama model to pre-pull" "qwen3:1.7b"
|
||||
|
||||
echo
|
||||
info "Configuration:"
|
||||
printf " Server type: ${BOLD}%s${RESET}\n" "$SERVER_TYPE"
|
||||
printf " Region: ${BOLD}%s${RESET}\n" "$LOCATION"
|
||||
printf " Domain: ${BOLD}%s${RESET}\n" "${DOMAIN:-none}"
|
||||
printf " OpenAI key: ${BOLD}%s${RESET}\n" "$([ -n "$OPENAI_KEY" ] && echo 'set' || echo 'not set')"
|
||||
printf " Anthropic key:${BOLD}%s${RESET}\n" "$([ -n "$ANTHROPIC_KEY" ] && echo 'set' || echo 'not set')"
|
||||
printf " Auth key: ${BOLD}%s${RESET}\n" "$AUTH_KEY"
|
||||
printf " Ollama model: ${BOLD}%s${RESET}\n" "$OLLAMA_MODEL"
|
||||
echo
|
||||
|
||||
printf "${CYAN} Proceed? (y/n)${RESET}: "
|
||||
read -r confirm
|
||||
[[ "$confirm" =~ ^[Yy]$ ]] || { info "Aborted."; exit 0; }
|
||||
|
||||
# Build cloud-init
|
||||
local user_data
|
||||
user_data=$(build_cloud_init "$AUTH_KEY" "$OPENAI_KEY" "$ANTHROPIC_KEY" "$DOMAIN" "$OLLAMA_MODEL")
|
||||
|
||||
# Get SSH keys
|
||||
local ssh_keys
|
||||
ssh_keys=$(get_ssh_keys)
|
||||
info "Found $(echo "$ssh_keys" | jq length) SSH key(s) in your Hetzner account."
|
||||
|
||||
# Create server
|
||||
info "Creating $SERVER_TYPE server in $LOCATION..."
|
||||
local create_payload
|
||||
create_payload=$(jq -n \
|
||||
--arg name "$SERVER_NAME" \
|
||||
--arg server_type "$SERVER_TYPE" \
|
||||
--arg location "$LOCATION" \
|
||||
--arg user_data "$user_data" \
|
||||
--argjson ssh_keys "$ssh_keys" \
|
||||
'{
|
||||
name: $name,
|
||||
server_type: $server_type,
|
||||
location: $location,
|
||||
image: "docker-ce",
|
||||
ssh_keys: $ssh_keys,
|
||||
user_data: $user_data,
|
||||
public_net: {
|
||||
enable_ipv4: true,
|
||||
enable_ipv6: true
|
||||
}
|
||||
}')
|
||||
|
||||
local response
|
||||
response=$(hetzner_api POST "/servers" -d "$create_payload") || {
|
||||
fatal "Failed to create server. Check your Hetzner token permissions."
|
||||
}
|
||||
|
||||
local server_id server_ip root_password
|
||||
server_id=$(echo "$response" | jq -r '.server.id')
|
||||
server_ip=$(echo "$response" | jq -r '.server.public_net.ipv4.ip')
|
||||
root_password=$(echo "$response" | jq -r '.root_password // empty')
|
||||
|
||||
if [[ -z "$server_id" || "$server_id" == "null" ]]; then
|
||||
error "Server creation response:"
|
||||
echo "$response" | jq .
|
||||
fatal "Failed to create server."
|
||||
fi
|
||||
|
||||
success "Server created: ID=$server_id, IP=$server_ip"
|
||||
|
||||
if [[ -n "$root_password" ]]; then
|
||||
echo
|
||||
warn "Root password (save this, shown only once): $root_password"
|
||||
echo
|
||||
fi
|
||||
|
||||
# Wait for SSH
|
||||
if wait_for_ssh "$server_ip"; then
|
||||
success "Server is reachable via SSH."
|
||||
else
|
||||
warn "Server not yet reachable via SSH. It may still be booting."
|
||||
warn "Try: ssh root@$server_ip"
|
||||
fi
|
||||
|
||||
# Summary
|
||||
echo
|
||||
printf "${BOLD}${GREEN} Deployment started.${RESET}\n"
|
||||
echo
|
||||
printf " The server is now building webclaw from source.\n"
|
||||
printf " This takes ${BOLD}5-15 minutes${RESET} on first deploy.\n"
|
||||
echo
|
||||
printf " ${BOLD}Server IP:${RESET} %s\n" "$server_ip"
|
||||
printf " ${BOLD}SSH:${RESET} ssh root@%s\n" "$server_ip"
|
||||
printf " ${BOLD}Auth key:${RESET} %s\n" "$AUTH_KEY"
|
||||
echo
|
||||
printf " ${BOLD}Monitor build progress:${RESET}\n"
|
||||
printf " ssh root@%s 'cd /opt/webclaw && docker compose logs -f'\n" "$server_ip"
|
||||
echo
|
||||
printf " ${BOLD}Test when ready:${RESET}\n"
|
||||
printf " curl http://%s:3000/health\n" "$server_ip"
|
||||
echo
|
||||
printf " ${BOLD}Scrape:${RESET}\n"
|
||||
printf " curl -X POST http://%s:3000/v1/scrape \\\\\n" "$server_ip"
|
||||
printf " -H 'Content-Type: application/json' \\\\\n"
|
||||
printf " -H 'Authorization: Bearer %s' \\\\\n" "$AUTH_KEY"
|
||||
printf " -d '{\"url\": \"https://example.com\"}'\n"
|
||||
echo
|
||||
|
||||
if [[ -n "$DOMAIN" ]]; then
|
||||
printf " ${BOLD}Domain:${RESET}\n"
|
||||
printf " Point %s A record -> %s\n" "$DOMAIN" "$server_ip"
|
||||
printf " SSL will auto-configure via certbot.\n"
|
||||
printf " Then: curl https://%s/health\n" "$DOMAIN"
|
||||
echo
|
||||
fi
|
||||
|
||||
printf " ${BOLD}Pull Ollama model manually (if auto-pull hasn't finished):${RESET}\n"
|
||||
printf " ssh root@%s 'cd /opt/webclaw && docker compose exec ollama ollama pull %s'\n" "$server_ip" "$OLLAMA_MODEL"
|
||||
echo
|
||||
|
||||
printf " ${BOLD}Tear down:${RESET}\n"
|
||||
printf " HETZNER_TOKEN=%s ./deploy/hetzner.sh --destroy\n" "$HETZNER_TOKEN"
|
||||
echo
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Entrypoint
|
||||
# ---------------------------------------------------------------------------
|
||||
main() {
|
||||
preflight
|
||||
|
||||
# Accept token from env or prompt
|
||||
if [[ -z "${HETZNER_TOKEN:-}" ]]; then
|
||||
echo
|
||||
printf "${BOLD}${GREEN} webclaw Hetzner Deploy${RESET}\n"
|
||||
echo
|
||||
prompt_secret HETZNER_TOKEN "Hetzner API token (https://console.hetzner.cloud)" ""
|
||||
[[ -n "$HETZNER_TOKEN" ]] || fatal "Hetzner API token is required."
|
||||
fi
|
||||
|
||||
validate_token
|
||||
|
||||
if [[ "${1:-}" == "--destroy" ]]; then
|
||||
destroy_server
|
||||
else
|
||||
create_server
|
||||
fi
|
||||
}
|
||||
|
||||
main "$@"
|
||||
35
docker-compose.yml
Normal file
35
docker-compose.yml
Normal file
|
|
@ -0,0 +1,35 @@
|
|||
services:
|
||||
webclaw:
|
||||
build: .
|
||||
ports:
|
||||
- "${WEBCLAW_PORT:-3000}:3000"
|
||||
env_file:
|
||||
- .env
|
||||
environment:
|
||||
- OLLAMA_HOST=http://ollama:11434
|
||||
depends_on:
|
||||
- ollama
|
||||
restart: unless-stopped
|
||||
healthcheck:
|
||||
test: ["CMD", "webclaw", "--help"]
|
||||
interval: 30s
|
||||
timeout: 5s
|
||||
retries: 3
|
||||
|
||||
ollama:
|
||||
image: ollama/ollama:latest
|
||||
volumes:
|
||||
- ollama_data:/root/.ollama
|
||||
restart: unless-stopped
|
||||
# CPU-only by default. For GPU, uncomment:
|
||||
# deploy:
|
||||
# resources:
|
||||
# reservations:
|
||||
# devices:
|
||||
# - capabilities: [gpu]
|
||||
#
|
||||
# Pre-pull a model after starting:
|
||||
# docker compose exec ollama ollama pull qwen3:1.7b
|
||||
|
||||
volumes:
|
||||
ollama_data:
|
||||
43
env.example
Normal file
43
env.example
Normal file
|
|
@ -0,0 +1,43 @@
|
|||
# ============================================
|
||||
# Webclaw Configuration
|
||||
# Copy to .env and fill in your values
|
||||
# ============================================
|
||||
|
||||
# --- LLM Providers ---
|
||||
|
||||
# Ollama (local, default provider)
|
||||
OLLAMA_HOST=http://localhost:11434
|
||||
OLLAMA_MODEL=qwen3:8b
|
||||
|
||||
# OpenAI (optional cloud fallback)
|
||||
# OPENAI_API_KEY — set your OpenAI key
|
||||
# OPENAI_BASE_URL — defaults to https://api.openai.com/v1
|
||||
# OPENAI_MODEL — defaults to gpt-4o-mini
|
||||
|
||||
# Anthropic (optional cloud fallback)
|
||||
# ANTHROPIC_API_KEY — set your Anthropic key
|
||||
# ANTHROPIC_MODEL — defaults to claude-sonnet-4-20250514
|
||||
|
||||
# --- Proxy ---
|
||||
|
||||
# Single proxy
|
||||
# WEBCLAW_PROXY=http://user:pass@host:port
|
||||
|
||||
# Proxy file (one per line: host:port:user:pass)
|
||||
# WEBCLAW_PROXY_FILE=/path/to/proxies.txt
|
||||
|
||||
# --- Server (webclaw-server only) ---
|
||||
# WEBCLAW_PORT=3000
|
||||
# WEBCLAW_HOST=0.0.0.0
|
||||
# WEBCLAW_AUTH_KEY=your-auth-key
|
||||
# WEBCLAW_MAX_CONCURRENCY=50
|
||||
# WEBCLAW_JOB_TTL_SECS=3600
|
||||
# WEBCLAW_MAX_JOBS=100
|
||||
|
||||
# --- CLI LLM overrides ---
|
||||
# WEBCLAW_LLM_PROVIDER=ollama
|
||||
# WEBCLAW_LLM_MODEL=qwen3:8b
|
||||
# WEBCLAW_LLM_BASE_URL=http://localhost:11434
|
||||
|
||||
# --- Logging ---
|
||||
# WEBCLAW_LOG=info
|
||||
320
examples/README.md
Normal file
320
examples/README.md
Normal file
|
|
@ -0,0 +1,320 @@
|
|||
# Examples
|
||||
|
||||
Practical examples showing what webclaw can do. Each example is a self-contained command you can run immediately.
|
||||
|
||||
## Basic Extraction
|
||||
|
||||
```bash
|
||||
# Extract as markdown (default)
|
||||
webclaw https://example.com
|
||||
|
||||
# Multiple output formats
|
||||
webclaw https://example.com -f markdown # Clean markdown
|
||||
webclaw https://example.com -f json # Full structured JSON
|
||||
webclaw https://example.com -f text # Plain text (no formatting)
|
||||
webclaw https://example.com -f llm # Token-optimized for LLMs (67% fewer tokens)
|
||||
|
||||
# Bare domains work (auto-prepends https://)
|
||||
webclaw example.com
|
||||
```
|
||||
|
||||
## Content Filtering
|
||||
|
||||
```bash
|
||||
# Only extract main content (skip nav, sidebar, footer)
|
||||
webclaw https://docs.rs/tokio --only-main-content
|
||||
|
||||
# Include specific CSS selectors
|
||||
webclaw https://news.ycombinator.com --include ".titleline,.score"
|
||||
|
||||
# Exclude specific elements
|
||||
webclaw https://example.com --exclude "nav,footer,.ads,.sidebar"
|
||||
|
||||
# Combine both
|
||||
webclaw https://docs.rs/reqwest --only-main-content --exclude ".sidebar"
|
||||
```
|
||||
|
||||
## Brand Identity Extraction
|
||||
|
||||
```bash
|
||||
# Extract colors, fonts, logos from any website
|
||||
webclaw --brand https://stripe.com
|
||||
# Output: { "name": "Stripe", "colors": [...], "fonts": ["Sohne"], "logos": [...] }
|
||||
|
||||
webclaw --brand https://github.com
|
||||
# Output: { "name": "GitHub", "colors": [{"hex": "#1F2328", ...}], "fonts": ["Mona Sans"], ... }
|
||||
|
||||
webclaw --brand wikipedia.org
|
||||
# Output: 10 colors, 5 fonts, favicon, logo URL
|
||||
```
|
||||
|
||||
## Sitemap Discovery
|
||||
|
||||
```bash
|
||||
# Discover all URLs from a site's sitemaps
|
||||
webclaw --map https://sitemaps.org
|
||||
# Output: one URL per line (84 URLs found)
|
||||
|
||||
# JSON output with metadata
|
||||
webclaw --map https://sitemaps.org -f json
|
||||
# Output: [{ "url": "...", "last_modified": "...", "priority": 0.8 }]
|
||||
```
|
||||
|
||||
## Recursive Crawling
|
||||
|
||||
```bash
|
||||
# Crawl a site (default: depth 1, max 20 pages)
|
||||
webclaw --crawl https://example.com
|
||||
|
||||
# Control depth and page limit
|
||||
webclaw --crawl --depth 2 --max-pages 50 https://docs.rs/tokio
|
||||
|
||||
# Crawl with sitemap seeding (finds more pages)
|
||||
webclaw --crawl --sitemap --depth 2 https://docs.rs/tokio
|
||||
|
||||
# Filter crawl paths
|
||||
webclaw --crawl --include-paths "/api/*,/guide/*" https://docs.example.com
|
||||
webclaw --crawl --exclude-paths "/changelog/*,/blog/*" https://docs.example.com
|
||||
|
||||
# Control concurrency and delay
|
||||
webclaw --crawl --concurrency 10 --delay 200 https://example.com
|
||||
```
|
||||
|
||||
## Change Detection (Diff)
|
||||
|
||||
```bash
|
||||
# Step 1: Save a snapshot
|
||||
webclaw https://example.com -f json > snapshot.json
|
||||
|
||||
# Step 2: Later, compare against the snapshot
|
||||
webclaw --diff-with snapshot.json https://example.com
|
||||
# Output:
|
||||
# Status: Same
|
||||
# Word count delta: +0
|
||||
|
||||
# If the page changed:
|
||||
# Status: Changed
|
||||
# Word count delta: +42
|
||||
# --- old
|
||||
# +++ new
|
||||
# @@ -1,3 +1,3 @@
|
||||
# -Old content here
|
||||
# +New content here
|
||||
```
|
||||
|
||||
## PDF Extraction
|
||||
|
||||
```bash
|
||||
# PDF URLs are auto-detected via Content-Type
|
||||
webclaw https://example.com/report.pdf
|
||||
|
||||
# Control PDF mode
|
||||
webclaw --pdf-mode auto https://example.com/report.pdf # Error on empty (catches scanned PDFs)
|
||||
webclaw --pdf-mode fast https://example.com/report.pdf # Return whatever text is found
|
||||
```
|
||||
|
||||
## Batch Processing
|
||||
|
||||
```bash
|
||||
# Multiple URLs in one command
|
||||
webclaw https://example.com https://httpbin.org/html https://rust-lang.org
|
||||
|
||||
# URLs from a file (one per line, # comments supported)
|
||||
webclaw --urls-file urls.txt
|
||||
|
||||
# Batch with JSON output
|
||||
webclaw --urls-file urls.txt -f json
|
||||
|
||||
# Proxy rotation for large batches
|
||||
webclaw --urls-file urls.txt --proxy-file proxies.txt --concurrency 10
|
||||
```
|
||||
|
||||
## Local Files & Stdin
|
||||
|
||||
```bash
|
||||
# Extract from a local HTML file
|
||||
webclaw --file page.html
|
||||
|
||||
# Pipe HTML from another command
|
||||
curl -s https://example.com | webclaw --stdin
|
||||
|
||||
# Chain with other tools
|
||||
webclaw https://example.com -f text | wc -w # Word count
|
||||
webclaw https://example.com -f json | jq '.metadata.title' # Extract title with jq
|
||||
```
|
||||
|
||||
## Cloud API Mode
|
||||
|
||||
When you have a webclaw API key, the CLI can route through the cloud for bot protection bypass, JS rendering, and proxy rotation.
|
||||
|
||||
```bash
|
||||
# Set API key (one time)
|
||||
export WEBCLAW_API_KEY=wc_your_key_here
|
||||
|
||||
# Automatic fallback: tries local first, cloud on bot detection
|
||||
webclaw https://protected-site.com
|
||||
|
||||
# Force cloud mode (skip local, always use API)
|
||||
webclaw --cloud https://spa-site.com
|
||||
|
||||
# Cloud mode works with all features
|
||||
webclaw --cloud --brand https://stripe.com
|
||||
webclaw --cloud -f json https://producthunt.com
|
||||
webclaw --cloud --crawl --depth 2 https://protected-docs.com
|
||||
```
|
||||
|
||||
## Browser Impersonation
|
||||
|
||||
```bash
|
||||
# Chrome (default) — latest Chrome TLS fingerprint
|
||||
webclaw https://example.com
|
||||
|
||||
# Firefox fingerprint
|
||||
webclaw --browser firefox https://example.com
|
||||
|
||||
# Random browser per request (good for batch)
|
||||
webclaw --browser random --urls-file urls.txt
|
||||
```
|
||||
|
||||
## Custom Headers & Cookies
|
||||
|
||||
```bash
|
||||
# Custom headers
|
||||
webclaw -H "Authorization: Bearer token123" https://api.example.com
|
||||
webclaw -H "Accept-Language: de-DE" https://example.com
|
||||
|
||||
# Cookies
|
||||
webclaw --cookie "session=abc123; theme=dark" https://example.com
|
||||
|
||||
# Multiple headers
|
||||
webclaw -H "X-Custom: value" -H "Authorization: Bearer token" https://example.com
|
||||
```
|
||||
|
||||
## LLM-Powered Features
|
||||
|
||||
These require an LLM provider (Ollama local, or OpenAI/Anthropic API key).
|
||||
|
||||
```bash
|
||||
# Summarize a page (default: 3 sentences)
|
||||
webclaw --summarize https://example.com
|
||||
|
||||
# Control summary length
|
||||
webclaw --summarize 5 https://example.com
|
||||
|
||||
# Extract structured JSON with a schema
|
||||
webclaw --extract-json '{"type":"object","properties":{"title":{"type":"string"},"price":{"type":"number"}}}' https://example.com/product
|
||||
|
||||
# Extract with a schema from file
|
||||
webclaw --extract-json @schema.json https://example.com/product
|
||||
|
||||
# Extract with natural language prompt
|
||||
webclaw --extract-prompt "Get all pricing tiers with name, price, and features" https://stripe.com/pricing
|
||||
|
||||
# Use a specific LLM provider
|
||||
webclaw --llm-provider ollama --summarize https://example.com
|
||||
webclaw --llm-provider openai --llm-model gpt-4o --extract-prompt "..." https://example.com
|
||||
webclaw --llm-provider anthropic --summarize https://example.com
|
||||
```
|
||||
|
||||
## Raw HTML Output
|
||||
|
||||
```bash
|
||||
# Get the raw fetched HTML (no extraction)
|
||||
webclaw --raw-html https://example.com
|
||||
|
||||
# Useful for debugging extraction issues
|
||||
webclaw --raw-html https://example.com > raw.html
|
||||
webclaw --file raw.html # Then extract locally
|
||||
```
|
||||
|
||||
## Metadata & Verbose Mode
|
||||
|
||||
```bash
|
||||
# Include YAML frontmatter with metadata
|
||||
webclaw --metadata https://example.com
|
||||
# Output:
|
||||
# ---
|
||||
# title: "Example Domain"
|
||||
# source: "https://example.com"
|
||||
# word_count: 20
|
||||
# ---
|
||||
# # Example Domain
|
||||
# ...
|
||||
|
||||
# Verbose logging (debug extraction pipeline)
|
||||
webclaw -v https://example.com
|
||||
```
|
||||
|
||||
## Proxy Usage
|
||||
|
||||
```bash
|
||||
# Single proxy
|
||||
webclaw --proxy http://user:pass@proxy.example.com:8080 https://example.com
|
||||
|
||||
# SOCKS5 proxy
|
||||
webclaw --proxy socks5://proxy.example.com:1080 https://example.com
|
||||
|
||||
# Proxy rotation from file (one per line: host:port:user:pass)
|
||||
webclaw --proxy-file proxies.txt https://example.com
|
||||
|
||||
# Auto-load proxies.txt from current directory
|
||||
echo "proxy1.com:8080:user:pass" > proxies.txt
|
||||
webclaw https://example.com # Automatically detects and uses proxies.txt
|
||||
```
|
||||
|
||||
## MCP Server (AI Agent Integration)
|
||||
|
||||
```bash
|
||||
# Start the MCP server (stdio transport)
|
||||
webclaw-mcp
|
||||
|
||||
# Configure in Claude Desktop (~/.config/claude/claude_desktop_config.json):
|
||||
# {
|
||||
# "mcpServers": {
|
||||
# "webclaw": {
|
||||
# "command": "/path/to/webclaw-mcp",
|
||||
# "env": {
|
||||
# "WEBCLAW_API_KEY": "wc_your_key" // optional, enables cloud fallback
|
||||
# }
|
||||
# }
|
||||
# }
|
||||
# }
|
||||
|
||||
# Available tools: scrape, crawl, map, batch, extract, summarize, diff, brand, research, search
|
||||
```
|
||||
|
||||
## Real-World Recipes
|
||||
|
||||
### Monitor competitor pricing
|
||||
|
||||
```bash
|
||||
# Save today's pricing
|
||||
webclaw --extract-json '{"type":"array","items":{"type":"object","properties":{"plan":{"type":"string"},"price":{"type":"string"}}}}' \
|
||||
https://competitor.com/pricing -f json > pricing-$(date +%Y%m%d).json
|
||||
```
|
||||
|
||||
### Build a documentation search index
|
||||
|
||||
```bash
|
||||
# Crawl docs and extract as LLM-optimized text
|
||||
webclaw --crawl --sitemap --depth 3 --max-pages 500 -f llm https://docs.example.com > docs.txt
|
||||
```
|
||||
|
||||
### Extract all images from a page
|
||||
|
||||
```bash
|
||||
webclaw https://example.com -f json | jq -r '.content.images[].src'
|
||||
```
|
||||
|
||||
### Get all external links
|
||||
|
||||
```bash
|
||||
webclaw https://example.com -f json | jq -r '.content.links[] | select(.href | startswith("http")) | .href'
|
||||
```
|
||||
|
||||
### Compare two pages
|
||||
|
||||
```bash
|
||||
webclaw https://site-a.com -f json > a.json
|
||||
webclaw https://site-b.com --diff-with a.json
|
||||
```
|
||||
40
packages/create-webclaw/README.md
Normal file
40
packages/create-webclaw/README.md
Normal file
|
|
@ -0,0 +1,40 @@
|
|||
# create-webclaw
|
||||
|
||||
Set up [webclaw](https://webclaw.io) MCP server for AI agents in one command.
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
npx create-webclaw
|
||||
```
|
||||
|
||||
## What it does
|
||||
|
||||
1. Detects installed AI tools (Claude Desktop, Claude Code, Cursor, Windsurf, VS Code + Continue)
|
||||
2. Downloads the `webclaw-mcp` binary for your platform
|
||||
3. Asks for your API key (optional — works locally without one)
|
||||
4. Configures MCP in each detected tool
|
||||
|
||||
## Supported tools
|
||||
|
||||
| Tool | Config location |
|
||||
|------|----------------|
|
||||
| Claude Desktop | `~/Library/Application Support/Claude/claude_desktop_config.json` |
|
||||
| Claude Code | `~/.claude.json` |
|
||||
| Cursor | `.cursor/mcp.json` |
|
||||
| Windsurf | `~/.codeium/windsurf/mcp_config.json` |
|
||||
| VS Code (Continue) | `~/.continue/config.json` |
|
||||
|
||||
## MCP tools provided
|
||||
|
||||
After setup, your AI agent has access to:
|
||||
|
||||
- **scrape** — extract content from any URL
|
||||
- **crawl** — recursively crawl a website
|
||||
- **search** — web search + parallel scrape
|
||||
- **map** — discover URLs from sitemaps
|
||||
- **batch** — extract multiple URLs in parallel
|
||||
- **extract** — LLM-powered structured extraction
|
||||
- **summarize** — content summarization
|
||||
- **diff** — track content changes
|
||||
- **brand** — extract brand identity
|
||||
599
packages/create-webclaw/index.mjs
Normal file
599
packages/create-webclaw/index.mjs
Normal file
|
|
@ -0,0 +1,599 @@
|
|||
#!/usr/bin/env node
|
||||
|
||||
import { existsSync, mkdirSync, readFileSync, writeFileSync } from "fs";
|
||||
import { createInterface } from "readline";
|
||||
import { homedir, platform, arch } from "os";
|
||||
import { join, dirname } from "path";
|
||||
import { execSync } from "child_process";
|
||||
import { createWriteStream } from "fs";
|
||||
import { chmod } from "fs/promises";
|
||||
import https from "https";
|
||||
import http from "http";
|
||||
|
||||
// ── Constants ──
|
||||
|
||||
const REPO = "0xMassi/webclaw";
|
||||
const BINARY_NAME = "webclaw-mcp";
|
||||
const INSTALL_DIR = join(homedir(), ".webclaw");
|
||||
const BINARY_PATH = join(INSTALL_DIR, BINARY_NAME);
|
||||
const VERSION = "latest";
|
||||
|
||||
const COLORS = {
|
||||
reset: "\x1b[0m",
|
||||
bold: "\x1b[1m",
|
||||
dim: "\x1b[2m",
|
||||
green: "\x1b[32m",
|
||||
yellow: "\x1b[33m",
|
||||
blue: "\x1b[34m",
|
||||
cyan: "\x1b[36m",
|
||||
red: "\x1b[31m",
|
||||
};
|
||||
|
||||
const c = (color, text) => `${COLORS[color]}${text}${COLORS.reset}`;
|
||||
|
||||
// ── AI Tool Detection ──
|
||||
|
||||
const AI_TOOLS = [
|
||||
{
|
||||
id: "claude-desktop",
|
||||
name: "Claude Desktop",
|
||||
detect: () => {
|
||||
if (platform() === "darwin")
|
||||
return existsSync(
|
||||
join(
|
||||
homedir(),
|
||||
"Library/Application Support/Claude/claude_desktop_config.json",
|
||||
),
|
||||
);
|
||||
if (platform() === "win32")
|
||||
return existsSync(
|
||||
join(process.env.APPDATA || "", "Claude/claude_desktop_config.json"),
|
||||
);
|
||||
return false;
|
||||
},
|
||||
configPath: () => {
|
||||
if (platform() === "darwin")
|
||||
return join(
|
||||
homedir(),
|
||||
"Library/Application Support/Claude/claude_desktop_config.json",
|
||||
);
|
||||
if (platform() === "win32")
|
||||
return join(
|
||||
process.env.APPDATA || "",
|
||||
"Claude/claude_desktop_config.json",
|
||||
);
|
||||
return null;
|
||||
},
|
||||
},
|
||||
{
|
||||
id: "claude-code",
|
||||
name: "Claude Code",
|
||||
detect: () => existsSync(join(homedir(), ".claude.json")),
|
||||
configPath: () => join(homedir(), ".claude.json"),
|
||||
},
|
||||
{
|
||||
id: "cursor",
|
||||
name: "Cursor",
|
||||
detect: () => {
|
||||
// Check for .cursor directory in home or current project
|
||||
return (
|
||||
existsSync(join(homedir(), ".cursor")) ||
|
||||
existsSync(join(process.cwd(), ".cursor"))
|
||||
);
|
||||
},
|
||||
configPath: () => {
|
||||
const projectPath = join(process.cwd(), ".cursor", "mcp.json");
|
||||
const globalPath = join(homedir(), ".cursor", "mcp.json");
|
||||
return existsSync(join(process.cwd(), ".cursor"))
|
||||
? projectPath
|
||||
: globalPath;
|
||||
},
|
||||
},
|
||||
{
|
||||
id: "windsurf",
|
||||
name: "Windsurf",
|
||||
detect: () => {
|
||||
return (
|
||||
existsSync(join(homedir(), ".codeium")) ||
|
||||
existsSync(join(homedir(), ".windsurf"))
|
||||
);
|
||||
},
|
||||
configPath: () =>
|
||||
join(homedir(), ".codeium", "windsurf", "mcp_config.json"),
|
||||
},
|
||||
{
|
||||
id: "vscode-continue",
|
||||
name: "VS Code (Continue)",
|
||||
detect: () => existsSync(join(homedir(), ".continue")),
|
||||
configPath: () => join(homedir(), ".continue", "config.json"),
|
||||
},
|
||||
{
|
||||
id: "opencode",
|
||||
name: "OpenCode",
|
||||
detect: () => {
|
||||
return (
|
||||
existsSync(join(homedir(), ".config", "opencode", "opencode.json")) ||
|
||||
existsSync(join(process.cwd(), "opencode.json"))
|
||||
);
|
||||
},
|
||||
configPath: () => {
|
||||
const projectPath = join(process.cwd(), "opencode.json");
|
||||
const globalPath = join(
|
||||
homedir(),
|
||||
".config",
|
||||
"opencode",
|
||||
"opencode.json",
|
||||
);
|
||||
return existsSync(projectPath) ? projectPath : globalPath;
|
||||
},
|
||||
},
|
||||
{
|
||||
id: "antigravity",
|
||||
name: "Antigravity",
|
||||
detect: () => {
|
||||
return (
|
||||
existsSync(join(homedir(), ".antigravity")) ||
|
||||
existsSync(join(homedir(), ".config", "antigravity"))
|
||||
);
|
||||
},
|
||||
configPath: () => {
|
||||
const configDir = existsSync(join(homedir(), ".config", "antigravity"))
|
||||
? join(homedir(), ".config", "antigravity")
|
||||
: join(homedir(), ".antigravity");
|
||||
return join(configDir, "mcp.json");
|
||||
},
|
||||
},
|
||||
{
|
||||
id: "codex",
|
||||
name: "Codex (CLI + App)",
|
||||
detect: () => existsSync(join(homedir(), ".codex")),
|
||||
configPath: () => join(homedir(), ".codex", "config.toml"),
|
||||
},
|
||||
];
|
||||
|
||||
// ── Helpers ──
|
||||
|
||||
function ask(question) {
|
||||
const rl = createInterface({
|
||||
input: process.stdin,
|
||||
output: process.stdout,
|
||||
});
|
||||
return new Promise((resolve) => {
|
||||
rl.question(question, (answer) => {
|
||||
rl.close();
|
||||
resolve(answer.trim());
|
||||
});
|
||||
});
|
||||
}
|
||||
|
||||
function download(url) {
|
||||
return new Promise((resolve, reject) => {
|
||||
const client = url.startsWith("https") ? https : http;
|
||||
client
|
||||
.get(url, { headers: { "User-Agent": "create-webclaw" } }, (res) => {
|
||||
// Follow redirects
|
||||
if (
|
||||
res.statusCode >= 300 &&
|
||||
res.statusCode < 400 &&
|
||||
res.headers.location
|
||||
) {
|
||||
return download(res.headers.location).then(resolve).catch(reject);
|
||||
}
|
||||
if (res.statusCode !== 200) {
|
||||
return reject(new Error(`HTTP ${res.statusCode}`));
|
||||
}
|
||||
const chunks = [];
|
||||
res.on("data", (chunk) => chunks.push(chunk));
|
||||
res.on("end", () => resolve(Buffer.concat(chunks)));
|
||||
res.on("error", reject);
|
||||
})
|
||||
.on("error", reject);
|
||||
});
|
||||
}
|
||||
|
||||
async function downloadFile(url, dest) {
|
||||
return new Promise((resolve, reject) => {
|
||||
const client = url.startsWith("https") ? https : http;
|
||||
client
|
||||
.get(url, { headers: { "User-Agent": "create-webclaw" } }, (res) => {
|
||||
if (
|
||||
res.statusCode >= 300 &&
|
||||
res.statusCode < 400 &&
|
||||
res.headers.location
|
||||
) {
|
||||
return downloadFile(res.headers.location, dest)
|
||||
.then(resolve)
|
||||
.catch(reject);
|
||||
}
|
||||
if (res.statusCode !== 200) {
|
||||
return reject(new Error(`HTTP ${res.statusCode}`));
|
||||
}
|
||||
const file = createWriteStream(dest);
|
||||
res.pipe(file);
|
||||
file.on("finish", () => {
|
||||
file.close();
|
||||
resolve();
|
||||
});
|
||||
file.on("error", reject);
|
||||
})
|
||||
.on("error", reject);
|
||||
});
|
||||
}
|
||||
|
||||
function getAssetName() {
|
||||
const os = platform();
|
||||
const a = arch();
|
||||
|
||||
if (os === "darwin" && a === "arm64")
|
||||
return `webclaw-mcp-aarch64-apple-darwin.tar.gz`;
|
||||
if (os === "darwin" && a === "x64")
|
||||
return `webclaw-mcp-x86_64-apple-darwin.tar.gz`;
|
||||
if (os === "linux" && a === "x64")
|
||||
return `webclaw-mcp-x86_64-unknown-linux-gnu.tar.gz`;
|
||||
if (os === "linux" && a === "arm64")
|
||||
return `webclaw-mcp-aarch64-unknown-linux-gnu.tar.gz`;
|
||||
if (os === "win32" && a === "x64")
|
||||
return `webclaw-mcp-x86_64-pc-windows-msvc.zip`;
|
||||
|
||||
return null;
|
||||
}
|
||||
|
||||
function readJsonFile(path) {
|
||||
try {
|
||||
return JSON.parse(readFileSync(path, "utf-8"));
|
||||
} catch {
|
||||
return {};
|
||||
}
|
||||
}
|
||||
|
||||
function writeJsonFile(path, data) {
|
||||
const dir = dirname(path);
|
||||
if (!existsSync(dir)) mkdirSync(dir, { recursive: true });
|
||||
writeFileSync(path, JSON.stringify(data, null, 2) + "\n");
|
||||
}
|
||||
|
||||
function buildMcpEntry(apiKey) {
|
||||
const entry = {
|
||||
command: BINARY_PATH,
|
||||
};
|
||||
if (apiKey) {
|
||||
entry.env = { WEBCLAW_API_KEY: apiKey };
|
||||
}
|
||||
return entry;
|
||||
}
|
||||
|
||||
// ── MCP Config Writers ──
|
||||
|
||||
function addToClaudeDesktop(configPath, apiKey) {
|
||||
const config = readJsonFile(configPath);
|
||||
if (!config.mcpServers) config.mcpServers = {};
|
||||
config.mcpServers.webclaw = buildMcpEntry(apiKey);
|
||||
writeJsonFile(configPath, config);
|
||||
}
|
||||
|
||||
function addToClaudeCode(configPath, apiKey) {
|
||||
const config = readJsonFile(configPath);
|
||||
if (!config.mcpServers) config.mcpServers = {};
|
||||
config.mcpServers.webclaw = buildMcpEntry(apiKey);
|
||||
writeJsonFile(configPath, config);
|
||||
}
|
||||
|
||||
function addToCursor(configPath, apiKey) {
|
||||
const config = readJsonFile(configPath);
|
||||
if (!config.mcpServers) config.mcpServers = {};
|
||||
config.mcpServers.webclaw = {
|
||||
command: BINARY_PATH,
|
||||
...(apiKey ? { env: { WEBCLAW_API_KEY: apiKey } } : {}),
|
||||
};
|
||||
writeJsonFile(configPath, config);
|
||||
}
|
||||
|
||||
function addToWindsurf(configPath, apiKey) {
|
||||
const config = readJsonFile(configPath);
|
||||
if (!config.mcpServers) config.mcpServers = {};
|
||||
config.mcpServers.webclaw = buildMcpEntry(apiKey);
|
||||
writeJsonFile(configPath, config);
|
||||
}
|
||||
|
||||
function addToVSCodeContinue(configPath, apiKey) {
|
||||
const config = readJsonFile(configPath);
|
||||
if (!config.mcpServers) config.mcpServers = [];
|
||||
// Continue uses array format
|
||||
const existing = config.mcpServers.findIndex?.((s) => s.name === "webclaw");
|
||||
const entry = {
|
||||
name: "webclaw",
|
||||
command: BINARY_PATH,
|
||||
...(apiKey ? { env: { WEBCLAW_API_KEY: apiKey } } : {}),
|
||||
};
|
||||
if (existing >= 0) {
|
||||
config.mcpServers[existing] = entry;
|
||||
} else if (Array.isArray(config.mcpServers)) {
|
||||
config.mcpServers.push(entry);
|
||||
}
|
||||
writeJsonFile(configPath, config);
|
||||
}
|
||||
|
||||
function addToOpenCode(configPath, apiKey) {
|
||||
const config = readJsonFile(configPath);
|
||||
if (!config.mcp) config.mcp = {};
|
||||
config.mcp.webclaw = {
|
||||
type: "local",
|
||||
command: [BINARY_PATH],
|
||||
enabled: true,
|
||||
};
|
||||
if (apiKey) {
|
||||
config.mcp.webclaw.environment = { WEBCLAW_API_KEY: apiKey };
|
||||
}
|
||||
writeJsonFile(configPath, config);
|
||||
}
|
||||
|
||||
function addToAntigravity(configPath, apiKey) {
|
||||
const config = readJsonFile(configPath);
|
||||
if (!config.mcpServers) config.mcpServers = {};
|
||||
config.mcpServers.webclaw = buildMcpEntry(apiKey);
|
||||
writeJsonFile(configPath, config);
|
||||
}
|
||||
|
||||
function addToCodex(configPath, apiKey) {
|
||||
// Codex uses TOML format, not JSON. Append MCP server config section.
|
||||
const dir = dirname(configPath);
|
||||
if (!existsSync(dir)) mkdirSync(dir, { recursive: true });
|
||||
|
||||
let existing = "";
|
||||
try {
|
||||
existing = readFileSync(configPath, "utf-8");
|
||||
} catch {
|
||||
// File doesn't exist yet
|
||||
}
|
||||
|
||||
// Remove any existing webclaw MCP section
|
||||
existing = existing.replace(
|
||||
/\n?\[mcp_servers\.webclaw\][^\[]*(?=\[|$)/gs,
|
||||
"",
|
||||
);
|
||||
|
||||
let section = `\n[mcp_servers.webclaw]\ncommand = "${BINARY_PATH}"\nargs = []\nenabled = true\n`;
|
||||
if (apiKey) {
|
||||
section += `env = { WEBCLAW_API_KEY = "${apiKey}" }\n`;
|
||||
}
|
||||
|
||||
writeFileSync(configPath, existing.trimEnd() + "\n" + section);
|
||||
}
|
||||
|
||||
const CONFIG_WRITERS = {
|
||||
"claude-desktop": addToClaudeDesktop,
|
||||
"claude-code": addToClaudeCode,
|
||||
cursor: addToCursor,
|
||||
windsurf: addToWindsurf,
|
||||
"vscode-continue": addToVSCodeContinue,
|
||||
opencode: addToOpenCode,
|
||||
antigravity: addToAntigravity,
|
||||
codex: addToCodex,
|
||||
};
|
||||
|
||||
// ── Main ──
|
||||
|
||||
async function main() {
|
||||
console.log();
|
||||
console.log(c("bold", " ┌─────────────────────────────────────┐"));
|
||||
console.log(
|
||||
c("bold", " │") +
|
||||
c("cyan", " webclaw") +
|
||||
c("dim", " — MCP setup for AI agents") +
|
||||
c("bold", " │"),
|
||||
);
|
||||
console.log(c("bold", " └─────────────────────────────────────┘"));
|
||||
console.log();
|
||||
|
||||
// 1. Detect installed AI tools
|
||||
console.log(c("bold", " Detecting AI tools..."));
|
||||
console.log();
|
||||
|
||||
const detected = AI_TOOLS.filter((tool) => {
|
||||
try {
|
||||
return tool.detect();
|
||||
} catch {
|
||||
return false;
|
||||
}
|
||||
});
|
||||
|
||||
if (detected.length === 0) {
|
||||
console.log(c("yellow", " No supported AI tools detected."));
|
||||
console.log();
|
||||
console.log(c("dim", " Supported tools:"));
|
||||
for (const tool of AI_TOOLS) {
|
||||
console.log(c("dim", ` • ${tool.name}`));
|
||||
}
|
||||
console.log();
|
||||
console.log(
|
||||
c("dim", " Install one of these tools and run this command again."),
|
||||
);
|
||||
console.log(c("dim", " Or use --manual to configure manually."));
|
||||
console.log();
|
||||
|
||||
if (process.argv.includes("--manual")) {
|
||||
// Continue anyway for manual setup
|
||||
} else {
|
||||
process.exit(0);
|
||||
}
|
||||
}
|
||||
|
||||
for (const tool of detected) {
|
||||
console.log(c("green", ` ✓ ${tool.name}`));
|
||||
}
|
||||
console.log();
|
||||
|
||||
// 2. Ask for API key
|
||||
console.log(c("dim", " An API key enables cloud features."));
|
||||
console.log(
|
||||
c("dim", " Without one, webclaw runs locally (free, no account needed)."),
|
||||
);
|
||||
console.log();
|
||||
|
||||
const apiKey = await ask(
|
||||
c("bold", " API key ") +
|
||||
c("dim", "(press Enter to skip for local-only): "),
|
||||
);
|
||||
console.log();
|
||||
|
||||
// 3. Download binary
|
||||
console.log(c("bold", " Downloading webclaw-mcp..."));
|
||||
|
||||
const assetName = getAssetName();
|
||||
if (!assetName) {
|
||||
console.log(c("red", ` Unsupported platform: ${platform()}-${arch()}`));
|
||||
console.log(
|
||||
c(
|
||||
"dim",
|
||||
" Build from source: cargo install --git https://github.com/0xMassi/webclaw webclaw-mcp",
|
||||
),
|
||||
);
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
if (!existsSync(INSTALL_DIR)) {
|
||||
mkdirSync(INSTALL_DIR, { recursive: true });
|
||||
}
|
||||
|
||||
let downloaded = false;
|
||||
|
||||
try {
|
||||
// Get latest release URL
|
||||
const releaseData = await download(
|
||||
`https://api.github.com/repos/${REPO}/releases/latest`,
|
||||
);
|
||||
const release = JSON.parse(releaseData.toString());
|
||||
const asset = release.assets?.find((a) => a.name === assetName);
|
||||
|
||||
if (asset) {
|
||||
const tarPath = join(INSTALL_DIR, assetName);
|
||||
await downloadFile(asset.browser_download_url, tarPath);
|
||||
|
||||
// Extract
|
||||
if (assetName.endsWith(".tar.gz")) {
|
||||
execSync(`tar xzf "${tarPath}" -C "${INSTALL_DIR}"`, {
|
||||
stdio: "ignore",
|
||||
});
|
||||
} else if (assetName.endsWith(".zip")) {
|
||||
execSync(`unzip -o "${tarPath}" -d "${INSTALL_DIR}"`, {
|
||||
stdio: "ignore",
|
||||
});
|
||||
}
|
||||
|
||||
// Make executable
|
||||
await chmod(BINARY_PATH, 0o755);
|
||||
|
||||
// Cleanup archive
|
||||
try {
|
||||
execSync(`rm "${tarPath}"`, { stdio: "ignore" });
|
||||
} catch {}
|
||||
|
||||
console.log(c("green", ` ✓ Installed to ${BINARY_PATH}`));
|
||||
downloaded = true;
|
||||
}
|
||||
} catch (e) {
|
||||
// Release not available yet — expected before first release
|
||||
}
|
||||
|
||||
if (!downloaded) {
|
||||
// Try cargo install as fallback
|
||||
console.log(
|
||||
c("yellow", " No pre-built binary found. Trying cargo install..."),
|
||||
);
|
||||
try {
|
||||
execSync(
|
||||
`cargo install --git https://github.com/${REPO} webclaw-mcp --root "${INSTALL_DIR}"`,
|
||||
{ stdio: "inherit" },
|
||||
);
|
||||
// cargo install puts binary in INSTALL_DIR/bin/
|
||||
const cargoPath = join(INSTALL_DIR, "bin", BINARY_NAME);
|
||||
if (existsSync(cargoPath)) {
|
||||
// Move to expected location
|
||||
execSync(`mv "${cargoPath}" "${BINARY_PATH}"`, {
|
||||
stdio: "ignore",
|
||||
});
|
||||
console.log(c("green", ` ✓ Built and installed to ${BINARY_PATH}`));
|
||||
downloaded = true;
|
||||
}
|
||||
} catch {
|
||||
console.log(
|
||||
c("red", " Failed to install. Make sure Rust is installed:"),
|
||||
);
|
||||
console.log(
|
||||
c(
|
||||
"dim",
|
||||
" curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh",
|
||||
),
|
||||
);
|
||||
process.exit(1);
|
||||
}
|
||||
}
|
||||
|
||||
console.log();
|
||||
|
||||
// 4. Configure each detected tool
|
||||
console.log(c("bold", " Configuring MCP servers..."));
|
||||
console.log();
|
||||
|
||||
for (const tool of detected) {
|
||||
const configPath = tool.configPath();
|
||||
if (!configPath) continue;
|
||||
|
||||
const writer = CONFIG_WRITERS[tool.id];
|
||||
if (!writer) continue;
|
||||
|
||||
try {
|
||||
writer(configPath, apiKey || null);
|
||||
console.log(
|
||||
c("green", ` ✓ ${tool.name}`) + c("dim", ` → ${configPath}`),
|
||||
);
|
||||
} catch (e) {
|
||||
console.log(c("red", ` ✗ ${tool.name}: ${e.message}`));
|
||||
}
|
||||
}
|
||||
|
||||
console.log();
|
||||
|
||||
// 5. Verify
|
||||
if (downloaded) {
|
||||
try {
|
||||
const version = execSync(`"${BINARY_PATH}" --version`, {
|
||||
encoding: "utf-8",
|
||||
}).trim();
|
||||
console.log(c("green", ` ✓ ${version}`));
|
||||
} catch {
|
||||
console.log(c("green", ` ✓ webclaw-mcp installed`));
|
||||
}
|
||||
}
|
||||
|
||||
// 6. Summary
|
||||
console.log();
|
||||
console.log(c("bold", " Done! webclaw is ready."));
|
||||
console.log();
|
||||
console.log(c("dim", " Your AI agent now has these tools:"));
|
||||
console.log(c("dim", " • scrape — extract content from any URL"));
|
||||
console.log(c("dim", " • crawl — recursively crawl a website"));
|
||||
console.log(c("dim", " • search — web search + parallel scrape"));
|
||||
console.log(c("dim", " • map — discover URLs from sitemaps"));
|
||||
console.log(c("dim", " • batch — extract multiple URLs in parallel"));
|
||||
console.log();
|
||||
|
||||
if (!apiKey) {
|
||||
console.log(c("yellow", " Running in local-only mode (no API key)."));
|
||||
console.log(
|
||||
c(
|
||||
"dim",
|
||||
" Get an API key at https://webclaw.io/dashboard for cloud features.",
|
||||
),
|
||||
);
|
||||
console.log();
|
||||
}
|
||||
|
||||
console.log(c("dim", " Restart your AI tool to activate the MCP server."));
|
||||
console.log();
|
||||
}
|
||||
|
||||
main().catch((e) => {
|
||||
console.error(c("red", `\n Error: ${e.message}\n`));
|
||||
process.exit(1);
|
||||
});
|
||||
32
packages/create-webclaw/package.json
Normal file
32
packages/create-webclaw/package.json
Normal file
|
|
@ -0,0 +1,32 @@
|
|||
{
|
||||
"name": "create-webclaw",
|
||||
"version": "0.1.0",
|
||||
"description": "Set up webclaw MCP server for AI agents (Claude, Cursor, Windsurf, OpenCode, Codex, Antigravity)",
|
||||
"bin": {
|
||||
"create-webclaw": "./index.mjs"
|
||||
},
|
||||
"type": "module",
|
||||
"keywords": [
|
||||
"webclaw",
|
||||
"mcp",
|
||||
"ai",
|
||||
"scraping",
|
||||
"claude",
|
||||
"cursor",
|
||||
"windsurf",
|
||||
"opencode",
|
||||
"codex",
|
||||
"antigravity",
|
||||
"web-scraping"
|
||||
],
|
||||
"author": "webclaw",
|
||||
"license": "MIT",
|
||||
"repository": {
|
||||
"type": "git",
|
||||
"url": "https://github.com/0xMassi/webclaw"
|
||||
},
|
||||
"homepage": "https://webclaw.io",
|
||||
"engines": {
|
||||
"node": ">=18"
|
||||
}
|
||||
}
|
||||
10
proxies.example.txt
Normal file
10
proxies.example.txt
Normal file
|
|
@ -0,0 +1,10 @@
|
|||
# Webclaw Proxy List
|
||||
# Copy this file to proxies.txt and add your proxies.
|
||||
# webclaw auto-loads proxies.txt when it exists — no config needed.
|
||||
#
|
||||
# Format: host:port:user:pass (one per line)
|
||||
# Lines starting with # are ignored.
|
||||
#
|
||||
# Example:
|
||||
# 123.45.67.89:8080:username:password
|
||||
# proxy2.example.com:3128:user:pass123
|
||||
1
rustfmt.toml
Normal file
1
rustfmt.toml
Normal file
|
|
@ -0,0 +1 @@
|
|||
style_edition = "2024"
|
||||
498
setup.sh
Executable file
498
setup.sh
Executable file
|
|
@ -0,0 +1,498 @@
|
|||
#!/usr/bin/env bash
|
||||
# setup.sh — Local setup for webclaw
|
||||
#
|
||||
# Checks prerequisites, builds binaries, configures .env,
|
||||
# optionally installs Ollama, and wires up the MCP server.
|
||||
#
|
||||
# Usage:
|
||||
# ./setup.sh # Interactive full setup
|
||||
# ./setup.sh --minimal # Build only, skip configuration
|
||||
# ./setup.sh --check # Check prerequisites without installing
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Colors
|
||||
# ---------------------------------------------------------------------------
|
||||
RED='\033[0;31m'
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[1;33m'
|
||||
BLUE='\033[0;34m'
|
||||
CYAN='\033[0;36m'
|
||||
BOLD='\033[1m'
|
||||
DIM='\033[2m'
|
||||
RESET='\033[0m'
|
||||
|
||||
info() { printf "${BLUE}[*]${RESET} %s\n" "$*"; }
|
||||
success() { printf "${GREEN}[+]${RESET} %s\n" "$*"; }
|
||||
warn() { printf "${YELLOW}[!]${RESET} %s\n" "$*"; }
|
||||
error() { printf "${RED}[x]${RESET} %s\n" "$*" >&2; }
|
||||
|
||||
prompt() {
|
||||
local var_name="$1" prompt_text="$2" default="${3:-}"
|
||||
if [[ -n "$default" ]]; then
|
||||
printf "${CYAN} %s${DIM} [%s]${RESET}: " "$prompt_text" "$default"
|
||||
else
|
||||
printf "${CYAN} %s${RESET}: " "$prompt_text"
|
||||
fi
|
||||
read -r input
|
||||
eval "$var_name=\"${input:-$default}\""
|
||||
}
|
||||
|
||||
prompt_secret() {
|
||||
local var_name="$1" prompt_text="$2" default="${3:-}"
|
||||
if [[ -n "$default" ]]; then
|
||||
printf "${CYAN} %s${DIM} [%s]${RESET}: " "$prompt_text" "$default"
|
||||
else
|
||||
printf "${CYAN} %s${RESET}: " "$prompt_text"
|
||||
fi
|
||||
read -rs input
|
||||
echo
|
||||
eval "$var_name=\"${input:-$default}\""
|
||||
}
|
||||
|
||||
prompt_yn() {
|
||||
local prompt_text="$1" default="${2:-y}"
|
||||
local hint="Y/n"
|
||||
[[ "$default" == "n" ]] && hint="y/N"
|
||||
printf "${CYAN} %s${DIM} [%s]${RESET}: " "$prompt_text" "$hint"
|
||||
read -r input
|
||||
input="${input:-$default}"
|
||||
[[ "$input" =~ ^[Yy]$ ]]
|
||||
}
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Step 1: Check prerequisites
|
||||
# ---------------------------------------------------------------------------
|
||||
check_prerequisites() {
|
||||
echo
|
||||
printf "${BOLD}${GREEN} Step 1: Prerequisites${RESET}\n"
|
||||
echo
|
||||
|
||||
local all_good=true
|
||||
|
||||
# Rust
|
||||
if command -v rustc &>/dev/null; then
|
||||
local rust_version
|
||||
rust_version=$(rustc --version | awk '{print $2}')
|
||||
success "Rust $rust_version"
|
||||
|
||||
# Check minimum version (1.85 for edition 2024)
|
||||
local major minor
|
||||
major=$(echo "$rust_version" | cut -d. -f1)
|
||||
minor=$(echo "$rust_version" | cut -d. -f2)
|
||||
if [[ "$major" -lt 1 ]] || [[ "$major" -eq 1 && "$minor" -lt 85 ]]; then
|
||||
warn "Rust 1.85+ required (edition 2024). Run: rustup update"
|
||||
all_good=false
|
||||
fi
|
||||
else
|
||||
error "Rust not found. Install: https://rustup.rs"
|
||||
all_good=false
|
||||
fi
|
||||
|
||||
# cargo
|
||||
if command -v cargo &>/dev/null; then
|
||||
success "Cargo $(cargo --version | awk '{print $2}')"
|
||||
else
|
||||
error "Cargo not found (should come with Rust)"
|
||||
all_good=false
|
||||
fi
|
||||
|
||||
# Ollama (optional)
|
||||
if command -v ollama &>/dev/null; then
|
||||
success "Ollama installed"
|
||||
if curl -sf http://localhost:11434/api/tags &>/dev/null; then
|
||||
success "Ollama is running"
|
||||
local models
|
||||
models=$(curl -sf http://localhost:11434/api/tags | python3 -c "import sys,json; [print(m['name']) for m in json.load(sys.stdin).get('models',[])]" 2>/dev/null || echo "")
|
||||
if [[ -n "$models" ]]; then
|
||||
success "Models: $(echo "$models" | tr '\n' ', ' | sed 's/,$//')"
|
||||
else
|
||||
warn "No models pulled yet"
|
||||
fi
|
||||
else
|
||||
warn "Ollama installed but not running. Start with: ollama serve"
|
||||
fi
|
||||
else
|
||||
warn "Ollama not found (optional — needed for local LLM features)"
|
||||
fi
|
||||
|
||||
# Git
|
||||
if command -v git &>/dev/null; then
|
||||
success "Git $(git --version | awk '{print $3}')"
|
||||
else
|
||||
error "Git not found"
|
||||
all_good=false
|
||||
fi
|
||||
|
||||
echo
|
||||
if $all_good; then
|
||||
success "All prerequisites met."
|
||||
else
|
||||
error "Some prerequisites are missing. Fix them before continuing."
|
||||
[[ "${1:-}" == "--check" ]] && exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Step 2: Build
|
||||
# ---------------------------------------------------------------------------
|
||||
build_binaries() {
|
||||
echo
|
||||
printf "${BOLD}${GREEN} Step 2: Build${RESET}\n"
|
||||
echo
|
||||
|
||||
info "Building release binaries (this may take a few minutes on first build)..."
|
||||
cd "$SCRIPT_DIR"
|
||||
|
||||
if cargo build --release 2>&1 | tail -5; then
|
||||
echo
|
||||
success "Built 3 binaries:"
|
||||
ls -lh target/release/webclaw target/release/webclaw-server target/release/webclaw-mcp 2>/dev/null | \
|
||||
awk '{printf " %-20s %s\n", $NF, $5}'
|
||||
else
|
||||
error "Build failed. Check the output above."
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Step 3: Configure .env
|
||||
# ---------------------------------------------------------------------------
|
||||
configure_env() {
|
||||
echo
|
||||
printf "${BOLD}${GREEN} Step 3: Configuration${RESET}\n"
|
||||
echo
|
||||
|
||||
if [[ -f "$SCRIPT_DIR/.env" ]]; then
|
||||
warn ".env already exists."
|
||||
if ! prompt_yn "Overwrite?"; then
|
||||
info "Keeping existing .env"
|
||||
return
|
||||
fi
|
||||
fi
|
||||
|
||||
local ollama_model="qwen3:8b"
|
||||
local openai_key=""
|
||||
local anthropic_key=""
|
||||
local proxy_file=""
|
||||
local server_port="3000"
|
||||
local auth_key=""
|
||||
|
||||
info "LLM configuration"
|
||||
prompt ollama_model "Ollama model (local)" "qwen3:8b"
|
||||
prompt_secret openai_key "OpenAI API key (optional, press enter to skip)" ""
|
||||
prompt_secret anthropic_key "Anthropic API key (optional, press enter to skip)" ""
|
||||
|
||||
echo
|
||||
info "Proxy configuration"
|
||||
if [[ -f "$SCRIPT_DIR/proxies.txt" ]]; then
|
||||
local proxy_count
|
||||
proxy_count=$(grep -cv '^\s*#\|^\s*$' "$SCRIPT_DIR/proxies.txt" 2>/dev/null || echo "0")
|
||||
success "proxies.txt found with $proxy_count proxies (auto-loaded)"
|
||||
else
|
||||
info "To use proxies, create proxies.txt with one proxy per line:"
|
||||
printf " ${DIM}Format: host:port:user:pass${RESET}\n"
|
||||
printf " ${DIM}cp proxies.example.txt proxies.txt${RESET}\n"
|
||||
fi
|
||||
local proxy_file=""
|
||||
|
||||
echo
|
||||
info "Server configuration"
|
||||
prompt server_port "REST API port" "3000"
|
||||
prompt auth_key "API auth key (press enter to auto-generate)" ""
|
||||
|
||||
if [[ -z "$auth_key" ]]; then
|
||||
if command -v openssl &>/dev/null; then
|
||||
auth_key=$(openssl rand -hex 16)
|
||||
else
|
||||
auth_key=$(LC_ALL=C tr -dc 'a-f0-9' < /dev/urandom | head -c 32)
|
||||
fi
|
||||
info "Generated auth key: $auth_key"
|
||||
fi
|
||||
|
||||
# Write .env
|
||||
cat > "$SCRIPT_DIR/.env" <<EOF
|
||||
# webclaw configuration — generated by setup.sh
|
||||
|
||||
# --- LLM Providers ---
|
||||
OLLAMA_HOST=http://localhost:11434
|
||||
OLLAMA_MODEL=$ollama_model
|
||||
EOF
|
||||
|
||||
if [[ -n "$openai_key" ]]; then
|
||||
echo "OPENAI_API_KEY=$openai_key" >> "$SCRIPT_DIR/.env"
|
||||
fi
|
||||
if [[ -n "$anthropic_key" ]]; then
|
||||
echo "ANTHROPIC_API_KEY=$anthropic_key" >> "$SCRIPT_DIR/.env"
|
||||
fi
|
||||
|
||||
cat >> "$SCRIPT_DIR/.env" <<EOF
|
||||
|
||||
# --- Proxy ---
|
||||
EOF
|
||||
if [[ -n "$proxy_file" ]]; then
|
||||
echo "WEBCLAW_PROXY_FILE=$proxy_file" >> "$SCRIPT_DIR/.env"
|
||||
else
|
||||
echo "# WEBCLAW_PROXY_FILE=/path/to/proxies.txt" >> "$SCRIPT_DIR/.env"
|
||||
fi
|
||||
|
||||
cat >> "$SCRIPT_DIR/.env" <<EOF
|
||||
|
||||
# --- Server ---
|
||||
WEBCLAW_PORT=$server_port
|
||||
WEBCLAW_HOST=0.0.0.0
|
||||
WEBCLAW_AUTH_KEY=$auth_key
|
||||
|
||||
# --- Logging ---
|
||||
WEBCLAW_LOG=info
|
||||
EOF
|
||||
|
||||
echo
|
||||
success ".env created."
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Step 4: Install Ollama (optional)
|
||||
# ---------------------------------------------------------------------------
|
||||
setup_ollama() {
|
||||
echo
|
||||
printf "${BOLD}${GREEN} Step 4: Ollama (Local LLM)${RESET}\n"
|
||||
echo
|
||||
|
||||
if ! command -v ollama &>/dev/null; then
|
||||
info "Ollama is not installed."
|
||||
info "It's optional but needed for local LLM features (extract, summarize)."
|
||||
info "Without it, you can still use OpenAI/Anthropic APIs."
|
||||
echo
|
||||
if prompt_yn "Install Ollama?" "y"; then
|
||||
info "Installing Ollama..."
|
||||
if [[ "$(uname)" == "Darwin" ]]; then
|
||||
if command -v brew &>/dev/null; then
|
||||
brew install ollama
|
||||
else
|
||||
warn "Install Ollama manually: https://ollama.ai/download"
|
||||
return
|
||||
fi
|
||||
else
|
||||
curl -fsSL https://ollama.ai/install.sh | sh
|
||||
fi
|
||||
success "Ollama installed."
|
||||
else
|
||||
info "Skipping Ollama. You can install later: https://ollama.ai"
|
||||
return
|
||||
fi
|
||||
fi
|
||||
|
||||
# Check if running
|
||||
if ! curl -sf http://localhost:11434/api/tags &>/dev/null; then
|
||||
warn "Ollama is not running."
|
||||
if [[ "$(uname)" == "Darwin" ]]; then
|
||||
info "On macOS, open the Ollama app or run: ollama serve"
|
||||
else
|
||||
info "Start with: ollama serve"
|
||||
fi
|
||||
echo
|
||||
if prompt_yn "Start Ollama now?" "y"; then
|
||||
nohup ollama serve &>/dev/null &
|
||||
sleep 2
|
||||
if curl -sf http://localhost:11434/api/tags &>/dev/null; then
|
||||
success "Ollama is running."
|
||||
else
|
||||
warn "Ollama didn't start. Start it manually and re-run setup."
|
||||
return
|
||||
fi
|
||||
else
|
||||
return
|
||||
fi
|
||||
fi
|
||||
|
||||
# Pull model
|
||||
local model
|
||||
model=$(grep '^OLLAMA_MODEL=' "$SCRIPT_DIR/.env" 2>/dev/null | cut -d= -f2 || echo "qwen3:8b")
|
||||
|
||||
local has_model
|
||||
has_model=$(curl -sf http://localhost:11434/api/tags | python3 -c "import sys,json; models=[m['name'] for m in json.load(sys.stdin).get('models',[])]; print('yes' if any('$model' in m for m in models) else 'no')" 2>/dev/null || echo "no")
|
||||
|
||||
if [[ "$has_model" == "yes" ]]; then
|
||||
success "Model $model already available."
|
||||
else
|
||||
info "Model $model not found locally."
|
||||
if prompt_yn "Pull $model now? (this downloads ~5GB)" "y"; then
|
||||
ollama pull "$model"
|
||||
success "Model $model ready."
|
||||
fi
|
||||
fi
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Step 5: Configure MCP server for Claude Desktop
|
||||
# ---------------------------------------------------------------------------
|
||||
setup_mcp() {
|
||||
echo
|
||||
printf "${BOLD}${GREEN} Step 5: MCP Server (Claude Desktop integration)${RESET}\n"
|
||||
echo
|
||||
|
||||
local mcp_binary="$SCRIPT_DIR/target/release/webclaw-mcp"
|
||||
if [[ ! -f "$mcp_binary" ]]; then
|
||||
warn "webclaw-mcp binary not found. Build first."
|
||||
return
|
||||
fi
|
||||
|
||||
info "The MCP server lets Claude Desktop use webclaw's tools directly."
|
||||
info "Tools: scrape, crawl, map, batch, extract, summarize, diff, brand"
|
||||
echo
|
||||
|
||||
if ! prompt_yn "Configure MCP server for Claude Desktop?" "y"; then
|
||||
info "Skipping MCP setup."
|
||||
info "You can configure it later by adding to your Claude Desktop config:"
|
||||
printf ' {"mcpServers": {"webclaw": {"command": "%s"}}}\n' "$mcp_binary"
|
||||
return
|
||||
fi
|
||||
|
||||
# Find Claude Desktop config
|
||||
local config_path=""
|
||||
if [[ "$(uname)" == "Darwin" ]]; then
|
||||
config_path="$HOME/Library/Application Support/Claude/claude_desktop_config.json"
|
||||
else
|
||||
config_path="$HOME/.config/claude/claude_desktop_config.json"
|
||||
fi
|
||||
|
||||
if [[ ! -f "$config_path" ]]; then
|
||||
# Create config directory and file
|
||||
mkdir -p "$(dirname "$config_path")"
|
||||
echo '{}' > "$config_path"
|
||||
info "Created Claude Desktop config at: $config_path"
|
||||
fi
|
||||
|
||||
# Read existing config and merge
|
||||
local existing
|
||||
existing=$(cat "$config_path")
|
||||
|
||||
# Check if webclaw is already configured
|
||||
if echo "$existing" | python3 -c "import sys,json; c=json.load(sys.stdin); exit(0 if 'webclaw' in c.get('mcpServers',{}) else 1)" 2>/dev/null; then
|
||||
warn "webclaw MCP server already configured in Claude Desktop."
|
||||
if ! prompt_yn "Update the path?" "y"; then
|
||||
return
|
||||
fi
|
||||
fi
|
||||
|
||||
# Merge webclaw into mcpServers
|
||||
local updated
|
||||
updated=$(echo "$existing" | python3 -c "
|
||||
import sys, json
|
||||
config = json.load(sys.stdin)
|
||||
if 'mcpServers' not in config:
|
||||
config['mcpServers'] = {}
|
||||
config['mcpServers']['webclaw'] = {
|
||||
'command': '$mcp_binary'
|
||||
}
|
||||
print(json.dumps(config, indent=2))
|
||||
")
|
||||
|
||||
echo "$updated" > "$config_path"
|
||||
success "MCP server configured in Claude Desktop."
|
||||
info "Restart Claude Desktop to activate."
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Step 6: Smoke test
|
||||
# ---------------------------------------------------------------------------
|
||||
smoke_test() {
|
||||
echo
|
||||
printf "${BOLD}${GREEN} Step 6: Smoke Test${RESET}\n"
|
||||
echo
|
||||
|
||||
local webclaw="$SCRIPT_DIR/target/release/webclaw"
|
||||
|
||||
info "Testing extraction..."
|
||||
local output
|
||||
output=$("$webclaw" https://example.com --format llm 2>/dev/null || echo "FAILED")
|
||||
|
||||
if [[ "$output" == "FAILED" ]]; then
|
||||
warn "Extraction test failed. Check your network connection."
|
||||
else
|
||||
local word_count
|
||||
word_count=$(echo "$output" | wc -w | tr -d ' ')
|
||||
success "Extracted example.com: $word_count words"
|
||||
fi
|
||||
|
||||
# Test Ollama if available
|
||||
if curl -sf http://localhost:11434/api/tags &>/dev/null; then
|
||||
info "Testing LLM summarization..."
|
||||
local summary
|
||||
summary=$("$webclaw" https://example.com --summarize 2>/dev/null || echo "FAILED")
|
||||
if [[ "$summary" == "FAILED" ]]; then
|
||||
warn "LLM test failed. Check Ollama and model availability."
|
||||
else
|
||||
success "LLM summarization works."
|
||||
fi
|
||||
fi
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Summary
|
||||
# ---------------------------------------------------------------------------
|
||||
print_summary() {
|
||||
local webclaw="$SCRIPT_DIR/target/release/webclaw"
|
||||
local server="$SCRIPT_DIR/target/release/webclaw-server"
|
||||
local mcp="$SCRIPT_DIR/target/release/webclaw-mcp"
|
||||
local port
|
||||
port=$(grep '^WEBCLAW_PORT=' "$SCRIPT_DIR/.env" 2>/dev/null | cut -d= -f2 || echo "3000")
|
||||
|
||||
echo
|
||||
printf "${BOLD}${GREEN} Setup Complete${RESET}\n"
|
||||
echo
|
||||
printf " ${BOLD}CLI:${RESET}\n"
|
||||
printf " %s https://example.com --format llm\n" "$webclaw"
|
||||
echo
|
||||
printf " ${BOLD}REST API:${RESET}\n"
|
||||
printf " %s\n" "$server"
|
||||
printf " curl http://localhost:%s/health\n" "$port"
|
||||
echo
|
||||
printf " ${BOLD}MCP Server:${RESET}\n"
|
||||
printf " Configured in Claude Desktop (restart to activate)\n"
|
||||
echo
|
||||
printf " ${BOLD}Config:${RESET} %s/.env\n" "$SCRIPT_DIR"
|
||||
printf " ${BOLD}Docs:${RESET} %s/README.md\n" "$SCRIPT_DIR"
|
||||
echo
|
||||
printf " ${DIM}Tip: Add to PATH for convenience:${RESET}\n"
|
||||
printf " export PATH=\"%s/target/release:\$PATH\"\n" "$SCRIPT_DIR"
|
||||
echo
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Main
|
||||
# ---------------------------------------------------------------------------
|
||||
main() {
|
||||
echo
|
||||
printf "${BOLD}${GREEN} webclaw — Local Setup${RESET}\n"
|
||||
printf "${DIM} Web extraction toolkit for AI agents${RESET}\n"
|
||||
echo
|
||||
|
||||
local mode="${1:-}"
|
||||
|
||||
if [[ "$mode" == "--check" ]]; then
|
||||
check_prerequisites "--check"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
check_prerequisites
|
||||
|
||||
build_binaries
|
||||
|
||||
if [[ "$mode" == "--minimal" ]]; then
|
||||
success "Minimal build complete. Run ./setup.sh for full configuration."
|
||||
exit 0
|
||||
fi
|
||||
|
||||
configure_env
|
||||
setup_ollama
|
||||
setup_mcp
|
||||
smoke_test
|
||||
print_summary
|
||||
}
|
||||
|
||||
main "$@"
|
||||
Loading…
Add table
Add a link
Reference in a new issue