feat(cli+mcp): vertical extractor support (28 extractors discoverable + callable)

Wires the vertical extractor catalog into both the CLI and the MCP server so users don't have to hit the HTTP API to invoke them. Same semantics as `/v1/scrape/{vertical}` + `/v1/extractors`. CLI (webclaw-cli): - New subcommand `webclaw extractors` lists all 28 extractors with name, label, and sample URL. `--json` flag emits the full catalog as machine-readable JSON. - New subcommand `webclaw vertical <name> <url>` runs a specific extractor and prints typed JSON. Pretty-printed by default; `--raw` for single-line. Exits 1 with a clear "URL does not match" error on mismatch. - FetchClient built with Firefox profile + cloud fallback attached when WEBCLAW_API_KEY is set, so antibot-gated verticals escalate. MCP (webclaw-mcp): - New tool `list_extractors` (no args) returns the catalog as pretty-printed JSON for in-session discovery. - New tool `vertical_scrape` takes `{name, url}` and returns typed JSON. Reuses the long-lived self.fetch_client. - Tool count goes from 10 to 12. Server-info instruction string updated accordingly. Tests: 215 passing, clippy clean. Manual surface-tested end-to-end: CLI prints real Reddit/github/pypi data; MCP JSON-RPC session returns 28-entry catalog + typed responses for pypi/requests + rust-lang/rust in 200-400ms. Version bumped to 0.5.2 (minor for API additions, backwards compatible).
2026-06-08 22:25:12 +02:00 · 2026-04-22 21:41:15 +02:00 · 2026-04-22 21:41:15 +02:00 · 0daa2fec1a
commit 0daa2fec1a
parent 058493bc8f
6 changed files with 190 additions and 9 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -3,6 +3,20 @@
 All notable changes to webclaw are documented here.
 Format follows [Keep a Changelog](https://keepachangelog.com/).
 ## [0.5.2] — 2026-04-22
 ### Added
 - **`webclaw vertical <name> <url>` subcommand on the CLI.** Runs a specific vertical extractor and prints typed JSON (pretty-printed by default, `--raw` for single-line). Example: `webclaw vertical reddit https://www.reddit.com/r/rust/comments/abc/` returns `{post: {title, author, points, ...}, comments: [...]}`. URL-mismatch errors surface cleanly as `"URL '...' does not match the '...' extractor"` on stderr with exit code 1.
 - **`webclaw extractors` subcommand on the CLI.** Lists all 28 vertical extractors with name, label, and one URL pattern sample. `--json` emits the full catalog as JSON (same shape as `GET /v1/extractors`) for tooling. Covers discovery for users who don't know which vertical to pick.
 - **`vertical_scrape` and `list_extractors` tools on `webclaw-mcp`.** Claude Desktop / Claude Code users can now call any of the 28 extractors by name from an MCP session. Tool count goes from 10 to 12. `list_extractors` takes no args and returns the full catalog; `vertical_scrape` takes `{name, url}` and returns the typed JSON payload. Antibot-gated verticals still auto-escalate to the webclaw cloud API when `WEBCLAW_API_KEY` is set.
 ### Changed
 - Server-info instruction string in `webclaw-mcp` now lists all 12 tools (previously hard-coded 10). Also `webclaw --help` on the CLI now shows the three subcommands: `bench`, `extractors`, `vertical`.
 ---
 ## [0.5.1] — 2026-04-22
 ### Added
--- a/Cargo.lock
+++ b/Cargo.lock
@ -3199,7 +3199,7 @@ dependencies = [
 [[package]]
 name = "webclaw-cli"
-version = "0.5.1"
+version = "0.5.2"
 dependencies = [
 "clap",
 "dotenvy",
@ -3220,7 +3220,7 @@ dependencies = [
 [[package]]
 name = "webclaw-core"
-version = "0.5.1"
+version = "0.5.2"
 dependencies = [
 "ego-tree",
 "once_cell",
@ -3238,7 +3238,7 @@ dependencies = [
 [[package]]
 name = "webclaw-fetch"
-version = "0.5.1"
+version = "0.5.2"
 dependencies = [
 "async-trait",
 "bytes",
@ -3263,7 +3263,7 @@ dependencies = [
 [[package]]
 name = "webclaw-llm"
-version = "0.5.1"
+version = "0.5.2"
 dependencies = [
 "async-trait",
 "reqwest",
@ -3276,7 +3276,7 @@ dependencies = [
 [[package]]
 name = "webclaw-mcp"
-version = "0.5.1"
+version = "0.5.2"
 dependencies = [
 "dirs",
 "dotenvy",
@ -3296,7 +3296,7 @@ dependencies = [
 [[package]]
 name = "webclaw-pdf"
-version = "0.5.1"
+version = "0.5.2"
 dependencies = [
 "pdf-extract",
 "thiserror",
@ -3305,7 +3305,7 @@ dependencies = [
 [[package]]
 name = "webclaw-server"
-version = "0.5.1"
+version = "0.5.2"
 dependencies = [
 "anyhow",
 "axum",
--- a/Cargo.toml
+++ b/Cargo.toml
@ -3,7 +3,7 @@ resolver = "2"
 members = ["crates/*"]
 [workspace.package]
-version = "0.5.1"
+version = "0.5.2"
 edition = "2024"
 license = "AGPL-3.0"
 repository = "https://github.com/0xMassi/webclaw"
--- a/crates/webclaw-cli/src/main.rs
+++ b/crates/webclaw-cli/src/main.rs
@ -308,6 +308,34 @@ enum Commands {
        #[arg(long)]
        facts: Option<PathBuf>,
    },
    /// List all vertical extractors in the catalog.
    ///
    /// Each entry has a stable `name` (usable with `webclaw vertical <name>`),
    /// a human-friendly label, a one-line description, and the URL
    /// patterns it claims. The same data is served by `/v1/extractors`
    /// when running the REST API.
    Extractors {
        /// Emit JSON instead of a human-friendly table.
        #[arg(long)]
        json: bool,
    },
    /// Run a vertical extractor by name. Returns typed JSON with fields
    /// specific to the target site (title, price, author, rating, etc.)
    /// rather than generic markdown.
    ///
    /// Use `webclaw extractors` to see the full list. Example:
    /// `webclaw vertical reddit https://www.reddit.com/r/rust/comments/abc/`.
    Vertical {
        /// Vertical name (e.g. `reddit`, `github_repo`, `trustpilot_reviews`).
        name: String,
        /// URL to extract.
        url: String,
        /// Emit compact JSON (single line). Default is pretty-printed.
        #[arg(long)]
        raw: bool,
    },
 }
 #[derive(Clone, ValueEnum)]
@ -2288,6 +2316,83 @@ async fn main() {
                }
                return;
            }
            Commands::Extractors { json } => {
                let entries = webclaw_fetch::extractors::list();
                if *json {
                    // Serialize with serde_json. ExtractorInfo derives
                    // Serialize so this is a one-liner.
                    match serde_json::to_string_pretty(&entries) {
                        Ok(s) => println!("{s}"),
                        Err(e) => {
                            eprintln!("error: failed to serialise catalog: {e}");
                            process::exit(1);
                        }
                    }
                } else {
                    // Human-friendly table: NAME + LABEL + one URL
                    // pattern sample. Keeps the output scannable on a
                    // narrow terminal.
                    println!("{} vertical extractors available:\n", entries.len());
                    let name_w = entries.iter().map(|e| e.name.len()).max().unwrap_or(0);
                    let label_w = entries.iter().map(|e| e.label.len()).max().unwrap_or(0);
                    for e in &entries {
                        let pattern_sample = e.url_patterns.first().copied().unwrap_or("");
                        println!(
                            "  {:<nw$}  {:<lw$}  {}",
                            e.name,
                            e.label,
                            pattern_sample,
                            nw = name_w,
                            lw = label_w,
                        );
                    }
                    println!("\nRun one: webclaw vertical <name> <url>");
                }
                return;
            }
            Commands::Vertical { name, url, raw } => {
                // Build a FetchClient with cloud fallback attached when
                // WEBCLAW_API_KEY is set. Antibot-gated verticals
                // (amazon, ebay, etsy, trustpilot) need this to escalate
                // on bot protection.
                let fetch_cfg = webclaw_fetch::FetchConfig {
                    browser: webclaw_fetch::BrowserProfile::Firefox,
                    ..webclaw_fetch::FetchConfig::default()
                };
                let mut client = match webclaw_fetch::FetchClient::new(fetch_cfg) {
                    Ok(c) => c,
                    Err(e) => {
                        eprintln!("error: failed to build fetch client: {e}");
                        process::exit(1);
                    }
                };
                if let Some(cloud) = webclaw_fetch::cloud::CloudClient::from_env() {
                    client = client.with_cloud(cloud);
                }
                match webclaw_fetch::extractors::dispatch_by_name(&client, name, url).await {
                    Ok(data) => {
                        let rendered = if *raw {
                            serde_json::to_string(&data)
                        } else {
                            serde_json::to_string_pretty(&data)
                        };
                        match rendered {
                            Ok(s) => println!("{s}"),
                            Err(e) => {
                                eprintln!("error: JSON encode failed: {e}");
                                process::exit(1);
                            }
                        }
                    }
                    Err(e) => {
                        // UrlMismatch / UnknownVertical / Fetch all get
                        // Display impls with actionable messages.
                        eprintln!("error: {e}");
                        process::exit(1);
                    }
                }
                return;
            }
        }
    }
--- a/crates/webclaw-mcp/src/server.rs
+++ b/crates/webclaw-mcp/src/server.rs
@ -718,6 +718,50 @@ impl WebclawMcp {
            Ok(serde_json::to_string_pretty(&resp).unwrap_or_default())
        }
    }
    /// List every vertical extractor the server knows about. Returns a
    /// JSON array of `{name, label, description, url_patterns}` entries.
    /// Call this to discover what verticals are available before using
    /// `vertical_scrape`.
    #[tool]
    async fn list_extractors(
        &self,
        Parameters(_params): Parameters<ListExtractorsParams>,
    ) -> Result<String, String> {
        let catalog = webclaw_fetch::extractors::list();
        serde_json::to_string_pretty(&catalog)
            .map_err(|e| format!("failed to serialise extractor catalog: {e}"))
    }
    /// Run a vertical extractor by name and return typed JSON specific
    /// to the target site (title, price, rating, author, etc.), not
    /// generic markdown. Use `list_extractors` to discover available
    /// names. Example names: `reddit`, `github_repo`, `trustpilot_reviews`,
    /// `youtube_video`, `shopify_product`, `pypi`, `npm`, `arxiv`.
    ///
    /// Antibot-gated verticals (amazon_product, ebay_listing,
    /// etsy_listing, trustpilot_reviews) will automatically escalate to
    /// the webclaw cloud API when local fetch hits bot protection,
    /// provided `WEBCLAW_API_KEY` is set.
    #[tool]
    async fn vertical_scrape(
        &self,
        Parameters(params): Parameters<VerticalParams>,
    ) -> Result<String, String> {
        validate_url(&params.url)?;
        // Reuse the long-lived default FetchClient. Extractors accept
        // `&dyn Fetcher`; FetchClient implements the trait so this just
        // works (see webclaw_fetch::Fetcher and client::FetchClient).
        let data = webclaw_fetch::extractors::dispatch_by_name(
            self.fetch_client.as_ref(),
            &params.name,
            &params.url,
        )
        .await
        .map_err(|e| e.to_string())?;
        serde_json::to_string_pretty(&data)
            .map_err(|e| format!("failed to serialise extractor output: {e}"))
    }
 }
 #[tool_handler]
@ -727,7 +771,8 @@ impl ServerHandler for WebclawMcp {
            .with_server_info(Implementation::new("webclaw-mcp", env!("CARGO_PKG_VERSION")))
            .with_instructions(String::from(
                "Webclaw MCP server -- web content extraction for AI agents. \
-                 Tools: scrape, crawl, map, batch, extract, summarize, diff, brand, research, search.",
+                 Tools: scrape, crawl, map, batch, extract, summarize, diff, brand, research, search, \
                 list_extractors, vertical_scrape.",
            ))
    }
 }
--- a/crates/webclaw-mcp/src/tools.rs
+++ b/crates/webclaw-mcp/src/tools.rs
@ -103,3 +103,20 @@ pub struct SearchParams {
    /// Number of results to return (default: 10)
    pub num_results: Option<u32>,
 }
 /// Parameters for `vertical_scrape`: run a site-specific extractor by name.
 #[derive(Debug, Deserialize, JsonSchema)]
 pub struct VerticalParams {
    /// Name of the vertical extractor. Call `list_extractors` to see all
    /// available names. Examples: "reddit", "github_repo", "pypi",
    /// "trustpilot_reviews", "youtube_video", "shopify_product".
    pub name: String,
    /// URL to extract. Must match the URL patterns the extractor claims;
    /// otherwise the tool returns a clear "URL mismatch" error.
    pub url: String,
 }
 /// `list_extractors` takes no arguments but we still need an empty struct
 /// so rmcp can generate a schema and parse the (empty) JSON-RPC params.
 #[derive(Debug, Deserialize, JsonSchema)]
 pub struct ListExtractorsParams {}