feat(cli+mcp): vertical extractor support (28 extractors discoverable + callable)
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run

Wires the vertical extractor catalog into both the CLI and the MCP
server so users don't have to hit the HTTP API to invoke them. Same
semantics as `/v1/scrape/{vertical}` + `/v1/extractors`.

CLI (webclaw-cli):
- New subcommand `webclaw extractors` lists all 28 extractors with
  name, label, and sample URL. `--json` flag emits the full catalog
  as machine-readable JSON.
- New subcommand `webclaw vertical <name> <url>` runs a specific
  extractor and prints typed JSON. Pretty-printed by default; `--raw`
  for single-line. Exits 1 with a clear "URL does not match" error
  on mismatch.
- FetchClient built with Firefox profile + cloud fallback attached
  when WEBCLAW_API_KEY is set, so antibot-gated verticals escalate.

MCP (webclaw-mcp):
- New tool `list_extractors` (no args) returns the catalog as
  pretty-printed JSON for in-session discovery.
- New tool `vertical_scrape` takes `{name, url}` and returns typed
  JSON. Reuses the long-lived self.fetch_client.
- Tool count goes from 10 to 12. Server-info instruction string
  updated accordingly.

Tests: 215 passing, clippy clean. Manual surface-tested end-to-end:
CLI prints real Reddit/github/pypi data; MCP JSON-RPC session returns
28-entry catalog + typed responses for pypi/requests + rust-lang/rust
in 200-400ms.

Version bumped to 0.5.2 (minor for API additions, backwards compatible).
This commit is contained in:
Valerio 2026-04-22 21:41:15 +02:00
parent 058493bc8f
commit 0daa2fec1a
6 changed files with 190 additions and 9 deletions

View file

@ -103,3 +103,20 @@ pub struct SearchParams {
/// Number of results to return (default: 10)
pub num_results: Option<u32>,
}
/// Parameters for `vertical_scrape`: run a site-specific extractor by name.
#[derive(Debug, Deserialize, JsonSchema)]
pub struct VerticalParams {
/// Name of the vertical extractor. Call `list_extractors` to see all
/// available names. Examples: "reddit", "github_repo", "pypi",
/// "trustpilot_reviews", "youtube_video", "shopify_product".
pub name: String,
/// URL to extract. Must match the URL patterns the extractor claims;
/// otherwise the tool returns a clear "URL mismatch" error.
pub url: String,
}
/// `list_extractors` takes no arguments but we still need an empty struct
/// so rmcp can generate a schema and parse the (empty) JSON-RPC params.
#[derive(Debug, Deserialize, JsonSchema)]
pub struct ListExtractorsParams {}