feat(extractors): add vertical extractors module + first 6 verticals

New extractors module returns site-specific typed JSON instead of generic markdown. Each extractor: - declares a URL pattern via matches() - fetches from the site's official JSON API where one exists - returns a typed serde_json::Value with documented field names - exposes an INFO struct that powers the /v1/extractors catalog First 6 verticals shipped, all hitting public JSON APIs (no HTML scraping, zero antibot risk): - reddit → www.reddit.com/*/.json - hackernews → hn.algolia.com/api/v1/items/{id} (full thread in one call) - github_repo → api.github.com/repos/{owner}/{repo} - pypi → pypi.org/pypi/{name}/json - npm → registry.npmjs.org/{name} + downloads/point/last-week - huggingface_model → huggingface.co/api/models/{owner}/{name} Server-side routes added: - POST /v1/scrape/{vertical} explicit per-vertical extraction - GET /v1/extractors catalog (name, label, description, url_patterns) The dispatcher validates that URL matches the requested vertical before running, so users get "URL doesn't match the X extractor" instead of opaque parse failures inside the extractor. 17 unit tests cover URL matching + path parsing for each vertical. Live tests against canonical URLs (rust-lang/rust, requests pypi, react npm, whisper-large-v3 hf, item 8863 hn, an r/micro_saas post) all return correct typed JSON in 100-300ms. Sample sizes: github 863B, npm 700B, pypi 1.7KB, hf 3.2KB, hn 38KB (full comment tree). Marketing positioning: Firecrawl charges 5 credits per /extract call and you write the schema. Webclaw returns the same JSON in 1 credit per /scrape/{vertical} call with hand-written deterministic extractors per site.
2026-07-02 04:08:08 +02:00 · 2026-04-22 14:11:43 +02:00 · 2026-04-22 14:11:43 +02:00 · 8ba7538c37
commit 8ba7538c37
parent ccdb6d364b
11 changed files with 1535 additions and 0 deletions
--- a/crates/webclaw-server/src/main.rs
+++ b/crates/webclaw-server/src/main.rs
@ -79,10 +79,15 @@ async fn main() -> anyhow::Result<()> {

    let v1 = Router::new()
        .route("/scrape", post(routes::scrape::scrape))
+        .route(
+            "/scrape/{vertical}",
+            post(routes::structured::scrape_vertical),
+        )
        .route("/crawl", post(routes::crawl::crawl))
        .route("/map", post(routes::map::map))
        .route("/batch", post(routes::batch::batch))
        .route("/extract", post(routes::extract::extract))
+        .route("/extractors", get(routes::structured::list_extractors))
        .route("/summarize", post(routes::summarize::summarize_route))
        .route("/diff", post(routes::diff::diff_route))
        .route("/brand", post(routes::brand::brand))