mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-06-15 23:35:14 +02:00
New extractors module returns site-specific typed JSON instead of
generic markdown. Each extractor:
- declares a URL pattern via matches()
- fetches from the site's official JSON API where one exists
- returns a typed serde_json::Value with documented field names
- exposes an INFO struct that powers the /v1/extractors catalog
First 6 verticals shipped, all hitting public JSON APIs (no HTML
scraping, zero antibot risk):
- reddit → www.reddit.com/*/.json
- hackernews → hn.algolia.com/api/v1/items/{id} (full thread in one call)
- github_repo → api.github.com/repos/{owner}/{repo}
- pypi → pypi.org/pypi/{name}/json
- npm → registry.npmjs.org/{name} + downloads/point/last-week
- huggingface_model → huggingface.co/api/models/{owner}/{name}
Server-side routes added:
- POST /v1/scrape/{vertical} explicit per-vertical extraction
- GET /v1/extractors catalog (name, label, description, url_patterns)
The dispatcher validates that URL matches the requested vertical
before running, so users get "URL doesn't match the X extractor"
instead of opaque parse failures inside the extractor.
17 unit tests cover URL matching + path parsing for each vertical.
Live tests against canonical URLs (rust-lang/rust, requests pypi,
react npm, whisper-large-v3 hf, item 8863 hn, an r/micro_saas post)
all return correct typed JSON in 100-300ms. Sample sizes: github
863B, npm 700B, pypi 1.7KB, hf 3.2KB, hn 38KB (full comment tree).
Marketing positioning: Firecrawl charges 5 credits per /extract call
and you write the schema. Webclaw returns the same JSON in 1 credit
per /scrape/{vertical} call with hand-written deterministic
extractors per site.
19 lines
598 B
Rust
19 lines
598 B
Rust
//! HTTP route handlers.
|
|
//!
|
|
//! The OSS server exposes a deliberately small surface that mirrors the
|
|
//! hosted-API JSON shapes where the underlying capability exists in the
|
|
//! OSS crates. Endpoints that depend on private infrastructure
|
|
//! (anti-bot bypass with stealth Chrome, JS rendering at scale,
|
|
//! per-user auth, billing, async job queues, agent loops) are
|
|
//! intentionally not implemented here. Use api.webclaw.io for those.
|
|
|
|
pub mod batch;
|
|
pub mod brand;
|
|
pub mod crawl;
|
|
pub mod diff;
|
|
pub mod extract;
|
|
pub mod health;
|
|
pub mod map;
|
|
pub mod scrape;
|
|
pub mod structured;
|
|
pub mod summarize;
|