webclaw

apunkt/webclaw

Fork 0

mirror of https://github.com/0xMassi/webclaw.git synced 2026-06-08 22:25:12 +02:00

Commit graph

Author	SHA1	Message	Date
Valerio	058493bc8f	feat(fetch): Fetcher trait so vertical extractors work under any HTTP backend Adds `webclaw_fetch::Fetcher` trait. All 28 vertical extractors now take `client: &dyn Fetcher` instead of `client: &FetchClient` directly. Backwards-compatible: FetchClient implements Fetcher, blanket impls cover `&T` and `Arc<T>`, so existing CLI / MCP / self-hosted-server callers keep working unchanged. Motivation: the production API server (api.webclaw.io) must not do in-process TLS fingerprinting; it delegates all HTTP to the Go tls-sidecar. Before this trait, exposing /v1/scrape/{vertical} on production would have required importing wreq into the server's dep graph, violating the CLAUDE.md rule. Now production can provide its own TlsSidecarFetcher implementation and pass it to the same dispatcher the OSS server uses. Changes: - New `crates/webclaw-fetch/src/fetcher.rs` defining the trait plus blanket impls for `&T` and `Arc<T>`. - `FetchClient` gains a tiny impl block in client.rs that forwards to its existing public methods. - All 28 extractor signatures migrated from `&FetchClient` to `&dyn Fetcher` (sed-driven bulk rewrite, no semantic change). - `cloud::smart_fetch` and `cloud::smart_fetch_html` take `&dyn Fetcher`. - `extractors::dispatch_by_url` and `extractors::dispatch_by_name` take `&dyn Fetcher`. - `async-trait 0.1` added to webclaw-fetch deps (Rust 1.75+ has native async-fn-in-trait but dyn dispatch still needs async_trait). - Version bumped to 0.5.1, CHANGELOG updated. Tests: 215 passing in webclaw-fetch (no new tests needed — the existing extractor tests exercise the trait methods transparently). Clippy: clean workspace-wide.	2026-04-22 21:17:50 +02:00
Valerio	b041f3cddd	feat(extractors): wave 2 \u2014 8 more verticals (14 total) Adds 8 more vertical extractors using public JSON APIs. All hit deterministic endpoints with no antibot risk. Live tests pass against canonical URLs for each. AI / ML ecosystem (3): - crates_io \u2192 crates.io/api/v1/crates/{name} - huggingface_dataset \u2192 huggingface.co/api/datasets/{path} (handles both legacy /datasets/{name} and canonical {owner}/{name}) - arxiv \u2192 export.arxiv.org/api/query (Atom XML parsed by quick-xml) Code / version control (2): - github_pr \u2192 api.github.com/repos/{owner}/{repo}/pulls/{number} - github_release \u2192 api.github.com/repos/{owner}/{repo}/releases/tags/{tag} Infrastructure (1): - docker_hub \u2192 hub.docker.com/v2/repositories/{namespace}/{name} (official-image shorthand /_/nginx normalized to library/nginx) Community / publishing (2): - dev_to \u2192 dev.to/api/articles/{username}/{slug} - stackoverflow \u2192 api.stackexchange.com/2.3/questions/{id} + answers, filter=withbody for rendered HTML, sort=votes for consistent top-answers ordering Live test results (real URLs): - serde: 942M downloads, 838B response - 'Attention Is All You Need': abstract + authors, 1.8KB - nginx official: 12.9B pulls, 21k stars, 17KB - openai/gsm8k: 822k downloads, 1.7KB - rust-lang/rust#138000: merged by RalfJung, +3/-2, 1KB - webclaw v0.4.0: 2.4KB - a real dev.to article: 2.2KB body, 3.1KB total - python yield Q&A: score 13133, 51 answers, 104KB Catalog now exposes 14 extractors via GET /v1/extractors. Total unit tests across the module: 34 passing. Clippy clean. Fmt clean. Marketing positioning sharpens: 14 dedicated extractors, all deterministic, all 1-credit-per-call. Firecrawl's /extract is 5 credits per call and you write the schema yourself.	2026-04-22 14:20:21 +02:00

Author

SHA1

Message

Date

Valerio

058493bc8f

feat(fetch): Fetcher trait so vertical extractors work under any HTTP backend

Adds `webclaw_fetch::Fetcher` trait. All 28 vertical extractors now
take `client: &dyn Fetcher` instead of `client: &FetchClient` directly.
Backwards-compatible: FetchClient implements Fetcher, blanket impls
cover `&T` and `Arc<T>`, so existing CLI / MCP / self-hosted-server
callers keep working unchanged.

Motivation: the production API server (api.webclaw.io) must not do
in-process TLS fingerprinting; it delegates all HTTP to the Go
tls-sidecar. Before this trait, exposing /v1/scrape/{vertical} on
production would have required importing wreq into the server's
dep graph, violating the CLAUDE.md rule. Now production can provide
its own TlsSidecarFetcher implementation and pass it to the same
dispatcher the OSS server uses.

Changes:
- New `crates/webclaw-fetch/src/fetcher.rs` defining the trait plus
  blanket impls for `&T` and `Arc<T>`.
- `FetchClient` gains a tiny impl block in client.rs that forwards to
  its existing public methods.
- All 28 extractor signatures migrated from `&FetchClient` to
  `&dyn Fetcher` (sed-driven bulk rewrite, no semantic change).
- `cloud::smart_fetch` and `cloud::smart_fetch_html` take `&dyn Fetcher`.
- `extractors::dispatch_by_url` and `extractors::dispatch_by_name`
  take `&dyn Fetcher`.
- `async-trait 0.1` added to webclaw-fetch deps (Rust 1.75+ has
  native async-fn-in-trait but dyn dispatch still needs async_trait).
- Version bumped to 0.5.1, CHANGELOG updated.

Tests: 215 passing in webclaw-fetch (no new tests needed — the existing
extractor tests exercise the trait methods transparently).
Clippy: clean workspace-wide.

2026-04-22 21:17:50 +02:00

Valerio

b041f3cddd

feat(extractors): wave 2 \u2014 8 more verticals (14 total)

Adds 8 more vertical extractors using public JSON APIs. All hit
deterministic endpoints with no antibot risk. Live tests pass
against canonical URLs for each.

AI / ML ecosystem (3):
- crates_io          \u2192 crates.io/api/v1/crates/{name}
- huggingface_dataset \u2192 huggingface.co/api/datasets/{path} (handles both
                       legacy /datasets/{name} and canonical {owner}/{name})
- arxiv              \u2192 export.arxiv.org/api/query (Atom XML parsed by quick-xml)

Code / version control (2):
- github_pr      \u2192 api.github.com/repos/{owner}/{repo}/pulls/{number}
- github_release \u2192 api.github.com/repos/{owner}/{repo}/releases/tags/{tag}

Infrastructure (1):
- docker_hub \u2192 hub.docker.com/v2/repositories/{namespace}/{name}
              (official-image shorthand /_/nginx normalized to library/nginx)

Community / publishing (2):
- dev_to        \u2192 dev.to/api/articles/{username}/{slug}
- stackoverflow \u2192 api.stackexchange.com/2.3/questions/{id} + answers,
                  filter=withbody for rendered HTML, sort=votes for
                  consistent top-answers ordering

Live test results (real URLs):
- serde:                 942M downloads, 838B response
- 'Attention Is All You Need': abstract + authors, 1.8KB
- nginx official:        12.9B pulls, 21k stars, 17KB
- openai/gsm8k:          822k downloads, 1.7KB
- rust-lang/rust#138000: merged by RalfJung, +3/-2, 1KB
- webclaw v0.4.0:        2.4KB
- a real dev.to article: 2.2KB body, 3.1KB total
- python yield Q&A:      score 13133, 51 answers, 104KB

Catalog now exposes 14 extractors via GET /v1/extractors. Total
unit tests across the module: 34 passing. Clippy clean. Fmt clean.

Marketing positioning sharpens: 14 dedicated extractors, all
deterministic, all 1-credit-per-call. Firecrawl's /extract is
5 credits per call and you write the schema yourself.

2026-04-22 14:20:21 +02:00

2 commits