Commit graph

6 commits

Author SHA1 Message Date
Valerio
480d3187db docs(claude-md): document search, map, and perf; refresh stale details
Bring core/CLAUDE.md current with the slices rescued this cycle, and fold
in earlier /init corrections that were never committed.

New capabilities documented:
- search: webclaw-fetch `search.rs` (Serper BYO-key) + the CLI `search`
  subcommand + the OSS `POST /v1/search` route (gated on SERPER_API_KEY)
  + the now-local-first MCP `search` tool.
- map: webclaw-fetch `map.rs` (`discover_urls`/`MapOptions`, sitemap +
  bounded crawl fallback), gzip sitemap support, and the new
  `--map-pages`/`--no-map-crawl`/`--map-limit` CLI flags.
- perf: shared `extractors/og.rs` parser and the QuickJS runtime gate /
  parsed-document reuse noted on `js_eval.rs`.

Corrections folded in: real browser fingerprint versions live in tls.rs
(not browser.rs), accurate module/route lists, Repo Layout section, and
removal of the now-false "search lives only in production" notes.
Bumped the stated workspace version to 0.6.13.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 17:10:36 +02:00
Valerio
23544f8fac docs(claude): note youtube.rs role and yt-dlp short-circuit in server
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
The webclaw-core youtube module produces structured markdown but no
transcript; document that and point at the production server's
youtube_transcript.rs short-circuit for the full YoutubeData + caption
text shape.
2026-05-03 21:17:23 +02:00
Valerio
e1af2da509 docs(claude): drop sidecar references, mention ProductionFetcher 2026-04-23 13:25:23 +02:00
Valerio
aaa5103504 docs(claude): fix stale primp references, document wreq + Fetcher trait
webclaw-fetch switched from primp to wreq 6.x (BoringSSL) a while ago
but CLAUDE.md still documented primp, the `[patch.crates-io]`
requirement, and RUSTFLAGS that no longer apply. Refreshed four
sections:

- Crate listing: webclaw-fetch uses wreq, not primp
- client.rs description: wreq BoringSSL, plus a note that FetchClient
  will implement the new Fetcher trait so production can swap in a
  tls-sidecar-backed fetcher without importing wreq
- Hard Rules: dropped obsolete `[patch.crates-io]` and RUSTFLAGS lines,
  added the "Vertical extractors take `&dyn Fetcher`" rule that makes
  the architectural separation explicit for the upcoming production
  integration
- Removed language about primp being "patched"; reqwest in webclaw-llm
  is now just "plain reqwest" with no relationship to wreq
2026-04-22 21:11:18 +02:00
Valerio
2ba682adf3 feat(server): add OSS webclaw-server REST API binary (closes #29)
Self-hosters hitting docs/self-hosting were promised three binaries
but the OSS Docker image only shipped two. webclaw-server lived in
the closed-source hosted-platform repo, which couldn't be opened. This
adds a minimal axum REST API in the OSS repo so self-hosting actually
works without pretending to ship the cloud platform.

Crate at crates/webclaw-server/. Stateless, no database, no job queue,
single binary. Endpoints: GET /health, POST /v1/{scrape, crawl, map,
batch, extract, summarize, diff, brand}. JSON shapes mirror
api.webclaw.io for the endpoints OSS can support, so swapping between
self-hosted and hosted is a base-URL change.

Auth: optional bearer token via WEBCLAW_API_KEY / --api-key. Comparison
is constant-time (subtle::ConstantTimeEq). Open mode (no key) is
allowed and binds 127.0.0.1 by default; the Docker image flips
WEBCLAW_HOST=0.0.0.0 so the container is reachable out of the box.

Hard caps to keep naive callers from OOMing the process: crawl capped
at 500 pages synchronously, batch capped at 100 URLs / 20 concurrent.
For unbounded crawls or anti-bot bypass the docs point users at the
hosted API.

Dockerfile + Dockerfile.ci updated to copy webclaw-server into
/usr/local/bin and EXPOSE 3000. Workspace version bumped to 0.4.0
(new public binary).
2026-04-22 12:25:11 +02:00
Valerio
c99ec684fa Initial release: webclaw v0.1.0 — web content extraction for LLMs
CLI + MCP server for extracting clean, structured content from any URL.
6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats.

MIT Licensed | https://webclaw.io
2026-03-23 18:31:11 +01:00