webclaw

mirror of https://github.com/0xMassi/webclaw.git synced 2026-06-17 23:55:13 +02:00

Author	SHA1	Message	Date
Valerio	480d3187db	docs(claude-md): document search, map, and perf; refresh stale details Bring core/CLAUDE.md current with the slices rescued this cycle, and fold in earlier /init corrections that were never committed. New capabilities documented: - search: webclaw-fetch `search.rs` (Serper BYO-key) + the CLI `search` subcommand + the OSS `POST /v1/search` route (gated on SERPER_API_KEY) + the now-local-first MCP `search` tool. - map: webclaw-fetch `map.rs` (`discover_urls`/`MapOptions`, sitemap + bounded crawl fallback), gzip sitemap support, and the new `--map-pages`/`--no-map-crawl`/`--map-limit` CLI flags. - perf: shared `extractors/og.rs` parser and the QuickJS runtime gate / parsed-document reuse noted on `js_eval.rs`. Corrections folded in: real browser fingerprint versions live in tls.rs (not browser.rs), accurate module/route lists, Repo Layout section, and removal of the now-false "search lives only in production" notes. Bumped the stated workspace version to 0.6.13. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-17 17:10:36 +02:00
Valerio	23544f8fac	docs(claude): note youtube.rs role and yt-dlp short-circuit in server Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run The webclaw-core youtube module produces structured markdown but no transcript; document that and point at the production server's youtube_transcript.rs short-circuit for the full YoutubeData + caption text shape.	2026-05-03 21:17:23 +02:00
Valerio	e1af2da509	docs(claude): drop sidecar references, mention ProductionFetcher	2026-04-23 13:25:23 +02:00
Valerio	aaa5103504	docs(claude): fix stale primp references, document wreq + Fetcher trait webclaw-fetch switched from primp to wreq 6.x (BoringSSL) a while ago but CLAUDE.md still documented primp, the `[patch.crates-io]` requirement, and RUSTFLAGS that no longer apply. Refreshed four sections: - Crate listing: webclaw-fetch uses wreq, not primp - client.rs description: wreq BoringSSL, plus a note that FetchClient will implement the new Fetcher trait so production can swap in a tls-sidecar-backed fetcher without importing wreq - Hard Rules: dropped obsolete `[patch.crates-io]` and RUSTFLAGS lines, added the "Vertical extractors take `&dyn Fetcher`" rule that makes the architectural separation explicit for the upcoming production integration - Removed language about primp being "patched"; reqwest in webclaw-llm is now just "plain reqwest" with no relationship to wreq	2026-04-22 21:11:18 +02:00
Valerio	2ba682adf3	feat(server): add OSS webclaw-server REST API binary (closes #29 ) Self-hosters hitting docs/self-hosting were promised three binaries but the OSS Docker image only shipped two. webclaw-server lived in the closed-source hosted-platform repo, which couldn't be opened. This adds a minimal axum REST API in the OSS repo so self-hosting actually works without pretending to ship the cloud platform. Crate at crates/webclaw-server/. Stateless, no database, no job queue, single binary. Endpoints: GET /health, POST /v1/{scrape, crawl, map, batch, extract, summarize, diff, brand}. JSON shapes mirror api.webclaw.io for the endpoints OSS can support, so swapping between self-hosted and hosted is a base-URL change. Auth: optional bearer token via WEBCLAW_API_KEY / --api-key. Comparison is constant-time (subtle::ConstantTimeEq). Open mode (no key) is allowed and binds 127.0.0.1 by default; the Docker image flips WEBCLAW_HOST=0.0.0.0 so the container is reachable out of the box. Hard caps to keep naive callers from OOMing the process: crawl capped at 500 pages synchronously, batch capped at 100 URLs / 20 concurrent. For unbounded crawls or anti-bot bypass the docs point users at the hosted API. Dockerfile + Dockerfile.ci updated to copy webclaw-server into /usr/local/bin and EXPOSE 3000. Workspace version bumped to 0.4.0 (new public binary).	2026-04-22 12:25:11 +02:00
Valerio	c99ec684fa	Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io	2026-03-23 18:31:11 +01:00

6 commits