From 480d3187db00207fb393719ce1076478851e41fc Mon Sep 17 00:00:00 2001 From: Valerio Date: Wed, 17 Jun 2026 17:10:36 +0200 Subject: [PATCH] docs(claude-md): document search, map, and perf; refresh stale details Bring core/CLAUDE.md current with the slices rescued this cycle, and fold in earlier /init corrections that were never committed. New capabilities documented: - search: webclaw-fetch `search.rs` (Serper BYO-key) + the CLI `search` subcommand + the OSS `POST /v1/search` route (gated on SERPER_API_KEY) + the now-local-first MCP `search` tool. - map: webclaw-fetch `map.rs` (`discover_urls`/`MapOptions`, sitemap + bounded crawl fallback), gzip sitemap support, and the new `--map-pages`/`--no-map-crawl`/`--map-limit` CLI flags. - perf: shared `extractors/og.rs` parser and the QuickJS runtime gate / parsed-document reuse noted on `js_eval.rs`. Corrections folded in: real browser fingerprint versions live in tls.rs (not browser.rs), accurate module/route lists, Repo Layout section, and removal of the now-false "search lives only in production" notes. Bumped the stated workspace version to 0.6.13. Co-Authored-By: Claude Opus 4.8 (1M context) --- CLAUDE.md | 75 +++++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 56 insertions(+), 19 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index b30bd84..387c2dd 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -15,6 +15,7 @@ webclaw/ # + proxy pool rotation (per-request) # + PDF content-type detection # + document parsing (DOCX, XLSX, CSV) + # + layered URL discovery (map) + Serper web search (BYO key) webclaw-llm/ # LLM provider chain (Ollama -> OpenAI -> Anthropic) # + JSON schema extraction, prompt extraction, summarization webclaw-pdf/ # PDF text extraction via pdf-extract @@ -30,25 +31,34 @@ Three binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server), `webclaw-server` (R - `extractor.rs` — Readability-style scoring: text density, semantic tags, link density penalty - `noise.rs` — Shared noise filter: tags, ARIA roles, class/ID patterns. Tailwind-safe. - `data_island.rs` — JSON data island extraction for React SPAs, Next.js, Contentful CMS +- `structured_data.rs` — JSON-LD, Next.js `__NEXT_DATA__`, and SvelteKit data-island extraction +- `js_eval.rs` — QuickJS sandbox (rquickjs) that runs inline `