diff --git a/CLAUDE.md b/CLAUDE.md index b30bd84..387c2dd 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -15,6 +15,7 @@ webclaw/ # + proxy pool rotation (per-request) # + PDF content-type detection # + document parsing (DOCX, XLSX, CSV) + # + layered URL discovery (map) + Serper web search (BYO key) webclaw-llm/ # LLM provider chain (Ollama -> OpenAI -> Anthropic) # + JSON schema extraction, prompt extraction, summarization webclaw-pdf/ # PDF text extraction via pdf-extract @@ -30,25 +31,34 @@ Three binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server), `webclaw-server` (R - `extractor.rs` — Readability-style scoring: text density, semantic tags, link density penalty - `noise.rs` — Shared noise filter: tags, ARIA roles, class/ID patterns. Tailwind-safe. - `data_island.rs` — JSON data island extraction for React SPAs, Next.js, Contentful CMS +- `structured_data.rs` — JSON-LD, Next.js `__NEXT_DATA__`, and SvelteKit data-island extraction +- `js_eval.rs` — QuickJS sandbox (rquickjs) that runs inline `