Commit graph

5 commits

Author SHA1 Message Date
devnen
74bac87435 fix: prevent stack overflow on deeply nested HTML pages
Pages like Express.co.uk live blogs nest 200+ DOM levels deep, overflowing
the default 1 MB main-thread stack on Windows during recursive markdown
conversion.

Two-layer fix:

1. markdown.rs: add depth parameter to node_to_md/children_to_md/inline_text
   with MAX_DOM_DEPTH=200 guard — falls back to plain text collection at limit

2. lib.rs: wrap extract_with_options in a worker thread with 8 MB stack so
   html5ever parsing and extraction both have room on deeply nested pages

Tested with Express.co.uk live blog (previously crashed, now extracts 2000+
lines of clean markdown) and drudgereport.com (still works correctly).
2026-04-03 23:45:19 +02:00
Valerio
8d29382b25 feat: extract __NEXT_DATA__ into structured_data
Next.js pages embed server-rendered data in <script id="__NEXT_DATA__">.
Now extracted as structured JSON (pageProps) in the structured_data field.

Tested on 45 sites — 13 return rich structured data including prices,
product info, and page state not visible in the DOM.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 16:04:51 +02:00
Valerio
84b2e6092e feat: SvelteKit data extraction + license change to AGPL-3.0
- Extract structured JSON from SvelteKit kit.start() data arrays
- Convert JS object literals (unquoted keys) to valid JSON
- Data appears in structured_data field (machine-readable)
- License changed from MIT to AGPL-3.0
- Bump to v0.3.4

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 20:37:56 +02:00
Valerio
32c035c543 feat: v0.1.4 — QuickJS integration for inline JavaScript data extraction
Embeds QuickJS (rquickjs) to execute inline <script> tags and extract
data hidden in JavaScript variable assignments. Captures window.__*
objects like __preloadedData (NYTimes), __PRELOADED_STATE__ (Wired),
and self.__next_f (Next.js RSC flight data).

Results:
- NYTimes: 1,552 → 4,162 words (+168%)
- Wired: 1,459 → 9,937 words (+580%)
- Zero measurable performance overhead (<15ms per page)
- Feature-gated: disable with --no-default-features for WASM

Smart text filtering rejects CSS, base64, file paths, code strings.
Only readable prose is appended under "## Additional Content".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 10:28:16 +01:00
Valerio
c99ec684fa Initial release: webclaw v0.1.0 — web content extraction for LLMs
CLI + MCP server for extracting clean, structured content from any URL.
6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats.

MIT Licensed | https://webclaw.io
2026-03-23 18:31:11 +01:00