webclaw/crates/webclaw-core/src
devnen 74bac87435 fix: prevent stack overflow on deeply nested HTML pages
Pages like Express.co.uk live blogs nest 200+ DOM levels deep, overflowing
the default 1 MB main-thread stack on Windows during recursive markdown
conversion.

Two-layer fix:

1. markdown.rs: add depth parameter to node_to_md/children_to_md/inline_text
   with MAX_DOM_DEPTH=200 guard — falls back to plain text collection at limit

2. lib.rs: wrap extract_with_options in a worker thread with 8 MB stack so
   html5ever parsing and extraction both have room on deeply nested pages

Tested with Express.co.uk live blog (previously crashed, now extracts 2000+
lines of clean markdown) and drudgereport.com (still works correctly).
2026-04-03 23:45:19 +02:00
..
llm feat: structured data in markdown/LLM output + v0.3.6 2026-04-02 19:16:56 +02:00
brand.rs Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
data_island.rs feat: SvelteKit data extraction + license change to AGPL-3.0 2026-04-01 20:37:56 +02:00
diff.rs Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
domain.rs Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
error.rs Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
extractor.rs Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
js_eval.rs feat: v0.1.4 — QuickJS integration for inline JavaScript data extraction 2026-03-26 10:28:16 +01:00
lib.rs fix: prevent stack overflow on deeply nested HTML pages 2026-04-03 23:45:19 +02:00
markdown.rs fix: prevent stack overflow on deeply nested HTML pages 2026-04-03 23:45:19 +02:00
metadata.rs Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
noise.rs Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
structured_data.rs feat: extract __NEXT_DATA__ into structured_data 2026-04-02 16:04:51 +02:00
types.rs Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
youtube.rs feat: v0.1.2 — TLS fallback, Safari default, Reddit fix, YouTube transcript infra 2026-03-25 18:50:07 +01:00