webclaw

mirror of https://github.com/0xMassi/webclaw.git synced 2026-06-08 22:25:12 +02:00

Valerio 7f0420bbf0 fix(core): UTF-8 char boundary panic in find_content_position (#16 ) (#24 ) `search_from = abs_pos + 1` landed mid-char when a rejected match started on a multi-byte UTF-8 character, panicking on the next `markdown[search_from..]` slice. Advance by `needle.len()` instead — always a valid char boundary, and skips the whole rejected match instead of re-scanning inside it. Repro: webclaw https://bruler.ru/about_brand -f json Before: panic "byte index 782 is not a char boundary; it is inside 'Ч'" After: extracts 2.3KB of clean Cyrillic markdown with 7 sections Two regression tests cover multi-byte rejected matches and all-rejected cycles in Cyrillic text. Closes #16 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-04-17 12:02:52 +02:00
..
webclaw-cli	feat(fetch,llm): DoS hardening + glob validation + cleanup (P2) (#22 )	2026-04-16 19:44:08 +02:00
webclaw-core	fix(core): UTF-8 char boundary panic in find_content_position (#16 ) (#24 )	2026-04-17 12:02:52 +02:00
webclaw-fetch	polish(fetch,mcp): robots parser + firefox client cache + Acquire ordering (P3) (#23 )	2026-04-16 20:21:32 +02:00
webclaw-llm	feat(fetch,llm): DoS hardening + glob validation + cleanup (P2) (#22 )	2026-04-16 19:44:08 +02:00
webclaw-mcp	polish(fetch,mcp): robots parser + firefox client cache + Acquire ordering (P3) (#23 )	2026-04-16 20:21:32 +02:00
webclaw-pdf	Initial release: webclaw v0.1.0 — web content extraction for LLMs	2026-03-23 18:31:11 +01:00