fix(core): UTF-8 char boundary panic in find_content_position (#16) (#24)

`search_from = abs_pos + 1` landed mid-char when a rejected match
started on a multi-byte UTF-8 character, panicking on the next
`markdown[search_from..]` slice. Advance by `needle.len()` instead —
always a valid char boundary, and skips the whole rejected match
instead of re-scanning inside it.

Repro: webclaw https://bruler.ru/about_brand -f json
Before: panic "byte index 782 is not a char boundary; it is inside 'Ч'"
After:  extracts 2.3KB of clean Cyrillic markdown with 7 sections

Two regression tests cover multi-byte rejected matches and
all-rejected cycles in Cyrillic text.

Closes #16

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Valerio 2026-04-17 12:02:52 +02:00 committed by GitHub
parent 095ae5d4b1
commit 7f0420bbf0
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
4 changed files with 49 additions and 8 deletions

View file

@ -3,7 +3,7 @@ resolver = "2"
members = ["crates/*"]
[workspace.package]
version = "0.3.17"
version = "0.3.18"
edition = "2024"
license = "AGPL-3.0"
repository = "https://github.com/0xMassi/webclaw"