mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-06-06 22:05:13 +02:00
fix: clean llm output noise
Port the valid PR #43 LLM cleanup fixes onto current main without stale branch regressions.\n\nIncludes comment-count link cleanup, bare numeric paragraph cleanup, pagination leftover cleanup, JSON-LD article body scrubbing, clearer CLI consent-wall warnings, and quieter parser logs by default.\n\nThanks to @devnen for the report and patch work.
This commit is contained in:
parent
5eef8358b0
commit
3fabdc1d02
8 changed files with 348 additions and 18 deletions
11
CHANGELOG.md
11
CHANGELOG.md
|
|
@ -3,6 +3,17 @@
|
|||
All notable changes to webclaw are documented here.
|
||||
Format follows [Keep a Changelog](https://keepachangelog.com/).
|
||||
|
||||
## [0.6.2] — 2026-05-18
|
||||
|
||||
### Fixed
|
||||
- Cleaned up `--format llm` output on noisy news and documentation pages. Comment-count links, bare page-number paragraphs, pagination leftovers such as `0 Next`, and duplicated JSON-LD article bodies are now removed before they reach the LLM context.
|
||||
- The CLI now recognizes common cookie-consent redirects and prints a clearer warning when a page returns a consent wall instead of usable content.
|
||||
- The CLI keeps noisy parser warnings from real-world malformed HTML out of stderr by default. `WEBCLAW_LOG` still lets advanced users opt into deeper parser logs.
|
||||
|
||||
Thanks to Nenad Oric (`@devnen`) for the report and patch work in PR #43.
|
||||
|
||||
---
|
||||
|
||||
## [0.6.1] — 2026-05-12
|
||||
|
||||
### Fixed
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue