webclaw

apunkt/webclaw

Fork 0

mirror of https://github.com/0xMassi/webclaw.git synced 2026-06-07 22:15:12 +02:00

Commit graph

Author	SHA1	Message	Date
Valerio	095ae5d4b1	polish(fetch,mcp): robots parser + firefox client cache + Acquire ordering (P3) (#23 ) Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run Three P3 items from the 2026-04-16 audit. Bump to 0.3.17. webclaw-fetch/sitemap.rs: parse_robots_txt used trimmed[..8] slice plus eq_ignore_ascii_case for the directive test. That was fragile: "Sitemap :" (space before colon) fell through silently, inline "# ..." comments leaked into the URL, and a line with no URL at all returned an empty string. Rewritten to split on the first colon, match any-case "sitemap" as the directive name, strip comments, and require `://` in the value. +7 unit tests cover case variants, space-before-colon, comments, empty values, non-URL values, and non-sitemap directives. webclaw-fetch/crawler.rs: is_cancelled uses Ordering::Acquire instead of Relaxed. Behaviourally equivalent on current hardware for single-word atomic loads, but the explicit ordering documents intent for readers + compilers. webclaw-mcp/server.rs: add lazy OnceLock cache for the Firefox FetchClient. Tool calls that repeatedly request the firefox profile without cookies used to build a fresh reqwest pool + TLS stack per call. Chrome (default) already used the long-lived field; Random is per-call by design; cookie-bearing requests still build ad-hoc since the cookie header is part of the client shape. Tests: 85 webclaw-fetch (was 78, +7 new sitemap), 272 webclaw-core, 43 webclaw-llm, 11 CLI — all green. Clippy clean across workspace. Refs: docs/AUDIT-2026-04-16.md P3 section Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-16 20:21:32 +02:00
Valerio	a4c351d5ae	feat: add fallback sitemap paths for broader discovery Try /sitemap_index.xml, /wp-sitemap.xml, and /sitemap/sitemap-index.xml after the standard /sitemap.xml. WordPress 5.5+ and many CMS platforms use non-standard paths that were previously missed. Paths found via robots.txt are deduplicated to avoid double-fetching. Bump to 0.3.11. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 18:22:57 +02:00
Valerio	c99ec684fa	Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io	2026-03-23 18:31:11 +01:00

Author

SHA1

Message

Date

Valerio

095ae5d4b1

polish(fetch,mcp): robots parser + firefox client cache + Acquire ordering (P3) (#23 )

CI / Test (push) Waiting to run

CI / Lint (push) Waiting to run

CI / Docs (push) Waiting to run

Three P3 items from the 2026-04-16 audit. Bump to 0.3.17.

webclaw-fetch/sitemap.rs: parse_robots_txt used trimmed[..8] slice
plus eq_ignore_ascii_case for the directive test. That was fragile:
"Sitemap :" (space before colon) fell through silently, inline
"# ..." comments leaked into the URL, and a line with no URL at all
returned an empty string. Rewritten to split on the first colon,
match any-case "sitemap" as the directive name, strip comments, and
require `://` in the value. +7 unit tests cover case variants,
space-before-colon, comments, empty values, non-URL values, and
non-sitemap directives.

webclaw-fetch/crawler.rs: is_cancelled uses Ordering::Acquire
instead of Relaxed. Behaviourally equivalent on current hardware for
single-word atomic loads, but the explicit ordering documents intent
for readers + compilers.

webclaw-mcp/server.rs: add lazy OnceLock cache for the Firefox
FetchClient. Tool calls that repeatedly request the firefox profile
without cookies used to build a fresh reqwest pool + TLS stack per
call. Chrome (default) already used the long-lived field; Random is
per-call by design; cookie-bearing requests still build ad-hoc since
the cookie header is part of the client shape.

Tests: 85 webclaw-fetch (was 78, +7 new sitemap), 272 webclaw-core,
43 webclaw-llm, 11 CLI — all green. Clippy clean across workspace.

Refs: docs/AUDIT-2026-04-16.md P3 section

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

2026-04-16 20:21:32 +02:00

Valerio

a4c351d5ae

feat: add fallback sitemap paths for broader discovery

Try /sitemap_index.xml, /wp-sitemap.xml, and /sitemap/sitemap-index.xml
after the standard /sitemap.xml. WordPress 5.5+ and many CMS platforms
use non-standard paths that were previously missed. Paths found via
robots.txt are deduplicated to avoid double-fetching.

Bump to 0.3.11.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-10 18:22:57 +02:00

Valerio

c99ec684fa

Initial release: webclaw v0.1.0 — web content extraction for LLMs

CLI + MCP server for extracting clean, structured content from any URL.
6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats.

MIT Licensed | https://webclaw.io

2026-03-23 18:31:11 +01:00

3 commits