- config.rs: NoxaConfig::load() now exits with an error when an explicit path
(via --config arg or NOXA_CONFIG env var) does not exist; implicit ./config.json
missing still silently returns default
- Updated test: test_load_missing_file_returns_default replaced with
test_load_implicit_missing_file_returns_default (tests None path, not explicit)
- env.example: replaced fat legacy content with slim secrets-only template
matching .env.example; deleted redundant .env.example
- README.md: replaced bare env-var table with full Configuration section
covering config.json workflow, precedence, NOXA_CONFIG override, and bool
flag limitation
The previous implementation used wrong flags (-p without value, --json,
--max-output-tokens) that don't exist in the real gemini CLI.
Correct invocation:
- Pass prompt as -p STRING value (not via stdin)
- Use --output-format json to get structured {response, stats} output
- Add --yolo to suppress interactive confirmation prompts
- Remove nonexistent --json and --max-output-tokens flags
- Parse `.response` field from JSON output, skipping MCP noise lines
- Extend timeout from 30s to 60s (agentic CLI is slower than raw API)
Smoke tested end-to-end: stdin HTML → summarize and --extract-json
both produce correct output via Gemini CLI.
- Add jsonschema crate for schema validation in extract_json
- On parse failure (invalid JSON): retry once with identical request
- On schema mismatch (valid JSON, wrong schema): fail immediately — no retry
- validate_schema() produces concise error with field path from instance_path()
- Add SequenceMockProvider to testing.rs for first-fail/second-success tests
- Fix env var test flakiness: mark env_model_override as ignored
- ProviderChain::default() order: Gemini CLI -> OpenAI -> Ollama -> Anthropic
- Add --llm-provider gemini arm to build_llm_provider() in noxa-cli
- Update unknown-provider error to mention gemini
- Update empty-chain error messages in CLI and MCP to mention gemini CLI
- Update MCP startup warn! to list gemini CLI as first option
Try /sitemap_index.xml, /wp-sitemap.xml, and /sitemap/sitemap-index.xml
after the standard /sitemap.xml. WordPress 5.5+ and many CMS platforms
use non-standard paths that were previously missed. Paths found via
robots.txt are deduplicated to avoid double-fetching.
Bump to 0.3.11.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Stress testing showed 33% of proxies are dead, causing 30s+ timeouts
per request with 3 retries (worst case 94s). Reducing timeout from 30s
to 12s and retries from 3 to 2 brings worst case to 25s. Combined with
disabling 509 dead proxies from the pool, this should significantly
improve response times under load.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CI runs Rust 1.94 which flags these. Collapsed nested if-let in
cell_has_block_content() and replaced .map()+return with .inspect()
in table_to_md().
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Version bump for layout table, stack overflow, and noise filter fixes
contributed by @devnen. Also fixes cargo fmt issues that caused CI lint
failure on the merge commit.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two related fixes for content being stripped by the noise filter:
1. Remove <form> from unconditional noise tags. ASP.NET and similar
frameworks wrap entire pages in a <form> tag — these are not input
forms. Forms with >500 chars of text are now treated as content
wrappers, not noise.
2. Add safety valve for class/ID noise matching. When malformed HTML
leaves a noise container unclosed (e.g., <div class="header"> missing
its </div>), the HTML5 parser makes all subsequent siblings into
children of that container. A header/nav/footer with >5000 chars of
text is almost certainly a broken wrapper absorbing real content —
exempt it from noise filtering.
Pages like Express.co.uk live blogs nest 200+ DOM levels deep, overflowing
the default 1 MB main-thread stack on Windows during recursive markdown
conversion.
Two-layer fix:
1. markdown.rs: add depth parameter to node_to_md/children_to_md/inline_text
with MAX_DOM_DEPTH=200 guard — falls back to plain text collection at limit
2. lib.rs: wrap extract_with_options in a worker thread with 8 MB stack so
html5ever parsing and extraction both have room on deeply nested pages
Tested with Express.co.uk live blog (previously crashed, now extracts 2000+
lines of clean markdown) and drudgereport.com (still works correctly).
Sites like Drudge Report use <table> for page layout, not data. Each cell
contains extensive block-level content (divs, hrs, paragraphs, links).
Previously, table_to_md() called inline_text() on every cell, collapsing
all whitespace and flattening block elements into a single unreadable line.
Changes:
- Add cell_has_block_content() heuristic: scans for block-level descendants
(p, div, hr, ul, ol, h1-h6, etc.) to distinguish layout vs data tables
- Layout tables render each cell as a standalone section separated by blank
lines, using children_to_md() to preserve block structure
- Data tables (no block elements in cells) keep existing markdown table format
- Bold/italic tags containing block elements are treated as containers
instead of wrapping in **/**/* (fixes Drudge's <b><font>...</font></b>
column wrappers that contain the entire column content)
- Add tests for layout tables with paragraphs and with links
Research results saved to ~/.webclaw/research/ (report.md + full.json).
MCP returns file paths + findings instead of the full report, preventing
"exceeds maximum allowed tokens" errors in Claude/Cursor.
Same query returns cached result instantly without spending credits.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- --research "query": deep research via cloud API, saves JSON file with
report + sources + findings, prints report to stdout
- --deep: longer, more thorough research mode
- MCP extract/summarize: cloud fallback when no local LLM available
- MCP research: returns structured JSON instead of raw text
- Bump to v0.3.7
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
__NEXT_DATA__, SvelteKit, and JSON-LD now appear as a
## Structured Data section in -f markdown and -f llm output.
Works with --only-main-content and all extraction flags.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previous approach used mislav/bump-homebrew-formula-action which only
updated macOS arm64 SHA. Now downloads all 4 tarballs after Docker
finishes, computes SHAs, and writes the complete formula.
Fixes#12 (brew install checksum mismatch on Linux)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Next.js pages embed server-rendered data in <script id="__NEXT_DATA__">.
Now extracted as structured JSON (pageProps) in the structured_data field.
Tested on 45 sites — 13 return rich structured data including prices,
product info, and page state not visible in the DOM.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Extract structured JSON from SvelteKit kit.start() data arrays
- Convert JS object literals (unquoted keys) to valid JSON
- Data appears in structured_data field (machine-readable)
- License changed from MIT to AGPL-3.0
- Bump to v0.3.4
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
boring-sys2 builds BoringSSL from C source via cmake. For aarch64 cross-
compilation, we need g++, cmake, and CC/CXX env vars pointing to the
cross-compiler. Also removed stale reqwest_unstable RUSTFLAG.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The --cfg reqwest_unstable flag was required by the old patched reqwest.
wreq handles everything internally — no special build flags needed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
wreq uses BoringSSL (via boring-sys2) which needs cmake and clang
at build time. Removed stale reference to Impit's patched rustls.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Migrated webclaw-fetch from webclaw-tls (patched rustls/h2/hyper/reqwest)
to wreq by @0x676e67. wreq uses BoringSSL for TLS and the http2 crate
for HTTP/2 fingerprinting — battle-tested with 60+ browser profiles.
This removes all 5 [patch.crates-io] entries that consumers previously
needed. Browser profiles (Chrome 145, Firefox 135, Safari 18, Edge 145)
are now built directly on wreq's Emulation API with correct TLS options,
HTTP/2 SETTINGS ordering, pseudo-header order, and header wire order.
84% pass rate across 1000 real sites. 384 unit tests green.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- --cookie-file reads Chrome extension format ([{name, value, domain, ...}])
- Works with EditThisCookie, Cookie-Editor, and similar browser extensions
- Merges with --cookie when both provided
- MCP scrape tool now accepts cookies parameter
- Closes#7
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a fetch returns a challenge page (small HTML with Akamai markers),
automatically visit the homepage first to collect _abck/bm_sz cookies,
then retry the original URL. This bypasses Akamai's cookie-based gate
on subpages without needing JS execution.
Detected via: <title>Challenge Page</title> or bazadebezolkohpepadr
sensor marker on responses under 15KB.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Response.headers() now returns &http::HeaderMap instead of
&HashMap<String, String>. Updated FetchResult, is_pdf_content_type,
is_document_content_type, is_bot_protected, and all related tests.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Core has reqwest 0.12 (direct) and 0.13 (via webclaw-tls patch).
Disambiguate with version specs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the weekly primp compatibility check (which fails since primp
was removed in v0.3.0) with an automated dependency sync workflow.
Triggered by webclaw-tls pushes via repository_dispatch.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>