Commit graph

29 commits

Author SHA1 Message Date
Valerio
0221c151dc feat(extractors): wave 4 \u2014 ecommerce (shopify + generic JSON-LD)
Two ecommerce extractors covering the long tail of online stores:

- shopify_product: hits the public /products/{handle}.json endpoint
  that every Shopify store exposes. Undocumented but stable for 10+
  years. Returns title, vendor, product_type, tags, full variants
  array (price, SKU, stock, options), images, options matrix, and
  the price_min/price_max/any_available summary fields. Covers the
  ~4M Shopify stores out there, modulo stores that put Cloudflare
  in front of the shop. Rejects known non-Shopify hosts (amazon,
  etsy, walmart, etc.) to save a failed request.

- ecommerce_product: generic Schema.org Product JSON-LD extractor.
  Works on any modern store that ships the Google-required Product
  rich-result markup: Shopify, WooCommerce, BigCommerce, Squarespace,
  Magento, custom storefronts. Returns name, brand, SKU, GTIN, MPN,
  images, normalized offers (Offer and AggregateOffer flattened into
  one shape with price, currency, availability, condition),
  aggregateRating, and the raw JSON-LD block for anyone who wants it.
  Reuses webclaw_core::structured_data::extract_json_ld so the
  JSON-LD parser stays shared across the extraction pipeline.

Both are explicit-call only — /v1/scrape/shopify_product and
/v1/scrape/ecommerce_product. Not in auto-dispatch because any
arbitrary /products/{slug} URL could belong to either platform
(or to a custom site that uses the same path shape), and claiming
such URLs blindly would steal from the default markdown /v1/scrape
flow.

Live test results against real stores:
- Shopify / Allbirds Tree Runners: $100, 7 size variants, 4 images,
  Size option, all SKUs. 250ms.
- ecommerce_product / same Allbirds URL: ProductGroup schema, name
  'Men's Tree Runner', brand 'Allbirds', $100 USD InStock offer.
  300ms. Different extraction path, same product.
- ecommerce_product / huel.com: 'Huel Black Edition' / 'Huel' brand,
  200ms.
- Shopify stores behind Cloudflare (Gymshark, Tesla Shop) 403 as
  expected \u2014 the error message points callers at the ecommerce_product
  fallback, but Cloudflare also blocks the HTML path so those stores
  are cloud-tier territory.

Catalog now exposes 19 extractors via GET /v1/extractors. Unit
tests: 59 passing across the module.

Scope not in v1:
- trustpilot_reviews: file written and tested (JSON-LD walker), but
  NOT registered in the catalog or dispatch. Trustpilot's Cloudflare
  turnstile blocks our Firefox + Chrome + Safari + mobile profiles
  at the TLS layer. Shipping it would return 403 more often than 200.
  Code kept in-tree under #[allow(dead_code)] for when the cloud
  tier has residential-proxy support.
- Amazon / Walmart / Target / AliExpress: same Cloudflare / WAF
  story. Not fixable without real browser + proxy pool.
- WooCommerce explicit: most WooCommerce stores ship Product JSON-LD,
  so ecommerce_product covers them. A dedicated WooCommerce REST
  extractor (/wp-json/wc/store/products) would be marginal on top of
  that and only works on ~30% of stores that expose the REST API.

Wave 4 positioning: we now own the OSS structured-scrape space for
any site that respects Schema.org. That's Google's entire rich-result
index \u2014 meaningful territory competitors won't try to replicate as
named endpoints.
2026-04-22 15:36:01 +02:00
Valerio
3bb0a4bca0 feat(extractors): add LinkedIn + Instagram with profile-to-posts fan-out
3 social-network extractors that work entirely without auth, using
public embed/preview endpoints + Instagram's own SEO-facing API:

- linkedin_post:      /embed/feed/update/{urn} returns full body,
                      author, image, OG tags. Accepts both the urn:li:share
                      and urn:li:activity URN forms plus the pretty
                      /posts/{slug}-{id}-{suffix} URLs.

- instagram_post:     /p/{shortcode}/embed/captioned/ returns the full
                      caption, username, thumbnail. Same endpoint serves
                      reels and IGTV, kind correctly classified.

- instagram_profile:  /api/v1/users/web_profile_info/?username=X with the
                      x-ig-app-id header (Instagram's public web-app id,
                      sent by their own JS bundle). Returns the full
                      profile + the 12 most recent posts with shortcodes,
                      kinds, like/comment counts, thumbnails, and caption
                      previews. Falls back to OG-tag scraping of the
                      public HTML if the API ever 401/403s.

The IG profile output is shaped so callers can fan out cleanly:
  for p in profile.recent_posts:
      scrape('instagram_post', p.url)
giving you 'whole profile + every recent post' in one loop. End-to-end
tested against ticketswave: 1 profile call + 12 post calls in ~3.5s.
Pagination beyond 12 posts requires authenticated cookies and is left
for the cloud where we can stash a session.

Infrastructure change: added FetchClient::fetch_with_headers so
extractors can satisfy site-specific request headers (here x-ig-app-id;
later github_pr will use this for Authorization, etc.) without polluting
the global FetchConfig.headers map. Same retry semantics as fetch().

Catalog now exposes 17 extractors via /v1/extractors. Total unit tests
across the module: 47 passing. Clippy clean. Fmt clean.

Live test on the maintainer's example URLs:
- LinkedIn post (urn:li:share:7452618582213144577): 'Orc Dev' / full body
  / shipper.club link / CDN image extracted in 250ms.
- Instagram post (DT-RICMjeK5): 835-char Slovak caption, ticketswave
  username, thumbnail. 200ms.
- Instagram profile (ticketswave): 18,473 followers (exact, not
  rounded), is_verified=True, is_business=True, biography with emojis,
  12 recent posts with shortcodes + kinds + likes. 400ms.

Out of scope for this wave (require infra we don't have):
- linkedin_profile: returns 999 to all bot UAs, needs OAuth
- facebook_post / facebook_page: content is JS-loaded, needs cloud Chrome
- facebook_profile (personal): not publicly accessible by design
2026-04-22 14:39:49 +02:00
Valerio
b041f3cddd feat(extractors): wave 2 \u2014 8 more verticals (14 total)
Adds 8 more vertical extractors using public JSON APIs. All hit
deterministic endpoints with no antibot risk. Live tests pass
against canonical URLs for each.

AI / ML ecosystem (3):
- crates_io          \u2192 crates.io/api/v1/crates/{name}
- huggingface_dataset \u2192 huggingface.co/api/datasets/{path} (handles both
                       legacy /datasets/{name} and canonical {owner}/{name})
- arxiv              \u2192 export.arxiv.org/api/query (Atom XML parsed by quick-xml)

Code / version control (2):
- github_pr      \u2192 api.github.com/repos/{owner}/{repo}/pulls/{number}
- github_release \u2192 api.github.com/repos/{owner}/{repo}/releases/tags/{tag}

Infrastructure (1):
- docker_hub \u2192 hub.docker.com/v2/repositories/{namespace}/{name}
              (official-image shorthand /_/nginx normalized to library/nginx)

Community / publishing (2):
- dev_to        \u2192 dev.to/api/articles/{username}/{slug}
- stackoverflow \u2192 api.stackexchange.com/2.3/questions/{id} + answers,
                  filter=withbody for rendered HTML, sort=votes for
                  consistent top-answers ordering

Live test results (real URLs):
- serde:                 942M downloads, 838B response
- 'Attention Is All You Need': abstract + authors, 1.8KB
- nginx official:        12.9B pulls, 21k stars, 17KB
- openai/gsm8k:          822k downloads, 1.7KB
- rust-lang/rust#138000: merged by RalfJung, +3/-2, 1KB
- webclaw v0.4.0:        2.4KB
- a real dev.to article: 2.2KB body, 3.1KB total
- python yield Q&A:      score 13133, 51 answers, 104KB

Catalog now exposes 14 extractors via GET /v1/extractors. Total
unit tests across the module: 34 passing. Clippy clean. Fmt clean.

Marketing positioning sharpens: 14 dedicated extractors, all
deterministic, all 1-credit-per-call. Firecrawl's /extract is
5 credits per call and you write the schema yourself.
2026-04-22 14:20:21 +02:00
Valerio
8ba7538c37 feat(extractors): add vertical extractors module + first 6 verticals
New extractors module returns site-specific typed JSON instead of
generic markdown. Each extractor:
- declares a URL pattern via matches()
- fetches from the site's official JSON API where one exists
- returns a typed serde_json::Value with documented field names
- exposes an INFO struct that powers the /v1/extractors catalog

First 6 verticals shipped, all hitting public JSON APIs (no HTML
scraping, zero antibot risk):

- reddit       → www.reddit.com/*/.json
- hackernews   → hn.algolia.com/api/v1/items/{id} (full thread in one call)
- github_repo  → api.github.com/repos/{owner}/{repo}
- pypi         → pypi.org/pypi/{name}/json
- npm          → registry.npmjs.org/{name} + downloads/point/last-week
- huggingface_model → huggingface.co/api/models/{owner}/{name}

Server-side routes added:
- POST /v1/scrape/{vertical}  explicit per-vertical extraction
- GET  /v1/extractors         catalog (name, label, description, url_patterns)

The dispatcher validates that URL matches the requested vertical
before running, so users get "URL doesn't match the X extractor"
instead of opaque parse failures inside the extractor.

17 unit tests cover URL matching + path parsing for each vertical.
Live tests against canonical URLs (rust-lang/rust, requests pypi,
react npm, whisper-large-v3 hf, item 8863 hn, an r/micro_saas post)
all return correct typed JSON in 100-300ms. Sample sizes: github
863B, npm 700B, pypi 1.7KB, hf 3.2KB, hn 38KB (full comment tree).

Marketing positioning: Firecrawl charges 5 credits per /extract call
and you write the schema. Webclaw returns the same JSON in 1 credit
per /scrape/{vertical} call with hand-written deterministic
extractors per site.
2026-04-22 14:11:43 +02:00
Valerio
095ae5d4b1
polish(fetch,mcp): robots parser + firefox client cache + Acquire ordering (P3) (#23)
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
Three P3 items from the 2026-04-16 audit. Bump to 0.3.17.

webclaw-fetch/sitemap.rs: parse_robots_txt used trimmed[..8] slice
plus eq_ignore_ascii_case for the directive test. That was fragile:
"Sitemap :" (space before colon) fell through silently, inline
"# ..." comments leaked into the URL, and a line with no URL at all
returned an empty string. Rewritten to split on the first colon,
match any-case "sitemap" as the directive name, strip comments, and
require `://` in the value. +7 unit tests cover case variants,
space-before-colon, comments, empty values, non-URL values, and
non-sitemap directives.

webclaw-fetch/crawler.rs: is_cancelled uses Ordering::Acquire
instead of Relaxed. Behaviourally equivalent on current hardware for
single-word atomic loads, but the explicit ordering documents intent
for readers + compilers.

webclaw-mcp/server.rs: add lazy OnceLock cache for the Firefox
FetchClient. Tool calls that repeatedly request the firefox profile
without cookies used to build a fresh reqwest pool + TLS stack per
call. Chrome (default) already used the long-lived field; Random is
per-call by design; cookie-bearing requests still build ad-hoc since
the cookie header is part of the client shape.

Tests: 85 webclaw-fetch (was 78, +7 new sitemap), 272 webclaw-core,
43 webclaw-llm, 11 CLI — all green. Clippy clean across workspace.

Refs: docs/AUDIT-2026-04-16.md P3 section

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 20:21:32 +02:00
Valerio
d69c50a31d
feat(fetch,llm): DoS hardening + glob validation + cleanup (P2) (#22)
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
* feat(fetch,llm): DoS hardening via response caps + glob validation (P2)

Response body caps:
- webclaw-fetch::Response::from_wreq now rejects bodies over 50 MB. Checks
  Content-Length up front (before the allocation) and the actual
  .bytes() length after (belt-and-braces against lying upstreams).
  Previously the HTML -> markdown conversion downstream could allocate
  multiple String copies per page; a 100 MB page would OOM the process.
- webclaw-llm providers (anthropic/openai/ollama) share a new
  response_json_capped helper with a 5 MB cap. Protects against a
  malicious or runaway provider response exhausting memory.

Crawler frontier cap: after each BFS depth level the frontier is
truncated to max(max_pages * 10, 100) entries, keeping the most
recently discovered links. Dense pages (tag clouds, search results)
used to push the frontier into the tens of thousands even after
max_pages halted new fetches.

Glob pattern validation: user-supplied include_patterns /
exclude_patterns are rejected at Crawler::new if they contain more
than 4 `**` wildcards or exceed 1024 chars. The backtracking matcher
degrades exponentially on deeply-nested `**` against long paths.

Cleanup:
- Removed blanket #![allow(dead_code)] from webclaw-cli/src/main.rs;
  no warnings surfaced, the suppression was obsolete.
- core/.gitignore: replaced overbroad *.json with specific local-
  artifact patterns (previous rule would have swallowed package.json,
  components.json, .smithery/*.json).

Tests: +4 validate_glob tests. Full workspace test: 283 passed
(webclaw-core + webclaw-fetch + webclaw-llm).

Version: 0.3.15 -> 0.3.16
CHANGELOG updated.

Refs: docs/AUDIT-2026-04-16.md (P2 section)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: gitignore CLI research dumps, drop accidentally-tracked file

research-*.json output from `webclaw ... --research ...` got silently
swept into git by the relaxed *.json gitignore in the preceding commit.
The old blanket *.json rule was hiding both this legitimate scratch
file AND packages/create-webclaw/server.json (MCP registry config that
we DO want tracked).

Removes the research dump from git and adds a narrower research-*.json
ignore pattern so future CLI output doesn't get re-tracked by accident.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 19:44:08 +02:00
Valerio
7773c8af2a
fix(fetch): surface semaphore-closed as typed error instead of panic (P1) (#21)
Three call sites in webclaw-fetch used .expect("semaphore closed") on
`Semaphore::acquire()`. Under normal operation they never fire, but
under a shutdown race or adversarial runtime state the spawned task
would panic and be silently dropped from the batch / crawl run — the
caller would see fewer results than URLs with no indication why.

Rewritten to match on the acquire result:
- client::fetch_batch and client::fetch_and_extract_batch_with_options
  now emit BatchResult/BatchExtractResult carrying
  FetchError::Build("semaphore closed before acquire").
- crawler's inner loop emits a failed PageResult with the same error
  string instead of panicking.

Behaviorally a no-op for the happy path. Fixes the silent-dropped-task
class of bug noted in the 2026-04-16 audit.

Version: 0.3.14 -> 0.3.15
CHANGELOG updated.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 19:20:26 +02:00
Valerio
6316b1a6e7 fix: handle raw newlines in JSON-LD strings
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
Sites like Bluesky emit JSON-LD with literal newline characters inside
string values (technically invalid JSON). Add sanitize_json_newlines()
fallback that escapes control characters inside quoted strings before
retrying the parse. This recovers ProfilePage, Product, and other
structured data that was previously silently dropped.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 11:40:25 +02:00
Valerio
050b2ef463 feat: add allow_subdomains and allow_external_links to CrawlConfig
Crawls are same-origin by default. Enable allow_subdomains to follow
sibling/child subdomains (blog.example.com from example.com), or
allow_external_links for full cross-origin crawling.

Root domain extraction uses a heuristic that handles two-part TLDs
(co.uk, com.au). Includes 5 unit tests for root_domain().

Bump to 0.3.12.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 19:33:06 +02:00
Valerio
a4c351d5ae feat: add fallback sitemap paths for broader discovery
Try /sitemap_index.xml, /wp-sitemap.xml, and /sitemap/sitemap-index.xml
after the standard /sitemap.xml. WordPress 5.5+ and many CMS platforms
use non-standard paths that were previously missed. Paths found via
robots.txt are deduplicated to avoid double-fetching.

Bump to 0.3.11.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 18:22:57 +02:00
Valerio
25b6282d5f style: fix rustfmt for 2-element delay array
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 17:21:53 +02:00
Valerio
954aabe3e8 perf: reduce fetch timeout to 12s and retries to 2
Stress testing showed 33% of proxies are dead, causing 30s+ timeouts
per request with 3 retries (worst case 94s). Reducing timeout from 30s
to 12s and retries from 3 to 2 brings worst case to 25s. Combined with
disabling 509 dead proxies from the pool, this should significantly
improve response times under load.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 17:18:57 +02:00
Valerio
124352e0b4 style: cargo fmt
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 18:25:40 +02:00
Valerio
aaf51eddef feat: replace custom TLS stack with wreq (BoringSSL), bump v0.3.3
Migrated webclaw-fetch from webclaw-tls (patched rustls/h2/hyper/reqwest)
to wreq by @0x676e67. wreq uses BoringSSL for TLS and the http2 crate
for HTTP/2 fingerprinting — battle-tested with 60+ browser profiles.

This removes all 5 [patch.crates-io] entries that consumers previously
needed. Browser profiles (Chrome 145, Firefox 135, Safari 18, Edge 145)
are now built directly on wreq's Emulation API with correct TLS options,
HTTP/2 SETTINGS ordering, pseudo-header order, and header wire order.

84% pass rate across 1000 real sites. 384 unit tests green.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 18:04:55 +02:00
Valerio
44f23332cc style: collapse nested if per clippy
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 14:13:55 +02:00
Valerio
20c810b8d2 chore: bump v0.3.1, update CHANGELOG, fix fmt
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 14:11:54 +02:00
Valerio
7041a1d992 feat: cookie warmup fallback for Akamai-protected pages
When a fetch returns a challenge page (small HTML with Akamai markers),
automatically visit the homepage first to collect _abck/bm_sz cookies,
then retry the original URL. This bypasses Akamai's cookie-based gate
on subpages without needing JS execution.

Detected via: <title>Challenge Page</title> or bazadebezolkohpepadr
sensor marker on responses under 15KB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 14:09:31 +02:00
Valerio
4cba36337b style: fix fmt in client.rs test
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 12:18:57 +02:00
Valerio
199dab6dfa fix: adapt to webclaw-tls v0.1.1 HeaderMap API change
Response.headers() now returns &http::HeaderMap instead of
&HashMap<String, String>. Updated FetchResult, is_pdf_content_type,
is_document_content_type, is_bot_protected, and all related tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 12:09:50 +02:00
Valerio
e3b0d0bd74 fix: make reddit and linkedin modules public for server access
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 16:54:35 +02:00
Valerio
f275a93bec fix: clippy empty-line-after-doc-comment in browser.rs
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 16:45:05 +02:00
Valerio
140234c139 style: cargo fmt
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 16:43:11 +02:00
Valerio
f13cb83c73 feat: replace primp with webclaw-tls, bump to v0.3.0
Replace primp dependency with our own TLS fingerprinting stack
(webclaw-tls). Perfect Chrome 146 JA4 + Akamai hash match.

- Remove primp entirely (zero references remaining)
- webclaw-fetch now uses webclaw-http from github.com/0xMassi/webclaw-tls
- Native + Mozilla root CAs (fixes HTTPS on cross-signed cert chains)
- Skip unknown certificate extensions (SCT tolerance)
- 99% bypass rate on 102 sites (was ~85% with primp)
- Fixes #5 (HTTPS broken — example.com and similar sites now work)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 16:40:10 +02:00
Valerio
ea14848772 feat: v0.2.0 — DOCX/XLSX/CSV extraction, HTML format, multi-URL watch, batch LLM
Document extraction:
- DOCX: auto-detected, outputs markdown with headings (via zip + quick-xml)
- XLSX/XLS: markdown tables with multi-sheet support (via calamine)
- CSV: quoted field handling, markdown table output
- All auto-detected by Content-Type header or URL extension

New features:
- -f html output format (sanitized HTML)
- Multi-URL watch: --urls-file + --watch monitors all URLs in parallel
- Batch + LLM: --extract-prompt/--extract-json works with multiple URLs
- Mixed batch: HTML pages + DOCX + XLSX + CSV in one command

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 15:28:23 +01:00
Valerio
0e4128782a fix: v0.1.7 — extraction options now work in batch mode (#3)
--only-main-content, --include, and --exclude were ignored in batch
mode because run_batch used default ExtractionOptions. Added
fetch_and_extract_batch_with_options to pass CLI options through.

Closes #3

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 13:30:20 +01:00
Valerio
0c91c6d5a9 feat: v0.1.3 — crawl streaming, resume/cancel, MCP proxy support
Crawl:
- Real-time progress on stderr as pages complete
- --crawl-state saves progress on Ctrl+C, resumes from saved state
- Visited set + remaining frontier persisted for accurate resume

MCP server:
- Reads WEBCLAW_PROXY and WEBCLAW_PROXY_FILE env vars
- Falls back to proxies.txt in CWD (existing behavior)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 21:38:28 +01:00
Valerio
afe4d3077d feat: v0.1.2 — TLS fallback, Safari default, Reddit fix, YouTube transcript infra
- Switch default profile to Safari26/Mac (best CF pass rate)
- Auto-fallback to plain client on connection error or 403
- Fixes: ycombinator.com, producthunt.com, and similar CF-strict sites
- Reddit .json endpoint uses plain client (TLS fingerprint was blocked)
- YouTube caption track extraction + timed text parser (core, not yet wired)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 18:50:07 +01:00
Valerio
907966a983 fix: use plain client for Reddit JSON endpoint
Reddit blocks TLS-fingerprinted clients on their .json API but
accepts standard requests with a browser User-Agent. Switch to
a non-impersonated primp client for the Reddit fallback path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 18:43:47 +01:00
Valerio
c99ec684fa Initial release: webclaw v0.1.0 — web content extraction for LLMs
CLI + MCP server for extracting clean, structured content from any URL.
6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats.

MIT Licensed | https://webclaw.io
2026-03-23 18:31:11 +01:00