webclaw

mirror of https://github.com/0xMassi/webclaw.git synced 2026-06-07 22:15:12 +02:00

Author	SHA1	Message	Date
Valerio	6d886c44f6	docs: enlarge studio partner banner	2026-05-18 12:27:11 +02:00
Valerio	8e3ad17428	docs: tighten studio partner layout	2026-05-18 12:23:19 +02:00
Valerio	7321549412	docs: add studio partner section	2026-05-18 12:17:34 +02:00
Valerio	72edb61881	Merge pull request #42 from jal-co/docs/add-community-plugins docs: add community plugins section	2026-05-16 11:24:33 +02:00
Valerio	00d86a12bc	docs: refine community plugin copy	2026-05-16 11:19:15 +02:00
Justin Levine	c8be5214f6	docs: add community plugins section with OpenClaw and Hermes integrations	2026-05-15 17:51:22 -07:00
Valerio	0ea189c5b2	fix(ci): pass repository to release cli Some checks failed CI / Test (push) Has been cancelled CI / Lint (push) Has been cancelled CI / Docs (push) Has been cancelled	2026-05-12 12:28:14 +02:00
Valerio	a629534490	fix(security): prepare 0.6.1 hardening Merge the 0.6.1 security hardening release candidate after local and CI verification.	2026-05-12 12:16:42 +02:00
Valerio	fd2e75d509	chore(fetch): satisfy clippy for resolver setup	2026-05-12 12:09:18 +02:00
Valerio	e2f89941ac	chore(release): prepare 0.6.1	2026-05-12 12:06:06 +02:00
Valerio	307b4f980d	fix(extractors): harden marketplace host matching	2026-05-12 12:03:43 +02:00
Valerio	dbf9ce08a6	fix(ci): scope release workflow token permissions	2026-05-12 12:00:47 +02:00
Valerio	3bcb288d13	fix(fetch): guard challenge detection before utf8 decoding	2026-05-12 12:00:47 +02:00
Valerio	a611ae26f3	fix(security): harden local fetch surfaces	2026-05-12 12:00:25 +02:00
Valerio	af96628dc9	Revise README for clarity and updated content Some checks failed CI / Test (push) Has been cancelled CI / Lint (push) Has been cancelled CI / Docs (push) Has been cancelled Updated the README to reflect changes in the project description, banner image size, and various content sections. Enhanced clarity on features and usage.	2026-05-10 22:44:57 +02:00
devnen	e8ca1417d6	Improve --format llm output quality (#37 ) Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run Improve LLM-format output for modern news and documentation pages. - Filter noisy hydration and low-value page chrome structured data while preserving content-bearing Schema.org records - Fix element/text spacing without detaching punctuation on docs, forums, and reference pages - Remove common accessibility link chrome from LLM text and link labels - Bump workspace version to 0.6.0 and update the changelog Thanks to Nenad Oric (@devnen) for the original PR and contribution.	2026-05-10 15:11:12 +02:00
Valerio	7f75143954	docs: update hosted api trial copy Some checks failed CI / Test (push) Has been cancelled CI / Lint (push) Has been cancelled CI / Docs (push) Has been cancelled	2026-05-06 17:16:35 +02:00
Valerio	e6a95f783d	chore: bump version to 0.5.9 Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run	2026-05-06 11:42:09 +02:00
Valerio	a3aa4bce6f	fix: support LLM provider compatibility options Closes #36	2026-05-06 11:36:53 +02:00
Valerio	86183b11e4	docs: credit Windows release contribution Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run	2026-05-05 11:44:07 +02:00
SURYANSH MISHRA	513b0e493e	ci: add Windows release artifacts Closes #34	2026-05-05 11:38:30 +02:00
Valerio	a1242a1c1d	docs: credit README badge refresh	2026-05-05 11:18:58 +02:00
Justin Levine	a542e45768	docs: refresh README badges Replace README badges with shieldcn-styled badges.	2026-05-05 11:17:21 +02:00
Valerio	615f326660	docs: update changelog for brand extraction Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run	2026-05-04 21:52:49 +02:00
Valerio	72b8dbc285	fix: improve brand extraction signals	2026-05-04 21:25:07 +02:00
Valerio	1c9def2fde	fix: validate self-host route URLs consistently	2026-05-04 14:30:06 +02:00
Valerio	eede2f6953	docs: credit SSRF report Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run	2026-05-04 12:08:11 +02:00
Valerio	bdf81fe6bf	fix: harden fetch URL validation	2026-05-04 11:50:57 +02:00
Valerio	23544f8fac	docs(claude): note youtube.rs role and yt-dlp short-circuit in server Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run The webclaw-core youtube module produces structured markdown but no transcript; document that and point at the production server's youtube_transcript.rs short-circuit for the full YoutubeData + caption text shape.	2026-05-03 21:17:23 +02:00
Valerio	923445f4a8	docs(readme): add h1 brand heading Some checks failed CI / Test (push) Has been cancelled CI / Lint (push) Has been cancelled CI / Docs (push) Has been cancelled The repo had no heading-level brand anchor, only a banner image and an h3 slogan. Search engines indexing the README were missing the canonical brand signal. The new h1 is what GitHub renders as the title of the page and what Google co-ranks with webclaw.io. Bumps workspace version to 0.5.7.	2026-04-30 11:47:02 +02:00
Valerio	0e6c7cdc97	Add GitHub Sponsors username to FUNDING.yml Some checks failed CI / Test (push) Has been cancelled CI / Lint (push) Has been cancelled CI / Docs (push) Has been cancelled Updated funding model with GitHub Sponsors username.	2026-04-27 13:18:22 +02:00
Valerio	5795c5c422	docs(readme): add star history chart Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run	2026-04-26 17:55:22 +02:00
Valerio	4908367720	docs(readme): add hosted API callout above Get Started Surface webclaw.io as a clear alternative path for visitors who want the antibot, JS rendering, async jobs, search, and watches the OSS server doesn't ship. Sits between the value-prop and the install instructions so self-host stays the primary on-ramp.	2026-04-26 17:15:44 +02:00
Valerio	a5c3433372	fix(core+server): guard markdown pipe slice + detect trustpilot/reddit verify walls Some checks failed CI / Test (push) Has been cancelled CI / Lint (push) Has been cancelled CI / Docs (push) Has been cancelled	2026-04-23 15:26:31 +02:00
Valerio	966981bc42	fix(fetch): send bot-identifying UA on reddit .json API to bypass browser UA block Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run	2026-04-23 15:17:04 +02:00
Valerio	866fa88aa0	fix(fetch): reject HTML verification pages served at .json reddit URL	2026-04-23 15:06:35 +02:00
Valerio	b413d702b2	feat(fetch): add fetch_smart with Reddit + Akamai rescue paths, bump 0.5.6	2026-04-23 14:59:29 +02:00
Valerio	98a177dec4	feat(cli): expose safari-ios browser profile + bump to 0.5.5	2026-04-23 13:32:55 +02:00
Valerio	e1af2da509	docs(claude): drop sidecar references, mention ProductionFetcher	2026-04-23 13:25:23 +02:00
Valerio	2285c585b1	docs(changelog): simplify 0.5.4 entry	2026-04-23 13:01:02 +02:00
Valerio	b77767814a	Bump to 0.5.4: SafariIos profile + Chrome fingerprint alignment + locale helper - New BrowserProfile::SafariIos mapped to BrowserVariant::SafariIos26. Built on wreq_util::Emulation::SafariIos26 with 4 overrides (TLS extension order, HTTP/2 HEADERS priority, real Safari iOS 26 headers, gzip/deflate/br). Matches bogdanfinn safari_ios_26_0 JA3 8d909525bd5bbb79f133d11cc05159fe exactly. Empirically 9/10 on immobiliare.it with country-it residential. - BrowserProfile::Chrome aligned to bogdanfinn chrome_133: dropped MAX_CONCURRENT_STREAMS from H2 SETTINGS, priority weight 256, explicit extension_permutation, advertise h3 in ALPN and ALPS. JA3 43067709b025da334de1279a120f8e14, akamai_fp 52d84b11737d980aef856699f885ca86. Fixes indeed.com and other Cloudflare-fronted sites. - New locale module: accept_language_for_url / accept_language_for_tld. TLD to Accept-Language mapping, unknown TLDs default to en-US. DataDome geo-vs-locale cross-checks are now trivially satisfiable. - wreq-util bumped 2.2.6 to 3.0.0-rc.10 for Emulation::SafariIos26.	2026-04-23 12:58:24 +02:00
Valerio	4bf11d902f	fix(mcp): vertical_scrape uses Firefox profile, not default Chrome Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run Reddit's .json API rejects the wreq-Chrome TLS fingerprint with a 403 even from residential IPs. Their block list includes known browser-emulation library fingerprints. wreq-Firefox passes. The CLI `vertical` subcommand already forced Firefox; MCP `vertical_scrape` was still falling back to the long-lived `self.fetch_client` which defaults to Chrome, so reddit failed on MCP and nobody noticed because the earlier test runs all had an API key set that masked the issue. Switched vertical_scrape to reuse `self.firefox_or_build()` which gives us the cached Firefox client (same pattern the scrape tool uses when the caller requests `browser: firefox`). Firefox is strictly-safer-than-Chrome for every vertical in the catalog, so making it the hard default for `vertical_scrape` is the right call. Verified end-to-end from a clean shell with no WEBCLAW_API_KEY: - MCP reddit: 679ms, post/author/6 comments correct - MCP instagram_profile: 1157ms, 18471 followers No change to the `scrape` tool -- it keeps the user-selectable browser param. Bumps version to 0.5.3.	2026-04-22 23:18:11 +02:00
Valerio	0daa2fec1a	feat(cli+mcp): vertical extractor support (28 extractors discoverable + callable) Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run Wires the vertical extractor catalog into both the CLI and the MCP server so users don't have to hit the HTTP API to invoke them. Same semantics as `/v1/scrape/{vertical}` + `/v1/extractors`. CLI (webclaw-cli): - New subcommand `webclaw extractors` lists all 28 extractors with name, label, and sample URL. `--json` flag emits the full catalog as machine-readable JSON. - New subcommand `webclaw vertical <name> <url>` runs a specific extractor and prints typed JSON. Pretty-printed by default; `--raw` for single-line. Exits 1 with a clear "URL does not match" error on mismatch. - FetchClient built with Firefox profile + cloud fallback attached when WEBCLAW_API_KEY is set, so antibot-gated verticals escalate. MCP (webclaw-mcp): - New tool `list_extractors` (no args) returns the catalog as pretty-printed JSON for in-session discovery. - New tool `vertical_scrape` takes `{name, url}` and returns typed JSON. Reuses the long-lived self.fetch_client. - Tool count goes from 10 to 12. Server-info instruction string updated accordingly. Tests: 215 passing, clippy clean. Manual surface-tested end-to-end: CLI prints real Reddit/github/pypi data; MCP JSON-RPC session returns 28-entry catalog + typed responses for pypi/requests + rust-lang/rust in 200-400ms. Version bumped to 0.5.2 (minor for API additions, backwards compatible).	2026-04-22 21:41:15 +02:00
Valerio	058493bc8f	feat(fetch): Fetcher trait so vertical extractors work under any HTTP backend Adds `webclaw_fetch::Fetcher` trait. All 28 vertical extractors now take `client: &dyn Fetcher` instead of `client: &FetchClient` directly. Backwards-compatible: FetchClient implements Fetcher, blanket impls cover `&T` and `Arc<T>`, so existing CLI / MCP / self-hosted-server callers keep working unchanged. Motivation: the production API server (api.webclaw.io) must not do in-process TLS fingerprinting; it delegates all HTTP to the Go tls-sidecar. Before this trait, exposing /v1/scrape/{vertical} on production would have required importing wreq into the server's dep graph, violating the CLAUDE.md rule. Now production can provide its own TlsSidecarFetcher implementation and pass it to the same dispatcher the OSS server uses. Changes: - New `crates/webclaw-fetch/src/fetcher.rs` defining the trait plus blanket impls for `&T` and `Arc<T>`. - `FetchClient` gains a tiny impl block in client.rs that forwards to its existing public methods. - All 28 extractor signatures migrated from `&FetchClient` to `&dyn Fetcher` (sed-driven bulk rewrite, no semantic change). - `cloud::smart_fetch` and `cloud::smart_fetch_html` take `&dyn Fetcher`. - `extractors::dispatch_by_url` and `extractors::dispatch_by_name` take `&dyn Fetcher`. - `async-trait 0.1` added to webclaw-fetch deps (Rust 1.75+ has native async-fn-in-trait but dyn dispatch still needs async_trait). - Version bumped to 0.5.1, CHANGELOG updated. Tests: 215 passing in webclaw-fetch (no new tests needed — the existing extractor tests exercise the trait methods transparently). Clippy: clean workspace-wide.	2026-04-22 21:17:50 +02:00
Valerio	aaa5103504	docs(claude): fix stale primp references, document wreq + Fetcher trait webclaw-fetch switched from primp to wreq 6.x (BoringSSL) a while ago but CLAUDE.md still documented primp, the `[patch.crates-io]` requirement, and RUSTFLAGS that no longer apply. Refreshed four sections: - Crate listing: webclaw-fetch uses wreq, not primp - client.rs description: wreq BoringSSL, plus a note that FetchClient will implement the new Fetcher trait so production can swap in a tls-sidecar-backed fetcher without importing wreq - Hard Rules: dropped obsolete `[patch.crates-io]` and RUSTFLAGS lines, added the "Vertical extractors take `&dyn Fetcher`" rule that makes the architectural separation explicit for the upcoming production integration - Removed language about primp being "patched"; reqwest in webclaw-llm is now just "plain reqwest" with no relationship to wreq	2026-04-22 21:11:18 +02:00
Valerio	2373162c81	chore: release v0.5.0 (28 vertical extractors + cloud integration) See CHANGELOG.md for the full entry. Headline: 28 site-specific extractors returning typed JSON, five with automatic antibot cloud-escalation via api.webclaw.io, `POST /v1/scrape/{vertical}` + `GET /v1/extractors` on webclaw-server.	2026-04-22 20:59:43 +02:00
Valerio	b2e7dbf365	fix(extractors): perfect-score follow-ups (trustpilot 2025 schema, amazon/etsy fallbacks, cloud docs) Addresses the four follow-ups surfaced by the cloud-key smoke test. trustpilot_reviews — full rewrite for 2025 schema: - Trustpilot moved from single-Organization+aggregateRating to three separate JSON-LD blocks: a site-level Organization (Trustpilot itself), a Dataset with a csvw:Table mainEntity carrying the per-star distribution for the target business, and an aiSummary + aiSummaryReviews block with the AI-generated summary and recent review objects. - Parser now: skips the site-level Org, walks @graph as either array or single object, picks the Dataset whose about.@id references the target domain, parses each csvw:column for rating buckets, computes weighted-average rating + total from the distribution, extracts the aiSummary text, and turns aiSummaryReviews into a clean reviews array with author/country/date/rating/title/text/likes. - OG-title regex fallbacks for business_name, rating_label, and average_rating when the Dataset block is absent. OG-description regex for review_count. - Returned shape: url, domain, business_name, rating_label, average_rating, review_count, rating_distribution (per-star count and percent), ai_summary, recent_reviews, review_count_listed, data_source. - Verified live: anthropic.com returns "Anthropic" / "Bad" / 1.4 / 226 reviews with full distribution + AI summary + 2 recent reviews. amazon_product — force-cloud-escalation + OG fallback: - Amazon serves Product JSON-LD intermittently even on non-CAPTCHA pages. When local fetch returns HTML without Product JSON-LD and a cloud client is configured, force-escalate to the cloud path which reliably surfaces title + description via its render engine. - New OG meta-tag fallback for title/image/description so the cloud's synthesize_html output (OG tags only, no #productTitle DOM ID) still yields useful data. Real Amazon pages still prefer the DOM regex. - Verified live: B0BSHF7WHW escalates to cloud, returns Apple MacBook Pro title + description + asin. etsy_listing — slug humanization + generic-page filtering + shop from brand: - Etsy serves various placeholder pages when a listing is delisted, blocked, or unavailable: "etsy.com", "Etsy - Your place to buy...", "This item is unavailable - Etsy", plus the OG description "Sorry, the page you were looking for was not found." is_generic_* helpers catch all three shapes. - When the OG title is generic, humanise the URL slug: the path `/listing/123456789/personalized-stainless-steel-tumbler` becomes `Personalized Stainless Steel Tumbler` so callers always get a meaningful title even on dead listings. - Etsy uses `brand` (top-level JSON-LD field) for the shop name on listings that don't ship offers[].seller.name. Shop now falls through offers -> brand so either schema resolves. - Verified live: listing/1097462299 returns full rich data (title, price 51.43 EUR, shop BlankEarthCeramics, 4.9 rating / 225 reviews, InStock). cloud.rs — module doc update: - Added an architecture section documenting that api.webclaw.io does not return raw HTML by design and that [`synthesize_html`] reassembles the parsed response (metadata + structured_data + markdown) back into minimal HTML so existing local parsers run unchanged across both paths. Also notes the DOM-regex limitation for extractors that need live-page-specific DOM IDs. Tests: 215 passing in webclaw-fetch (18 new), clippy clean. Smoke test against all 28 extractors with WEBCLAW_CLOUD_API_KEY: 28/28 clean, 0 partial, 0 failed.	2026-04-22 17:49:50 +02:00
Valerio	e10066f527	fix(cloud): synthesize HTML from cloud response instead of requesting raw html api.webclaw.io/v1/scrape does not return a `html` field even when `formats=["html"]` is requested, by design: the cloud API returns pre-parsed `structured_data` (JSON-LD blocks), `metadata` (OG tags, title, description, image, site_name), and `markdown`. Our CloudClient::fetch_html helper was premised on the API returning raw HTML. Without a key set, the error message was hidden behind CloudError::NotConfigured so the bug never surfaced. With a key set, every extractor that escalated to cloud (trustpilot_reviews, etsy_listing, amazon_product, ebay_listing, substack_post HTML fallback) got back "cloud /v1/scrape returned no html field". Fix: reassemble a minimal synthetic HTML document from the cloud's parsed output. Each JSON-LD block goes back into a `<script type="application/ld+json">` tag, metadata fields become OG `<meta>` tags, and the markdown body lands in a `<pre>` tag. Existing local extractor parsers (find_product_jsonld, find_business, og() regex) see the same shapes they'd see from a real page, so no per-extractor changes needed. Verified end-to-end with WEBCLAW_CLOUD_API_KEY set: - trustpilot_reviews: escalates, returns Organization JSON-LD data (parser picks Trustpilot site-level Org not the reviewed business; tracked as a follow-up to update Trustpilot schema handling) - etsy_listing: escalates via antibot render path; listing-specific data depends on target listing having JSON-LD (many Etsy listings don't) - amazon_product, ebay_listing: stay local because their pages ship enough content not to trigger bot-detection escalation - The other 24 extractors unchanged (local path, zero cloud credits) Tests: 200 passing in webclaw-fetch (3 new), clippy clean.	2026-04-22 17:24:50 +02:00
Valerio	a53578e45c	fix(extractors): detect AWS WAF verifying-connection page, add OG fallback to ecommerce_product Two targeted fixes surfaced by the manual extractor smoke test. cloud::is_bot_protected: - Trustpilot serves a ~565-byte AWS WAF interstitial with the string "Verifying your connection..." and an `interstitial-spinner` div. That pattern was not in our detector, so local fetch returned the challenge page, JSON-LD parsing found nothing, and the extractor emitted a confusing "no Organization/LocalBusiness JSON-LD" error. - Added the pattern plus a <10KB size gate so real articles that happen to mention the phrase aren't misclassified. Two new tests cover positive + negative cases. - With the fix, trustpilot_reviews now correctly escalates via smart_fetch_html and returns the clean "Set WEBCLAW_API_KEY" actionable error without a key, or cloud-bypassed HTML with one. ecommerce_product: - Previously hard-failed when a page had no Product JSON-LD, and produced an empty `offers` list when JSON-LD was present but its `offers` node was. Many sites (Patagonia-style catalog pages, smaller Squarespace stores) ship one or the other of OG / JSON-LD but not both with price data. - Added OG meta-tag fallback that handles: * no JSON-LD at all -> build minimal payload from og:title, og:image, og:description, product:price:amount, product:price:currency, product:availability, product:brand * JSON-LD present but offers empty -> augment with an OG-derived offer so price comes through - New `data_source` field: "jsonld", "jsonld+og", or "og_fallback" so callers can tell which branch populated the data. - `has_og_product_signal()` requires og:type=product or a price tag so blog posts don't get mis-classified as products. Tests: 197 passing in webclaw-fetch (6 new), clippy clean.	2026-04-22 17:07:31 +02:00
Valerio	7f5eb93b65	feat(extractors): wave 6b, etsy_listing + HTML fallbacks for substack/youtube Adds etsy_listing and hardens two existing extractors with HTML fallbacks so transient API failures still return useful data. New: - etsy_listing: /listing/{id}(/slug) with Schema.org Product JSON-LD + OG fallback. Antibot-gated, routes through cloud::smart_fetch_html like amazon_product and ebay_listing. Auto-dispatched (etsy host is unique). Hardened: - substack_post: when /api/v1/posts/{slug} returns non-200 (rate limit, 403 on hardened custom domains, 5xx), fall back to HTML fetch and parse OG tags + Article JSON-LD. Response shape is stable across both paths, with a `data_source` field of "api" or "html_fallback". - youtube_video: when ytInitialPlayerResponse is missing (EU-consent interstitial, age-gated, some live pre-shows), fall back to OG tags for title/description/thumbnail. `data_source` now "player_response" or "og_fallback". Tests: 91 passing in webclaw-fetch (9 new), clippy clean.	2026-04-22 16:44:51 +02:00

1 2 3

150 commits