Commit graph

163 commits

Author SHA1 Message Date
Valerio
f6000cba52 chore(release): v0.6.5
Reddit extraction moves from the dead .json API to old.reddit.com HTML.
2026-06-04 17:36:02 +02:00
Valerio
217bfe088b feat(reddit): parse old.reddit.com HTML instead of the dead .json API
Reddit blocked unauthenticated `.json` access, so the previous extractor
returned block pages or timed out on every thread. Switch to parsing
old.reddit.com's server-rendered HTML, which needs no API key or JS.

Fetch layer:
- Rewrite every Reddit host to old.reddit.com before fetching; drop all
  `.json` URL handling and the JSON response parser.

Extraction (webclaw-core::reddit):
- New HTML parser producing a typed post + nested comment tree.
- Comments nest structurally (.comment > .child > .sitetable > .comment);
  old.reddit omits a usable depth attribute, so the tree is walked
  recursively. Bodies live in .entry > form > .usertext-body > .md.
- Post metadata: title, author, subreddit, score, comment count
  (data-comments-count), self-vs-link (self class / self.* domain),
  flair, self-text body.
- Comment scores read the .score.unvoted title (the displayed value, not
  the ±1 vote-state siblings); hidden scores are None, not 0.
- Deleted comments are kept in place so their replies aren't orphaned;
  "load more comments" stubs are skipped.

Markdown output:
- Reply nesting via blockquote depth (avoids 4-space indentation turning
  text and code fences into broken indented-code blocks).
- Links keep their target as [text](url); root-relative reddit links
  resolve against old.reddit.com. Nested lists indent correctly.
- A recognised but unparseable /comments/ page returns no content rather
  than falling through to generic extraction of Reddit chrome.

Tests: regression suite runs against real old.reddit.com fixtures
(testdata/reddit/), the ground truth that surfaced the parsing and
markdown bugs synthetic HTML had hidden. Fixtures are excluded from the
published crate.
2026-06-04 17:36:02 +02:00
Valerio
3b7d11328e Add sponsor preview placements 2026-06-04 10:04:32 +02:00
Valerio
363e17d362 docs: add ColdProxy infrastructure partner 2026-05-31 18:35:45 +02:00
Valerio
8fe8bcb479 chore(ci): bump actions/checkout and artifact actions to v5
GitHub flagged checkout@v4 / upload-artifact@v4 / download-artifact@v4
as Node.js 20 actions, force-migrated to Node 24 on 2026-06-02. Bump
all nine references to v5 ahead of the deadline. The artifact steps are
v5-compatible: upload uses a unique matrix-target name and the download
step flattens subdirectories with find afterward.
2026-05-21 15:11:29 +02:00
Valerio
51260ae4e3 chore(release): record v0.6.4 version bump and changelog
The v0.6.4 tag shipped the API surface discovery module but the
release commit left the workspace version at 0.6.3 with no matching
changelog entry. Bump [workspace.package] to 0.6.4 and add the
[0.6.4] CHANGELOG section so the code matches the tag.
2026-05-21 12:58:47 +02:00
Valerio
fe567a6af1
feat(core): endpoints module for API surface extraction from HTML and JS (#47)
* feat(core): endpoints module — extract API surface from HTML + JS bundles

* fix(docker): source CA bundle from distroless instead of apt (fixes arm64 release build)

* fix(test): serialize env-mutating CloudClient tests to stop flaky CI

* feat(core): filter endpoint-extractor noise (invalid hosts, schema domains, bare paths)
2026-05-19 19:05:16 +02:00
Valerio
be8bcfebd9
fix: harden resource limits, path safety, and WASM build (#46)
Security audit follow-up across the workspace:

- webclaw-core: keep the crate WASM-safe. quickjs/rquickjs is now a
  cfg(not(wasm32)) target dependency and the extraction entry point uses
  a direct call on wasm instead of spawning a thread, so it builds and
  runs on wasm32 with or without default features.
- webclaw-core: bound the structured-data scrubber recursion (depth cap)
  so deeply nested attacker JSON-LD / __NEXT_DATA__ cannot exhaust the
  stack.
- webclaw-fetch: stream the response body with a running ceiling so a
  small highly compressed payload cannot inflate to gigabytes in memory;
  redact user:pass@ from proxy URLs before they reach error strings.
- webclaw-cli: contain output filenames inside the chosen directory
  (reject .. / absolute, drop traversal path segments), run --webhook
  URLs through the public-URL SSRF guard, clamp --watch-interval to >=1s,
  and make research slug truncation char-safe.
- webclaw-mcp: char-safe slug truncation (no multibyte slice panic).
- setup.sh / deploy/hetzner.sh: replace eval on read input with
  printf -v, and mask auth key / API token in console output.
- CI: enforce the wasm32 build invariant for webclaw-core.

Tests added for every behavioral change. Bump to 0.6.3 + CHANGELOG.
2026-05-19 17:03:52 +02:00
Valerio
aab51bea91 docs: add workflow examples 2026-05-18 18:56:00 +02:00
Valerio
b75b768ec3 Update Quantum Proxies sponsor copy 2026-05-18 18:50:38 +02:00
Valerio
3fabdc1d02
fix: clean llm output noise
Port the valid PR #43 LLM cleanup fixes onto current main without stale branch regressions.\n\nIncludes comment-count link cleanup, bare numeric paragraph cleanup, pagination leftover cleanup, JSON-LD article body scrubbing, clearer CLI consent-wall warnings, and quieter parser logs by default.\n\nThanks to @devnen for the report and patch work.
2026-05-18 18:39:33 +02:00
Valerio
5eef8358b0 docs: update sponsor partner details 2026-05-18 13:09:02 +02:00
Valerio
7dfd62ec1d docs: add proxy-seller studio partner 2026-05-18 12:37:28 +02:00
Valerio
6d886c44f6 docs: enlarge studio partner banner 2026-05-18 12:27:11 +02:00
Valerio
8e3ad17428 docs: tighten studio partner layout 2026-05-18 12:23:19 +02:00
Valerio
7321549412 docs: add studio partner section 2026-05-18 12:17:34 +02:00
Valerio
72edb61881
Merge pull request #42 from jal-co/docs/add-community-plugins
docs: add community plugins section
2026-05-16 11:24:33 +02:00
Valerio
00d86a12bc docs: refine community plugin copy 2026-05-16 11:19:15 +02:00
Justin Levine
c8be5214f6
docs: add community plugins section with OpenClaw and Hermes integrations 2026-05-15 17:51:22 -07:00
Valerio
0ea189c5b2 fix(ci): pass repository to release cli
Some checks failed
CI / Test (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Docs (push) Has been cancelled
2026-05-12 12:28:14 +02:00
Valerio
a629534490
fix(security): prepare 0.6.1 hardening
Merge the 0.6.1 security hardening release candidate after local and CI verification.
2026-05-12 12:16:42 +02:00
Valerio
fd2e75d509 chore(fetch): satisfy clippy for resolver setup 2026-05-12 12:09:18 +02:00
Valerio
e2f89941ac chore(release): prepare 0.6.1 2026-05-12 12:06:06 +02:00
Valerio
307b4f980d fix(extractors): harden marketplace host matching 2026-05-12 12:03:43 +02:00
Valerio
dbf9ce08a6 fix(ci): scope release workflow token permissions 2026-05-12 12:00:47 +02:00
Valerio
3bcb288d13 fix(fetch): guard challenge detection before utf8 decoding 2026-05-12 12:00:47 +02:00
Valerio
a611ae26f3 fix(security): harden local fetch surfaces 2026-05-12 12:00:25 +02:00
Valerio
af96628dc9
Revise README for clarity and updated content
Some checks failed
CI / Test (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Docs (push) Has been cancelled
Updated the README to reflect changes in the project description, banner image size, and various content sections. Enhanced clarity on features and usage.
2026-05-10 22:44:57 +02:00
devnen
e8ca1417d6
Improve --format llm output quality (#37)
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
Improve LLM-format output for modern news and documentation pages.

- Filter noisy hydration and low-value page chrome structured data while preserving content-bearing Schema.org records
- Fix element/text spacing without detaching punctuation on docs, forums, and reference pages
- Remove common accessibility link chrome from LLM text and link labels
- Bump workspace version to 0.6.0 and update the changelog

Thanks to Nenad Oric (@devnen) for the original PR and contribution.
2026-05-10 15:11:12 +02:00
Valerio
7f75143954 docs: update hosted api trial copy
Some checks failed
CI / Test (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Docs (push) Has been cancelled
2026-05-06 17:16:35 +02:00
Valerio
e6a95f783d chore: bump version to 0.5.9
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
2026-05-06 11:42:09 +02:00
Valerio
a3aa4bce6f fix: support LLM provider compatibility options
Closes #36
2026-05-06 11:36:53 +02:00
Valerio
86183b11e4 docs: credit Windows release contribution
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
2026-05-05 11:44:07 +02:00
SURYANSH MISHRA
513b0e493e ci: add Windows release artifacts
Closes #34
2026-05-05 11:38:30 +02:00
Valerio
a1242a1c1d docs: credit README badge refresh 2026-05-05 11:18:58 +02:00
Justin Levine
a542e45768
docs: refresh README badges
Replace README badges with shieldcn-styled badges.
2026-05-05 11:17:21 +02:00
Valerio
615f326660 docs: update changelog for brand extraction
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
2026-05-04 21:52:49 +02:00
Valerio
72b8dbc285 fix: improve brand extraction signals 2026-05-04 21:25:07 +02:00
Valerio
1c9def2fde fix: validate self-host route URLs consistently 2026-05-04 14:30:06 +02:00
Valerio
eede2f6953 docs: credit SSRF report
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
2026-05-04 12:08:11 +02:00
Valerio
bdf81fe6bf fix: harden fetch URL validation 2026-05-04 11:50:57 +02:00
Valerio
23544f8fac docs(claude): note youtube.rs role and yt-dlp short-circuit in server
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
The webclaw-core youtube module produces structured markdown but no
transcript; document that and point at the production server's
youtube_transcript.rs short-circuit for the full YoutubeData + caption
text shape.
2026-05-03 21:17:23 +02:00
Valerio
923445f4a8 docs(readme): add h1 brand heading
Some checks failed
CI / Test (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Docs (push) Has been cancelled
The repo had no heading-level brand anchor, only a banner image and
an h3 slogan. Search engines indexing the README were missing the
canonical brand signal. The new h1 is what GitHub renders as the
title of the page and what Google co-ranks with webclaw.io.

Bumps workspace version to 0.5.7.
2026-04-30 11:47:02 +02:00
Valerio
0e6c7cdc97
Add GitHub Sponsors username to FUNDING.yml
Some checks failed
CI / Test (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Docs (push) Has been cancelled
Updated funding model with GitHub Sponsors username.
2026-04-27 13:18:22 +02:00
Valerio
5795c5c422 docs(readme): add star history chart
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
2026-04-26 17:55:22 +02:00
Valerio
4908367720 docs(readme): add hosted API callout above Get Started
Surface webclaw.io as a clear alternative path for visitors who want
the antibot, JS rendering, async jobs, search, and watches the OSS
server doesn't ship. Sits between the value-prop and the install
instructions so self-host stays the primary on-ramp.
2026-04-26 17:15:44 +02:00
Valerio
a5c3433372 fix(core+server): guard markdown pipe slice + detect trustpilot/reddit verify walls
Some checks failed
CI / Test (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Docs (push) Has been cancelled
2026-04-23 15:26:31 +02:00
Valerio
966981bc42 fix(fetch): send bot-identifying UA on reddit .json API to bypass browser UA block
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
2026-04-23 15:17:04 +02:00
Valerio
866fa88aa0 fix(fetch): reject HTML verification pages served at .json reddit URL 2026-04-23 15:06:35 +02:00
Valerio
b413d702b2 feat(fetch): add fetch_smart with Reddit + Akamai rescue paths, bump 0.5.6 2026-04-23 14:59:29 +02:00