webclaw/crates/webclaw-fetch/src/reddit.rs
Valerio 217bfe088b feat(reddit): parse old.reddit.com HTML instead of the dead .json API
Reddit blocked unauthenticated `.json` access, so the previous extractor
returned block pages or timed out on every thread. Switch to parsing
old.reddit.com's server-rendered HTML, which needs no API key or JS.

Fetch layer:
- Rewrite every Reddit host to old.reddit.com before fetching; drop all
  `.json` URL handling and the JSON response parser.

Extraction (webclaw-core::reddit):
- New HTML parser producing a typed post + nested comment tree.
- Comments nest structurally (.comment > .child > .sitetable > .comment);
  old.reddit omits a usable depth attribute, so the tree is walked
  recursively. Bodies live in .entry > form > .usertext-body > .md.
- Post metadata: title, author, subreddit, score, comment count
  (data-comments-count), self-vs-link (self class / self.* domain),
  flair, self-text body.
- Comment scores read the .score.unvoted title (the displayed value, not
  the ±1 vote-state siblings); hidden scores are None, not 0.
- Deleted comments are kept in place so their replies aren't orphaned;
  "load more comments" stubs are skipped.

Markdown output:
- Reply nesting via blockquote depth (avoids 4-space indentation turning
  text and code fences into broken indented-code blocks).
- Links keep their target as [text](url); root-relative reddit links
  resolve against old.reddit.com. Nested lists indent correctly.
- A recognised but unparseable /comments/ page returns no content rather
  than falling through to generic extraction of Reddit chrome.

Tests: regression suite runs against real old.reddit.com fixtures
(testdata/reddit/), the ground truth that surfaced the parsing and
markdown bugs synthetic HTML had hidden. Fixtures are excluded from the
published crate.
2026-06-04 17:36:02 +02:00

56 lines
1.6 KiB
Rust

//! Reddit URL helpers for the fetch layer.
//!
//! The JSON API (`*.json`) is blocked. We rewrite all Reddit hosts to
//! `old.reddit.com`, which serves stable server-rendered HTML that
//! `webclaw-core::reddit` parses directly.
pub fn is_reddit_url(url: &str) -> bool {
webclaw_core::reddit::is_reddit_url(url)
}
/// Rewrite any Reddit host to old.reddit.com, preserving path and query.
pub fn to_old_reddit_url(url: &str) -> String {
let Some(scheme_end) = url.find("://") else {
return url.to_string();
};
let after = &url[scheme_end + 3..];
let host_end = after.find(['/', '?', '#']).unwrap_or(after.len());
let scheme = &url[..scheme_end + 3];
let rest = &after[host_end..];
format!("{scheme}old.reddit.com{rest}")
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn rewrites_www_to_old() {
assert_eq!(
to_old_reddit_url("https://www.reddit.com/r/rust/comments/abc/x/"),
"https://old.reddit.com/r/rust/comments/abc/x/"
);
}
#[test]
fn rewrites_bare_to_old() {
assert_eq!(
to_old_reddit_url("https://reddit.com/r/rust/"),
"https://old.reddit.com/r/rust/"
);
}
#[test]
fn preserves_old_reddit_unchanged() {
let url = "https://old.reddit.com/r/rust/comments/abc/x/?context=3";
assert_eq!(to_old_reddit_url(url), url);
}
#[test]
fn preserves_query_and_hash() {
assert_eq!(
to_old_reddit_url("https://www.reddit.com/r/rust/?sort=top#anchor"),
"https://old.reddit.com/r/rust/?sort=top#anchor"
);
}
}