feat(reddit): parse old.reddit.com HTML instead of the dead .json API
Reddit blocked unauthenticated `.json` access, so the previous extractor
returned block pages or timed out on every thread. Switch to parsing
old.reddit.com's server-rendered HTML, which needs no API key or JS.
Fetch layer:
- Rewrite every Reddit host to old.reddit.com before fetching; drop all
`.json` URL handling and the JSON response parser.
Extraction (webclaw-core::reddit):
- New HTML parser producing a typed post + nested comment tree.
- Comments nest structurally (.comment > .child > .sitetable > .comment);
old.reddit omits a usable depth attribute, so the tree is walked
recursively. Bodies live in .entry > form > .usertext-body > .md.
- Post metadata: title, author, subreddit, score, comment count
(data-comments-count), self-vs-link (self class / self.* domain),
flair, self-text body.
- Comment scores read the .score.unvoted title (the displayed value, not
the ±1 vote-state siblings); hidden scores are None, not 0.
- Deleted comments are kept in place so their replies aren't orphaned;
"load more comments" stubs are skipped.
Markdown output:
- Reply nesting via blockquote depth (avoids 4-space indentation turning
text and code fences into broken indented-code blocks).
- Links keep their target as [text](url); root-relative reddit links
resolve against old.reddit.com. Nested lists indent correctly.
- A recognised but unparseable /comments/ page returns no content rather
than falling through to generic extraction of Reddit chrome.
Tests: regression suite runs against real old.reddit.com fixtures
(testdata/reddit/), the ground truth that surfaced the parsing and
markdown bugs synthetic HTML had hidden. Fixtures are excluded from the
published crate.
2026-06-04 16:16:08 +02:00
|
|
|
//! Reddit URL helpers for the fetch layer.
|
|
|
|
|
//!
|
|
|
|
|
//! The JSON API (`*.json`) is blocked. We rewrite all Reddit hosts to
|
|
|
|
|
//! `old.reddit.com`, which serves stable server-rendered HTML that
|
|
|
|
|
//! `webclaw-core::reddit` parses directly.
|
2026-03-23 18:31:11 +01:00
|
|
|
|
|
|
|
|
pub fn is_reddit_url(url: &str) -> bool {
|
feat(reddit): parse old.reddit.com HTML instead of the dead .json API
Reddit blocked unauthenticated `.json` access, so the previous extractor
returned block pages or timed out on every thread. Switch to parsing
old.reddit.com's server-rendered HTML, which needs no API key or JS.
Fetch layer:
- Rewrite every Reddit host to old.reddit.com before fetching; drop all
`.json` URL handling and the JSON response parser.
Extraction (webclaw-core::reddit):
- New HTML parser producing a typed post + nested comment tree.
- Comments nest structurally (.comment > .child > .sitetable > .comment);
old.reddit omits a usable depth attribute, so the tree is walked
recursively. Bodies live in .entry > form > .usertext-body > .md.
- Post metadata: title, author, subreddit, score, comment count
(data-comments-count), self-vs-link (self class / self.* domain),
flair, self-text body.
- Comment scores read the .score.unvoted title (the displayed value, not
the ±1 vote-state siblings); hidden scores are None, not 0.
- Deleted comments are kept in place so their replies aren't orphaned;
"load more comments" stubs are skipped.
Markdown output:
- Reply nesting via blockquote depth (avoids 4-space indentation turning
text and code fences into broken indented-code blocks).
- Links keep their target as [text](url); root-relative reddit links
resolve against old.reddit.com. Nested lists indent correctly.
- A recognised but unparseable /comments/ page returns no content rather
than falling through to generic extraction of Reddit chrome.
Tests: regression suite runs against real old.reddit.com fixtures
(testdata/reddit/), the ground truth that surfaced the parsing and
markdown bugs synthetic HTML had hidden. Fixtures are excluded from the
published crate.
2026-06-04 16:16:08 +02:00
|
|
|
webclaw_core::reddit::is_reddit_url(url)
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/// Rewrite any Reddit host to old.reddit.com, preserving path and query.
|
|
|
|
|
pub fn to_old_reddit_url(url: &str) -> String {
|
|
|
|
|
let Some(scheme_end) = url.find("://") else {
|
|
|
|
|
return url.to_string();
|
|
|
|
|
};
|
|
|
|
|
let after = &url[scheme_end + 3..];
|
|
|
|
|
let host_end = after.find(['/', '?', '#']).unwrap_or(after.len());
|
|
|
|
|
let scheme = &url[..scheme_end + 3];
|
|
|
|
|
let rest = &after[host_end..];
|
|
|
|
|
format!("{scheme}old.reddit.com{rest}")
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
#[cfg(test)]
|
|
|
|
|
mod tests {
|
|
|
|
|
use super::*;
|
|
|
|
|
|
|
|
|
|
#[test]
|
|
|
|
|
fn rewrites_www_to_old() {
|
|
|
|
|
assert_eq!(
|
|
|
|
|
to_old_reddit_url("https://www.reddit.com/r/rust/comments/abc/x/"),
|
|
|
|
|
"https://old.reddit.com/r/rust/comments/abc/x/"
|
|
|
|
|
);
|
2026-03-23 18:31:11 +01:00
|
|
|
}
|
|
|
|
|
|
feat(reddit): parse old.reddit.com HTML instead of the dead .json API
Reddit blocked unauthenticated `.json` access, so the previous extractor
returned block pages or timed out on every thread. Switch to parsing
old.reddit.com's server-rendered HTML, which needs no API key or JS.
Fetch layer:
- Rewrite every Reddit host to old.reddit.com before fetching; drop all
`.json` URL handling and the JSON response parser.
Extraction (webclaw-core::reddit):
- New HTML parser producing a typed post + nested comment tree.
- Comments nest structurally (.comment > .child > .sitetable > .comment);
old.reddit omits a usable depth attribute, so the tree is walked
recursively. Bodies live in .entry > form > .usertext-body > .md.
- Post metadata: title, author, subreddit, score, comment count
(data-comments-count), self-vs-link (self class / self.* domain),
flair, self-text body.
- Comment scores read the .score.unvoted title (the displayed value, not
the ±1 vote-state siblings); hidden scores are None, not 0.
- Deleted comments are kept in place so their replies aren't orphaned;
"load more comments" stubs are skipped.
Markdown output:
- Reply nesting via blockquote depth (avoids 4-space indentation turning
text and code fences into broken indented-code blocks).
- Links keep their target as [text](url); root-relative reddit links
resolve against old.reddit.com. Nested lists indent correctly.
- A recognised but unparseable /comments/ page returns no content rather
than falling through to generic extraction of Reddit chrome.
Tests: regression suite runs against real old.reddit.com fixtures
(testdata/reddit/), the ground truth that surfaced the parsing and
markdown bugs synthetic HTML had hidden. Fixtures are excluded from the
published crate.
2026-06-04 16:16:08 +02:00
|
|
|
#[test]
|
|
|
|
|
fn rewrites_bare_to_old() {
|
|
|
|
|
assert_eq!(
|
|
|
|
|
to_old_reddit_url("https://reddit.com/r/rust/"),
|
|
|
|
|
"https://old.reddit.com/r/rust/"
|
|
|
|
|
);
|
2026-03-23 18:31:11 +01:00
|
|
|
}
|
|
|
|
|
|
feat(reddit): parse old.reddit.com HTML instead of the dead .json API
Reddit blocked unauthenticated `.json` access, so the previous extractor
returned block pages or timed out on every thread. Switch to parsing
old.reddit.com's server-rendered HTML, which needs no API key or JS.
Fetch layer:
- Rewrite every Reddit host to old.reddit.com before fetching; drop all
`.json` URL handling and the JSON response parser.
Extraction (webclaw-core::reddit):
- New HTML parser producing a typed post + nested comment tree.
- Comments nest structurally (.comment > .child > .sitetable > .comment);
old.reddit omits a usable depth attribute, so the tree is walked
recursively. Bodies live in .entry > form > .usertext-body > .md.
- Post metadata: title, author, subreddit, score, comment count
(data-comments-count), self-vs-link (self class / self.* domain),
flair, self-text body.
- Comment scores read the .score.unvoted title (the displayed value, not
the ±1 vote-state siblings); hidden scores are None, not 0.
- Deleted comments are kept in place so their replies aren't orphaned;
"load more comments" stubs are skipped.
Markdown output:
- Reply nesting via blockquote depth (avoids 4-space indentation turning
text and code fences into broken indented-code blocks).
- Links keep their target as [text](url); root-relative reddit links
resolve against old.reddit.com. Nested lists indent correctly.
- A recognised but unparseable /comments/ page returns no content rather
than falling through to generic extraction of Reddit chrome.
Tests: regression suite runs against real old.reddit.com fixtures
(testdata/reddit/), the ground truth that surfaced the parsing and
markdown bugs synthetic HTML had hidden. Fixtures are excluded from the
published crate.
2026-06-04 16:16:08 +02:00
|
|
|
#[test]
|
|
|
|
|
fn preserves_old_reddit_unchanged() {
|
|
|
|
|
let url = "https://old.reddit.com/r/rust/comments/abc/x/?context=3";
|
|
|
|
|
assert_eq!(to_old_reddit_url(url), url);
|
2026-03-23 18:31:11 +01:00
|
|
|
}
|
|
|
|
|
|
feat(reddit): parse old.reddit.com HTML instead of the dead .json API
Reddit blocked unauthenticated `.json` access, so the previous extractor
returned block pages or timed out on every thread. Switch to parsing
old.reddit.com's server-rendered HTML, which needs no API key or JS.
Fetch layer:
- Rewrite every Reddit host to old.reddit.com before fetching; drop all
`.json` URL handling and the JSON response parser.
Extraction (webclaw-core::reddit):
- New HTML parser producing a typed post + nested comment tree.
- Comments nest structurally (.comment > .child > .sitetable > .comment);
old.reddit omits a usable depth attribute, so the tree is walked
recursively. Bodies live in .entry > form > .usertext-body > .md.
- Post metadata: title, author, subreddit, score, comment count
(data-comments-count), self-vs-link (self class / self.* domain),
flair, self-text body.
- Comment scores read the .score.unvoted title (the displayed value, not
the ±1 vote-state siblings); hidden scores are None, not 0.
- Deleted comments are kept in place so their replies aren't orphaned;
"load more comments" stubs are skipped.
Markdown output:
- Reply nesting via blockquote depth (avoids 4-space indentation turning
text and code fences into broken indented-code blocks).
- Links keep their target as [text](url); root-relative reddit links
resolve against old.reddit.com. Nested lists indent correctly.
- A recognised but unparseable /comments/ page returns no content rather
than falling through to generic extraction of Reddit chrome.
Tests: regression suite runs against real old.reddit.com fixtures
(testdata/reddit/), the ground truth that surfaced the parsing and
markdown bugs synthetic HTML had hidden. Fixtures are excluded from the
published crate.
2026-06-04 16:16:08 +02:00
|
|
|
#[test]
|
|
|
|
|
fn preserves_query_and_hash() {
|
|
|
|
|
assert_eq!(
|
|
|
|
|
to_old_reddit_url("https://www.reddit.com/r/rust/?sort=top#anchor"),
|
|
|
|
|
"https://old.reddit.com/r/rust/?sort=top#anchor"
|
|
|
|
|
);
|
2026-03-23 18:31:11 +01:00
|
|
|
}
|
|
|
|
|
}
|