mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-06-06 22:05:13 +02:00
Reddit blocked unauthenticated `.json` access, so the previous extractor returned block pages or timed out on every thread. Switch to parsing old.reddit.com's server-rendered HTML, which needs no API key or JS. Fetch layer: - Rewrite every Reddit host to old.reddit.com before fetching; drop all `.json` URL handling and the JSON response parser. Extraction (webclaw-core::reddit): - New HTML parser producing a typed post + nested comment tree. - Comments nest structurally (.comment > .child > .sitetable > .comment); old.reddit omits a usable depth attribute, so the tree is walked recursively. Bodies live in .entry > form > .usertext-body > .md. - Post metadata: title, author, subreddit, score, comment count (data-comments-count), self-vs-link (self class / self.* domain), flair, self-text body. - Comment scores read the .score.unvoted title (the displayed value, not the ±1 vote-state siblings); hidden scores are None, not 0. - Deleted comments are kept in place so their replies aren't orphaned; "load more comments" stubs are skipped. Markdown output: - Reply nesting via blockquote depth (avoids 4-space indentation turning text and code fences into broken indented-code blocks). - Links keep their target as [text](url); root-relative reddit links resolve against old.reddit.com. Nested lists indent correctly. - A recognised but unparseable /comments/ page returns no content rather than falling through to generic extraction of Reddit chrome. Tests: regression suite runs against real old.reddit.com fixtures (testdata/reddit/), the ground truth that surfaced the parsing and markdown bugs synthetic HTML had hidden. Fixtures are excluded from the published crate.
56 lines
1.6 KiB
Rust
56 lines
1.6 KiB
Rust
//! Reddit URL helpers for the fetch layer.
|
|
//!
|
|
//! The JSON API (`*.json`) is blocked. We rewrite all Reddit hosts to
|
|
//! `old.reddit.com`, which serves stable server-rendered HTML that
|
|
//! `webclaw-core::reddit` parses directly.
|
|
|
|
pub fn is_reddit_url(url: &str) -> bool {
|
|
webclaw_core::reddit::is_reddit_url(url)
|
|
}
|
|
|
|
/// Rewrite any Reddit host to old.reddit.com, preserving path and query.
|
|
pub fn to_old_reddit_url(url: &str) -> String {
|
|
let Some(scheme_end) = url.find("://") else {
|
|
return url.to_string();
|
|
};
|
|
let after = &url[scheme_end + 3..];
|
|
let host_end = after.find(['/', '?', '#']).unwrap_or(after.len());
|
|
let scheme = &url[..scheme_end + 3];
|
|
let rest = &after[host_end..];
|
|
format!("{scheme}old.reddit.com{rest}")
|
|
}
|
|
|
|
#[cfg(test)]
|
|
mod tests {
|
|
use super::*;
|
|
|
|
#[test]
|
|
fn rewrites_www_to_old() {
|
|
assert_eq!(
|
|
to_old_reddit_url("https://www.reddit.com/r/rust/comments/abc/x/"),
|
|
"https://old.reddit.com/r/rust/comments/abc/x/"
|
|
);
|
|
}
|
|
|
|
#[test]
|
|
fn rewrites_bare_to_old() {
|
|
assert_eq!(
|
|
to_old_reddit_url("https://reddit.com/r/rust/"),
|
|
"https://old.reddit.com/r/rust/"
|
|
);
|
|
}
|
|
|
|
#[test]
|
|
fn preserves_old_reddit_unchanged() {
|
|
let url = "https://old.reddit.com/r/rust/comments/abc/x/?context=3";
|
|
assert_eq!(to_old_reddit_url(url), url);
|
|
}
|
|
|
|
#[test]
|
|
fn preserves_query_and_hash() {
|
|
assert_eq!(
|
|
to_old_reddit_url("https://www.reddit.com/r/rust/?sort=top#anchor"),
|
|
"https://old.reddit.com/r/rust/?sort=top#anchor"
|
|
);
|
|
}
|
|
}
|