webclaw/crates/webclaw-fetch/src/reddit.rs

//! Reddit URL helpers for the fetch layer.
//!
//! The JSON API (`*.json`) is blocked. We rewrite all Reddit hosts to
//! `old.reddit.com`, which serves stable server-rendered HTML that
//! `webclaw-core::reddit` parses directly.

pub fn is_reddit_url(url: &str) -> bool {
    webclaw_core::reddit::is_reddit_url(url)
}

/// Rewrite any Reddit host to old.reddit.com, preserving path and query.
pub fn to_old_reddit_url(url: &str) -> String {
    let Some(scheme_end) = url.find("://") else {
        return url.to_string();
    };
    let after = &url[scheme_end + 3..];
    let host_end = after.find(['/', '?', '#']).unwrap_or(after.len());
    let scheme = &url[..scheme_end + 3];
    let rest = &after[host_end..];
    format!("{scheme}old.reddit.com{rest}")
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn rewrites_www_to_old() {
        assert_eq!(
            to_old_reddit_url("https://www.reddit.com/r/rust/comments/abc/x/"),
            "https://old.reddit.com/r/rust/comments/abc/x/"
        );
    }

    #[test]
    fn rewrites_bare_to_old() {
        assert_eq!(
            to_old_reddit_url("https://reddit.com/r/rust/"),
            "https://old.reddit.com/r/rust/"
        );
    }

    #[test]
    fn preserves_old_reddit_unchanged() {
        let url = "https://old.reddit.com/r/rust/comments/abc/x/?context=3";
        assert_eq!(to_old_reddit_url(url), url);
    }

    #[test]
    fn preserves_query_and_hash() {
        assert_eq!(
            to_old_reddit_url("https://www.reddit.com/r/rust/?sort=top#anchor"),
            "https://old.reddit.com/r/rust/?sort=top#anchor"
        );
    }
}
feat(reddit): parse old.reddit.com HTML instead of the dead .json API Reddit blocked unauthenticated `.json` access, so the previous extractor returned block pages or timed out on every thread. Switch to parsing old.reddit.com's server-rendered HTML, which needs no API key or JS. Fetch layer: - Rewrite every Reddit host to old.reddit.com before fetching; drop all `.json` URL handling and the JSON response parser. Extraction (webclaw-core::reddit): - New HTML parser producing a typed post + nested comment tree. - Comments nest structurally (.comment > .child > .sitetable > .comment); old.reddit omits a usable depth attribute, so the tree is walked recursively. Bodies live in .entry > form > .usertext-body > .md. - Post metadata: title, author, subreddit, score, comment count (data-comments-count), self-vs-link (self class / self.* domain), flair, self-text body. - Comment scores read the .score.unvoted title (the displayed value, not the ±1 vote-state siblings); hidden scores are None, not 0. - Deleted comments are kept in place so their replies aren't orphaned; "load more comments" stubs are skipped. Markdown output: - Reply nesting via blockquote depth (avoids 4-space indentation turning text and code fences into broken indented-code blocks). - Links keep their target as [text](url); root-relative reddit links resolve against old.reddit.com. Nested lists indent correctly. - A recognised but unparseable /comments/ page returns no content rather than falling through to generic extraction of Reddit chrome. Tests: regression suite runs against real old.reddit.com fixtures (testdata/reddit/), the ground truth that surfaced the parsing and markdown bugs synthetic HTML had hidden. Fixtures are excluded from the published crate. 2026-06-04 16:16:08 +02:00			`//! Reddit URL helpers for the fetch layer.`
			`//!`
			//! The JSON API (`*.json`) is blocked. We rewrite all Reddit hosts to
			//! `old.reddit.com`, which serves stable server-rendered HTML that
			//! `webclaw-core::reddit` parses directly.
Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io 2026-03-23 18:31:11 +01:00
			`pub fn is_reddit_url(url: &str) -> bool {`
feat(reddit): parse old.reddit.com HTML instead of the dead .json API Reddit blocked unauthenticated `.json` access, so the previous extractor returned block pages or timed out on every thread. Switch to parsing old.reddit.com's server-rendered HTML, which needs no API key or JS. Fetch layer: - Rewrite every Reddit host to old.reddit.com before fetching; drop all `.json` URL handling and the JSON response parser. Extraction (webclaw-core::reddit): - New HTML parser producing a typed post + nested comment tree. - Comments nest structurally (.comment > .child > .sitetable > .comment); old.reddit omits a usable depth attribute, so the tree is walked recursively. Bodies live in .entry > form > .usertext-body > .md. - Post metadata: title, author, subreddit, score, comment count (data-comments-count), self-vs-link (self class / self.* domain), flair, self-text body. - Comment scores read the .score.unvoted title (the displayed value, not the ±1 vote-state siblings); hidden scores are None, not 0. - Deleted comments are kept in place so their replies aren't orphaned; "load more comments" stubs are skipped. Markdown output: - Reply nesting via blockquote depth (avoids 4-space indentation turning text and code fences into broken indented-code blocks). - Links keep their target as [text](url); root-relative reddit links resolve against old.reddit.com. Nested lists indent correctly. - A recognised but unparseable /comments/ page returns no content rather than falling through to generic extraction of Reddit chrome. Tests: regression suite runs against real old.reddit.com fixtures (testdata/reddit/), the ground truth that surfaced the parsing and markdown bugs synthetic HTML had hidden. Fixtures are excluded from the published crate. 2026-06-04 16:16:08 +02:00			`webclaw_core::reddit::is_reddit_url(url)`
			`}`

			`/// Rewrite any Reddit host to old.reddit.com, preserving path and query.`
			`pub fn to_old_reddit_url(url: &str) -> String {`
			`let Some(scheme_end) = url.find("://") else {`
			`return url.to_string();`
			`};`
			`let after = &url[scheme_end + 3..];`
			`let host_end = after.find(['/', '?', '#']).unwrap_or(after.len());`
			`let scheme = &url[..scheme_end + 3];`
			`let rest = &after[host_end..];`
			`format!("{scheme}old.reddit.com{rest}")`
			`}`

			`#[cfg(test)]`
			`mod tests {`
			`use super::*;`

			`#[test]`
			`fn rewrites_www_to_old() {`
			`assert_eq!(`
			`to_old_reddit_url("https://www.reddit.com/r/rust/comments/abc/x/"),`
			`"https://old.reddit.com/r/rust/comments/abc/x/"`
			`);`
Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io 2026-03-23 18:31:11 +01:00			`}`

feat(reddit): parse old.reddit.com HTML instead of the dead .json API Reddit blocked unauthenticated `.json` access, so the previous extractor returned block pages or timed out on every thread. Switch to parsing old.reddit.com's server-rendered HTML, which needs no API key or JS. Fetch layer: - Rewrite every Reddit host to old.reddit.com before fetching; drop all `.json` URL handling and the JSON response parser. Extraction (webclaw-core::reddit): - New HTML parser producing a typed post + nested comment tree. - Comments nest structurally (.comment > .child > .sitetable > .comment); old.reddit omits a usable depth attribute, so the tree is walked recursively. Bodies live in .entry > form > .usertext-body > .md. - Post metadata: title, author, subreddit, score, comment count (data-comments-count), self-vs-link (self class / self.* domain), flair, self-text body. - Comment scores read the .score.unvoted title (the displayed value, not the ±1 vote-state siblings); hidden scores are None, not 0. - Deleted comments are kept in place so their replies aren't orphaned; "load more comments" stubs are skipped. Markdown output: - Reply nesting via blockquote depth (avoids 4-space indentation turning text and code fences into broken indented-code blocks). - Links keep their target as [text](url); root-relative reddit links resolve against old.reddit.com. Nested lists indent correctly. - A recognised but unparseable /comments/ page returns no content rather than falling through to generic extraction of Reddit chrome. Tests: regression suite runs against real old.reddit.com fixtures (testdata/reddit/), the ground truth that surfaced the parsing and markdown bugs synthetic HTML had hidden. Fixtures are excluded from the published crate. 2026-06-04 16:16:08 +02:00			`#[test]`
			`fn rewrites_bare_to_old() {`
			`assert_eq!(`
			`to_old_reddit_url("https://reddit.com/r/rust/"),`
			`"https://old.reddit.com/r/rust/"`
			`);`
Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io 2026-03-23 18:31:11 +01:00			`}`

feat(reddit): parse old.reddit.com HTML instead of the dead .json API Reddit blocked unauthenticated `.json` access, so the previous extractor returned block pages or timed out on every thread. Switch to parsing old.reddit.com's server-rendered HTML, which needs no API key or JS. Fetch layer: - Rewrite every Reddit host to old.reddit.com before fetching; drop all `.json` URL handling and the JSON response parser. Extraction (webclaw-core::reddit): - New HTML parser producing a typed post + nested comment tree. - Comments nest structurally (.comment > .child > .sitetable > .comment); old.reddit omits a usable depth attribute, so the tree is walked recursively. Bodies live in .entry > form > .usertext-body > .md. - Post metadata: title, author, subreddit, score, comment count (data-comments-count), self-vs-link (self class / self.* domain), flair, self-text body. - Comment scores read the .score.unvoted title (the displayed value, not the ±1 vote-state siblings); hidden scores are None, not 0. - Deleted comments are kept in place so their replies aren't orphaned; "load more comments" stubs are skipped. Markdown output: - Reply nesting via blockquote depth (avoids 4-space indentation turning text and code fences into broken indented-code blocks). - Links keep their target as [text](url); root-relative reddit links resolve against old.reddit.com. Nested lists indent correctly. - A recognised but unparseable /comments/ page returns no content rather than falling through to generic extraction of Reddit chrome. Tests: regression suite runs against real old.reddit.com fixtures (testdata/reddit/), the ground truth that surfaced the parsing and markdown bugs synthetic HTML had hidden. Fixtures are excluded from the published crate. 2026-06-04 16:16:08 +02:00			`#[test]`
			`fn preserves_old_reddit_unchanged() {`
			`let url = "https://old.reddit.com/r/rust/comments/abc/x/?context=3";`
			`assert_eq!(to_old_reddit_url(url), url);`
Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io 2026-03-23 18:31:11 +01:00			`}`

feat(reddit): parse old.reddit.com HTML instead of the dead .json API Reddit blocked unauthenticated `.json` access, so the previous extractor returned block pages or timed out on every thread. Switch to parsing old.reddit.com's server-rendered HTML, which needs no API key or JS. Fetch layer: - Rewrite every Reddit host to old.reddit.com before fetching; drop all `.json` URL handling and the JSON response parser. Extraction (webclaw-core::reddit): - New HTML parser producing a typed post + nested comment tree. - Comments nest structurally (.comment > .child > .sitetable > .comment); old.reddit omits a usable depth attribute, so the tree is walked recursively. Bodies live in .entry > form > .usertext-body > .md. - Post metadata: title, author, subreddit, score, comment count (data-comments-count), self-vs-link (self class / self.* domain), flair, self-text body. - Comment scores read the .score.unvoted title (the displayed value, not the ±1 vote-state siblings); hidden scores are None, not 0. - Deleted comments are kept in place so their replies aren't orphaned; "load more comments" stubs are skipped. Markdown output: - Reply nesting via blockquote depth (avoids 4-space indentation turning text and code fences into broken indented-code blocks). - Links keep their target as [text](url); root-relative reddit links resolve against old.reddit.com. Nested lists indent correctly. - A recognised but unparseable /comments/ page returns no content rather than falling through to generic extraction of Reddit chrome. Tests: regression suite runs against real old.reddit.com fixtures (testdata/reddit/), the ground truth that surfaced the parsing and markdown bugs synthetic HTML had hidden. Fixtures are excluded from the published crate. 2026-06-04 16:16:08 +02:00			`#[test]`
			`fn preserves_query_and_hash() {`
			`assert_eq!(`
			`to_old_reddit_url("https://www.reddit.com/r/rust/?sort=top#anchor"),`
			`"https://old.reddit.com/r/rust/?sort=top#anchor"`
			`);`
Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io 2026-03-23 18:31:11 +01:00			`}`
			`}`