mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-06-06 22:05:13 +02:00
feat(reddit): parse old.reddit.com HTML instead of the dead .json API
Reddit blocked unauthenticated `.json` access, so the previous extractor returned block pages or timed out on every thread. Switch to parsing old.reddit.com's server-rendered HTML, which needs no API key or JS. Fetch layer: - Rewrite every Reddit host to old.reddit.com before fetching; drop all `.json` URL handling and the JSON response parser. Extraction (webclaw-core::reddit): - New HTML parser producing a typed post + nested comment tree. - Comments nest structurally (.comment > .child > .sitetable > .comment); old.reddit omits a usable depth attribute, so the tree is walked recursively. Bodies live in .entry > form > .usertext-body > .md. - Post metadata: title, author, subreddit, score, comment count (data-comments-count), self-vs-link (self class / self.* domain), flair, self-text body. - Comment scores read the .score.unvoted title (the displayed value, not the ±1 vote-state siblings); hidden scores are None, not 0. - Deleted comments are kept in place so their replies aren't orphaned; "load more comments" stubs are skipped. Markdown output: - Reply nesting via blockquote depth (avoids 4-space indentation turning text and code fences into broken indented-code blocks). - Links keep their target as [text](url); root-relative reddit links resolve against old.reddit.com. Nested lists indent correctly. - A recognised but unparseable /comments/ page returns no content rather than falling through to generic extraction of Reddit chrome. Tests: regression suite runs against real old.reddit.com fixtures (testdata/reddit/), the ground truth that surfaced the parsing and markdown bugs synthetic HTML had hidden. Fixtures are excluded from the published crate.
This commit is contained in:
parent
3b7d11328e
commit
217bfe088b
11 changed files with 2522 additions and 391 deletions
596
crates/webclaw-core/testdata/reddit/askreddit_deep_morechildren.html
vendored
Normal file
596
crates/webclaw-core/testdata/reddit/askreddit_deep_morechildren.html
vendored
Normal file
File diff suppressed because one or more lines are too long
82
crates/webclaw-core/testdata/reddit/ebpf_6comments.html
vendored
Normal file
82
crates/webclaw-core/testdata/reddit/ebpf_6comments.html
vendored
Normal file
File diff suppressed because one or more lines are too long
312
crates/webclaw-core/testdata/reddit/elixir_60comments.html
vendored
Normal file
312
crates/webclaw-core/testdata/reddit/elixir_60comments.html
vendored
Normal file
File diff suppressed because one or more lines are too long
227
crates/webclaw-core/testdata/reddit/pandas_34comments.html
vendored
Normal file
227
crates/webclaw-core/testdata/reddit/pandas_34comments.html
vendored
Normal file
File diff suppressed because one or more lines are too long
234
crates/webclaw-core/testdata/reddit/rust_selfpost_36comments.html
vendored
Normal file
234
crates/webclaw-core/testdata/reddit/rust_selfpost_36comments.html
vendored
Normal file
File diff suppressed because one or more lines are too long
Loading…
Add table
Add a link
Reference in a new issue