feat: v0.1.4 — QuickJS integration for inline JavaScript data extraction

Embeds QuickJS (rquickjs) to execute inline <script> tags and extract
data hidden in JavaScript variable assignments. Captures window.__*
objects like __preloadedData (NYTimes), __PRELOADED_STATE__ (Wired),
and self.__next_f (Next.js RSC flight data).

Results:
- NYTimes: 1,552 → 4,162 words (+168%)
- Wired: 1,459 → 9,937 words (+580%)
- Zero measurable performance overhead (<15ms per page)
- Feature-gated: disable with --no-default-features for WASM

Smart text filtering rejects CSS, base64, file paths, code strings.
Only readable prose is appended under "## Additional Content".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Valerio 2026-03-26 10:28:16 +01:00
parent 0c91c6d5a9
commit 32c035c543
6 changed files with 665 additions and 7 deletions

View file

@ -5,6 +5,10 @@ version.workspace = true
edition.workspace = true
license.workspace = true
[features]
default = ["quickjs"]
quickjs = ["rquickjs"]
[dependencies]
serde = { workspace = true }
serde_json = { workspace = true }
@ -16,6 +20,7 @@ url = { version = "2", features = ["serde"] }
regex = "1"
once_cell = "1"
similar = "2"
rquickjs = { version = "0.9", features = ["classes", "properties"], optional = true }
[dev-dependencies]
tokio = { workspace = true }