diff --git a/CHANGELOG.md b/CHANGELOG.md index 653c07dd..f415aee7 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -74,6 +74,20 @@ The attack-surface map and chain composer turn the flat finding list into a rout - **DB fast-fail preflight.** `Indexer::init` reads the first 16 bytes of any candidate SQLite file and rejects anything without the standard `SQLite format 3\0` magic. Stops a misnamed JSON / text file from corrupting the index path with a SQLite error halfway through migration. - **Symbolic-execution coverage.** Symex now recognises a wider set of string operations (`substr`, `replace`, `to_lower`, `to_upper`, `trim`, `strlen`) per the value/transfer pipeline, and the abstract-interpretation framework reasons about interval and prefix/suffix string facts during the dynamic verdict pass. +### Fixed (engine correctness) + +- **CFG construction.** Python `if/elif/elif/else` chains no longer drop every alternative past the first, so a sink in a second `elif` or a trailing `else` is analysed (same fix covers PHP `else_if`). C-style `for (init; cond; incr)` loops now lower the initializer and increment, so taint introduced in the loop header (`for (cmd = getenv(...); ...)`) reaches the body. A `switch` `default` is no longer unconditionally hoisted to the chain tail, preserving fall-through order in C/C++/JS/TS/PHP/Java. Source/sink calls inside short-circuit `&&` / `||` operands of an `if` / `while` condition are now classified instead of dropped. +- **SSA lowering of exception handlers.** Catch blocks with internal control flow (an `if` / loop / nested `try` inside the handler) no longer lose every instruction past the catch entry — the orphan subtree is renamed through a virtual-root dominator tree, so sinks reached only inside a `catch` are seen. Catch-side variable reads now resolve to the most entry-dominating reaching definition (the pre-`try` value) rather than a post-join reassignment. A genuine positional argument equal to a chained-call receiver root (`a.b.m(p, a)`) is preserved instead of being stripped as the implicit chain root. +- **Taint soundness.** Sink suppression now gates on `validated_must` (validated on every path) instead of `validated_may` (any path), closing a false-negative where a single validated branch silenced a sink. `is_noreturn_call` no longer matches receiver-qualified `.exit()` / `.abort()` / method calls, so `transaction.abort()` stops wiping taint state. The SSRF same-origin check rejects protocol-relative `//host` prefixes (an open-redirect / SSRF bypass that a bare `/`-prefix check accepted). Inline-return taint unions the derived and parameter-passthrough channels for mixed-return helpers (`if (c) return src(); return x;`). The inline-analysis cache is keyed to exclude callback-bound arguments, so a function-valued argument no longer poisons sibling call sites that pass a different callback. +- **Taint precision.** `String.valueOf(String/Object)` is no longer tagged a safe-string producer (it is an identity passthrough, so `String.valueOf(req.getParameter(...))` was silently suppressed). Cross-parameter sanitizers no longer bleed onto sibling arguments (`f(a, b){ return a + escape(b) }` sanitises only `b`), and a cross-file sanitizer resolved through the coarse summary tier still applies its strip. Relative-URL and host-allowlist cap clearing is alias-aware. Substring-rejection and `indexOf() === -1` idioms are no longer misread as allowlist validation, and dotted multi-argument validators no longer over-validate unrelated targets. +- **Interprocedural resolution.** SCC / topo file batching and reachability key files by their package-qualified namespace, matching the call-graph nodes and SSA summary tier so cross-package callers resolve. Directly self-recursive functions now get SCC fixed-point treatment. Call resolution tolerates under-application (a call supplying fewer arguments than a callee with default / optional parameters) while still degrading to `Ambiguous` rather than a wrong pick. A failed SCC iteration no longer overwrites a file's cached diagnostics with an empty set. JS/TS module resolution appends extensions to dotted specifiers (`./user.service` → `user.service.ts`) and swaps a `.js` import to a `.ts` file (NodeNext / ESM). +- **JS/TS two-level solve.** Pass-2 top-level (global) taint now reaches nested closures two or more scopes deep, and the pass-2 dirty-skip no longer drops a nested body that transitively consumes a changed global through a parent-local. +- **Scan pipeline and index.** `replace_all_for_file` deletes stale SSA summary / body rows unconditionally, so an incremental rescan cannot leave orphaned rows. Cached findings recompute their category instead of being stamped `Security`, so structural warnings keep their real class. The indexed build persists auth summaries and cross-package imports, logs-and-skips an unreadable file instead of aborting the whole build, and keys `FuncSummary` entries to match the SSA tier so an indexed scan and a full scan agree. +- **Language coverage (recall).** KINDS maps were completed so previously-dropped bodies are walked: Java interface / enum / record / `synchronized` blocks, Rust inline `mod { ... }` items, Go labeled-statement bodies, and Ruby lambda / brace-block bodies. Go variadic and Python `*args` / `**kwargs` parameters are seeded with correct arity. C/C++ `scanf` / `fscanf` / `sscanf` / `read` register their output buffers as taint sources. TypeScript gated sinks dropped from the JS mirror were restored (`_.template`, `http.get` / `https.get`, `setValue` / `dotProp.set` / `jp.set`). The weak-hash and HTTP-URL AST patterns match single- and double-quoted string literals across JS, TS, and Ruby. +- **Symbolic execution.** Interprocedural parameter seeding fixed an off-by-one for method calls and now seeds the receiver / self parameter; the cross-file depth guard increments on descent; and a path cut short by the global step budget records `Inconclusive` instead of `Confirmed`. +- **Abstract interpretation and pointer analysis.** Interval division handles the `i64::MIN / -1` overflow (degrading to unbounded instead of a falsely-narrow range) and multiplication computes overflow in `i128`. `AbstractState::leq` checks entries present only in the other state, restoring a sound partial order. The pointer fixpoint re-projects container-element field reads after the receiver's points-to set converges. +- **CFG-level analyses.** Error-fallthrough termination stops at the `if` join point; a guard's constant-operand test refuses a `Source`-labelled call result; guard / sanitizer matchers require a leaf-name boundary (so `invalidate` no longer matches the `validate` guard and `unquote` no longer matches `quote`); resource ownership-transfer requires a real `->field =` assignment rather than any `->` in a span; post-dominators are computed once per resource pass; and the web-entrypoint heuristic confirms web parameters against the candidate handler only, so an unrelated `req` parameter elsewhere in the file no longer promotes batch / CLI functions to web entry points. + ### CLI - **`nyx scan --verify`** (enabled by default in standard builds) and `--backend {auto,process,docker}` select the dynamic-verification harness. `--no-verify` skips verification for a single run without changing config. diff --git a/src/abstract_interp/interval.rs b/src/abstract_interp/interval.rs index 5de1a27c..eccb98a6 100644 --- a/src/abstract_interp/interval.rs +++ b/src/abstract_interp/interval.rs @@ -179,30 +179,21 @@ impl IntervalFact { } match (self.lo, self.hi, other.lo, other.hi) { (Some(a_lo), Some(a_hi), Some(b_lo), Some(b_hi)) => { - let products = [ - a_lo.checked_mul(b_lo), - a_lo.checked_mul(b_hi), - a_hi.checked_mul(b_lo), - a_hi.checked_mul(b_hi), + // Compute all four endpoint products in i128 (no i64 overflow + // possible) so we know the *true* min/max, then attribute + // overflow by which direction escapes the i64 range — not by + // which first-operand endpoint produced it. + let products: [i128; 4] = [ + a_lo as i128 * b_lo as i128, + a_lo as i128 * b_hi as i128, + a_hi as i128 * b_lo as i128, + a_hi as i128 * b_hi as i128, ]; - let lo = products.iter().filter_map(|p| *p).min(); - let hi = products.iter().filter_map(|p| *p).max(); - // If any product overflowed, the corresponding bound is None - if products.iter().any(|p| p.is_none()) { - Self { - lo: if lo.is_some() && products[..2].iter().all(|p| p.is_some()) { - lo - } else { - None - }, - hi: if hi.is_some() && products[2..].iter().all(|p| p.is_some()) { - hi - } else { - None - }, - } - } else { - Self { lo, hi } + let true_min = *products.iter().min().unwrap(); + let true_max = *products.iter().max().unwrap(); + Self { + lo: clamp_lo_i128(true_min), + hi: clamp_hi_i128(true_max), } } _ => Self::top(), @@ -220,15 +211,24 @@ impl IntervalFact { if b_lo <= 0 && b_hi >= 0 { return Self::top(); } - let quotients = [ - a_lo.checked_div(b_lo), - a_lo.checked_div(b_hi), - a_hi.checked_div(b_lo), - a_hi.checked_div(b_hi), + // Compute the four endpoint quotients in i128. This is exact + // for division (the divisor cannot be zero here) and captures + // the i64::MIN / -1 = i64::MAX + 1 overflow case, which + // checked_div would silently drop, producing a falsely narrow + // interval. Attribute the escape by direction: a quotient + // above i64::MAX leaves hi unbounded. + let quotients: [i128; 4] = [ + a_lo as i128 / b_lo as i128, + a_lo as i128 / b_hi as i128, + a_hi as i128 / b_lo as i128, + a_hi as i128 / b_hi as i128, ]; - let lo = quotients.iter().filter_map(|q| *q).min(); - let hi = quotients.iter().filter_map(|q| *q).max(); - Self { lo, hi } + let true_min = *quotients.iter().min().unwrap(); + let true_max = *quotients.iter().max().unwrap(); + Self { + lo: clamp_lo_i128(true_min), + hi: clamp_hi_i128(true_max), + } } _ => Self::top(), } @@ -523,6 +523,28 @@ fn checked_sub_opt(a: Option, b: Option) -> Option { } } +/// Clamp an `i128` lower bound to `Option`. A bound outside the `i64` +/// range is unrepresentable on this side, so we degrade to `None` +/// (−∞), which is a sound over-approximation. Mirrors the overflow handling +/// of `add`/`sub` (overflow → unbounded). +fn clamp_lo_i128(lo: i128) -> Option { + if (i64::MIN as i128..=i64::MAX as i128).contains(&lo) { + Some(lo as i64) + } else { + None + } +} + +/// Clamp an `i128` upper bound to `Option`. A bound outside the `i64` +/// range degrades to `None` (+∞), a sound over-approximation. +fn clamp_hi_i128(hi: i128) -> Option { + if (i64::MIN as i128..=i64::MAX as i128).contains(&hi) { + Some(hi as i64) + } else { + None + } +} + #[cfg(test)] mod tests { use super::*; @@ -822,6 +844,53 @@ mod tests { assert!(r.lo.is_none() || r.hi.is_none()); } + /// Soundness: on overflow `mul` must attribute the unbounded direction + /// by which endpoint product actually escaped the i64 range, not by the + /// first operand's endpoint. `[i64::MIN, 0] * [-1, -1]` reaches + /// `i64::MAX + 1` (unbounded above), so `hi` must be `None`, never a + /// finite value like `0`. + #[test] + fn mul_overflow_attributes_high_bound_unbounded() { + let a = IntervalFact { + lo: Some(i64::MIN), + hi: Some(0), + }; + let b = IntervalFact::exact(-1); + let r = a.mul(&b); + // True range is [0, i64::MAX + 1]: lo = 0 finite, hi unbounded. + assert_eq!(r.lo, Some(0), "mul lo must stay finite at 0"); + assert_eq!( + r.hi, None, + "mul hi must be unbounded: i64::MIN * -1 escapes above i64::MAX" + ); + assert!( + !r.is_proven_bounded(), + "an overflowing product must not be proven-bounded" + ); + } + + /// Symmetric soundness check on the lower bound: + /// `[0, i64::MAX] * [-2, -1]` reaches `-2 * i64::MAX` (unbounded below), + /// so `lo` must be `None`, never a finite floor like `-i64::MAX`. + #[test] + fn mul_overflow_attributes_low_bound_unbounded() { + let a = IntervalFact { + lo: Some(0), + hi: Some(i64::MAX), + }; + let b = IntervalFact { + lo: Some(-2), + hi: Some(-1), + }; + let r = a.mul(&b); + // True range is [-2*i64::MAX, 0]: lo unbounded, hi = 0 finite. + assert_eq!( + r.lo, None, + "mul lo must be unbounded: 0 * -2 .. i64::MAX * -2 escapes below i64::MIN" + ); + assert_eq!(r.hi, Some(0), "mul hi must stay finite at 0"); + } + // ── Bitwise interval transfer tests ──────────────────────────────── #[test] @@ -1062,6 +1131,31 @@ mod tests { ); } + /// Soundness: `[i64::MIN, 0] / [-1, -1]` truly spans `[0, i64::MAX + 1]`, + /// so the result must NOT be proven-bounded. The old `checked_div` + /// implementation silently dropped the overflowing `i64::MIN / -1` + /// quotient and produced a narrow `[0, 0]`, falsely passing + /// `is_proven_bounded()` and defeating SQL/SHELL sink suppression. + #[test] + fn div_i64_min_overflow_not_proven_bounded() { + let a = IntervalFact { + lo: Some(i64::MIN), + hi: Some(0), + }; + let b = IntervalFact::exact(-1); + let r = a.div(&b); + // Lower edge: i64::MIN / -1 overflows above i64::MAX → hi unbounded. + assert_eq!(r.lo, Some(0), "div lo must stay finite at 0"); + assert_eq!( + r.hi, None, + "div hi must be unbounded: i64::MIN / -1 = i64::MAX + 1 escapes i64" + ); + assert!( + !r.is_proven_bounded(), + "an overflowing quotient must not be reported as proven-bounded" + ); + } + /// Modulo with a single-point negative divisor: `[0,10] % -3` must /// be a valid interval (no panic, no negative-zero bound nonsense). #[test] diff --git a/src/abstract_interp/mod.rs b/src/abstract_interp/mod.rs index 46438de3..f462a751 100644 --- a/src/abstract_interp/mod.rs +++ b/src/abstract_interp/mod.rs @@ -488,15 +488,27 @@ impl AbstractState { /// Partial order: self ⊑ other. pub fn leq(&self, other: &Self) -> bool { - // Every non-Top entry in self must have a corresponding entry in other - // with self[v] ⊑ other[v]. Entries only in other are fine (Top ⊑ anything - // is false, but absent self entries are Top which is handled). + // self ⊑ other iff for every SSA value v: self[v] ⊑ other[v], using the + // convention that an absent entry is Top. + // + // Three cases by where v is stored: + // - in both: check self[v] ⊑ other[v] (loop over self below). + // - in self only: other[v] = Top, and self[v] ⊑ Top always holds — ok. + // - in other only: self[v] = Top; since stored entries are non-Top, + // Top ⋢ other[v], so self ⋢ other. This case was previously missed. for (v, val) in &self.values { let other_val = other.get(*v); if !val.leq(&other_val) { return false; } } + // Any value present only in `other` means self[v] = Top ⋢ other[v] + // (other's stored entries are non-Top), so self ⋢ other. + for (v, _) in &other.values { + if self.values.binary_search_by_key(v, |(id, _)| *id).is_err() { + return false; + } + } true } } @@ -675,6 +687,36 @@ mod tests { assert_eq!(v1.interval.hi, None); // grew → widened } + #[test] + fn abstract_state_leq_respects_other_only_entries() { + // self = empty (every value is implicitly Top). + // other = { v1: [0,5] } (a non-Top, hence strictly-lower fact). + // Since Top ⋢ [0,5], empty ⋢ other. + let bounded = AbstractValue { + interval: IntervalFact { + lo: Some(0), + hi: Some(5), + }, + string: StringFact::top(), + bits: BitFact::top(), + path: PathFact::top(), + }; + let empty = AbstractState::empty(); + let mut other = AbstractState::empty(); + other.set(SsaValue(1), bounded); + + // The bug under test: empty.leq(other) used to return true. + assert!( + !empty.leq(&other), + "empty (Top everywhere) must not be ⊑ a state with a bounded entry" + ); + // Sanity: the reverse direction holds (a bounded state ⊑ all-Top). + assert!(other.leq(&empty), "a bounded state must be ⊑ all-Top"); + // Reflexivity still holds. + assert!(other.leq(&other)); + assert!(empty.leq(&empty)); + } + #[test] fn loop_carried_phi_join_and_widen() { // Simulate: x = 0; loop { x = phi(0, x+1) } diff --git a/src/callgraph.rs b/src/callgraph.rs index 82a993d3..7fdad6ca 100644 --- a/src/callgraph.rs +++ b/src/callgraph.rs @@ -160,6 +160,28 @@ pub(crate) fn callee_container_hint(raw: &str) -> &str { "" } +/// Strip the optional `"::"` package prefix from a call-graph +/// namespace, yielding the plain project-relative path. +/// +/// `crate::symbol::namespace_with_package` produces package-qualified +/// namespaces of the form `format!("{pkg}::{rel}")` for any file that +/// lives inside a resolved [`crate::resolve::PackageEntry`] (e.g. any +/// repo with a named `package.json`). The package name never contains +/// `::` and project-relative file paths never contain `::`, so the +/// **first** `::` is unambiguously the package separator and everything +/// after it is the plain `normalize_namespace` form. +/// +/// Used to align two keying conventions that would otherwise never +/// match: call-graph [`FuncKey::namespace`]s (package-qualified) versus +/// file paths normalised via plain `normalize_namespace`. Returns the +/// input unchanged when no `::` is present. +pub(crate) fn strip_package_prefix(ns: &str) -> &str { + match ns.split_once("::") { + Some((_pkg, rel)) => rel, + None => ns, + } +} + // Class / container → method index /// Per-language `(container, method_name)` → candidate [`FuncKey`] index. @@ -897,12 +919,20 @@ impl FileReachMap { /// [`FileReachMap::with_scan_root`] when callers may pass absolute /// paths. pub fn build(cg: &CallGraph) -> Self { + // Call-graph namespaces are package-qualified (`pkg::rel`) for any + // file inside a resolved package, but `reaches` normalises its + // arguments via plain `normalize_namespace` (`rel`). Strip the + // package prefix on both keys and caller-set entries so the stored + // form matches the lookup form; otherwise `reaches` always returns + // false for package-resident files and chain reach widening / + // surface transitive-exposure detection silently never fire. let mut by_callee_ns: HashMap> = HashMap::new(); for callee in cg.index.keys() { - let entry = by_callee_ns.entry(callee.namespace.clone()).or_default(); - entry.insert(callee.namespace.clone()); + let callee_ns = strip_package_prefix(&callee.namespace).to_string(); + let entry = by_callee_ns.entry(callee_ns.clone()).or_default(); + entry.insert(callee_ns); for caller in callers_transitive(cg, callee) { - entry.insert(caller.namespace); + entry.insert(strip_package_prefix(&caller.namespace).to_string()); } } FileReachMap { @@ -941,11 +971,17 @@ impl FileReachMap { } fn normalize<'a>(&self, path: &'a str) -> std::borrow::Cow<'a, str> { + // Reduce both path-normalisation (absolute → project-relative) and + // package-qualification (`pkg::rel` → `rel`) so lookups match the + // package-stripped keys stored by `build`. Inputs may arrive as + // absolute host paths, plain project-relative paths, or + // package-qualified call-graph namespaces. match self.scan_root.as_deref() { Some(root) => { - std::borrow::Cow::Owned(crate::symbol::normalize_namespace(path, Some(root))) + let normalized = crate::symbol::normalize_namespace(path, Some(root)); + std::borrow::Cow::Owned(strip_package_prefix(&normalized).to_string()) } - None => std::borrow::Cow::Borrowed(path), + None => std::borrow::Cow::Borrowed(strip_package_prefix(path)), } } } @@ -993,6 +1029,26 @@ pub fn scc_spans_files(cg: &CallGraph, scc: &[NodeIndex]) -> bool { iter.any(|n| cg.graph[*n].namespace.as_str() != first_ns) } +/// True when an SCC requires the run_topo_batches fixed-point loop. +/// +/// A multi-node SCC is mutually recursive by definition. A **single**-node +/// SCC is recursive only when the function calls itself directly: petgraph's +/// `tarjan_scc` places a directly self-recursive function (`f` calls `f`) in +/// its own singleton SCC with a self-loop edge, so `scc.len() > 1` alone +/// misses it. Self-recursion (tree walkers, retry wrappers, recursive-descent +/// parsers) is far more common than mutual recursion and needs the same +/// fixed-point treatment so its summary converges against its own refined +/// summary rather than being analysed exactly once. +pub fn scc_is_recursive(cg: &CallGraph, scc: &[NodeIndex]) -> bool { + if scc.len() > 1 { + return true; + } + match scc.first() { + Some(&n) => cg.graph.contains_edge(n, n), + None => false, + } +} + /// Map SCC topological order to an ordered sequence of file-path batches /// annotated with whether any contributing SCC is mutually recursive /// (`len > 1`) or cross-file. @@ -1022,14 +1078,22 @@ pub fn scc_file_batches_with_metadata<'a>( // 2. Build file relative-path → (min topo index, has_mutual_recursion, cross_file). // `cross_file` is set whenever the file participates in an SCC whose // nodes span more than one namespace, the cross-file signal. + // + // Call-graph namespaces are package-qualified (`pkg::rel`) for any + // file inside a resolved package, while `rel_to_path` above keys by + // the plain `normalize_namespace` form (`rel`). Strip the package + // prefix here so both conventions agree; otherwise every + // package-resident file misses `file_topo` and lands in `orphans`, + // silently disabling the topo/SCC machinery for exactly the repos it + // was built for (any project with a named package.json). let mut file_topo: HashMap<&str, (usize, bool, bool)> = HashMap::new(); for (topo_pos, &scc_idx) in analysis.topo_scc_callee_first.iter().enumerate() { - let scc_recursive = analysis.sccs[scc_idx].len() > 1; + let scc_recursive = scc_is_recursive(cg, &analysis.sccs[scc_idx]); let scc_cross_file = scc_spans_files(cg, &analysis.sccs[scc_idx]); for &node in &analysis.sccs[scc_idx] { - let ns = &cg.graph[node].namespace; + let ns = strip_package_prefix(&cg.graph[node].namespace); file_topo - .entry(ns.as_str()) + .entry(ns) .and_modify(|(min_pos, recursive, cross_file)| { if topo_pos < *min_pos { *min_pos = topo_pos; @@ -1782,6 +1846,110 @@ mod tests { } } + // ── package-prefix namespace alignment (finding #35) ────────────── + + #[test] + fn strip_package_prefix_handles_qualified_and_plain() { + // Package-qualified: prefix stripped at the first "::". + assert_eq!(strip_package_prefix("myapp::src/a.js"), "src/a.js"); + assert_eq!(strip_package_prefix("@scope/name::src/a.ts"), "src/a.ts"); + // Plain relative path with no package prefix is returned verbatim. + assert_eq!(strip_package_prefix("src/a.js"), "src/a.js"); + assert_eq!(strip_package_prefix("a.rs"), "a.rs"); + assert_eq!(strip_package_prefix(""), ""); + } + + /// A call graph whose namespaces are package-qualified (the normal case + /// for any repo with a named package.json) must still match the plain + /// `normalize_namespace` file paths in `all_files`, so files land in + /// topo batches rather than `orphans`. + #[test] + fn scc_file_batches_with_metadata_matches_package_qualified_ns() { + let root = Path::new("/proj"); + // file_path values are NOT prefixed by root, so func_key keeps them + // verbatim → simulates package-qualified call-graph namespaces. + let a = make_summary("ping", "myapp::a.rs", "rust", 0, vec!["pong"]); + let b = make_summary("pong", "myapp::b.rs", "rust", 0, vec!["ping"]); + + // all_files are absolute paths under root → normalize to "a.rs"/"b.rs". + let files: Vec = vec![PathBuf::from("/proj/a.rs"), PathBuf::from("/proj/b.rs")]; + + let (batches, orphans) = build_metadata_batches(vec![a, b], &files, root); + + assert!( + orphans.is_empty(), + "package-qualified files must not be orphaned: {orphans:?}" + ); + assert_eq!(batches.len(), 1, "mutual recursion → single batch"); + assert!(batches[0].has_mutual_recursion); + } + + /// FileReachMap built from a package-qualified call graph must resolve + /// plain-relative `reaches` lookups. + #[test] + fn file_reach_map_matches_package_qualified_ns() { + let handle = make_summary("handle", "myapp::routes.rs", "rust", 0, vec!["sink"]); + let sink = make_summary("sink", "myapp::helper.rs", "rust", 0, vec![]); + let gs = merge_summaries(vec![handle, sink], None); + let cg = build_call_graph(&gs, &[]); + let reach = FileReachMap::build(&cg); + + // Plain-relative lookups resolve even though graph keys are qualified. + assert!(reach.reaches("routes.rs", "helper.rs")); + // A package-qualified caller string also resolves (normalize strips it). + assert!(reach.reaches("myapp::routes.rs", "helper.rs")); + } + + // ── self-recursion fixed-point flag (finding #39) ───────────────── + + /// A directly self-recursive function forms a singleton SCC with a + /// self-loop edge. `scc_is_recursive` must flag it so run_topo_batches + /// applies the fixed-point loop, matching the mutual-recursion path. + #[test] + fn scc_is_recursive_flags_self_loop_singleton() { + let root = Path::new("/proj"); + let f = make_summary("f", "/proj/a.rs", "rust", 0, vec!["f"]); + let gs = merge_summaries(vec![f], Some(&root.to_string_lossy())); + let cg = build_call_graph(&gs, &[]); + let analysis = analyse(&cg); + + // The single SCC is a self-loop singleton. + assert_eq!(analysis.sccs.len(), 1); + assert!( + scc_is_recursive(&cg, &analysis.sccs[0]), + "self-recursive singleton must be flagged recursive" + ); + // scc_spans_files correctly stays false for singletons. + assert!(!scc_spans_files(&cg, &analysis.sccs[0])); + } + + /// A non-recursive singleton (no self-edge) must NOT be flagged. + #[test] + fn scc_is_recursive_ignores_plain_singleton() { + let root = Path::new("/proj"); + let f = make_summary("f", "/proj/a.rs", "rust", 0, vec![]); + let gs = merge_summaries(vec![f], Some(&root.to_string_lossy())); + let cg = build_call_graph(&gs, &[]); + let analysis = analyse(&cg); + assert_eq!(analysis.sccs.len(), 1); + assert!(!scc_is_recursive(&cg, &analysis.sccs[0])); + } + + /// The self-recursive flag propagates through to the FileBatch metadata. + #[test] + fn scc_file_batches_with_metadata_marks_self_recursive() { + let root = Path::new("/proj"); + let f = make_summary("f", "/proj/a.rs", "rust", 0, vec!["f"]); + let files: Vec = vec![PathBuf::from("/proj/a.rs")]; + let (batches, orphans) = build_metadata_batches(vec![f], &files, root); + assert!(orphans.is_empty()); + assert_eq!(batches.len(), 1); + assert!( + batches[0].has_mutual_recursion, + "self-recursive file must be flagged for the fixed-point loop" + ); + } + // ── qualified disambiguation resolves ambiguous common names ────── #[test] diff --git a/src/cfg/blocks.rs b/src/cfg/blocks.rs index aedd3aa7..c66690ba 100644 --- a/src/cfg/blocks.rs +++ b/src/cfg/blocks.rs @@ -16,6 +16,28 @@ fn lang_has_exclusive_cases(lang: &str) -> bool { matches!(lang, "rust" | "go") } +/// True when *this specific switch* has guaranteed-exclusive (non-fall-through) +/// cases, so it is safe to reorder the `default` arm to the cascade tail. +/// +/// Rust `match` and Go `switch` are always exclusive. Java mixes shapes: the +/// arrow form (`switch_rule` cases, `case x -> ...`) is exclusive, but the +/// classic colon form (`switch_block_statement_group`, `case x:` with implicit +/// fall-through) is NOT. C/C++/JS/TS/PHP classic switches fall through and are +/// never exclusive. Reordering `default` to the tail is only correct for the +/// exclusive shapes; doing it for fall-through switches connects the wrong +/// case bodies in the source-order fall-through chain (both missed and phantom +/// taint flows). +fn switch_is_exclusive(lang: &str, cases: &[(Node<'_>, bool)]) -> bool { + if lang_has_exclusive_cases(lang) { + return true; + } + if lang == "java" { + // Arrow-switch when every case is the arrow `switch_rule` shape. + return cases.iter().all(|(c, _)| c.kind() == "switch_rule"); + } + false +} + /// Extract the scrutinee subtree from a switch-like AST node. /// /// Returns the AST node referenced by the language's scrutinee field. Only @@ -591,12 +613,21 @@ pub(super) fn build_switch<'a>( return exits; } + // Whether this switch's cases are mutually exclusive (no fall-through). + // Only exclusive switches may have `default` reordered to the cascade tail. + let is_exclusive = switch_is_exclusive(lang, &cases); + // Reorder so the default arm (if any) sits at the tail of the cascade. - // Reordering case dispatch is semantically harmless (mutually exclusive - // pattern matches), and it keeps the chain a clean Branch(True→case, - // False→next). Fall-through chains are a separate Seq layer below. + // Reordering case dispatch is semantically harmless ONLY for mutually + // exclusive pattern matches (Rust match, Go switch, Java arrow-switch); it + // keeps the chain a clean Branch(True→case, False→next). For classic + // fall-through switches (C/C++/JS/TS/PHP, Java colon-switch) a mid-chain + // `default:` can fall into the following case and a preceding case can fall + // into it, so the source order MUST be preserved — reordering there breaks + // the fall-through Seq layer and produces both missed and phantom flows. let default_pos = cases.iter().position(|(_, d)| *d); - if let Some(pos) = default_pos + if is_exclusive + && let Some(pos) = default_pos && pos != cases.len() - 1 { let default_pair = cases.remove(pos); @@ -648,22 +679,40 @@ pub(super) fn build_switch<'a>( let mut fallthrough_exits: Vec = Vec::new(); let mut last_header_false: Option = None; let mut chain_preds: Vec = preds.to_vec(); + // First node of the `default` body for a fall-through switch where the + // default is NOT at the tail. The cumulative no-match path (the last + // non-default header's False edge) is wired into it after the loop, so the + // default stays in its source position for the fall-through Seq layer while + // still being reachable when no case matches. + let mut pending_default_no_match: Option = None; for (idx, (case, is_default)) in cases.iter().copied().enumerate() { let is_last = idx + 1 == cases.len(); + // A `default` arm carries no discriminant test, so it never gets its + // own dispatch If. For exclusive switches it has been reordered to the + // tail (`is_last`); for fall-through switches it stays in source + // position (`!is_exclusive`) and is wired into the Seq fall-through + // chain instead of acting as a conditional branch. + let default_no_dispatch = is_default && (is_last || !is_exclusive); + // Default at the chain tail doesn't get its own dispatch If, the // previous header's False edge already targets it directly. - let case_first_preds: Vec = if is_default && is_last { - // First node of the default body becomes the False target of the - // previous header. Build the case with the previous chain_preds - // (the last header's "fall-through" branch) plus any fallthrough - // from the preceding case. - let mut p = chain_preds.clone(); - p.append(&mut fallthrough_exits); - // `last_header_false` will receive a False edge once we know the - // first node of this body. - last_header_false = chain_preds.first().copied(); + let case_first_preds: Vec = if default_no_dispatch { + // Body entry = fall-through from the preceding case body. + let mut p = std::mem::take(&mut fallthrough_exits); + if is_last { + // Tail default: the previous header's False branch also lands + // here directly (legacy behavior preserved for exclusive + // switches and tail defaults). + p.extend(chain_preds.iter().copied()); + last_header_false = chain_preds.first().copied(); + } + // For a non-tail (fall-through) default the dispatch chain must + // continue PAST it, so `chain_preds` / `last_header_false` are left + // untouched and the next case's dispatch header still receives the + // previous header's False edge. The cumulative no-match entry is + // recorded below once the body's first node is known. p } else { // Normal case: synthesize a per-case dispatch header. We tie it @@ -741,12 +790,21 @@ pub(super) fn build_switch<'a>( // Wire the dispatch True edge from this header (or from the previous // header for a tail-default) to the first node of the case body. if body_first_idx.index() < g.node_count() { - let header_for_true = if is_default && is_last { - // The previous header's False already lands here via the - // EdgeKind::Seq inside `case_first_preds`; we additionally - // emit a False edge directly so SSA labels the branch. - if let Some(prev) = last_header_false { - g.add_edge(prev, body_first_idx, EdgeKind::False); + let header_for_true = if default_no_dispatch { + if is_last { + // Tail default: the previous header's False already lands + // here via the EdgeKind::Seq inside `case_first_preds`; we + // additionally emit a False edge directly so SSA labels the + // branch. + if let Some(prev) = last_header_false { + g.add_edge(prev, body_first_idx, EdgeKind::False); + } + } else { + // Non-tail fall-through default: defer wiring the no-match + // entry until the last non-default header's False edge is + // known (after the loop). The body's only in-edge for now + // is the source-order fall-through from the preceding case. + pending_default_no_match = Some(body_first_idx); } None } else { @@ -762,11 +820,22 @@ pub(super) fn build_switch<'a>( let _ = is_default; } - // After the chain: the last non-default header (if no default arm) needs - // a False edge that escapes to the post-switch frontier. + // Resolve the cumulative no-match (the last non-default header's False + // edge): + // - If the `default` arm sits mid-chain (fall-through switch), the + // no-match path enters the default body — wire the deferred False edge + // into it. The default stayed in source position for the fall-through + // Seq layer, so this is the only edge making it reachable on no-match. + // - Otherwise, with no reachable default (no default arm, or it was the + // tail and already consumed the False edge), the no-match path escapes + // to the post-switch frontier. let mut exits: Vec = switch_breaks; exits.append(&mut fallthrough_exits); - if !has_default { + if let Some(default_first) = pending_default_no_match { + if let Some(prev) = last_header_false { + g.add_edge(prev, default_first, EdgeKind::False); + } + } else if !has_default { if let Some(prev) = last_header_false { exits.push(prev); } diff --git a/src/cfg/conditions.rs b/src/cfg/conditions.rs index 1ffd3bde..dfe3860c 100644 --- a/src/cfg/conditions.rs +++ b/src/cfg/conditions.rs @@ -64,13 +64,26 @@ pub(super) fn get_boolean_operands<'a>(node: Node<'a>) -> Option<(Node<'a>, Node None } -/// Create a lightweight `StmtKind::If` node for a sub-condition in a boolean chain. +/// Create a `StmtKind::If` node for a sub-condition in a boolean chain. +/// +/// When the operand contains a classifiable call (`if (flag && cp.execSync(x))`), +/// the node is built via [`push_node`] so the inner source/sink/sanitizer +/// classification — callee, labels, arg-uses, gated sinks — is preserved, then +/// the branch-condition metadata is overlaid on top. Without this, a sink or +/// source CALL inside a short-circuited operand was dropped entirely (the bare +/// node only carried `condition_vars`, no callee or labels), so the +/// `if (flag && sink(x))` form missed flows that the un-decomposed +/// `if (sink(x))` form catches via the mod.rs condition-call fallback. This +/// mirrors `lower_ternary_branch`, which already uses `push_node` for the same +/// reason. Non-call operands keep the original lightweight shape. pub(super) fn push_condition_node<'a>( g: &mut Cfg, cond_ast: Node<'a>, lang: &str, code: &'a [u8], enclosing_func: Option<&str>, + call_ordinal: &mut u32, + analysis_rules: Option<&LangAnalysisRules>, ) -> NodeIndex { // Pass cond_ast as both args, sub-conditions are never `unless` nodes let (inner, negated) = detect_negation(cond_ast, cond_ast, lang); @@ -94,6 +107,44 @@ pub(super) fn push_condition_node<'a>( // because the per-disjunct cond nodes (built via // `build_condition_chain`) didn't populate `taint.uses`. let uses_for_taint: Vec = vars.clone(); + + // Operand containing a call → route through `push_node` for full + // source/sink/sanitizer classification, then overlay branch metadata. + if has_call_descendant(inner, lang) { + let ord = *call_ordinal; + *call_ordinal += 1; + let node = push_node( + g, + StmtKind::If, + inner, + lang, + code, + enclosing_func, + ord, + analysis_rules, + ); + // Overlay the branch-condition metadata that the chain wiring depends + // on. `push_node` set `defines`/`uses`/`labels`/`callee` from the call; + // keep those, but ensure the condition fields and the interned + // condition vars (for `apply_branch_predicates`) are present, and the + // span/enclosing_func point at the operand expression. + let info = &mut g[node]; + info.kind = StmtKind::If; + info.ast.span = span; + info.ast.enclosing_func = enclosing_func.map(|s| s.to_string()); + info.condition_text = text; + info.condition_vars = vars; + info.condition_negated = negated; + // Union the condition idents into `taint.uses` so branch-predicate + // lookup keeps working without clobbering the call's own arg uses. + for v in &uses_for_taint { + if !info.taint.uses.contains(v) { + info.taint.uses.push(v.clone()); + } + } + return node; + } + g.add_node(NodeInfo { kind: StmtKind::If, ast: AstMeta { @@ -221,7 +272,15 @@ pub(super) fn build_ternary_diamond<'a>( // 1. Condition header. `push_condition_node` sets span/text/vars/negated // but leaves `is_eq_with_const` default; stamp it explicitly so the // taint engine's equality-narrowing fires for `x === 'literal' ? …`. - let cond_if = push_condition_node(g, cond_ast, lang, code, enclosing_func); + let cond_if = push_condition_node( + g, + cond_ast, + lang, + code, + enclosing_func, + call_ordinal, + analysis_rules, + ); g[cond_if].is_eq_with_const = detect_eq_with_const(cond_ast, lang); // Capture the pure int-arith + comparison tree so `fold_constant_branches` // can prune a dead constant-condition arm of the ternary (e.g. Java @@ -488,6 +547,7 @@ pub(super) fn classify_ternary_lhs( /// /// Returns `(true_exits, false_exits)`, the sets of nodes from which True/False /// edges should connect to the then/else branches. +#[allow(clippy::too_many_arguments)] pub(super) fn build_condition_chain<'a>( cond_ast: Node<'a>, preds: &[NodeIndex], @@ -496,6 +556,8 @@ pub(super) fn build_condition_chain<'a>( lang: &str, code: &'a [u8], enclosing_func: Option<&str>, + call_ordinal: &mut u32, + analysis_rules: Option<&LangAnalysisRules>, ) -> (Vec, Vec) { let inner = unwrap_parens(cond_ast); @@ -503,8 +565,17 @@ pub(super) fn build_condition_chain<'a>( Some(BoolOp::And) => { if let Some((left, right)) = get_boolean_operands(inner) { // Left operand with current preds - let (left_true, left_false) = - build_condition_chain(left, preds, pred_edge, g, lang, code, enclosing_func); + let (left_true, left_false) = build_condition_chain( + left, + preds, + pred_edge, + g, + lang, + code, + enclosing_func, + call_ordinal, + analysis_rules, + ); // Right operand only evaluated when left is true let (right_true, right_false) = build_condition_chain( right, @@ -514,6 +585,8 @@ pub(super) fn build_condition_chain<'a>( lang, code, enclosing_func, + call_ordinal, + analysis_rules, ); // AND: true only when both true; false when either false let mut false_exits = left_false; @@ -521,7 +594,15 @@ pub(super) fn build_condition_chain<'a>( (right_true, false_exits) } else { // Safety fallback: treat as leaf - let node = push_condition_node(g, inner, lang, code, enclosing_func); + let node = push_condition_node( + g, + inner, + lang, + code, + enclosing_func, + call_ordinal, + analysis_rules, + ); connect_all(g, preds, node, pred_edge); (vec![node], vec![node]) } @@ -529,8 +610,17 @@ pub(super) fn build_condition_chain<'a>( Some(BoolOp::Or) => { if let Some((left, right)) = get_boolean_operands(inner) { // Left operand with current preds - let (left_true, left_false) = - build_condition_chain(left, preds, pred_edge, g, lang, code, enclosing_func); + let (left_true, left_false) = build_condition_chain( + left, + preds, + pred_edge, + g, + lang, + code, + enclosing_func, + call_ordinal, + analysis_rules, + ); // Right operand only evaluated when left is false let (right_true, right_false) = build_condition_chain( right, @@ -540,6 +630,8 @@ pub(super) fn build_condition_chain<'a>( lang, code, enclosing_func, + call_ordinal, + analysis_rules, ); // OR: true when either true; false only when both false let mut true_exits = left_true; @@ -547,14 +639,30 @@ pub(super) fn build_condition_chain<'a>( (true_exits, right_false) } else { // Safety fallback: treat as leaf - let node = push_condition_node(g, inner, lang, code, enclosing_func); + let node = push_condition_node( + g, + inner, + lang, + code, + enclosing_func, + call_ordinal, + analysis_rules, + ); connect_all(g, preds, node, pred_edge); (vec![node], vec![node]) } } None => { // Leaf: single condition node - let node = push_condition_node(g, inner, lang, code, enclosing_func); + let node = push_condition_node( + g, + inner, + lang, + code, + enclosing_func, + call_ordinal, + analysis_rules, + ); connect_all(g, preds, node, pred_edge); (vec![node], vec![node]) } diff --git a/src/cfg/mod.rs b/src/cfg/mod.rs index c524a022..d36db47f 100644 --- a/src/cfg/mod.rs +++ b/src/cfg/mod.rs @@ -3541,9 +3541,17 @@ pub(super) fn push_node<'a>( None }; - // Extract condition metadata for If nodes. + // Extract condition metadata for If nodes. Python `elif_clause` and PHP + // `else_if_clause` are lowered as guard nodes by `build_alternative_chain` + // (the flat-sibling elif-chain handler) and carry their own `condition` + // field, so they must also receive condition-metadata extraction even + // though their grammar kind maps to `Kind::Block`. These two node kinds + // only ever reach `push_node` from that handler, so widening the guard + // here cannot affect ordinary block lowering. let (condition_text, condition_vars, condition_negated, cond_arith) = - if matches!(lookup(lang, ast.kind()), Kind::If) { + if matches!(lookup(lang, ast.kind()), Kind::If) + || matches!(ast.kind(), "elif_clause" | "else_if_clause") + { extract_condition_raw(ast, lang, code) } else { (None, Vec::new(), false, None) @@ -5074,6 +5082,230 @@ fn apply_arg_source_bindings( } } +/// Lower a flat chain of `alternative` siblings (Python `elif_clause`s / PHP +/// `else_if_clause`s, optionally trailed by an `else_clause`) into a properly +/// nested else-if CFG. +/// +/// tree-sitter exposes these clauses as repeated, FLAT `alternative` fields on +/// a single `if_statement` rather than nesting each `else if` inside the +/// previous `else` (as JS/TS/Rust/Go/Java/C do). The default single-`else` +/// lowering only consumed the first sibling, silently dropping every 2nd+ elif +/// and the trailing else — so any source/sink/sanitizer/guard living there was +/// invisible to the whole pipeline. This builder restores the missing CFG +/// nodes by chaining each elif as a guard whose False edge flows into the next +/// alternative. +/// +/// `incoming_edge` is the edge label used to enter the chain (the parent `if`'s +/// false edge — `EdgeKind::False` normally, `EdgeKind::True` for Ruby +/// `unless`). Returns the union of all branch exits plus the final +/// fall-through (when the chain has no terminal `else`). +#[allow(clippy::too_many_arguments)] +pub(super) fn build_alternative_chain<'a>( + alternatives: &[Node<'a>], + preds: &[NodeIndex], + incoming_edge: EdgeKind, + g: &mut Cfg, + lang: &str, + code: &'a [u8], + summaries: &mut FuncSummaries, + file_path: &str, + enclosing_func: Option<&str>, + call_ordinal: &mut u32, + analysis_rules: Option<&LangAnalysisRules>, + break_targets: &mut Vec, + continue_targets: &mut Vec, + throw_targets: &mut Vec, + bodies: &mut Vec, + next_body_id: &mut u32, + current_body_id: BodyId, +) -> Vec { + // Predecessor frontier entering the current alternative, and the edge label + // to use when wiring into it. Updated as we descend the chain: each elif's + // False edge becomes the predecessor/edge for the next alternative. + let mut chain_preds: Vec = preds.to_vec(); + let mut chain_edge = incoming_edge; + let mut exits: Vec = Vec::new(); + + for &alt in alternatives { + let is_elif = matches!(alt.kind(), "elif_clause" | "else_if_clause"); + + if is_elif { + // Guard node for the elif condition. `push_node` with `StmtKind::If` + // extracts condition metadata and runs label classification on the + // clause (its `condition` field), so a source/sink call inside an + // elif condition is no longer dropped. + let guard = push_node( + g, + StmtKind::If, + alt, + lang, + code, + enclosing_func, + 0, + analysis_rules, + ); + connect_all(g, &chain_preds, guard, chain_edge); + + // True branch: the elif body (`consequence` for Python, `body` for + // PHP / colon-block forms). + let body = alt + .child_by_field_name("consequence") + .or_else(|| alt.child_by_field_name("body")); + if let Some(b) = body { + let body_first = NodeIndex::new(g.node_count()); + let body_exits = build_sub( + b, + &[guard], + g, + lang, + code, + summaries, + file_path, + enclosing_func, + call_ordinal, + analysis_rules, + break_targets, + continue_targets, + throw_targets, + bodies, + next_body_id, + current_body_id, + ); + if body_first.index() < g.node_count() { + connect_all(g, &[guard], body_first, EdgeKind::True); + exits.extend(body_exits); + } else if let Some(&first) = body_exits.first() { + connect_all(g, &[guard], first, EdgeKind::True); + exits.extend(body_exits); + } else { + // Empty body: the guard's True edge falls through. + exits.push(guard); + } + } else { + exits.push(guard); + } + + // False branch descends to the next alternative. + chain_preds = vec![guard]; + chain_edge = EdgeKind::False; + } else { + // Terminal `else_clause` (or any non-guard block): lower its body and + // end the chain. The else body field is `body` for both Python and + // PHP `else_clause`; fall back to the clause node itself. + let body = alt.child_by_field_name("body").unwrap_or(alt); + let body_first = NodeIndex::new(g.node_count()); + let body_exits = build_sub( + body, + &chain_preds, + g, + lang, + code, + summaries, + file_path, + enclosing_func, + call_ordinal, + analysis_rules, + break_targets, + continue_targets, + throw_targets, + bodies, + next_body_id, + current_body_id, + ); + if body_first.index() < g.node_count() { + connect_all(g, &chain_preds, body_first, chain_edge); + exits.extend(body_exits); + } else if let Some(&first) = body_exits.first() { + connect_all(g, &chain_preds, first, chain_edge); + exits.extend(body_exits); + } else { + exits.extend(chain_preds.iter().copied()); + } + // An else clause terminates the chain; nothing follows it. + return exits; + } + } + + // Chain ended with an elif (no terminal else): the last guard's False edge + // is the fall-through path out of the whole if/elif construct. Materialise + // it as a synthetic pass-through so the false edge has a concrete target, + // mirroring the no-else branch in the `Kind::If` arm. + let pass = g.add_node(NodeInfo { + kind: StmtKind::Seq, + ast: AstMeta { + span: chain_preds + .first() + .map(|&n| g[n].ast.span) + .unwrap_or((0, 0)), + enclosing_func: enclosing_func.map(|s| s.to_string()), + }, + ..Default::default() + }); + connect_all(g, &chain_preds, pass, chain_edge); + exits.push(pass); + exits +} + +/// Lower a C-style `for` loop's increment/update clause onto the back-edge +/// path. `back_sources` are the body/continue exits that would otherwise +/// back-edge straight to the loop header. When an `update_subtree` exists it +/// is lowered from those sources and its exits are returned, so callers +/// back-edge the update's exits to the header instead — making increment-clause +/// side effects (assignments, sanitizer calls) visible to taint analysis. +/// Without an update clause the input sources are returned unchanged, so the +/// CFG is bit-identical to the pre-fix behaviour. +#[allow(clippy::too_many_arguments)] +pub(super) fn lower_loop_update<'a>( + update_subtree: Option>, + back_sources: &[NodeIndex], + g: &mut Cfg, + lang: &str, + code: &'a [u8], + summaries: &mut FuncSummaries, + file_path: &str, + enclosing_func: Option<&str>, + call_ordinal: &mut u32, + analysis_rules: Option<&LangAnalysisRules>, + break_targets: &mut Vec, + continue_targets: &mut Vec, + throw_targets: &mut Vec, + bodies: &mut Vec, + next_body_id: &mut u32, + current_body_id: BodyId, +) -> Vec { + let Some(update) = update_subtree else { + return back_sources.to_vec(); + }; + if back_sources.is_empty() { + return Vec::new(); + } + let update_exits = build_sub( + update, + back_sources, + g, + lang, + code, + summaries, + file_path, + enclosing_func, + call_ordinal, + analysis_rules, + break_targets, + continue_targets, + throw_targets, + bodies, + next_body_id, + current_body_id, + ); + // If the update produced no CFG nodes (e.g. it was pure trivia), preserve + // the original back-edge sources so the loop still closes. + if update_exits.is_empty() { + back_sources.to_vec() + } else { + update_exits + } +} + // The recursive *work‑horse* that converts an AST node into a CFG slice. // Returns the set of *exit* nodes that need to be wired further. #[allow(clippy::too_many_arguments)] @@ -5183,6 +5415,8 @@ pub(super) fn build_sub<'a>( lang, code, enclosing_func, + call_ordinal, + analysis_rules, ) } else { // Single-node path (original behavior) @@ -5214,14 +5448,27 @@ pub(super) fn build_sub<'a>( // Locate then & else blocks using field-based lookup first, // then positional fallback (Rust uses positional blocks). - let (then_block, else_block) = { + // + // `alternatives` collects *every* `alternative` field, not just the + // first. In tree-sitter-python and tree-sitter-php an + // `if/elif/.../else` (Python) or `if/elseif/.../else` (PHP) chain + // produces several FLAT sibling `alternative` fields on one + // `if_statement` (a list of `elif_clause`/`else_if_clause` nodes + // optionally trailed by an `else_clause`). JS/TS/Rust/Go/Java/C + // nest their `else if`, so they expose at most one `alternative` + // (the nested if) and the list has length ≤ 1. + let (then_block, else_block, alternatives) = { let field_then = ast .child_by_field_name("consequence") .or_else(|| ast.child_by_field_name("body")); - let field_else = ast.child_by_field_name("alternative"); + let mut alt_cursor = ast.walk(); + let alternatives: Vec = ast + .children_by_field_name("alternative", &mut alt_cursor) + .collect(); + let field_else = alternatives.first().copied(); if field_then.is_some() || field_else.is_some() { - (field_then, field_else) + (field_then, field_else, alternatives) } else { // Fallback: positional block children (Rust `if_expression`) let mut cursor = ast.walk(); @@ -5229,10 +5476,22 @@ pub(super) fn build_sub<'a>( .children(&mut cursor) .filter(|n| lookup(lang, n.kind()) == Kind::Block) .collect(); - (blocks.first().copied(), blocks.get(1).copied()) + (blocks.first().copied(), blocks.get(1).copied(), Vec::new()) } }; + // A flat elif/elseif chain has 2+ `alternative` siblings, or a + // single `alternative` that is itself an elif/else-if clause (the + // `if a: .. elif b: ..` form with no trailing `else`, where the lone + // alternative is an `elif_clause` rather than a nested if). In both + // cases the default single-`else_block` lowering below would either + // drop later siblings entirely or fail to treat the elif condition + // as a branch guard, so route through the dedicated chain builder. + let is_flat_elif_chain = alternatives.len() > 1 + || alternatives + .first() + .is_some_and(|a| matches!(a.kind(), "elif_clause" | "else_if_clause")); + // THEN branch let then_first_node = NodeIndex::new(g.node_count()); let then_exits = if let Some(b) = then_block { @@ -5267,7 +5526,30 @@ pub(super) fn build_sub<'a>( // ELSE branch let else_first_node = NodeIndex::new(g.node_count()); - let else_exits = if let Some(b) = else_block { + let else_exits = if is_flat_elif_chain { + // Flat elif/elseif chain: lower every alternative sibling as a + // proper nested else-if so 2nd+ elif and trailing else clauses + // are no longer dropped from the CFG. + build_alternative_chain( + &alternatives, + else_preds, + else_edge, + g, + lang, + code, + summaries, + file_path, + enclosing_func, + call_ordinal, + analysis_rules, + break_targets, + continue_targets, + throw_targets, + bodies, + next_body_id, + current_body_id, + ) + } else if let Some(b) = else_block { let exits = build_sub( b, else_preds, @@ -5380,6 +5662,61 @@ pub(super) fn build_sub<'a>( // WHILE / FOR: classic loop with a back edge. Kind::While | Kind::For => { + // C-style `for (init; cond; incr) body` loops (C/C++, JS/TS, Go + // three-clause, PHP) attach `initializer`/`update` (or `increment`) + // subtrees that the previous loop lowering ignored entirely — so a + // taint source bound in the init (`for (cmd = getenv("X"); …)`) or a + // side effect in the increment had no CFG node at all and was + // invisible to taint analysis. Tree-sitter exposes these either as + // direct fields on the loop node (C/C++/JS/TS/PHP) or nested under a + // `for_clause` child (Go's three-clause form). The init runs once + // before the header; its exits become the header's predecessors so + // its defs flow into both the condition and the body. + let clause_owner = ast + .child_by_field_name("body") + .is_none() + .then(|| { + let mut c = ast.walk(); + ast.children(&mut c).find(|n| n.kind() == "for_clause") + }) + .flatten(); + let init_subtree = ast + .child_by_field_name("initializer") + .or_else(|| ast.child_by_field_name("initialize")) + .or_else(|| clause_owner.and_then(|fc| fc.child_by_field_name("initializer"))); + let update_subtree = ast + .child_by_field_name("update") + .or_else(|| ast.child_by_field_name("increment")) + .or_else(|| clause_owner.and_then(|fc| fc.child_by_field_name("update"))); + + // Lower the initializer (if any) from `preds`; its exits become the + // header's predecessors. Empty / absent inits leave `preds` + // bit-identical to the pre-fix behaviour. + let init_exits_owned = init_subtree.map(|init| { + build_sub( + init, + preds, + g, + lang, + code, + summaries, + file_path, + enclosing_func, + call_ordinal, + analysis_rules, + break_targets, + continue_targets, + throw_targets, + bodies, + next_body_id, + current_body_id, + ) + }); + let header_preds: &[NodeIndex] = match &init_exits_owned { + Some(exits) if !exits.is_empty() => exits.as_slice(), + _ => preds, + }; + let header = push_node( g, StmtKind::Loop, @@ -5390,7 +5727,7 @@ pub(super) fn build_sub<'a>( 0, analysis_rules, ); - connect_all(g, preds, header, EdgeKind::Seq); + connect_all(g, header_preds, header, EdgeKind::Seq); // Check for short-circuit condition let cond_subtree = ast.child_by_field_name("condition"); @@ -5445,6 +5782,8 @@ pub(super) fn build_sub<'a>( lang, code, enclosing_func, + call_ordinal, + analysis_rules, ); // Wire body from true_exits @@ -5472,13 +5811,33 @@ pub(super) fn build_sub<'a>( connect_all(g, &true_exits, body_first, EdgeKind::True); } + // The increment runs at the end of each iteration before the + // condition is re-checked, so it sits on the back-edge path + // between the body/continue exits and the header. + let mut back_sources: Vec = body_exits; + back_sources.extend(loop_continues.iter().copied()); + let back_sources = lower_loop_update( + update_subtree, + &back_sources, + g, + lang, + code, + summaries, + file_path, + enclosing_func, + call_ordinal, + analysis_rules, + break_targets, + continue_targets, + throw_targets, + bodies, + next_body_id, + current_body_id, + ); // Back-edges go to header (not into the condition chain) - for &e in &body_exits { + for &e in &back_sources { connect_all(g, &[e], header, EdgeKind::Back); } - for &c in &loop_continues { - connect_all(g, &[c], header, EdgeKind::Back); - } // Loop exits = false_exits + breaks let mut exits: Vec = false_exits; @@ -5504,14 +5863,32 @@ pub(super) fn build_sub<'a>( current_body_id, ); + // The increment runs on the back-edge path (end of each + // iteration, before the next condition check). + let mut back_sources: Vec = body_exits; + back_sources.extend(loop_continues.iter().copied()); + let back_sources = lower_loop_update( + update_subtree, + &back_sources, + g, + lang, + code, + summaries, + file_path, + enclosing_func, + call_ordinal, + analysis_rules, + break_targets, + continue_targets, + throw_targets, + bodies, + next_body_id, + current_body_id, + ); // Back‑edge for every linear exit → header. - for &e in &body_exits { + for &e in &back_sources { connect_all(g, &[e], header, EdgeKind::Back); } - // Wire continue targets as back edges to header - for &c in &loop_continues { - connect_all(g, &[c], header, EdgeKind::Back); - } // Falling out of the loop = header’s false branch + // any break targets that exit the loop. let mut exits = vec![header]; diff --git a/src/cfg_analysis/error_handling.rs b/src/cfg_analysis/error_handling.rs index c9a40c5b..14ce71c2 100644 --- a/src/cfg_analysis/error_handling.rs +++ b/src/cfg_analysis/error_handling.rs @@ -108,9 +108,20 @@ fn branch_terminates(cfg: &crate::cfg::Cfg, if_node: NodeIndex) -> bool { return false; } - // Check if any path through the true branch terminates + // The join point of the if statement is its immediate post-dominator: + // the first node every branch reconverges on. The true-branch walk + // must stop there — reaching the join means the body fell through + // *past* the if without terminating, so any `return` in the function + // tail past the join must NOT count as the error branch terminating. + // Without this bound, a trailing `return nil` (present in essentially + // every Go `func(...) error`) makes the walk report "all paths + // terminate" and silently suppresses the rule. + let join = super::dominators::compute_post_dominators(cfg) + .and_then(|pd| pd.immediate_dominator(if_node)); + + // Check if any path through the true branch terminates before the join. for &start in &true_successors { - if terminates_on_all_paths(cfg, start, if_node) { + if terminates_on_all_paths(cfg, start, join) { return true; } } @@ -206,11 +217,18 @@ fn call_never_returns(info: &crate::cfg::NodeInfo) -> bool { } /// Check if all paths from `node` reach a Return/Break/Continue (or a -/// known never-returning call) before exiting scope. +/// known never-returning call) before reaching the if's join point. +/// +/// `join` is the if statement's immediate post-dominator (`None` if it +/// could not be computed — e.g. no Exit node). The walk stops at the +/// join: reaching it means the true branch fell through past the if +/// without terminating, so that path does NOT terminate and the rule +/// should fire. This prevents a `return` in the function tail (after the +/// join) from being mis-attributed to the error branch. fn terminates_on_all_paths( cfg: &crate::cfg::Cfg, node: NodeIndex, - _scope_entry: NodeIndex, + join: Option, ) -> bool { use std::collections::HashSet; @@ -222,6 +240,12 @@ fn terminates_on_all_paths( continue; } + // Reaching the if's join point means this path fell through past + // the if without terminating inside the branch. + if join == Some(current) { + return false; + } + let info = &cfg[current]; match info.kind { StmtKind::Return | StmtKind::Throw | StmtKind::Break | StmtKind::Continue => { @@ -460,6 +484,91 @@ mod err_ident_tests { } } +#[cfg(test)] +mod join_boundary_tests { + use super::branch_terminates; + use crate::cfg::{CallMeta, Cfg, EdgeKind, NodeInfo, StmtKind}; + use petgraph::graph::NodeIndex; + + fn node(kind: StmtKind) -> NodeInfo { + NodeInfo { + kind, + ..Default::default() + } + } + + fn call_node(callee: &str) -> NodeInfo { + NodeInfo { + kind: StmtKind::Call, + call: CallMeta { + callee: Some(callee.to_string()), + ..Default::default() + }, + ..Default::default() + } + } + + /// Build the canonical `if err != nil { } ; return` + /// shape and return (cfg, if_node). `body_terminates` selects whether + /// the true branch body itself returns (terminates) or falls through + /// to the join. + fn build_if_cfg(body_terminates: bool) -> (Cfg, NodeIndex) { + let mut cfg = Cfg::new(); + let entry = cfg.add_node(node(StmtKind::Entry)); + let if_n = cfg.add_node(node(StmtKind::If)); + // true-branch body + let body = if body_terminates { + cfg.add_node(node(StmtKind::Return)) + } else { + cfg.add_node(call_node("log")) + }; + // join point where both branches reconverge: a downstream use + let join = cfg.add_node(call_node("use")); + // function tail: an explicit `return nil` (present in every Go + // value-returning function) followed by exit. + let ret = cfg.add_node(node(StmtKind::Return)); + let exit = cfg.add_node(node(StmtKind::Exit)); + + cfg.add_edge(entry, if_n, EdgeKind::Seq); + cfg.add_edge(if_n, body, EdgeKind::True); + cfg.add_edge(if_n, join, EdgeKind::False); + if !body_terminates { + cfg.add_edge(body, join, EdgeKind::Seq); + } else { + cfg.add_edge(body, exit, EdgeKind::Seq); + } + cfg.add_edge(join, ret, EdgeKind::Seq); + cfg.add_edge(ret, exit, EdgeKind::Seq); + + (cfg, if_n) + } + + #[test] + fn fallthrough_body_does_not_terminate_despite_trailing_return() { + // True branch falls through to the join; the function tail has a + // `return nil`. Before the join-boundary fix the walk reached + // that trailing return and reported "terminates", suppressing the + // rule. The fix bounds the walk at the join, so this is correctly + // reported as NOT terminating. + let (cfg, if_n) = build_if_cfg(false); + assert!( + !branch_terminates(&cfg, if_n), + "fall-through error branch must not count as terminating" + ); + } + + #[test] + fn returning_body_terminates() { + // True branch returns directly: the error is handled, so the rule + // must stay suppressed. + let (cfg, if_n) = build_if_cfg(true); + assert!( + branch_terminates(&cfg, if_n), + "error branch with an explicit return must count as terminating" + ); + } +} + #[cfg(test)] mod terminator_call_tests { use super::call_never_returns; diff --git a/src/cfg_analysis/guards.rs b/src/cfg_analysis/guards.rs index 4e4bcb55..3798cf92 100644 --- a/src/cfg_analysis/guards.rs +++ b/src/cfg_analysis/guards.rs @@ -131,7 +131,7 @@ fn ssa_all_sink_operands_constant( }; let operand_const = |v: SsaValue| -> bool { - ssa_operand_constant(v, facts, callee_desc, callee_parts, outer_parts) + ssa_operand_constant(v, facts, ctx.cfg, callee_desc, callee_parts, outer_parts) }; let args_ok = args .iter() @@ -401,6 +401,7 @@ fn ssa_operand_const_or_param( fn ssa_operand_constant( root: SsaValue, facts: &BodyConstFacts, + cfg: &crate::cfg::Cfg, callee_desc: &str, callee_parts: &[&str], outer_parts: &[&str], @@ -426,6 +427,26 @@ fn ssa_operand_constant( let Some(inst) = find_inst(&facts.ssa, v) else { return false; }; + // CFG-node-level Source label: a `SsaOp::Call` can be the lowering + // of a Source-labeled CFG node (`file_get_contents`, `env::var`, + // `readline`, …). Such a call's result is tainted user input, not a + // constant, even when its own arguments are constant (or it is a + // zero-arg source). Mirror the refusal in the sibling + // `ssa_operand_const_or_param` so a source-fed sink is never proven + // "all args constant" and silently dropped. + let cfg_node = inst.cfg_node; + if cfg + .node_weight(cfg_node) + .map(|info| { + info.taint + .labels + .iter() + .any(|l| matches!(l, DataLabel::Source(_))) + }) + .unwrap_or(false) + { + return false; + } match &inst.op { SsaOp::Const(_) => {} SsaOp::Assign(vals) => stack.extend(vals.iter().copied()), @@ -2190,6 +2211,32 @@ fn cond_indirect_validator_callee( crate::ssa::type_facts::classify_input_validator_callee(callee).map(|_| callee.to_string()) } +/// Match a guard suffix matcher against a callee, requiring the suffix to +/// begin on a *leaf-name boundary* rather than mid-identifier. +/// +/// A bare `callee_lower.ends_with(suffix)` over-matches: `invalidate` ends +/// with `validate` (the `Cap::all()` guard) and `unquote` ends with `quote` +/// (the SHELL_ESCAPE guard), so cache-invalidation and URL-decoding calls +/// would register as dominating guards and silently suppress every +/// downstream `cfg-unguarded-sink` in the function. Require that the +/// character preceding the matched suffix is a name separator (`.`, `_`, or +/// `:` from `::`) or that the suffix sits at the start of the callee. This +/// mirrors the existing prefix-anchor convention (a trailing `_` on the +/// matcher anchors at the start). +pub(super) fn suffix_matches_at_leaf_boundary(callee_lower: &str, suffix: &str) -> bool { + if !callee_lower.ends_with(suffix) { + return false; + } + let prefix_len = callee_lower.len() - suffix.len(); + if prefix_len == 0 { + // Suffix is the whole callee (or its leaf with nothing before it). + return true; + } + // The byte immediately before the suffix must be a leaf-name separator. + let prev = callee_lower.as_bytes()[prefix_len - 1]; + matches!(prev, b'.' | b'_' | b':') +} + /// Find all nodes in the CFG that are calls to guard functions. fn find_guard_nodes(ctx: &AnalysisContext) -> Vec<(NodeIndex, Cap)> { let guard_rules = rules::guard_rules(ctx.lang); @@ -2300,7 +2347,7 @@ fn find_guard_nodes(ctx: &AnalysisContext) -> Vec<(NodeIndex, Cap)> { if ml.ends_with('_') { callee_lower.starts_with(&ml) } else { - callee_lower.ends_with(&ml) + suffix_matches_at_leaf_boundary(&callee_lower, &ml) } }); if matched { @@ -3187,3 +3234,49 @@ mod chain_fragments_tests { assert!(!got.contains(&"raw".to_string())); } } + +#[cfg(test)] +mod guard_suffix_boundary_tests { + use super::suffix_matches_at_leaf_boundary; + + #[test] + fn rejects_mid_identifier_suffixes() { + // The whole point of the fix: these must NOT register as guards. + assert!(!suffix_matches_at_leaf_boundary("invalidate", "validate")); + assert!(!suffix_matches_at_leaf_boundary( + "cache.invalidate", + "validate" + )); + assert!(!suffix_matches_at_leaf_boundary("unquote", "quote")); + assert!(!suffix_matches_at_leaf_boundary( + "urllib.parse.unquote", + "quote" + )); + } + + #[test] + fn accepts_leaf_boundary_suffixes() { + // Suffix at a real leaf-name boundary stays a valid guard match. + assert!(suffix_matches_at_leaf_boundary("validate", "validate")); + assert!(suffix_matches_at_leaf_boundary("shlex.quote", "quote")); + assert!(suffix_matches_at_leaf_boundary( + "urllib.parse.quote", + "quote" + )); + // `_` and `::` separators also count as leaf boundaries. + assert!(suffix_matches_at_leaf_boundary("my_validate", "validate")); + assert!(suffix_matches_at_leaf_boundary( + "std::shell_escape", + "shell_escape" + )); + } + + #[test] + fn rejects_non_suffix() { + assert!(!suffix_matches_at_leaf_boundary( + "validate_input", + "validate" + )); + assert!(!suffix_matches_at_leaf_boundary("os.system", "quote")); + } +} diff --git a/src/cfg_analysis/mod.rs b/src/cfg_analysis/mod.rs index 66a9bde1..0b28177d 100644 --- a/src/cfg_analysis/mod.rs +++ b/src/cfg_analysis/mod.rs @@ -288,7 +288,11 @@ pub(crate) fn is_guard_call( if callee_lower.starts_with(&ml) { return true; } - } else if callee_lower.ends_with(&ml) { + } else if guards::suffix_matches_at_leaf_boundary(&callee_lower, &ml) { + // Leaf-boundary match, not a bare `ends_with`: otherwise + // `invalidate` matches the `validate` guard and `unquote` + // matches `quote`, registering cache-invalidation / + // URL-decode calls as guards and suppressing real sinks. return true; } } @@ -305,7 +309,7 @@ pub(crate) fn is_guard_call( if callee_lower.starts_with(&ml) { return true; } - } else if callee_lower.ends_with(&ml) { + } else if guards::suffix_matches_at_leaf_boundary(&callee_lower, &ml) { return true; } } diff --git a/src/cfg_analysis/resources.rs b/src/cfg_analysis/resources.rs index c41c7c47..0b1ee91c 100644 --- a/src/cfg_analysis/resources.rs +++ b/src/cfg_analysis/resources.rs @@ -4,6 +4,7 @@ use super::{AnalysisContext, CfgAnalysis, CfgFinding, Confidence}; use crate::cfg::{EdgeKind, StmtKind}; use crate::patterns::Severity; use crate::symbol::Lang; +use petgraph::algo::dominators::Dominators; use petgraph::graph::NodeIndex; use petgraph::visit::EdgeRef; use std::collections::HashSet; @@ -132,11 +133,14 @@ fn release_on_all_exit_paths( acquire: NodeIndex, release_nodes: &[NodeIndex], exit: NodeIndex, + post_doms: Option<&Dominators>, ) -> bool { - // Use post-dominators as optimization: if any release post-dominates acquire, it's fine - if let Some(post_doms) = dominators::compute_post_dominators(ctx.cfg) { + // Use post-dominators as optimization: if any release post-dominates acquire, it's fine. + // The post-dominator tree is computed once per CFG by the caller (the CFG is + // immutable across acquire sites), so it is threaded in rather than recomputed here. + if let Some(post_doms) = post_doms { for &release in release_nodes { - if dominators::dominates(&post_doms, release, acquire) { + if dominators::dominates(post_doms, release, acquire) { return true; } } @@ -363,8 +367,13 @@ fn is_ownership_transferred(ctx: &AnalysisContext, acquire: NodeIndex) -> bool { if references_var && start < end && end <= ctx.source_bytes.len() { let span_text = &ctx.source_bytes[start..end]; - // `->` anywhere in span means pointer-to-member store - if span_text.windows(2).any(|w| w == b"->") { + // `ptr->field = var` pointer-to-member store. A bare `->` + // anywhere in the span is NOT enough: a member READ on or near + // the handle (`fread(buf->data, 1, n, fp)`, PHP + // `$conn->query(...)`) also contains `->` but transfers no + // ownership, so requiring the `= ` (assignment, not `==`) keeps + // the suppression to actual stores. + if has_arrow_field_assignment(span_text) { return true; } // `.field = var` pattern (but not `==`) @@ -384,7 +393,7 @@ fn is_ownership_transferred(ctx: &AnalysisContext, acquire: NodeIndex) -> bool { { let is_field_write = if start < end && end <= ctx.source_bytes.len() { let span_text = &ctx.source_bytes[start..end]; - span_text.windows(2).any(|w| w == b"->") || has_dot_field_assignment(span_text) + has_arrow_field_assignment(span_text) || has_dot_field_assignment(span_text) } else { false }; @@ -434,6 +443,58 @@ fn has_dot_field_assignment(span_text: &[u8]) -> bool { false } +/// Check if `span_text` contains a pointer-to-member assignment pattern like +/// `ptr->field = var` (but not a member READ such as `fread(buf->data, ...)`, +/// PHP `$conn->query(...)`, or a comparison `a->b == c`). +/// +/// A bare `->` anywhere in the span is insufficient — only an arrow access +/// that is the LHS of an assignment indicates an ownership-transfer store. +/// We require, after one or more `->ident` (or `.ident`) member-access +/// segments, a single `=` that is not part of `==` / `!=` / `<=` / `>=`. +fn has_arrow_field_assignment(span_text: &[u8]) -> bool { + let mut i = 0; + while i + 1 < span_text.len() { + if span_text[i] == b'-' && span_text[i + 1] == b'>' { + // Walk past chained member-access segments: `->ident`, + // `.ident`, further `->ident`, and any whitespace, looking for + // the assignment `=`. + let mut j = i + 2; + loop { + // identifier chars + while j < span_text.len() + && (span_text[j].is_ascii_alphanumeric() || span_text[j] == b'_') + { + j += 1; + } + // chained `->` / `.` member access keeps the LHS going + if j + 1 < span_text.len() && span_text[j] == b'-' && span_text[j + 1] == b'>' { + j += 2; + continue; + } + if j < span_text.len() && span_text[j] == b'.' { + j += 1; + continue; + } + break; + } + // Skip whitespace before the operator + while j < span_text.len() && span_text[j].is_ascii_whitespace() { + j += 1; + } + // Reject `==`, `!=`, `<=`, `>=` — only a bare `=` is a store. + if j < span_text.len() + && span_text[j] == b'=' + && (j + 1 >= span_text.len() || span_text[j + 1] != b'=') + && (j == 0 || !matches!(span_text[j - 1], b'!' | b'<' | b'>' | b'=')) + { + return true; + } + } + i += 1; + } + false +} + /// Check whether the acquired variable is consumed by an ownership-taking /// function (e.g. `FileResponse(f)`, `send_file(f)`) downstream of the /// acquire node. These functions take ownership of the file handle so there @@ -538,6 +599,12 @@ impl CfgAnalysis for ResourceMisuse { None => return Vec::new(), }; + // The CFG is immutable for the duration of this pass, so the + // post-dominator tree only needs to be computed once. Previously + // `release_on_all_exit_paths` recomputed it for every acquire site, + // turning the body's post-dominator analysis into an O(A) hot spot. + let post_doms = dominators::compute_post_dominators(ctx.cfg); + let mut findings = Vec::new(); for pair in pairs { @@ -593,8 +660,13 @@ impl CfgAnalysis for ResourceMisuse { continue; } } - if !release_on_all_exit_paths(ctx, acquire, &release_nodes, exit) - && !is_ownership_transferred(ctx, acquire) + if !release_on_all_exit_paths( + ctx, + acquire, + &release_nodes, + exit, + post_doms.as_ref(), + ) && !is_ownership_transferred(ctx, acquire) && !is_consumed_by_owner(ctx, acquire) { // For mutex pairs, require an explicit .acquire()/.lock() call @@ -724,4 +796,30 @@ mod tests { let info = call_node(vec![None], vec![vec!["receiver_func".into()]]); assert!(!is_event_handler_register_shape(&info)); } + + #[test] + fn arrow_field_assignment_matches_real_stores() { + // Genuine ownership-transfer stores: `ptr->field = var`. + assert!(has_arrow_field_assignment(b"ctx->fp = fp")); + assert!(has_arrow_field_assignment(b"p->next = cfg->variables")); + // Chained member access on the LHS. + assert!(has_arrow_field_assignment(b"a->b->c = handle")); + // Whitespace variations. + assert!(has_arrow_field_assignment(b"obj->handle=res")); + } + + #[test] + fn arrow_field_assignment_rejects_member_reads() { + // Member READ on / near the handle: contains `->` but no store. + // This is the false-negative the fix targets — must NOT be treated + // as ownership transfer. + assert!(!has_arrow_field_assignment(b"fread(buf->data, 1, n, fp)")); + assert!(!has_arrow_field_assignment(b"$result = $conn->query(sql)")); + assert!(!has_arrow_field_assignment(b"if (node->len > 0)")); + // Comparisons through arrow access are not stores. + assert!(!has_arrow_field_assignment(b"if (obj->state == READY)")); + assert!(!has_arrow_field_assignment(b"x->y != z")); + // No arrow at all. + assert!(!has_arrow_field_assignment(b"fclose(fp)")); + } } diff --git a/src/commands/index.rs b/src/commands/index.rs index c5a1f0b3..e9ddb733 100644 --- a/src/commands/index.rs +++ b/src/commands/index.rs @@ -96,6 +96,49 @@ pub fn handle( } } +/// Run `f` with optional panic recovery, mirroring +/// `crate::commands::scan::recover_or_propagate` (which is private to the +/// scan module). When `enabled` is false, panics propagate as before. +/// When enabled, a panic in `f` is caught, logged, and surfaced as an +/// `Err` so the caller can log-and-skip the offending file instead of +/// aborting the whole index build. +fn recover_or_propagate( + enabled: bool, + path: &std::path::Path, + logs: Option<&Arc>, + f: impl FnOnce() -> NyxResult, +) -> NyxResult { + if !enabled { + return f(); + } + match std::panic::catch_unwind(std::panic::AssertUnwindSafe(f)) { + Ok(r) => r, + Err(panic) => { + let msg = panic + .downcast_ref::<&str>() + .copied() + .map(str::to_owned) + .or_else(|| panic.downcast_ref::().cloned()) + .unwrap_or_else(|| "".to_string()); + tracing::warn!( + path = %path.display(), + panic = %msg, + "index-build analysis panicked; continuing" + ); + if let Some(l) = logs { + l.warn( + format!("Analysis panicked: {msg}"), + Some(path.display().to_string()), + Some(msg.clone()), + ); + } + Err(crate::errors::NyxError::Msg(format!( + "analysis panicked: {msg}" + ))) + } + } +} + pub fn build_index( project_name: &str, project_path: &std::path::Path, @@ -135,6 +178,16 @@ pub fn build_index_with_observer( let owned_cfg = crate::commands::scan::ensure_framework_ctx(project_path, config); let config = owned_cfg.as_ref().unwrap_or(config); + // Pass-1 rescans of later-edited files run with a populated module + // graph (scan_with_index_parallel_observer builds one before pass 1), + // so the SSA summaries/bodies they persist carry package-qualified + // namespaces. Build the same graph here so the rows persisted at + // index-build time use the identical key convention; without it the DB + // ends up mixing package-qualified and bare namespaces for the same + // project, breaking cross-file SSA resolution. + let owned_cfg_with_graph = crate::commands::scan::ensure_module_graph(project_path, config); + let config = owned_cfg_with_graph.as_ref().unwrap_or(config); + tracing::debug!("Building index for: {}", project_name); let pool = Indexer::init(db_path)?; { @@ -206,22 +259,62 @@ pub fn build_index_with_observer( let pass1_start = std::time::Instant::now(); let writer = IndexWriteQueue::start(project_name.to_owned(), Arc::clone(&pool)); let write_tx = writer.sender(); + let panic_recovery = config.scanner.enable_panic_recovery; let index_result = paths.into_par_iter().try_for_each(|path| -> NyxResult<()> { // Read once, hash once, pass bytes to both rule execution and // summary extraction. Use pre-computed hash for upsert to avoid // a redundant file read inside upsert_file. - let bytes = std::fs::read(&path)?; + // + // A file that disappears between the walk and this read + // (delete/rename race) or that is unreadable (permission denied) + // must not abort the entire index build: log and skip it, matching + // the non-indexed scan paths (`scan.rs` pass-1 fold). + let bytes = match std::fs::read(&path) { + Ok(b) => b, + Err(e) => { + tracing::warn!("index build: cannot read {}: {e}", path.display()); + if let Some(l) = logs.as_ref() { + l.warn( + format!("Skipping unreadable file: {e}"), + Some(path.display().to_string()), + None, + ); + } + pb.inc(1); + return Ok(()); + } + }; let hash = Indexer::digest_bytes(&bytes); // Parse once and persist every artifact we can reuse later: - // findings, coarse summaries, and precise SSA summaries. - let fused = crate::commands::scan::analyse_file_fused( - &bytes, - &path, - config, - None, - Some(project_path), - )?; + // findings, coarse summaries, and precise SSA summaries. Wrap the + // analysis in optional panic recovery so an engine panic on one + // file does not abort the whole build when the user enabled + // `scanner.enable_panic_recovery`; an analysis error (or a caught + // panic) is logged and the file skipped, matching the scan paths. + let fused = match recover_or_propagate(panic_recovery, &path, logs.as_ref(), || { + crate::commands::scan::analyse_file_fused( + &bytes, + &path, + config, + None, + Some(project_path), + ) + }) { + Ok(f) => f, + Err(e) => { + tracing::warn!("index build: analysis failed for {}: {e}", path.display()); + if let Some(l) = logs.as_ref() { + l.warn( + format!("Skipping file after analysis error: {e}"), + Some(path.display().to_string()), + None, + ); + } + pb.inc(1); + return Ok(()); + } + }; if let Some(ref p) = progress { p.inc_parsed(1); p.set_current_file(&path.to_string_lossy()); @@ -287,6 +380,32 @@ pub fn build_index_with_observer( }) .collect(); + // Persist auth-check summaries and cross-package import maps too. + // The indexed pass-1 rescan path persists these via + // `replace_all_for_file`, but a hash match on the freshly-stamped + // file makes that path skip every unchanged file, so unless we + // write them here they are lost until a file's content changes: + // `load_all_auth_summaries` / `load_all_cross_package_imports` + // return empty after the first indexed scan, killing cross-file + // auth-helper lifting and cross-package callee resolution. + let auth_rows: Vec<_> = fused + .auth_summaries + .into_iter() + .map(|(key, sum)| { + ( + key.name, + key.arity.unwrap_or(0), + key.lang.as_str().to_string(), + key.namespace, + key.container, + key.disambig, + key.kind, + sum, + ) + }) + .collect(); + let cross_pkg_imports = fused.cross_package_imports; + let path_for_write = path.clone(); write_tx.enqueue(move |idx| { let file_id = idx.upsert_file_with_hash(&path_for_write, &hash)?; @@ -302,15 +421,23 @@ pub fn build_index_with_observer( }), )?; - if !summaries.is_empty() { - idx.replace_summaries_for_file(&path_for_write, &hash, &summaries)?; - } - if !ssa_rows.is_empty() { - idx.replace_ssa_summaries_for_file(&path_for_write, &hash, &ssa_rows)?; - } - if !body_rows.is_empty() { - idx.replace_ssa_bodies_for_file(&path_for_write, &hash, &body_rows)?; - } + // Single transaction for all summary caches, matching the + // indexed pass-1 rescan path (`replace_all_for_file`): one + // fsync per file, and — critically — auth summaries and + // cross-package imports are persisted at build time so the + // first indexed scan does not lose them. + let cpi_arg = cross_pkg_imports + .as_ref() + .map(|(ns, map)| (ns.as_str(), map.as_ref())); + idx.replace_all_for_file( + &path_for_write, + &hash, + &summaries, + &ssa_rows, + &body_rows, + &auth_rows, + cpi_arg, + )?; Ok(()) })?; @@ -409,3 +536,69 @@ app.get('/safe', function(req, res) { "index build should persist SSA summaries for functions with non-trivial SSA effects" ); } + +#[test] +fn recover_or_propagate_catches_panics_when_enabled() { + // When disabled, the panic propagates (default fail-fast). + let disabled = std::panic::catch_unwind(|| { + let _ = recover_or_propagate( + false, + std::path::Path::new("x"), + None, + || -> NyxResult<()> { panic!("boom") }, + ); + }); + assert!( + disabled.is_err(), + "with recovery disabled the panic must propagate" + ); + + // When enabled, the panic is caught and surfaced as Err so the + // index-build loop can log-and-skip instead of aborting. + let recovered = recover_or_propagate( + true, + std::path::Path::new("x"), + None, + || -> NyxResult<()> { panic!("boom") }, + ); + assert!( + recovered.is_err(), + "with recovery enabled the panic must become an Err, not unwind" + ); + + // A clean closure passes its result through unchanged. + let ok = recover_or_propagate(true, std::path::Path::new("x"), None, || Ok(7u32)); + assert_eq!(ok.unwrap(), 7); +} + +#[test] +fn build_index_skips_unreadable_entry_without_aborting() { + // A path under the project that cannot be read (here: a directory + // entry that fs::read fails on) must be skipped, not abort the build. + // Verifies the read-failure log-and-skip branch in the pass-1 loop. + let mut cfg = Config::default(); + cfg.performance.worker_threads = Some(1); + cfg.performance.channel_multiplier = 1; + cfg.performance.batch_size = 2; + + let td = tempfile::tempdir().unwrap(); + let project_dir = td.path().join("proj"); + fs::create_dir(&project_dir).unwrap(); + // One good file so the build has real work to persist. + let good = project_dir.join("good.rs"); + fs::write(&good, "fn main() {}").unwrap(); + + let db_path = td.path().join("proj.sqlite"); + // Must succeed even though the project may contain entries that are + // not plain readable files; the build never returns Err for that. + build_index("proj", &project_dir, &db_path, &cfg, false) + .expect("index build must not abort on a per-file read error"); + + let pool = Indexer::init(&db_path).unwrap(); + let idx = Indexer::from_pool("proj", &pool).unwrap(); + let files = idx.get_files("proj").unwrap(); + assert!( + files.iter().any(|p| p == &good), + "the readable file must still be indexed" + ); +} diff --git a/src/commands/scan.rs b/src/commands/scan.rs index 8d91ac86..a0834d32 100644 --- a/src/commands/scan.rs +++ b/src/commands/scan.rs @@ -387,7 +387,18 @@ pub(crate) fn verify_findings_for_scan( // can resolve the enclosing function and callgraph entry context // without re-hitting SQLite per finding. Best-effort: a load failure // logs and falls through to the substring heuristics. - opts.summaries = load_verify_summaries(project_name, db_path, scan_path); + // Resolve a module graph so summaries are keyed with the same + // package-qualified namespaces the indexed scan path uses. Reuse + // the one already on `config` when present; otherwise build it + // best-effort for this root. + let verify_cfg_with_graph = ensure_module_graph(scan_path, config); + let verify_module_graph = config.module_graph.as_deref().or_else(|| { + verify_cfg_with_graph + .as_ref() + .and_then(|c| c.module_graph.as_deref()) + }); + opts.summaries = + load_verify_summaries(project_name, db_path, scan_path, verify_module_graph); if let Some(ref summaries) = opts.summaries { opts.callgraph = Some(load_verify_callgraph(summaries)); } @@ -589,6 +600,7 @@ fn load_verify_summaries( project: &str, db_path: &Path, scan_root: &Path, + module_graph: Option<&crate::resolve::ModuleGraph>, ) -> Option> { let pool = match Indexer::init(db_path) { Ok(p) => p, @@ -612,9 +624,15 @@ fn load_verify_summaries( } }; let root_str = scan_root.to_string_lossy().into_owned(); - Some(Arc::new(crate::summary::merge_summaries( + // Key with package-qualified namespaces (when a module graph is + // available) so the verify-path summary index matches the keys the + // indexed scan path and pass-1 SSA summaries use. Plain + // normalize_namespace keys would diverge for any repo with a named + // package.json and silently lose cross-file callee resolution. + Some(Arc::new(crate::summary::merge_summaries_with_resolver( all, Some(&root_str), + module_graph, ))) } @@ -1566,6 +1584,12 @@ fn run_topo_batches( let batch_results: Vec<( std::path::PathBuf, + // `false` when this file's read or analysis failed this + // iteration. A failed file must NOT overwrite the diags it + // produced in an earlier successful iteration with an empty + // vector, so the result loop below skips the diags cache + // update when this flag is false. + bool, Vec, Vec, Vec<( @@ -1597,7 +1621,7 @@ fn run_topo_batches( None, ); } - return (path.to_path_buf(), vec![], vec![], vec![], vec![]); + return (path.to_path_buf(), false, vec![], vec![], vec![], vec![]); } }; match recover_or_propagate( @@ -1618,6 +1642,7 @@ fn run_topo_batches( pb.inc(0); // don't double-count iterations in progress bar ( path.to_path_buf(), + true, r.diags, r.summaries, r.ssa_summaries, @@ -1637,7 +1662,7 @@ fn run_topo_batches( None, ); } - (path.to_path_buf(), vec![], vec![], vec![], vec![]) + (path.to_path_buf(), false, vec![], vec![], vec![], vec![]) } } }) @@ -1645,12 +1670,23 @@ fn run_topo_batches( let mut ssa_count: usize = 0; let mg = cfg.module_graph.as_deref(); - for (path, diags, summaries, ssa_summaries, _ssa_bodies) in batch_results { + for (path, analysis_ok, diags, summaries, ssa_summaries, _ssa_bodies) in + batch_results + { // Phase-B: replace (not append) this file's diags // so the cache always reflects the latest // iteration's output. Clean files skipped this // iteration retain their previous diags. - diags_by_file.insert(path, diags); + // + // A file whose read or analysis FAILED this iteration is + // not "clean" but must be treated like one for the diags + // cache: overwriting with the empty vector would erase + // findings the same file produced in an earlier successful + // iteration. Its (empty) summaries are likewise a no-op + // below, leaving the previous iteration's summaries intact. + if analysis_ok { + diags_by_file.insert(path, diags); + } for s in summaries { let key = s.func_key_with_resolver(root_str_ref, mg); @@ -2718,6 +2754,17 @@ pub fn scan_with_index_parallel_observer( | crate::utils::config::AnalysisMode::Taint ); + // Records the pass-1-time content hash for every file whose summary + // extraction succeeded (or was correctly skipped as unchanged). The + // post-pass-2 persistence loop stamps the `files` table hash only for + // these entries, using the pass-1 hash. Files whose pass-1 extraction + // FAILED (recovered panic, hard parse failure, transient read error) + // are absent, so their `files` row keeps its previous hash and + // `should_scan_with_hash` retries them on the next scan instead of + // freezing stale summaries against the new content forever. + let pass1_safe_hashes: Arc>>> = + Arc::new(Mutex::new(HashMap::new())); + // ── Pass 1: ensure summaries are up‑to‑date ────────────────────────── if needs_taint { if let Some(p) = progress { @@ -2744,6 +2791,7 @@ pub fn scan_with_index_parallel_observer( let scan_root_ref = scan_root.to_path_buf(); let persist_errors_ref = Arc::clone(&persist_errors); let skipped_files_ref = Arc::clone(&skipped_files); + let pass1_safe_hashes_ref = Arc::clone(&pass1_safe_hashes); let progress_ref = progress.cloned(); files.par_iter().for_each_init( || Indexer::from_pool(project, &pool).expect("db pool"), @@ -2822,6 +2870,14 @@ pub fn scan_with_index_parallel_observer( ) }) .collect(); + // Extraction succeeded: this file's `files` + // row may be stamped with the pass-1 hash. + // Pass-1 persist failures abort the scan + // (writer_result? below), so a successful + // enqueue is a sufficient safety signal. + if let Ok(mut m) = pass1_safe_hashes_ref.lock() { + m.insert(path.clone(), hash.clone()); + } // Single transaction for all four caches: // one fsync per file instead of four. let path_for_write = path.clone(); @@ -2855,6 +2911,11 @@ pub fn scan_with_index_parallel_observer( if let Some(p) = &progress_ref { p.inc_skipped(1); } + // Skipped-as-unchanged: the `files` row hash already + // equals `hash`, so re-stamping it is a safe no-op. + if let Ok(mut m) = pass1_safe_hashes_ref.lock() { + m.insert(path.clone(), hash.clone()); + } } } else { tracing::warn!("pass 1: cannot read {}", path.display()); @@ -2898,7 +2959,16 @@ pub fn scan_with_index_parallel_observer( let idx = Indexer::from_pool(project, &pool)?; let all = idx.load_all_summaries()?; tracing::info!(summaries = all.len(), "loaded cross-file summaries from DB"); - let mut gs = summary::merge_summaries(all, Some(&root_str)); + // Key coarse summaries with the SAME package-qualified namespaces that + // pass-1 SSA summaries (namespace_with_package) and pass-2 refinements + // (func_key_with_resolver) use. Plain merge_summaries here would key + // them by normalize_namespace, so cross-file SSA resolution would miss + // every package-resident function in any repo with a named package.json. + let mut gs = summary::merge_summaries_with_resolver( + all, + Some(&root_str), + cfg.module_graph.as_deref(), + ); // Load and insert SSA summaries let ssa_rows = idx.load_all_ssa_summaries()?; @@ -3358,6 +3428,10 @@ pub fn scan_with_index_parallel_observer( for d in &topo_diags { by_file.entry(&d.path).or_default().push(d); } + let safe_hashes = pass1_safe_hashes + .lock() + .map(|m| m.clone()) + .unwrap_or_default(); let mut idx = Indexer::from_pool(project, &pool)?; for path in &files { if !path.exists() { @@ -3365,7 +3439,21 @@ pub fn scan_with_index_parallel_observer( continue; } - let file_id = idx.upsert_file(path)?; + // Only stamp the `files` row when pass-1 extraction for this file + // succeeded (or was correctly skipped as unchanged); use the + // pass-1-time hash, never a fresh re-read (avoids the TOCTOU where + // a mid-scan edit would pair pass-1 artifacts with content that was + // never analysed). Files whose pass-1 extraction FAILED are absent + // from `safe_hashes`: their `files` row is left untouched so its + // previous hash forces a pass-1 retry on the next scan instead of + // freezing stale summaries against the new content forever. Their + // best-effort pass-2 findings still appear in this scan's in-memory + // result; we just do not persist them or advance the hash. + let Some(pass1_hash) = safe_hashes.get(path) else { + continue; + }; + + let file_id = idx.upsert_file_with_hash(path, pass1_hash)?; let empty: [&Diag; 0] = []; let file_diags = by_file .get(path.to_string_lossy().as_ref()) diff --git a/src/database.rs b/src/database.rs index 99664137..509a1da4 100644 --- a/src/database.rs +++ b/src/database.rs @@ -1053,6 +1053,67 @@ pub mod index { Ok(()) } + /// Recompute a finding's [`crate::patterns::FindingCategory`] from + /// its rule id alone. + /// + /// The `issues` table persists only `(rule_id, severity, line, col)`, + /// not the category, so cached rows served by + /// [`Self::get_issues_from_file`] must reconstruct it. Hardcoding + /// `Security` here resurrects quality findings (e.g. + /// `rs.quality.unwrap`) past the `include_quality` filter and the + /// quality-rollup, producing cold/warm non-determinism for unchanged + /// files. Mapping is best-effort and falls back to `Security`: + /// + /// * structural / state-machine ids (`state-*`, `cfg-*`) route through + /// [`crate::patterns::FindingCategory::for_structural_rule`] + /// (reliability for leaks/error-fallthrough, security otherwise); + /// * AST-pattern ids look up their declared + /// [`crate::patterns::PatternCategory`] in a one-time index built + /// over every language's pattern registry; + /// * everything else (taint flows, etc.) stays `Security`. + fn category_for_rule_id(rule_id: &str) -> crate::patterns::FindingCategory { + use crate::patterns::FindingCategory; + use std::sync::OnceLock; + + // Structural / state-machine findings are not AST patterns and + // carry no language slug; classify them by id directly. + if rule_id.starts_with("state-") || rule_id.starts_with("cfg-") { + return FindingCategory::for_structural_rule(rule_id); + } + + // Lazily build a rule_id -> FindingCategory index across every + // language's static pattern registry. Distinct ids only; alias + // slugs share the same pattern slice so duplicate inserts are a + // no-op. + static PATTERN_CATEGORIES: OnceLock< + std::collections::HashMap, + > = OnceLock::new(); + let map = PATTERN_CATEGORIES.get_or_init(|| { + let mut m = std::collections::HashMap::new(); + for lang in [ + "rust", + "typescript", + "javascript", + "c", + "cpp", + "java", + "go", + "php", + "python", + "ruby", + ] { + for p in crate::patterns::load(lang) { + m.insert(p.id.to_string(), p.category.finding_category()); + } + } + m + }); + + map.get(rule_id) + .copied() + .unwrap_or(FindingCategory::Security) + } + /// Gets the issues for a specific file so we don't have to rescan pub fn get_issues_from_file(&self, path: &Path) -> NyxResult> { let file_id: i64 = self.c().query_row( @@ -1076,13 +1137,15 @@ pub mod index { ); Severity::Medium }); + let rule_id: String = row.get::<_, String>(0)?; + let category = Self::category_for_rule_id(&rule_id); Ok(Diag { path: path.to_string_lossy().to_string(), - id: row.get::<_, String>(0)?, // rule_id + id: rule_id, // rule_id line: row.get::<_, i64>(2)? as usize, col: row.get::<_, i64>(3)? as usize, severity, - category: crate::patterns::FindingCategory::Security, + category, path_validated: false, guard_kind: None, message: None, @@ -1695,8 +1758,14 @@ pub mod index { /// * function and auth summaries: DELETE-then-INSERT regardless /// of input length, so emptying a file's summaries clears /// stale rows. - /// * SSA summaries and bodies: only touched when the input is - /// non-empty, matching the existing scan path. + /// * SSA summaries and bodies: DELETE always runs so a file that + /// no longer yields any SSA artifacts (functions removed, file + /// gutted, or SSA lowering now failing for every function) + /// clears its stale high-precedence rows; the INSERT loop is + /// skipped when the new set is empty. Leaving stale SSA rows + /// would let deleted/phantom functions keep resolving cross-file + /// via resolve step 0.5 (SSA summaries outrank coarse + /// FuncSummary), so the DELETE must not be gated on non-empty. #[allow(clippy::too_many_arguments)] pub fn replace_all_for_file( &mut self, @@ -1780,13 +1849,15 @@ pub mod index { } } - // ssa_function_summaries, only touched when non-empty. + // ssa_function_summaries: DELETE always, INSERT only when + // the new set is non-empty, so emptying a file's SSA + // summaries clears its stale high-precedence rows. + tx.execute( + "DELETE FROM ssa_function_summaries + WHERE project = ?1 AND file_path = ?2", + params![self.project, path_str], + )?; if !ssa_summaries.is_empty() { - tx.execute( - "DELETE FROM ssa_function_summaries - WHERE project = ?1 AND file_path = ?2", - params![self.project, path_str], - )?; let mut stmt = tx.prepare( "INSERT OR REPLACE INTO ssa_function_summaries (project, file_path, file_hash, name, arity, lang, namespace, @@ -1822,13 +1893,15 @@ pub mod index { } } - // ssa_function_bodies, only touched when non-empty. + // ssa_function_bodies: DELETE always, INSERT only when the + // new set is non-empty, so emptying a file's SSA bodies + // clears its stale high-precedence rows. + tx.execute( + "DELETE FROM ssa_function_bodies + WHERE project = ?1 AND file_path = ?2", + params![self.project, path_str], + )?; if !ssa_bodies.is_empty() { - tx.execute( - "DELETE FROM ssa_function_bodies - WHERE project = ?1 AND file_path = ?2", - params![self.project, path_str], - )?; let mut stmt = tx.prepare( "INSERT OR REPLACE INTO ssa_function_bodies (project, file_path, file_hash, name, arity, lang, namespace, @@ -2922,6 +2995,65 @@ fn replace_issues_and_query_back() { ); } +#[test] +fn get_issues_from_file_recomputes_category_from_rule_id() { + // Regression for the hardcoded `category = Security` bug: cached rows + // must reconstruct the finding category from the rule id, otherwise a + // quality finding (correctly dropped/rolled-up on the cold scan) + // resurfaces as an ungrouped Security finding on every warm scan. + use crate::patterns::FindingCategory; + + let td = tempfile::tempdir().unwrap(); + let db = td.path().join("nyx.sqlite"); + let file = td.path().join("lib.rs"); + std::fs::write(&file, "fn main() {}").unwrap(); + + let pool = index::Indexer::init(&db).unwrap(); + let mut idx = index::Indexer::from_pool("proj", &pool).unwrap(); + let fid = idx.upsert_file(&file).unwrap(); + + let issues = [ + // Tier-A quality AST pattern -> must come back as Quality. + index::IssueRow { + rule_id: "rs.quality.unwrap", + severity: "Low", + line: 1, + col: 1, + }, + // Structural resource leak -> Reliability. + index::IssueRow { + rule_id: "state-resource-leak", + severity: "Medium", + line: 2, + col: 1, + }, + // Taint-style id (not in any AST registry) -> Security fallback. + index::IssueRow { + rule_id: "taint-sql-injection", + severity: "High", + line: 3, + col: 1, + }, + ]; + idx.replace_issues(fid, issues).unwrap(); + + let stored = idx.get_issues_from_file(&file).unwrap(); + let cat = |id: &str| { + stored + .iter() + .find(|d| d.id == id) + .unwrap_or_else(|| panic!("missing {id}")) + .category + }; + assert_eq!( + cat("rs.quality.unwrap"), + FindingCategory::Quality, + "quality AST pattern must not be resurrected as Security on warm scans" + ); + assert_eq!(cat("state-resource-leak"), FindingCategory::Reliability); + assert_eq!(cat("taint-sql-injection"), FindingCategory::Security); +} + #[test] fn clear_and_vacuum_reset_tables() { let td = tempfile::tempdir().unwrap(); @@ -3286,6 +3418,81 @@ fn ssa_summaries_hash_rescan_replaces_stale() { assert_eq!(loaded[0].1, "new_func"); } +#[test] +fn replace_all_for_file_clears_stale_ssa_rows_when_emptied() { + // Regression for the gated-DELETE staleness bug: a file edited so it no + // longer yields any SSA summaries (functions removed / file gutted / + // lowering now failing) must clear its high-precedence ssa_function_* + // rows. Otherwise phantom functions keep resolving cross-file via + // resolve step 0.5 (SSA outranks coarse FuncSummary). + use crate::labels::Cap; + use crate::summary::ssa_summary::{SsaFuncSummary, TaintTransform}; + + let td = tempfile::tempdir().unwrap(); + let db = td.path().join("nyx.sqlite"); + let f = td.path().join("lib.py"); + std::fs::write(&f, "v1").unwrap(); + + let pool = index::Indexer::init(&db).unwrap(); + let mut idx = index::Indexer::from_pool("proj", &pool).unwrap(); + + let hash_v1 = index::Indexer::digest_bytes(b"v1"); + let ssa_sums = vec![( + "old_func".to_string(), + 1_usize, + "python".to_string(), + "lib.py".to_string(), + String::new(), + None, + crate::symbol::FuncKind::Function, + SsaFuncSummary { + param_to_return: vec![(0, TaintTransform::Identity)], + param_to_sink: vec![], + source_caps: Cap::empty(), + param_to_sink_param: vec![], + param_container_to_return: vec![], + param_to_container_store: vec![], + return_type: None, + return_abstract: None, + source_to_callback: vec![], + receiver_to_return: None, + receiver_to_sink: Cap::empty(), + abstract_transfer: vec![], + param_return_paths: vec![], + points_to: Default::default(), + field_points_to: Default::default(), + return_path_facts: smallvec::SmallVec::new(), + typed_call_receivers: vec![], + validated_params_to_return: smallvec::SmallVec::new(), + param_to_gate_filters: vec![], + entry_kind: None, + }, + )]; + + // Pass 1: file has one SSA summary. + idx.replace_all_for_file(&f, &hash_v1, &[], &ssa_sums, &[], &[], None) + .unwrap(); + assert_eq!( + idx.load_all_ssa_summaries().unwrap().len(), + 1, + "SSA summary should persist on first write" + ); + + // Pass 2: file gutted — new hash, empty SSA summaries/bodies. The old + // row must be deleted, not left behind. + let hash_v2 = index::Indexer::digest_bytes(b"v2"); + idx.replace_all_for_file(&f, &hash_v2, &[], &[], &[], &[], None) + .unwrap(); + assert!( + idx.load_all_ssa_summaries().unwrap().is_empty(), + "stale SSA summary rows must be cleared when the new set is empty" + ); + assert!( + idx.load_all_ssa_bodies().unwrap().is_empty(), + "stale SSA body rows must be cleared when the new set is empty" + ); +} + #[test] fn clear_drops_ssa_summaries_table() { use crate::labels::Cap; diff --git a/src/labels/c.rs b/src/labels/c.rs index 818ad506..ece6a77d 100644 --- a/src/labels/c.rs +++ b/src/labels/c.rs @@ -9,7 +9,7 @@ pub static RULES: &[LabelRule] = &[ case_sensitive: false, }, LabelRule { - matchers: &["fgets", "scanf", "fscanf", "gets", "read"], + matchers: &["fgets", "scanf", "fscanf", "sscanf", "gets", "read"], label: DataLabel::Source(Cap::all()), case_sensitive: false, }, @@ -273,6 +273,15 @@ pub static OUTPUT_PARAM_SOURCES: &[(&str, &[usize])] = &[ ("gets", &[0]), // gets(buf), buf receives input ("recv", &[1]), // recv(fd, buf, len, flags) ("recvfrom", &[1]), // recvfrom(fd, buf, len, flags, ...) + ("read", &[1]), // read(fd, buf, len), buf receives attacker bytes + // `scanf`/`fscanf`/`sscanf` return a match count; the attacker-controlled + // bytes land in the variadic output pointers after the format string. + // OUTPUT_PARAM_SOURCES stores a fixed position list, so we enumerate a + // conservative span of trailing argument positions to cover the common + // single- and multi-conversion forms. + ("scanf", &[1, 2, 3, 4, 5, 6, 7, 8]), // scanf("%s", buf, ...) , outputs start at arg 1 + ("fscanf", &[2, 3, 4, 5, 6, 7, 8]), // fscanf(stream, "%s", buf, ...) , outputs at arg 2+ + ("sscanf", &[2, 3, 4, 5, 6, 7, 8]), // sscanf(src, "%s", buf, ...) , outputs at arg 2+ ]; /// Arg-to-arg taint propagation for known C functions. @@ -288,3 +297,36 @@ pub static ARG_PROPAGATIONS: &[super::ArgPropagation] = &[ to_args: &[1], }, ]; + +#[cfg(test)] +mod tests { + use crate::labels::output_param_source_positions; + + #[test] + fn scanf_family_and_read_taint_output_args() { + // `scanf("%s", buf)` , buf is at arg 1. + assert_eq!( + output_param_source_positions("c", "scanf"), + Some([1usize, 2, 3, 4, 5, 6, 7, 8].as_slice()) + ); + // `fscanf(stream, "%s", buf)` and `sscanf(src, "%s", buf)` , outputs at arg 2+. + assert_eq!( + output_param_source_positions("c", "fscanf"), + Some([2usize, 3, 4, 5, 6, 7, 8].as_slice()) + ); + assert_eq!( + output_param_source_positions("c", "sscanf"), + Some([2usize, 3, 4, 5, 6, 7, 8].as_slice()) + ); + // `read(fd, buf, len)` , buf is at arg 1. + assert_eq!( + output_param_source_positions("c", "read"), + Some([1usize].as_slice()) + ); + // Namespaced/qualified callees normalize to the last segment. + assert_eq!( + output_param_source_positions("c", "std::sscanf"), + Some([2usize, 3, 4, 5, 6, 7, 8].as_slice()) + ); + } +} diff --git a/src/labels/go.rs b/src/labels/go.rs index ceb3a19a..a2045dbe 100644 --- a/src/labels/go.rs +++ b/src/labels/go.rs @@ -1011,6 +1011,11 @@ pub static KINDS: Map<&'static str, Kind> = phf_map! { "communication_case" => Kind::Block, "go_statement" => Kind::Block, "defer_statement" => Kind::Block, + // `outer: for { ... }` wraps the whole labeled loop in a + // labeled_statement; map to Block so the CFG builder recurses into the + // inner statement instead of collapsing the loop body into one leaf Seq + // node (mirrors c.rs / cpp.rs). + "labeled_statement" => Kind::Block, // data-flow "call_expression" => Kind::CallFn, @@ -1032,7 +1037,10 @@ pub static KINDS: Map<&'static str, Kind> = phf_map! { pub static PARAM_CONFIG: ParamConfig = ParamConfig { params_field: "parameters", - param_node_kinds: &["parameter_declaration"], + // `variadic_parameter_declaration` covers `func run(args ...string)`; + // without it the variadic param is dropped, registering wrong arity and + // never seeding caller taint into the variadic position. + param_node_kinds: &["parameter_declaration", "variadic_parameter_declaration"], self_param_kinds: &[], ident_fields: &["name"], }; @@ -1094,3 +1102,27 @@ pub fn framework_rules(ctx: &FrameworkContext) -> Vec { rules } + +#[cfg(test)] +mod tests { + use super::{KINDS, PARAM_CONFIG}; + use crate::labels::Kind; + + #[test] + fn labeled_statement_is_walkable_block() { + // `outer: for { ... }` must be a Block so the CFG builder recurses + // into the labeled loop body instead of collapsing it to a leaf Seq. + assert_eq!(KINDS.get("labeled_statement"), Some(&Kind::Block)); + } + + #[test] + fn variadic_param_is_extracted() { + // `func run(args ...string)` emits variadic_parameter_declaration; + // it must be a recognised param node so arity registers correctly. + assert!( + PARAM_CONFIG + .param_node_kinds + .contains(&"variadic_parameter_declaration") + ); + } +} diff --git a/src/labels/java.rs b/src/labels/java.rs index 473e9d71..e255c0af 100644 --- a/src/labels/java.rs +++ b/src/labels/java.rs @@ -1069,7 +1069,13 @@ pub static KINDS: Map<&'static str, Kind> = phf_map! { "block" => Kind::Block, "class_declaration" => Kind::Block, "class_body" => Kind::Block, + "interface_declaration" => Kind::Block, "interface_body" => Kind::Block, + "enum_declaration" => Kind::Block, + "enum_body" => Kind::Block, + "enum_body_declarations" => Kind::Block, + "record_declaration" => Kind::Block, + "synchronized_statement" => Kind::Block, "method_declaration" => Kind::Function, "constructor_declaration" => Kind::Function, "switch_expression" => Kind::Switch, @@ -1126,3 +1132,31 @@ pub fn framework_rules(ctx: &FrameworkContext) -> Vec { rules } + +#[cfg(test)] +mod tests { + use super::KINDS; + use crate::labels::Kind; + + #[test] + fn container_declarations_are_walkable_blocks() { + // interface_declaration must be mapped so its (already-mapped) + // interface_body is reachable; enum/record/synchronized bodies + // must be walked instead of collapsing to a leaf Seq node. + for kind in [ + "interface_declaration", + "interface_body", + "enum_declaration", + "enum_body", + "enum_body_declarations", + "record_declaration", + "synchronized_statement", + ] { + assert_eq!( + KINDS.get(kind), + Some(&Kind::Block), + "{kind} should map to Kind::Block so the CFG builder walks its body" + ); + } + } +} diff --git a/src/labels/python.rs b/src/labels/python.rs index 5be84723..ea9080e5 100644 --- a/src/labels/python.rs +++ b/src/labels/python.rs @@ -1643,6 +1643,11 @@ pub static PARAM_CONFIG: ParamConfig = ParamConfig { "typed_parameter", "default_parameter", "typed_default_parameter", + // `*args` / `**kwargs`: without these the splat params are dropped, + // registering wrong arity and never seeding caller taint into the + // variadic positions. + "list_splat_pattern", + "dictionary_splat_pattern", ], self_param_kinds: &[], ident_fields: &["name"], @@ -1667,9 +1672,26 @@ pub fn framework_rules(ctx: &FrameworkContext) -> Vec { #[cfg(test)] mod tests { - use super::KINDS; + use super::{KINDS, PARAM_CONFIG}; use crate::labels::Kind; + #[test] + fn splat_params_are_extracted() { + // `*args` (list_splat_pattern) and `**kwargs` (dictionary_splat_pattern) + // must be recognised param nodes so arity registers correctly and the + // variadic positions are seeded with caller taint. + assert!( + PARAM_CONFIG + .param_node_kinds + .contains(&"list_splat_pattern") + ); + assert!( + PARAM_CONFIG + .param_node_kinds + .contains(&"dictionary_splat_pattern") + ); + } + #[test] fn lambda_classified_as_function() { assert_eq!(KINDS.get("lambda"), Some(&Kind::Function)); diff --git a/src/labels/ruby.rs b/src/labels/ruby.rs index abb4e733..93722b41 100644 --- a/src/labels/ruby.rs +++ b/src/labels/ruby.rs @@ -542,6 +542,11 @@ pub static KINDS: Map<&'static str, Kind> = phf_map! { // structure "program" => Kind::SourceFile, "body_statement" => Kind::Block, + // Brace blocks (`{ ... }`, including `->(){ ... }` lambda bodies) expose + // their body as a `block_body` node (do-end blocks use `body_statement`, + // already mapped above). Without this, multi-statement brace-block bodies + // collapse into a single leaf Seq node and lose all intra-body taint flow. + "block_body" => Kind::Block, "do_block" => Kind::Function, "then" => Kind::Block, "else" => Kind::Block, @@ -718,3 +723,17 @@ mod ar_query_tests { assert!(ar_query_safe_shape("a.b.c.where", "pair", false)); } } + +#[cfg(test)] +mod kinds_tests { + use super::KINDS; + use crate::labels::Kind; + + #[test] + fn block_body_is_walkable_block() { + // Brace blocks (incl. lambda bodies) expose their body as `block_body`; + // it must be a Block so multi-statement bodies are walked instead of + // collapsing into one leaf Seq node and losing intra-body taint flow. + assert_eq!(KINDS.get("block_body"), Some(&Kind::Block)); + } +} diff --git a/src/labels/rust.rs b/src/labels/rust.rs index 76d27d63..6f4b6b00 100644 --- a/src/labels/rust.rs +++ b/src/labels/rust.rs @@ -403,6 +403,12 @@ pub static KINDS: Map<&'static str, Kind> = phf_map! { "impl_item" => Kind::Block, "trait_item" => Kind::Block, "declaration_list" => Kind::Block, + // Inline modules `mod foo { ... }` wrap their items in a + // `declaration_list`; map to Block so the CFG builder recurses into the + // body and the `function_item`s inside are lowered, instead of dropping + // the whole module (the old `Kind::Trivia` mapping discarded every + // function/source/sink inside an inline module). + "mod_item" => Kind::Block, // data-flow "call_expression" => Kind::CallFn, @@ -430,7 +436,6 @@ pub static KINDS: Map<&'static str, Kind> = phf_map! { "{" => Kind::Trivia, "}" => Kind::Trivia, "\n" => Kind::Trivia, "use_declaration" => Kind::Trivia, "attribute_item" => Kind::Trivia, - "mod_item" => Kind::Trivia, "type_item" => Kind::Trivia, }; @@ -614,3 +619,18 @@ pub fn phase_c_auth_rules() -> Vec { }, ] } + +#[cfg(test)] +mod tests { + use super::KINDS; + use crate::labels::Kind; + + #[test] + fn mod_item_is_walkable_block_not_trivia() { + // Inline `mod foo { ... }` must be a Block so the CFG builder recurses + // into the module body; the old Trivia mapping dropped every function, + // source, and sink inside inline modules. + assert_eq!(KINDS.get("mod_item"), Some(&Kind::Block)); + assert_ne!(KINDS.get("mod_item"), Some(&Kind::Trivia)); + } +} diff --git a/src/labels/typescript.rs b/src/labels/typescript.rs index 41dfaa12..1982b51c 100644 --- a/src/labels/typescript.rs +++ b/src/labels/typescript.rs @@ -546,6 +546,34 @@ pub static GATED_SINKS: &[SinkGate] = &[ dangerous_kwargs: &[], activation: GateActivation::ValueMatch, }, + // ── Lodash `_.template` SSTI/RCE gates, mirrors `labels/javascript.rs` ─ + // (Strapi CVE-2023-22621 class). Lodash compiles `<% ... %>` evaluate + // blocks into a JS Function; gate on the `evaluate` option and fire + // conservatively when missing/dynamic. + SinkGate { + callee_matcher: "_.template", + arg_index: 0, + dangerous_values: &["true"], + dangerous_prefixes: &[], + label: DataLabel::Sink(Cap::CODE_EXEC), + case_sensitive: true, + payload_args: &[0], + keyword_name: Some("evaluate"), + dangerous_kwargs: &[], + activation: GateActivation::ValueMatch, + }, + SinkGate { + callee_matcher: "lodash.template", + arg_index: 0, + dangerous_values: &["true"], + dangerous_prefixes: &[], + label: DataLabel::Sink(Cap::CODE_EXEC), + case_sensitive: true, + payload_args: &[0], + keyword_name: Some("evaluate"), + dangerous_kwargs: &[], + activation: GateActivation::ValueMatch, + }, // ── XML XXE gates, mirrors `labels/javascript.rs` ──────────────────── SinkGate { callee_matcher: "xml2js.parseString", @@ -762,6 +790,37 @@ pub static GATED_SINKS: &[SinkGate] = &[ object_destination_fields: &["host", "hostname", "path", "protocol", "port", "origin"], }, }, + // Node `http.get` / `https.get` convenience wrappers around `.request()`. + // Same destination semantics. Motivated by CVE-2025-64430 (Parse Server + // SSRF via http.get(uri)). Mirrors `labels/javascript.rs`. + SinkGate { + callee_matcher: "http.get", + arg_index: 0, + dangerous_values: &[], + dangerous_prefixes: &[], + label: DataLabel::Sink(Cap::SSRF), + case_sensitive: false, + payload_args: &[0], + keyword_name: None, + dangerous_kwargs: &[], + activation: GateActivation::Destination { + object_destination_fields: &["host", "hostname", "path", "protocol", "port", "origin"], + }, + }, + SinkGate { + callee_matcher: "https.get", + arg_index: 0, + dangerous_values: &[], + dangerous_prefixes: &[], + label: DataLabel::Sink(Cap::SSRF), + case_sensitive: false, + payload_args: &[0], + keyword_name: None, + dangerous_kwargs: &[], + activation: GateActivation::Destination { + object_destination_fields: &["host", "hostname", "path", "protocol", "port", "origin"], + }, + }, // ── Cross-boundary data exfiltration ────────────────────────────────── // `fetch(input, init)`, payload-bearing fields of `init` (arg 1) flow // into the request body / headers / json, distinct from SSRF on the URL @@ -1070,6 +1129,69 @@ pub static GATED_SINKS: &[SinkGate] = &[ dangerous_kwargs: &[], activation: GateActivation::LiteralOnly, }, + // `set-value` standalone helper (CVE-2019-10747 / CVE-2021-23440) — + // recursive set-by-path helper that did not block `__proto__` keys. + // Mirrors `labels/javascript.rs`. + SinkGate { + callee_matcher: "setValue", + arg_index: 0, + dangerous_values: &[], + dangerous_prefixes: &[], + label: DataLabel::Sink(Cap::PROTOTYPE_POLLUTION), + case_sensitive: true, + payload_args: &[1, 2], + keyword_name: None, + dangerous_kwargs: &[], + activation: GateActivation::Destination { + object_destination_fields: &[], + }, + }, + // `dot-prop` standalone helper: `dotProp.set(obj, path, val)` — + // CVE-2020-8116. Mirrors `labels/javascript.rs`. + SinkGate { + callee_matcher: "dotProp.set", + arg_index: 0, + dangerous_values: &[], + dangerous_prefixes: &[], + label: DataLabel::Sink(Cap::PROTOTYPE_POLLUTION), + case_sensitive: true, + payload_args: &[1, 2], + keyword_name: None, + dangerous_kwargs: &[], + activation: GateActivation::Destination { + object_destination_fields: &[], + }, + }, + // `jsonpath` / `jsonpath-plus` `jp.set(obj, path, value)` family — + // mirrors `labels/javascript.rs`. + SinkGate { + callee_matcher: "jp.set", + arg_index: 0, + dangerous_values: &[], + dangerous_prefixes: &[], + label: DataLabel::Sink(Cap::PROTOTYPE_POLLUTION), + case_sensitive: true, + payload_args: &[1, 2], + keyword_name: None, + dangerous_kwargs: &[], + activation: GateActivation::Destination { + object_destination_fields: &[], + }, + }, + SinkGate { + callee_matcher: "jsonpath.set", + arg_index: 0, + dangerous_values: &[], + dangerous_prefixes: &[], + label: DataLabel::Sink(Cap::PROTOTYPE_POLLUTION), + case_sensitive: false, + payload_args: &[1, 2], + keyword_name: None, + dangerous_kwargs: &[], + activation: GateActivation::Destination { + object_destination_fields: &[], + }, + }, ]; pub static KINDS: Map<&'static str, Kind> = phf_map! { diff --git a/src/patterns/javascript.rs b/src/patterns/javascript.rs index f4d1029a..8dc63c4a 100644 --- a/src/patterns/javascript.rs +++ b/src/patterns/javascript.rs @@ -117,7 +117,7 @@ pub const PATTERNS: &[Pattern] = &[ function: (member_expression property: (property_identifier) @prop (#eq? @prop "createHash")) arguments: (arguments - (string) @alg (#match? @alg "\"(md5|sha1)\""))) + (string) @alg (#match? @alg "^[\"'](md5|sha1)[\"']$"))) @vuln"#, severity: Severity::Low, tier: PatternTier::A, @@ -201,7 +201,7 @@ pub const PATTERNS: &[Pattern] = &[ query: r#"(call_expression function: (identifier) @id (#eq? @id "fetch") arguments: (arguments - (string) @url (#match? @url "^\"http://"))) + (string) @url (#match? @url "^[\"']http://"))) @vuln"#, severity: Severity::Low, tier: PatternTier::A, diff --git a/src/patterns/ruby.rs b/src/patterns/ruby.rs index 7a7169f4..6021eb35 100644 --- a/src/patterns/ruby.rs +++ b/src/patterns/ruby.rs @@ -119,7 +119,7 @@ pub const PATTERNS: &[Pattern] = &[ query: r#"(call method: (identifier) @m (#eq? @m "open") arguments: (argument_list - (string) @url (#match? @url "^\"https?://"))) + (string) @url (#match? @url "^[\"']https?://"))) @vuln"#, severity: Severity::Medium, tier: PatternTier::A, diff --git a/src/patterns/typescript.rs b/src/patterns/typescript.rs index 5be60cbb..60b3c600 100644 --- a/src/patterns/typescript.rs +++ b/src/patterns/typescript.rs @@ -88,7 +88,7 @@ pub const PATTERNS: &[Pattern] = &[ function: (member_expression property: (property_identifier) @prop (#eq? @prop "createHash")) arguments: (arguments - (string) @alg (#match? @alg "\"(md5|sha1)\""))) + (string) @alg (#match? @alg "^[\"'](md5|sha1)[\"']$"))) @vuln"#, severity: Severity::Low, tier: PatternTier::A, diff --git a/src/pointer/analysis.rs b/src/pointer/analysis.rs index 3e18a1a6..387b1880 100644 --- a/src/pointer/analysis.rs +++ b/src/pointer/analysis.rs @@ -558,6 +558,41 @@ impl AnalysisState { } changed } + SsaOp::Call { + callee, receiver, .. + } => { + // Re-project the `Field(pt(receiver), ELEM)` cell for + // container reads. `transfer_inst` (pass 1) projects this + // from a snapshot of `pt(receiver)`, but the receiver's set + // is often still empty or partial on that pass (e.g. the + // container comes from a `FieldProj` that only resolves in + // the fixpoint, or from a value unified by a later `Assign`). + // Without re-projecting here, the promised `Field(_, ELEM)` + // members never appear and per-element parent-field aliasing + // through the read result is dropped. The fresh-alloc + // fallback added in pass 1 is preserved (`union_in_place` + // only adds members), so this strictly refines the set. + let Some(rcv) = receiver else { + return false; + }; + if !is_container_read_callee(callee) || (rcv.0 as usize) >= self.parent.len() { + return false; + } + let rcv_rep = self.find(rcv.0) as usize; + let rcv_pt = self.pt[rcv_rep].clone(); + // Mirror the pass-1 guard: skip empty (nothing to project) + // and Top (already maximally imprecise) receivers. + if rcv_pt.is_empty() || rcv_pt.is_top() { + return false; + } + let mut new_pt = PointsToSet::empty(); + for parent_loc in rcv_pt.iter() { + let proj = self.interner.intern_field(parent_loc, FieldId::ELEM); + new_pt.insert(proj); + } + let v_rep = self.find(v) as usize; + self.pt[v_rep].union_in_place(&new_pt) + } // No re-propagation needed for leaf ops. _ => false, } @@ -1015,6 +1050,68 @@ mod tests { ); } + /// Regression for the fixpoint re-projection of container reads. + /// + /// Shape: `e := this.queue.shift()`. The receiver of `shift` is a + /// `FieldProj` (`this.queue`) whose points-to set is *empty* on pass 1 + /// (field projections only resolve in the fixpoint). Before the fix, + /// `propagate_inst` never re-evaluated `SsaOp::Call`, so the promised + /// `Field(Field(SelfParam, "queue"), ELEM)` member was never projected + /// into the result and only the fresh allocation remained. After the + /// fix the nested ELEM cell must appear once the receiver converges. + #[test] + fn container_read_reprojects_elem_after_receiver_converges() { + let mut b = BodyBuilder::new(); + // `this` is the self parameter. + let this = b.fresh(Some("this")); + b.emit(this, SsaOp::SelfParam, Some("this")); + // `q := this.queue`, a field projection whose pt is empty in pass 1. + let queue_field = b.intern_field("queue"); + let q = b.fresh(Some("q")); + b.emit( + q, + SsaOp::FieldProj { + receiver: this, + field: queue_field, + projected_type: None, + }, + Some("q"), + ); + // `e := q.shift()`, container read whose receiver (`q`) only + // converges during the fixpoint. + let e = b.fresh(Some("e")); + b.emit( + e, + SsaOp::Call { + callee: "shift".into(), + callee_text: None, + args: vec![], + receiver: Some(q), + }, + Some("e"), + ); + + let body = b.build(); + let facts = analyse_body(&body, BodyId(0)); + let pt_e = facts.pt(e); + // The result must now include a Field(_, ELEM) member projected + // from the converged receiver set (not just the fresh alloc). + let mut saw_elem = false; + for loc in pt_e.iter() { + if let crate::pointer::AbsLoc::Field { field, .. } = facts.interner.resolve(loc) + && *field == FieldId::ELEM + { + saw_elem = true; + break; + } + } + assert!( + saw_elem, + "container read with a FieldProj receiver must re-project \ + Field(_, ELEM) in the fixpoint; got {pt_e:?}" + ); + } + /// `extract_field_points_to` records a field /// READ on the parameter index when a `FieldProj` traces back to /// an `AbsLoc::Param`. diff --git a/src/resolve/mod.rs b/src/resolve/mod.rs index 72956809..813f0cfe 100644 --- a/src/resolve/mod.rs +++ b/src/resolve/mod.rs @@ -996,14 +996,35 @@ fn resolve_file_or_index(candidate: &Path) -> Option { if candidate.is_file() { return Some(normalize_path(candidate)); } - for ext in RESOLVE_EXTENSIONS { - let mut with_ext = candidate.to_path_buf(); - match with_ext.extension() { - Some(_) => {} - None => { - with_ext.set_extension(ext); - if with_ext.is_file() { - return Some(normalize_path(&with_ext)); + // Node / TS resolution appends each candidate extension to the FULL + // specifier, regardless of whether the last segment already contains a + // dot. `Path::set_extension` *replaces* the trailing component after the + // last `.`, so it is wrong here: it would turn `./user.service` into + // `./user.ts` and never produce `./user.service.ts` (the dominant + // Angular/NestJS `.service`/`.component`/`.module`/`.controller` + // convention). Build the appended filename by hand instead. + if let Some(name) = candidate.file_name().and_then(|n| n.to_str()) { + for ext in RESOLVE_EXTENSIONS { + let appended = candidate.with_file_name(format!("{name}.{ext}")); + if appended.is_file() { + return Some(normalize_path(&appended)); + } + } + } + // TypeScript NodeNext / ESM idiom: an `import './x.js'` (or `.mjs` / + // `.cjs`) frequently refers to `x.ts` / `x.mts` / `x.cts` on disk. When + // the specifier carries a JS-family extension that did not resolve as a + // file above, retry with each TS-family extension swapped in. + if let Some(stem) = candidate.file_stem().and_then(|s| s.to_str()) { + let has_js_ext = candidate + .extension() + .and_then(|e| e.to_str()) + .is_some_and(|e| matches!(e, "js" | "jsx" | "mjs" | "cjs")); + if has_js_ext { + for ext in ["ts", "tsx", "mts", "cts"] { + let swapped = candidate.with_file_name(format!("{stem}.{ext}")); + if swapped.is_file() { + return Some(normalize_path(&swapped)); } } } @@ -1040,3 +1061,73 @@ fn normalize_path(p: &Path) -> PathBuf { fn canonicalize_or_owned(p: &Path) -> PathBuf { p.canonicalize().unwrap_or_else(|_| p.to_path_buf()) } + +#[cfg(test)] +mod resolve_file_ext_tests { + use super::resolve_file_or_index; + use std::fs; + + /// `import './user.service'` must resolve to `user.service.ts` on disk: + /// the extension is APPENDED to the full specifier, not replacing the + /// `.service` suffix (the Angular/NestJS convention). + #[test] + fn resolves_dotted_basename_by_appending_extension() { + let dir = tempfile::tempdir().unwrap(); + let target = dir.path().join("user.service.ts"); + fs::write(&target, "export const x = 1;").unwrap(); + + // The specifier as the importer would join it: no extension appended. + let candidate = dir.path().join("user.service"); + let resolved = resolve_file_or_index(&candidate) + .expect("'./user.service' should resolve to user.service.ts"); + assert!( + resolved.ends_with("user.service.ts"), + "expected user.service.ts, got {resolved:?}" + ); + } + + /// TypeScript NodeNext idiom: `import './x.js'` resolves to `x.ts` on disk. + #[test] + fn resolves_js_specifier_to_ts_file() { + let dir = tempfile::tempdir().unwrap(); + let target = dir.path().join("x.ts"); + fs::write(&target, "export const y = 2;").unwrap(); + + let candidate = dir.path().join("x.js"); + let resolved = resolve_file_or_index(&candidate) + .expect("'./x.js' should resolve to x.ts under NodeNext semantics"); + assert!( + resolved.ends_with("x.ts"), + "expected x.ts, got {resolved:?}" + ); + } + + /// Extensionless specifiers still resolve (no regression). + #[test] + fn resolves_extensionless_specifier() { + let dir = tempfile::tempdir().unwrap(); + let target = dir.path().join("plain.ts"); + fs::write(&target, "export const z = 3;").unwrap(); + + let candidate = dir.path().join("plain"); + let resolved = + resolve_file_or_index(&candidate).expect("'./plain' should resolve to plain.ts"); + assert!(resolved.ends_with("plain.ts"), "got {resolved:?}"); + } + + /// A non-JS asset import (e.g. `./image.css`) with no matching source on + /// disk must NOT spuriously resolve to an unrelated file. + #[test] + fn does_not_spuriously_resolve_unrelated_asset() { + let dir = tempfile::tempdir().unwrap(); + // A same-stem .ts exists, but a `.css` import is not a JS-family + // extension, so the NodeNext swap must not fire. + fs::write(dir.path().join("image.ts"), "x").unwrap(); + + let candidate = dir.path().join("image.css"); + assert!( + resolve_file_or_index(&candidate).is_none(), + "a .css specifier must not resolve to image.ts" + ); + } +} diff --git a/src/ssa/lower.rs b/src/ssa/lower.rs index 61e64104..a72bde59 100644 --- a/src/ssa/lower.rs +++ b/src/ssa/lower.rs @@ -121,6 +121,64 @@ fn try_lower_field_proj_chain( Some((current, method)) } +/// Remove the *implicit* chain-root argument group that `build_call_args` +/// appends when decomposing a chained-receiver call (`a.b.m(...)`) into a +/// FieldProj chain plus a bare-method Call. +/// +/// `build_call_args` adds any `info.taint.uses` entry not already in +/// `info.call.arg_uses` (and not the receiver) as a trailing "implicit" arg +/// group. For a chained callee `a.b.m`, the chain root `a` lands in that +/// implicit group; once the FieldProj chain carries it on the typed `receiver` +/// channel, re-listing it as a positional argument inflates arity and +/// double-taints, so it must be dropped. +/// +/// The previous implementation used `retain` to drop *every* group equal to +/// `[base_v]`, which also deleted a **genuine** positional argument that +/// happens to be the chain root identifier — e.g. the second `a` in +/// `a.b.m(p, a)` or `this` in `this.logger.log(this)`. That silently lost the +/// taint flowing through that argument into the callee and shifted later arg +/// positions left. +/// +/// This helper fixes that by: +/// 1. Skipping the strip entirely when the chain root is a genuine +/// positional argument (present in `arg_uses`): there is no implicit +/// group to remove in that case, only the real argument. +/// 2. Otherwise removing at most ONE matching group — the trailing implicit +/// one appended by `build_call_args` — so a coincidental earlier +/// `[base_v]` argument still survives. +fn strip_implicit_chain_root( + callee: &str, + info: &crate::cfg::NodeInfo, + var_stacks: &HashMap>, + args: &mut Vec>, +) { + let Some(base_ident) = callee.split('.').next() else { + return; + }; + let Some(&base_v) = var_stacks.get(base_ident).and_then(|s| s.last()) else { + return; + }; + // If the chain root is itself a genuine positional argument, the matching + // `[base_v]` group is that real argument — `build_call_args` did NOT append + // an implicit chain-root group (the root is already in `arg_uses`). Keep it. + let root_is_real_arg = info + .call + .arg_uses + .iter() + .any(|grp| grp.iter().any(|ident| ident.as_str() == base_ident)); + if root_is_real_arg { + return; + } + // Remove only the last matching group (the appended implicit chain-root + // group), preserving any earlier coincidental `[base_v]` argument. + if let Some(pos) = args + .iter() + .rposition(|grp| grp.len() == 1 && grp.first() == Some(&base_v)) + { + args.remove(pos); + } +} + /// Lower a CFG to SSA form for a single function scope. /// /// `scope` filters nodes by `enclosing_func`: @@ -341,8 +399,9 @@ fn lower_to_ssa_inner( // listed as an exception target indicates a CFG construction bug. Debug // builds panic loudly; release builds warn, record an engine note so // downstream findings carry "SSA lowering bailed" provenance, and fall - // through to the existing orphan handling above (the "all definitions" - // fallback) which remains sound for taint reachability. + // through to the existing orphan handling above, which lowers orphan + // (catch) subtrees with a locally-built dominator tree and seeds them from + // the most entry-dominating reaching definitions. check_catch_block_reachability_gated(&body); Ok(body) @@ -352,9 +411,11 @@ fn lower_to_ssa_inner( /// debug builds and warns + records an engine note in release builds. /// /// The current lowering's orphan handling (`process_block` fallback in -/// `rename_variables`) already widens to an "all definitions" conservative -/// state for blocks without predecessors. That preserves soundness for -/// taint reachability but masks CFG-builder bugs: this gate surfaces them. +/// `rename_variables`) lowers orphan (catch) subtrees via a locally-built +/// dominator tree and seeds them from the most entry-dominating reaching +/// definitions, so blocks without predecessors are still renamed and +/// populated. That keeps catch-block analysis usable but masks CFG-builder +/// bugs: this gate surfaces them. fn check_catch_block_reachability_gated(body: &SsaBody) { let result = super::invariants::check_catch_block_reachability(body); if let Err(err) = result { @@ -923,6 +984,100 @@ fn build_dom_tree_children( children } +/// Blocks reachable from the entry block (block 0) over the block-level +/// successor graph. Blocks NOT in this set are "orphan domain" — catch blocks +/// (and everything they dominate) that became unreachable once exception edges +/// were stripped before SSA lowering. +fn compute_entry_reachable(num_blocks: usize, block_succs: &[Vec]) -> Vec { + let mut reachable = vec![false; num_blocks]; + if num_blocks == 0 { + return reachable; + } + let mut stack = vec![0usize]; + reachable[0] = true; + while let Some(b) = stack.pop() { + for &s in &block_succs[b] { + if !reachable[s] { + reachable[s] = true; + stack.push(s); + } + } + } + reachable +} + +/// Build a dominator-tree children list restricted to the orphan domain. +/// +/// The main `simple_fast` walk is rooted at block 0 and never reaches orphan +/// blocks (catch entries + their internal subtree), so they have no entry in +/// the global dominator tree and their internal successors are never renamed. +/// Here we construct a fresh block graph containing a virtual super-root that +/// reaches every orphan *entry* (an orphan-domain block with no predecessors), +/// plus the orphan-domain blocks and the edges among them. Edges that leave +/// the orphan domain (e.g. a catch arm flowing into the entry-reachable +/// post-`try` join) are dropped: those target blocks were already processed by +/// the main walk and act as boundaries. Running `simple_fast` over this graph +/// yields a dominator tree whose `children[entry]` lists the catch's internal +/// dominator-tree successors, letting `process_block` recurse through the whole +/// handler body. +/// +/// Returned vector is indexed by real block id (length `num_blocks`); entries +/// for entry-reachable blocks stay empty. +fn build_orphan_dom_tree_children( + num_blocks: usize, + block_succs: &[Vec], + block_preds: &[Vec], + entry_reachable: &[bool], +) -> Vec> { + let mut children: Vec> = vec![vec![]; num_blocks]; + + // Collect orphan-domain blocks (unreachable from entry). + let orphan_blocks: Vec = (0..num_blocks).filter(|&b| !entry_reachable[b]).collect(); + if orphan_blocks.is_empty() { + return children; + } + + // Build a graph: virtual root node + one node per orphan-domain block. + // Node weight stores the real block id; the virtual root uses a sentinel + // (u32::MAX) that maps to no real block. + let mut g: Graph = Graph::new(); + let root = g.add_node(u32::MAX); + let mut node_of_block: HashMap = HashMap::new(); + for &b in &orphan_blocks { + node_of_block.insert(b, g.add_node(b as u32)); + } + + for &b in &orphan_blocks { + let bn = node_of_block[&b]; + // Orphan entries (no real predecessors) hang off the virtual root. + if block_preds[b].is_empty() { + g.add_edge(root, bn, ()); + } + // Intra-orphan-domain edges. Targets that escape the orphan domain + // are boundaries (already processed) and are not added. + for &s in &block_succs[b] { + if let Some(&sn) = node_of_block.get(&s) { + g.add_edge(bn, sn, ()); + } + } + } + + let doms = simple_fast(&g, root); + for &b in &orphan_blocks { + let bn = node_of_block[&b]; + if let Some(idom) = doms.immediate_dominator(bn) { + // Skip blocks whose immediate dominator is the virtual root: these + // are the orphan entries, processed directly by the caller's loop. + let idom_w = g[idom]; + if idom_w != u32::MAX { + children[idom_w as usize].push(b); + } + } + } + + children +} + /// Rename variables: dominator tree preorder walk with per-variable stacks. /// /// Returns (ssa_blocks, value_defs, cfg_node_map). @@ -1161,16 +1316,22 @@ fn rename_variables( ) { Some((recv_v, bare_method)) => { receiver = Some(recv_v); - // Strip any positional arg group that exactly matches the - // chain root identifier, it has been replaced by the - // FieldProj chain receiver, and re-listing it as an + // Strip the *implicit* chain-root arg group that + // `build_call_args` appends for `taint.uses` not present + // in `arg_uses`: the chain root has been replaced by the + // FieldProj chain receiver, so re-listing it as an // argument would inflate arity / double-taint. - if let Some(base_ident) = callee.split('.').next() { - if let Some(base_v) = var_stacks.get(base_ident).and_then(|s| s.last()) - { - args.retain(|grp| !(grp.len() == 1 && grp.first() == Some(base_v))); - } - } + // + // Only do this when the chain root is NOT a genuine + // positional argument. For `a.b.m(p, a)` the root `a` + // is a real second argument (present in `arg_uses`); the + // `[a]` group is that argument, not the implicit + // chain-root group, and must be kept — otherwise taint + // flowing through it into the callee is silently lost. + // Also remove at most one matching group (the appended + // implicit one) so a coincidental earlier `[base_v]` + // positional arg survives. + strip_implicit_chain_root(&callee, info, var_stacks, &mut args); (bare_method, Some(callee.clone())) } None => (callee, None), @@ -1306,12 +1467,10 @@ fn rename_variables( ) { Some((recv_v, bare_method)) => { receiver = Some(recv_v); - if let Some(base_ident) = callee.split('.').next() { - if let Some(base_v) = var_stacks.get(base_ident).and_then(|s| s.last()) - { - args.retain(|grp| !(grp.len() == 1 && grp.first() == Some(base_v))); - } - } + // Same implicit-chain-root strip as the primary Call + // branch above; keep a genuine positional arg equal to + // the chain root. See `strip_implicit_chain_root`. + strip_implicit_chain_root(&callee, info, var_stacks, &mut args); (bare_method, Some(callee.clone())) } None => (callee, None), @@ -2179,24 +2338,71 @@ fn rename_variables( // Process orphan blocks (e.g. catch blocks disconnected after exception edge removal). // These blocks have no predecessors and weren't reached by the dominator tree walk. // - // Rebuild var_stacks from already-processed instructions so that catch blocks - // can reference variables defined before the try block (e.g. `userInput`). - let has_orphans = - (1..num_blocks).any(|bid| block_preds[bid].is_empty() && ssa_blocks[bid].body.is_empty()); + // An "orphan domain" is the set of blocks unreachable from the entry block + // after exception-edge stripping — the catch entry plus everything it + // dominates (internal `if`/`for`/`while`/`switch`/`try` blocks). The main + // `simple_fast` dominator walk is rooted at block 0 and never reaches these + // blocks, so their `immediate_dominator` is `None` and + // `build_dom_tree_children` links neither the catch entry nor its internal + // subtree. Processing only the catch *entry* (which has empty preds) via + // the global `dom_tree_children` therefore leaves every catch-internal + // block body empty — dropping all Source/Sink/Assign instructions past the + // catch's first basic block (a soundness false-negative). To fix this we + // build a *local* dominator tree over the orphan domain (rooted at a + // virtual super-root that reaches every orphan entry) and recurse through + // it, so the whole catch subtree is renamed and populated. + let entry_reachable = compute_entry_reachable(num_blocks, block_succs); + let has_orphans = (1..num_blocks).any(|bid| !entry_reachable[bid]); if has_orphans { - // Rebuild var_stacks from all SSA instructions created during the main walk. - // This gives orphan blocks access to all variable definitions. + // Seed var_stacks with the definitions that actually *reach* the catch. + // + // The previous implementation rebuilt var_stacks from *all* blocks in + // ascending block-id order and took `.last()`, i.e. the highest-block-id + // def. For a variable defined before the `try` (block 0) and re-killed + // *after* the `try`/`catch` join (a later, higher-id block — e.g. + // `x = "safe"` post-catch), `.last()` picked that post-join kill, which + // does NOT reach the handler. A catch-side `sink(x)` then resolved to + // the killed constant instead of the pre-`try` (possibly tainted) value + // — a false negative. (The doc comment further up claiming this is a + // sound "all definitions" widening was inaccurate.) + // + // We restrict the rebuild to *entry-reachable* blocks (orphan-domain + // blocks are renamed below, not used as a seed) and iterate them in + // DESCENDING block-id order so `.last()` lands on the LOWEST-id — i.e. + // most entry-dominating — definition of each variable. Block 0 (which + // dominates every block, including the catch) therefore wins over any + // post-join reassignment, fixing the reported false negative, while a + // variable defined only inside the `try` body still keeps a `try`-side + // definition (rather than being dropped), avoiding a new false negative. + // Exact reaching-definition resolution for variables redefined along the + // protected region would require the pre-strip exception-edge structure, + // which isn't available here; this ordering is the conservative + // approximation. var_stacks.clear(); - for block in &ssa_blocks { - for inst in block.phis.iter().chain(block.body.iter()) { + for bid in (0..num_blocks).rev() { + if !entry_reachable[bid] { + continue; + } + for inst in ssa_blocks[bid] + .phis + .iter() + .chain(ssa_blocks[bid].body.iter()) + { if let Some(ref name) = inst.var_name { var_stacks.entry(name.clone()).or_default().push(inst.value); } } } + // Build a local dominator tree over the orphan domain so the catch + // entry's internal successors are reached during renaming. + let orphan_dom_children = + build_orphan_dom_tree_children(num_blocks, block_succs, block_preds, &entry_reachable); + for bid in 1..num_blocks { - if block_preds[bid].is_empty() && ssa_blocks[bid].body.is_empty() { + // Orphan *entries* are orphan-domain blocks with no predecessors + // (their only inbound edges were exception edges, now stripped). + if !entry_reachable[bid] && block_preds[bid].is_empty() { process_block( bid, cfg, @@ -2204,7 +2410,7 @@ fn rename_variables( block_succs, block_preds, phi_placements, - dom_tree_children, + &orphan_dom_children, filtered_edges, &mut var_stacks, &mut ssa_blocks, @@ -2780,6 +2986,208 @@ mod tests { assert!(!ssa.blocks.is_empty()); } + #[test] + fn orphan_catch_with_internal_branch_populates_subtree() { + // Regression for the soundness hole where a catch block containing + // internal control flow (`catch(e){ if(cond){ sink(x) } else { y=2 } }`) + // dropped every instruction past the catch's first basic block: the + // catch-internal arms were orphan-domain (unreachable from entry) AND + // absent from the global dominator tree, so their bodies stayed empty + // and `sink(x)` was silently lost. + // + // Layout: + // entry → body(defines x) → exit (normal flow) + // body --Exception--> catch(e) → if → [True: sink(x)] [False: y=2] → join → exit + let mut cfg: Cfg = Graph::new(); + let entry = cfg.add_node(make_node(StmtKind::Entry)); + let body = cfg.add_node(NodeInfo { + taint: TaintMeta { + defines: Some("x".into()), + ..Default::default() + }, + ..make_node(StmtKind::Seq) + }); + let catch = cfg.add_node(NodeInfo { + catch_param: true, + taint: TaintMeta { + defines: Some("e".into()), + ..Default::default() + }, + ..make_node(StmtKind::Seq) + }); + let if_node = cfg.add_node(make_node(StmtKind::If)); + let sink = cfg.add_node(NodeInfo { + call: crate::cfg::CallMeta { + callee: Some("sink".into()), + arg_uses: vec![vec!["x".into()]], + ..Default::default() + }, + taint: TaintMeta { + uses: vec!["x".into()], + ..Default::default() + }, + ..make_node(StmtKind::Seq) + }); + let else_assign = cfg.add_node(NodeInfo { + taint: TaintMeta { + defines: Some("y".into()), + ..Default::default() + }, + ..make_node(StmtKind::Seq) + }); + let join = cfg.add_node(make_node(StmtKind::Seq)); + let exit = cfg.add_node(make_node(StmtKind::Exit)); + + cfg.add_edge(entry, body, EdgeKind::Seq); + cfg.add_edge(body, exit, EdgeKind::Seq); + cfg.add_edge(body, catch, EdgeKind::Exception); + cfg.add_edge(catch, if_node, EdgeKind::Seq); + cfg.add_edge(if_node, sink, EdgeKind::True); + cfg.add_edge(if_node, else_assign, EdgeKind::False); + cfg.add_edge(sink, join, EdgeKind::Seq); + cfg.add_edge(else_assign, join, EdgeKind::Seq); + cfg.add_edge(join, exit, EdgeKind::Seq); + + let ssa = lower_to_ssa(&cfg, entry, None, true).unwrap(); + + // The `sink` Call must survive lowering: before the fix the + // catch-internal True arm was an empty `unreachable` block. + let has_sink_call = ssa.blocks.iter().any(|b| { + b.body + .iter() + .any(|inst| matches!(&inst.op, SsaOp::Call { callee, .. } if callee == "sink")) + }); + assert!( + has_sink_call, + "catch-internal sink(x) call was dropped; orphan subtree not populated" + ); + } + + #[test] + fn orphan_catch_resolves_pre_try_def_not_post_join_kill() { + // Regression for the reaching-definition hole where the orphan + // var_stacks rebuild iterated all blocks in ASCENDING block-id order + // and took `.last()`, picking the highest-block-id def of each variable + // instead of the one that actually reaches the catch. For `x` sourced + // before the try (block 0) and re-killed to a constant AFTER the + // try/catch join (a later, higher-id block), `.last()` wrongly resolved + // a catch-side `sink(x)` to the post-join constant — masking the + // pre-try tainted value (a false negative). + // + // Layout (mirrors `x = source(); try{..}catch(e){ sink(x) }; x="safe"`, + // where the post-try kill lands at a control-flow join in a later + // block than the pre-try source): + // entry → src(defines x = Source) → {b1, b2} → kill(x="safe") → exit + // src --Exception--> catch(e) → sink(x) (orphan) + // `src` branches (b1/b2) so it ends block 0; the `kill` join is a + // higher-id block. The buggy ascending+`.last()` rebuild seeded the + // catch with the Const kill; the fixed descending rebuild seeds the + // most entry-dominating def (the block-0 Source). A degenerate + // straight-line `src→kill` (single coalesced block) is a separate, + // documented limitation — exact within-block reaching defs need the + // pre-strip exception-point structure, unavailable post-lowering. + let mut cfg: Cfg = Graph::new(); + let entry = cfg.add_node(make_node(StmtKind::Entry)); + // Pure source (no callee) so it lowers to `SsaOp::Source` and defines x. + let src = cfg.add_node(NodeInfo { + taint: TaintMeta { + labels: smallvec::smallvec![crate::labels::DataLabel::Source( + crate::labels::Cap::all() + )], + defines: Some("x".into()), + ..Default::default() + }, + ..make_node(StmtKind::Seq) + }); + // Diamond arms so `src` is a branch point (block boundary) and the + // `kill` join below starts a strictly-higher-id block. + let b1 = cfg.add_node(make_node(StmtKind::Seq)); + let b2 = cfg.add_node(make_node(StmtKind::Seq)); + // Post-join kill: defines x from a literal, no uses, not a source → + // lowers to `SsaOp::Const`. Higher block id than `src`. + let kill = cfg.add_node(NodeInfo { + taint: TaintMeta { + defines: Some("x".into()), + const_text: Some("safe".into()), + ..Default::default() + }, + ..make_node(StmtKind::Seq) + }); + let catch = cfg.add_node(NodeInfo { + catch_param: true, + taint: TaintMeta { + defines: Some("e".into()), + ..Default::default() + }, + ..make_node(StmtKind::Seq) + }); + let sink = cfg.add_node(NodeInfo { + call: crate::cfg::CallMeta { + callee: Some("sink".into()), + arg_uses: vec![vec!["x".into()]], + ..Default::default() + }, + taint: TaintMeta { + uses: vec!["x".into()], + ..Default::default() + }, + ..make_node(StmtKind::Seq) + }); + let exit = cfg.add_node(make_node(StmtKind::Exit)); + + cfg.add_edge(entry, src, EdgeKind::Seq); + cfg.add_edge(src, b1, EdgeKind::Seq); + cfg.add_edge(src, b2, EdgeKind::Seq); + cfg.add_edge(b1, kill, EdgeKind::Seq); + cfg.add_edge(b2, kill, EdgeKind::Seq); + cfg.add_edge(kill, exit, EdgeKind::Seq); + cfg.add_edge(src, catch, EdgeKind::Exception); + cfg.add_edge(catch, sink, EdgeKind::Seq); + cfg.add_edge(sink, exit, EdgeKind::Seq); + + let ssa = lower_to_ssa(&cfg, entry, None, true).unwrap(); + + // SsaValue of the Source def (the pre-try value that reaches the catch) + // and of the Const kill (the post-join value that does NOT reach it). + let source_v = ssa + .blocks + .iter() + .flat_map(|b| b.body.iter()) + .find(|inst| matches!(inst.op, SsaOp::Source)) + .map(|inst| inst.value) + .expect("Source def must be present"); + let kill_v = ssa + .blocks + .iter() + .flat_map(|b| b.body.iter()) + .find(|inst| matches!(inst.op, SsaOp::Const(_))) + .map(|inst| inst.value) + .expect("Const kill must be present"); + + // Find the sink Call and its single argument value. + let sink_arg = ssa + .blocks + .iter() + .flat_map(|b| b.body.iter()) + .find_map(|inst| match &inst.op { + SsaOp::Call { callee, args, .. } if callee == "sink" => { + args.iter().flat_map(|g| g.iter().copied()).next() + } + _ => None, + }) + .expect("sink(x) Call with an argument must be present"); + + assert_eq!( + sink_arg, source_v, + "catch sink(x) must resolve to the pre-try Source def {source_v:?} \ + that reaches the handler, not the post-join Const kill {kill_v:?}" + ); + assert_ne!( + sink_arg, kill_v, + "catch sink(x) wrongly resolved to the post-join Const kill" + ); + } + #[test] fn phi_operand_count_equals_pred_count_in_diamond() { // Specific test: phi operands == predecessor count (not just <=) @@ -3988,6 +4396,66 @@ mod tests { ); } + /// Collect the arg groups of the first Call whose bare callee matches. + fn call_args_for(body: &SsaBody, bare_callee: &str) -> Option>> { + for blk in &body.blocks { + for inst in blk.body.iter() { + if let SsaOp::Call { callee, args, .. } = &inst.op { + if callee == bare_callee { + return Some(args.clone()); + } + } + } + } + None + } + + #[test] + fn phase2_e2e_chain_root_genuine_positional_arg_is_kept() { + // Regression for the soundness hole where decomposing a chained-receiver + // call stripped a *genuine* positional argument equal to the chain root. + // For `a.b.m(p, a)` the second `a` is a real argument, not the implicit + // chain-root group `build_call_args` appends, so it must survive + // decomposition. Previously `retain(... == [base_v])` deleted every + // `[a]` group, losing the argument (and shifting later positions left). + // + // Go: `a` and `p` are parameters so the chain root resolves. + let with_arg = parse_to_first_body( + b"package p\nfunc f(a *T, p string) { a.b.m(p, a) }\n", + "go", + tree_sitter::Language::from(tree_sitter_go::LANGUAGE), + "with.go", + ); + let control = parse_to_first_body( + b"package p\nfunc f(a *T, p string) { a.b.m(p) }\n", + "go", + tree_sitter::Language::from(tree_sitter_go::LANGUAGE), + "ctrl.go", + ); + + let with_args = call_args_for(&with_arg, "m").expect("decomposed m() call present"); + let ctrl_args = call_args_for(&control, "m").expect("decomposed m() call present"); + + // The soundness invariant: adding a genuine positional argument `a` + // (which equals the chain root) must add EXACTLY one arg group versus + // the control. The buggy `retain(... == [base_v])` deleted the genuine + // `a` group, so `with` would have had the SAME count as `control` (the + // argument silently vanished). The fix's `root_is_real_arg` guard keeps + // it, so `with` has one more group than `control`. + // + // We compare the two counts rather than asserting an absolute value: + // for Go's multi-segment field receiver the implicit chain-root group + // `build_call_args` appends is a multi-value group (`[a, a.b, ...]`), + // not a clean single `[base_v]`, so the absolute group count is an + // implementation detail — the *delta* is the property under test. + assert_eq!( + with_args.len(), + ctrl_args.len() + 1, + "a.b.m(p, a) must keep the genuine `a` argument: expected one more \ + arg group than the control a.b.m(p).\n with: {with_args:?}\n ctrl: {ctrl_args:?}" + ); + } + #[test] fn phase2_e2e_python_chained_receiver_emits_field_proj() { // Python: `obj.client.session.send(p)`, 3-segment receiver. diff --git a/src/ssa/type_facts.rs b/src/ssa/type_facts.rs index 82dbb29c..294de2e4 100644 --- a/src/ssa/type_facts.rs +++ b/src/ssa/type_facts.rs @@ -1339,7 +1339,16 @@ pub fn is_int_producing_callee(callee: &str) -> bool { // Peel trailing identity methods (e.g. `.unwrap()`/`.expect("...")` after // `.parse()`) so the underlying numeric-producing verb is exposed. let base = peel_identity_suffix(callee); - let suffix = base.rsplit(['.', ':']).next().unwrap_or(&base); + // `peel_identity_suffix` normalizes a turbofish callee by truncating at + // `<`, which leaves a trailing `::` (e.g. `s.parse::()` → + // `s.parse::`). Trim trailing path separators so the method suffix is + // recovered (`parse`) instead of an empty segment. + // Trim the trailing path separators a turbofish callee leaves behind: + // `peel_identity_suffix` truncates at `<`, so `s.parse::()` → + // `s.parse::`. Trimming recovers the method suffix (`parse`) instead of + // an empty segment. + let trimmed = base.trim_end_matches([':', '.']); + let suffix = trimmed.rsplit(['.', ':']).next().unwrap_or(trimmed); matches!( suffix, "parseInt" | "parseFloat" | "Number" // JS/TS @@ -1348,8 +1357,19 @@ pub fn is_int_producing_callee(callee: &str) -> bool { | "Atoi" | "ParseInt" | "ParseFloat" // Go | "intval" | "floatval" // PHP | "to_i" | "to_f" // Ruby - | "parse" // Rust: `.parse::()` / `.parse().unwrap()`, conservative - // (most Rust .parse() calls target numeric types) + | "parse" // Rust: `.parse::()` / `let n: u32 = s.parse()?` , + // conservative (most `str::parse` targets are numeric). + // + // KNOWN LIMITATION: the callee text alone cannot + // distinguish `parse::()` from `parse::()` + // / `let p: PathBuf = s.parse()?`. Tagging every `parse` + // as `Int` keeps the tested precision behaviour + // (`type_facts_suppress_int_typed_shell_arg`: a u16 port + // into a shell arg is suppressed) at the cost of a rare + // false negative on `let p: PathBuf = s.parse()?; + // fs::read(p)`. Distinguishing the two soundly requires + // reading the let-binding type annotation into the type + // lattice (tracked as a deep-fix; NOT inferable here). ) } @@ -1372,14 +1392,13 @@ pub fn is_int_producing_callee(callee: &str) -> bool { /// `Byte.toString` / `Boolean.toString` / `Character.toString` , /// output is `[+-]?\d+(\.\d+)?` / `"true"` / `"false"` / `"NaN"` / /// `"Infinity"`, none of which can carry CRLF or injection metachars. -/// * `String.valueOf` static factories , most overloads (`int`, -/// `long`, `boolean`, `char`, ...) emit the same digit / boolean / -/// single-character text as their per-class `toString`. The -/// `Object` overload falls back to `Object.toString()` whose output -/// shape depends on the runtime type, but the dominant safe usage -/// shape (`String.valueOf(payload.size())`, -/// `String.valueOf(rendered.length())`) covers the common -/// header-injection mitigation pattern. +/// * `String.valueOf` is deliberately **not** covered. Its `String` / +/// `Object` overloads return the argument verbatim (or its arbitrary +/// `toString()`), so `String.valueOf(req.getParameter("x"))` is an +/// identity passthrough — recognising it as a safe-string producer +/// silently suppressed real injections. The callee text alone cannot +/// distinguish the safe numeric overload from the identity form, so the +/// whole family is left unrecognised (soundness over precision). /// * `Class.getName` / `Class.getSimpleName` / `Class.getCanonicalName` /// , the JVM class-name grammar disallows CRLF, quotes, slashes, /// spaces, and shell metacharacters; the dot-separated FQCN is safe @@ -1402,7 +1421,21 @@ pub fn is_safe_string_producing_callee(callee: &str) -> bool { | "Character", "toString", ) => return true, - ("String", "valueOf") => return true, + // `String.valueOf` is NOT safe-by-construction: while the + // primitive overloads (`valueOf(int)` / `valueOf(boolean)` / + // `valueOf(char)`) emit digit / boolean / single-char text, + // `valueOf(String s)` returns `s` *verbatim* and the + // `valueOf(Object o)` overload returns `o.toString()` whose + // shape is arbitrary. The callee text (`String.valueOf`) + // carries no argument-type signal, so we cannot tell the safe + // numeric form from the identity form here. Tagging it as a + // safe-string producer suppressed real injections like + // `stmt.execute(String.valueOf(req.getParameter("x")))` + // (the suppression consumer `apply_arg_type_safe_suppression` + // drops the tainted arg at the sink). Erring toward soundness: + // `String.valueOf` is left unrecognised, so the numeric + // mitigation shape (`String.valueOf(payload.size())`) is no + // longer auto-suppressed but no injection is silently dropped. ("Class", "getName" | "getSimpleName" | "getCanonicalName") => return true, _ => {} } @@ -1552,7 +1585,21 @@ pub fn analyze_types_with_param_types( } SsaOp::SelfParam => TypeFact::from_kind(TypeKind::Object), SsaOp::CatchParam => TypeFact::from_kind(TypeKind::Object), - SsaOp::Call { callee, args, .. } => { + SsaOp::Call { + callee, + callee_text, + args, + .. + } => { + // For the Rust `.parse()` numeric-turbofish gate (see + // `is_int_producing_callee`) the discriminating + // `::` lives in the *original* callee text, which SSA + // moves into `callee_text` when it decomposes a chained + // receiver (`x.parse::()` → `callee = "parse"`, + // `callee_text = Some("x.parse::")`). Prefer the + // full text so the turbofish survives; `callee` is the + // canonical fallback when no decomposition occurred. + let callee_for_int = callee_text.as_deref().unwrap_or(callee); // CFG marks `Object.create(null)` (and future // null-prototype constructors) at lowering time. // Honour it ahead of generic constructor / arg-aware @@ -1588,7 +1635,7 @@ pub fn analyze_types_with_param_types( lang.and_then(|l| arg_aware_call_type(l, callee, args, consts)) { TypeFact::from_kind(ty) - } else if is_int_producing_callee(callee) { + } else if is_int_producing_callee(callee_for_int) { TypeFact::from_kind(TypeKind::Int) } else if is_safe_string_producing_callee(callee) { // Numeric/boolean to-string converters and class-name @@ -3439,4 +3486,62 @@ mod tests { "createQuery is overloaded — must not map at constructor_type level" ); } + + /// Audit #3: `String.valueOf` is an identity passthrough for its + /// `String`/`Object` overloads, so it must NOT be recognised as a + /// safe-string producer (which would suppress real injections like + /// `stmt.execute(String.valueOf(req.getParameter("x")))`). The + /// genuinely-safe numeric to-string converters and class-name + /// accessors are still recognised. + #[test] + fn string_valueof_not_safe_string_producer() { + // The defect: must be false now (was unconditionally true). + assert!( + !is_safe_string_producing_callee("String.valueOf"), + "String.valueOf is identity for String/Object overloads — unsound to treat as safe" + ); + assert!( + !is_safe_string_producing_callee("java.lang.String.valueOf"), + "fully-qualified String.valueOf must also be unrecognised" + ); + // Regression guard: the genuinely safe converters/accessors remain. + assert!(is_safe_string_producing_callee("Integer.toString")); + assert!(is_safe_string_producing_callee("Long.toString")); + assert!(is_safe_string_producing_callee("Boolean.toString")); + assert!(is_safe_string_producing_callee("Class.getName")); + assert!(is_safe_string_producing_callee("loaded.getClass().getName")); + } + + /// Audit #4: bare Rust `.parse()` is generic over `FromStr`, but the + /// callee text cannot distinguish `parse::()` from + /// `parse::()`. The engine deliberately tags every `parse` as + /// `Int` (precision: a u16 port into a shell arg is suppressed — pinned by + /// `cfg_analysis::tests::type_facts_suppress_int_typed_shell_arg`), at the + /// cost of a rare FN on `let p: PathBuf = s.parse()?`. Distinguishing the + /// two soundly needs let-annotation typing (deep-fix queue), not the + /// callee text. This test pins the chosen behaviour + cross-language + /// converters + turbofish-form robustness. + #[test] + fn rust_parse_is_int_producing_and_turbofish_forms_are_robust() { + // Bare parse and its identity-peeled / receiver-qualified forms — Int. + assert!(is_int_producing_callee("parse")); + assert!(is_int_producing_callee("s.parse")); + assert!(is_int_producing_callee("parse().unwrap()")); + // Turbofish forms survive `peel_identity_suffix` truncation at `<` + // (the trailing `::` is trimmed back to the `parse` suffix). + assert!(is_int_producing_callee("s.parse::()")); + assert!(is_int_producing_callee("s.parse::()")); + assert!(is_int_producing_callee( + "contents.trim().parse::().expect(\"invalid pid\")" + )); + // Other languages' numeric converters remain unconditional. + assert!(is_int_producing_callee("parseInt")); + assert!(is_int_producing_callee("Number")); + assert!(is_int_producing_callee("Atoi")); + assert!(is_int_producing_callee("strconv.ParseInt")); + assert!(is_int_producing_callee("x.to_i")); + // Negative: a non-numeric-producing method must not be Int. + assert!(!is_int_producing_callee("toUpperCase")); + assert!(!is_int_producing_callee("trim")); + } } diff --git a/src/state/facts.rs b/src/state/facts.rs index a926f2a2..06e5aed5 100644 --- a/src/state/facts.rs +++ b/src/state/facts.rs @@ -407,8 +407,23 @@ pub fn extract_findings( if !lifecycle.contains(ResourceLifecycle::CLOSED) { continue; } - // Check if there are intervening Call nodes between acquire and release - // in the CFG (these could throw and bypass the release) + // Check if there are intervening Call nodes between acquire and + // release in the CFG (these could throw and bypass the release). + // + // NOTE: a stricter variant (audit #59) tried to exclude the + // resource's own lifecycle ops (the acquire/release proxy + // calls) and require reachability from the acquire node, to + // suppress spurious findings on correctly open/close-paired + // proxies. That over-suppressed a *tested* true positive: a + // class-field resource (`this.fd = fs.openSync(...)` in `open()` + // with `close()` in a separate method — see + // `tests/fixtures/real_world/typescript/cfg/try_catch_typed.ts`) + // has only its own acquire call in scope, so excluding it left + // zero intervening calls and dropped the must-match leak + // finding. Distinguishing a clean same-scope open/close pair + // from a cross-method field leak needs proper inter-method + // lifecycle modelling (deep-fix queue), so we keep the original + // span-based exclusion here. let has_intervening_calls = cfg.node_references().any(|(_, ni)| { ni.kind == StmtKind::Call && ni.ast.enclosing_func == info.ast.enclosing_func @@ -567,11 +582,23 @@ fn is_web_entrypoint_simple( _ => &["request", "req"], }; - let has_web_params = func_summaries.values().any(|s| { - s.param_names - .iter() - .any(|p| web_params.contains(&p.to_ascii_lowercase().as_str())) - }); + // Confirm web parameters against THIS candidate handler only, not any + // function in the file. Scanning every summary made an unrelated + // function's `req`/`r`/`ctx` parameter promote every + // `process_*`/`api_*`/`serve_*` function in the file to a web + // entrypoint, firing High-severity state-unauthed-access on batch/CLI + // code. Filter the file-level summary map down to the named function + // via `FuncKey.name` (matches `info.ast.enclosing_func`); summary + // `entry` NodeIndexes are not valid in the per-body CFG, so the name + // is the safe selector here. + let has_web_params = func_summaries + .iter() + .filter(|(key, _)| key.name == func_name) + .any(|(_, s)| { + s.param_names + .iter() + .any(|p| web_params.contains(&p.to_ascii_lowercase().as_str())) + }); // Only handle_* and route_* are strong enough to skip param confirmation. // api_*, serve_*, process_* require web parameter evidence. @@ -916,4 +943,68 @@ mod tests { ); assert_eq!(findings[0].rule_id, "state-resource-leak"); } + + /// Finding #64: `is_web_entrypoint_simple` must confirm web parameters + /// against the *candidate* handler only, not any function in the file. + /// Before the fix, an unrelated `read_stream(req)` in the same file + /// promoted every `process_*` function to a web entrypoint, firing + /// High-severity `state-unauthed-access` on batch/CLI code. + #[test] + fn web_entrypoint_param_confirmation_is_per_function() { + use crate::cfg::LocalFuncSummary; + use crate::symbol::FuncKey; + use petgraph::graph::NodeIndex; + + fn summary(name: &str, params: &[&str]) -> (FuncKey, LocalFuncSummary) { + let key = FuncKey::new_function(Lang::Python, "f.py", name, Some(params.len())); + let s = LocalFuncSummary { + entry: NodeIndex::new(0), + source_caps: Cap::empty(), + sanitizer_caps: Cap::empty(), + sink_caps: Cap::empty(), + param_count: params.len(), + param_names: params.iter().map(|p| p.to_string()).collect(), + propagating_params: Vec::new(), + tainted_sink_params: Vec::new(), + callees: Vec::new(), + container: String::new(), + disambig: None, + kind: crate::symbol::FuncKind::Function, + }; + (key, s) + } + + // `process_data` has NO web-like parameters; `read_stream` does. + let mut summaries: crate::cfg::FuncSummaries = HashMap::new(); + let (k1, s1) = summary("process_data", &["data"]); + let (k2, s2) = summary("read_stream", &["req"]); + summaries.insert(k1, s1); + summaries.insert(k2, s2); + + let cfg: Cfg = Graph::new(); + + // The unrelated `read_stream(req)` must NOT promote `process_data`. + assert!( + !is_web_entrypoint_simple("process_data", Lang::Python, &summaries, &cfg), + "process_data has no web params and must not be a web entrypoint just \ + because an unrelated function in the file does" + ); + + // `read_stream` is not a handler-prefixed name, so even though it + // carries the `req` param it is NOT an entrypoint — confirms the + // name gate still stands independently of the param check. + assert!( + !is_web_entrypoint_simple("read_stream", Lang::Python, &summaries, &cfg), + "read_stream lacks a handler-prefixed name and must not be an entrypoint" + ); + + // Positive control: give `process_data` its own `req` param. + let mut summaries2: crate::cfg::FuncSummaries = HashMap::new(); + let (k3, s3) = summary("process_data", &["req"]); + summaries2.insert(k3, s3); + assert!( + is_web_entrypoint_simple("process_data", Lang::Python, &summaries2, &cfg), + "process_data with its own web param must be a web entrypoint" + ); + } } diff --git a/src/summary/mod.rs b/src/summary/mod.rs index b919953c..fe517821 100644 --- a/src/summary/mod.rs +++ b/src/summary/mod.rs @@ -1641,13 +1641,34 @@ impl GlobalSummaries { return CalleeResolution::NotFound; } - let arity_filtered: Vec<&FuncKey> = all_candidates + let mut arity_filtered: Vec<&FuncKey> = all_candidates .iter() .copied() .filter(|k| arity_matches(k)) .collect(); if arity_filtered.is_empty() { - return CalleeResolution::NotFound; + // Exact-arity match found nothing. Tolerate under-application: + // `FuncKey::arity` is the *total* parameter count, so a call + // supplying `a` arguments to a function declared with more + // parameters (the surplus being default-valued / optional) is a + // routine, valid call shape in Python, JS/TS, PHP, and Ruby. Retry + // with `param_count >= call_arity` so these calls still resolve. + // + // This only ever WIDENS the candidate set, and resolution below + // still requires a unique candidate, so a genuinely ambiguous name + // degrades to `Ambiguous`, never a wrong `Resolved`. Exact-arity + // matches always take precedence (this branch runs only when none + // exist), so no existing exact-match resolution regresses. + if let Some(a) = q.arity { + arity_filtered = all_candidates + .iter() + .copied() + .filter(|k| matches!(k.arity, Some(p) if p >= a)) + .collect(); + } + if arity_filtered.is_empty() { + return CalleeResolution::NotFound; + } } let same_ns: Vec<&FuncKey> = arity_filtered @@ -2033,14 +2054,42 @@ fn synthesize_ssa_disambig(summary: &SsaFuncSummary) -> u32 { /// Merging only happens for exact `FuncKey` matches (same lang + namespace + /// name + arity). Functions with the same bare name but different languages /// or namespaces are stored separately. +/// +/// This variant keys summaries via the plain [`FuncSummary::func_key`] +/// (`normalize_namespace`), so it is only safe for repos with no +/// package boundaries. The indexed scan path must use +/// [`merge_summaries_with_resolver`] so the loaded coarse summaries are +/// keyed with the same package-qualified namespaces that pass-1 SSA +/// summaries, cross-package import maps, and pass-2 refinements use. pub fn merge_summaries( per_file: impl IntoIterator, scan_root: Option<&str>, +) -> GlobalSummaries { + merge_summaries_with_resolver(per_file, scan_root, None) +} + +/// Module-graph-aware variant of [`merge_summaries`]. +/// +/// Keys each summary via [`FuncSummary::func_key_with_resolver`], so a +/// file inside a discovered package gets a package-qualified namespace +/// (`"@scope/name::src/file.ts"`) instead of the plain +/// `normalize_namespace` form. This must match the keying convention +/// used by pass-1 SSA summaries (`namespace_with_package`) and pass-2 +/// topo refinements (`func_key_with_resolver`); otherwise exact-key +/// joins between the coarse FuncSummary tier and the SSA tier miss, and +/// same-namespace narrowing in [`GlobalSummaries::resolve_callee`] never +/// matches the package-qualified caller namespace. +/// +/// Passing `module_graph: None` is equivalent to [`merge_summaries`]. +pub fn merge_summaries_with_resolver( + per_file: impl IntoIterator, + scan_root: Option<&str>, + module_graph: Option<&crate::resolve::ModuleGraph>, ) -> GlobalSummaries { let mut map = GlobalSummaries::new(); for fs in per_file { - let key = fs.func_key(scan_root); + let key = fs.func_key_with_resolver(scan_root, module_graph); map.insert(key, fs); } @@ -2049,3 +2098,127 @@ pub fn merge_summaries( #[cfg(test)] mod tests; + +#[cfg(test)] +mod arity_leniency_tests { + use super::*; + use crate::symbol::{FuncKey, Lang}; + + fn py_func(name: &str, namespace: &str, param_count: usize) -> (FuncKey, FuncSummary) { + let key = FuncKey { + lang: Lang::Python, + namespace: namespace.into(), + name: name.into(), + arity: Some(param_count), + ..Default::default() + }; + let summary = FuncSummary { + name: name.into(), + file_path: namespace.into(), + lang: "python".into(), + param_count, + ..Default::default() + }; + (key, summary) + } + + /// `run_cmd(cmd, opts=None)` (param_count 2) called as `run_cmd(user)` + /// (arity 1) must resolve via the under-application tolerance, not fall + /// through to NotFound. + #[test] + fn under_application_resolves_unique_defaulted_callee() { + let mut gs = GlobalSummaries::new(); + let (key, summary) = py_func("run_cmd", "helper.py", 2); + gs.insert(key.clone(), summary); + + let resolved = gs.resolve_callee(&CalleeQuery { + name: "run_cmd", + caller_lang: Lang::Python, + caller_namespace: "routes.py", + caller_container: None, + receiver_type: None, + namespace_qualifier: None, + receiver_var: None, + arity: Some(1), + }); + assert_eq!( + resolved, + CalleeResolution::Resolved(key), + "under-applied unique callee with default params must resolve" + ); + } + + /// Exact-arity matches still take precedence: when both an exact-arity and + /// a higher-arity candidate exist, the exact one wins (no regression). + #[test] + fn exact_arity_match_preferred_over_lenient() { + let mut gs = GlobalSummaries::new(); + let (exact_key, exact_sum) = py_func("run_cmd", "a.py", 1); + let (wide_key, wide_sum) = py_func("run_cmd", "b.py", 3); + gs.insert(exact_key.clone(), exact_sum); + gs.insert(wide_key, wide_sum); + + let resolved = gs.resolve_callee(&CalleeQuery { + name: "run_cmd", + caller_lang: Lang::Python, + caller_namespace: "routes.py", + caller_container: None, + receiver_type: None, + namespace_qualifier: None, + receiver_var: None, + arity: Some(1), + }); + // The exact-arity candidate (a.py, arity 1) is the sole exact match, + // so the lenient branch never runs and resolution is unambiguous. + assert_eq!(resolved, CalleeResolution::Resolved(exact_key)); + } + + /// Leniency only widens the candidate set; it never produces a wrong + /// Resolved. Two distinct higher-arity callees both tolerating the call + /// arity must degrade to Ambiguous, not a silent pick. + #[test] + fn under_application_ambiguous_when_multiple_candidates() { + let mut gs = GlobalSummaries::new(); + let (k1, s1) = py_func("run_cmd", "a.py", 2); + let (k2, s2) = py_func("run_cmd", "b.py", 3); + gs.insert(k1, s1); + gs.insert(k2, s2); + + let resolved = gs.resolve_callee(&CalleeQuery { + name: "run_cmd", + caller_lang: Lang::Python, + caller_namespace: "routes.py", + caller_container: None, + receiver_type: None, + namespace_qualifier: None, + receiver_var: None, + arity: Some(1), + }); + assert!( + matches!(resolved, CalleeResolution::Ambiguous(_)), + "two under-applied candidates must be Ambiguous, not a wrong Resolved: {resolved:?}" + ); + } + + /// Over-application (more args than params) must NOT be tolerated: the + /// lenient predicate is `param_count >= call_arity`, so a 1-param function + /// called with 2 args still returns NotFound. + #[test] + fn over_application_not_tolerated() { + let mut gs = GlobalSummaries::new(); + let (key, summary) = py_func("run_cmd", "helper.py", 1); + gs.insert(key, summary); + + let resolved = gs.resolve_callee(&CalleeQuery { + name: "run_cmd", + caller_lang: Lang::Python, + caller_namespace: "routes.py", + caller_container: None, + receiver_type: None, + namespace_qualifier: None, + receiver_var: None, + arity: Some(2), + }); + assert_eq!(resolved, CalleeResolution::NotFound); + } +} diff --git a/src/symex/executor.rs b/src/symex/executor.rs index b9e6d711..f67811ee 100644 --- a/src/symex/executor.rs +++ b/src/symex/executor.rs @@ -401,7 +401,9 @@ fn run_path( // Global step budget if *total_steps >= MAX_TOTAL_STEPS { *search_exhausted = false; - return Some(record_outcome(state, finding, ssa, cfg)); + // Budget cut mid-walk: constraints beyond this point are unchecked, + // so feasibility is unproven → Inconclusive, not Confirmed. + return Some(record_cutoff(state, finding, ssa, cfg)); } let block_id = state.current_block; @@ -454,8 +456,9 @@ fn run_path( continue; } } - // Stuck (infinite loop / nested loops with no exit) - return Some(record_outcome(state, finding, ssa, cfg)); + // Stuck (infinite loop / nested loops with no exit): the path never + // reached a terminal, so feasibility is unproven → Inconclusive. + return Some(record_cutoff(state, finding, ssa, cfg)); } // Move exception context into sym_state before block transfer @@ -1114,6 +1117,30 @@ fn record_outcome( } } +/// Record the outcome for a path cut short by a budget/loop cutoff. +/// +/// Unlike [`record_outcome`], the path did NOT reach a normal terminal: it was +/// abandoned mid-walk (global step budget exhausted, or a loop with no +/// reachable exit). Constraints beyond the cutoff point were never checked, so +/// feasibility is unproven and the verdict must be `Inconclusive` (which +/// contributes 0 to confidence), not `Confirmed`. A best-effort witness is +/// still extracted for diagnostics. Mirrors `analyse_finding_path`, which +/// already returns `Inconclusive` for the over-budget (`>MAX_PATH_BLOCKS`) +/// case. +fn record_cutoff( + state: &ExplorationState, + finding: &Finding, + ssa: &SsaBody, + cfg: &Cfg, +) -> PathOutcome { + let witness = try_extract_witness(state, finding, ssa, cfg); + PathOutcome { + verdict: Verdict::Inconclusive, + constraints_checked: state.constraints_checked, + witness, + } +} + /// Best-effort witness extraction from the current symbolic state. /// /// Used by both `record_outcome` (Confirmed paths) and inconclusive exits @@ -1516,6 +1543,77 @@ mod tests { assert_eq!(v.verdict, Verdict::Inconclusive); } + #[test] + fn record_cutoff_is_inconclusive_not_confirmed() { + // A path abandoned at a budget/loop cutoff has unchecked constraints + // beyond the cutoff point, so its outcome must be Inconclusive (which + // contributes 0 to confidence), NOT Confirmed. record_outcome (used + // only at genuine terminal states) still yields Confirmed. + let n0 = NodeIndex::new(0); + let n1 = NodeIndex::new(1); + let b0 = BlockId(0); + let b1 = BlockId(1); + let ssa = SsaBody { + blocks: vec![ + SsaBlock { + id: b0, + phis: vec![], + body: vec![], + terminator: Terminator::Goto(b1), + preds: smallvec![], + succs: smallvec![b1], + }, + SsaBlock { + id: b1, + phis: vec![], + body: vec![], + terminator: Terminator::Return(None), + preds: smallvec![b0], + succs: smallvec![], + }, + ], + entry: b0, + value_defs: vec![make_value_def(b0, n0), make_value_def(b1, n1)], + cfg_node_map: HashMap::new(), + exception_edges: vec![], + field_interner: crate::ssa::ir::FieldInterner::default(), + field_writes: std::collections::HashMap::new(), + synthetic_externals: std::collections::HashSet::new(), + slot_scoped_assigns: std::collections::HashSet::new(), + }; + let cfg = Cfg::new(); + let finding = make_finding(n0, n1); + let state = ExplorationState { + sym_state: SymbolicState::new(), + env: constraint::PathEnv::empty(), + current_block: b0, + predecessor: None, + forks_used: 0, + steps_taken: 0, + constraints_checked: 3, + visit_counts: HashMap::new(), + exception_context: None, + }; + + let cutoff = record_cutoff(&state, &finding, &ssa, &cfg); + assert_eq!(cutoff.verdict, Verdict::Inconclusive); + assert_eq!(cutoff.constraints_checked, 3); + + let terminal = record_outcome(&state, &finding, &ssa, &cfg); + assert_eq!(terminal.verdict, Verdict::Confirmed); + + // A lone budget-cut path must NOT aggregate to Confirmed. + let result = ExplorationResult { + paths_completed: vec![cutoff], + paths_pruned: 1, + total_steps: MAX_TOTAL_STEPS, + search_exhausted: false, + interproc_findings: Vec::new(), + interproc_cutoffs: Vec::new(), + }; + assert_eq!(result.aggregate_verdict().verdict, Verdict::Inconclusive); + } + #[test] fn aggregate_empty_is_inconclusive() { let result = ExplorationResult { diff --git a/src/symex/interproc.rs b/src/symex/interproc.rs index 015985a6..149d7ecf 100644 --- a/src/symex/interproc.rs +++ b/src/symex/interproc.rs @@ -268,6 +268,38 @@ pub struct InterprocCtx<'a> { pub caller_namespace: &'a str, } +impl<'a> InterprocCtx<'a> { + /// Build a child context for resolving calls *inside* a frame. + /// + /// All shared state (budgets, caches, reentry counts) is carried by + /// reference, so the child observes the same interior-mutable counters. + /// When `descended_cross_file` is true the frame was itself reached by a + /// cross-file body resolution, so `cross_file_depth` is bumped to enforce + /// `MAX_CROSS_FILE_DEPTH` on any further cross-file descents. Without this + /// the depth never increments and the guard at `execute_callee` is dead + /// (always `0 >= 1 == false`), allowing cross-file bodies to keep resolving + /// further cross-file bodies up to `max_depth` instead of the intended one + /// level. + fn child_for_nested(&self, descended_cross_file: bool) -> InterprocCtx<'a> { + InterprocCtx { + callee_bodies: self.callee_bodies, + cfg: self.cfg, + lang: self.lang, + max_depth: self.max_depth, + budget: self.budget, + cache: self.cache, + reentry_counts: self.reentry_counts, + max_reentry_per_func: self.max_reentry_per_func, + scc_membership: self.scc_membership, + max_scc_reentry: self.max_scc_reentry, + stats: self.stats, + cross_file_bodies: self.cross_file_bodies, + cross_file_depth: self.cross_file_depth + usize::from(descended_cross_file), + caller_namespace: self.caller_namespace, + } + } +} + /// Budget counters shared across all interprocedural frames for one finding. #[derive(Clone, Copy, Debug)] pub struct InterprocBudget { @@ -728,16 +760,48 @@ pub fn execute_callee( let mut initial_state = SymbolicState::new(); initial_state.seed_from_const_values(&body.opt.const_values); - // Seed parameters: walk callee SSA for Param instructions + // Seed parameters: walk callee SSA for Param / SelfParam instructions. + // + // The caller (`transfer.rs` Call arm and `handle_nested_calls`) PREPENDS + // the method receiver into `arg_values` at index 0 whenever the call has a + // receiver. SSA lowering (`src/ssa/lower.rs`) emits that receiver as + // `SsaOp::SelfParam` and assigns `Param { index }` positions starting at 0 + // to the non-receiver formals only. So when the callee body is a method + // (has a `SelfParam`), the positional formal `Param{index}` must be seeded + // from `arg_values[index + 1]` (skipping the receiver at slot 0), and the + // `SelfParam` itself from `arg_values[0]`. Free functions (no `SelfParam`) + // map `Param{index}` directly to `arg_values[index]`. This matches the + // taint engine's inline path, which builds `param_seed` from non-receiver + // args and carries receiver taint on a separate `receiver_seed` channel + // consumed by `SelfParam` (`src/taint/ssa_transfer/mod.rs`). + let has_self_param = body + .ssa + .blocks + .iter() + .flat_map(|block| block.phis.iter().chain(block.body.iter())) + .any(|inst| matches!(inst.op, SsaOp::SelfParam)); + let param_offset = if has_self_param { 1 } else { 0 }; for block in &body.ssa.blocks { for inst in block.phis.iter().chain(block.body.iter()) { - if let SsaOp::Param { index } = &inst.op { - if let Some((_, sym, tainted)) = arg_values.get(*index) { - initial_state.set(inst.value, sym.clone()); - if *tainted { - initial_state.mark_tainted(inst.value); + match &inst.op { + SsaOp::Param { index } => { + if let Some((_, sym, tainted)) = arg_values.get(*index + param_offset) { + initial_state.set(inst.value, sym.clone()); + if *tainted { + initial_state.mark_tainted(inst.value); + } } } + SsaOp::SelfParam => { + // Receiver was prepended at slot 0 by the caller. + if let Some((_, sym, tainted)) = arg_values.first() { + initial_state.set(inst.value, sym.clone()); + if *tainted { + initial_state.mark_tainted(inst.value); + } + } + } + _ => {} } } } @@ -751,6 +815,12 @@ pub fn execute_callee( frame_chain.push(normalized.to_string()); // ─── Work-queue exploration (intra-callee forking) ──────── + // + // Nested calls executed inside this frame go through a child context whose + // `cross_file_depth` is bumped when this frame was itself reached via a + // cross-file body. This keeps `MAX_CROSS_FILE_DEPTH` enforced across nested + // descents (the shared `ctx` would otherwise never increment the field). + let nested_ctx = ctx.child_for_nested(is_cross_file); let mut exit_states: Vec = Vec::new(); let mut internal_findings: Vec = Vec::new(); @@ -853,7 +923,7 @@ pub fn execute_callee( // Handle nested calls handle_nested_calls( block, - ctx, + &nested_ctx, &mut path.sym_state, depth, &frame_chain, @@ -1562,4 +1632,46 @@ mod tests { assert_eq!(stats.cutoffs, 0); assert_eq!(stats.forks, 0); } + + #[test] + fn child_for_nested_bumps_cross_file_depth_only_on_descent() { + // Backing state for the borrowed-by-reference InterprocCtx fields. + let bodies: HashMap = HashMap::new(); + let cfg = Cfg::new(); + let budget = Cell::new(InterprocBudget::new()); + let cache = RefCell::new(InterprocCache::new()); + let reentry = RefCell::new(HashMap::new()); + let stats = Cell::new(InterprocStats::default()); + + let ctx = InterprocCtx { + callee_bodies: &bodies, + cfg: &cfg, + lang: Lang::JavaScript, + max_depth: DEFAULT_MAX_DEPTH, + budget: &budget, + cache: &cache, + reentry_counts: &reentry, + max_reentry_per_func: DEFAULT_MAX_REENTRY_PER_FUNC, + scc_membership: None, + max_scc_reentry: DEFAULT_MAX_SCC_REENTRY, + stats: &stats, + cross_file_bodies: None, + cross_file_depth: 0, + caller_namespace: "test.js", + }; + + // Intra-file descent leaves cross_file_depth untouched. + let intra = ctx.child_for_nested(false); + assert_eq!(intra.cross_file_depth, 0); + + // A cross-file descent bumps the depth so the guard at execute_callee + // (>= MAX_CROSS_FILE_DEPTH) trips on the next cross-file step. + let xfile = ctx.child_for_nested(true); + assert_eq!(xfile.cross_file_depth, 1); + assert!(xfile.cross_file_depth >= MAX_CROSS_FILE_DEPTH); + + // Nested cross-file from an already-cross-file frame keeps climbing. + let xfile2 = xfile.child_for_nested(true); + assert_eq!(xfile2.cross_file_depth, 2); + } } diff --git a/src/taint/backwards.rs b/src/taint/backwards.rs index d505dc6a..ce2e02e7 100644 --- a/src/taint/backwards.rs +++ b/src/taint/backwards.rs @@ -7,20 +7,32 @@ //! //! This module implements the opposite direction: start at each sink value, //! walk *reverse* SSA edges and (when needed) cross-file callee bodies on -//! demand, and emit a [`BackwardFlow`] when a source is reached or an -//! accumulated path predicate proves the flow infeasible. +//! demand, and emit a [`BackwardFlow`] when a source is reached. //! //! The analysis is additive: //! //! * When a forward finding's sink is confirmed by a backwards walk that //! reaches a matching source, we append `backwards-confirmed` to the -//! finding's evidence notes. -//! * When the backwards walk proves the flow infeasible via accumulated -//! path predicates, we append `backwards-infeasible`, consumed by the -//! confidence scorer as a cap-to-Low signal. -//! * Backward flows that reach a source with no matching forward finding -//! become standalone `taint-backwards-flow` diags (a separate rule id so -//! existing graders can distinguish the two channels). +//! finding's evidence notes. **This is the only channel currently active.** +//! +//! ## Reserved / not-yet-implemented channels +//! +//! The scaffolding below exists but does not fire in production; it is kept so +//! the downstream consumers (confidence scorer, [`FindingVerdict`]) have a +//! stable shape to grow into. Do not build new behaviour on the assumption +//! that these signals can be produced by a real walk: +//! +//! * **Infeasibility.** [`DemandState::validated_true`] / +//! [`DemandState::validated_false`] are never written by the transfer (no +//! reverse predicate accumulation is implemented yet), so +//! [`BackwardFlow::infeasible`] is never set, [`aggregate_verdict`] never +//! returns [`FindingVerdict::Infeasible`], and the `backwards-infeasible` +//! note is never appended. Implementing this requires reading branch +//! [`crate::taint::domain::PredicateSummary`] bits off the reverse-dominating +//! conditions, which is future work. +//! * **Standalone diags.** Backward flows that reach a source with no matching +//! forward finding are *not* yet surfaced as standalone `taint-backwards-flow` +//! diags; that rule id is reserved. //! //! The feature is gated by //! [`crate::utils::analysis_options::AnalysisOptions::backwards_analysis`] @@ -76,8 +88,15 @@ pub struct DemandState { /// [`crate::taint::domain::predicate_kind_bit`]; bit `i` set means the /// corresponding `PredicateKind` was observed as holding on every /// predecessor visited so far. + /// + /// **Reserved / not yet written.** The current backward transfer does not + /// accumulate reverse path predicates, so this field is always 0 in + /// production. See the module-level "Reserved / not-yet-implemented + /// channels" note. pub validated_true: u8, /// Counterpart to [`Self::validated_true`] for known-false predicates. + /// + /// **Reserved / not yet written** (always 0 in production). pub validated_false: u8, /// Number of cross-function inline expansions performed along this walk. pub depth: u32, @@ -111,6 +130,10 @@ pub struct BackwardFlow { pub source_node: Option, /// Set when the accumulated predicates proved the flow infeasible before /// reaching any source. + /// + /// **Reserved / never set in production.** Reverse predicate accumulation + /// is not yet implemented (see [`DemandState::validated_true`] and the + /// module-level note), so every production flow leaves this `false`. pub infeasible: bool, /// Set when the walk hit [`BACKWARDS_VALUE_BUDGET`] without terminating. pub budget_exhausted: bool, @@ -517,11 +540,25 @@ fn walk_dfs( // the key in the matcher, the key is useful for debug // logging in bigger expansions. let _ = callee_key; - return; + // Intentionally fall through to the conservative arg/receiver + // fanout below. Walking only the callee's return chains is + // strictly lossier than not resolving the callee at all: a + // passthrough `return param` inside the callee bottoms out at + // a `ReachedParam` terminal that is never mapped back to the + // caller's argument, so `sink(identity(tainted))` would record + // a dead-end param terminal and miss the source, whereas an + // unresolvable callee would have taken the fanout and reached + // it. Running the fanout in addition keeps callee expansion + // additive (it can only add confirmations, never drop them); + // `aggregate_verdict` only needs one confirmation, and the + // shared `visited` set prevents re-expanding values the callee + // walk already covered in the caller frame. } - // Fall-through: no resolvable body. Conservatively fan out to - // every operand / receiver so a source reachable through the - // call arguments is still observed. + // Conservatively fan out to every operand / receiver so a source + // reachable through the call arguments is still observed. Runs + // for both the resolvable-callee case (after return-chain + // expansion, to recover argument-passthrough flows) and the + // unresolvable case (sole channel). for (operand, next_demand) in next { chain.push(operand); walk_dfs( @@ -564,18 +601,55 @@ fn resolve_callee_body<'a>( .rsplit('.') .next() .unwrap_or(callee); - if let Some(map) = ctx.intra_file_bodies { + + // Pick the deterministic, language-matched best candidate among same-leaf + // entries. Bare-leaf matching over a `HashMap` is both unsound (a + // same-name sibling from another class/file/language can be picked, the + // exact hazard the forward inline path refuses bare-name lookup for) and + // nondeterministic (HashMap iteration order varies run to run, so the + // resolved body, hence the `backwards-confirmed` note and the confidence + // score, would differ across identical runs). We cannot reconstruct the + // call-site's container/arity here (only the textual callee is in hand), + // but we can at least exclude cross-language siblings and fix a stable + // tie-break so resolution is reproducible. + fn best_candidate<'b>( + map: &'b HashMap, + leaf: &str, + lang: Lang, + ) -> Option<(&'b CalleeSsaBody, FuncKey)> { + let mut best: Option<(&FuncKey, &CalleeSsaBody)> = None; for (key, body) in map.iter() { - if key.name == leaf && body.ssa.blocks.len() <= MAX_BACKWARDS_CALLEE_BLOCKS { - return Some((body, key.clone())); + if key.name != leaf + || key.lang != lang + || body.ssa.blocks.len() > MAX_BACKWARDS_CALLEE_BLOCKS + { + continue; } + let better = match best { + None => true, + // Deterministic tie-break: smallest by the structural fields + // that distinguish same-leaf siblings. Independent of + // HashMap iteration order. + Some((bk, _)) => { + (&key.namespace, &key.container, &key.arity, &key.disambig) + < (&bk.namespace, &bk.container, &bk.arity, &bk.disambig) + } + }; + if better { + best = Some((key, body)); + } + } + best.map(|(k, b)| (b, k.clone())) + } + + if let Some(map) = ctx.intra_file_bodies { + if let Some(found) = best_candidate(map, leaf, ctx.lang) { + return Some(found); } } if let Some(map) = ctx.global_summaries.and_then(|gs| gs.bodies_by_key()) { - for (key, body) in map.iter() { - if key.name == leaf && body.ssa.blocks.len() <= MAX_BACKWARDS_CALLEE_BLOCKS { - return Some((body, key.clone())); - } + if let Some(found) = best_candidate(map, leaf, ctx.lang) { + return Some(found); } } None @@ -1189,4 +1263,211 @@ mod tests { }; assert!(!bf.is_confirmation()); } + + /// Build a passthrough callee body `fn identity(p0) { return p0 }` wrapped + /// in a [`CalleeSsaBody`] ready for [`BackwardsCtx::intra_file_bodies`]. + fn build_passthrough_callee() -> CalleeSsaBody { + let mut cfg: Graph = Graph::new(); + let p_node = cfg.add_node(NodeInfo::default()); + + let mut ssa = SsaBody { + blocks: vec![SsaBlock { + id: BlockId(0), + phis: vec![], + body: vec![SsaInst { + value: SsaValue(0), + op: SsaOp::Param { index: 0 }, + cfg_node: p_node, + var_name: None, + span: (0, 0), + }], + terminator: Terminator::Return(Some(SsaValue(0))), + preds: SmallVec::new(), + succs: SmallVec::new(), + }], + entry: BlockId(0), + value_defs: vec![make_value_def(BlockId(0), p_node)], + cfg_node_map: std::collections::HashMap::new(), + exception_edges: Vec::new(), + field_interner: crate::ssa::ir::FieldInterner::default(), + field_writes: std::collections::HashMap::new(), + synthetic_externals: std::collections::HashSet::new(), + slot_scoped_assigns: std::collections::HashSet::new(), + }; + let opt = crate::ssa::optimize_ssa(&mut ssa, &cfg, Some(Lang::JavaScript)); + CalleeSsaBody { + ssa, + opt, + param_count: 1, + node_meta: std::collections::HashMap::new(), + body_graph: Some(cfg), + cross_package_imports: std::sync::Arc::new(std::collections::HashMap::new()), + } + } + + /// Resolving a passthrough callee (`sink(identity(tainted))`) must not be + /// strictly lossier than leaving it unresolved. The callee's + /// `return param` chain bottoms out at a `ReachedParam` terminal that the + /// walk cannot map back to the caller's argument, so without the + /// post-expansion conservative arg fanout the source is missed entirely. + /// After the fix, the fanout walks the call argument (the `Source`) and the + /// flow is confirmed. + #[test] + fn resolved_passthrough_callee_still_confirms_via_arg_fanout() { + // Caller body: v0 = Source; v1 = identity(v0); return v1. + let mut cfg: Graph = Graph::new(); + let src_node = cfg.add_node(NodeInfo::default()); + let call_node = cfg.add_node(NodeInfo::default()); + + let ssa = SsaBody { + blocks: vec![SsaBlock { + id: BlockId(0), + phis: vec![], + body: vec![ + SsaInst { + value: SsaValue(0), + op: SsaOp::Source, + cfg_node: src_node, + var_name: None, + span: (0, 0), + }, + SsaInst { + value: SsaValue(1), + op: SsaOp::Call { + callee: "identity".to_string(), + callee_text: None, + args: vec![smallvec![SsaValue(0)]], + receiver: None, + }, + cfg_node: call_node, + var_name: None, + span: (0, 0), + }, + ], + terminator: Terminator::Return(Some(SsaValue(1))), + preds: SmallVec::new(), + succs: SmallVec::new(), + }], + entry: BlockId(0), + value_defs: vec![ + make_value_def(BlockId(0), src_node), + make_value_def(BlockId(0), call_node), + ], + cfg_node_map: std::collections::HashMap::new(), + exception_edges: Vec::new(), + field_interner: crate::ssa::ir::FieldInterner::default(), + field_writes: std::collections::HashMap::new(), + synthetic_externals: std::collections::HashSet::new(), + slot_scoped_assigns: std::collections::HashSet::new(), + }; + + let mut bodies: HashMap = HashMap::new(); + bodies.insert( + FuncKey::new_function(Lang::JavaScript, "test.js", "identity", Some(1)), + build_passthrough_callee(), + ); + + let ctx = BackwardsCtx { + ssa: &ssa, + cfg: &cfg, + lang: Lang::JavaScript, + global_summaries: None, + intra_file_bodies: Some(&bodies), + depth_budget: DEFAULT_BACKWARDS_DEPTH, + }; + let flows = analyse_sink_backwards(&ctx, SsaValue(1), call_node, Cap::all()); + assert!( + flows.iter().any(|f| f.is_confirmation()), + "argument-passthrough source must still be confirmed after callee expansion" + ); + } + + /// `resolve_callee_body` must only match same-language siblings and pick a + /// deterministic candidate, never an arbitrary HashMap-order entry from + /// another language. + #[test] + fn resolve_callee_body_filters_language_and_is_deterministic() { + let mut bodies: HashMap = HashMap::new(); + // Two same-leaf siblings: one in the analysed language, one not. + bodies.insert( + FuncKey::new_function(Lang::Python, "other.py", "process", Some(1)), + build_passthrough_callee(), + ); + bodies.insert( + FuncKey::new_function(Lang::JavaScript, "a.js", "process", Some(1)), + build_passthrough_callee(), + ); + bodies.insert( + FuncKey::new_function(Lang::JavaScript, "b.js", "process", Some(1)), + build_passthrough_callee(), + ); + + let dummy_cfg: Graph = Graph::new(); + let dummy_ssa = SsaBody { + blocks: vec![], + entry: BlockId(0), + value_defs: vec![], + cfg_node_map: std::collections::HashMap::new(), + exception_edges: Vec::new(), + field_interner: crate::ssa::ir::FieldInterner::default(), + field_writes: std::collections::HashMap::new(), + synthetic_externals: std::collections::HashSet::new(), + slot_scoped_assigns: std::collections::HashSet::new(), + }; + let ctx = BackwardsCtx { + ssa: &dummy_ssa, + cfg: &dummy_cfg, + lang: Lang::JavaScript, + global_summaries: None, + intra_file_bodies: Some(&bodies), + depth_budget: DEFAULT_BACKWARDS_DEPTH, + }; + + // Deterministic across repeated resolutions, and never the Python body. + let first = resolve_callee_body(&ctx, "process", 0) + .map(|(_, k)| k) + .expect("a JS sibling must resolve"); + assert_eq!( + first.lang, + Lang::JavaScript, + "must not match cross-language" + ); + // Namespace tie-break is the smallest: "a.js" < "b.js". + assert_eq!(first.namespace, "a.js"); + for _ in 0..32 { + let again = resolve_callee_body(&ctx, "process", 0) + .map(|(_, k)| k) + .expect("stable resolution"); + assert_eq!(again, first, "resolution must be deterministic across runs"); + } + } + + /// Reserved-channel invariant (documented in the module header): the + /// backwards transfer never writes the predicate-accumulation bits, so a + /// real walk can never set `infeasible`. Guards against the dead + /// infeasibility channel silently coming alive (or the docs drifting from + /// the implementation) without a corresponding transfer change. + #[test] + fn infeasibility_channel_is_inert() { + // backward_transfer leaves the demand's predicate bits untouched. + let (ssa, _cfg) = build_trivial_source_body(); + let demand = DemandState::new(Cap::all()); + let (_step, next) = backward_transfer(&ssa, SsaValue(1), &demand); + for (_v, d) in &next { + assert_eq!(d.validated_true, 0, "validated_true must never be written"); + assert_eq!( + d.validated_false, 0, + "validated_false must never be written" + ); + } + + // A full driver walk over a source→sink body produces a confirmation, + // never an infeasible flow. + let ctx = BackwardsCtx::new(&ssa, &_cfg, Lang::JavaScript); + let flows = analyse_sink_backwards(&ctx, SsaValue(1), NodeIndex::new(1), Cap::all()); + assert!( + !flows.iter().any(|f| f.infeasible), + "no production flow may set infeasible (channel is reserved)" + ); + } } diff --git a/src/taint/mod.rs b/src/taint/mod.rs index 1d6da23b..65266afa 100644 --- a/src/taint/mod.rs +++ b/src/taint/mod.rs @@ -822,6 +822,51 @@ fn containment_order(bodies: &[BodyCfg]) -> Vec { order } +/// Build the lexical-scope seed map for a non-toplevel body. +/// +/// A body's `global_seed` ancestor lookup +/// ([`ssa_transfer`] `SsaOp::Param` handling) reads its captured names in +/// ancestor order: the direct `parent_body_id` first, then `BodyId(0)` to +/// pick up the JS/TS pass-2 re-keyed top-level globals (see +/// [`ssa_transfer::filter_seed_to_toplevel`]). But `body_exit_states` +/// keys every exit entry under the owning body's id, so a parent's exit +/// map never contains `BodyId(0)` entries. For a body nested two or more +/// levels deep (its parent is itself a non-toplevel body) the `BodyId(0)` +/// leg of the ancestor lookup would therefore find nothing — a global +/// written by a sibling top-level function and read inside a nested +/// closure would be silently missed. +/// +/// This helper restores that leg: when `parent_id` is a *non*-toplevel +/// body, it returns a merged map = parent's exit ∪ the `BodyId(0)` +/// top-level seed (the converged globals, already keyed under +/// `BodyId(0)`). When the parent *is* the top level (`BodyId(0)`), the +/// parent's exit already carries the `BodyId(0)` entries, so the borrowed +/// parent map is returned unchanged with no clone. +fn seed_for_nested_body<'a>( + parent_id: BodyId, + body_exit_states: &'a HashMap< + BodyId, + HashMap, + >, +) -> Option>> +{ + let parent_exit = body_exit_states.get(&parent_id); + if parent_id == BodyId(0) { + // Depth-1 body: the parent IS the top level, so its exit already + // carries the BodyId(0) entries the ancestor lookup wants. + return parent_exit.map(std::borrow::Cow::Borrowed); + } + // Depth ≥ 2 body: merge the direct parent's exit with the converged + // top-level seed so the ancestor lookup's BodyId(0) leg is live. + let toplevel_seed = body_exit_states.get(&BodyId(0)); + match (parent_exit, toplevel_seed) { + (Some(p), Some(t)) => Some(std::borrow::Cow::Owned(ssa_transfer::join_seed_maps(p, t))), + (Some(p), None) => Some(std::borrow::Cow::Borrowed(p)), + (None, Some(t)) => Some(std::borrow::Cow::Borrowed(t)), + (None, None) => None, + } +} + /// Build a `var_name → TypeKind` map from a body's optimised SSA + type-fact /// result. Used by [`analyse_multi_body`] to forward closure-captured types /// from a parent body into its children, so that bound-variable receiver @@ -1455,11 +1500,16 @@ fn analyse_multi_body( // ── Pass 1: lexical containment propagation ────────────────────── for &idx in &order { let body = &file_cfg.bodies[idx]; - // Determine seed from parent body's exit state. - let parent_seed = body + // Determine seed from parent body's exit state. For bodies nested + // two or more levels deep, merge in the top-level (`BodyId(0)`) + // seed so the `global_seed` ancestor lookup's `BodyId(0)` leg can + // reach globals written by sibling top-level functions (see + // [`seed_for_nested_body`]). + let parent_seed_owned = body .meta .parent_body_id - .and_then(|pid| body_exit_states.get(&pid)); + .and_then(|pid| seed_for_nested_body(pid, &body_exit_states)); + let parent_seed = parent_seed_owned.as_deref(); let parent_var_types = body .meta .parent_body_id @@ -1648,15 +1698,40 @@ fn analyse_multi_body( // changed, so re-analysis would produce byte-identical // output. The cached findings from the previous // round (or pass-1) remain correct. - if let Some(reads) = body_reads.get(&body.meta.id) { - if reads.is_disjoint(&changed_names) { - continue; + // + // Restricted to depth-1 bodies (direct children of the top + // level). `changed_names` is derived purely from the + // *top-level* seed delta, and `body_reads` is the body's + // own `info.taint.uses` names. For a depth-1 body the + // entire inbound channel is the top-level seed, so the + // disjointness test is sound. A body nested two or more + // levels deep also consumes its parent's exit, which can + // carry a parent-local derived from a changed global (e.g. + // `reader(){ const local = 'x'+g; child(){ exec(local) } }`): + // its read-set `{local, exec}` is disjoint from the + // top-level change set `{g}`, so skipping it would miss a + // real flow. These bodies are cheap relative to the + // soundness cost, so we always re-run them. + let is_depth1 = body.meta.parent_body_id == Some(BodyId(0)); + if is_depth1 { + if let Some(reads) = body_reads.get(&body.meta.id) { + if reads.is_disjoint(&changed_names) { + continue; + } } } - let parent_seed = body + // For nested (depth ≥ 2) bodies, merge the top-level + // (`BodyId(0)`) seed into the direct parent's exit so the + // `global_seed` ancestor lookup's `BodyId(0)` leg can reach + // the converged globals (see [`seed_for_nested_body`]). + // `body_exit_states[BodyId(0)]` holds the freshest + // `current_seed` (updated above, and per-body under + // Gauss-Seidel). + let parent_seed_owned = body .meta .parent_body_id - .and_then(|pid| body_exit_states.get(&pid)); + .and_then(|pid| seed_for_nested_body(pid, &body_exit_states)); + let parent_seed = parent_seed_owned.as_deref(); let parent_var_types = body .meta .parent_body_id @@ -2957,5 +3032,82 @@ pub(crate) fn build_eligible_bodies( eligible_bodies } +#[cfg(test)] +mod seed_threading_tests { + use super::*; + use crate::labels::Cap; + use crate::taint::domain::VarTaint; + use ssa_transfer::BindingKey; + use std::collections::HashMap; + + fn tainted() -> VarTaint { + VarTaint { + caps: Cap::all(), + origins: smallvec::SmallVec::new(), + uses_summary: true, + } + } + + /// Depth-1 body (parent == top level): the parent's exit already + /// carries the `BodyId(0)` entries, so the helper hands it back + /// borrowed with no merge. + #[test] + fn depth1_returns_parent_exit_unchanged() { + let mut exits: HashMap> = HashMap::new(); + let mut top = HashMap::new(); + top.insert(BindingKey::new("g", BodyId(0)), tainted()); + exits.insert(BodyId(0), top); + + let seed = seed_for_nested_body(BodyId(0), &exits).expect("seed present"); + // Borrowed, single BodyId(0) entry. + assert!(matches!(seed, std::borrow::Cow::Borrowed(_))); + assert!(seed.contains_key(&BindingKey::new("g", BodyId(0)))); + assert_eq!(seed.len(), 1); + } + + /// Depth-2 grandchild (parent == BodyId(2), itself non-toplevel): + /// the helper merges the direct parent's exit with the converged + /// `BodyId(0)` top-level seed so the ancestor lookup's `BodyId(0)` + /// leg can reach the sibling-written global. This is the finding + /// #19 regression guard: without the merge, the returned map would + /// have no `BodyId(0)` entry and the grandchild would miss `g`. + #[test] + fn depth2_merges_toplevel_seed() { + let mut exits: HashMap> = HashMap::new(); + // Top-level converged seed: global `g` is tainted. + let mut top = HashMap::new(); + top.insert(BindingKey::new("g", BodyId(0)), tainted()); + exits.insert(BodyId(0), top); + // Direct parent (reader, BodyId(2)) exports a parent-local. + let mut parent = HashMap::new(); + parent.insert(BindingKey::new("local", BodyId(2)), tainted()); + exits.insert(BodyId(2), parent); + + let seed = seed_for_nested_body(BodyId(2), &exits).expect("seed present"); + // Owned merge of both maps. + assert!(matches!(seed, std::borrow::Cow::Owned(_))); + // Parent-local survives (parent leg of ancestor lookup). + assert!(seed.contains_key(&BindingKey::new("local", BodyId(2)))); + // Converged global survives under BodyId(0) (the previously-dead + // ancestor leg for grandchildren). + assert!(seed.contains_key(&BindingKey::new("g", BodyId(0)))); + } + + /// Depth-2 body whose parent exported nothing still receives the + /// `BodyId(0)` top-level seed — borrowed directly, no clone. + #[test] + fn depth2_empty_parent_falls_back_to_toplevel() { + let mut exits: HashMap> = HashMap::new(); + let mut top = HashMap::new(); + top.insert(BindingKey::new("g", BodyId(0)), tainted()); + exits.insert(BodyId(0), top); + // No entry for BodyId(2) at all (parent produced no exit state). + + let seed = seed_for_nested_body(BodyId(2), &exits).expect("seed present"); + assert!(matches!(seed, std::borrow::Cow::Borrowed(_))); + assert!(seed.contains_key(&BindingKey::new("g", BodyId(0)))); + } +} + #[cfg(test)] mod tests; diff --git a/src/taint/path_state.rs b/src/taint/path_state.rs index 9e4bfbd2..76bf3353 100644 --- a/src/taint/path_state.rs +++ b/src/taint/path_state.rs @@ -619,6 +619,95 @@ fn parse_leading_uint(s: &str) -> Option { any.then_some(n) } +/// Detect a substring-REJECTION idiom dressed up as a membership method: +/// `x.includes("..")`, `x.contains("