nyx/src/summary/mod.rs
Eli Peter 41128177d2
Release/0.5.0 (#35)
* feat: Introduce function-scoped variable interning for state analysis with new tests and fixtures

* feat: Add Phase 26 symbolic execution enhancements with bitwise operator support, abstract interpretation refinements, and new taint analysis tests

* feat: Refine state analysis to handle factory-pattern resource returns with mixed-path tests and leak detection enhancements

* feat: Add Phase 27 debug views with symbolic execution, abstract interpretation, SSA, and call graph viewers; integrate with debug layout and styles

* feat: Add Phase 31 type-qualified symbolic resolution with receiver-based callee disambiguation and testing

* feat: Extend symbolic execution with state iteration, enhanced debug views, and debounced input handling

* feat: Add Phase 13 resource and auth pattern extensions with new tests and fixtures

* feat: Introduce CFG debug graph renderer with compact mode, toolbar, and DAG layout integration

* feat: Add Phase 28 encoding and decoding transform modeling with structural symex enhancements and new taint analysis tests

* feat: Extend abstract interpretation with type facts and constant value tracking in debug views and server logic

* feat: Add linear path handling and witness extraction to symbolic execution with Phase 28 transform mismatch detection

* feat: Refine Go auth and sanitizer handling with enhanced rules, state updates, and benchmark improvements

* feat: Enable auth-state analysis by default and update relevant tests in benchmark config

* test: Update state_tests to reflect default enablement of auth-state analysis and add auth suppression test

* docs: update CHANGELOG.md

* feat: Introduce per-index taint tracking in `HeapState` with `HeapSlot`, overflow handling, and revised SSA transfers

* feat: Introduce C/C++ language labels and refine heap state tracking in SSA transfers

* feat: Implement per-index array slot tracking in symbolic heap with overflow collapse

* feat: Add implicit definition handling for uninitialized declarations in SSA value allocation

* feat: Refactor function parameters and constants for improved clarity and maintainability

* refactor: Reorder module imports and improve formatting for consistency

* refactor: Fix formatting erorrs

* refactor: Fix clippy warnings

* refactor: Fix fmt warnings (again)

* chore: Update dependencies and improve feature configuration

* Add comprehensive tests for undertested modules (#36) (COPILOT)

* Add comprehensive tests for undertested modules

Co-authored-by: elicpeter <54954007+elicpeter@users.noreply.github.com>
Agent-Logs-Url: https://github.com/elicpeter/nyx/sessions/f3fc877e-f386-49ba-9793-fc93d3805083

* Add comprehensive tests for ext, project, walk, and errors modules

Co-authored-by: elicpeter <54954007+elicpeter@users.noreply.github.com>
Agent-Logs-Url: https://github.com/elicpeter/nyx/sessions/f3fc877e-f386-49ba-9793-fc93d3805083

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: elicpeter <54954007+elicpeter@users.noreply.github.com>

* chore: Update dependencies and improve feature configuration

* fix: formatting errors in new tests

* chore: Update license list in about.toml

* chore: made functions input inline

* chore: updated cfg graph to take up the full page

* chore: add Prettier configuration and update code formatting

* Add frontend test suite with Vitest (111 tests) (#37)

* Add Vitest test suite for frontend - 111 tests across utils, components, hooks, and graph utilities

Co-authored-by: elicpeter <54954007+elicpeter@users.noreply.github.com>
Agent-Logs-Url: https://github.com/elicpeter/nyx/sessions/7cf0dba2-ecff-4740-ba4d-92717e74a0b7

* ci: add frontend test step to CI workflow

Co-authored-by: elicpeter <54954007+elicpeter@users.noreply.github.com>
Agent-Logs-Url: https://github.com/elicpeter/nyx/sessions/5bc0ac9f-0a32-4d03-9cb7-7a15aea53fca

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: elicpeter <54954007+elicpeter@users.noreply.github.com>

* chore: simplify array initialization in test files for consistency

* ran typecheck

* feat: add AnalysisWorkspace component and integrate it into CfgViewerPage

* feat: update routing in AppLayout and improve empty state message in ExplorerPage

* feat: enhance scan progress tracking with additional metrics and stages

* feat: update license information and add license check script

* feat: implement cross-file symbolic execution with callee body persistence

* feat: replace dagre graphs with Graphology + ELK + Sigma for more advanced call stack and cfg rendering

* feat: ensure CFG function view is scoped to the selected function, preventing bleed into sibling functions

* feat: enhance resource tracking with proxy method summaries and improve finding extraction

* feat: add terminal function exit detection for accurate resource leak analysis

* feat: add warnings for loops and functions without bodies to improve error recovery

* feat: update lambda expression handling to ensure proper function classification and control flow

* feat: remove bounded formatting/string ops and add JSON.parse sanitizer for improved data handling

* feat: add inline return taint analysis and regression tests for improved security checks

* feat: add engine version management and migration handling for database schema updates

* feat: enhance first_call_ident to skip nested function bodies and add regression tests

* feat: enhance callee name resolution with two-segment normalization and disambiguation

* feat: add cross-file context flags and debug assertions for taint analysis

* feat: refactor taint analysis structure to unify context handling and improve clarity

* feat: enhance dead code elimination to preserve Sink, Source, and Sanitizer labels with new tests

* docs: updated CHANGELOG.md

* fmt: formatting fixes

* fix: fixed frontend formatting and lint warnings

* fix: optimized ci

* fix: optimized ci

* Add comprehensive multi-file test coverage to Nyx (#38)

* Initial checklist for multi-file test suite expansion

Agent-Logs-Url: https://github.com/elicpeter/nyx/sessions/e550cb88-9767-4442-94d4-101bf5bb0e23

Co-authored-by: elicpeter <54954007+elicpeter@users.noreply.github.com>

* Add 12 new multi-file test fixtures with TP/TN/near-miss coverage

Agent-Logs-Url: https://github.com/elicpeter/nyx/sessions/e550cb88-9767-4442-94d4-101bf5bb0e23

Co-authored-by: elicpeter <54954007+elicpeter@users.noreply.github.com>

* deleted root repo

* rebuilt to test for regressions

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: elicpeter <54954007+elicpeter@users.noreply.github.com>
Co-authored-by: elipeter <elicpeter@gmail.com>

* feat: enhance import alias resolution and taint tracking

* feat: implement security hardening with CSRF protection and path validation

* feat: add support for import alias bindings in Python, PHP, and Rust

* feat: enhance CFG analysis modes and improve code readability

* feat: add detection for parameterized SQL queries to enhance security

* feat: add safe internal redirect handling and enhance session destroy validation

* feat: implement security improvements by addressing vulnerabilities in execAsync, session management, and file downloads

* feat: enhance taint detection by adding support for inline source member expressions in call arguments

* feat: implement pre-emission of Source nodes for inline source member expressions in call arguments

* feat: add support for Throw statement in control flow and error handling

* feat: add debug and echo endpoints with potential information leakage

* feat: implement internal redirect suppression and enhance taint detection

* feat: implement module alias tracking for dynamic dispatch in JS/TS

* feat: add authorization analysis module with Express support

* feat: add authorization analysis module with Express support

* feat: add tests for admin guard requirements and clean checks in authorization analysis

* feat: integrate Koa and Fastify frameworks into authorization analysis

* feat: add Flask and Django support to authorization analysis module

* feat: add support for Rails and Sinatra frameworks in authorization analysis

* feat: add support for Axum, ActixWeb, and Rocket frameworks in authorization analysis

* feat: add support for ActixWeb, Axum, and Rocket frameworks in authorization analysis

* feat: add support for Rails and Sinatra in authorization analysis

* chore: add .DS_Store to .gitignore

* refactor: simplify conditional checks and improve readability in multiple files

* refactor: update usage of Option methods for improved clarity and consistency

* refactor: improve code readability by simplifying conditional checks and formatting

* refactor: improve code formatting and readability by simplifying conditional checks

* refactor: simplify conditional checks and improve readability in multiple files

* refactor: simplify conditional checks in axum.rs for improved readability

* feat: add CodeQL analysis configuration for enhanced security scanning

* test: add comprehensive tests for `src/output.rs` SARIF builder (#39)

* chore: start test coverage improvement work

Agent-Logs-Url: https://github.com/elicpeter/nyx/sessions/cd7ff398-134e-4728-a5e7-0353a0744423

Co-authored-by: elicpeter <54954007+elicpeter@users.noreply.github.com>

* test: add comprehensive tests for src/output.rs SARIF builder

Agent-Logs-Url: https://github.com/elicpeter/nyx/sessions/cd7ff398-134e-4728-a5e7-0353a0744423

Co-authored-by: elicpeter <54954007+elicpeter@users.noreply.github.com>

* refactor: improve code formatting and readability in output.rs

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: elicpeter <54954007+elicpeter@users.noreply.github.com>
Co-authored-by: elipeter <elicpeter@gmail.com>

* refactor: improve code formatting and readability in output.rs

* Potential fix for code scanning alert no. 210: Uncontrolled data used in path expression

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

* Potential fix for code scanning alert no. 211: Uncontrolled data used in path expression

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

* refactor: enhance triage file path handling with improved error management and validation

* refactor: updated func summaries for richer detail

* refactor: update SSA summary extraction to use canonical FuncKey for distinct entries

* refactor: enhance callee metadata structure to support arity, receiver, and qualifier for better overload resolution

* refactor: add support for keyword arguments in function calls and enhance receiver extraction for method-style calls

* refactor: implement new Flask routes for safe and unsafe shell command execution

* refactor: separate receiver handling in SSA operations and enhance taint propagation

* refactor: improve arity handling by using arg_uses for positional argument count and enhance witness scoring for tainted arguments

* refactor: implement auth decorator extraction and classification for multiple languages

* refactor: enhance Rust module path resolution and use map handling for cross-file disambiguation

* refactor: introduce CalleeQuery struct for structured callee resolution and enhance resolver logic

* refactor: implement same-file identity collision handling for `runTask` to ensure correct resolver behavior

* refactor: standardize default struct initialization across multiple files

* feat: add scripts for formatting checks and auto-fixes with test summaries

* refactor: simplify character splitting and enhance namespace qualifier handling

* refactor: improve documentation clarity and enhance code readability in resolver logic

* refactor: replace default struct initialization with explicit field assignments for clarity

* feat: enhance anonymous function naming by deriving context-based bindings

* refactor: streamline match expressions for improved readability and performance

* refactor: streamline match expressions for improved readability and performance

* refactor: replace loop with while let for improved clarity and performance

* feat: add SSA constant propagation support to analysis context for improved accuracy

* feat: add SSA constant propagation support to analysis context for improved accuracy

* feat: implement shell metacharacter validation and bounded-length checks in Rust analysis

* feat: add static map analysis for command injection suppression and type safety

* refactor: simplify match statements and reduce line breaks for improved readability

* feat(summary): phase 1/5 SinkSite data model for primary sink-location attribution

Introduce SinkSite (file_rel, line, col, snippet, cap) carrying the
primary sink source-location through function summaries. Swap
SsaFuncSummary.param_to_sink and FuncSummary.param_to_sink from a coarse
Cap map to a deduped SmallVec<[SinkSite; 1]> per parameter, with a
backward-compatible cap_sites() helper and serde defaults so pre-phase-1
on-disk rows continue to deserialise cleanly.

Extraction: SinkSiteLocator bundles the tree/bytes/file_rel needed by
extract_ssa_func_summary; ParsedFile::extract_ssa_artifacts wires the
locator in for the persisted pass-1 path, while pass-2 intra-file
transient summaries fall back to cap-only sites (behavior unchanged).
Merge: GlobalSummaries::insert now unions sink sites with
(file_rel, line, col, cap) dedup via shared union_param_sink_sites
helper.

Database: JSON-serialised summary columns carry the new shape
automatically; no schema change needed.

Phase 2 will consume SinkSite in build_taint_diag() to overwrite the
caller-site Finding.line with the callee's sink line when resolved via
summary. Phase 1 keeps behavior unchanged: scanning
tests/benchmark/corpus/rust/cmdi/cmdi_indirect.rs still produces the
same (wrong) line 10 finding.

Adds round-trip tests covering SinkSite solo, SsaFuncSummary with sink
sites, legacy-JSON default handling for both summary types, and merge
dedup.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(taint): phase 2/5 thread SinkSite into SsaTaintEvent and Finding

Plumb Phase 1's SinkSite through the event pipeline into Findings,
no output change yet.  SsaTaintEvent gains `primary_sink_site:
Option<SinkSite>`; when the main or callback sink-emission path has
non-empty `param_to_sink_sites`, filter to sites whose
`(line != 0) && (cap ∩ sink_caps != ∅)` and emit one event per
distinct site — the multi-primary collapse keeps each downstream
Finding single-primary.

Resolution: ResolvedSummary and SinkInfo gain mirror
`param_to_sink_sites` fields, populated from `SsaFuncSummary.param_to_sink`
(SSA + callback paths) and `FuncSummary.param_to_sink` (global paths).
Label, local-summary, and interop resolution paths leave the field
empty — they only ever had cap-level info to begin with.

Finding: new `primary_location: Option<SinkLocation>` with
`file_rel/line/col`.  `ssa_events_to_findings` maps
`event.primary_sink_site` → `Finding.primary_location`, filtering
cap-only sites (`line == 0`) to `None` so the (0,0) sentinel never
leaks to formatters.  Dedup key extended with the primary location
so multi-site events aren't collapsed back together.

Invariants (debug_assert!):
* every SinkSite reaching emission has `line != 0 && cap ∩ sink_caps
  != ∅` — enforced by the pick_primary_sink_sites* filters;
* every populated Finding.primary_location has `line != 0` AND
  non-empty `file_rel` — the cap-only → None translation upstream
  guarantees this.

Deliberately independent of `uses_summary`: that flag tracks whether
the *taint chain* used a summary, whereas primary attribution
requires only that the *sink* itself was summary-resolved.  A local
source reaching a cross-file sink produces `uses_summary=false`
alongside a populated primary_location — documented on
Finding.primary_location, covered by
`cross_file_sink_finding_carries_primary_location`.

build_taint_diag, SARIF/JSON/explanation formatters, and the
benchmark scorer remain untouched: finding.line still comes from
`cfg_graph[finding.sink]`, so cmdi_indirect.rs still reports line 10
and the benchmark's rs-cmdi-003 row still shows FN in the LOC column.

Tests: `cross_file_sink_finding_carries_primary_location` (proves
plumbing via a synthetic FuncSummary carrying a SinkSite at 42:5) and
`cross_file_sink_cap_only_site_leaves_primary_location_none`
(regression guard against cap-only sites surfacing).  All 1566 lib
tests + integration tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(output): phase 3/5 consume primary sink location in diag + SARIF

When a finding's primary_location (populated in phase 2 from a callee
summary's SinkSite) names the dangerous instruction inside a callee
body, attribute the diagnostic line to that location instead of the
caller's call site. The call site is demoted to a Call step in
flow_steps, and a synthetic Sink step at the primary location is
appended so analysts still see the full trace.

Changes:
- Add scan_root parameter to build_taint_diag so file_rel can be
  resolved back to an absolute path via a shared resolve_file_rel
  helper. Empty file_rel (single-file scans where namespace == "")
  resolves to the file under analysis.
- Extend SinkLocation with snippet, carried from the upstream
  SinkSite so the formatter needs no second file read.
- Relax the ssa_events_to_findings debug_assert to allow empty
  file_rel, which is valid when scan root equals the file itself.
- SARIF: emit data-flow as codeFlows[0].threadFlows[0].locations[];
  locations[0] already reflects the primary sink position via the
  updated diag line/col.

Acceptance: scan on tests/benchmark/corpus/rust/cmdi/cmdi_indirect.rs
now reports line 5 (Command::new) as the primary sink, with the call
site at line 10 visible in flow_steps.

Two expect.json fixtures updated (must_match line_range widened):
- javascript/taint/context_sensitive_call: 12-14 -> 7-14 (line 8 is
  the real sink inside run()).
- rust/cfg/closure_async: 10-10 -> 10-11 (line 11 is Command::new
  inside the closure).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(bench): phase 4/5 validate primary sink attribution across corpus

Extend the benchmark scorer and ground truth to lock in phase 3's
primary-location behavior, and add fixtures that exercise the new
capability end-to-end.

Scorer (tests/benchmark_test.rs):
- Add optional `expected_call_site_lines: Option<Vec<[usize; 2]>>` on
  Case. When present, score_location_level additionally requires at
  least one flow_step in the finding's evidence trace to fall within
  ±2 of the call-site range. When absent, the check is skipped —
  fully forward-compatible with existing fixtures.
- Retain ±2 tolerance on expected_sink_lines (compared against the
  now-primary Diag.line post-phase-3).

Ground truth edits:
- rs-cmdi-cross-001: expected_sink_lines [8,8] -> [9,9]. Line 8 is the
  transform::wrap call site (a cross-file propagator, not a sink);
  line 9 is Command::new, the real sink. The ±2 tolerance happened to
  mask this stale attribution but it was semantically wrong — phase 4
  is the right time to correct it. Also adds expected_call_site_lines
  [8,8] so the new field is exercised on an existing cross-file case.
- rs-cmdi-003: adds expected_call_site_lines [10,10] (run_cmd call).
  This fixture's sink (Command::new inside run_cmd at line 5) was the
  motivating case for phases 1-3; adding the call-site assertion
  guards against regression to caller-line attribution.

New fixtures:
- rust/cmdi/cmdi_indirect_multisink.rs (rs-cmdi-009): helper run_both
  takes two tainted params and invokes two Command sinks on
  consecutive lines. Locks in that primary line lands inside the
  helper (lines 5-6), not at the caller (line 12). Notes document
  that SinkSite is currently one-per-callee so both findings today
  collapse onto the first sink; expected_sink_lines=[5,6] and
  expected_call_site_lines=[12,12] stay valid either way.
- python/cmdi/cross_indirect_sink/{app.py,helper.py} (py-cmdi-cross-
  004): sink os.system lives in helper.py (cross-file), caller in
  app.py reads env source and calls run_cmd. Verifies phase 3's
  cross-file primary attribution: Diag.path = helper.py, Diag.line =
  5, with app.py:7 recorded in flow_steps as a Call step.

Acceptance:
- `cargo test --test benchmark_test -- --ignored --nocapture` passes.
- rs-cmdi-003 is TP/TP/TP (the target flip FN->TP at LOC). All
  pre-existing TP/TP/TP fixtures remain TP/TP/TP; 2 new fixtures are
  TP/TP/TP.
- Aggregate rule-level: TP=158 FP=10 FN=1 TN=97, P=0.940 R=0.994
  F1=0.966 on the 266-case corpus (was TP=156 FP=10 FN=1 TN=97 on
  264 pre-phase-4, delta is the +2 new cases both resolving TP).
- Full `cargo test` green (1566 lib tests + all integration tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(taint): phase 5/5 lock Finding.primary_location contract via regression test

Add a regression test in src/taint/ssa_transfer.rs that wires up a synthetic
SsaFuncSummary with a SinkSite at other.rs:42:10 and drives the three
emission stages (pick_primary_sink_sites → emit_ssa_taint_events →
ssa_events_to_findings) against a minimal caller SSA body.  Asserts the
resulting Finding.primary_location is exactly that triple.

The existing integration tests in src/taint/tests.rs cover the coarse
FuncSummary path end-to-end through analyse_file.  This test locks in the
lower-level SSA-side plumbing so a future refactor that silently drops the
site between pick → emit → findings fails here rather than only at the
benchmark layer.

Also refreshes tests/benchmark/results/latest.json (timestamp only; rs-cmdi-003
remains TP/TP/TP and the aggregate P/R/F1 are unchanged from phase 4).

Closes the primary sink-location attribution feature (phases 1-5/5):
* Phase 1 — SinkSite data model on summaries.
* Phase 2 — SinkSite threaded into SsaTaintEvent and Finding.
* Phase 3 — diag + SARIF consume primary_location.
* Phase 4 — benchmark validates primary_call_site_lines across corpus.
* Phase 5 — regression test locks the event→finding contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: clean up formatting and improve readability in multiple files

* refactor: simplify type definition for deduplication key in findings

* test(harness): add must_not_match expectation for FP regression guards

Extends ExpectedFinding with must_not_match field that asserts a
diagnostic must NOT fire — presence is a hard failure. Non-consuming
scan so it coexists with must_match entries on the same rule_id.
Adds forbidden_violations accumulator and updates summary line.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(regression): update expectations to ensure must_not_match for various taint and resource leak rules

* feat: implement auto-seeding for JS/TS handler parameters to enhance taint tracking

* feat: update switch statement handling to improve control flow analysis

* feat: implement promisify alias handling for JS/TS to enhance taint tracking

* feat: enhance taint tracking by refining expectation handling and adding mode filtering

* feat: refine SQL handling in stream processing and enhance auto-seeding for handler parameters

* feat: update taint tracking rules to enforce full mode matching and improve flow analysis

* feat: enhance Ruby subshell handling to improve taint tracking and flow analysis

* feat: update xss_response expectations to refine taint flow analysis and enhance regression guarding

* feat: refine framework detection and update expectation handling for Echo and Sinatra

* feat: implement max_count for taint tracking expectations and deduplicate findings

* feat: add strict_unexpected handling for taint-unsanitised-flow in expectation files

* feat: enhance deduplication of taint-unsanitised-flow findings by collapsing based on line and severity

* feat: add strict_unexpected handling for taint-unsanitised-flow in multiple expectation files

* feat: add structural invariant checks for SSA bodies

* feat: ensure deterministic phi emission order using BTreeSet

* feat: enhance handling of terminators to ensure authoritative flow through successor edges

* feat: enhance Goto terminator handling to ensure all successors are marked executable

* feat: refactor code for improved readability and organization

* feat: simplify predicate checks and enhance readability in SSA handling

* feat: implement per-file parse timeout and enhance file size handling

* feat: migrate analysis engine toggles from environment variables to configuration file

* feat: remove unnecessary whitespace in hostile_input_tests.rs

* feat: remove unnecessary whitespace in hostile_input_tests.rs

* feat: update dependencies and enhance documentation on language maturity

* feat: enhance security headers and improve request body limits

* feat: implement sink capability bits for deduplication and enhance evidence tagging

* feat: implement dynamic activation handling for gated sinks and enhance validation logic

* feat: enhance configuration documentation and clarify inline analysis cache behavior

* feat: implement panic recovery during analysis to continue scans past errors

* feat: add expectations configuration for taint analysis and performance metrics

* feat: enhance error handling and logging during file reading and mutex locking

* feat: add cross-file body loading tests and plumbing for CF-1 phase

* feat: implement cross-file k=1 context-sensitive inline taint analysis with new tests and fixtures

* feat: implement indexed-scan parity in cross-file inline analysis with new dropdown and copy functionality

* feat: enhance classification span handling in CFG and AST for improved source attribution

* feat: add new Express routes for handling user input and telemetry data

* feat: implement ternary expression handling in CFG with diamond structure for JS/TS

* feat: implement Phase CF-3 abstract-domain transfer channels in summaries

* feat: add support for string-prefix transfer in cross-file calls and update tests

* docs: reduce RESULTS.md doc size

* feat: implement Phase CF-4 per-return-path summary decomposition with tests

* feat: update parameter handling in pass1 and refactor SsaFuncSummary initialization

* feat: implement Phase CF-5 for cross-file SCC joint fixed-point convergence with new flags and tests

* feat: implement Phase CF-6 with parameter-granularity points-to summaries and associated tests

* refactor: update comments and documentation for clarity and consistency

* style: format code for consistency and readability

* refactor: simplify verdict handling and improve edge checking logic

* refactor: optimize path and identifier collection by avoiding unnecessary cloning

* chore: update Cargo.toml for Rust version 1.85 and add ignored files; modify CHANGELOG and README for clarity on state analysis defaults

* refactor: update documentation and improve clarity in configuration files

* refactor: update documentation and improve clarity in configuration files

* feat: add JS/TS pass-2 convergence tests and expectations configuration

* feat: add Phase 5 regression tests for inline cache origin attribution and update related logic

* feat: implement Phase 7 deduplication and alternative path linking for taint findings

* feat: implement structural DFS index for anonymous functions and update naming conventions

* feat: add Phase 8 regression tests for container-element taint in JS and Python

* feat: add engine-depth profiles and explain-engine option for CLI

* feat: update expectations and add new README fixtures for multi-file scan regression

* feat: implement Phase 11 callback-alias and factory patterns with regression tests

* feat: implement Terminator::Switch for multi-way dispatch and add regression tests

* feat: add real-CVE benchmark fixtures for CVE-2023-48022, CVE-2019-14939, and CVE-2023-26159 with corresponding patched variants

* refactor: extract cfg and ssa_transfer to submodules

* refactor: cargo fmt

* refactor: remove unnecessary blank line in cfg_tests.rs

* refactor: remove unnecessary planning file

* chore: update Rust version to 1.88 and bump dependencies in Cargo files

* feat: enhance triage UI with new layout and controls, update README for clarity

* feat: enhance triage UI with new layout and controls, update README for clarity

* chore: remove outdated section from README for version 0.5.0

* docs: improve clarity and consistency in README content

* chore: add "GPL-3.0-or-later" to license options in about.toml

* chore: update license handling in about.toml and check-licenses.mjs

* style: format code for improved readability in TriagePage component

* style: format code for improved readability in TriagePage component

* chore: enhance license handling and improve body_id scoping in seed lookup

* feat: introduce owner and parent body IDs for enhanced seed scoping

* feat: implement direction-aware engine provenance with new CLI flag for strict CI gating

* feat: add Undef SSA operation for improved control-flow handling

* style: improve code formatting for consistency and readability in multiple files

* feat: add 16-function chain SCC across multiple files for enhanced analysis

* style: simplify code formatting for improved readability in multiple files

* fix: update CapHitReason default implementation and improve README clarity

* docs: enhance README with detailed explanations of taint analysis and limitations

* docs: refine README for clarity and consistency in taint analysis section

* style: improve code formatting for better readability in NewScanModal and scans

* fix: update cargo-about command to use --offline for deterministic license generation

* fix: update cargo-about command to use --offline for deterministic license generation

* ci: add step to prime cargo registry cache for deterministic license generation

* feat: add support for non-sink collections in authorization analysis

* feat: enhance authorization checks with row-level ownership equality and binding tracking

* feat: implement self-scoped user handling and enhance ownership checks

* refactor: simplify assertions and formatting in authorization analysis tests

* fix: normalize line endings in THIRDPARTY-LICENSES.html generation and update README with AI disclosure

* docs: update AI disclosure section for clarity and conciseness

* feat: add AI Contribution Policy and update contributing guidelines for AI assistance disclosure

* feat: enhance authorization analysis with SSA-derived variable type classification

* feat: implement auth_finding_to_diag function for enhanced security diagnostics

* feat: add args_value_refs to CallSite struct for enhanced argument tracking

* feat: add args_value_refs to CallSite struct for enhanced argument tracking

* feat: add direction-aware engine provenance with LossDirection classification and new CLI flag

* feat: simplify strip_cap_from_call_args call by removing unnecessary line breaks

* feat: enhance error message handling in cli_validation_tests for better Windows compatibility

* feat: optimize release profile settings in Cargo.toml and update CodeQL configuration

* feat: enhance release build process with SBOM generation and SLSA provenance

* feat: update actions/checkout and actions/setup-node to v6, enhance CLI options, and improve auth-check summaries

* feat: introduce PathFact handling for path safety checks and rejection logic

* feat: introduce PathFact handling for path safety checks and rejection logic

* feat: update benchmark data and enhance path sanitization logic with new safety checks

* feat: document AI assistance in frontend UI development and human review process

* feat: add return path facts for enhanced path safety checks and update documentation

* chore: update release date for version 0.5.0 in CHANGELOG.md

* chore: clean up ci.yml by removing outdated comments and clarifying steps

* feat: implement cross-language path sanitizers and validators for enhanced security

* feat: enhance SSA value usage tracking by including block terminators and improve path safety checks

* feat: enhance switch statement handling by adding per-case path constraints and support for exclusive cases

* refactor: simplify conditional formatting and improve code readability in executor and lower modules

* feat: add vulnerable examples for various languages demonstrating authentication and sanitization issues

* feat: enhance actor context recognition for self-actor identifiers and add support for global non-sink receivers

* feat: enhance actor context recognition for self-actor identifiers and add support for global non-sink receivers

* feat: add transform classifiers for Java, Go, and Ruby with corresponding tests

* refactor: clarify comments on reassign-to-constant idiom and sink behavior in guards.rs

---------

Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-25 17:59:11 -04:00

1555 lines
64 KiB
Rust
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

pub mod points_to;
pub mod ssa_summary;
use crate::labels::Cap;
use crate::summary::ssa_summary::SsaFuncSummary;
use crate::symbol::{FuncKey, FuncKind, Lang, normalize_namespace};
use serde::{Deserialize, Deserializer, Serialize};
use smallvec::SmallVec;
use std::collections::{BTreeMap, HashMap};
use std::hash::{Hash, Hasher};
// ── Sink site (primary sink-location attribution) ───────────────────────
/// A single dangerous-instruction site recorded inside a function's body.
///
/// `SinkSite` pairs a [`Cap`] (the bits this particular site consumes) with
/// the file-relative source location of the instruction that consumes them.
/// Carrying this alongside a summary's `param_to_sink` map lets cross-file
/// findings attribute the finding line to the actual dangerous call inside
/// the callee, rather than to the caller's call-site (which is all a
/// bare `(param_idx, Cap)` pair could support).
///
/// Primary sink-location attribution stores this data in the summary so
/// `build_taint_diag()` can consume it and overwrite the caller-site
/// `Finding.line` when the sink was resolved via summary.
///
/// Fields
/// ──────
/// * `file_rel` — the callee file's path relative to the workspace root
/// being scanned. Matches the `FuncKey::namespace` convention so the
/// site's origin is addressable without additional workspace context.
/// * `line` / `col` — 1-based source coordinates of the sink instruction.
/// `0` indicates the extractor could not resolve coordinates (e.g. a
/// pass-2 transient summary without tree access).
/// * `snippet` — the trimmed source line, capped at 120 characters, empty
/// when coordinates could not be resolved.
/// * `cap` — the [`Cap`] bits this specific site consumes. A parameter's
/// total sink caps is the union across every site associated with it.
#[derive(Debug, Clone, Default, Serialize, Deserialize, PartialEq)]
pub struct SinkSite {
#[serde(default, skip_serializing_if = "String::is_empty")]
pub file_rel: String,
#[serde(default, skip_serializing_if = "is_zero_u32")]
pub line: u32,
#[serde(default, skip_serializing_if = "is_zero_u32")]
pub col: u32,
#[serde(default, skip_serializing_if = "String::is_empty")]
pub snippet: String,
pub cap: Cap,
}
impl SinkSite {
/// Dedup key comparing the full identity of a site. Two sites with the
/// same `(file_rel, line, col, cap)` describe the same consumption of
/// the same bits at the same source location and should collapse when
/// summaries are merged.
pub(crate) fn dedup_key(&self) -> (&str, u32, u32, u16) {
(self.file_rel.as_str(), self.line, self.col, self.cap.bits())
}
/// Build a site that only carries a [`Cap`] — no resolved source
/// coordinates. Used by extraction paths that have no tree/bytes
/// context (e.g. pass-2 transient summaries), so downstream consumers
/// unioning caps across sites still see the correct bits even when
/// primary-location attribution is not available.
pub fn cap_only(cap: Cap) -> Self {
Self {
file_rel: String::new(),
line: 0,
col: 0,
snippet: String::new(),
cap,
}
}
}
/// Tree/bytes context for resolving a CFG span to a [`SinkSite`].
///
/// Summary extraction runs deep inside the taint engine, far from the
/// `ParsedFile` that owns the tree; `SinkSiteLocator` is the narrow
/// reference bundle the extractor needs to populate `SinkSite.line`,
/// `col`, and `snippet`. The struct is intentionally plain references
/// so construction is free and threading it as `Option<&Locator>` is
/// cheap.
pub struct SinkSiteLocator<'a> {
pub tree: &'a tree_sitter::Tree,
pub bytes: &'a [u8],
pub file_rel: &'a str,
}
impl<'a> SinkSiteLocator<'a> {
/// Resolve a `(start_byte, end_byte)` span to a [`SinkSite`] with the
/// given `cap`. Coordinates fall back to `(0, 0)` and the snippet to
/// empty when the byte offset is out of range (should not happen for
/// spans that came from the same tree).
pub fn site_for_span(&self, span: (usize, usize), cap: Cap) -> SinkSite {
let byte = span.0;
let point = self
.tree
.root_node()
.descendant_for_byte_range(byte, byte)
.map(|n| n.start_position())
.unwrap_or(tree_sitter::Point { row: 0, column: 0 });
let snippet = line_snippet(self.bytes, byte).unwrap_or_default();
SinkSite {
file_rel: self.file_rel.to_string(),
line: (point.row + 1) as u32,
col: (point.column + 1) as u32,
snippet,
cap,
}
}
}
/// Extract the source line containing `byte_offset`, trimmed and capped at
/// 120 chars. Returns `None` when the offset is out of range or the line
/// is entirely blank after trimming.
pub(crate) fn line_snippet(src: &[u8], byte_offset: usize) -> Option<String> {
if byte_offset >= src.len() {
return None;
}
let line_start = src[..byte_offset]
.iter()
.rposition(|&b| b == b'\n')
.map_or(0, |p| p + 1);
let line_end = src[byte_offset..]
.iter()
.position(|&b| b == b'\n')
.map_or(src.len(), |p| byte_offset + p);
let line = std::str::from_utf8(&src[line_start..line_end]).ok()?;
let trimmed = line.trim();
if trimmed.is_empty() {
return None;
}
if trimmed.len() > 120 {
Some(format!("{}...", &trimmed[..120]))
} else {
Some(trimmed.to_string())
}
}
/// Union two `SmallVec<[SinkSite; 1]>` lists with `(file_rel, line, col,
/// cap)` dedup. Preserves insertion order of `existing` then appends any
/// new sites from `incoming` not already present.
pub(crate) fn union_sink_sites(existing: &mut SmallVec<[SinkSite; 1]>, incoming: &[SinkSite]) {
for site in incoming {
let key = site.dedup_key();
if !existing.iter().any(|s| s.dedup_key() == key) {
existing.push(site.clone());
}
}
}
/// Union two `Vec<(usize, SmallVec<[SinkSite; 1]>)>` lists keyed by
/// parameter index. Each parameter keeps its own deduped site list.
pub(crate) fn union_param_sink_sites(
existing: &mut Vec<(usize, SmallVec<[SinkSite; 1]>)>,
incoming: &[(usize, SmallVec<[SinkSite; 1]>)],
) {
for (idx, sites) in incoming {
if let Some((_, ex)) = existing.iter_mut().find(|(i, _)| *i == *idx) {
union_sink_sites(ex, sites);
} else {
existing.push((*idx, sites.clone()));
}
}
}
/// Top bit of [`FuncKey::disambig`] reserved for synthetic discriminators
/// minted by [`GlobalSummaries`] when an identity collision is detected
/// between structurally incompatible summaries.
///
/// Real disambigs come from `tree_sitter::Node::start_byte` (see
/// `cfg.rs:fn_disambig`), which is a byte offset into the source file.
/// Source files in practice are far below 2 GiB, so bit 31 of a real
/// disambig is always zero — setting it marks a value as synthetic and
/// keeps it in a disjoint namespace from byte-offset disambigs.
const SYNTHETIC_DISAMBIG_BIT: u32 = 0x8000_0000;
// ── Callee site metadata ────────────────────────────────────────────────
/// Richer per-call-site metadata preserved in a function's summary.
///
/// Replaces the legacy `Vec<String>` callee list. Carries enough structure
/// to disambiguate same-name overloads and method calls at resolution time
/// without having to re-parse the raw callee string.
///
/// * `name` — the raw callee text as it appeared in source
/// (`"obj.method"`, `"env::var"`, `"helper"`). Preserved for diagnostics.
/// * `arity` — number of positional arguments at the call site. `None`
/// when splats / keyword-args / rest-params make the count unreliable.
/// * `receiver` — structured receiver identifier for method calls
/// (e.g. `"obj"` in `obj.method()`). Carries the root receiver for
/// chained calls; `None` for non-method or complex receivers.
/// * `qualifier` — the segment immediately before the leaf for non-method
/// qualified calls (e.g. `"env"` in `env::var`). Extracted once at CFG
/// time rather than re-parsed downstream.
/// * `ordinal` — the per-function call ordinal matching
/// `CallMeta.call_ordinal`, allowing cross-file consumers to address a
/// specific call site rather than just a callee name.
#[derive(Debug, Clone, Default, Serialize, Deserialize, PartialEq, Eq, Hash)]
pub struct CalleeSite {
pub name: String,
#[serde(default, skip_serializing_if = "Option::is_none")]
pub arity: Option<usize>,
#[serde(default, skip_serializing_if = "Option::is_none")]
pub receiver: Option<String>,
#[serde(default, skip_serializing_if = "Option::is_none")]
pub qualifier: Option<String>,
#[serde(default, skip_serializing_if = "is_zero_u32")]
pub ordinal: u32,
}
fn is_zero_u32(n: &u32) -> bool {
*n == 0
}
impl CalleeSite {
/// Construct a bare call-site reference from a name, with no other metadata.
pub fn bare(name: impl Into<String>) -> Self {
Self {
name: name.into(),
..Default::default()
}
}
}
impl From<String> for CalleeSite {
fn from(name: String) -> Self {
Self {
name,
..Default::default()
}
}
}
impl From<&str> for CalleeSite {
fn from(name: &str) -> Self {
Self {
name: name.to_string(),
..Default::default()
}
}
}
/// Deserialize a `Vec<CalleeSite>` while tolerating the legacy
/// on-disk form where callees were a plain array of strings.
///
/// Accepts:
/// * `[{"name": "foo", "arity": 1, ...}, ...]` ← current structured form
/// * `["foo", "bar", ...]` ← legacy string form
fn deserialize_callee_sites<'de, D>(de: D) -> Result<Vec<CalleeSite>, D::Error>
where
D: Deserializer<'de>,
{
#[derive(Deserialize)]
#[serde(untagged)]
enum Entry {
Structured(CalleeSite),
Bare(String),
}
let raw: Vec<Entry> = Vec::deserialize(de)?;
Ok(raw
.into_iter()
.map(|e| match e {
Entry::Structured(s) => s,
Entry::Bare(name) => CalleeSite::bare(name),
})
.collect())
}
/// Serialisable summary of a single function's taint behaviour.
///
/// One of these is produced per function during **pass 1** of a scan and
/// persisted to the `function_summaries` SQLite table. During **pass 2** the
/// full set of summaries across every file is loaded into memory so the taint
/// engine can resolve crossfile calls.
///
/// Design notes
/// ────────────
/// * **All three cap fields are independent.** A function can simultaneously
/// act as a source (introduces fresh taint), a sanitizer (cleans certain
/// bits), and a sink (passes tainted data to a dangerous operation).
/// The old code picked a single `DataLabel` which lost information.
///
/// * **`propagating_params`** captures perargument passthrough behaviour:
/// which parameter indices (0based) flow through to the return value.
/// This is essential for chains like `let y = transform(tainted_x); sink(y);`.
/// The legacy boolean `propagates_taint` is kept for deserialising old JSON.
///
/// * **`callees`** drive callgraph construction in `callgraph.rs`, which
/// yields the topological order and SCC batches used between pass 1 and
/// pass 2 (see `scan::run_topo_batches` and `scc_file_batches_with_metadata`).
///
/// * **`tainted_sink_params`** marks which parameter *positions* flow to
/// internal sinks and is consumed by SSA callee resolution
/// (`ssa_transfer::mod.rs` `resolve_callee`) to build the per-parameter
/// `param_to_sink` list, so caller-side sink propagation fires on the
/// specific argument positions rather than the whole call.
#[derive(Debug, Clone, Default, Serialize, Deserialize)]
pub struct FuncSummary {
/// Function name as it appears in the source (`my_func`, not the full path).
pub name: String,
/// Absolute path of the file that defines this function.
pub file_path: String,
/// Language slug (`"rust"`, `"javascript"`, …).
pub lang: String,
// ── Signature information ────────────────────────────────────────────
/// Total number of parameters (including `self`/`&self` for methods).
pub param_count: usize,
/// Parameter names in declaration order.
pub param_names: Vec<String>,
// ── Taint behaviour ──────────────────────────────────────────────────
// Stored as raw `u16` so serde doesn't need to know about `bitflags`.
/// Caps this function **introduces** — i.e. the return value carries
/// freshlytainted data even if no argument was tainted.
pub source_caps: u16,
/// Caps this function **cleans** — passing tainted data through this
/// function strips the corresponding bits.
pub sanitizer_caps: u16,
/// Caps this function **consumes unsafely** — calling it with tainted
/// arguments that still carry these bits is a finding.
pub sink_caps: u16,
/// Which parameter indices (0based) flow through to the return value.
#[serde(default)]
pub propagating_params: Vec<usize>,
/// Legacy field — kept only for deserialising old JSON from SQLite.
/// New code should use `propagating_params` instead.
#[serde(default, skip_serializing)]
pub propagates_taint: bool,
/// Indices of parameters that flow to internal sinks (0based).
pub tainted_sink_params: Vec<usize>,
/// Per-parameter [`SinkSite`] records — mirrors
/// [`SsaFuncSummary::param_to_sink`] so the coarse legacy summary also
/// carries primary sink-location attribution through the two-pass
/// architecture. Empty when the extractor lacked tree access.
#[serde(default, skip_serializing_if = "Vec::is_empty")]
pub param_to_sink: Vec<(usize, SmallVec<[SinkSite; 1]>)>,
/// Per-call-site metadata for every function/method/macro invoked
/// inside this body (`CalleeSite`). Carries arity, receiver,
/// qualifier, and call ordinal so downstream resolution does not have
/// to re-parse the raw callee string.
///
/// A custom deserializer tolerates legacy on-disk rows whose callees
/// field was a plain `Vec<String>`; those are lifted to
/// `CalleeSite { name, .. }` with no additional metadata.
#[serde(default, deserialize_with = "deserialize_callee_sites")]
pub callees: Vec<CalleeSite>,
// ── Identity discriminators ──────────────────────────────────────────
/// Enclosing container path (class / impl / module / outer function),
/// segments joined with `::`. Empty for free top-level functions.
#[serde(default)]
pub container: String,
/// Numeric discriminator for same-name siblings (closure byte offset,
/// nested-function occurrence index). `None` when no sibling collision.
#[serde(default)]
pub disambig: Option<u32>,
/// Structural role of this definition. Defaults to `Function` when
/// deserialising legacy JSON.
#[serde(default)]
pub kind: FuncKind,
// ── Rust-specific module-resolution metadata ────────────────────────
/// Crate-relative module path for this function's defining file
/// (e.g. `"auth::token"` for `src/auth/token.rs`). Only populated
/// when `lang == "rust"`. Used by the call graph to resolve
/// `use`-imported callees to their fully-qualified module.
///
/// `None` for non-Rust files and for Rust files outside a recognised
/// `src/` tree (tests, examples, build scripts).
#[serde(default, skip_serializing_if = "Option::is_none")]
pub module_path: Option<String>,
/// Per-file `use`-alias map for the defining Rust source.
///
/// Maps the local identifier introduced by a `use` declaration to its
/// fully qualified path (`"validate"` → `"crate::auth::token::validate"`).
/// Carried on every summary for the file even though it is per-file
/// information; the duplication keeps the persistence schema simple
/// and lets resolution operate purely off the caller's summary.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub rust_use_map: Option<BTreeMap<String, String>>,
/// Fully qualified prefixes of any wildcard `use ...::*` imports in
/// the defining Rust source. Stored separately because they expand
/// the candidate space at resolution time rather than naming a single
/// alias.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub rust_wildcards: Option<Vec<String>>,
}
// ── Cap conversion helpers ──────────────────────────────────────────────
impl FuncSummary {
#[inline]
pub fn source_caps(&self) -> Cap {
Cap::from_bits_truncate(self.source_caps)
}
#[inline]
pub fn sanitizer_caps(&self) -> Cap {
Cap::from_bits_truncate(self.sanitizer_caps)
}
#[inline]
pub fn sink_caps(&self) -> Cap {
Cap::from_bits_truncate(self.sink_caps)
}
/// Returns `true` when any parameter flows to the return value.
/// Also returns `true` for legacy summaries with `propagates_taint: true`
/// but empty `propagating_params` (backward compat).
pub fn propagates_any(&self) -> bool {
!self.propagating_params.is_empty() || self.propagates_taint
}
/// Build a [`FuncKey`] from this summary, normalizing the namespace
/// relative to `scan_root`.
pub fn func_key(&self, scan_root: Option<&str>) -> FuncKey {
FuncKey {
lang: Lang::from_slug(&self.lang).unwrap_or(Lang::Rust),
namespace: normalize_namespace(&self.file_path, scan_root),
container: self.container.clone(),
name: self.name.clone(),
arity: Some(self.param_count),
disambig: self.disambig,
kind: self.kind,
}
}
}
// ── Callee resolution ────────────────────────────────────────────────────
/// Result of resolving a bare callee name to a [`FuncKey`].
///
/// Three-valued: the call graph builder and taint engine need to distinguish
/// "no candidates at all" from "multiple candidates, can't pick one".
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum CalleeResolution {
/// Exactly one candidate matched.
Resolved(FuncKey),
/// No candidates found at all.
NotFound,
/// Multiple candidates — ambiguous, cannot pick one.
Ambiguous(Vec<FuncKey>),
}
/// Structured query describing a call site.
///
/// Carries every hint needed to pick the right callee *by qualified identity*
/// first and only fall back on bare-leaf lookup as a last resort. The old
/// entry points (`resolve_callee_key`, `resolve_callee_key_with_container`)
/// are now thin wrappers that build a `CalleeQuery` with partial information.
///
/// Hint categories, ordered from strongest to weakest:
///
/// * `receiver_type` — authoritative class/impl/module name (e.g. from
/// type inference or a `use ...` resolution). When set, the resolver
/// *requires* the callee's container to equal this name and refuses to
/// fall back to a leaf-name collision if the qualified lookup misses.
/// * `namespace_qualifier` — syntactic qualifier parsed from the callee
/// (e.g. `"env"` in `env::var`, `"http"` in `http.Get`). Treated as a
/// container hint but not authoritative: a miss falls through.
/// * `receiver_var` — syntactic receiver variable name (e.g. `"obj"` in
/// `obj.method()`). Soft hint, used only to tie-break ambiguity.
/// * `caller_container` — caller's own enclosing container, used to
/// resolve bare self-calls inside a class/impl body.
///
/// `arity` is a hard filter — when `Some`, every candidate whose arity
/// differs is excluded from consideration.
#[derive(Debug, Clone)]
pub struct CalleeQuery<'a> {
/// Leaf (unqualified) callee name, e.g. `"process"` for `OrderService::process`.
pub name: &'a str,
pub caller_lang: Lang,
/// Project-relative namespace (file path) of the caller. Used for
/// same-namespace disambiguation when qualified hints miss.
pub caller_namespace: &'a str,
/// The caller's own container (`FuncKey::container`), for resolving
/// bare `self`/intra-class calls without a receiver.
pub caller_container: Option<&'a str>,
/// Authoritative receiver class/impl name. Populated from type facts
/// (`TypeKind::label_prefix`) or from Rust use-map resolution.
pub receiver_type: Option<&'a str>,
/// Syntactic namespace qualifier (non-authoritative). For
/// `std::env::var` in Rust the caller passes `"env"`; for `http.Get`
/// in Go, `"http"`. Left `None` for purely bare calls.
pub namespace_qualifier: Option<&'a str>,
/// Syntactic receiver variable name. Used only as a tie-breaker — a
/// variable name is a weak proxy for a class name.
pub receiver_var: Option<&'a str>,
/// Positional-argument count at the call site. Hard filter when set.
pub arity: Option<usize>,
}
impl<'a> CalleeQuery<'a> {
/// Whether this query carries any qualified identity hint stronger than
/// a bare leaf name. Used by the resolver to decide whether an
/// unresolved qualified match should still fall through to leaf lookup
/// (no hints → fall through; authoritative hints → refuse to guess).
pub fn has_qualified_hint(&self) -> bool {
self.receiver_type.is_some()
|| self.namespace_qualifier.is_some()
|| self.caller_container.is_some_and(|s| !s.is_empty())
}
}
// ── Lookup map used by the taint engine ─────────────────────────────────
/// A merged view of all function summaries keyed by qualified [`FuncKey`].
///
/// Functions are partitioned by language + namespace + name + arity. Two
/// functions with the same bare name but different languages or namespaces
/// are stored separately — no implicit cross-language merging occurs.
///
/// A secondary index `(Lang, name)` supports fast lookup by language + name
/// for same-language resolution in the taint engine.
#[derive(Default)]
pub struct GlobalSummaries {
by_key: HashMap<FuncKey, FuncSummary>,
/// Bare leaf-name index — kept for compatibility with callers that only
/// see an unqualified call string. A single name may map to many keys
/// across containers / files / arities.
by_lang_name: HashMap<(Lang, String), Vec<FuncKey>>,
/// Container-qualified index: keyed on `"{container}::{name}"` (or just
/// `name` for free functions). Used to resolve calls when the call-site
/// can supply a receiver / container hint (e.g. `OrderService::process`).
by_lang_qualified: HashMap<(Lang, String), Vec<FuncKey>>,
/// Rust-only secondary index keyed on `(module_path, name)`.
///
/// Populated whenever a Rust [`FuncSummary`] is inserted with a
/// `module_path` set. Used by use-map driven resolution to look up
/// candidates by their crate-relative module rather than their
/// filesystem path. Same name / module / arity overloads land on the
/// same vector — arity narrowing happens at resolution time.
by_rust_module: HashMap<(String, String), Vec<FuncKey>>,
/// Precise SSA-derived per-parameter summaries, keyed by `FuncKey`.
/// These take precedence over `FuncSummary` during callee resolution.
ssa_by_key: HashMap<FuncKey, SsaFuncSummary>,
/// Cross-file callee bodies for interprocedural symbolic execution.
/// Keyed by `FuncKey` (same identity model as SSA summaries).
bodies_by_key: HashMap<FuncKey, crate::taint::ssa_transfer::CalleeSsaBody>,
/// Per-function auth-check summaries for cross-file helper lifting.
/// Keyed by `FuncKey` so a call-site resolver can go from a resolved
/// callee name to the helper's auth-check signature. Populated in
/// pass 1 and consumed by
/// [`crate::auth_analysis::run_auth_analysis`] during pass 2.
auth_by_key: HashMap<FuncKey, crate::auth_analysis::model::AuthCheckSummary>,
}
impl GlobalSummaries {
pub fn new() -> Self {
Self::default()
}
/// Walk a proposed insertion key, bumping the synthetic disambig
/// until either (a) the key is unoccupied, or (b) the entry found at
/// that key is compatible with the incoming summary (safe to merge).
///
/// Identity collisions are extraordinarily rare in practice (they
/// require two structurally distinct functions to land on the same
/// non-synthetic key, e.g. both with `disambig: None`). The loop
/// bound is defensive — if synthetic probing still collides after
/// 1024 attempts we fall through and let the caller merge, which
/// degrades gracefully to the old behaviour rather than looping
/// forever.
fn reconcile_func_summary_key(&self, mut key: FuncKey, summary: &FuncSummary) -> FuncKey {
let mut probe: u32 = 0;
loop {
match self.by_key.get(&key) {
Some(existing) if !summaries_compatible(existing, summary) => {
let synth = synthesize_disambig(summary).wrapping_add(probe);
key.disambig = Some(SYNTHETIC_DISAMBIG_BIT | (synth & !SYNTHETIC_DISAMBIG_BIT));
probe = probe.wrapping_add(1);
if probe >= 1024 {
tracing::warn!(
"summary identity collision probe gave up after 1024 attempts; \
falling back to union-merge for {}",
key
);
return key;
}
}
_ => return key,
}
}
}
/// SSA-summary variant of [`Self::reconcile_func_summary_key`].
///
/// Distinctness signals for SSA summaries are weaker than for
/// coarse `FuncSummary`s — the summary itself carries no explicit
/// `param_count`, only references to parameter indices. We combine:
///
/// * **Key arity fit** — any parameter index referenced by the new
/// summary that exceeds `key.arity` is a structural mismatch.
/// * **Existing-entry compare** — if an entry already lives at
/// this key and it disagrees on the set of referenced parameter
/// indices, the two cannot both describe the same function.
fn reconcile_ssa_summary_key(&self, mut key: FuncKey, summary: &SsaFuncSummary) -> FuncKey {
let mut probe: u32 = 0;
loop {
let conflict = match self.ssa_by_key.get(&key) {
Some(existing) => !ssa_summaries_compatible(existing, summary, key.arity),
None => !ssa_summary_fits_arity(summary, key.arity),
};
if !conflict {
return key;
}
let synth = synthesize_ssa_disambig(summary).wrapping_add(probe);
key.disambig = Some(SYNTHETIC_DISAMBIG_BIT | (synth & !SYNTHETIC_DISAMBIG_BIT));
probe = probe.wrapping_add(1);
if probe >= 1024 {
tracing::warn!(
"SSA summary identity collision probe gave up after 1024 attempts \
for {}",
key
);
return key;
}
}
}
/// Body variant of [`Self::reconcile_func_summary_key`].
///
/// `CalleeSsaBody` carries an explicit `param_count`, which must
/// agree with both `key.arity` and any co-located body's
/// `param_count`. A mismatch is a hard collision.
fn reconcile_body_key(
&self,
mut key: FuncKey,
body: &crate::taint::ssa_transfer::CalleeSsaBody,
) -> FuncKey {
let mut probe: u32 = 0;
loop {
let conflict = match self.bodies_by_key.get(&key) {
Some(existing) => existing.param_count != body.param_count,
None => match key.arity {
Some(a) => a != body.param_count,
None => false,
},
};
if !conflict {
return key;
}
let synth = (body.param_count as u32)
.wrapping_mul(0x9E37_79B9)
.wrapping_add(probe);
key.disambig = Some(SYNTHETIC_DISAMBIG_BIT | (synth & !SYNTHETIC_DISAMBIG_BIT));
probe = probe.wrapping_add(1);
if probe >= 1024 {
tracing::warn!(
"SSA body identity collision probe gave up after 1024 attempts for {}",
key
);
return key;
}
}
}
/// Insert or merge a summary. If an exact `FuncKey` match exists and
/// the two summaries describe the same function, merge conservatively
/// (OR caps/booleans, union params/callees).
///
/// `FuncKey` is structurally precise *when every producer populates
/// `disambig`*. Legacy on-disk JSON, interop configs, DB rows written
/// by older versions, and any code path that keeps `disambig: None`
/// can produce two keys that hash-equal even though they belong to
/// structurally distinct functions (e.g. different `param_count`,
/// `kind`, `container`, or `param_names`). Silently unioning those
/// would leak security-relevant caps across unrelated functions and
/// drop one of the two summaries entirely.
///
/// We therefore inspect the existing entry first. If the new summary
/// is not [`summaries_compatible`] with it, we mint a synthetic
/// disambig (top bit set to stay disjoint from byte-offset disambigs)
/// and retry the insert under the fresh key so *both* functions are
/// preserved.
pub fn insert(&mut self, key: FuncKey, summary: FuncSummary) {
let key = self.reconcile_func_summary_key(key, &summary);
let lang = key.lang;
let name = key.name.clone();
let qualified = key.qualified_name();
let rust_module = if lang == Lang::Rust {
summary.module_path.clone()
} else {
None
};
self.by_key
.entry(key.clone())
.and_modify(|existing| {
existing.source_caps |= summary.source_caps;
existing.sanitizer_caps |= summary.sanitizer_caps;
existing.sink_caps |= summary.sink_caps;
existing.propagates_taint |= summary.propagates_taint;
for &idx in &summary.propagating_params {
if !existing.propagating_params.contains(&idx) {
existing.propagating_params.push(idx);
}
}
for &idx in &summary.tainted_sink_params {
if !existing.tainted_sink_params.contains(&idx) {
existing.tainted_sink_params.push(idx);
}
}
union_param_sink_sites(&mut existing.param_to_sink, &summary.param_to_sink);
for c in &summary.callees {
if !existing.callees.iter().any(|e| {
e.name == c.name
&& e.arity == c.arity
&& e.receiver == c.receiver
&& e.qualifier == c.qualifier
&& e.ordinal == c.ordinal
}) {
existing.callees.push(c.clone());
}
}
})
.or_insert(summary);
let keys = self.by_lang_name.entry((lang, name)).or_default();
if !keys.contains(&key) {
keys.push(key.clone());
}
let q_keys = self.by_lang_qualified.entry((lang, qualified)).or_default();
if !q_keys.contains(&key) {
q_keys.push(key.clone());
}
if let Some(mp) = rust_module {
let mk = self
.by_rust_module
.entry((mp, key.name.clone()))
.or_default();
if !mk.contains(&key) {
mk.push(key);
}
}
}
/// Exact lookup by fully-qualified key.
pub fn get(&self, key: &FuncKey) -> Option<&FuncSummary> {
self.by_key.get(key)
}
/// Interop / external-edge lookup: tolerant of `disambig` being `None`.
///
/// Interop edges originate outside the source code (user-specified JSON,
/// language-bridge config) and cannot know a callee's internal byte-offset
/// disambiguator. When the query key has `disambig = None` we fall back to
/// scanning for a single match on `(lang, namespace, container, name,
/// arity, kind)`. If exactly one matches it is returned; otherwise we
/// return `None` to preserve determinism (ambiguity is treated as unknown).
pub fn get_for_interop(&self, key: &FuncKey) -> Option<&FuncSummary> {
if let Some(hit) = self.by_key.get(key) {
return Some(hit);
}
if key.disambig.is_some() {
return None;
}
let mut matches = self.by_key.iter().filter(|(k, _)| {
k.lang == key.lang
&& k.namespace == key.namespace
&& k.container == key.container
&& k.name == key.name
&& k.arity == key.arity
&& k.kind == key.kind
});
let first = matches.next()?;
if matches.next().is_some() {
None
} else {
Some(first.1)
}
}
/// All same-language matches for a bare function name.
pub fn lookup_same_lang(&self, lang: Lang, name: &str) -> Vec<(&FuncKey, &FuncSummary)> {
self.by_lang_name
.get(&(lang, name.to_string()))
.map(|keys| {
keys.iter()
.filter_map(|k| self.by_key.get(k).map(|v| (k, v)))
.collect()
})
.unwrap_or_default()
}
/// Rust-only lookup by `(module_path, name)`.
///
/// Returns every candidate that was inserted with a matching module
/// path. Arity filtering is applied by the caller so that the index
/// stays ambiguity-aware (two overloads legitimately share a module
/// path + name and only differ in arity).
pub fn lookup_rust_module(
&self,
module_path: &str,
name: &str,
) -> Vec<(&FuncKey, &FuncSummary)> {
self.by_rust_module
.get(&(module_path.to_string(), name.to_string()))
.map(|keys| {
keys.iter()
.filter_map(|k| self.by_key.get(k).map(|v| (k, v)))
.collect()
})
.unwrap_or_default()
}
/// Container-qualified lookup. `qualified` should be
/// `"Container::name"` (use [`FuncKey::qualified_name`]) or `"name"`.
pub fn lookup_qualified(&self, lang: Lang, qualified: &str) -> Vec<(&FuncKey, &FuncSummary)> {
self.by_lang_qualified
.get(&(lang, qualified.to_string()))
.map(|keys| {
keys.iter()
.filter_map(|k| self.by_key.get(k).map(|v| (k, v)))
.collect()
})
.unwrap_or_default()
}
/// Merge another `GlobalSummaries` into this one (for parallel fold/reduce).
pub fn merge(&mut self, other: GlobalSummaries) {
// `insert` rebuilds every secondary index (by_lang_name, by_lang_qualified,
// by_rust_module) from the summary itself, so we do not need to copy
// `other.by_rust_module` explicitly — draining `other.by_key` is enough.
for (key, summary) in other.by_key {
self.insert(key, summary);
}
// SSA summaries: last-writer-wins (exact-key replacement, no unioning)
for (key, ssa_sum) in other.ssa_by_key {
self.ssa_by_key.insert(key, ssa_sum);
}
// Cross-file bodies: last-writer-wins
for (key, body) in other.bodies_by_key {
self.bodies_by_key.insert(key, body);
}
// Auth summaries: last-writer-wins (exact-key replacement)
for (key, auth_sum) in other.auth_by_key {
self.auth_by_key.insert(key, auth_sum);
}
}
/// Insert an SSA summary.
///
/// Per-function refinement is expressed via last-writer-wins for
/// *compatible* summaries: re-analysing the same function body with
/// more precise seeds yields a strictly better summary, and the
/// caller genuinely wants the new one to replace the old.
///
/// When the existing entry is **incompatible** with the incoming
/// one — the key's `arity` disagrees with the new summary's referenced
/// parameter indices, or the two summaries would describe different
/// functions — we synthesize a disambig so both are kept. Silent
/// replacement in that case would drop one function's cross-file
/// taint signal entirely, which the caller cannot recover.
pub fn insert_ssa(&mut self, key: FuncKey, summary: SsaFuncSummary) {
let key = self.reconcile_ssa_summary_key(key, &summary);
self.ssa_by_key.insert(key, summary);
}
/// Exact lookup of an SSA summary by fully-qualified key.
pub fn get_ssa(&self, key: &FuncKey) -> Option<&SsaFuncSummary> {
self.ssa_by_key.get(key)
}
/// Insert an `AuthCheckSummary` for cross-file helper lifting.
///
/// Last-writer-wins: re-analysing a file produces a fresh summary
/// that fully replaces any earlier entry. No compatibility
/// reconciliation is needed because `AuthCheckSummary` carries no
/// identity-sensitive signal beyond the key itself.
pub fn insert_auth(
&mut self,
key: FuncKey,
summary: crate::auth_analysis::model::AuthCheckSummary,
) {
self.auth_by_key.insert(key, summary);
}
/// Exact lookup of an `AuthCheckSummary` by fully-qualified key.
pub fn get_auth(
&self,
key: &FuncKey,
) -> Option<&crate::auth_analysis::model::AuthCheckSummary> {
self.auth_by_key.get(key)
}
/// Direct access to the auth-summary map. `None` when empty so
/// callers can distinguish "no cross-file auth summaries loaded"
/// from "some were loaded but none matched the call site".
pub fn auth_by_key(
&self,
) -> Option<&HashMap<FuncKey, crate::auth_analysis::model::AuthCheckSummary>> {
if self.auth_by_key.is_empty() {
None
} else {
Some(&self.auth_by_key)
}
}
/// Count of cross-file auth summaries currently loaded.
pub fn auth_len(&self) -> usize {
self.auth_by_key.len()
}
/// Insert a cross-file callee body.
///
/// See [`insert_ssa`](Self::insert_ssa) for the identity-safety rule.
/// Bodies additionally carry `param_count`, giving a hard structural
/// signal: a collision between bodies with different `param_count`
/// cannot be the same function and is always rekeyed.
pub fn insert_body(&mut self, key: FuncKey, body: crate::taint::ssa_transfer::CalleeSsaBody) {
let key = self.reconcile_body_key(key, &body);
self.bodies_by_key.insert(key, body);
}
/// Exact lookup of a cross-file callee body by fully-qualified key.
pub fn get_body(&self, key: &FuncKey) -> Option<&crate::taint::ssa_transfer::CalleeSsaBody> {
self.bodies_by_key.get(key)
}
/// Direct access to the cross-file body map.
///
/// Returns `None` when no cross-file bodies were loaded (empty map).
/// The taint engine uses this to thread bodies through
/// [`crate::taint::ssa_transfer::SsaTaintTransfer::cross_file_bodies`]
/// and `resolve_callee` for context-sensitive cross-file inline
/// analysis.
pub fn bodies_by_key(
&self,
) -> Option<&HashMap<FuncKey, crate::taint::ssa_transfer::CalleeSsaBody>> {
if self.bodies_by_key.is_empty() {
None
} else {
Some(&self.bodies_by_key)
}
}
/// Count of cross-file bodies currently loaded. Exposed for
/// `tracing::debug!` observability — lets callers distinguish "no
/// bodies available" from "bodies available but inline didn't fire".
pub fn bodies_len(&self) -> usize {
self.bodies_by_key.len()
}
/// Resolve a bare callee name to a cross-file body.
///
/// Uses `resolve_callee_key()` for strict deterministic resolution,
/// then checks `bodies_by_key`. Returns `None` on `Ambiguous` or `NotFound`.
pub fn resolve_callee_body(
&self,
lang: Lang,
name: &str,
arity_hint: Option<usize>,
caller_namespace: &str,
) -> Option<&crate::taint::ssa_transfer::CalleeSsaBody> {
match self.resolve_callee_key(name, lang, caller_namespace, arity_hint) {
CalleeResolution::Resolved(key) => self.bodies_by_key.get(&key),
CalleeResolution::NotFound | CalleeResolution::Ambiguous(_) => None,
}
}
#[allow(dead_code)] // used by tests and future call-graph consumers
pub fn is_empty(&self) -> bool {
self.by_key.is_empty() && self.ssa_by_key.is_empty() && self.auth_by_key.is_empty()
}
/// Iterate over all (key, summary) pairs.
pub fn iter(&self) -> impl Iterator<Item = (&FuncKey, &FuncSummary)> {
self.by_key.iter()
}
/// Snapshot the convergence-relevant fields of every summary.
///
/// Returns `(source_caps, sanitizer_caps, sink_caps, propagating_params)`
/// per key. Used by the SCC fixed-point loop to detect when an iteration
/// has not changed any summary — i.e. convergence.
pub fn snapshot_caps(&self) -> HashMap<FuncKey, (u16, u16, u16, Vec<usize>)> {
self.by_key
.iter()
.map(|(k, s)| {
(
k.clone(),
(
s.source_caps,
s.sanitizer_caps,
s.sink_caps,
s.propagating_params.clone(),
),
)
})
.collect()
}
/// Snapshot the SSA summaries for convergence detection.
///
/// Used alongside [`snapshot_caps`] in the SCC fixed-point loop so that
/// SSA-only refinements (e.g. a `StripBits` transform appearing after a
/// cross-file sanitizer is resolved) are not invisible to convergence.
pub fn snapshot_ssa(&self) -> &HashMap<FuncKey, SsaFuncSummary> {
&self.ssa_by_key
}
/// Rust-only resolution that consults the caller's `use` map before
/// falling back to generic resolution.
///
/// The caller passes the callee's leaf name plus the (optional)
/// structured qualifier that `CalleeSite.qualifier` carries for Rust
/// call sites (e.g. `"crate::auth::token"` for `crate::auth::token::validate()`).
/// The `use` map and wildcard list come from the caller's own
/// [`FuncSummary`].
///
/// Resolution order:
///
/// 1. If the caller has a `use_map` and (qualifier, name) resolves to a
/// fully qualified path, strip the leading `crate::` and look up
/// `(module_path, name)` in the Rust module index. If arity filtering
/// leaves exactly one candidate → resolved.
/// 2. Otherwise, for each wildcard prefix in scope, try
/// `(wildcard_prefix, name)` in the module index. If across all
/// wildcards exactly one arity-filtered candidate appears → resolved.
/// 3. Otherwise fall through to [`resolve_callee_key_with_container`]
/// with no `container_hint` — meaning only the existing namespace /
/// arity disambiguation applies.
///
/// A `None` use_map (non-Rust file or no `use` declarations) makes this
/// equivalent to the generic path.
pub fn resolve_callee_key_rust(
&self,
callee: &str,
qualifier: Option<&str>,
arity_hint: Option<usize>,
caller_namespace: &str,
use_map: Option<&crate::rust_resolve::RustUseMap>,
) -> CalleeResolution {
use crate::rust_resolve::{resolve_with_use_map, split_module_and_name};
// 1) Try direct use-map resolution.
if let Some(um) = use_map
&& let Some(full) = resolve_with_use_map(um, qualifier, callee)
{
let (module_path, name) = split_module_and_name(&full);
if !module_path.is_empty() {
let candidates = self.lookup_rust_module(&module_path, &name);
let filtered: Vec<&FuncKey> = match arity_hint {
Some(a) => candidates
.iter()
.filter(|(k, _)| k.arity == Some(a))
.map(|(k, _)| *k)
.collect(),
None => candidates.iter().map(|(k, _)| *k).collect(),
};
if filtered.len() == 1 {
return CalleeResolution::Resolved(filtered[0].clone());
}
}
}
// 2) Try wildcards. Each wildcard expands `use prefix::*;` into an
// implicit `(prefix, name)` candidate set; we union across all
// wildcards and only resolve when exactly one matches under the
// arity filter.
if let Some(um) = use_map
&& !um.wildcards.is_empty()
{
let mut collected: Vec<FuncKey> = Vec::new();
for w in &um.wildcards {
let prefix = w.strip_prefix("crate::").unwrap_or(w);
if prefix.is_empty() {
continue;
}
for (k, _) in self.lookup_rust_module(prefix, callee) {
if let Some(a) = arity_hint
&& k.arity != Some(a)
{
continue;
}
if !collected.contains(k) {
collected.push(k.clone());
}
}
}
if collected.len() == 1 {
return CalleeResolution::Resolved(collected.remove(0));
}
}
// 3) Fall back to generic same-language resolution.
self.resolve_callee_key_with_container(
callee,
Lang::Rust,
caller_namespace,
None,
arity_hint,
)
}
/// Resolve a bare (already-normalized) callee name to a [`FuncKey`].
///
/// Thin wrapper around [`resolve_callee`] that constructs a minimal
/// [`CalleeQuery`] with no qualified hints. Kept for call sites that
/// only hold a string callee and an arity; prefer [`resolve_callee`]
/// whenever receiver / qualifier / container information is available.
pub fn resolve_callee_key(
&self,
callee: &str,
caller_lang: Lang,
caller_namespace: &str,
arity_hint: Option<usize>,
) -> CalleeResolution {
self.resolve_callee(&CalleeQuery {
name: callee,
caller_lang,
caller_namespace,
caller_container: None,
receiver_type: None,
namespace_qualifier: None,
receiver_var: None,
arity: arity_hint,
})
}
/// Resolve a callee name with an optional container hint.
///
/// Legacy entry point — kept so tests and older callers compile
/// unchanged. `container_hint` is interpreted as a syntactic
/// container qualifier (not an authoritative receiver type), so a
/// miss is allowed to fall through to leaf-name lookup. New
/// callers should route through [`resolve_callee`] and classify
/// their hint as `receiver_type` vs `namespace_qualifier` vs
/// `receiver_var` so the resolver can apply the correct policy.
pub fn resolve_callee_key_with_container(
&self,
callee: &str,
caller_lang: Lang,
caller_namespace: &str,
container_hint: Option<&str>,
arity_hint: Option<usize>,
) -> CalleeResolution {
self.resolve_callee(&CalleeQuery {
name: callee,
caller_lang,
caller_namespace,
caller_container: None,
receiver_type: None,
namespace_qualifier: container_hint,
receiver_var: None,
arity: arity_hint,
})
}
/// Resolve a callee with full structured hints.
///
/// **New resolution order** (qualified identity primary, leaf name
/// fallback):
///
/// 1. **Receiver-type qualified** — if `receiver_type` is set,
/// consult `by_lang_qualified[{receiver_type}::{name}]` with the
/// arity filter. Exactly-one → resolved; same-namespace
/// tie-breaker if multiple. *Receiver types are authoritative*:
/// a miss does not fall back to bare leaf lookup (that would be
/// a silent reinterpretation).
/// 2. **Namespace-qualifier qualified** — if `namespace_qualifier`
/// is set, try the qualified index with that container.
/// Non-authoritative: a miss falls through.
/// 3. **Caller-self-container** — when the caller lives inside a
/// container (method body), try the qualified index against the
/// caller's own container. Resolves bare `foo()` self-calls
/// inside a class without collapsing into an unrelated same-leaf
/// definition in another file.
/// 4. **Same-namespace unique leaf** — intra-file bare-leaf call:
/// if the caller's namespace contains exactly one arity-matched
/// candidate with this leaf, resolve to it.
/// 5. **Receiver-variable tie-break** — if the same-namespace
/// lookup misses but the raw call came with a receiver variable,
/// try `{receiver_var}::{name}` as a last qualified attempt.
///
/// 5.5. **Bare-call free-function preference** — for a truly bare
/// call (no receiver type, no namespace qualifier, no receiver
/// variable), if exactly one same-namespace arity-matched
/// candidate has an empty container, resolve to it. A class
/// method cannot be invoked with bare-call syntax from outside
/// its class, so this disambiguation is safe even when same-name
/// methods exist elsewhere in the file.
/// 6. **Leaf-name fallback** — arity-filtered same-language lookup.
/// Unique → resolved. Multiple + we had any qualified hint →
/// Ambiguous (refuse to guess when a qualifier exists but
/// missed). Multiple + no qualified hint → narrow by namespace,
/// then container.
pub fn resolve_callee(&self, q: &CalleeQuery<'_>) -> CalleeResolution {
// ── Helpers ─────────────────────────────────────────────────
let arity_matches = |k: &FuncKey| match q.arity {
Some(a) => k.arity == Some(a),
None => true,
};
// Look up `{container}::{name}` and return a single arity-matched
// candidate if one exists (using same-namespace to break ties).
let try_qualified = |container: &str| -> Option<FuncKey> {
if container.is_empty() {
return None;
}
let qual = format!("{container}::{}", q.name);
let candidates: Vec<&FuncKey> = self
.lookup_qualified(q.caller_lang, &qual)
.into_iter()
.map(|(k, _)| k)
.filter(|k| arity_matches(k))
.collect();
match candidates.len() {
0 => None,
1 => Some(candidates[0].clone()),
_ => {
let same_ns: Vec<&FuncKey> = candidates
.iter()
.copied()
.filter(|k| k.namespace == q.caller_namespace)
.collect();
if same_ns.len() == 1 {
Some(same_ns[0].clone())
} else {
None
}
}
}
};
// ── Step 1: receiver_type (authoritative) ───────────────────
if let Some(rt) = q.receiver_type {
if let Some(key) = try_qualified(rt) {
return CalleeResolution::Resolved(key);
}
// Authoritative miss: before returning, check whether any
// candidate exists at all for the leaf name. If there are
// some, report Ambiguous with the leaf candidates (so the
// caller knows we saw the name but refused to pick the
// wrong container). If there are none, return NotFound.
let bare: Vec<&FuncKey> = self
.lookup_same_lang(q.caller_lang, q.name)
.into_iter()
.map(|(k, _)| k)
.filter(|k| arity_matches(k))
.collect();
return if bare.is_empty() {
CalleeResolution::NotFound
} else {
CalleeResolution::Ambiguous(bare.into_iter().cloned().collect())
};
}
// ── Step 2: namespace_qualifier (non-authoritative) ─────────
if let Some(nq) = q.namespace_qualifier
&& let Some(key) = try_qualified(nq)
{
return CalleeResolution::Resolved(key);
}
// ── Step 3: caller self-container ───────────────────────────
if let Some(cc) = q.caller_container
&& let Some(key) = try_qualified(cc)
{
return CalleeResolution::Resolved(key);
}
// ── Step 4: same-namespace unique leaf ──────────────────────
let all_candidates: Vec<&FuncKey> = self
.lookup_same_lang(q.caller_lang, q.name)
.into_iter()
.map(|(k, _)| k)
.collect();
if all_candidates.is_empty() {
return CalleeResolution::NotFound;
}
let arity_filtered: Vec<&FuncKey> = all_candidates
.iter()
.copied()
.filter(|k| arity_matches(k))
.collect();
if arity_filtered.is_empty() {
return CalleeResolution::NotFound;
}
let same_ns: Vec<&FuncKey> = arity_filtered
.iter()
.copied()
.filter(|k| k.namespace == q.caller_namespace)
.collect();
if same_ns.len() == 1 {
return CalleeResolution::Resolved(same_ns[0].clone());
}
// ── Step 5: receiver_var tie-break (soft) ───────────────────
if let Some(rv) = q.receiver_var
&& let Some(key) = try_qualified(rv)
{
return CalleeResolution::Resolved(key);
}
// ── Step 5.5: bare-call free-function preference ────────────
// A call with no receiver, no namespace qualifier, and no
// authoritative receiver type is syntactically a free-function
// invocation: a class method cannot be invoked that way from
// outside its own class (intra-class self-calls were already
// resolved by step 3). When the same-namespace candidate set
// contains exactly one empty-container entry, it is the
// unambiguous target — returning Ambiguous here would be a
// silent false negative whenever a top-level helper happens to
// share a name with some method elsewhere in the file.
let syntactic_bare = q.receiver_type.is_none()
&& q.namespace_qualifier.is_none()
&& q.receiver_var.is_none();
if syntactic_bare {
let empty_container_same_ns: Vec<&FuncKey> = same_ns
.iter()
.copied()
.filter(|k| k.container.is_empty())
.collect();
if empty_container_same_ns.len() == 1 {
return CalleeResolution::Resolved(empty_container_same_ns[0].clone());
}
}
// ── Step 6: leaf fallback ───────────────────────────────────
if arity_filtered.len() == 1 {
return CalleeResolution::Resolved(arity_filtered[0].clone());
}
// Multiple arity-matched candidates remain. When a qualified
// hint was supplied but missed, refuse to guess — a silent
// leaf-name pick would defeat the point of qualified-first
// resolution. (`receiver_type` is handled in Step 1 and never
// reaches here; `namespace_qualifier` / `caller_container`
// missing their target flow through as a soft miss.)
if q.has_qualified_hint() {
return CalleeResolution::Ambiguous(arity_filtered.into_iter().cloned().collect());
}
// No qualified hints whatsoever — tolerate namespace narrowing.
match same_ns.len() {
1 => CalleeResolution::Resolved(same_ns[0].clone()),
0 => CalleeResolution::Ambiguous(arity_filtered.into_iter().cloned().collect()),
_ => CalleeResolution::Ambiguous(same_ns.into_iter().cloned().collect()),
}
}
}
impl std::fmt::Debug for GlobalSummaries {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("GlobalSummaries")
.field("len", &self.by_key.len())
.field("ssa_len", &self.ssa_by_key.len())
.field("bodies_len", &self.bodies_by_key.len())
.field("auth_len", &self.auth_by_key.len())
.finish()
}
}
/// Return `true` iff two `FuncSummary`s can be safely union-merged at the
/// same `FuncKey`.
///
/// Only fields that a single function definition is guaranteed to agree on
/// are compared. Behaviour fields (`source_caps`, `propagating_params`,
/// `callees`, …) are deliberately ignored: merge is *allowed* to combine
/// those. The test is symmetric.
///
/// Comparison rules
/// ────────────────
/// * **`param_count` / `kind` / `container`** — unconditional agreement.
/// Any mismatch is a hard collision between distinct functions.
/// * **`file_path`** — agree when both sides are populated. A blank path
/// can come from synthetic summaries constructed in tests / interop
/// configs and should not force a split.
/// * **`param_names`** — agree when both sides are populated. Legacy
/// summaries may persist with empty names; treating empty as "unknown"
/// avoids gratuitous splits while still catching real divergence.
/// * **`module_path`** — Rust-only. Agreed when both sides are `Some`.
/// A missing module path on one side is legacy-compatible; two *distinct*
/// `Some` values mean the two summaries belong to different crates'
/// module trees.
pub(crate) fn summaries_compatible(a: &FuncSummary, b: &FuncSummary) -> bool {
if a.param_count != b.param_count {
return false;
}
if a.kind != b.kind {
return false;
}
if a.container != b.container {
return false;
}
if !a.file_path.is_empty() && !b.file_path.is_empty() && a.file_path != b.file_path {
return false;
}
if !a.param_names.is_empty() && !b.param_names.is_empty() && a.param_names != b.param_names {
return false;
}
match (&a.module_path, &b.module_path) {
(Some(l), Some(r)) if l != r => return false,
_ => {}
}
true
}
/// Derive a deterministic synthetic disambiguator from the
/// identity-relevant fields of a `FuncSummary`.
///
/// The top bit is **not** set here — the caller composes the final value
/// via `SYNTHETIC_DISAMBIG_BIT | (hash & !SYNTHETIC_DISAMBIG_BIT)` so that
/// (a) the caller can safely bump the low bits to probe for a free slot,
/// and (b) the synthetic namespace stays disjoint from byte-offset
/// disambigs produced by `cfg.rs`.
pub(crate) fn synthesize_disambig(summary: &FuncSummary) -> u32 {
let mut h = std::collections::hash_map::DefaultHasher::new();
summary.param_count.hash(&mut h);
summary.param_names.hash(&mut h);
summary.container.hash(&mut h);
summary.kind.hash(&mut h);
summary.file_path.hash(&mut h);
summary.source_caps.hash(&mut h);
summary.sanitizer_caps.hash(&mut h);
summary.sink_caps.hash(&mut h);
summary.module_path.hash(&mut h);
h.finish() as u32
}
/// Return `true` iff the new `SsaFuncSummary` is consistent with the
/// existing one at the same `FuncKey`.
///
/// `SsaFuncSummary` carries no explicit `param_count`; we approximate
/// it via the maximum parameter index referenced by either summary.
/// Two summaries are compatible when neither references a parameter
/// index the other cannot — an upward compatibility check, so a refined
/// summary that merely adds flows for previously-silent parameters is
/// still considered compatible.
fn ssa_summaries_compatible(
existing: &SsaFuncSummary,
new: &SsaFuncSummary,
key_arity: Option<usize>,
) -> bool {
if !ssa_summary_fits_arity(existing, key_arity) {
// Existing entry itself is inconsistent with the key; don't let
// that inconsistency mask a real collision with the new entry.
return false;
}
if !ssa_summary_fits_arity(new, key_arity) {
return false;
}
true
}
/// Every parameter index referenced by `summary` must fit inside
/// `key_arity` when it is known. `None` (unknown arity) accepts any
/// index.
fn ssa_summary_fits_arity(summary: &SsaFuncSummary, key_arity: Option<usize>) -> bool {
let arity = match key_arity {
Some(a) => a,
None => return true,
};
let refs = summary
.param_to_return
.iter()
.map(|(i, _)| *i)
.chain(summary.param_to_sink.iter().map(|(i, _)| *i))
.chain(summary.param_to_sink_param.iter().map(|(i, _, _)| *i))
.chain(summary.param_container_to_return.iter().copied())
.chain(
summary
.param_to_container_store
.iter()
.flat_map(|(a, b)| [*a, *b]),
)
.chain(summary.source_to_callback.iter().map(|(i, _)| *i))
.chain(summary.abstract_transfer.iter().map(|(i, _)| *i))
.chain(summary.param_return_paths.iter().map(|(i, _)| *i));
for i in refs {
if i >= arity {
return false;
}
}
// Every parameter referenced by a points-to edge must also fit the
// key's arity. An overflow-flagged summary is conservative by
// construction and can be kept as-is.
if let Some(max) = summary.points_to.max_param_index()
&& (max as usize) >= arity
{
return false;
}
true
}
/// Derive a deterministic synthetic disambiguator for an
/// `SsaFuncSummary`. Mirrors `synthesize_disambig` but restricted to
/// SSA-level structural signals.
fn synthesize_ssa_disambig(summary: &SsaFuncSummary) -> u32 {
let mut h = std::collections::hash_map::DefaultHasher::new();
summary.param_to_return.len().hash(&mut h);
summary.param_to_sink.len().hash(&mut h);
summary.source_caps.bits().hash(&mut h);
summary.param_to_sink_param.len().hash(&mut h);
summary.param_container_to_return.len().hash(&mut h);
summary.param_to_container_store.len().hash(&mut h);
summary.receiver_to_sink.bits().hash(&mut h);
summary.receiver_to_return.is_some().hash(&mut h);
summary.return_type.is_some().hash(&mut h);
summary.return_abstract.is_some().hash(&mut h);
summary.source_to_callback.len().hash(&mut h);
summary.abstract_transfer.len().hash(&mut h);
summary.param_return_paths.len().hash(&mut h);
summary.points_to.edges.len().hash(&mut h);
summary.points_to.overflow.hash(&mut h);
summary.points_to.returns_fresh_alloc.hash(&mut h);
h.finish() as u32
}
/// Merge a set of perfile summaries into a single `GlobalSummaries` map.
///
/// Merging only happens for exact `FuncKey` matches (same lang + namespace +
/// name + arity). Functions with the same bare name but different languages
/// or namespaces are stored separately.
pub fn merge_summaries(
per_file: impl IntoIterator<Item = FuncSummary>,
scan_root: Option<&str>,
) -> GlobalSummaries {
let mut map = GlobalSummaries::new();
for fs in per_file {
let key = fs.func_key(scan_root);
map.insert(key, fs);
}
map
}
#[cfg(test)]
mod tests;