diff --git a/CHANGELOG.md b/CHANGELOG.md index 6e994990..d042bc9d 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -5,6 +5,185 @@ All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). +## [Unreleased] + +## [0.4.0] - 2025-02-25 + +### Added +- **Low-noise prioritization system** — post-analysis pipeline that reduces noise from high-frequency LOW/Quality findings without hiding security signal. Three-stage process: category filtering, rollup grouping, and LOW budgets. + - **`FindingCategory` enum** (`Security`, `Reliability`, `Quality`) — every `Diag` now carries a `category` field. AST pattern findings derive their category from `PatternCategory` metadata (`CodeQuality` → `Quality`, all others → `Security`). Taint, CFG, and state findings are always `Security`. + - **Category filtering** — Quality-category findings (e.g. `rs.quality.unwrap`, `rs.quality.expect`) are excluded by default. Use `--include-quality` to include them. + - **Rollup grouping** — eligible HIGH-frequency rules (`rs.quality.unwrap`, `rs.quality.expect`, `rs.quality.panic_macro`) are grouped by `(file, rule)` into a single rollup finding with occurrence count and example locations. Canonical location is the first sorted occurrence. Example count controlled by `--rollup-examples` (default 5). + - **LOW budgets** — three configurable limits enforce noise caps: `--max-low` (default 20, total), `--max-low-per-file` (default 1), `--max-low-per-rule` (default 10). Rollups count as one finding for all budgets. High/Medium findings are never dropped. + - **`--all` CLI flag** — disables all prioritization (no category filtering, no rollups, no budgets). + - **`--show-instances `** — bypasses rollup for a specific rule, expanding all individual occurrences. + - **Console suppression footer** — when findings are suppressed, a footer displays the count and active filter values with adjustment hints. + - **`rollup` field on `Diag`** — optional `RollupData` with `count` and `occurrences` (example `Location`s). Serializes to JSON automatically; omitted when not a rollup. + - **SARIF rollup support** — `category` in result properties, rollup count in `properties.rollup.count`, example locations in `relatedLocations`. + - **`max_results` severity stability** — when `max_results` truncation is needed, High findings are kept first, then Medium, then Low. Low findings never displace higher-severity ones. + - New config fields in `[output]`: `include_quality`, `show_all`, `max_low`, `max_low_per_file`, `max_low_per_rule`, `rollup_examples`. + - 14 new unit tests covering category filtering, rollup grouping/examples/canonical, LOW budgets (per-file/per-rule/total), High/Medium immunity, rollup-counts-as-one, show_instances bypass, JSON serialization, and determinism. +- **Pattern-level confidence for AST rules** — each AST pattern in `src/patterns/` now carries an explicit `confidence: Confidence` field (High, Medium, or Low). Confidence is set at the pattern definition site and flows directly into emitted `Diag`s, replacing the old heuristic that inferred AST confidence from severity alone. `compute_confidence()` is retained as a fallback for detectors that don't set confidence (taint, state, legacy). + - Tier A patterns with High/Medium severity → `Confidence::High` (deterministic structural match). + - Tier A patterns with Low severity → `Confidence::Medium` (quality/crypto signals). + - Tier B patterns (heuristic-guarded) → `Confidence::Medium`. + - Example: `rs.quality.expect` now produces `Confidence: High` regardless of its Low severity. +- **Inline per-finding suppressions** — suppress specific findings directly in source code using `nyx:ignore` comments. Two directive forms: `nyx:ignore ` (same line) and `nyx:ignore-next-line ` (next line). Supports comma-separated IDs, wildcard suffixes (`rs.quality.*`), and automatic canonicalization of taint rule IDs (parenthetical suffixes stripped). Comment detection covers all 10 languages with string/raw-string/template-literal guards to avoid false positives. + - **`--show-suppressed` CLI flag** — reveal suppressed findings in output, dimmed with `[SUPPRESSED]` tag. Summary shows `"N issues (M suppressed)"`. In JSON/SARIF mode, suppressed findings include `"suppressed": true` and `"suppression": {...}` metadata fields. + - **`suppressed` and `suppression` fields on `Diag`** — conditionally serialized; JSON output is unchanged when no suppressions are active. + - Suppressed findings are excluded from `--fail-on` exit-code checks and severity counts. + - New module `src/suppress/mod.rs` with 22 unit tests covering all comment styles, string guards, wildcard matching, canonicalization, CRLF, and edge cases. +- **`--min-score ` CLI flag and `output.min_score` config option** — filter out findings whose attack-surface rank score falls below the given threshold. Applied after ranking and severity filtering, before `max_results` truncation. Has no effect when `--no-rank` is used. CLI value overrides config. +- **Attack surface ranking** — deterministic post-analysis scoring layer that prioritizes findings by exploitability. Each `Diag` receives an `f64` score computed from five components: severity base (High=60, Medium=30, Low=10), analysis kind bonus (taint +10 > state +8 > cfg +3/5 > ast 0), evidence strength (+1 per item, +2–6 for source-kind priority), state rule type bonus (+1–6), and a path-validation penalty (−5 for guarded paths). Findings are sorted by descending score before truncation so `max_results` keeps the most important results. Tie-breaking is deterministic by severity, rule ID, file path, line, column, and message hash. + - **`rank_score` and `rank_reason` fields on `Diag`** — optional fields with `#[serde(skip_serializing_if = "Option::is_none")]`; JSON output is unchanged when ranking is disabled. + - **`--no-rank` CLI flag** — disables attack-surface ranking (enabled by default). + - **`output.attack_surface_ranking` config key** — boolean (default `true`) to control ranking via config file. + - **Console score display** — dim `Score: N` appended to each finding's header line when ranking is enabled. + - **New module `src/rank.rs`** — `compute_attack_rank()`, `rank_diags()`, and `sort_key()` functions. Scoring uses only in-memory data; no extra file I/O or graph recomputation. + - 10 new unit tests: ordering correctness (high taint > medium file-io, must-leak > may-leak, taint > cfg-only, state rules, AST lowest at same severity), determinism (input-order-independent), path-validation penalty, and JSON serialization (rank fields omitted when None, present when set). +- **State-model dataflow analysis** — new `src/state/` module implementing a forward worklist dataflow engine over the existing CFG. Tracks per-variable resource lifecycle (`UNINIT`, `OPEN`, `CLOSED`, `MOVED`) via bitset lattice and per-path authentication level (`Unauthed`, `Authed`, `Admin`) as a composable product domain. Detects: + - **Use-after-close** (`state-use-after-close`, High) — variable read/written after its resource handle was closed. + - **Double-close** (`state-double-close`, Medium) — resource handle closed more than once. + - **Must-leak** (`state-resource-leak`, High) — resource acquired but never closed on any exit path. + - **May-leak** (`state-resource-leak-possible`, Medium) — resource open on some but not all exit paths (branch-aware via lattice join). + - **Unauthenticated access** (`state-unauthed-access`, High) — sensitive sink reached without a preceding auth/admin check. +- **State analysis architecture** — six-module design: + - `lattice.rs` — `Lattice` trait (`bot`, `join`, `leq`) for generic fixed-point computation. + - `domain.rs` — `ResourceLifecycle` (bitflag), `ResourceDomainState`, `AuthLevel`, `AuthDomainState`, `ProductState` with lattice impls. + - `symbol.rs` — `SymbolInterner` that builds a string-interning table from CFG node defines/uses; `SymbolId` newtype. + - `transfer.rs` — `DefaultTransfer` function: maps CFG node kinds (Call, Assignment, If, Return) to state transitions using the existing `ResourcePair` definitions from `cfg_analysis::rules`. Emits `TransferEvent` for illegal transitions. + - `engine.rs` — two-phase forward worklist solver: Phase 1 iterates to a fixed point (no events collected to avoid spurious reports from intermediate states); Phase 2 re-applies transfer once over converged states to collect events. Bounded by `MAX_TRACKED_VARS` (64) with guarded degradation. + - `facts.rs` — post-analysis pass: extracts `StateFinding`s from transfer events (use-after-close, double-close) and exit-node state inspection (must-leak, may-leak, unauthed access). +- **`scanner.enable_state_analysis` config option** — opt-in boolean (default `false`) in `ScannerConfig` and `default-nyx.conf`. Requires CFG mode (`full` or `taint`). +- **`Diag.message` field** — optional human-readable message on diagnostic output. State findings carry variable-specific context (e.g. "variable `f` used after close"). Surfaced in console output (dimmed line below the finding), JSON, and SARIF (`message.text` prefers per-finding message over generic rule description). +- **State finding dedup** — when state analysis produces findings on a line, overlapping `cfg-resource-leak` and `cfg-auth-gap` findings on the same line are suppressed (state analysis is more precise). +- **SARIF rule descriptions** for all five state rule IDs. +- 21 integration tests (`tests/state_tests.rs`) with 19 C fixture files covering: use-after-close, double-close, resource leak, clean usage, opt-in gating, may-leak vs must-leak branch semantics, early return, nested branches, both-branches-close, loop convergence, loop use-after-close, handle overwrite, reopen-after-close, multiple handles, conservative join masking, chain operations, malloc/free pairs, straight-line double-close, and message field population. +- 30+ unit tests across state modules: lattice properties, lifecycle join/leq, domain merging, auth-level join, product state composition, may/must leak semantics, symbol interning, and transfer event generation. +- **`--severity ` filter** — replaces `--high-only` with a flexible severity expression supporting single levels (`HIGH`), comma lists (`HIGH,MEDIUM`), and thresholds (`>=MEDIUM`). Parsing is case-insensitive with whitespace tolerance. `SeverityFilter` type with `parse()` and `matches()` in `patterns/mod.rs`. +- **`--mode `** — replaces `--ast-only` and `--cfg-only` with a single canonical analysis mode flag. Enforces mutual exclusivity via clap `ValueEnum`. +- **`--index `** — replaces `--no-index` and `--rebuild-index` with a single flag (default `auto`). +- **`--fail-on `** — CI ergonomics: exit code 1 if any emitted finding meets or exceeds the threshold severity. Example: `--fail-on HIGH`. +- **`--quiet`** — CLI flag to suppress all human-readable status output (equivalent to `output.quiet = true` in config). +- **`--keep-nonprod-severity`** — renamed from `--include-nonprod` for clarity; old name kept as hidden alias. +- **`OutputFormat` enum** — `--format` now uses clap `ValueEnum` with typed `Console`, `Json`, `Sarif` variants (default `Console`). No more empty-string default. +- 10 new unit tests: `SeverityFilter` parsing (single, comma list, threshold, case-insensitive, whitespace, empty rejection, invalid level rejection), `Severity::from_str` rejection of unknown values, and `severity_filter_applied_at_output_stage` integration test verifying that downgraded findings are correctly filtered. +- **AST pattern overhaul** -- all 10 language pattern files (`src/patterns/*.rs`) rewritten with consistent conventions, structured metadata, and validated tree-sitter queries. + - **Pattern schema extensions** -- `PatternTier` (A = structural, B = heuristic-guarded), `PatternCategory` (13 vulnerability classes), and `Hash` on `Severity`. Module-level docs explain conventions and how to add new patterns. + - **Namespaced IDs** -- all pattern IDs follow `..` format (e.g. `java.deser.readobject`, `py.cmdi.os_system`, `js.xss.document_write`). + - **New vulnerability coverage** -- 30+ new patterns across languages: Python deserialization (`pickle.loads`, `yaml.load`, `shelve.open`), Python command injection (`os.system`, `os.popen`), Python weak crypto (`hashlib.md5/sha1`), Java reflection (`Method.invoke`), Java weak digest (`MessageDigest.getInstance("MD5")`), Java XSS (`getWriter().println`), Go TLS misconfiguration (`InsecureSkipVerify: true`), Go SQL concat, Go hardcoded secrets, Go gob deserialization, PHP `assert()` code exec, PHP `include $var` path traversal, PHP weak crypto (`md5`/`sha1`/`rand`), C/C++ `popen()`, C/C++ format-string with variable first arg, C++ `const_cast`, Ruby `Digest::MD5`. + - **Query fixes** -- fixed 11 broken tree-sitter queries: Java `object_creation_expression` used wrong type node (`identifier` → `type_identifier`), C++ `reinterpret_cast`/`const_cast` used non-existent node types (→ `template_function` match), Ruby backtick used `shell_command` (→ `subshell`), Python SQL used `binary_expression` (→ `binary_operator`), TypeScript `as any` used inaccessible field (→ positional child), PHP patterns missing `argument` wrapper nodes, Rust `unsafe fn` regex used unsupported `\b`. + - **No-duplicate rule** -- patterns that overlap with taint sinks use distinct ID namespaces and are documented; dedup in `ast.rs` prevents duplicate findings at the same location. + - **Severity recalibration** -- `unwrap`/`expect`/`panic!`/`todo!` moved to Low (filtered by default `min_severity`). Security patterns remain High/Medium. +- **Pattern test suite** (`tests/pattern_tests.rs`, 26 tests) -- sanity checks (unique IDs, query compilation, non-empty descriptions, naming convention, severity distribution), positive fixture tests (10 languages), and negative fixture tests (10 languages verifying no false positives on safe code). +- **Pattern test fixtures** -- positive and negative fixture files for all 10 languages under `tests/fixtures/patterns//`. +- **Real world test suite** — comprehensive fixture-based test suite (`tests/real_world_tests.rs`) with ~180 test fixtures across all 10 supported languages (C, C++, Go, Java, JavaScript, PHP, Python, Ruby, Rust, TypeScript). Each fixture has an `.expect.json` file declaring expected findings (with `must_match` for hard requirements and soft expectations for aspirational coverage). Fixtures are organized by analysis type (`taint/`, `state/`, `cfg/`, `mixed/`) under `tests/fixtures/real_world//`. A single parameterized test runner validates all fixtures in both `full` and `ast` modes, with verbose output via `NYX_TEST_VERBOSE=1`. + + +### Changed +- **Console header line now includes confidence** — the finding header shows score and confidence together as a parenthesized suffix: `(Score: 36, Confidence: Medium)`. The previous standalone `Confidence: ...` body line is removed. All four combinations are handled (both, score-only, confidence-only, neither). +- **Confidence display uses Title Case** — `Confidence::Display` now renders as `Low`, `Medium`, `High` (previously lowercase). +- **Breaking**: Config and data directory changed from `dev.ecpeter23.nyx` to `nyx` (e.g. `~/Library/Application Support/nyx/` on macOS). Existing config files (`nyx.conf`, `nyx.local`) and SQLite indexes at the old path will not be picked up automatically — copy them to the new location or re-run `nyx scan` to regenerate. +- **Improved diagnostic output formatting** — overhauled console renderer for a professional, security-tool-grade look: + - Severity is now the strongest visual anchor: HIGH (bold red with ✖), MEDIUM (bold orange ⚠), LOW (muted blue-gray ●). Fewer colors, clearer hierarchy. + - File paths rendered dim blue (never brighter than severity). + - Taint flow messages now use `→` arrow between shortened source/sink instead of backtick-wrapped text. + - Evidence values (Source, Sink) no longer wrapped in backticks — cleaner rendering with no risk of broken backtick spans across wrapped lines. +- **Fixed taint expression rendering** — multi-line sink/source call chains are now normalised before display: + - Whitespace collapsed (`foo() .bar()` → `foo().bar()`). + - Newlines joined into single-line canonical form. + - Spacing artefacts between `)` and `.` in method chains cleaned up. + - Long chains truncated with `…` ellipsis. +- Added `terminal_size` dependency for terminal-width-aware line wrapping. +- **Monotone forward dataflow taint analysis** — replaced the BFS taint engine in `taint/mod.rs` with a proper worklist-based forward dataflow analysis where termination is guaranteed by lattice finiteness. The generic `Transfer` trait in `state/engine.rs` now powers both the resource lifecycle/auth analysis and taint analysis. + - **`TaintState` lattice** (`taint/domain.rs`) — bounded abstract state with per-variable `VarTaint` (Cap bitflags + multi-origin tracking via `SmallVec<[TaintOrigin; 2]>`), dual validation bitsets (`validated_must` for intersection/all-paths, `validated_may` for union/any-path), and monotone `PredicateSummary` for contradiction pruning. Variables stored in sorted `SmallVec` keyed by `SymbolId` for O(n) merge-join. Lattice height bounded at ~8700 (7-bit Cap × 64 vars + validation bits + predicate bits). + - **`TaintTransfer`** (`taint/transfer.rs`) — implements `Transfer` with identical taint logic to the old BFS (source → propagation → sanitization → sink check). Callee resolution unchanged (local → global same-lang → interop edges). Emits `TaintEvent::SinkReached` events during Phase 2 of the engine. + - **JS/TS two-level solve** — prevents cross-function taint leakage (the main source of state explosion in the old BFS) while preserving global-to-function flows. Level 1 solves top-level code; Level 2 solves each function seeded with read-only top-level taint via `global_seed`. + - **Monotone predicate tracking** — path-sensitivity predicates moved from per-BFS-item `PathState` (which duplicated state exponentially) to monotone `PredicateSummary` in the lattice. Contradiction pruning uses `known_true & known_false` bit intersection (NullCheck/EmptyCheck/ErrorCheck only), which is both more precise and guaranteed monotone. + - **Multi-origin tracking** — each tainted variable tracks up to 4 `TaintOrigin` (node + `SourceKind`), enabling multiple findings when distinct sources flow to the same sink. + - **Guaranteed termination** — no more `MAX_BFS_ITERATIONS`/`MAX_SEEN_STATES` safety nets needed (though a 100K worklist iteration budget remains as defense-in-depth). Convergence follows from finite lattice height × finite CFG edges. + - **`analyse_file()` signature unchanged** — `Finding` struct, `Diag` conversion, and all callers are unaffected. +- **Generic dataflow engine** (`state/engine.rs`) — `run_forward()` and `DataflowResult` are now generic over any `S: Lattice` + `T: Transfer`. `DefaultTransfer` (resource lifecycle) implements `Transfer`; `TaintTransfer` implements `Transfer`. Per-domain iteration budget and `on_budget_exceeded` hooks added. +- **`path_state.rs` simplified** — removed `PathState`, `Predicate`, `MAX_PATH_PREDICATES`, `state_hash()`, `priority()` structs/methods. Kept `PredicateKind` enum and `classify_condition()` function (used by the new transfer for predicate classification). +- **Removed BFS infrastructure** — `taint_hash()`, BFS `Item` struct, `pred` predecessor map, two-tier seen-state map, and all bail-out constants (`MAX_BFS_ITERATIONS=200K`, `MAX_SEEN_STATES=100K`, `PATH_SENSITIVITY_NODE_LIMIT=500`, `PATH_SENSITIVITY_QUEUE_LIMIT=10K`, `MAX_PATH_VARIANTS_PER_KEY=4`) are no longer needed and have been removed. +- **Severity filtering applied at output stage** — `--severity` (and legacy `--high-only`) filtering is now applied ONCE in `scan::handle()` after all severity normalization (nonprod downgrades, dedup, truncation). Previously `--high-only` only filtered AST patterns during analysis; taint and CFG findings bypassed the filter entirely. +- **`--format` default is `console`** — previously defaulted to empty string, requiring fallback logic. +- **All status/progress output goes to stderr** — "Checking...", "Finished in...", config notes, and progress bars now use `eprintln!`/stderr exclusively. JSON and SARIF output is stdout-only. +- **`Severity::from_str` returns `Err` for unknown values** — previously returned `Ok(Severity::Low)` for any unrecognized input. +- **Deprecated CLI flags preserved as hidden aliases** — `--high-only`, `--no-index`, `--rebuild-index`, `--ast-only`, `--cfg-only`, and `--include-nonprod` are hidden from help but still functional, mapping to their canonical replacements. +- **Path-sensitive taint analysis** -- the BFS taint engine now carries a `PathState` (bounded set of branch predicates) alongside the taint map. When the BFS traverses a True or False edge from an `If` node, it records a `Predicate` with the condition's variables, kind, and polarity. This enables two new capabilities: + - **Infeasible path pruning** -- paths with contradictory predicates (e.g. `if x.is_none() { return; } if x.is_none() { sink }`) are detected and pruned, eliminating false positives on code guarded by redundant null/empty/error checks. Contradiction detection is conservative: only whitelisted kinds (`NullCheck`, `EmptyCheck`, `ErrorCheck`) with single-variable predicates are pruned. + - **Validation guard annotation** -- when all tainted variables reaching a sink are guarded by a `ValidationCall` predicate (e.g. `if validate(&x) { sink }` or `if !validate(&x) { return; } sink`), the finding is annotated with `path_validated: true` and `guard_kind: ValidationCall`. This metadata is surfaced in JSON and console output without changing severity. +- **Condition metadata on CFG nodes** -- `NodeInfo` now carries `condition_text`, `condition_vars`, and `condition_negated` for `If` nodes, extracted during CFG construction. Negation detection handles `!expr`, `not expr`, and Ruby `unless`. Classification of condition text into `PredicateKind` (NullCheck, EmptyCheck, ErrorCheck, ValidationCall, SanitizerCall, Comparison, Unknown) is conservative: call-based kinds require `(` in the text and a matching callee token. +- **`path_validated` and `guard_kind` fields on `Diag`** -- taint findings carry path-sensitivity metadata in JSON output (fields omitted when not set) and console output (suffix line `Path guard: ValidationCall` when present). Finding IDs are unchanged for dedup stability. +- **`smallvec` dependency** -- used for inline-allocated predicate storage in `PathState` (avoids heap allocation for the common case of ≤4 predicates per path). +- **Interprocedural call graph** -- a whole-program `CallGraph` (`petgraph::DiGraph`) is now built between Pass 1 and Pass 2 of every taint-enabled scan. Each function definition is a node; resolved callee relationships are edges. The graph is constructed from the merged `GlobalSummaries` and is available in both the filesystem and indexed scan paths. +- **Three-valued callee resolution** -- `CalleeResolution` enum distinguishes `Resolved(FuncKey)`, `NotFound`, and `Ambiguous(Vec)`. Ambiguous callees (same name in multiple namespaces, caller in a third namespace) are tracked separately from missing callees for diagnostics. +- **Shared resolution helper** -- `GlobalSummaries::resolve_callee_key()` centralizes same-language callee resolution with arity-aware filtering and namespace disambiguation. Both the call graph builder and the taint engine now use the same resolution logic. +- **Callee-name normalization** -- `normalize_callee_name()` extracts the last segment from qualified callee text (`"env::var"` → `"var"`, `"obj.method"` → `"method"`) before resolution. The raw call-site text is preserved on graph edges for diagnostics. +- **SCC / topological analysis** -- `CallGraphAnalysis` computes strongly connected components via Tarjan's algorithm and exposes a callee-first (leaves-first) topological ordering of SCC indices, ready for future bottom-up taint propagation. +- **Call graph tracing** -- `tracing::info!` log with node count, edge count, unresolved-not-found count, unresolved-ambiguous count, and SCC count is emitted after every call graph build. +- 8 new path-sensitivity integration tests: early-return validation guard, failed-validation branch, contradictory null-check pruning, if/else validation annotation, sanitize-one-branch regression, path-state budget graceful degradation, unknown-predicate non-pruning, multi-var non-pruning. +- 35 new unit tests in `taint::path_state`: classify_condition variants, PathState push/truncation, contradiction detection (whitelisted kinds, single-var only), has_validation_for semantics, state_hash determinism, priority ordering. +- 11 new unit tests: callee normalization, same-name-different-namespaces resolution, cross-language isolation, arity separation, recursive SCC detection, not-found vs ambiguous diagnostics, diamond topo ordering, interop edge resolution, namespace normalization consistency, and raw call-site preservation. +- **Edge-aware taint traversal** -- `analyse_file()` now uses `cfg.edges(node)` instead of `cfg.neighbors(node)`, inspecting `EdgeKind` on each edge. This is required for predicate recording but also makes the taint engine aware of the CFG's branch structure for the first time. +- **Two-tier seen-state deduplication** -- the BFS seen-state map changed from `HashSet<(NodeIndex, u64)>` to a `HashMap` keyed by `(NodeIndex, taint_hash)` mapping to a bounded list of `(path_hash, priority)` pairs. At most `MAX_PATH_VARIANTS_PER_KEY` (4) path variants are tracked per taint state, with deterministic eviction preferring non-truncated states with fewer predicates. +- **Finding deduplication** -- taint findings are now deduplicated by `(sink, source)` pair after analysis, preferring findings with `path_validated = true` (most informative metadata). +- **`taint::Finding` struct** -- added `path_validated: bool` and `guard_kind: Option` fields. Code that constructs `Finding` directly must include these fields. +- **`Diag` struct** -- added `path_validated: bool` and `guard_kind: Option` fields. Both use `#[serde(skip_serializing_if)]` to omit from JSON when not set. +- **`taint::resolve_callee()` refactored** -- the global resolution step now delegates to `GlobalSummaries::resolve_callee_key()` and applies `normalize_callee_name()` before lookup, unifying resolution logic with the call graph builder. +- **Label rules expanded across 8 languages:** + - **Go** — added `r.URL.Query`, `r.URL.Query.Get`, `Request.FormValue`, `Request.URL` sources; `filepath.Clean`/`filepath.Base` sanitizers; `fmt.Fprintf`/`fmt.Sprintf`/`fmt.Printf` format-string sinks; `os.Open`/`os.OpenFile`/`os.Create`/`ioutil.ReadFile`/`os.ReadFile` FILE_IO sinks; `template.HTML` HTML sink; `db.QueryRow`/`db.Prepare` SQL sinks. + - **PHP** — sources now match both `$_GET` and `_GET` (without `$` prefix, matching collect_idents stripping); added `$_FILES`/`_FILES`, `$_SERVER`/`_SERVER`, `$_ENV`/`_ENV` sources; `eval`/`assert` shell sinks; `include`/`include_once`/`require`/`require_once` FILE_IO sinks; `unserialize` sink; `move_uploaded_file`/`copy`/`file_put_contents`/`fwrite` FILE_IO sinks; `basename` FILE_IO sanitizer; `query` SQL sink. + - **Java** — added `readObject`/`readLine` sources; `ProcessBuilder` shell sink; `Class.forName` reflection sink; `println`/`print`/`write` HTML sinks. + - **Python** — added `send_file`/`send_from_directory` FILE_IO sinks; `os.path.realpath` FILE_IO sanitizer; `open` changed from source to FILE_IO sink (fixes source/sink conflict for path traversal detection). + - **Ruby** — `params` source detection now works via subscript handling. + - **Rust** — added `fs::read_to_string`/`fs::write`/`fs::read`/`File::open`/`File::create` as FILE_IO sinks; `fs::read_to_string` removed from sources (was source/sink conflict). + - **C/C++** — added `fopen`/`open` as FILE_IO sinks. +- **Ruby `rb.cmdi.system_interp` pattern broadened** — no longer requires string interpolation in arguments; now matches any `system`/`exec` call, promoted from Tier B to Tier A. +- **C++ `cpp.cmdi.popen` pattern added** — `popen()` command execution detection for C++, using the language-namespaced ID (the C pattern retains `c.cmdi.popen`). +- **Test config enables state analysis** — `test_config()` now sets `enable_state_analysis = true`. + + +### Fixed +- **Taint source kind misclassified as "unknown" for non-call sources** — source-bearing nodes with `CallWrapper` or `Assignment` kind (e.g. `userInput = req.query.data`) had their `callee` field set to `None` because the CFG builder only populated `callee` for `StmtKind::Call` nodes. This caused `infer_source_kind()` to receive an empty string, failing to match any keyword pattern and defaulting to `SourceKind::Unknown`. Fixed by also setting `callee` when a label (Source/Sink/Sanitizer) is detected, so the extracted member text (e.g. "req.query") flows through to source kind inference. Affects severity classification and diagnostic output for property-access sources across all languages. +- **Full KINDS map audit across all 10 languages** — 89 missing tree-sitter node types added to KINDS maps so the CFG builder no longer silently drops code inside switch/case, try/catch/finally, class bodies, closures/lambdas, and other container nodes. Previously, any node not in a language's KINDS map hit the `build_sub` fallback which created a terminal Seq node without recursing into children, effectively making all wrapped code invisible to analysis. + - **C** (+3): `switch_statement`, `case_statement`, `labeled_statement` + - **C++** (+7, 1 fix): `switch_statement`, `case_statement`, `labeled_statement`, `throw_statement` (Return), `try_statement`, `catch_clause`, `lambda_expression`; **critical fix**: `namespace_definition` changed from `Trivia` to `Block` (all function definitions inside namespaces were silently dropped) + - **Java** (+11): `do_statement` (While), `throw_statement` (Return), `switch_expression`, `switch_block`, `switch_block_statement_group`, `try_statement`, `catch_clause`, `finally_clause`, `lambda_expression`, `constructor_body`, `static_initializer` + - **JavaScript** (+11): `switch_statement`, `switch_body`, `switch_case`, `switch_default`, `try_statement`, `catch_clause`, `finally_clause`, `class_declaration`, `class` (expression), `class_body`, `export_statement` + - **TypeScript** (+13): all JS switch/try/class entries plus `abstract_class_declaration`, `export_statement`, `enum_declaration` (Trivia) + - **PHP** (+11): `do_statement` (While), `throw_expression` (Return), `switch_statement`, `switch_block`, `case_statement`, `default_statement`, `try_statement`, `catch_clause`, `finally_clause`, `colon_block`, `class_declaration` + - **Python** (+7): `try_statement`, `except_clause`, `finally_clause`, `class_definition`, `decorated_definition`, `match_statement`, `case_clause` + - **Ruby** (+11): `until` (While), `begin`, `rescue`, `ensure`, `case`, `when`, `class`, `module`, `singleton_method` (Function), `do`, `block` + - **Go** (+10): `expression_switch_statement`, `type_switch_statement`, `expression_case`, `type_case`, `default_case`, `select_statement`, `communication_case`, `go_statement`, `defer_statement`, `func_literal` (Function) + - **Rust** (+5, 1 removal): `closure_expression`, `async_block`, `impl_item`, `trait_item`, `declaration_list`; removed dead `loop_statement` entry (node doesn't exist in tree-sitter-rust 0.24.0) +- Removed unused `Kind::LoopBody` enum variant from `labels/mod.rs` (no arm in `build_sub`, last reference was the dead Rust `loop_statement` entry) +- **CFG: `else_clause` not recursed into for C/C++** — tree-sitter's C and C++ grammars wrap else bodies in an `else_clause` node. This node was missing from both languages' `KINDS` maps, so the CFG builder's fallback arm treated it as a terminal `Seq` node without descending into children. All statements inside else blocks (e.g. `fclose(f)`) were silently dropped from the CFG, causing false-positive resource leak and incorrect branch analysis. Fixed by mapping `"else_clause" => Kind::Block` in `src/labels/c.rs` and `src/labels/cpp.rs`. +- **CFG: `else_clause` missing from Rust, JavaScript, TypeScript, Python, PHP KINDS maps** — same bug class as C/C++: tree-sitter wraps else bodies in an `else_clause` node that was not in KINDS, silently dropping all code inside else blocks from the CFG. Fixed by mapping `"else_clause" => Kind::Block` in all five languages. Also added `"elif_clause" => Kind::Block` (Python), `"else_if_clause" => Kind::Block` (PHP), and `"elsif" => Kind::If` (Ruby) to handle chained elif/elsif nodes. +- **Rust KINDS using wrong tree-sitter node names** — tree-sitter-rust uses `_expression` suffixes (not `_statement`) for `while`, `for`, and `return` nodes. The existing `while_statement`, `for_statement`, and `return_statement` entries were dead code (0 grammar matches). Added `while_expression`, `for_expression`, and `return_expression` mappings. +- **Rust `match_expression`, `match_block`, `match_arm`, `unsafe_block` missing from KINDS** — these wrapper nodes were not mapped, causing all code inside match arms and unsafe blocks to be silently dropped from the CFG. Mapped to `Kind::Block` for sequential traversal. +- **TypeScript missing `throw_statement` and `do_statement`** — `throw` was mapped in JavaScript but not TypeScript; `do_statement` (do-while loops) was missing from both JS and TS. Added `"throw_statement" => Kind::Return` and `"do_statement" => Kind::While` to both languages. +- **Python `raise_statement` and `with_statement` missing from KINDS** — `raise` terminates the current path (mapped to `Kind::Return`); `with` wraps code in a context manager (mapped to `Kind::Block`). Both were silently dropping enclosed code. +- **Dead KINDS entries removed** — `"for_of_statement"` in TypeScript (0 grammar matches; TS inherits `for_in_statement` from JS) and `"method_call"` in Ruby (0 grammar matches; Ruby only has `call`). +- **`--high-only` emitting Low/Medium taint and CFG findings** — severity filter was only applied to AST pattern queries during analysis. Taint findings (whose severity derives from `SourceKind`) and CFG structural findings passed through unfiltered. The filter is now applied at the final output stage after all severity normalization, ensuring `--severity HIGH` never emits downgraded Medium/Low findings. +- **JSON/SARIF output contaminated with status messages on stdout** — status messages ("Checking...", "Finished in...") used `println!` and appeared in stdout alongside machine output. Now all status goes to stderr. +- **CFG: False edge to then-block exits in no-else if statements** -- previously, `if (cond) { body }` without an else block created a `False` edge from the condition node directly to the then-block's exit nodes. This made the false path appear to traverse the then-block, causing incorrect predicate polarity in path-sensitive analysis and duplicate taint findings with contradictory metadata. The CFG now creates a synthetic pass-through `Seq` node for the false path with an explicit `False` edge from the condition, correctly modeling "skip the then-block." This also fixes the frontier: previously, the no-else non-terminating case duplicated `then_exits` in the frontier (`then_exits ++ then_exits.clone()`); it now correctly produces `then_exits ∪ [pass_through]`. +- **Taint BFS non-termination on large JS files** — the BFS taint engine in `taint/mod.rs` had no global iteration bound. The seen-state deduplication keyed on `(node, taint_hash)`, so every distinct taint map at a CFG node was treated as a novel state. In files with loops and many tainted variables (e.g. a 2,200-line JS file with 18+ top-level variables tainted via `window.location.search`), each loop iteration produced a slightly different taint map, causing the BFS to revisit loop bodies indefinitely. Both `--no-index` and `--rebuild-index` scans hung near completion (progress showed e.g. 87/88 files). Fixed by adding two hard bounds: `MAX_BFS_ITERATIONS` (200,000 queue pops) and `MAX_SEEN_STATES` (100,000 unique `(node, taint_hash)` entries in the seen-state map). When either limit is reached the analysis bails out gracefully and returns all findings collected so far. A `tracing::warn!` is emitted on iteration-limit bail-out. Normal files are unaffected (typical BFS uses <1,000 iterations). +- **Rust `if let` / `while let` taint propagation** — the CFG builder now extracts pattern bindings from `let_condition` nodes as variable definitions in `def_use()`, and classifies the value expression (e.g. `env::var("CMD")`) for source/sink labels in `push_node()`. Previously, `if let Ok(cmd) = env::var("CMD") { Command::new("sh").arg(&cmd) }` produced no taint finding because `cmd` was never recognized as a tainted definition. Now correctly detects taint flow through `if let` and `while let` bindings. +- **C++ `popen` pattern ID collision** — renamed `c.cmdi.popen` to `cpp.cmdi.popen` in C++ patterns to fix a cross-language duplicate ID that caused `all_pattern_ids_are_globally_unique` test failure. +- **State analysis early-return leak duplication** — `extract_findings` in `state/facts.rs` now skips early-return nodes when checking for resource leaks, only inspecting the synthesized function exit node. Previously, early-return nodes with path-specific state (OPEN only) emitted `state-resource-leak` alongside the correct `state-resource-leak-possible` from the merged exit state. +- **Severity filter bug** — `min_severity` comparison in `ast.rs` was inverted (`<=` instead of `>`), causing all AST patterns at the minimum severity level to be silently dropped. With the default `min_severity = Low`, all Low-severity patterns (`.unwrap()`, `.expect()`, `panic!`, `todo!`, `mem::forget`, Go crypto patterns, narrow casts) were never reported. Fixed 29 test cases. +- **Nested function analysis** — CFG builder now recurses into function expressions passed as call arguments (e.g., Express `app.get('/path', function(req, res) { ... })`, Sinatra `get '/path' do...end`). Added `collect_nested_function_nodes()` to discover `Kind::Function` nodes inside `CallWrapper`/`CallFn` AST subtrees. Also added `function_expression` to JS/TS KINDS maps, and `do_block`/`block` as `Kind::Function` in Ruby for Sinatra/Rails blocks. Anonymous functions now get unique names (``) to prevent scope collisions in JS two-level taint solve. +- **Chained method call classification** — `classify()` now normalizes chained calls like `r.URL.Query().Get` by stripping internal `()` between `.` segments, producing `r.URL.Query.Get`. Suffix matching is attempted against both the original head and the normalized form, fixing Go HTTP handler source detection and similar patterns. +- **Subscript access source detection** — `first_member_label` and `first_member_text` now handle `subscript_expression`, `subscript`, and `element_reference` nodes, enabling source classification for PHP `$_GET['cmd']`, Ruby `params[:cmd]`, and Python `os.environ['KEY']`. +- **Return-statement call extraction** — `Kind::Return` added to the node types that extract inner call identifiers via `first_call_ident`, fixing cases like `return send_file(path)` where the sink was not classified. +- **Nested call classification** — new `find_classifiable_inner_call()` tries all nested calls when the outermost one doesn't classify, fixing `str(eval(expr))` where `eval` is a sink wrapped in a non-sink call. +- **Java `new` expression text extraction** — added `type` field fallback in `push_node` and `first_call_ident` for `CallFn` nodes, fixing `new ProcessBuilder(...)` not matching as a sink. +- **Function body lookup for anonymous functions** — `Kind::Function` handler now falls back to finding a `Kind::Block` child when `child_by_field_name("body")` returns None, supporting JS/TS anonymous function expressions and Ruby blocks. +- **Function-level resource leak detection** — `extract_findings` in `state/facts.rs` now inspects per-function Return nodes for leaked resources, not just the file-level Exit node. Previously, variables from one function could be overwritten by same-named variables in subsequent functions, masking leaks. +- **Use-after-free for memory functions** — added `strcpy`, `strncpy`, `memcpy`, `memmove`, `memset`, `memcmp`, `strcmp`, `strncmp`, `strlen`, `sprintf`, `snprintf` to `RESOURCE_USE_PATTERNS` in state analysis, enabling use-after-free detection for common C/C++ string and memory functions. + ## [0.3.0] - 2026-02-25 ### Added diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md index 3191af89..9baa64c7 100644 --- a/CODE_OF_CONDUCT.md +++ b/CODE_OF_CONDUCT.md @@ -61,7 +61,7 @@ representative at an online or offline event. Instances of abusive, harassing, or otherwise unacceptable behavior may be reported to the community leaders responsible for enforcement at -**opening a private issue** at [https://github.com/ecpeter23/nyx/issues/new/choose](). +**opening a private issue** at [https://github.com/elicpeter/nyx/issues/new/choose](). All complaints will be reviewed and investigated promptly and fairly. All community leaders are obligated to respect the privacy and security of the diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 527dd6da..0c6b2cf0 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,142 +1,351 @@ # Contributing to Nyx -First off, **thank you for taking the time to contribute!** By participating in this project, you agree to abide by the community values and expectations described in our [Code of Conduct](CODE_OF_CONDUCT.md). +Thank you for your interest in improving Nyx. This guide covers everything you need to contribute effectively. -Nyx is dual‑licensed under **MIT** and **Apache‑2.0**. By submitting code, documentation, or any other material, you agree to license your contribution under these same terms. +Please read our [Code of Conduct](CODE_OF_CONDUCT.md) before participating. --- ## Table of Contents -1. [Getting Started](#getting-started) -2. [How to Contribute](#how-to-contribute) - - * [Bug Reports](#bug-reports) - * [Feature Requests](#feature-requests) - * [Pull Requests](#pull-requests) -3. [Development Workflow](#development-workflow) -4. [Commit & Branching Conventions](#commit--branching-conventions) -5. [Style Guide](#style-guide) -6. [Security Policy](#security-policy) -7. [Community Standards](#community-standards) +1. [Development Setup](#development-setup) +2. [Project Layout](#project-layout) +3. [How to Add a New AST Pattern](#how-to-add-a-new-ast-pattern) +4. [How to Add a New Taint Rule](#how-to-add-a-new-taint-rule) +5. [How to Add a New Language](#how-to-add-a-new-language) +6. [Testing](#testing) +7. [Pull Request Guidelines](#pull-request-guidelines) +8. [Bug Reports](#bug-reports) +9. [Feature Requests](#feature-requests) +10. [Release Process](#release-process) --- -## Getting Started +## Development Setup -Clone the repository and build Nyx in release mode: +### Prerequisites + +- **Rust 1.85+** (edition 2024) +- Git + +### Building ```bash -git clone https://github.com//nyx.git +git clone https://github.com/elicpeter/nyx.git cd nyx -cargo build --release + +cargo build # Debug build +cargo build --release # Release build +cargo install --path . # Install as `nyx` binary ``` -Run the test‑suite: +### Running Quality Checks ```bash -cargo test +cargo test --bin nyx # Unit tests (inline in modules) +cargo clippy --all -- -D warnings # Lint — treats warnings as errors +cargo fmt # Format code +cargo fmt -- --check # Check formatting without modifying ``` -> **Tip**: The first build downloads and compiles several `tree‑sitter` grammars. Later builds will be faster. +> **Note**: The first build downloads and compiles tree-sitter grammars for all 10 languages. Subsequent builds are faster. + +### Benchmarks + +```bash +cargo bench --bench scan_bench +``` + +Benchmark fixtures live in `benches/fixtures/`. Criterion produces HTML reports in `target/criterion/`. --- -## How to Contribute +## Project Layout -### Bug Reports - -* Search existing [issues](https://github.com//nyx/issues) to ensure the bug has not already been reported. -* Include **steps to reproduce**, expected vs. actual behaviour, and your environment details (`nyx --version`, `rustc --version`). -* Attach a minimal code sample if possible. - -### Feature Requests - -We welcome well‑motivated feature proposals. Please describe: - -1. **Problem statement** – what pain point does this solve? -2. **Proposed solution** – high‑level description, optionally with pseudo‑code. -3. **Alternatives considered** – why existing functionality is not enough. - -### Pull Requests - -Every PR should: - -1. Target the `main` branch. -2. Contain a single, focused change (small orthogonal fixes are okay). -3. Pass `cargo test`, `cargo fmt --check`, and `cargo clippy -- -D warnings`. -4. Update documentation and, when relevant, add tests. -5. Reference related issue numbers in the description (`Fixes #123`). - -A reviewer will provide feedback within **3 business days**. Squash‑merge is the default strategy; maintainers may edit commit messages for clarity. +``` +src/ + main.rs CLI entry point + lib.rs Library re-exports (benchmarks, integration tests) + cli.rs Clap command definitions + commands/ + mod.rs Command dispatch + scan.rs Two-pass scan orchestration, Diag struct + ast.rs Entry points for both passes; tree-sitter parsing + cfg.rs CFG construction from AST + cfg_analysis/ CFG structural detectors + guards.rs Unguarded sink detection (dominator analysis) + auth.rs Auth gap detection + resources.rs Resource leak detection + error_handling.rs Error fallthrough detection + unreachable.rs Unreachable security code detection + rules.rs Guard rules, auth rules, resource pairs + taint/ + mod.rs Taint analysis facade + JS two-level solve + domain.rs TaintState lattice (VarTaint, Cap, TaintOrigin) + transfer.rs TaintTransfer function (source/sanitizer/sink/call) + path_state.rs Predicate tracking and contradiction pruning + state/ + engine.rs Generic monotone dataflow engine (Transfer) + transfer.rs DefaultTransfer — resource lifecycle + auth state + summary.rs FuncSummary, GlobalSummaries, conservative merge + labels/ Per-language label rules + mod.rs classify() dispatch, Cap bitflags, DataLabel, LabelRule + rust.rs Rust sources, sinks, sanitizers + javascript.rs JS sources, sinks, sanitizers + ... (one file per language) + patterns/ Per-language AST pattern queries + mod.rs Pattern struct, Severity, SeverityFilter, registry + rust.rs Rust patterns + javascript.rs JS patterns + ... (one file per language) + callgraph.rs Call graph construction (petgraph), SCC, topo sort + database.rs SQLite indexing via r2d2 pool + rank.rs Attack-surface ranking + fmt.rs Output formatting and evidence normalization + output.rs SARIF 2.1 builder + walk.rs Parallel file walker (ignore crate, respects .gitignore) + symbol.rs Symbol interning (SymbolId) + interop.rs Cross-language interop edges + errors.rs NyxError, NyxResult types + utils/ + config.rs TOML config loading, merging, Config struct +``` --- -## Development Workflow +## How to Add a New AST Pattern -1. **Fork** the repo and create your feature branch: +AST patterns are the simplest detector to add. Each pattern is a tree-sitter query that matches a structural code construct. - ```bash - git checkout -b feature/my‑feature +### Step-by-step + +1. **Pick the language file** under `src/patterns/.rs`. + +2. **Choose the metadata**: + + | Field | Options | Guidelines | + |-------|---------|------------| + | **ID** | `..` | e.g. `py.cmdi.os_popen` | + | **Tier** | `A` or `B` | `A` = presence alone is high-signal; `B` = query includes a heuristic guard | + | **Severity** | `High`, `Medium`, `Low` | High: command exec, deser, banned functions. Medium: SQL concat, reflection, XSS. Low: weak crypto, code quality. | + | **Category** | See `PatternCategory` enum | `CommandExec`, `CodeExec`, `Deserialization`, `SqlInjection`, `PathTraversal`, `Xss`, `Crypto`, `Secrets`, `InsecureTransport`, `Reflection`, `MemorySafety`, `Prototype`, `CodeQuality` | + +3. **Write the tree-sitter query**: + + ```rust + Pattern { + id: "py.cmdi.os_popen", + description: "os.popen() — shell command execution", + query: r#"(call + function: (attribute + object: (identifier) @pkg (#eq? @pkg "os") + attribute: (identifier) @fn (#eq? @fn "popen"))) + @vuln"#, + severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::CommandExec, + }, ``` -2. Make your changes, then run: + The query **must** capture a `@vuln` node. That node's span determines the reported location. + +4. **Test it**: ```bash - cargo fmt - cargo clippy --all-targets --all-features -- -D warnings - cargo test + cargo test --bin nyx ``` -3. **Sign‑off** your commits if your employer requires a Developer Certificate of Origin (DCO): +5. **Update docs**: Add the new rule to `docs/rules/.md`. +### Tips + +- Use the [tree-sitter playground](https://tree-sitter.github.io/tree-sitter/playground) to develop and test queries. +- Avoid duplicating taint coverage. If the same function is already a labeled sink in `src/labels/.rs`, the AST pattern is still useful for `--mode ast`, but use a distinct ID namespace. The dedup pass prevents exact-duplicate findings at the same location. +- Test with real-world code to check false positive rates before choosing a tier. + +--- + +## How to Add a New Taint Rule + +Taint rules define sources (where untrusted data enters), sinks (where dangerous operations happen), and sanitizers (where data is made safe). + +### Step-by-step + +1. **Open the language file** in `src/labels/.rs`. + +2. **Add an entry** to the `RULES` slice: + + ```rust + LabelRule { + matchers: &["dangerouslySetInnerHTML"], + label: DataLabel::Sink(Cap::HTML_ESCAPE), + }, + ``` + +3. **Choose the right label type**: + + | Type | Purpose | Example | + |------|---------|---------| + | `DataLabel::Source(cap)` | Introduces tainted data | `env::var`, `req.body` | + | `DataLabel::Sanitizer(cap)` | Strips matching capability bits | `html_escape`, `encodeURIComponent` | + | `DataLabel::Sink(cap)` | Dangerous operation requiring sanitization | `eval`, `innerHTML`, `Command::new` | + +4. **Choose capabilities**: + + | Capability | When to use | + |-----------|-------------| + | `Cap::all()` | Sources that produce universally dangerous data | + | `Cap::SHELL_ESCAPE` | Shell command injection sinks/sanitizers | + | `Cap::HTML_ESCAPE` | XSS sinks/sanitizers | + | `Cap::URL_ENCODE` | URL injection sinks/sanitizers | + | `Cap::JSON_PARSE` | JSON parsing sanitizers | + | `Cap::FILE_IO` | File I/O sinks | + | `Cap::FMT_STRING` | Format string sinks | + | `Cap::ENV_VAR` | Environment/config data sources | + +5. **Matcher semantics**: + - Case-insensitive suffix matching by default. + - If a matcher ends with `_`, it acts as a prefix match. + - Multiple matchers in one rule are alternatives (any match triggers the rule). + +### User-defined rules (no code change needed) + +Users can add taint rules via config: + +```toml +[[analysis.languages.javascript.rules]] +matchers = ["dangerouslySetInnerHTML"] +kind = "sink" +cap = "html_escape" +``` + +Or via CLI: + +```bash +nyx config add-rule --lang javascript --matcher dangerouslySetInnerHTML --kind sink --cap html_escape +``` + +--- + +## How to Add a New Language + +Adding a new language requires changes across several modules. Use an existing language (e.g. Go or Python) as a template. + +### Checklist + +1. **Tree-sitter parser**: Add `tree-sitter-` to `Cargo.toml`. + +2. **Language registration**: Register the parser in `ast.rs` (language detection from file extension, parser initialization). + +3. **CFG node kinds**: Create `src/labels/.rs` with a `KINDS` map that maps tree-sitter node types to the internal `Kind` enum (`Block`, `If`, `While`, `For`, `Return`, `CallFn`, `CallMethod`, `Assignment`, etc.). + +4. **Parameter extraction**: Add a `PARAM_CONFIG` constant specifying how to extract function parameters from the AST (field name for parameter list, node type for individual parameters, extraction field for parameter names). + +5. **Label rules**: Add `RULES` (sources, sinks, sanitizers) and `TERMINATORS` to the labels file. + +6. **AST patterns**: Create `src/patterns/.rs` with a `PATTERNS` constant. + +7. **Registry updates**: + - `src/patterns/mod.rs` — add to the `REGISTRY` HashMap + - `src/labels/mod.rs` — add to the `classify()` dispatch + +8. **File extension mapping**: Add the extension in `ast.rs`. + +9. **Tests**: Write unit tests and add test fixtures. + +--- + +## Testing + +### Unit Tests + +All tests are inline `#[test]` blocks inside source modules. Run them with: + +```bash +cargo test --bin nyx +``` + +### What to Test + +- **New AST patterns**: Ensure the tree-sitter query matches the intended construct and does not match safe alternatives. +- **New taint rules**: Verify that source-to-sink flows are detected and that sanitizers properly neutralize findings. +- **New CFG rules**: Test that guard dominance logic correctly suppresses findings when guards are present. +- **Edge cases**: Empty files, files with syntax errors (tree-sitter is error-tolerant), deeply nested structures. + +### Linting + +CI runs Clippy with strict settings. Before submitting: + +```bash +cargo clippy --all -- -D warnings +``` + +--- + +## Pull Request Guidelines + +1. **Branch from `master`**. Use descriptive branch names: `feat/add-kotlin-support`, `fix/false-positive-sql-concat`, `docs/update-rule-reference`. + +2. **Keep PRs focused**. One logical change per PR. + +3. **Ensure CI passes**: ```bash - git commit -s -m "feat: add XYZ" + cargo test --bin nyx + cargo clippy --all -- -D warnings + cargo fmt -- --check ``` -4. Push the branch and open a PR against `main`. +4. **Commit style**: Use [Conventional Commits](https://www.conventionalcommits.org/). + ``` + feat(patterns): add Python subprocess.Popen pattern + fix(taint): prevent false positive on sanitized innerHTML + docs(rules): update JavaScript rule reference + ``` + +5. **Document new rules**. If you add patterns or taint rules, update the corresponding `docs/rules/.md` page. + +6. **Include test cases** for any new detection rules. --- -## Commit & Branching Conventions +## Bug Reports -* **Branch names**: `feature/`, `fix/`, `docs/` -* **Commit style** – Conventional Commits (simplified): +Please [open an issue](https://github.com/elicpeter/nyx/issues) for: - ```text - type(scope): subject - - body (optional) - ``` - - | Type | Use for | - |------------|--------------------------------------| - | `feat` | New functionality | - | `fix` | Bug fixes | - | `docs` | Documentation only | - | `refactor` | Code change without behaviour change | - | `test` | Adding or changing tests | - | `chore` | Build process, tooling | +- **Crashes or panics** — include the backtrace (`RUST_BACKTRACE=1 nyx scan .`) +- **False positives** — include the minimal code snippet, rule ID, and Nyx version +- **False negatives** — describe what you expected Nyx to find and why +- **Documentation errors** — point to the specific page and what's wrong --- -## Style Guide +## Feature Requests -* **Formatting**: run `cargo fmt` before committing. -* **Linting**: CI runs Clippy with `-D warnings`; keep the tree warning‑free. -* **Unsafe Rust**: prohibited unless absolutely necessary. Justify with in‑code comments. -* **Public API stability**: avoid breaking changes on exported types and functions without prior discussion. +We welcome well-motivated feature proposals. Please describe: + +1. **Problem statement** — what pain point does this solve? +2. **Proposed solution** — high-level description, optionally with pseudo-code. +3. **Alternatives considered** — why existing functionality is not enough. --- -## Security Policy +## Release Process -Please do **not** open public issues for security‑sensitive bugs. Instead, email the maintainers at `` with the details and a proof of concept. We aim to acknowledge reports within **48 hours**. +1. Update version in `Cargo.toml`. +2. Update `CHANGELOG.md` with the new version section. +3. Run full test suite: `cargo test --bin nyx && cargo clippy --all -- -D warnings`. +4. Create a git tag: `git tag v0.x.y`. +5. Push tag: `git push origin v0.x.y`. +6. CI builds release binaries and publishes to crates.io. --- -## Community Standards +## Security Issues -We strive to maintain a welcoming and inclusive community. Harassment, discrimination, or other forms of unacceptable behavior will be addressed per the [Code of Conduct](CODE_OF_CONDUCT.md). +Please do **not** open public issues for security-sensitive bugs. See [SECURITY.md](SECURITY.md) for our responsible disclosure process. -Thank you for helping to make Nyx better! +--- + +## License + +By contributing to Nyx, you agree that your contributions will be licensed under the [GPL-3.0](./LICENSE). diff --git a/Cargo.lock b/Cargo.lock index 6e03cb29..0ddb894b 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -71,7 +71,7 @@ version = "1.1.5" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "40c48f72fd53cd289104fc64099abca73db4166ad86ea0b4341abe65af83dadc" dependencies = [ - "windows-sys", + "windows-sys 0.61.2", ] [[package]] @@ -82,7 +82,7 @@ checksum = "291e6a250ff86cd4a820112fb8898808a366d8f9f58ce16d1f538353ad55747d" dependencies = [ "anstyle", "once_cell_polyfill", - "windows-sys", + "windows-sys 0.61.2", ] [[package]] @@ -283,7 +283,7 @@ dependencies = [ "libc", "once_cell", "unicode-width", - "windows-sys", + "windows-sys 0.61.2", ] [[package]] @@ -429,7 +429,7 @@ dependencies = [ "libc", "option-ext", "redox_users", - "windows-sys", + "windows-sys 0.61.2", ] [[package]] @@ -457,7 +457,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "39cab71617ae0d63f51a36d69f866391735b51691dbda63cf6f96d042b63efeb" dependencies = [ "libc", - "windows-sys", + "windows-sys 0.61.2", ] [[package]] @@ -805,7 +805,7 @@ version = "0.50.3" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "7957b9740744892f114936ab4a57b3f487491bbeafaf8083688b16841a4240e5" dependencies = [ - "windows-sys", + "windows-sys 0.61.2", ] [[package]] @@ -835,7 +835,7 @@ dependencies = [ [[package]] name = "nyx-scanner" -version = "0.3.0" +version = "0.4.0" dependencies = [ "assert_cmd", "bitflags", @@ -862,7 +862,9 @@ dependencies = [ "rusqlite", "serde", "serde_json", + "smallvec", "tempfile", + "terminal_size", "thiserror", "toml", "tracing", @@ -1272,7 +1274,7 @@ dependencies = [ "errno", "libc", "linux-raw-sys", - "windows-sys", + "windows-sys 0.61.2", ] [[package]] @@ -1436,7 +1438,17 @@ dependencies = [ "getrandom 0.4.1", "once_cell", "rustix", - "windows-sys", + "windows-sys 0.61.2", +] + +[[package]] +name = "terminal_size" +version = "0.4.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "60b8cb979cb11c32ce1603f8137b22262a9d131aaa5c37b5678025f22b8becd0" +dependencies = [ + "rustix", + "windows-sys 0.60.2", ] [[package]] @@ -1631,9 +1643,9 @@ dependencies = [ [[package]] name = "tree-sitter" -version = "0.26.5" +version = "0.26.6" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "12987371f54efc9b9306a20dc87ed5aaee9f320c8a8b115e28515c412b2efe39" +checksum = "13f456d2108c3fef07342ba4689a8503ec1fb5beed245e2b9be93096ef394848" dependencies = [ "cc", "regex", @@ -1967,7 +1979,7 @@ version = "0.1.11" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "c2a7b1c03c876122aa43f3020e6c3c3ee5c05081c9a00739faf7503aeba10d22" dependencies = [ - "windows-sys", + "windows-sys 0.61.2", ] [[package]] @@ -2035,6 +2047,15 @@ dependencies = [ "windows-link", ] +[[package]] +name = "windows-sys" +version = "0.60.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f2f500e4d28234f72040990ec9d39e3a6b950f9f22d3dba18416c35882612bcb" +dependencies = [ + "windows-targets", +] + [[package]] name = "windows-sys" version = "0.61.2" @@ -2044,6 +2065,71 @@ dependencies = [ "windows-link", ] +[[package]] +name = "windows-targets" +version = "0.53.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "4945f9f551b88e0d65f3db0bc25c33b8acea4d9e41163edf90dcd0b19f9069f3" +dependencies = [ + "windows-link", + "windows_aarch64_gnullvm", + "windows_aarch64_msvc", + "windows_i686_gnu", + "windows_i686_gnullvm", + "windows_i686_msvc", + "windows_x86_64_gnu", + "windows_x86_64_gnullvm", + "windows_x86_64_msvc", +] + +[[package]] +name = "windows_aarch64_gnullvm" +version = "0.53.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a9d8416fa8b42f5c947f8482c43e7d89e73a173cead56d044f6a56104a6d1b53" + +[[package]] +name = "windows_aarch64_msvc" +version = "0.53.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b9d782e804c2f632e395708e99a94275910eb9100b2114651e04744e9b125006" + +[[package]] +name = "windows_i686_gnu" +version = "0.53.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "960e6da069d81e09becb0ca57a65220ddff016ff2d6af6a223cf372a506593a3" + +[[package]] +name = "windows_i686_gnullvm" +version = "0.53.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "fa7359d10048f68ab8b09fa71c3daccfb0e9b559aed648a8f95469c27057180c" + +[[package]] +name = "windows_i686_msvc" +version = "0.53.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "1e7ac75179f18232fe9c285163565a57ef8d3c89254a30685b57d83a38d326c2" + +[[package]] +name = "windows_x86_64_gnu" +version = "0.53.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9c3842cdd74a865a8066ab39c8a7a473c0778a3f29370b5fd6b4b9aa7df4a499" + +[[package]] +name = "windows_x86_64_gnullvm" +version = "0.53.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0ffa179e2d07eee8ad8f57493436566c7cc30ac536a3379fdf008f47f6bb7ae1" + +[[package]] +name = "windows_x86_64_msvc" +version = "0.53.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d6bbff5f0aada427a1e5a6da5f1f98158182f26556f345ac9e04d36d0ebed650" + [[package]] name = "winnow" version = "0.7.14" diff --git a/Cargo.toml b/Cargo.toml index d4ed73bc..1cc967b1 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -1,13 +1,13 @@ [package] name = "nyx-scanner" -version = "0.3.0" +version = "0.4.0" edition = "2024" description = "A CLI security scanner for automating vulnerability checks" license = "GPL-3.0" authors = ["Eli Peter "] homepage = "https://github.com/elicpeter/nyx" repository = "https://github.com/elicpeter/nyx" -documentation = "https://github.com/elicpeter/nyx#readme" +documentation = "https://github.com/elicpeter/nyx/tree/master/docs" keywords = ["security", "vulnerability", "scanner", "static-analysis", "cli"] categories = ["security", "command-line-utilities", "development-tools", "parser-implementations", "text-processing"] readme = "README.md" @@ -56,7 +56,7 @@ num_cpus = "1.17.0" rusqlite = { version = "0.38.0", features = ["bundled"] } r2d2_sqlite = { version = "0.32.0", features = ["bundled"] } ignore = "0.4.25" -tree-sitter = "0.26.5" +tree-sitter = "0.26.6" tree-sitter-rust = "0.24.0" tree-sitter-c = "0.24.1" tree-sitter-cpp = "0.23.4" @@ -71,6 +71,7 @@ crossbeam-channel = "0.5.15" blake3 = "1.8.3" once_cell = "1.21.3" console = "0.16.2" +terminal_size = "0.4" rayon = "1.11.0" r2d2 = "0.8.10" bytesize = "2.3.1" @@ -81,3 +82,4 @@ petgraph = "0.8.3" bitflags = "2.11.0" phf = { version = "0.13.1", features = ["macros"] } indicatif = "0.18.4" +smallvec = "1.15" diff --git a/README.md b/README.md index f797366f..062f42d9 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ [![crates.io](https://img.shields.io/crates/v/nyx-scanner.svg)](https://crates.io/crates/nyx-scanner) [![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0) [![Rust 1.85+](https://img.shields.io/badge/rust-1.85%2B-orange)](https://www.rust-lang.org) -[![CI](https://img.shields.io/github/actions/workflow/status/ecpeter23/nyx/ci.yml?branch=master)](https://github.com/ecpeter23/nyx/actions) +[![CI](https://img.shields.io/github/actions/workflow/status/elicpeter/nyx/ci.yml?branch=master)](https://github.com/elicpeter/nyx/actions) --- @@ -24,7 +24,7 @@ | Multi-language support | Rust, C, C++, Java, Go, PHP, Python, Ruby, TypeScript, JavaScript | | AST-level pattern matching | Language-specific queries written against precise parse trees | | Control-flow graph analysis | Auth gaps, unguarded sinks, unreachable security code, resource leaks, error fallthrough | -| Cross-file taint tracking | BFS taint propagation from sources through sanitizers to sinks with function summaries | +| Cross-file taint tracking | Monotone forward dataflow taint analysis from sources through sanitizers to sinks with function summaries | | Cross-language interop | Taint flows across language boundaries via explicit interop edges | | Two-pass architecture | Pass 1 extracts function summaries; Pass 2 runs taint with full cross-file context | | Incremental indexing | SQLite database stores file hashes, summaries, and findings to skip unchanged files | @@ -42,7 +42,7 @@ |---|---| | **Pure-Rust, single binary** | No JVM, Python, or server to install; drop the `nyx` executable into your `$PATH` and go. | | **Massively parallel** | Uses Rayon and a thread-pool walker; scales to all CPU cores. Scanning the entire **rust-lang/rust** codebase (~53,000 files) on an M2 MacBook Pro takes **~1 s**. | -| **Deep analysis** | Real CFG construction and taint propagation, not just regex matching. Cross-file function summaries, capability-based sanitizer tracking, and scored findings. | +| **Deep analysis** | Real CFG construction and monotone dataflow taint analysis with guaranteed termination, not just regex matching. Cross-file function summaries, capability-based sanitizer tracking, and scored findings. | | **Index-aware** | An optional SQLite index stores file hashes and findings; subsequent scans touch *only* changed files, slashing CI times. | | **Offline & privacy-friendly** | Requires no login, cloud account, or telemetry. Perfect for air-gapped environments and strict compliance policies. | | **Tree-sitter precision** | Parses real language grammars, not regexes, giving far fewer false positives than line-based scanners. | @@ -58,7 +58,7 @@ $ cargo install nyx-scanner ``` ### Install Github release -1. Navigate to the [Releases](https://github.com/ecpeter23/nyx/releases) page of the repository. +1. Navigate to the [Releases](https://github.com/elicpeter/nyx/releases) page of the repository. 2. Download the appropriate binary for your system: ```nyx-x86_64-unknown-linux-gnu.zip``` for Linux @@ -87,7 +87,7 @@ $ cargo install nyx-scanner ### Build from source ```bash -$ git clone https://github.com/ecpeter23/nyx.git +$ git clone https://github.com/elicpeter/nyx.git $ cd nyx $ cargo build --release # optional – copy the binary into PATH @@ -111,20 +111,29 @@ $ nyx scan ./server --format json $ nyx scan --format sarif > results.sarif # Perform an ad-hoc scan without touching the index -$ nyx scan --no-index +$ nyx scan --index off # Restrict results to high-severity findings -$ nyx scan --high-only +$ nyx scan --severity HIGH + +# Filter by severity expression (high and medium) +$ nyx scan --severity ">=MEDIUM" # AST pattern matching only (fastest, no CFG/taint) -$ nyx scan --ast-only +$ nyx scan --mode ast # CFG + taint analysis only (skip AST pattern rules) -$ nyx scan --cfg-only +$ nyx scan --mode cfg + +# CI gate: fail on medium+, SARIF output +$ nyx scan --format sarif --fail-on MEDIUM > results.sarif + +# Suppress status messages (for CI/scripting) +$ nyx scan --quiet --format json # Include test/vendor/benchmark paths at original severity # (by default these are downgraded one tier) -$ nyx scan --include-nonprod +$ nyx scan --keep-nonprod-severity ``` ### Index Management @@ -164,13 +173,14 @@ $ nyx config add-terminator --lang javascript --name process.exit ## Analysis Modes -Nyx supports three analysis modes, selectable via the `scanner.mode` config option or CLI flags: +Nyx supports four analysis modes, selectable via `--mode` or the `scanner.mode` config option: | Mode | CLI flag | What runs | |---|---|---| -| **Full** (default) | — | AST pattern matching + CFG construction + taint analysis | -| **AST-only** | `--ast-only` | AST pattern matching only; skips CFG and taint entirely | -| **Taint-only** | `--cfg-only` | CFG + taint analysis only; filters out AST pattern findings | +| **Full** (default) | `--mode full` | AST pattern matching + CFG construction + taint analysis | +| **AST-only** | `--mode ast` | AST pattern matching only; skips CFG and taint entirely | +| **CFG** | `--mode cfg` | CFG + taint analysis only; filters out AST pattern findings | +| **Taint** | `--mode taint` | Alias for `cfg` (CFG + taint analysis) | ### What the CFG + taint engine detects @@ -182,8 +192,40 @@ Nyx supports three analysis modes, selectable via the `scanner.mode` config opti | Unreachable security code | `cfg-unreachable-*` | Sanitizers, guards, or sinks in dead code branches | | Error fallthrough | `cfg-error-fallthrough` | Error-handling branches that don't terminate, allowing execution to fall through to dangerous operations | | Resource leak | `cfg-resource-leak` | Resources acquired but not released on all exit paths (malloc/free, fopen/fclose, Lock/Unlock) | +| Use-after-close | `state-use-after-close` | Variable read/written after its resource handle was closed | +| Double-close | `state-double-close` | Resource handle closed more than once | +| Must-leak | `state-resource-leak` | Resource acquired but never closed on any exit path | +| May-leak | `state-resource-leak-possible` | Resource open on some but not all exit paths | +| Unauthenticated access | `state-unauthed-access` | Sensitive sink reached without a preceding auth/admin check | -Findings are scored and ranked by severity, proximity to entry point, path complexity, and taint confirmation. +### Attack Surface Ranking + +Every finding is assigned a deterministic **attack-surface score** that estimates exploitability using only information already in memory — no extra source passes are needed. Findings are sorted by descending score before truncation, so `max_results` always keeps the most important results. + +The score is the sum of five components: + +| Component | Weight | Description | +|---|---|---| +| **Severity base** | High = 60, Medium = 30, Low = 10 | Primary ordering signal. Severity reflects source-kind exploitability and rule confidence. | +| **Analysis kind** | taint = +10, state = +8, cfg = +3/+5, ast = 0 | Taint-confirmed flows are the strongest signal; AST-only pattern matches rank lowest at equal severity. CFG findings with evidence get +5, without get +3. | +| **Evidence strength** | +1 per evidence item (max 4), +2–6 for source kind | More evidence increases confidence. Source-kind priority: user input (+6) > env/config (+5) > unknown (+4) > file system (+3) > database (+2). | +| **State rule type** | +1 to +6 | Use-after-close and unauthenticated access (+6) rank above double-close (+3), must-leak (+2), and may-leak (+1). | +| **Path validation** | −5 | Findings on paths guarded by a validation predicate receive a small exploitability penalty — the guard may prevent triggering. | + +**Score ranges** (approximate): + +| Finding type | Score | +|---|---| +| High taint + user input | ~78 | +| High state (use-after-close) | ~74 | +| High CFG structural | ~63 | +| Medium taint + env source | ~47 | +| Medium state (resource leak) | ~40 | +| Low AST-only pattern | ~10 | + +Tie-breaking is deterministic: severity → rule ID → file path → line → column → message hash. The same set of findings always produces the same ordering regardless of parallelism or input order. + +Ranking is enabled by default. Disable it with `--no-rank` or `output.attack_surface_ranking = false` in config. When disabled, `rank_score` is omitted from JSON/SARIF output. --- @@ -213,8 +255,8 @@ Nyx merges a default configuration file (`nyx.conf`) with user overrides (`nyx.l | Platform | Directory | |---|---| | Linux | `~/.config/nyx/` | -| macOS | `~/Library/Application Support/dev.ecpeter23.nyx/` | -| Windows | `%APPDATA%\ecpeter23\nyx\config\` | +| macOS | `~/Library/Application Support/nyx/` | +| Windows | `%APPDATA%\elicpeter\nyx\config\` | Minimal example (`nyx.local`): @@ -270,7 +312,7 @@ Nyx uses a **two-pass architecture** to enable cross-file analysis without sacri 1. **File enumeration** -- A parallel walker (Rayon + `ignore` crate) applies gitignore rules, size limits, and user exclusions. 2. **Pass 1 -- Summary extraction** -- Each file is parsed via tree-sitter, an intra-procedural CFG is built (petgraph), and a `FuncSummary` is exported per function capturing source/sanitizer/sink capabilities (bitflags), taint propagation behavior, and callee lists. Summaries are persisted to SQLite. 3. **Summary merge** -- All per-file summaries are merged into a `GlobalSummaries` map with conservative conflict resolution (union caps, OR booleans). -4. **Pass 2 -- Analysis** -- Files are re-parsed and analyzed with the full cross-file context: BFS taint propagation resolves callees against local and global summaries, CFG analysis checks for auth gaps, unguarded sinks, resource leaks, and more. +4. **Pass 2 -- Analysis** -- Files are re-parsed and analyzed with the full cross-file context: a monotone forward dataflow engine resolves callees against local and global summaries and propagates taint through a bounded lattice with guaranteed convergence. CFG analysis checks for auth gaps, unguarded sinks, resource leaks, and more. 5. **Reporting** -- Findings are scored, ranked, deduplicated, and emitted to the console or serialized as JSON. With indexing enabled, Pass 1 skips files whose blake3 content hash is unchanged, and cached findings are served directly for AST-only results. @@ -279,14 +321,19 @@ With indexing enabled, Pass 1 skips files whose blake3 content hash is unchanged ## Roadmap -### Phase 1 -- Deep Static Engine +### Phase 1 -- Deep Static Engine (Complete) -| Feature | Description | -|---|---| -| Interprocedural call graph | Precise symbol resolution via `FuncKey`, language-scoped namespaces, cross-module linking. No name-collision merging -- full call graph with topological analysis. | -| Path-sensitive analysis | Track path predicates and conditional constraints. Detect infeasible paths and validation-only-in-one-branch patterns. Dramatically reduces false positives. | -| Dataflow & state modeling | Resource state machines (init -> use -> close), auth state transitions, privilege level tracking. Semantic analysis beyond pattern matching. | -| Attack surface ranking | Score entry points by distance-to-sink, guard strength, path complexity, and privilege escalation potential. Deterministic attack surface scoring. | +| Feature | Status | Description | +|---|--------|---| +| Interprocedural call graph | Done | Precise symbol resolution via `FuncKey`, language-scoped namespaces, cross-module linking. Full call graph with SCC and topological analysis. | +| Path-sensitive analysis | Done | Track path predicates and conditional constraints. Detect infeasible paths and validation-only-in-one-branch patterns. Monotone predicate summaries with contradiction pruning. | +| Dataflow & state modeling | Done | Resource state machines (init -> use -> close), auth state transitions, privilege level tracking. Generic `Transfer` trait over bounded lattices with guaranteed convergence. | +| Monotone taint analysis | Done | Replaced BFS taint engine with a forward worklist dataflow analysis over a finite `TaintState` lattice. Multi-origin tracking, dual validated-must/may sets, JS/TS two-level solve. Guaranteed termination via lattice finiteness. | +| Attack surface ranking | Done | Deterministic post-analysis scoring of findings by severity, analysis kind, evidence strength, source-kind exploitability, and validation state. Findings sorted by score before truncation so `max_results` keeps the most important results. | +| Inline suppressions | Done | `nyx:ignore` and `nyx:ignore-next-line` comments with wildcard matching, all 10 languages supported. `--show-suppressed` flag for visibility. | +| Low-noise prioritization | Done | Category filtering, rollup grouping for high-frequency rules, configurable LOW budgets. Quality-category findings hidden by default. | +| Pattern-level confidence | Done | Explicit High/Medium/Low confidence on every AST pattern. Confidence flows into output alongside severity and rank score. | +| AST pattern overhaul | Done | 30+ new patterns across all languages, 11 broken query fixes, namespaced IDs, severity recalibration. | ### Phase 2 -- Dynamic Capability @@ -312,7 +359,25 @@ With indexing enabled, Pass 1 skips files whose blake3 content hash is unchanged | Rule updates | Remote rule feed with signature verification | | UX | Smart file-watch re-scan | -Community feedback shapes priorities -- please [open an issue](https://github.com/ecpeter23/nyx/issues) to discuss proposed changes. +Community feedback shapes priorities -- please [open an issue](https://github.com/elicpeter/nyx/issues) to discuss proposed changes. + +--- + +## Documentation + +Full documentation is available in the [`docs/`](docs/index.md) directory: + +- [Installation](docs/installation.md) — cargo, binaries, CI tips +- [Quick Start](docs/quickstart.md) — Your first scan in 60 seconds +- [CLI Reference](docs/cli.md) — Every flag and subcommand +- [Configuration](docs/configuration.md) — Config file schema, custom rules +- [Output Formats](docs/output.md) — Console, JSON, SARIF; exit codes +- [Detector Overview](docs/detectors.md) — How the four detector families work + - [Taint Analysis](docs/detectors/taint.md) — Cross-file source-to-sink dataflow + - [CFG Structural](docs/detectors/cfg.md) — Auth gaps, unguarded sinks, resource leaks + - [State Model](docs/detectors/state.md) — Resource lifecycle, authentication state + - [AST Patterns](docs/detectors/patterns.md) — Tree-sitter structural matching +- [Rule Reference](docs/rules/index.md) — Per-language rule listings with examples --- @@ -327,7 +392,7 @@ Pull requests are welcome. To contribute: Please open an issue for any crash, panic, or suspicious result -- attach the minimal code snippet and mention the Nyx version. -See `CONTRIBUTING.md` for full guidelines. +See [`CONTRIBUTING.md`](CONTRIBUTING.md) for full guidelines, including how to add new rules and support new languages. --- diff --git a/SECURITY.md b/SECURITY.md index 001c35e1..06c4ecc8 100644 --- a/SECURITY.md +++ b/SECURITY.md @@ -4,9 +4,9 @@ | Version | Supported | Notes | |---------|-----------|----------------------| -| 0.3.x | ✅ | Latest stable line | -| 0.2.x | ✅ | Critical fixes only | -| < 0.2 | ❌ | End-of-life | +| 0.4.x | ✅ | Latest stable line | +| 0.3.x | ✅ | Critical fixes only | +| < 0.3 | ❌ | End-of-life | We follow [Semantic Versioning] as soon as we hit **1.0.0**. Before that, breaking changes may land in any minor release. diff --git a/default-nyx.conf b/default-nyx.conf index e2416692..097890f1 100644 --- a/default-nyx.conf +++ b/default-nyx.conf @@ -52,6 +52,11 @@ follow_symlinks = false ## Scan hidden files (dot-files) scan_hidden_files = false +## Enable state-model dataflow analysis (resource lifecycle + auth state). +## Detects use-after-close, double-close, resource leaks, and unauthed access. +## Requires mode = "full" or "taint" (needs CFG). Default: off. +enable_state_analysis = false + [database] @@ -70,15 +75,48 @@ vacuum_on_startup = false [output] -## Output format — only "console" exists for now +## Output format: console | json | sarif default_format = "console" -## Suppress all console output (UNIMPLEMENTED) +## Suppress all human-readable status output (stderr) quiet = false +## Enable attack-surface ranking (sort findings by exploitability score) +attack_surface_ranking = true + ## Cap the number of issues shown; null = unlimited max_results = null +## Minimum attack-surface score to include; null = no minimum +## Findings below this threshold are dropped after ranking. +## Requires attack_surface_ranking to be enabled. +min_score = null + +## Minimum confidence level to include in output; null = no minimum +## Values: "low", "medium", "high" +# min_confidence = "medium" + +## Include Quality-category findings (excluded by default). +## Quality findings (e.g. unwrap, expect, panic) are noise-heavy and hidden +## unless this is set to true or --include-quality is passed. +include_quality = false + +## Show all findings: disables category filtering, rollups, and LOW budgets. +## Equivalent to --all on the command line. +show_all = false + +## Maximum total LOW findings to show (rollups count as 1). +max_low = 20 + +## Maximum LOW findings per file (rollups count as 1). +max_low_per_file = 1 + +## Maximum LOW findings per rule (rollups count as 1). +max_low_per_rule = 10 + +## Number of example locations stored in rollup findings. +rollup_examples = 5 + [performance] diff --git a/docs/cli.md b/docs/cli.md new file mode 100644 index 00000000..82e24715 --- /dev/null +++ b/docs/cli.md @@ -0,0 +1,234 @@ +# CLI Reference + +## Global + +``` +nyx [COMMAND] +nyx --version +nyx --help +``` + +--- + +## `nyx scan` + +Run a security scan on a directory. + +``` +nyx scan [PATH] [OPTIONS] +``` + +**PATH** defaults to `.` (current directory). + +### Analysis Mode + +| Flag | Default | Description | +|------|---------|-------------| +| `--mode ` | `full` | Analysis mode: `full`, `ast`, `cfg`, or `taint` | + +| Mode | What runs | +|------|-----------| +| `full` | AST patterns + CFG structural analysis + taint analysis | +| `ast` | AST patterns only (fastest, no CFG or taint) | +| `cfg` / `taint` | CFG + taint analysis only (no AST patterns) | + +**Deprecated aliases**: `--ast-only` (use `--mode ast`), `--cfg-only` (use `--mode cfg`), `--all-targets` (use `--mode full`). + +### Index Control + +| Flag | Default | Description | +|------|---------|-------------| +| `--index ` | `auto` | Index behavior: `auto`, `off`, or `rebuild` | + +| Index Mode | Behavior | +|------------|----------| +| `auto` | Use existing index if available; build if missing | +| `off` | Skip indexing, scan filesystem directly | +| `rebuild` | Force rebuild index before scanning | + +**Deprecated aliases**: `--no-index` (use `--index off`), `--rebuild-index` (use `--index rebuild`). + +### Output + +| Flag | Default | Description | +|------|---------|-------------| +| `-f, --format ` | `console` | Output format: `console`, `json`, or `sarif` | +| `--quiet` | off | Suppress status messages (stderr); stdout stays clean | +| `--no-rank` | off | Disable attack-surface ranking | + +### Filtering + +| Flag | Default | Description | +|------|---------|-------------| +| `--severity ` | *(none)* | Filter findings by severity | +| `--min-score ` | *(none)* | Drop findings with rank score below N | +| `--min-confidence ` | *(none)* | Drop findings below this confidence level (`low`, `medium`, `high`) | +| `--fail-on ` | *(none)* | Exit code 1 if any finding >= this severity | +| `--show-suppressed` | off | Show inline-suppressed findings (dimmed, tagged `[SUPPRESSED]`) | +| `--keep-nonprod-severity` | off | Don't downgrade severity for test/vendor paths | +| `--all` | off | Disable category filtering, rollups, and LOW budgets — show everything | +| `--include-quality` | off | Include Quality-category findings (hidden by default) | +| `--max-low ` | `20` | Maximum total LOW findings to show | +| `--max-low-per-file ` | `1` | Maximum LOW findings per file | +| `--max-low-per-rule ` | `10` | Maximum LOW findings per rule | +| `--rollup-examples ` | `5` | Number of example locations in rollup findings | +| `--show-instances ` | *(none)* | Expand all instances of a specific rule (bypass rollup) | + +**Severity expression formats**: + +```bash +--severity HIGH # Only high +--severity "HIGH,MEDIUM" # High or medium +--severity ">=MEDIUM" # Medium and above (high + medium) +--severity ">= low" # All severities (case-insensitive) +``` + +**Deprecated aliases**: `--high-only` (use `--severity HIGH`), `--include-nonprod` (use `--keep-nonprod-severity`). + +### Examples + +```bash +# Basic scan +nyx scan + +# Scan specific path, JSON output +nyx scan ./server --format json + +# CI gate: fail on medium+, SARIF output +nyx scan . --format sarif --fail-on medium > results.sarif + +# Fast AST-only scan, no index +nyx scan . --mode ast --index off + +# High-severity only, quiet mode +nyx scan . --severity HIGH --quiet + +# Only findings scoring 50 or above +nyx scan . --min-score 50 + +# Only medium+ confidence findings +nyx scan . --min-confidence medium + +# Show everything (no filtering, no rollups) +nyx scan . --all + +# Include quality findings but keep rollups and budgets +nyx scan . --include-quality + +# See all unwrap findings expanded +nyx scan . --include-quality --show-instances rs.quality.unwrap + +# Allow more LOW findings +nyx scan . --max-low 50 --max-low-per-file 5 +``` + +--- + +## `nyx index` + +Manage the SQLite file index. + +### `nyx index build` + +``` +nyx index build [PATH] [--force] +``` + +Build or update the index for the given path (default: `.`). + +| Flag | Description | +|------|-------------| +| `-f, --force` | Force full rebuild, ignoring cached file hashes | + +### `nyx index status` + +``` +nyx index status [PATH] +``` + +Display index statistics (file count, size, last modified) for the given path. + +--- + +## `nyx list` + +``` +nyx list [-v] +``` + +List all indexed projects. + +| Flag | Description | +|------|-------------| +| `-v, --verbose` | Show detailed information per project | + +--- + +## `nyx clean` + +``` +nyx clean [PROJECT] [--all] +``` + +Remove index data. + +| Argument/Flag | Description | +|---------------|-------------| +| `PROJECT` | Project name or path to clean | +| `--all` | Clean all indexed projects | + +--- + +## `nyx config` + +Manage configuration. + +### `nyx config show` + +Print the effective merged configuration as TOML. + +### `nyx config path` + +Print the configuration directory path. + +### `nyx config add-rule` + +``` +nyx config add-rule --lang --matcher --kind --cap +``` + +Add a custom taint rule. Written to `nyx.local`. + +| Flag | Values | +|------|--------| +| `--lang` | `rust`, `javascript`, `typescript`, `python`, `go`, `java`, `c`, `cpp`, `php`, `ruby` | +| `--matcher` | Function or property name to match | +| `--kind` | `source`, `sanitizer`, `sink` | +| `--cap` | `env_var`, `html_escape`, `shell_escape`, `url_encode`, `json_parse`, `file_io`, `all` | + +### `nyx config add-terminator` + +``` +nyx config add-terminator --lang --name +``` + +Add a terminator function (e.g. `process.exit`). Written to `nyx.local`. + +--- + +## Exit Codes + +| Code | Meaning | +|------|---------| +| `0` | Scan completed; no findings matched `--fail-on` threshold (or no `--fail-on` specified) | +| `1` | Scan completed but at least one finding met or exceeded the `--fail-on` severity | +| Non-zero | Error during scan (I/O error, config parse error, database error, etc.) | + +--- + +## Environment Variables + +| Variable | Description | +|----------|-------------| +| `RUST_LOG` | Set tracing verbosity (e.g. `RUST_LOG=debug nyx scan .`) | +| `NO_COLOR` | Disable ANSI color output | diff --git a/docs/configuration.md b/docs/configuration.md new file mode 100644 index 00000000..2d884b01 --- /dev/null +++ b/docs/configuration.md @@ -0,0 +1,183 @@ +# Configuration + +Nyx uses TOML configuration files. A default config is auto-generated on first run. + +## File Locations + +| Platform | Directory | +|----------|-----------| +| Linux | `~/.config/nyx/` | +| macOS | `~/Library/Application Support/nyx/` | +| Windows | `%APPDATA%\elicpeter\nyx\config\` | + +Run `nyx config path` to see the exact directory on your system. + +## File Precedence + +1. **`nyx.conf`** — Default config (auto-created from built-in template on first run) +2. **`nyx.local`** — User overrides (loaded on top of defaults) + +Both files are optional. CLI flags take precedence over both. + +## Merge Strategy + +| Type | Behavior | +|------|----------| +| Scalars (`mode`, `min_severity`, booleans) | User value wins | +| Arrays (`excluded_extensions`, `excluded_directories`) | Union + deduplicate | +| Analysis rules | Per-language union with deduplication | + +Example: +```toml +# nyx.conf (default): +excluded_extensions = ["jpg", "png", "exe"] + +# nyx.local (user): +excluded_extensions = ["foo", "jpg"] + +# Effective result: +# ["exe", "foo", "jpg", "png"] — sorted, deduped union +``` + +--- + +## Full Schema + +### `[scanner]` + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `mode` | `"full"` \| `"ast"` \| `"cfg"` \| `"taint"` | `"full"` | Analysis mode | +| `min_severity` | `"Low"` \| `"Medium"` \| `"High"` | `"Low"` | Minimum severity to report | +| `max_file_size_mb` | int \| null | null | Max file size in MiB; null = unlimited | +| `excluded_extensions` | [string] | `["jpg", "png", "gif", "mp4", ...]` | File extensions to skip | +| `excluded_directories` | [string] | `["node_modules", ".git", "target", ...]` | Directories to skip | +| `excluded_files` | [string] | `[]` | Specific files to skip | +| `read_global_ignore` | bool | `false` | Honor global ignore file | +| `read_vcsignore` | bool | `true` | Honor `.gitignore` / `.hgignore` | +| `require_git_to_read_vcsignore` | bool | `true` | Require `.git` dir to apply gitignore | +| `one_file_system` | bool | `false` | Don't cross filesystem boundaries | +| `follow_symlinks` | bool | `false` | Follow symbolic links | +| `scan_hidden_files` | bool | `false` | Scan dot-files | +| `include_nonprod` | bool | `false` | Keep original severity for test/vendor paths | +| `enable_state_analysis` | bool | `false` | Enable resource lifecycle + auth state analysis. Detects use-after-close, double-close, resource leaks (per-function scope), and unauthenticated access. Requires `mode = "full"` or `mode = "cfg"`. | + +### `[database]` + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `path` | string | `""` | Custom SQLite DB path; empty = platform default | + +### `[output]` + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `default_format` | `"console"` \| `"json"` \| `"sarif"` | `"console"` | Default output format | +| `quiet` | bool | `false` | Suppress status messages | +| `max_results` | int \| null | null | Cap number of findings; null = unlimited | +| `attack_surface_ranking` | bool | `true` | Enable attack-surface ranking | +| `min_score` | int \| null | null | Minimum rank score to include; null = no minimum | +| `min_confidence` | string \| null | null | Minimum confidence level (`"low"`, `"medium"`, `"high"`); null = no minimum | +| `include_quality` | bool | `false` | Include Quality-category findings (hidden by default) | +| `show_all` | bool | `false` | Disable category filtering, rollups, and LOW budgets | +| `max_low` | int | `20` | Maximum total LOW findings to show (rollups count as 1) | +| `max_low_per_file` | int | `1` | Maximum LOW findings per file (rollups count as 1) | +| `max_low_per_rule` | int | `10` | Maximum LOW findings per rule (rollups count as 1) | +| `rollup_examples` | int | `5` | Number of example locations stored in rollup findings | + +### `[performance]` + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `worker_threads` | int \| null | null | Worker thread count; null/0 = auto-detect | +| `batch_size` | int | `100` | Files per index batch | +| `channel_multiplier` | int | `4` | Channel capacity = threads x multiplier | +| `rayon_thread_stack_size` | int | `8388608` | Rayon thread stack size in bytes (8 MiB) | +| `prune` | bool | `false` | Stop traversing into matching directories | + +### `[analysis.languages.]` + +Per-language custom rules. `` is one of: `rust`, `javascript`, `typescript`, `python`, `go`, `java`, `c`, `cpp`, `php`, `ruby`. + +| Field | Type | Description | +|-------|------|-------------| +| `rules` | array of rule objects | Custom label rules | +| `terminators` | [string] | Functions that terminate execution | +| `event_handlers` | [string] | Event handler function names | + +**Rule object**: + +```toml +[[analysis.languages.javascript.rules]] +matchers = ["escapeHtml"] +kind = "sanitizer" # "source" | "sanitizer" | "sink" +cap = "html_escape" # "env_var" | "html_escape" | "shell_escape" | + # "url_encode" | "json_parse" | "file_io" | "all" +``` + +--- + +## Example Configurations + +### Minimal override (`nyx.local`) + +```toml +[scanner] +min_severity = "Medium" + +[output] +default_format = "json" +max_results = 100 +``` + +### CI-optimized + +```toml +[scanner] +mode = "full" +min_severity = "Medium" +excluded_directories = ["node_modules", ".git", "target", "vendor", "dist"] + +[output] +quiet = true +default_format = "sarif" + +[performance] +worker_threads = 4 +``` + +### Custom rules for a Node.js project + +```toml +[analysis.languages.javascript] +terminators = ["process.exit", "abort"] +event_handlers = ["addEventListener"] + +[[analysis.languages.javascript.rules]] +matchers = ["escapeHtml", "sanitizeInput"] +kind = "sanitizer" +cap = "html_escape" + +[[analysis.languages.javascript.rules]] +matchers = ["dangerouslySetInnerHTML"] +kind = "sink" +cap = "html_escape" + +[[analysis.languages.javascript.rules]] +matchers = ["getRequestBody", "readUserInput"] +kind = "source" +cap = "all" +``` + +### Adding rules via CLI + +```bash +# Add a sanitizer +nyx config add-rule --lang javascript --matcher escapeHtml --kind sanitizer --cap html_escape + +# Add a terminator +nyx config add-terminator --lang javascript --name process.exit + +# Verify +nyx config show +``` diff --git a/docs/detectors.md b/docs/detectors.md new file mode 100644 index 00000000..32070724 --- /dev/null +++ b/docs/detectors.md @@ -0,0 +1,81 @@ +# Detector Overview + +Nyx uses four independent detector families. Each targets different vulnerability classes and operates at a different level of analysis depth. Findings from all active detectors are merged, deduplicated, ranked, and presented in a single result set. + +## The Four Detector Families + +| Family | Rule prefix | Analysis depth | What it finds | +|--------|------------|----------------|---------------| +| [**Taint Analysis**](detectors/taint.md) | `taint-*` | Cross-file dataflow | Unsanitized data flowing from sources to sinks | +| [**CFG Structural**](detectors/cfg.md) | `cfg-*` | Intra-procedural CFG | Auth gaps, unguarded sinks, resource leaks, error fallthrough | +| [**State Model**](detectors/state.md) | `state-*` | Intra-procedural lattice | Use-after-close, double-close, resource leaks, unauthenticated access | +| [**AST Patterns**](detectors/patterns.md) | `.*.*` | Structural (no flow) | Dangerous function calls, banned APIs, weak crypto | + +## How They Combine + +In `--mode full` (default), all four families run. Findings are deduplicated: + +1. **Taint supersedes AST**: If a taint finding and an AST pattern both fire at the same location (e.g. both flag `eval(userInput)`), both are kept with distinct rule IDs. The taint finding ranks higher due to the analysis-kind bonus. + +2. **State supersedes CFG**: If a state-model finding (e.g. `state-resource-leak`) fires at the same location as a CFG finding (e.g. `cfg-resource-leak`), the CFG finding is suppressed. + +3. **Location-level dedup**: Exact duplicates (same line, column, rule ID, severity) are removed. + +## Analysis Modes + +| Mode | CLI flag | Active detectors | +|------|----------|-----------------| +| Full | `--mode full` | All four | +| AST-only | `--mode ast` | AST patterns only | +| CFG/Taint | `--mode cfg` | Taint + CFG + State | + +## Attack-Surface Ranking + +Every finding receives a deterministic **attack-surface score** estimating exploitability. Findings are sorted by descending score. + +### Scoring Formula + +``` +score = severity_base + analysis_kind + evidence_strength + state_bonus - validation_penalty +``` + +| Component | Values | Purpose | +|-----------|--------|---------| +| **Severity base** | High=60, Medium=30, Low=10 | Primary signal | +| **Analysis kind** | taint=+10, state=+8, cfg(with evidence)=+5, cfg(no evidence)=+3, ast=+0 | Confidence of analysis | +| **Evidence strength** | +1 per evidence item (max 4), +2-6 for source kind | Specificity of finding | +| **State bonus** | use-after-close/unauthed=+6, double-close=+3, must-leak=+2, may-leak=+1 | State rule severity | +| **Validation penalty** | -5 if path-validated | Guard reduces exploitability | + +### Source-kind priority + +| Source type | Bonus | Examples | +|-------------|-------|---------| +| User input | +6 | `req.body`, `argv`, `stdin`, `form`, `query`, `params` | +| Environment | +5 | `env::var`, `getenv`, `process.env` | +| Unknown | +4 | Conservative default | +| File system | +3 | `fs::read_to_string`, `fgets` | +| Database | +2 | Query results | + +### Score ranges (approximate) + +| Finding type | Score range | +|-------------|------------| +| High taint + user input | ~76-80 | +| High state (use-after-close) | ~74 | +| High CFG structural | ~63-68 | +| Medium taint + env source | ~45-50 | +| Medium state (resource leak) | ~40 | +| Low AST-only pattern | ~10 | + +Ranking is enabled by default. Disable with `--no-rank` or `output.attack_surface_ranking = false`. + +## Two-Pass Architecture + +Nyx's taint analysis requires cross-file context, achieved via two passes: + +1. **Pass 1 — Summary extraction**: Each file is parsed, a CFG is built, and a `FuncSummary` is extracted per function. Summaries capture source/sanitizer/sink capabilities (bitflags), taint propagation behavior, and callee lists. Summaries are persisted to SQLite. + +2. **Pass 2 — Analysis**: All summaries are merged into a global map. Files are re-parsed and analyzed with full cross-file context. The taint engine resolves callees against local summaries (more precise) first, then falls back to global summaries. + +With indexing enabled, Pass 1 skips files whose content hash hasn't changed since the last scan. diff --git a/docs/detectors/cfg.md b/docs/detectors/cfg.md new file mode 100644 index 00000000..5f9415ce --- /dev/null +++ b/docs/detectors/cfg.md @@ -0,0 +1,161 @@ +# CFG Structural Analysis + +## Summary + +Nyx builds an intra-procedural control-flow graph (CFG) for each function and analyzes structural properties: whether sinks are guarded by sanitizers or validators, whether web handlers check authentication, whether resources are released on all exit paths, and whether error-handling code terminates properly. + +These detectors use **dominator analysis** — they check whether a guard node dominates (must execute before) a sink node on the CFG. + +## Rule IDs + +| Rule ID | Severity | Description | +|---------|----------|-------------| +| `cfg-unguarded-sink` | High/Medium | Sink reachable without a dominating guard or sanitizer | +| `cfg-auth-gap` | High | Web handler reaches privileged sink without auth check | +| `cfg-unreachable-sink` | Medium | Dangerous function in unreachable code | +| `cfg-unreachable-sanitizer` | Low | Sanitizer in unreachable code | +| `cfg-unreachable-source` | Low | Source in unreachable code | +| `cfg-error-fallthrough` | High/Medium | Error check doesn't terminate; dangerous code follows | +| `cfg-resource-leak` | Medium | Resource acquired but not released on all exit paths | +| `cfg-lock-not-released` | Medium | Lock acquired but not released on all exit paths | + +## What It Detects + +### Unguarded sinks (`cfg-unguarded-sink`) +A sink call (e.g. `system()`, `eval()`, `Command::new()`) is reachable from the function entry without passing through a guard or sanitizer that matches the sink's capability. + +### Auth gaps (`cfg-auth-gap`) +A function identified as a web handler (by parameter naming conventions like `req`, `res`, `ctx`, `request`) reaches a privileged sink (shell execution, file I/O) without a prior call to an authentication function (`is_authenticated`, `require_auth`, `check_permission`, etc.). + +### Unreachable security code (`cfg-unreachable-*`) +Sinks, sanitizers, or sources in dead code branches. This often indicates a refactoring error where security-critical code was accidentally made unreachable. + +### Error fallthrough (`cfg-error-fallthrough`) +An error check (null check, error return check) does not terminate the function or loop back. Execution continues to a dangerous operation on the error path. + +### Resource leaks (`cfg-resource-leak`, `cfg-lock-not-released`) +A resource acquisition call (e.g. `File::open`, `fopen`, `socket`, `Lock`) is not matched by a release call (e.g. `close`, `fclose`, `unlock`) on all exit paths from the function. + +## What It Cannot Detect + +- **Inter-procedural guards**: If authentication is checked in a middleware function that calls this handler, the CFG detector cannot see it. It only analyzes one function at a time. +- **Dynamic dispatch**: Virtual method calls, function pointers, and closures are opaque to the CFG. +- **Complex guard patterns**: Only recognized guard function names are checked. Custom validation logic (e.g. `if password == expected`) is not recognized as a guard. +- **Correct sanitization**: The detector checks that *some* guard dominates the sink, not that the guard is *correct*. A guard that always passes would suppress the finding. +- **Cross-function resource flows**: If a file handle is opened in one function and closed in another, the detector will report a leak in the first function. + +## Common False Positives + +| Scenario | Why it fires | Mitigation | +|----------|-------------|------------| +| Framework-level auth middleware | Handler doesn't call auth directly | Document as expected; suppress with severity filter | +| Resource closed via RAII/defer | Implicit cleanup not visible to CFG | Currently not detected; known limitation | +| Custom guard function name | Function not in the recognized guard list | Add the function name as a sanitizer in config | +| Test handlers | Intentionally skip auth in tests | Default non-prod downgrade reduces severity; or exclude test dirs | + +## Common False Negatives + +| Scenario | Why it's missed | +|----------|----------------| +| Auth in called function | Cross-function guards not tracked | +| Guard via type system | Type-level guarantees (e.g. Rust's `AuthenticatedUser` wrapper) not analyzed | +| Resource closed in finally/defer | Some cleanup patterns not recognized | + +## Confidence Signals + +| Signal | Meaning | +|--------|---------| +| **Evidence lists guard nodes** | Shows which guards were checked and found missing | +| **Sink has high capability** | Shell execution or file I/O sinks are higher risk | +| **Handler detection matched** | Web handler identification is based on conventional parameter names | + +## Tuning and Noise Controls + +### Add custom guards/sanitizers + +```toml +[[analysis.languages.python.rules]] +matchers = ["validate_request", "check_csrf"] +kind = "sanitizer" +cap = "all" +``` + +### Add auth rules + +Auth checks are recognized by function name. If your codebase uses non-standard names: + +```toml +[[analysis.languages.javascript.rules]] +matchers = ["ensureLoggedIn", "requirePermission"] +kind = "sanitizer" +cap = "all" +``` + +### Filter results + +```bash +# Skip low-severity unreachable findings +nyx scan . --severity ">=MEDIUM" +``` + +### Disable CFG analysis + +```bash +nyx scan . --mode ast # AST patterns only +``` + +## Examples + +### Unguarded sink + +```go +func handler(w http.ResponseWriter, r *http.Request) { + cmd := r.URL.Query().Get("cmd") + exec.Command("sh", "-c", cmd).Run() // cfg-unguarded-sink: no guard dominates +} +``` + +### Auth gap + +```javascript +app.get('/admin/delete', (req, res) => { + // No is_authenticated() call + db.execute("DELETE FROM users WHERE id = " + req.params.id); + // cfg-auth-gap: web handler reaches privileged sink without auth +}); +``` + +### Resource leak + +```c +void process() { + FILE *f = fopen("data.txt", "r"); // acquire + if (error) { + return; // cfg-resource-leak: f not closed on this path + } + fclose(f); +} +``` + +## Guard Rules + +Nyx recognizes these function name patterns as guards: + +| Pattern | Applies to | +|---------|-----------| +| `validate*`, `sanitize*` | All sinks | +| `check_*`, `verify_*`, `assert_*` | All sinks | +| `shell_escape` | Shell execution sinks | +| `html_escape` | HTML/XSS sinks | +| `url_encode` | URL sinks | +| `which` | Shell execution (binary lookup) | + +### Auth rules + +| Pattern | Category | +|---------|----------| +| `is_authenticated`, `require_auth`, `check_permission` | Common | +| `authorize`, `authenticate`, `require_login` | Common | +| `check_auth`, `verify_token`, `validate_token` | Common | +| `middleware.auth`, `auth.required` | Go | +| `isAuthenticated`, `checkPermission`, `hasAuthority`, `hasRole` | Java | diff --git a/docs/detectors/patterns.md b/docs/detectors/patterns.md new file mode 100644 index 00000000..4b4c99f4 --- /dev/null +++ b/docs/detectors/patterns.md @@ -0,0 +1,149 @@ +# AST Pattern Matching + +## Summary + +AST patterns are tree-sitter queries that match specific structural code constructs. They are the simplest and fastest detector family — no dataflow, no CFG, just structural presence. A match means the dangerous construct exists in the code; it does not prove the code is exploitable. + +AST patterns run in all analysis modes, including `--mode ast` (where they are the only active detector). + +## Rule IDs + +Pattern rule IDs follow the format `..`: + +``` +rs.memory.transmute +js.code_exec.eval +py.deser.pickle_loads +c.memory.gets +java.sqli.execute_concat +``` + +See the [Rule Reference](../rules/index.md) for a complete listing per language. + +## Pattern Tiers + +| Tier | Meaning | Examples | +|------|---------|---------| +| **A** | Structural presence alone is high-signal | `gets()`, `eval()`, `pickle.loads()`, `mem::transmute` | +| **B** | Query includes a heuristic guard | SQL `execute` with concatenated arg, `printf(var)` with non-literal format | + +Tier B patterns use additional tree-sitter predicates to reduce false positives. For example, `java.sqli.execute_concat` only fires when `executeQuery()` receives a `binary_expression` (string concatenation) as its argument, not when it receives a literal or parameter placeholder. + +## What It Detects + +### By category + +| Category | What it matches | Example languages | +|----------|----------------|-------------------| +| **CommandExec** | Shell command execution functions | C (`system`), Python (`os.system`), Ruby (backticks) | +| **CodeExec** | Dynamic code evaluation | JS (`eval`, `new Function()`), Python (`exec`), PHP (`eval`) | +| **Deserialization** | Unsafe object deserialization | Java (`readObject`), Python (`pickle.loads`), Ruby (`Marshal.load`) | +| **SqlInjection** | SQL with string concatenation | Java, Go, Python, PHP (Tier B heuristic) | +| **PathTraversal** | File inclusion with variable path | PHP (`include $var`) | +| **Xss** | XSS sink functions | JS (`document.write`, `outerHTML`), Java (`getWriter().print`) | +| **Crypto** | Weak cryptographic algorithms | All languages (`md5`, `sha1`, `Math.random()`) | +| **Secrets** | Hardcoded credentials | Go (variable name matching) | +| **InsecureTransport** | Unencrypted communication | Go (`InsecureSkipVerify`), JS (`fetch("http://")`) | +| **Reflection** | Dynamic class/method dispatch | Java (`Class.forName`, `Method.invoke`), Ruby (`send`, `constantize`) | +| **MemorySafety** | Memory safety violations | Rust (`transmute`, `unsafe`), C (`gets`, `strcpy`, `sprintf`) | +| **Prototype** | Prototype pollution | JS/TS (`__proto__` assignment) | +| **CodeQuality** | Panic/abort/type-safety issues | Rust (`unwrap`, `panic!`), TS (`as any`) | + +## What It Cannot Detect + +- **Dataflow**: Patterns don't track whether the dangerous function receives tainted input. `eval("hello")` (safe) and `eval(userInput)` (dangerous) both match `js.code_exec.eval`. +- **Context**: Patterns don't understand whether the code is reachable, guarded, or inside a test. +- **Semantics**: `strcpy(dst, src)` always matches — it cannot determine buffer sizes. +- **Indirect calls**: Function pointers, dynamic dispatch, and aliased references are invisible. + +## Common False Positives + +| Scenario | Why it fires | Mitigation | +|----------|-------------|------------| +| `eval()` with a hardcoded string literal | Pattern matches structural presence | Taint analysis won't flag this — use `--mode cfg` for fewer false positives | +| `unsafe` block in Rust with sound justification | All unsafe blocks match | Filter with `--severity ">=MEDIUM"` (unsafe_block is Medium) | +| `.unwrap()` in test code | Acceptable in tests | Default non-prod downgrade reduces severity | +| `md5()` used for checksums (not security) | Pattern doesn't know usage intent | Filter Low severity or add to exclusions | +| SQL concatenation with trusted data | Tier B heuristic can't verify data source | Taint analysis is more precise here | + +## Common False Negatives + +| Scenario | Why it's missed | +|----------|----------------| +| `eval` called via alias (`let e = eval; e(input)`) | Pattern matches the identifier `eval`, not the resolved function | +| Dangerous function in a macro expansion | Tree-sitter parses the macro call, not the expansion | +| SQL injection via ORM query builder | No pattern for ORM-specific query building | +| Imported function under different name | `from os import system as s; s(cmd)` — pattern looks for `system` | + +## Confidence Signals + +| Signal | Meaning | +|--------|---------| +| **Tier A** | High confidence — the function itself is dangerous | +| **Tier B** | Moderate confidence — heuristic guard reduces false positives | +| **High severity** | Critical vulnerability class (command exec, deserialization) | +| **Low severity** | Informational (weak crypto, code quality) | +| **Non-prod path** | Finding in test/vendor code — downgraded by default | + +## Tuning and Noise Controls + +### Severity filtering + +```bash +# Skip code-quality and weak-crypto findings +nyx scan . --severity ">=MEDIUM" + +# Only critical findings +nyx scan . --severity HIGH +``` + +### Use taint for precision + +```bash +# Taint-only mode: only report findings with confirmed dataflow +nyx scan . --mode cfg +``` + +### Exclude directories + +```toml +[scanner] +excluded_directories = ["node_modules", "vendor", "generated"] +``` + +## Examples + +### Tier A — structural presence + +**C: Banned function** +```c +char buf[64]; +gets(buf); // c.memory.gets — always dangerous, no safe usage +``` + +**Python: Unsafe deserialization** +```python +import pickle +data = pickle.loads(user_input) # py.deser.pickle_loads +``` + +### Tier B — heuristic-guarded + +**Java: SQL concatenation** +```java +// Fires: concatenated argument +stmt.executeQuery("SELECT * FROM users WHERE id=" + userId); +// java.sqli.execute_concat + +// Does NOT fire: parameterized query +stmt.executeQuery(preparedSql); +``` + +**C: Format string** +```c +// Fires: variable as first argument +printf(user_input); // c.memory.printf_no_fmt + +// Does NOT fire: literal format string +printf("%s", user_input); +``` diff --git a/docs/detectors/state.md b/docs/detectors/state.md new file mode 100644 index 00000000..52500e67 --- /dev/null +++ b/docs/detectors/state.md @@ -0,0 +1,204 @@ +# State Model Analysis + +## Summary + +Nyx's state model analysis tracks **resource lifecycle** and **authentication state** through a function using monotone dataflow over bounded lattices. It detects use-after-close bugs, double-close bugs, resource leaks, and unauthenticated access to privileged operations. + +State analysis is **opt-in** — enable it with `scanner.enable_state_analysis = true` in config. It requires `mode = "full"` or `mode = "cfg"`. + +## Rule IDs + +| Rule ID | Severity | Description | +|---------|----------|-------------| +| `state-use-after-close` | High | Variable used after being closed/released | +| `state-double-close` | Medium | Resource closed twice | +| `state-resource-leak` | Medium | Resource opened but never closed (definite) | +| `state-resource-leak-possible` | Low | Resource may not be closed on all paths | +| `state-unauthed-access` | High | Privileged operation reached without authentication | + +## What It Detects + +### Use-after-close (`state-use-after-close`) + +A resource transitions to the CLOSED state (via `close()`, `fclose()`, `disconnect()`, etc.), then a use operation (`read`, `write`, `send`, `recv`, `query`, etc.) is performed on it. + +```c +FILE *f = fopen("data.txt", "r"); +fclose(f); +fread(buf, 1, 100, f); // state-use-after-close +``` + +### Double-close (`state-double-close`) + +A resource is closed twice. This can cause crashes or undefined behavior. + +```python +f = open("data.txt") +f.close() +f.close() # state-double-close +``` + +### Resource leak (`state-resource-leak`) + +A resource is opened but never closed on any path through the function. This is a definite leak. + +```java +FileInputStream fis = new FileInputStream("data.txt"); +process(fis); +// function exits without fis.close() — state-resource-leak +``` + +### Possible resource leak (`state-resource-leak-possible`) + +A resource is closed on some paths but not others. + +```go +f, err := os.Open("data.txt") +if err != nil { + return // f not closed here +} +f.Close() // closed here +// state-resource-leak-possible on the error path +``` + +### Unauthenticated access (`state-unauthed-access`) + +A function identified as a web handler reaches a privileged sink (shell execution, file I/O) without any authentication check on the path. + +A function is identified as a web handler if: +1. Its name starts with `handle_`, `route_`, or `api_` (strong match — sufficient on its own), OR +2. Its name starts with `serve_` or `process_` AND any function in the file has web-like parameter names (`request`, `req`, `ctx`, `res`, `response`, `w`, `writer`, etc., varying by language). + +The function name `main` is explicitly excluded. + +```javascript +app.post('/admin/exec', (req, res) => { + // No auth check + exec(req.body.command); // state-unauthed-access +}); +``` + +## What It Cannot Detect + +- **Cross-function resource management**: Resources opened in one function and closed in another are not tracked. This is the most common source of false positives for leak detection. +- **RAII / defer / try-with-resources**: Implicit cleanup via language-level constructs (Rust's `Drop`, Go's `defer`, Java's try-with-resources, Python's `with`) is not recognized. These patterns will produce false-positive leak findings. +- **Dynamic dispatch**: If `close()` is called through a trait object or interface, it may not be recognized. +- **Authentication via type system**: Rust's type-state pattern (e.g. `AuthenticatedRequest`) is not recognized as an auth check. +- **Complex authorization logic**: Only recognized function name patterns are checked. + +## Common False Positives + +| Scenario | Why it fires | Mitigation | +|----------|-------------|------------| +| RAII / Drop / defer cleanup | Implicit cleanup not visible | Known limitation; filter by severity | +| Resource returned to caller | Ownership transferred, not leaked | Known limitation | +| Framework-managed resources | Web framework manages connection lifecycle | Exclude framework-generated handlers | +| Try-with-resources (Java) | Language construct not parsed | Known limitation | +| Context manager (Python `with`) | Block construct not tracked | Known limitation | + +## Common False Negatives + +| Scenario | Why it's missed | +|----------|----------------| +| Resource closed in helper function | Cross-function tracking not implemented | +| Auth in middleware | Auth check happens before handler is called | +| Double-close via aliased reference | Alias analysis not performed | + +## Confidence Signals + +| Signal | Meaning | +|--------|---------| +| **Definite leak (state-resource-leak)** | Resource is never closed on any path — high confidence | +| **Use-after-close** | Read/write operation after explicit close — high confidence | +| **Web handler detected** | Entry point matched by parameter naming convention | +| **Possible leak (state-resource-leak-possible)** | Resource closed on some but not all paths — lower confidence | + +## Tuning and Noise Controls + +### Enable state analysis + +```toml +[scanner] +enable_state_analysis = true +``` + +### Severity filtering + +```bash +# Skip possible-leak findings (Low severity) +nyx scan . --severity ">=MEDIUM" +``` + +### Exclude test files + +```toml +[scanner] +excluded_directories = ["tests", "test", "spec"] +``` + +## Resource Pairs + +The state engine recognizes these acquire/release pairs per language: + +### C/C++ +| Acquire | Release | Resource | +|---------|---------|----------| +| `fopen` | `fclose` | File handle | +| `open` | `close` | File descriptor | +| `socket` | `close` | Socket | +| `malloc`, `calloc`, `realloc` | `free` | Heap memory | +| `pthread_mutex_lock` | `pthread_mutex_unlock` | Mutex | + +### Rust +| Acquire | Release | Resource | +|---------|---------|----------| +| `File::open`, `File::create` | `drop`, `close` | File handle | +| `TcpStream::connect` | `shutdown` | TCP connection | +| `lock`, `read`, `write` (on Mutex/RwLock) | `drop` | Lock guard | + +### Java +| Acquire | Release | Resource | +|---------|---------|----------| +| `new FileInputStream` | `close` | File stream | +| `getConnection` | `close` | DB connection | +| `new Socket` | `close` | Socket | + +### Go, Python, JavaScript, Ruby, PHP +Similar patterns with language-specific function names. + +## Use Patterns (Trigger use-after-close) + +The following operations on a closed resource trigger `state-use-after-close`: + +``` +read, write, send, recv, fread, fwrite, fgets, fputs, fprintf, fscanf, +fflush, fseek, ftell, rewind, feof, ferror, fgetc, fputc, getc, putc, +ungetc, query, execute, fetch, sendto, recvfrom, ioctl, fcntl, +strcpy, strncpy, strcat, strncat, memcpy, memmove, memset, memcmp, +strcmp, strncmp, strlen, sprintf, snprintf +``` + +## Technical Details + +### Resource Lifecycle Lattice + +``` +UNINIT → OPEN → CLOSED + → MOVED +``` + +States are tracked as bitflags, allowing the lattice to represent uncertainty (e.g. OPEN|CLOSED means the resource is open on some paths and closed on others). + +### Leak Detection Scope + +Resource leaks are checked at the file-level exit node and the **synthesized** function exit node (a single Return node that all early returns feed into). Early-return nodes are **not** checked individually — only the merged state at the function's synthesized exit is inspected. This prevents duplicate findings where an early-return path reports a definite leak while the merged exit correctly reports a possible leak. + +This per-function exit inspection ensures that a variable leaked inside one function is not masked by a same-named variable that is properly closed in a subsequent function. + +### Auth Level Lattice + +``` +Unauthed < Authed < Admin +``` + +Join semantics: take the minimum (conservative). If any path is unauthenticated, the result is unauthenticated. diff --git a/docs/detectors/taint.md b/docs/detectors/taint.md new file mode 100644 index 00000000..ffbf5043 --- /dev/null +++ b/docs/detectors/taint.md @@ -0,0 +1,202 @@ +# Taint Analysis + +## Summary + +Nyx's taint analysis tracks the flow of untrusted data from **sources** (where data enters the program) through **assignments and function calls** to **sinks** (where dangerous operations happen). If the data reaches a sink without passing through a **sanitizer** with matching capabilities, a finding is emitted. + +The engine uses a monotone forward dataflow analysis over a finite lattice with guaranteed termination. Analysis is **intra-procedural with cross-file function summaries** — it does not follow calls into other functions but uses pre-computed summaries of their behavior. + +## Rule ID + +``` +taint-unsanitised-flow (source :) +``` + +One rule ID covers all taint findings. The parenthetical identifies the specific source location. + +## What It Detects + +- Environment variables flowing to shell execution (`env::var` → `Command::new`) +- User input flowing to code evaluation (`req.body` → `eval()`) +- File contents flowing to SQL queries (`fs::read_to_string` → `db.execute()`) +- Request parameters flowing to HTML output (`req.query` → `innerHTML`) +- Any source-to-sink flow where the sink's required capability is not stripped by a sanitizer + +## What It Cannot Detect + +- **Inter-procedural flows without summaries**: If a function isn't summarized (e.g. from a third-party library without source), the taint engine cannot track data through it. It conservatively treats unknown callees as neither propagating nor sanitizing. +- **Flows through data structures**: Taint is tracked per-variable, not per-field. `obj.field = tainted; sink(obj.other_field)` may produce a false positive because taint attaches to `obj` as a whole. +- **Aliasing**: `let y = &x; sink(*y)` — the engine tracks `y` as a fresh variable, not an alias of `x`. This can cause false negatives. +- **Complex control flow**: The analysis is flow-sensitive (respects control flow within a function) but does not track taint through arbitrary loops with complex exit conditions. +- **Implicit flows**: Taint only follows explicit data flow, not information flow through branching (e.g. `if (secret) { x = 1 } else { x = 0 }` does not taint `x`). + +## Common False Positives + +| Scenario | Why it happens | Mitigation | +|----------|---------------|------------| +| Custom sanitizer not recognized | Nyx only knows built-in and configured sanitizers | Add a custom sanitizer rule in config | +| Taint through struct fields | Variable-level (not field-level) tracking | No current mitigation; field sensitivity is planned | +| Dead code paths | The engine is path-insensitive within a function (it considers all paths) | Contradiction pruning catches some cases; path-validated findings score lower | +| Library wrappers | A wrapper around a dangerous function may re-introduce taint that was sanitized by the wrapper | Summarize the wrapper function or add it as a sanitizer | + +## Common False Negatives + +| Scenario | Why it's missed | +|----------|----------------| +| Third-party library calls | No summary available; callee treated as opaque | +| Taint through global/static variables | Not tracked across function boundaries | +| Taint through closures/callbacks in some languages | Closure capture analysis is limited (JS/TS/Ruby/Go anonymous functions ARE analyzed) | +| Flows spanning more than two files | Summary approximation loses precision at depth | + +## Confidence Signals + +These signals in the output indicate higher-confidence findings: + +| Signal | What it means | +|--------|--------------| +| **Evidence: Source + Sink** | Both endpoints identified with specific function names and locations | +| **Source kind = user input** | Source is directly controllable by an attacker (req.body, argv, etc.) | +| **path_validated = false** | No validation guard on the path — higher exploitability | +| **No guard_kind** | No dominating predicate check (null check, error check, etc.) | +| **High rank_score** | Multiple confidence signals combined | + +Lower-confidence: + +| Signal | What it means | +|--------|--------------| +| **path_validated = true** | A validation predicate guards the path — may not be exploitable | +| **guard_kind = "ValidationCall"** | An explicit validation function was called before the sink | +| **Source kind = database** | Data from DB — may already be validated at insertion time | + +## Tuning and Noise Controls + +### Add custom sanitizers + +If your codebase has a custom sanitizer that Nyx doesn't recognize: + +```toml +# nyx.local +[[analysis.languages.javascript.rules]] +matchers = ["escapeHtml", "sanitizeInput"] +kind = "sanitizer" +cap = "html_escape" +``` + +Or via CLI: +```bash +nyx config add-rule --lang javascript --matcher escapeHtml --kind sanitizer --cap html_escape +``` + +### Filter by severity + +```bash +nyx scan . --severity HIGH # Only high-severity taint findings +nyx scan . --severity ">=MEDIUM" # Skip low-severity +``` + +### Skip non-production code + +By default, findings in `tests/`, `vendor/`, `build/` paths are downgraded one severity tier. To exclude them entirely, add to config: + +```toml +[scanner] +excluded_directories = ["tests", "vendor", "build", "examples"] +``` + +### Disable taint (AST-only mode) + +```bash +nyx scan . --mode ast +``` + +## Example + +**Vulnerable code** (Rust): +```rust +use std::env; +use std::process::Command; + +fn main() { + let cmd = env::var("USER_CMD").unwrap(); // line 5: source + Command::new("sh").arg("-c").arg(&cmd).output(); // line 6: sink +} +``` + +**Finding**: +``` +[HIGH] taint-unsanitised-flow (source 5:15) src/main.rs:6:5 + Source: env::var("USER_CMD") at 5:15 + Sink: Command::new("sh").arg("-c") + Score: 76 +``` + +**Safe alternative**: +```rust +use std::env; +use std::process::Command; + +fn main() { + let cmd = env::var("USER_CMD").unwrap(); + // Use the value as a direct argument, not a shell command + Command::new(&cmd).output(); + // Or validate against an allowlist +} +``` + +## Technical Details + +### Capability System + +Taint uses a bitflag capability system to match sources with appropriate sanitizers and sinks: + +| Capability | Bit | Sources | Sanitizers | Sinks | +|-----------|-----|---------|------------|-------| +| `ENV_VAR` | 0x01 | `env::var`, `getenv` | — | — | +| `HTML_ESCAPE` | 0x02 | — | `html_escape`, `DOMPurify.sanitize` | `innerHTML`, `document.write` | +| `SHELL_ESCAPE` | 0x04 | — | `shell_escape` | `Command::new`, `system()`, `eval()` | +| `URL_ENCODE` | 0x08 | — | `encodeURIComponent` | `location.href` | +| `JSON_PARSE` | 0x10 | — | `JSON.parse` | — | +| `FILE_IO` | 0x20 | — | `filepath.Clean`, `basename`, `os.path.realpath` | `fopen`, `open`, `send_file`, `fs::read_to_string` | +| `FMT_STRING` | 0x40 | — | — | `printf(var)` | + +Sources typically use `Cap::all()` to match any sink. A sanitizer strips specific capability bits. A finding fires when a tainted variable reaches a sink and the taint still has the matching capability bit set. + +### Nested Function Analysis + +The CFG builder recursively discovers function expressions nested inside call arguments: + +- **JavaScript/TypeScript**: `function_expression`, `arrow_function` inside call arguments (e.g., Express route handlers) +- **Ruby**: `do_block` and `block` nodes (e.g., Sinatra `get '/path' do...end`) +- **Go**: `func_literal` (anonymous function literals) + +Each nested function is walked as a separate scope and receives a unique identifier (``) to prevent collisions when multiple anonymous functions exist in the same file. + +### Chained Call Classification + +Method chains like `r.URL.Query().Get("host")` are normalized by stripping internal `()` segments between `.` separators. The classifier matches against both the original text and the normalized form, enabling rules like `r.URL` to match within `r.URL.Query.Get`. + +### Nested Call Fallback + +When the outermost call in an expression doesn't classify as a source/sink, the engine tries all nested inner calls. This handles patterns like `str(eval(expr))` where `str` is not a sink but the inner `eval` is. + +### Rust `if let` / `while let` Pattern Bindings + +The CFG builder recognizes Rust `let_condition` nodes inside `if` and `while` expressions. The value expression is classified for source/sink labels, and the pattern binding is extracted as a variable definition: + +```rust +if let Ok(cmd) = env::var("CMD") { + // cmd is tainted — env::var is a source, cmd is the binding + Command::new("sh").arg("-c").arg(&cmd).output(); // taint-unsanitised-flow +} +``` + +This also works for `while let` patterns. + +### JS/TS Two-Level Solve + +For JavaScript and TypeScript, taint analysis uses a two-level approach: + +1. **Level 1**: Solve top-level code (module scope) +2. **Level 2**: Solve each function seeded with the converged top-level state + +This prevents false positives from cross-function taint leakage while preserving global-to-function flows. diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 00000000..fff14f10 --- /dev/null +++ b/docs/index.md @@ -0,0 +1,32 @@ +# Nyx Documentation + +Welcome to the Nyx documentation. Nyx is a multi-language static vulnerability scanner built in Rust. + +## User Guide + +- [Installation](installation.md) — Install via cargo, prebuilt binaries, or from source +- [Quick Start](quickstart.md) — Your first scan in 60 seconds +- [CLI Reference](cli.md) — Every flag, subcommand, and option +- [Configuration](configuration.md) — Config file schema, precedence, custom rules +- [Output Formats](output.md) — Console, JSON, SARIF; exit codes; evidence fields + +## Detector Reference + +- [Detector Overview](detectors.md) — How the four detector families work together +- [Taint Analysis](detectors/taint.md) — Cross-file source-to-sink dataflow tracking +- [CFG Structural Analysis](detectors/cfg.md) — Auth gaps, unguarded sinks, resource leaks +- [State Model Analysis](detectors/state.md) — Resource lifecycle and authentication state +- [AST Patterns](detectors/patterns.md) — Tree-sitter structural pattern matching + +## Rule Reference + +- [Rule Index](rules/index.md) — How rules are organized +- [Rust](rules/rust.md) | [C](rules/c.md) | [C++](rules/cpp.md) | [Java](rules/java.md) | [Go](rules/go.md) +- [JavaScript](rules/javascript.md) | [TypeScript](rules/typescript.md) | [Python](rules/python.md) +- [PHP](rules/php.md) | [Ruby](rules/ruby.md) + +## Contributing + +- [Contributing Guide](../CONTRIBUTING.md) — Development setup, adding rules, PR guidelines +- [Security Policy](../SECURITY.md) — Responsible disclosure +- [Code of Conduct](../CODE_OF_CONDUCT.md) diff --git a/docs/installation.md b/docs/installation.md new file mode 100644 index 00000000..b73eb73a --- /dev/null +++ b/docs/installation.md @@ -0,0 +1,76 @@ +# Installation + +## Install from crates.io + +```bash +cargo install nyx-scanner +``` + +This installs the `nyx` binary into `~/.cargo/bin/`. + +## Install from GitHub releases + +1. Go to the [Releases](https://github.com/elicpeter/nyx/releases) page. +2. Download the binary for your platform: + + | Platform | Archive | + |----------|---------| + | Linux x86_64 | `nyx-x86_64-unknown-linux-gnu.zip` | + | macOS Intel | `nyx-x86_64-apple-darwin.zip` | + | macOS Apple Silicon | `nyx-aarch64-apple-darwin.zip` | + | Windows x86_64 | `nyx-x86_64-pc-windows-msvc.zip` | + +3. Extract and install: + + ```bash + # Linux / macOS + unzip nyx-*.zip + chmod +x nyx + sudo mv nyx /usr/local/bin/ + + # Windows (PowerShell) + Expand-Archive -Path nyx-*.zip -DestinationPath . + Move-Item -Path .\nyx.exe -Destination "C:\Program Files\Nyx\" + ``` + +4. Verify: + ```bash + nyx --version + ``` + +## Build from source + +```bash +git clone https://github.com/elicpeter/nyx.git +cd nyx +cargo build --release +cargo install --path . +``` + +Requires **Rust 1.85+** (edition 2024). + +## CI Integration + +### GitHub Actions + +```yaml +- name: Install Nyx + run: cargo install nyx-scanner + +- name: Run security scan + run: nyx scan . --format sarif --fail-on medium > results.sarif + +- name: Upload SARIF + uses: github/codeql-action/upload-sarif@v3 + with: + sarif_file: results.sarif +``` + +### Generic CI + +```bash +# Fail the build if any High or Medium finding is detected +nyx scan . --severity ">=MEDIUM" --fail-on medium --quiet --format json +``` + +The `--fail-on` flag causes Nyx to exit with code **1** if any finding meets or exceeds the given severity. Exit code **0** means no findings matched. diff --git a/docs/output.md b/docs/output.md new file mode 100644 index 00000000..091432e5 --- /dev/null +++ b/docs/output.md @@ -0,0 +1,315 @@ +# Output Formats + +Nyx supports three output formats, selected with `--format` or `output.default_format` in config. + +## Console (default) + +Human-readable, color-coded output to stdout. Status messages go to stderr. + +``` +[HIGH] taint-unsanitised-flow (source 5:11) src/handler.rs:12:5 (Score: 76, Confidence: High) + Source: env::var("CMD") → Command::new("sh").arg("-c") + +[MEDIUM] cfg-unguarded-sink src/handler.rs:12:5 (Score: 35, Confidence: Medium) + +[LOW] rs.quality.unwrap src/lib.rs:88:5 (Score: 10, Confidence: High) +``` + +### Severity indicators + +| Tag | Color | Meaning | +|-----|-------|---------| +| `[HIGH]` | Red, bold | Critical — likely exploitable | +| `[MEDIUM]` | Orange, bold | Important — may be exploitable | +| `[LOW]` | Muted blue-gray | Informational — code quality or weak signal | + +### Evidence fields + +Taint and state findings include structured evidence: + +| Label | Meaning | +|-------|---------| +| **Source** | Where tainted data originated (function name + location) | +| **Sink** | Where the dangerous operation happens | +| **Path guard** | Type of validation predicate protecting the path | + +### Score + +When attack-surface ranking is enabled (default), each finding shows a `Score` value. Higher scores indicate greater exploitability. See [Detector Overview](detectors.md) for the scoring formula. + +### Rollup findings + +High-frequency LOW Quality findings (e.g. `rs.quality.unwrap`) are grouped into rollup findings by `(file, rule)`: + +``` + 21:10 ● [LOW] rs.quality.unwrap + rs.quality.unwrap (38 occurrences) + Examples: 21:10, 50:10, 79:10, 105:10, 134:10 + Run: nyx scan --show-instances rs.quality.unwrap +``` + +Rollups count as **one finding** for LOW budget enforcement. Use `--show-instances ` to expand a specific rule or `--all` to disable rollups entirely. + +### Suppression footer + +When findings are suppressed by the prioritization pipeline, a footer is shown: + +``` +Suppressed 195 LOW/Quality findings. +Active filters: + include_quality = false + max_low = 20 + max_low_per_file = 1 + max_low_per_rule = 10 + +Use --include-quality, --max-low, or --all to adjust. +``` + +--- + +## JSON + +Machine-readable JSON array. Each finding is an object: + +```json +[ + { + "path": "src/handler.rs", + "line": 12, + "col": 5, + "severity": "High", + "id": "taint-unsanitised-flow (source 5:11)", + "path_validated": false, + "labels": [ + ["Source", "env::var(\"CMD\") at 5:11"], + ["Sink", "Command::new(\"sh\").arg(\"-c\")"] + ], + "confidence": "High", + "evidence": { + "source": { + "path": "src/handler.rs", + "line": 5, + "col": 11, + "kind": "source", + "snippet": "env::var(\"CMD\")" + }, + "sink": { + "path": "src/handler.rs", + "line": 12, + "col": 5, + "kind": "sink", + "snippet": "Command::new(\"sh\")" + }, + "notes": ["source_kind:EnvironmentConfig"] + }, + "rank_score": 76.0, + "rank_reason": [ + ["severity_base", "60"], + ["analysis_kind", "10"], + ["source_kind", "5"], + ["evidence_count", "1"] + ] + } +] +``` + +### Field descriptions + +| Field | Type | Always present | Description | +|-------|------|----------------|-------------| +| `path` | string | yes | File path relative to scan root | +| `line` | int | yes | 1-indexed line number | +| `col` | int | yes | 1-indexed column number | +| `severity` | string | yes | `"High"`, `"Medium"`, or `"Low"` | +| `id` | string | yes | Rule ID | +| `category` | string | yes | Finding category: `"Security"`, `"Reliability"`, or `"Quality"` | +| `path_validated` | bool | no | True if guarded by validation predicate | +| `guard_kind` | string | no | Predicate type (e.g. `"NullCheck"`, `"ValidationCall"`) | +| `message` | string | no | Human-readable context (state analysis findings) | +| `labels` | array | no | Array of `[label, value]` pairs for console display | +| `confidence` | string | no | Confidence level: `"Low"`, `"Medium"`, or `"High"` | +| `evidence` | object | no | Structured evidence (source/sink spans, state, notes) | +| `rank_score` | float | no | Attack-surface score (omitted when ranking disabled) | +| `rank_reason` | array | no | Score breakdown (omitted when ranking disabled) | +| `rollup` | object | no | Rollup data when findings are grouped (see below) | + +Fields marked "no" are omitted when empty/null/false to keep output compact. + +### Confidence levels + +| Level | Meaning | +|-------|---------| +| `High` | Strong signal — taint-confirmed flow, definite state violation | +| `Medium` | Moderate signal — resource leak, path-validated taint, CFG structural | +| `Low` | Weak signal — AST pattern match, possible resource leak, degraded analysis | + +### Evidence object + +The `evidence` field provides structured provenance data: + +| Field | Type | Description | +|-------|------|-------------| +| `source` | object | Source span (path, line, col, kind, snippet) | +| `sink` | object | Sink span (path, line, col, kind, snippet) | +| `guards` | array | Validation guard spans | +| `sanitizers` | array | Sanitizer spans | +| `state` | object | State-machine evidence (machine, subject, from_state, to_state) | +| `notes` | array | Free-form notes (e.g. `"source_kind:UserInput"`, `"path_validated"`) | + +All fields are omitted when empty/null. + +### Rollup object + +When a finding is a rollup (grouped from multiple occurrences), the `rollup` field is present: + +```json +{ + "rollup": { + "count": 38, + "occurrences": [ + { "line": 21, "col": 10 }, + { "line": 50, "col": 10 }, + { "line": 79, "col": 10 } + ] + } +} +``` + +| Field | Type | Description | +|-------|------|-------------| +| `count` | int | Total number of occurrences | +| `occurrences` | array | First N example locations (controlled by `rollup_examples`) | + +--- + +## SARIF (Static Analysis Results Interchange Format) + +SARIF 2.1.0 JSON, suitable for GitHub Code Scanning and other SARIF-compatible tools. + +```bash +nyx scan . --format sarif > results.sarif +``` + +The SARIF output includes: + +- **Tool metadata** — Nyx name and version +- **Rules** — Rule ID, description, severity mapping +- **Results** — One result per finding with location, message, and properties +- **Properties** — Each result includes `category` and optionally `confidence` and `rollup.count` +- **Related locations** — Rollup findings include example locations in `relatedLocations` +- **Artifacts** — File paths referenced by findings + +### GitHub Code Scanning integration + +```yaml +- name: Run Nyx + run: nyx scan . --format sarif > results.sarif + +- name: Upload SARIF + uses: github/codeql-action/upload-sarif@v3 + with: + sarif_file: results.sarif +``` + +--- + +## Exit Codes + +| Code | Meaning | +|------|---------| +| `0` | Scan completed successfully; no findings matched `--fail-on` threshold | +| `1` | `--fail-on` threshold breached (at least one finding meets or exceeds the specified severity) | +| Non-zero | Error (I/O, config, database, parse error) | + +Without `--fail-on`, Nyx always exits `0` on a successful scan regardless of findings count. + +--- + +## Severity Levels + +| Level | Description | Typical rules | +|-------|-------------|---------------| +| **High** | Critical vulnerabilities — likely exploitable | Command injection, unsafe deserialization, banned C functions, taint-confirmed flows with user input sources | +| **Medium** | Important issues — may be exploitable with additional context | SQL concatenation, XSS sinks, reflection, unguarded sinks, resource leaks | +| **Low** | Informational — code quality or weak signals | Weak crypto algorithms, insecure randomness, `unwrap()`/`panic!()`, type-safety escapes | + +### Non-production severity downgrade + +By default, findings in paths matching common non-production patterns (`tests/`, `test/`, `vendor/`, `build/`, `examples/`, `benchmarks/`) are downgraded by one tier: + +- High → Medium +- Medium → Low +- Low → Low (unchanged) + +Use `--keep-nonprod-severity` to disable this behavior. + +--- + +## Inline Suppressions + +Suppress specific findings directly in source code using `nyx:ignore` comments. Suppressed findings are excluded from output, severity counts, and `--fail-on` checks by default. + +### Comment syntax + +| Language | Comment styles | +|----------|---------------| +| Rust, C, C++, Java, Go, JS, TS | `// nyx:ignore ...` or `/* nyx:ignore ... */` | +| Python, Ruby | `# nyx:ignore ...` | +| PHP | `// nyx:ignore ...`, `# nyx:ignore ...`, or `/* nyx:ignore ... */` | + +### Directive forms + +```python +x = dangerous() # nyx:ignore taint-unsanitised-flow ← suppresses this line +# nyx:ignore-next-line taint-unsanitised-flow +x = dangerous() ← suppresses this line +``` + +- `nyx:ignore ` — suppresses findings on the **same line** as the comment. +- `nyx:ignore-next-line ` — suppresses findings on the **next line**. +- For taint findings, the primary line is the **sink line** (the `line` field in output). + +### Rule ID matching + +- **Case-sensitive**, exact match after canonicalization. +- Comma-separated: `nyx:ignore rule-a, rule-b` +- Wildcard suffix: `nyx:ignore rs.quality.*` matches any ID starting with `rs.quality.` +- Taint IDs are canonicalized: `nyx:ignore taint-unsanitised-flow` matches `taint-unsanitised-flow (source 5:1)` (parenthetical suffix stripped). + +### Console behavior + +- **Default**: suppressed findings are hidden entirely. +- **`--show-suppressed`**: suppressed findings appear dimmed with `[SUPPRESSED]` tag. Summary shows `"N issues (M suppressed)"`. + +### JSON / SARIF behavior + +- **Default**: suppressed findings are excluded from JSON/SARIF output. +- **`--show-suppressed`**: suppressed findings are included with additional fields: + +```json +{ + "suppressed": true, + "suppression": { + "kind": "SameLine", + "matched_pattern": "taint-unsanitised-flow", + "directive_line": 42 + } +} +``` + +### Exit code + +Suppressed findings do **not** trigger `--fail-on`. A scan with only suppressed findings exits `0`. + +--- + +## Rule ID Format + +| Prefix | Detector | Example | +|--------|----------|---------| +| `taint-*` | Taint analysis | `taint-unsanitised-flow (source 5:11)` | +| `cfg-*` | CFG structural | `cfg-unguarded-sink`, `cfg-auth-gap` | +| `state-*` | State model | `state-use-after-close`, `state-resource-leak` | +| `.*.*` | AST patterns | `rs.memory.transmute`, `js.code_exec.eval` | + +See the [Rule Reference](rules/index.md) for a complete listing. diff --git a/docs/quickstart.md b/docs/quickstart.md new file mode 100644 index 00000000..69ddf371 --- /dev/null +++ b/docs/quickstart.md @@ -0,0 +1,103 @@ +# Quick Start + +## Your first scan + +```bash +# Scan the current directory +nyx scan + +# Scan a specific path +nyx scan ./my-project +``` + +Nyx automatically creates an SQLite index on first run. Subsequent scans skip unchanged files. + +## Understanding the output + +A typical console output looks like: + +``` +[HIGH] taint-unsanitised-flow (source 5:11) src/handler.rs:12:5 + Source: env::var("CMD") at 5:11 + Sink: Command::new("sh").arg("-c") + Score: 76 + +[MEDIUM] cfg-unguarded-sink src/handler.rs:12:5 + Score: 35 + +[MEDIUM] rs.quality.unsafe_block src/lib.rs:44:5 + Score: 30 +``` + +Each finding shows: + +| Field | Meaning | +|-------|---------| +| **Severity tag** | `[HIGH]`, `[MEDIUM]`, or `[LOW]` | +| **Rule ID** | Identifies the detector and specific rule | +| **Location** | `file:line:col` | +| **Evidence** | Source, Sink, and guard details (taint findings only) | +| **Score** | Attack-surface ranking score (higher = more exploitable) | + +## Common workflows + +### CI gate — fail on high-severity findings + +```bash +nyx scan . --fail-on high --quiet +# Exit code 1 if any HIGH finding exists, 0 otherwise +``` + +### Export for tooling + +```bash +# JSON for scripting +nyx scan . --format json > findings.json + +# SARIF for GitHub Code Scanning +nyx scan . --format sarif > results.sarif +``` + +### Fast structural scan (no dataflow) + +```bash +nyx scan . --mode ast +``` + +AST-only mode runs tree-sitter pattern queries without building CFGs or running taint analysis. Much faster, but misses dataflow vulnerabilities. + +### Filter by severity + +```bash +# Only high-severity +nyx scan . --severity HIGH + +# High and medium +nyx scan . --severity ">=MEDIUM" + +# Specific set +nyx scan . --severity "HIGH,MEDIUM" +``` + +### Skip the index + +```bash +nyx scan . --index off +``` + +Useful for one-off scans or when you don't want to write to disk. + +### Scan without non-production noise + +By default, findings in test/vendor/build paths are downgraded one severity tier. To keep original severity: + +```bash +nyx scan . --keep-nonprod-severity +``` + +## Next steps + +- [CLI Reference](cli.md) — All flags and options +- [Configuration](configuration.md) — Customize rules, exclusions, and behavior +- [Detector Overview](detectors.md) — How the analysis engines work +- [Rule Reference](rules/index.md) — Browse all rules by language diff --git a/docs/rules/c.md b/docs/rules/c.md new file mode 100644 index 00000000..e66f706b --- /dev/null +++ b/docs/rules/c.md @@ -0,0 +1,89 @@ +# C Rules + +Nyx detects C vulnerabilities through AST patterns (banned functions, format strings) and taint analysis (user input → shell execution, buffer overflow sinks). + +## Taint Sources + +| Function | Capability | Source Kind | +|----------|-----------|-------------| +| `getenv` | `all` | EnvironmentConfig | +| `fgets`, `scanf`, `fscanf`, `gets`, `read` | `all` | UserInput | + +## Taint Sinks + +| Function | Required Capability | +|----------|-------------------| +| `system`, `popen`, `exec*` family | `SHELL_ESCAPE` | +| `sprintf`, `strcpy`, `strcat` | `HTML_ESCAPE` | +| `printf`, `fprintf` | `FMT_STRING` | +| `fopen`, `open` | `FILE_IO` | + +--- + +## AST Pattern Rules + +### Memory Safety (Banned Functions) + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `c.memory.gets` | High | A | `gets()` — no bounds checking, always exploitable | +| `c.memory.strcpy` | High | A | `strcpy()` — no bounds checking on destination buffer | +| `c.memory.strcat` | High | A | `strcat()` — no bounds checking on destination buffer | +| `c.memory.sprintf` | High | A | `sprintf()` — no length limit on output buffer | +| `c.memory.scanf_percent_s` | High | A | `scanf("%s")` — unbounded string read | +| `c.memory.printf_no_fmt` | High | B | `printf(var)` — format-string vulnerability (non-literal first arg) | + +### Command Execution + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `c.cmdi.system` | High | A | `system()` — shell command execution | +| `c.cmdi.popen` | Medium | A | `popen()` — shell command execution with pipe | + +--- + +## Examples + +### `c.memory.gets` — Banned function + +**Vulnerable:** +```c +char buf[64]; +gets(buf); // No bounds checking — buffer overflow +``` + +**Safe alternative:** +```c +char buf[64]; +fgets(buf, sizeof(buf), stdin); +``` + +### `c.memory.printf_no_fmt` — Format string + +**Vulnerable:** +```c +char *user_input = get_input(); +printf(user_input); // Format string vulnerability +``` + +**Safe alternative:** +```c +char *user_input = get_input(); +printf("%s", user_input); +``` + +### `c.cmdi.system` — Shell execution + +**Vulnerable:** +```c +char cmd[256]; +snprintf(cmd, sizeof(cmd), "ls %s", user_dir); +system(cmd); // Command injection if user_dir contains shell metacharacters +``` + +**Safe alternative:** +```c +// Use execvp with explicit argument array +char *args[] = {"ls", user_dir, NULL}; +execvp("ls", args); +``` diff --git a/docs/rules/cpp.md b/docs/rules/cpp.md new file mode 100644 index 00000000..5178a920 --- /dev/null +++ b/docs/rules/cpp.md @@ -0,0 +1,66 @@ +# C++ Rules + +C++ rules inherit C banned-function concerns and add C++-specific patterns like dangerous casts. + +## Taint Labels + +C++ shares taint labels with C. See [C Rules](c.md) for the full source/sink/sanitizer listing. + +--- + +## AST Pattern Rules + +### Memory Safety + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `cpp.memory.gets` | High | A | `gets()` — no bounds checking, always exploitable | +| `cpp.memory.strcpy` | High | A | `strcpy()` — no bounds checking on destination | +| `cpp.memory.strcat` | High | A | `strcat()` — no bounds checking on destination | +| `cpp.memory.sprintf` | High | A | `sprintf()` — no length limit on output | +| `cpp.memory.reinterpret_cast` | Medium | A | `reinterpret_cast` — type-punning cast | +| `cpp.memory.const_cast` | Medium | A | `const_cast` — removes const/volatile qualifier | +| `cpp.memory.printf_no_fmt` | High | B | `printf(var)` — format-string vulnerability | + +### Command Execution + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `cpp.cmdi.system` | High | A | `system()` — shell command execution | +| `cpp.cmdi.popen` | High | A | `popen()` — shell command execution | + +--- + +## Examples + +### `cpp.memory.reinterpret_cast` — Type-punning cast + +**Flagged:** +```cpp +int x = 42; +float* fp = reinterpret_cast(&x); // Type-punning, may violate strict aliasing +``` + +**Safe alternative:** +```cpp +int x = 42; +float f; +std::memcpy(&f, &x, sizeof(f)); // Well-defined type punning +``` + +### `cpp.memory.const_cast` — Removing const + +**Flagged:** +```cpp +void process(const std::string& s) { + char* p = const_cast(s.c_str()); // Removes const + p[0] = 'X'; // Undefined behavior +} +``` + +**Safe alternative:** +```cpp +void process(std::string s) { // Take by value + s[0] = 'X'; +} +``` diff --git a/docs/rules/go.md b/docs/rules/go.md new file mode 100644 index 00000000..391d763b --- /dev/null +++ b/docs/rules/go.md @@ -0,0 +1,148 @@ +# Go Rules + +Nyx detects Go vulnerabilities through AST patterns and taint analysis, covering command execution, unsafe pointer usage, TLS misconfiguration, weak crypto, SQL injection, hardcoded secrets, and deserialization. + +## Taint Labels + +Go has moderate taint label coverage. Sources, sinks, and sanitizers are defined in `src/labels/go.rs`. + +### Sources + +| Matcher | Cap | +|---------|-----| +| `os.Getenv` | all | +| `http.Request`, `r.FormValue`, `r.URL`, `r.Body`, `r.Header` | all | +| `r.URL.Query`, `r.URL.Query.Get`, `Request.FormValue`, `Request.URL` | all | + +### Sanitizers + +| Matcher | Cap | +|---------|-----| +| `html.EscapeString`, `template.HTMLEscapeString` | HTML_ESCAPE | +| `url.QueryEscape`, `url.PathEscape` | URL_ENCODE | +| `filepath.Clean`, `filepath.Base` | FILE_IO | + +### Sinks + +| Matcher | Cap | +|---------|-----| +| `exec.Command` | SHELL_ESCAPE | +| `db.Query`, `db.Exec`, `db.QueryRow`, `db.Prepare` | SHELL_ESCAPE | +| `fmt.Fprintf`, `fmt.Sprintf`, `fmt.Printf` | FMT_STRING | +| `os.Open`, `os.OpenFile`, `os.Create`, `ioutil.ReadFile`, `os.ReadFile` | FILE_IO | +| `template.HTML` | HTML_ESCAPE | + +> **Note:** Chained calls like `r.URL.Query().Get("host")` are normalized by stripping internal `()` segments before matching, so `r.URL.Query.Get` matches the source rule. + +--- + +## AST Pattern Rules + +### Command Execution + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `go.cmdi.exec_command` | High | A | `exec.Command()` — arbitrary process execution | + +### Memory Safety + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `go.memory.unsafe_pointer` | Medium | A | `unsafe.Pointer` — bypasses Go type system | + +### Insecure Transport + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `go.transport.insecure_skip_verify` | High | A | `InsecureSkipVerify: true` — disables TLS certificate validation | + +### Weak Crypto + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `go.crypto.md5` | Low | A | `md5.New()` / `md5.Sum()` — weak hash algorithm | +| `go.crypto.sha1` | Low | A | `sha1.New()` / `sha1.Sum()` — weak hash algorithm | + +### SQL Injection + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `go.sqli.query_concat` | Medium | B | `db.Query`/`Exec`/`QueryRow` with concatenated string | + +### Secrets + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `go.secrets.hardcoded_key` | Medium | A | Variable with secret-like name assigned a string literal | + +### Deserialization + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `go.deser.gob_decode` | Medium | A | `gob.NewDecoder` — Go binary deserialization | + +--- + +## Examples + +### `go.transport.insecure_skip_verify` — TLS misconfiguration + +**Vulnerable:** +```go +tr := &http.Transport{ + TLSClientConfig: &tls.Config{ + InsecureSkipVerify: true, // Disables certificate verification + }, +} +``` + +**Safe alternative:** +```go +tr := &http.Transport{ + TLSClientConfig: &tls.Config{ + // Use proper CA certificates + RootCAs: certPool, + }, +} +``` + +### `go.sqli.query_concat` — SQL concatenation + +**Vulnerable:** +```go +rows, err := db.Query("SELECT * FROM users WHERE id=" + userID) +``` + +**Safe alternative:** +```go +rows, err := db.Query("SELECT * FROM users WHERE id=$1", userID) +``` + +### `go.secrets.hardcoded_key` — Hardcoded secret + +**Flagged:** +```go +apiKey := "sk-1234567890abcdef" +password := "hunter2" +``` + +**Safe alternative:** +```go +apiKey := os.Getenv("API_KEY") +password := os.Getenv("DB_PASSWORD") +``` + +### `go.cmdi.exec_command` — Command execution + +**Vulnerable:** +```go +cmd := exec.Command("sh", "-c", userInput) +cmd.Run() +``` + +**Safe alternative:** +```go +// Use explicit command and arguments, not shell +cmd := exec.Command("ls", "-la", safeDir) +cmd.Run() +``` diff --git a/docs/rules/index.md b/docs/rules/index.md new file mode 100644 index 00000000..52e08e82 --- /dev/null +++ b/docs/rules/index.md @@ -0,0 +1,79 @@ +# Rule Reference + +This section lists every detection rule in Nyx, organized by language. + +## Rule ID Format + +| Prefix | Detector Family | Example | +|--------|----------------|---------| +| `taint-*` | [Taint analysis](../detectors/taint.md) | `taint-unsanitised-flow (source 5:11)` | +| `cfg-*` | [CFG structural](../detectors/cfg.md) | `cfg-unguarded-sink`, `cfg-auth-gap` | +| `state-*` | [State model](../detectors/state.md) | `state-use-after-close`, `state-resource-leak` | +| `.*.*` | [AST patterns](../detectors/patterns.md) | `rs.memory.transmute`, `js.code_exec.eval` | + +## Cross-Language Rules + +These rules apply to all supported languages: + +### Taint Rules + +| Rule ID | Severity | Description | +|---------|----------|-------------| +| `taint-unsanitised-flow (source L:C)` | Varies by source kind | Unsanitized data flows from source to sink | + +### CFG Structural Rules + +| Rule ID | Severity | Description | +|---------|----------|-------------| +| `cfg-unguarded-sink` | High/Medium | Sink without dominating guard | +| `cfg-auth-gap` | High | Web handler reaches privileged sink without auth | +| `cfg-unreachable-sink` | Medium | Dangerous function in unreachable code | +| `cfg-unreachable-sanitizer` | Low | Sanitizer in unreachable code | +| `cfg-unreachable-source` | Low | Source in unreachable code | +| `cfg-error-fallthrough` | High/Medium | Error path doesn't terminate before dangerous code | +| `cfg-resource-leak` | Medium | Resource not released on all exit paths | +| `cfg-lock-not-released` | Medium | Lock not released on all exit paths | + +### State Model Rules + +| Rule ID | Severity | Description | +|---------|----------|-------------| +| `state-use-after-close` | High | Variable used after being closed | +| `state-double-close` | Medium | Resource closed twice | +| `state-resource-leak` | Medium | Resource never closed (definite) | +| `state-resource-leak-possible` | Low | Resource may not close on all paths | +| `state-unauthed-access` | High | Privileged operation without authentication | + +## Per-Language AST Pattern Rules + +Each language page lists all AST pattern rules with examples: + +- [Rust](rust.md) — 12 rules (memory safety, code quality) +- [C](c.md) — 8 rules (banned functions, command execution, format strings) +- [C++](cpp.md) — 9 rules (banned functions, dangerous casts, command execution) +- [Java](java.md) — 8 rules (deserialization, command execution, reflection, SQL, crypto, XSS) +- [Go](go.md) — 8 rules (command execution, unsafe pointer, TLS, crypto, SQL, secrets, deserialization) +- [JavaScript](javascript.md) — 12 rules (code execution, XSS, prototype pollution, crypto, transport) +- [TypeScript](typescript.md) — 10 rules (mirrors JS + type-safety escapes) +- [Python](python.md) — 12 rules (code execution, command execution, deserialization, SQL, crypto, XSS) +- [PHP](php.md) — 11 rules (code execution, command execution, deserialization, SQL, path traversal, crypto) +- [Ruby](ruby.md) — 10 rules (code execution, command execution, deserialization, reflection, SSRF, crypto) + +## Taint Label Coverage + +Taint analysis uses language-specific source/sink/sanitizer labels. Coverage varies by language: + +| Language | Sources | Sinks | Sanitizers | Coverage | +|----------|---------|-------|------------|----------| +| Rust | Complete | Complete | Complete | Full | +| JavaScript | Complete | Complete | Partial | Full | +| TypeScript | Partial | Partial | Partial | Moderate | +| Python | Partial | Complete | Partial | Moderate | +| C | Partial | Complete | Minimal | Moderate | +| C++ | Partial | Complete | Minimal | Moderate | +| Java | Partial | Partial | Partial | Moderate | +| Go | Complete | Complete | Partial | Full | +| PHP | Complete | Complete | Partial | Full | +| Ruby | Partial | Partial | Partial | Moderate | + +"Starter" coverage means basic rules exist but many common library functions are not yet labeled. Contributions welcome. diff --git a/docs/rules/java.md b/docs/rules/java.md new file mode 100644 index 00000000..99ab5a10 --- /dev/null +++ b/docs/rules/java.md @@ -0,0 +1,135 @@ +# Java Rules + +Nyx detects Java vulnerabilities through AST patterns and taint analysis, covering deserialization, command execution, reflection, SQL injection, weak crypto, and XSS. + +## Taint Labels + +Java has moderate taint label coverage. Sources, sinks, and sanitizers are defined in `src/labels/java.rs`. + +### Sources + +| Matcher | Cap | +|---------|-----| +| `System.getenv` | all | +| `getParameter`, `getInputStream`, `getHeader`, `getCookies`, `getReader`, `getQueryString`, `getPathInfo` | all | +| `readObject`, `readLine` | all | + +### Sanitizers + +| Matcher | Cap | +|---------|-----| +| `HtmlUtils.htmlEscape`, `StringEscapeUtils.escapeHtml4` | HTML_ESCAPE | + +### Sinks + +| Matcher | Cap | +|---------|-----| +| `Runtime.exec`, `ProcessBuilder` | SHELL_ESCAPE | +| `executeQuery`, `executeUpdate`, `prepareStatement` | SHELL_ESCAPE | +| `Class.forName` | SHELL_ESCAPE | +| `println`, `print`, `write` | HTML_ESCAPE | + +--- + +## AST Pattern Rules + +### Deserialization + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `java.deser.readobject` | High | A | `ObjectInputStream.readObject()` — unsafe deserialization | + +### Command Execution + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `java.cmdi.runtime_exec` | High | A | `Runtime.getRuntime().exec()` — shell command execution | + +### Reflection + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `java.reflection.class_forname` | Medium | A | `Class.forName()` — dynamic class loading | +| `java.reflection.method_invoke` | Medium | A | `Method.invoke()` — reflective method invocation | + +### SQL Injection + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `java.sqli.execute_concat` | Medium | B | SQL `execute*()` with concatenated string argument | + +### Weak Crypto + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `java.crypto.insecure_random` | Low | A | `new Random()` — `java.util.Random` is not cryptographically secure | +| `java.crypto.weak_digest` | Low | A | `MessageDigest.getInstance("MD5"/"SHA1")` | + +### XSS + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `java.xss.getwriter_print` | Medium | A | `response.getWriter().print/println/write` — direct output | + +--- + +## Examples + +### `java.deser.readobject` — Unsafe deserialization + +**Vulnerable:** +```java +ObjectInputStream ois = new ObjectInputStream(request.getInputStream()); +Object obj = ois.readObject(); // Arbitrary object instantiation +``` + +**Safe alternative:** +```java +// Use a safe format like JSON +ObjectMapper mapper = new ObjectMapper(); +MyType obj = mapper.readValue(request.getInputStream(), MyType.class); +``` + +### `java.sqli.execute_concat` — SQL concatenation + +**Vulnerable:** +```java +String query = "SELECT * FROM users WHERE id=" + userId; +stmt.executeQuery(query); // SQL injection +``` + +**Safe alternative:** +```java +PreparedStatement ps = conn.prepareStatement("SELECT * FROM users WHERE id=?"); +ps.setString(1, userId); +ResultSet rs = ps.executeQuery(); +``` + +### `java.cmdi.runtime_exec` — Command execution + +**Vulnerable:** +```java +Runtime.getRuntime().exec("cmd /c " + userCommand); +``` + +**Safe alternative:** +```java +ProcessBuilder pb = new ProcessBuilder("cmd", "/c", "dir"); +// Use explicit argument list, never concatenate user input +``` + +### `java.reflection.class_forname` — Dynamic class loading + +**Flagged:** +```java +Class cls = Class.forName(className); +Object obj = cls.getDeclaredConstructor().newInstance(); +``` + +**Safe alternative:** +```java +// Use an allowlist of permitted class names +Map> allowed = Map.of("User", User.class, "Order", Order.class); +Class cls = allowed.get(className); +if (cls != null) { /* ... */ } +``` diff --git a/docs/rules/javascript.md b/docs/rules/javascript.md new file mode 100644 index 00000000..9971ed6d --- /dev/null +++ b/docs/rules/javascript.md @@ -0,0 +1,138 @@ +# JavaScript Rules + +JavaScript has the most complete taint label coverage alongside Rust. Nyx detects code execution, XSS, prototype pollution, command injection, and weak crypto. + +## Taint Sources + +| Function | Capability | Source Kind | +|----------|-----------|-------------| +| `document.location`, `window.location` | `all` | UserInput | +| `req.body`, `req.query`, `req.params` | `all` | UserInput | +| `req.headers`, `req.cookies` | `all` | UserInput | +| `process.env` | `all` | EnvironmentConfig | + +## Taint Sinks + +| Function | Required Capability | +|----------|-------------------| +| `eval` | `SHELL_ESCAPE` | +| `innerHTML` | `HTML_ESCAPE` | +| `location.href`, `window.location.href` | `URL_ENCODE` | +| `child_process.exec`, `child_process.execSync` | `SHELL_ESCAPE` | +| `child_process.spawn` | `SHELL_ESCAPE` | + +## Taint Sanitizers + +| Function | Strips Capability | +|----------|------------------| +| `JSON.parse` | `JSON_PARSE` | +| `encodeURIComponent`, `encodeURI` | `URL_ENCODE` | +| `DOMPurify.sanitize` | `HTML_ESCAPE` | + +> **Note:** Anonymous function expressions and arrow functions passed as callback arguments (e.g., Express `app.get('/path', function(req, res) { ... })`) are automatically walked as separate function scopes for taint analysis. Each anonymous function gets a unique scope identifier to prevent cross-function taint leakage. + +--- + +## AST Pattern Rules + +### Code Execution + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `js.code_exec.eval` | High | A | `eval()` — dynamic code execution | +| `js.code_exec.new_function` | High | A | `new Function()` — eval equivalent | +| `js.code_exec.settimeout_string` | Medium | A | `setTimeout`/`setInterval` with string argument | + +### XSS Sinks + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `js.xss.document_write` | Medium | A | `document.write()` / `document.writeln()` | +| `js.xss.outer_html` | Medium | A | Assignment to `.outerHTML` | +| `js.xss.insert_adjacent_html` | Medium | A | `insertAdjacentHTML()` | +| `js.xss.location_assign` | Medium | A | Assignment to `location`/`location.href` — open redirect | +| `js.xss.cookie_write` | Medium | A | Write to `document.cookie` | + +### Prototype Pollution + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `js.prototype.proto_assignment` | Medium | A | Assignment to `__proto__` | +| `js.prototype.extend_object` | Medium | A | Assignment to `Object.prototype.*` | + +### Weak Crypto + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `js.crypto.weak_hash` | Low | A | `crypto.createHash("md5"/"sha1")` | +| `js.crypto.math_random` | Low | A | `Math.random()` — not cryptographically secure | + +### Insecure Transport + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `js.transport.fetch_http` | Low | A | `fetch("http://...")` — plaintext HTTP | + +--- + +## Examples + +### `js.code_exec.eval` — Dynamic code execution + +**Vulnerable:** +```javascript +const code = req.query.code; +eval(code); // Remote code execution +``` + +**Safe alternative:** +```javascript +// Use a sandboxed interpreter or avoid eval entirely +const allowed = { add: (a, b) => a + b }; +const result = allowed[req.query.operation]?.(req.query.a, req.query.b); +``` + +### `js.xss.document_write` — XSS sink + +**Vulnerable:** +```javascript +document.write("

" + userName + "

"); +``` + +**Safe alternative:** +```javascript +const el = document.createElement("h1"); +el.textContent = userName; +document.body.appendChild(el); +``` + +### `js.prototype.proto_assignment` — Prototype pollution + +**Vulnerable:** +```javascript +function merge(target, source) { + for (let key in source) { + target[key] = source[key]; // If key is "__proto__", pollutes prototype + } +} +``` + +**Safe alternative:** +```javascript +function merge(target, source) { + for (let key in source) { + if (key === "__proto__" || key === "constructor") continue; + target[key] = source[key]; + } +} +``` + +### Taint: `req.body` → `eval()` + +**Finding:** +``` +[HIGH] taint-unsanitised-flow (source 2:18) src/handler.js:3:5 + Source: req.body at 2:18 + Sink: eval() + Score: 78 +``` diff --git a/docs/rules/php.md b/docs/rules/php.md new file mode 100644 index 00000000..50bfe659 --- /dev/null +++ b/docs/rules/php.md @@ -0,0 +1,138 @@ +# PHP Rules + +Nyx detects PHP vulnerabilities through AST patterns and taint analysis, covering code execution, command injection, deserialization, SQL injection, path traversal, and weak crypto. + +## Taint Labels + +PHP has moderate taint label coverage. Sources, sinks, and sanitizers are defined in `src/labels/php.rs`. + +### Sources + +| Matcher | Cap | +|---------|-----| +| `$_GET` / `_GET`, `$_POST` / `_POST`, `$_REQUEST` / `_REQUEST`, `$_COOKIE` / `_COOKIE`, `$_FILES` / `_FILES`, `$_SERVER` / `_SERVER`, `$_ENV` / `_ENV` | all | +| `file_get_contents`, `fread` | all | + +> **Note:** PHP superglobal names are matched both with and without the `$` prefix because the CFG's `collect_idents` strips the leading `$` from variable names. Subscript access like `$_GET['cmd']` is handled via `element_reference` / `subscript_expression` node detection. + +### Sanitizers + +| Matcher | Cap | +|---------|-----| +| `htmlspecialchars`, `htmlentities` | HTML_ESCAPE | +| `escapeshellarg`, `escapeshellcmd` | SHELL_ESCAPE | +| `basename` | FILE_IO | + +### Sinks + +| Matcher | Cap | +|---------|-----| +| `system`, `exec`, `passthru`, `shell_exec`, `proc_open`, `popen` | SHELL_ESCAPE | +| `eval`, `assert` | SHELL_ESCAPE | +| `include`, `include_once`, `require`, `require_once` | FILE_IO | +| `unserialize` | SHELL_ESCAPE | +| `move_uploaded_file`, `copy`, `file_put_contents`, `fwrite` | FILE_IO | +| `echo`, `print` | HTML_ESCAPE | +| `mysqli_query`, `pg_query`, `query` | SHELL_ESCAPE | + +--- + +## AST Pattern Rules + +### Code Execution + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `php.code_exec.eval` | High | A | `eval()` — dynamic code execution | +| `php.code_exec.create_function` | High | A | `create_function()` — deprecated eval-like constructor | +| `php.code_exec.preg_replace_e` | High | A | `preg_replace` with `/e` modifier — code execution via regex | +| `php.code_exec.assert_string` | High | A | `assert()` with string argument — evaluates PHP code | + +### Command Execution + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `php.cmdi.system` | High | A | `system`/`shell_exec`/`exec`/`passthru`/`proc_open`/`popen` | + +### Deserialization + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `php.deser.unserialize` | High | A | `unserialize()` — PHP object injection | + +### SQL Injection + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `php.sqli.query_concat` | Medium | B | `mysql_query`/`mysqli_query` with concatenated SQL | + +### Path Traversal + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `php.path.include_variable` | High | B | `include`/`require` with variable path — file inclusion | + +### Weak Crypto + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `php.crypto.md5` | Low | A | `md5()` — weak hash function | +| `php.crypto.sha1` | Low | A | `sha1()` — weak hash function | +| `php.crypto.rand` | Low | A | `rand()`/`mt_rand()` — not cryptographically secure | + +--- + +## Examples + +### `php.code_exec.eval` — Dynamic code execution + +**Vulnerable:** +```php +eval($_GET['code']); +``` + +**Safe alternative:** +```php +// Never use eval with user input +// Use a template engine or allowlisted operations +``` + +### `php.deser.unserialize` — Object injection + +**Vulnerable:** +```php +$obj = unserialize($_COOKIE['data']); +``` + +**Safe alternative:** +```php +$data = json_decode($_COOKIE['data'], true); +``` + +### `php.path.include_variable` — File inclusion + +**Vulnerable:** +```php +include($_GET['page']); // Local/remote file inclusion +``` + +**Safe alternative:** +```php +$allowed = ['home', 'about', 'contact']; +$page = in_array($_GET['page'], $allowed) ? $_GET['page'] : 'home'; +include("pages/{$page}.php"); +``` + +### `php.sqli.query_concat` — SQL concatenation + +**Vulnerable:** +```php +mysqli_query($conn, "SELECT * FROM users WHERE id=" . $_GET['id']); +``` + +**Safe alternative:** +```php +$stmt = $conn->prepare("SELECT * FROM users WHERE id=?"); +$stmt->bind_param("i", $_GET['id']); +$stmt->execute(); +``` diff --git a/docs/rules/python.md b/docs/rules/python.md new file mode 100644 index 00000000..cfc1c556 --- /dev/null +++ b/docs/rules/python.md @@ -0,0 +1,142 @@ +# Python Rules + +Nyx detects Python vulnerabilities through AST patterns and taint analysis, covering code execution, command injection, deserialization, SQL injection, and weak crypto. + +## Taint Labels + +Python has moderate taint label coverage. Sources, sinks, and sanitizers are defined in `src/labels/python.rs`. + +### Sources + +| Matcher | Cap | +|---------|-----| +| `os.getenv`, `os.environ` | all | +| `request.args`, `request.form`, `request.json`, `request.headers`, `request.cookies`, `input` | all | +| `sys.argv` | all | +| `argparse.parse_args`, `urllib.request.urlopen`, `requests.get`, `requests.post` | all | + +### Sanitizers + +| Matcher | Cap | +|---------|-----| +| `html.escape` | HTML_ESCAPE | +| `shlex.quote` | SHELL_ESCAPE | +| `os.path.realpath` | FILE_IO | + +### Sinks + +| Matcher | Cap | +|---------|-----| +| `eval`, `exec` | SHELL_ESCAPE | +| `os.system`, `os.popen`, `subprocess.call`, `subprocess.run`, `subprocess.Popen`, `subprocess.check_output`, `subprocess.check_call` | SHELL_ESCAPE | +| `cursor.execute`, `cursor.executemany` | SHELL_ESCAPE | +| `send_file`, `send_from_directory` | FILE_IO | +| `open` | FILE_IO | + +--- + +## AST Pattern Rules + +### Code Execution + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `py.code_exec.eval` | High | A | `eval()` — dynamic code execution | +| `py.code_exec.exec` | High | A | `exec()` — dynamic code execution | +| `py.code_exec.compile` | Medium | A | `compile()` with exec/eval mode | + +### Command Execution + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `py.cmdi.os_system` | High | A | `os.system()` — shell command execution | +| `py.cmdi.os_popen` | High | A | `os.popen()` — shell command execution | +| `py.cmdi.subprocess_shell` | High | B | `subprocess.*` with `shell=True` | + +### Deserialization + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `py.deser.pickle_loads` | High | A | `pickle.loads()` / `pickle.load()` — arbitrary object deserialization | +| `py.deser.yaml_load` | High | A | `yaml.load()` without SafeLoader | +| `py.deser.shelve_open` | Medium | A | `shelve.open()` — pickle-backed deserialization | + +### SQL Injection + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `py.sqli.execute_format` | Medium | B | `cursor.execute()` with string concatenation | + +### Weak Crypto + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `py.crypto.md5` | Low | A | `hashlib.md5()` — weak hash algorithm | +| `py.crypto.sha1` | Low | A | `hashlib.sha1()` — weak hash algorithm | + +### Template Injection + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `py.xss.jinja_from_string` | Medium | A | `jinja2.Template.from_string()` — template injection | + +--- + +## Examples + +### `py.deser.pickle_loads` — Unsafe deserialization + +**Vulnerable:** +```python +import pickle +data = pickle.loads(request.body) # Arbitrary code execution +``` + +**Safe alternative:** +```python +import json +data = json.loads(request.body) # JSON is safe +``` + +### `py.cmdi.subprocess_shell` — Shell execution + +**Vulnerable:** +```python +import subprocess +subprocess.call(user_input, shell=True) # Command injection +``` + +**Safe alternative:** +```python +import subprocess +import shlex +subprocess.call(shlex.split(user_input), shell=False) +# Or better: use an explicit command list +subprocess.call(["ls", "-la", user_dir]) +``` + +### `py.deser.yaml_load` — Unsafe YAML + +**Vulnerable:** +```python +import yaml +config = yaml.load(user_data) # Can instantiate arbitrary objects +``` + +**Safe alternative:** +```python +import yaml +config = yaml.safe_load(user_data) # Only basic Python types +``` + +### `py.sqli.execute_format` — SQL concatenation + +**Vulnerable:** +```python +cursor.execute("SELECT * FROM users WHERE id=" + user_id) +``` + +**Safe alternative:** +```python +cursor.execute("SELECT * FROM users WHERE id=?", (user_id,)) +``` diff --git a/docs/rules/ruby.md b/docs/rules/ruby.md new file mode 100644 index 00000000..0b9a6f19 --- /dev/null +++ b/docs/rules/ruby.md @@ -0,0 +1,132 @@ +# Ruby Rules + +Nyx detects Ruby vulnerabilities through AST patterns and taint analysis, covering code execution, command injection, deserialization, reflection, SSRF, and weak crypto. + +## Taint Labels + +Ruby has moderate taint label coverage. Sources, sinks, and sanitizers are defined in `src/labels/ruby.rs`. + +### Sources + +| Matcher | Cap | +|---------|-----| +| `ENV`, `gets` | all | +| `params` | all | + +> **Note:** Ruby's `params[:cmd]` subscript access is detected via `element_reference` node handling in the CFG. Sinatra/Rails `do...end` blocks are walked as function scopes. + +### Sanitizers + +| Matcher | Cap | +|---------|-----| +| `CGI.escapeHTML`, `ERB::Util.html_escape` | HTML_ESCAPE | +| `Shellwords.escape`, `Shellwords.shellescape` | SHELL_ESCAPE | + +### Sinks + +| Matcher | Cap | +|---------|-----| +| `system`, `exec` | SHELL_ESCAPE | +| `eval` | SHELL_ESCAPE | +| `puts`, `print` | HTML_ESCAPE | + +--- + +## AST Pattern Rules + +### Code Execution + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `rb.code_exec.eval` | High | A | `Kernel#eval` — dynamic code execution | +| `rb.code_exec.instance_eval` | High | A | `instance_eval` — evaluates string in object context | +| `rb.code_exec.class_eval` | High | A | `class_eval` / `module_eval` — evaluates string in class context | + +### Command Execution + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `rb.cmdi.backtick` | High | A | Backtick shell execution (`` `cmd` ``) | +| `rb.cmdi.system_interp` | High | A | `system`/`exec` call — command execution risk | + +### Deserialization + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `rb.deser.yaml_load` | High | A | `YAML.load` — arbitrary object deserialization | +| `rb.deser.marshal_load` | High | A | `Marshal.load` — arbitrary Ruby object deserialization | + +### Reflection + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `rb.reflection.send_dynamic` | Medium | B | `send()` with non-symbol argument — arbitrary method dispatch | +| `rb.reflection.constantize` | Medium | A | `constantize` / `safe_constantize` — dynamic class resolution | + +### SSRF + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `rb.ssrf.open_uri` | Medium | A | `Kernel#open` with HTTP URL — SSRF via open-uri | + +### Weak Crypto + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `rb.crypto.md5` | Low | A | `Digest::MD5` — weak hash algorithm | + +--- + +## Examples + +### `rb.deser.yaml_load` — Unsafe YAML deserialization + +**Vulnerable:** +```ruby +data = YAML.load(params[:config]) # Arbitrary object instantiation +``` + +**Safe alternative:** +```ruby +data = YAML.safe_load(params[:config]) # Only basic Ruby types +``` + +### `rb.cmdi.backtick` — Backtick shell execution + +**Vulnerable:** +```ruby +output = `ls #{user_dir}` # Command injection via interpolation +``` + +**Safe alternative:** +```ruby +require 'open3' +output, status = Open3.capture2('ls', user_dir) +``` + +### `rb.reflection.send_dynamic` — Dynamic method dispatch + +**Vulnerable:** +```ruby +obj.send(params[:method], params[:arg]) # Arbitrary method invocation +``` + +**Safe alternative:** +```ruby +allowed = %w[name email phone] +if allowed.include?(params[:method]) + obj.send(params[:method]) +end +``` + +### `rb.deser.marshal_load` — Marshal deserialization + +**Vulnerable:** +```ruby +obj = Marshal.load(request.body.read) +``` + +**Safe alternative:** +```ruby +data = JSON.parse(request.body.read) +``` diff --git a/docs/rules/rust.md b/docs/rules/rust.md new file mode 100644 index 00000000..03ef12af --- /dev/null +++ b/docs/rules/rust.md @@ -0,0 +1,105 @@ +# Rust Rules + +Nyx detects Rust vulnerabilities through AST patterns (memory safety, code quality) and taint analysis (command injection via `env::var` → `Command::new`). + +## Taint Sources + +| Function | Capability | Source Kind | +|----------|-----------|-------------| +| `std::env::var`, `env::var` | `all` | EnvironmentConfig | + +## Taint Sinks + +| Function | Required Capability | +|----------|-------------------| +| `Command::new`, `Command::arg`, `Command::args` | `SHELL_ESCAPE` | +| `Command::status`, `Command::output` | `SHELL_ESCAPE` | +| `fs::read_to_string`, `fs::write`, `fs::read`, `File::open`, `File::create` | `FILE_IO` | + +## Taint Sanitizers + +| Function | Strips Capability | +|----------|------------------| +| `html_escape::encode_safe`, `sanitize_html` | `HTML_ESCAPE` | +| `shell_escape::unix::escape`, `sanitize_shell` | `SHELL_ESCAPE` | + +> **Note:** `fs::read_to_string` was moved from taint sources to sinks to support path traversal detection (`env::var` → `fs::read_to_string`). + +--- + +## AST Pattern Rules + +### Memory Safety + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `rs.memory.transmute` | High | A | `std::mem::transmute` — unchecked type reinterpretation | +| `rs.memory.copy_nonoverlapping` | High | A | `ptr::copy_nonoverlapping` — raw pointer memcpy | +| `rs.memory.get_unchecked` | High | A | `get_unchecked` / `get_unchecked_mut` — unchecked indexing | +| `rs.memory.mem_zeroed` | High | A | `std::mem::zeroed` — may be UB for non-POD types | +| `rs.memory.ptr_read` | High | A | `ptr::read` / `ptr::read_volatile` — raw pointer dereference | +| `rs.memory.narrow_cast` | Low | A | `as u8`/`i8`/`u16`/`i16` — possible truncation | +| `rs.memory.mem_forget` | Low | A | `std::mem::forget` — may leak resources | + +### Code Quality + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `rs.quality.unsafe_block` | Medium | A | `unsafe { }` block — manual memory safety obligation | +| `rs.quality.unsafe_fn` | Medium | A | `unsafe fn` declaration | +| `rs.quality.unwrap` | Low | A | `.unwrap()` — panics on `None`/`Err` | +| `rs.quality.expect` | Low | A | `.expect()` — panics on `None`/`Err` | +| `rs.quality.panic_macro` | Low | A | `panic!()` macro invocation | +| `rs.quality.todo` | Low | A | `todo!()` / `unimplemented!()` placeholder | + +--- + +## Examples + +### `rs.memory.transmute` — Unchecked type reinterpretation + +**Vulnerable:** +```rust +let x: u32 = 42; +let y: f32 = unsafe { std::mem::transmute(x) }; +``` + +**Safe alternative:** +```rust +let x: u32 = 42; +let y: f32 = f32::from_bits(x); +``` + +### `rs.quality.unsafe_block` — Unsafe block + +**Flagged:** +```rust +unsafe { + let ptr = &x as *const i32; + println!("{}", *ptr); +} +``` + +**Safe alternative:** +```rust +// Use safe abstractions when possible +println!("{}", x); +``` + +### Taint: `env::var` → `Command::new` + +**Vulnerable:** +```rust +let cmd = std::env::var("USER_CMD").unwrap(); +Command::new("sh").arg("-c").arg(&cmd).output()?; +``` + +**Safe alternative:** +```rust +let cmd = std::env::var("USER_CMD").unwrap(); +// Validate against allowlist +let allowed = ["ls", "whoami", "date"]; +if allowed.contains(&cmd.as_str()) { + Command::new(&cmd).output()?; +} +``` diff --git a/docs/rules/typescript.md b/docs/rules/typescript.md new file mode 100644 index 00000000..b86427b8 --- /dev/null +++ b/docs/rules/typescript.md @@ -0,0 +1,81 @@ +# TypeScript Rules + +TypeScript rules mirror JavaScript patterns plus TypeScript-specific type-safety escape detectors. Taint labels are shared with JavaScript (see [JavaScript Rules](javascript.md)). + +--- + +## AST Pattern Rules + +### Code Execution + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `ts.code_exec.eval` | High | A | `eval()` — dynamic code execution | +| `ts.code_exec.new_function` | High | A | `new Function()` — eval equivalent | +| `ts.code_exec.settimeout_string` | Medium | A | `setTimeout`/`setInterval` with string argument | + +### XSS Sinks + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `ts.xss.document_write` | Medium | A | `document.write()` / `document.writeln()` | +| `ts.xss.outer_html` | Medium | A | Assignment to `.outerHTML` | +| `ts.xss.insert_adjacent_html` | Medium | A | `insertAdjacentHTML()` | +| `ts.xss.location_assign` | Medium | A | Assignment to `location`/`location.href` | +| `ts.xss.cookie_write` | Low | A | Write to `document.cookie` | + +### Prototype Pollution + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `ts.prototype.proto_assignment` | Medium | A | Assignment to `__proto__` | + +### Weak Crypto + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `ts.crypto.math_random` | Low | A | `Math.random()` — not cryptographically secure | + +### Code Quality (TypeScript-specific) + +| Rule ID | Severity | Tier | Description | +|---------|----------|------|-------------| +| `ts.quality.any_annotation` | Low | A | Type annotation of `any` — disables type checking | +| `ts.quality.as_any` | Low | A | Type assertion `as any` — type-safety escape hatch | + +--- + +## Examples + +### `ts.quality.any_annotation` — `any` type + +**Flagged:** +```typescript +function process(data: any) { // ts.quality.any_annotation + data.whatever(); // No type checking +} +``` + +**Safe alternative:** +```typescript +interface UserData { name: string; email: string; } +function process(data: UserData) { + console.log(data.name); +} +``` + +### `ts.quality.as_any` — Type assertion escape + +**Flagged:** +```typescript +const result = someValue as any; // ts.quality.as_any +result.nonexistentMethod(); +``` + +**Safe alternative:** +```typescript +if (isValidType(someValue)) { + const result = someValue as KnownType; + result.knownMethod(); +} +``` diff --git a/src/ast.rs b/src/ast.rs index 8916dde7..cdb83551 100644 --- a/src/ast.rs +++ b/src/ast.rs @@ -2,8 +2,10 @@ use crate::cfg::{build_cfg, export_summaries}; use crate::cfg_analysis; use crate::commands::scan::Diag; use crate::errors::{NyxError, NyxResult}; +use crate::evidence::{Evidence, SpanEvidence, StateEvidence}; use crate::labels::{build_lang_rules, severity_for_source_kind}; -use crate::patterns::Severity; +use crate::patterns::{FindingCategory, Severity}; +use crate::state; use crate::summary::{FuncSummary, GlobalSummaries}; use crate::symbol::{Lang, normalize_namespace}; use crate::taint::analyse_file; @@ -92,6 +94,23 @@ fn is_nonprod_path(path: &Path) -> bool { false } +/// Normalize a callee description for display. +fn sanitize_desc(s: &str) -> String { + crate::fmt::normalize_snippet(s) +} + +/// Human-readable label for a `SourceKind`. +fn source_kind_label(sk: crate::labels::SourceKind) -> &'static str { + use crate::labels::SourceKind; + match sk { + SourceKind::UserInput => "user input", + SourceKind::EnvironmentConfig => "environment config", + SourceKind::FileSystem => "file system data", + SourceKind::Database => "database result", + SourceKind::Unknown => "tainted data", + } +} + /// Downgrade severity by one tier: High→Medium, Medium→Low, Low→Low. fn downgrade_severity(s: Severity) -> Severity { match s { @@ -239,8 +258,45 @@ pub fn run_rules_on_bytes( let source_byte = cfg_graph[finding.source].span.0; let source_point = byte_offset_to_point(&_tree, source_byte); + let source_callee = cfg_graph[finding.source] + .callee + .as_deref() + .map(sanitize_desc) + .unwrap_or_else(|| "(unknown)".into()); + let sink_callee = cfg_graph[finding.sink] + .callee + .as_deref() + .map(sanitize_desc) + .unwrap_or_else(|| "(unknown)".into()); + let kind_label = source_kind_label(finding.source_kind); + + let short_source = crate::fmt::shorten_callee(&source_callee); + let short_sink = crate::fmt::shorten_callee(&sink_callee); + + let mut labels = vec![ + ( + "Source".into(), + format!( + "{source_callee} ({}:{})", + source_point.row + 1, + source_point.column + 1 + ), + ), + ("Sink".into(), sink_callee.to_string()), + ]; + if let Some(guard) = finding.guard_kind { + labels.push(("Path guard".into(), format!("{guard:?}"))); + } + + let file_path_owned = path.to_string_lossy().into_owned(); + let mut evidence_notes = Vec::new(); + if finding.path_validated { + evidence_notes.push("path_validated".into()); + } + evidence_notes.push(format!("source_kind:{:?}", finding.source_kind)); + out.push(Diag { - path: path.to_string_lossy().into_owned(), + path: file_path_owned.clone(), line: sink_point.row + 1, col: sink_point.column + 1, severity: severity_for_source_kind(finding.source_kind), @@ -249,6 +305,50 @@ pub fn run_rules_on_bytes( source_point.row + 1, source_point.column + 1 ), + category: FindingCategory::Security, + path_validated: finding.path_validated, + guard_kind: finding.guard_kind.map(|k| format!("{k:?}")), + message: Some(format!( + "unsanitised {kind_label} flows from {short_source} \u{2192} {short_sink}" + )), + labels, + confidence: None, + evidence: Some(Evidence { + source: Some(SpanEvidence { + path: file_path_owned.clone(), + line: (source_point.row + 1) as u32, + col: (source_point.column + 1) as u32, + kind: "source".into(), + snippet: Some(short_source.clone()), + }), + sink: Some(SpanEvidence { + path: file_path_owned, + line: (sink_point.row + 1) as u32, + col: (sink_point.column + 1) as u32, + kind: "sink".into(), + snippet: Some(short_sink.clone()), + }), + guards: finding + .guard_kind + .map(|g| { + vec![SpanEvidence { + path: path.to_string_lossy().into_owned(), + line: (sink_point.row + 1) as u32, + col: 0, + kind: "guard".into(), + snippet: Some(format!("{g:?}")), + }] + }) + .unwrap_or_default(), + sanitizers: vec![], + state: None, + notes: evidence_notes, + }), + rank_score: None, + rank_reason: None, + suppressed: false, + suppression: None, + rollup: None, }); } @@ -268,14 +368,111 @@ pub fn run_rules_on_bytes( }; for cf in cfg_analysis::run_all(&cfg_ctx) { let point = byte_offset_to_point(&_tree, cf.span.0); + let cfg_confidence = Some(match cf.confidence { + cfg_analysis::Confidence::High => crate::evidence::Confidence::High, + cfg_analysis::Confidence::Medium => crate::evidence::Confidence::Medium, + cfg_analysis::Confidence::Low => crate::evidence::Confidence::Low, + }); out.push(Diag { path: path.to_string_lossy().into_owned(), line: point.row + 1, col: point.column + 1, severity: cf.severity, id: cf.rule_id, + category: FindingCategory::Security, + path_validated: false, + guard_kind: None, + message: Some(cf.message), + labels: vec![], + confidence: cfg_confidence, + evidence: Some(Evidence { + source: None, + sink: Some(SpanEvidence { + path: path.to_string_lossy().into_owned(), + line: (point.row + 1) as u32, + col: (point.column + 1) as u32, + kind: "sink".into(), + snippet: None, + }), + guards: vec![], + sanitizers: vec![], + state: None, + notes: vec![], + }), + rank_score: None, + rank_reason: None, + suppressed: false, + suppression: None, + rollup: None, }); } + + // ── State-model dataflow analysis ──────────────────────────────── + if cfg.scanner.enable_state_analysis { + let state_findings = state::run_state_analysis( + &cfg_graph, + entry, + caller_lang, + bytes, + &summaries, + global_summaries, + ); + // Collect state finding lines to dedup overlapping CFG findings. + let state_lines: std::collections::HashSet = state_findings + .iter() + .map(|sf| byte_offset_to_point(&_tree, sf.span.0).row + 1) + .collect(); + + for sf in &state_findings { + let point = byte_offset_to_point(&_tree, sf.span.0); + out.push(Diag { + path: path.to_string_lossy().into_owned(), + line: point.row + 1, + col: point.column + 1, + severity: sf.severity, + id: sf.rule_id.clone(), + category: FindingCategory::Security, + path_validated: false, + guard_kind: None, + message: Some(sf.message.clone()), + labels: vec![], + confidence: None, + evidence: Some(Evidence { + source: None, + sink: Some(SpanEvidence { + path: path.to_string_lossy().into_owned(), + line: (point.row + 1) as u32, + col: (point.column + 1) as u32, + kind: "sink".into(), + snippet: None, + }), + guards: vec![], + sanitizers: vec![], + state: Some(StateEvidence { + machine: sf.machine.into(), + subject: sf.subject.clone(), + from_state: sf.from_state.into(), + to_state: sf.to_state.into(), + }), + notes: vec![], + }), + rank_score: None, + rank_reason: None, + suppressed: false, + suppression: None, + rollup: None, + }); + } + + // Suppress cfg-resource-leak / cfg-auth-gap when state analysis + // already covers the same line (state analysis is more precise). + if !state_findings.is_empty() { + out.retain(|d| { + !((d.id == "cfg-resource-leak" || d.id == "cfg-auth-gap") + && state_lines.contains(&d.line)) + }); + } + } } if cfg.scanner.mode == AnalysisMode::Full || cfg.scanner.mode == AnalysisMode::Ast { @@ -285,7 +482,7 @@ pub fn run_rules_on_bytes( let mut cursor = QueryCursor::new(); for cq in compiled.iter() { - if cfg.scanner.min_severity <= cq.meta.severity { + if cq.meta.severity > cfg.scanner.min_severity { continue; } let mut matches = cursor.matches(&cq.query, root, bytes); @@ -298,6 +495,31 @@ pub fn run_rules_on_bytes( col: point.column + 1, severity: cq.meta.severity, id: cq.meta.id.to_owned(), + category: cq.meta.category.finding_category(), + path_validated: false, + guard_kind: None, + message: Some(cq.meta.description.to_owned()), + labels: vec![], + confidence: Some(cq.meta.confidence), + evidence: Some(Evidence { + source: None, + sink: Some(SpanEvidence { + path: path.to_string_lossy().into_owned(), + line: (point.row + 1) as u32, + col: (point.column + 1) as u32, + kind: "sink".into(), + snippet: None, + }), + guards: vec![], + sanitizers: vec![], + state: None, + notes: vec![], + }), + rank_score: None, + rank_reason: None, + suppressed: false, + suppression: None, + rollup: None, }); } } @@ -427,8 +649,45 @@ pub fn analyse_file_fused( let source_byte = cfg_graph[finding.source].span.0; let source_point = byte_offset_to_point(&tree, source_byte); + let source_callee = cfg_graph[finding.source] + .callee + .as_deref() + .map(sanitize_desc) + .unwrap_or_else(|| "(unknown)".into()); + let sink_callee = cfg_graph[finding.sink] + .callee + .as_deref() + .map(sanitize_desc) + .unwrap_or_else(|| "(unknown)".into()); + let kind_label = source_kind_label(finding.source_kind); + + let short_source = crate::fmt::shorten_callee(&source_callee); + let short_sink = crate::fmt::shorten_callee(&sink_callee); + + let mut labels = vec![ + ( + "Source".into(), + format!( + "{source_callee} ({}:{})", + source_point.row + 1, + source_point.column + 1 + ), + ), + ("Sink".into(), sink_callee.to_string()), + ]; + if let Some(guard) = finding.guard_kind { + labels.push(("Path guard".into(), format!("{guard:?}"))); + } + + let fused_file_path = path.to_string_lossy().into_owned(); + let mut fused_evidence_notes = Vec::new(); + if finding.path_validated { + fused_evidence_notes.push("path_validated".into()); + } + fused_evidence_notes.push(format!("source_kind:{:?}", finding.source_kind)); + out.push(Diag { - path: path.to_string_lossy().into_owned(), + path: fused_file_path.clone(), line: sink_point.row + 1, col: sink_point.column + 1, severity: severity_for_source_kind(finding.source_kind), @@ -437,6 +696,50 @@ pub fn analyse_file_fused( source_point.row + 1, source_point.column + 1 ), + category: FindingCategory::Security, + path_validated: finding.path_validated, + guard_kind: finding.guard_kind.map(|k| format!("{k:?}")), + message: Some(format!( + "unsanitised {kind_label} flows from {short_source} \u{2192} {short_sink}" + )), + labels, + confidence: None, + evidence: Some(Evidence { + source: Some(SpanEvidence { + path: fused_file_path.clone(), + line: (source_point.row + 1) as u32, + col: (source_point.column + 1) as u32, + kind: "source".into(), + snippet: Some(short_source.clone()), + }), + sink: Some(SpanEvidence { + path: fused_file_path.clone(), + line: (sink_point.row + 1) as u32, + col: (sink_point.column + 1) as u32, + kind: "sink".into(), + snippet: Some(short_sink.clone()), + }), + guards: finding + .guard_kind + .map(|g| { + vec![SpanEvidence { + path: fused_file_path, + line: (sink_point.row + 1) as u32, + col: 0, + kind: "guard".into(), + snippet: Some(format!("{g:?}")), + }] + }) + .unwrap_or_default(), + sanitizers: vec![], + state: None, + notes: fused_evidence_notes, + }), + rank_score: None, + rank_reason: None, + suppressed: false, + suppression: None, + rollup: None, }); } @@ -455,14 +758,108 @@ pub fn analyse_file_fused( }; for cf in cfg_analysis::run_all(&cfg_ctx) { let point = byte_offset_to_point(&tree, cf.span.0); + let fused_cfg_confidence = Some(match cf.confidence { + cfg_analysis::Confidence::High => crate::evidence::Confidence::High, + cfg_analysis::Confidence::Medium => crate::evidence::Confidence::Medium, + cfg_analysis::Confidence::Low => crate::evidence::Confidence::Low, + }); out.push(Diag { path: path.to_string_lossy().into_owned(), line: point.row + 1, col: point.column + 1, severity: cf.severity, id: cf.rule_id, + category: FindingCategory::Security, + path_validated: false, + guard_kind: None, + message: Some(cf.message), + labels: vec![], + confidence: fused_cfg_confidence, + evidence: Some(Evidence { + source: None, + sink: Some(SpanEvidence { + path: path.to_string_lossy().into_owned(), + line: (point.row + 1) as u32, + col: (point.column + 1) as u32, + kind: "sink".into(), + snippet: None, + }), + guards: vec![], + sanitizers: vec![], + state: None, + notes: vec![], + }), + rank_score: None, + rank_reason: None, + suppressed: false, + suppression: None, + rollup: None, }); } + + // ── State-model dataflow analysis ──────────────────────────────── + if cfg.scanner.enable_state_analysis { + let state_findings = state::run_state_analysis( + &cfg_graph, + entry, + caller_lang, + bytes, + &local_summaries, + global_summaries, + ); + let state_lines: std::collections::HashSet = state_findings + .iter() + .map(|sf| byte_offset_to_point(&tree, sf.span.0).row + 1) + .collect(); + + for sf in &state_findings { + let point = byte_offset_to_point(&tree, sf.span.0); + out.push(Diag { + path: path.to_string_lossy().into_owned(), + line: point.row + 1, + col: point.column + 1, + severity: sf.severity, + id: sf.rule_id.clone(), + category: FindingCategory::Security, + path_validated: false, + guard_kind: None, + message: Some(sf.message.clone()), + labels: vec![], + confidence: None, + evidence: Some(Evidence { + source: None, + sink: Some(SpanEvidence { + path: path.to_string_lossy().into_owned(), + line: (point.row + 1) as u32, + col: (point.column + 1) as u32, + kind: "sink".into(), + snippet: None, + }), + guards: vec![], + sanitizers: vec![], + state: Some(StateEvidence { + machine: sf.machine.into(), + subject: sf.subject.clone(), + from_state: sf.from_state.into(), + to_state: sf.to_state.into(), + }), + notes: vec![], + }), + rank_score: None, + rank_reason: None, + suppressed: false, + suppression: None, + rollup: None, + }); + } + + if !state_findings.is_empty() { + out.retain(|d| { + !((d.id == "cfg-resource-leak" || d.id == "cfg-auth-gap") + && state_lines.contains(&d.line)) + }); + } + } } // AST pattern queries @@ -472,7 +869,7 @@ pub fn analyse_file_fused( let mut cursor = QueryCursor::new(); for cq in compiled.iter() { - if cfg.scanner.min_severity <= cq.meta.severity { + if cq.meta.severity > cfg.scanner.min_severity { continue; } let mut matches = cursor.matches(&cq.query, root, bytes); @@ -485,6 +882,31 @@ pub fn analyse_file_fused( col: point.column + 1, severity: cq.meta.severity, id: cq.meta.id.to_owned(), + category: cq.meta.category.finding_category(), + path_validated: false, + guard_kind: None, + message: Some(cq.meta.description.to_owned()), + labels: vec![], + confidence: Some(cq.meta.confidence), + evidence: Some(Evidence { + source: None, + sink: Some(SpanEvidence { + path: path.to_string_lossy().into_owned(), + line: (point.row + 1) as u32, + col: (point.column + 1) as u32, + kind: "sink".into(), + snippet: None, + }), + guards: vec![], + sanitizers: vec![], + state: None, + notes: vec![], + }), + rank_score: None, + rank_reason: None, + suppressed: false, + suppression: None, + rollup: None, }); } } diff --git a/src/callgraph.rs b/src/callgraph.rs new file mode 100644 index 00000000..442417b5 --- /dev/null +++ b/src/callgraph.rs @@ -0,0 +1,599 @@ +use crate::interop::InteropEdge; +use crate::summary::{CalleeResolution, GlobalSummaries}; +use crate::symbol::FuncKey; +use petgraph::graph::NodeIndex; +use petgraph::prelude::*; +use std::collections::HashMap; + +// ───────────────────────────────────────────────────────────────────────────── +// Types +// ───────────────────────────────────────────────────────────────────────────── + +/// Metadata attached to each call-graph edge. +#[derive(Debug, Clone)] +pub struct CallEdge { + /// The raw callee string as it appeared in source (e.g. `"env::var"`). + /// Preserved for diagnostics — **not** the normalized form used for resolution. + #[allow(dead_code)] // used for future diagnostics and path display + pub call_site: String, +} + +/// A callee that could not be resolved to any known function definition. +#[derive(Debug, Clone)] +#[allow(dead_code)] // fields used for future diagnostics reporting +pub struct UnresolvedCallee { + pub caller: FuncKey, + pub callee_name: String, +} + +/// A callee that matched multiple function definitions — ambiguous. +#[derive(Debug, Clone)] +#[allow(dead_code)] // fields used for future diagnostics reporting +pub struct AmbiguousCallee { + pub caller: FuncKey, + pub callee_name: String, + pub candidates: Vec, +} + +/// The whole-program call graph. +/// +/// Nodes are [`FuncKey`]s (one per function definition across all files). +/// Edges represent call-site relationships resolved after pass 1. +pub struct CallGraph { + pub graph: DiGraph, + /// `FuncKey → NodeIndex` for quick lookup. + #[allow(dead_code)] // used for future topo-ordered analysis and call-graph queries + pub index: HashMap, + /// Callee strings that could not be resolved to any [`FuncKey`]. + pub unresolved_not_found: Vec, + /// Callee strings that matched multiple candidates. + pub unresolved_ambiguous: Vec, +} + +/// Result of SCC / topological analysis on the call graph. +pub struct CallGraphAnalysis { + /// Strongly connected components. + pub sccs: Vec>, + /// Maps each `NodeIndex` to its SCC index in [`sccs`]. + #[allow(dead_code)] // used for future topo-ordered taint propagation + pub node_to_scc: HashMap, + /// SCC indices in **callee-first** (leaves-first) order. + /// + /// Functions with no callees appear first; callers appear later. + /// Suitable for bottom-up taint propagation. + #[allow(dead_code)] // used for future topo-ordered taint propagation + pub topo_scc_callee_first: Vec, +} + +// ───────────────────────────────────────────────────────────────────────────── +// Callee-name normalization +// ───────────────────────────────────────────────────────────────────────────── + +/// Extract the last segment of a qualified callee name for resolution. +/// +/// ```text +/// "env::var" → "var" +/// "std::process::Command" → "Command" +/// "obj.method" → "method" +/// "pkg.mod.func" → "func" +/// "foo" → "foo" (unchanged) +/// "" → "" (edge case) +/// ``` +/// +/// The original raw text is preserved on [`CallEdge::call_site`] for +/// diagnostics; this function only produces the lookup key. +pub(crate) fn normalize_callee_name(raw: &str) -> &str { + // Split on "::" first (Rust-style qualification), take last segment. + let after_colons = raw.rsplit("::").next().unwrap_or(raw); + // Then split on "." (method calls, Python/JS dotted paths), take last segment. + after_colons.rsplit('.').next().unwrap_or(after_colons) +} + +// ───────────────────────────────────────────────────────────────────────────── +// Call-graph construction +// ───────────────────────────────────────────────────────────────────────────── + +/// Build the whole-program call graph from merged summaries. +/// +/// Resolution mirrors `GlobalSummaries::resolve_callee_key`: +/// 1. Normalize callee name (last segment after `::` or `.`) +/// 2. Same-language, arity-filtered, namespace-disambiguated lookup +/// 3. Interop edges (explicit cross-language bridges) +/// +/// Unresolved and ambiguous callees are recorded for diagnostics but +/// do **not** create edges. +pub fn build_call_graph(summaries: &GlobalSummaries, interop_edges: &[InteropEdge]) -> CallGraph { + let mut graph = DiGraph::new(); + let mut index = HashMap::new(); + + // 1. Create one node per FuncKey. + for (key, _) in summaries.iter() { + let idx = graph.add_node(key.clone()); + index.insert(key.clone(), idx); + } + + let mut unresolved_not_found = Vec::new(); + let mut unresolved_ambiguous = Vec::new(); + + // 2. Resolve callees and add edges. + for (caller_key, summary) in summaries.iter() { + let caller_node = index[caller_key]; + + for raw_callee in &summary.callees { + let normalized = normalize_callee_name(raw_callee); + + match summaries.resolve_callee_key( + normalized, + caller_key.lang, + &caller_key.namespace, + None, + ) { + CalleeResolution::Resolved(target_key) => { + if let Some(&target_node) = index.get(&target_key) { + graph.add_edge( + caller_node, + target_node, + CallEdge { + call_site: raw_callee.clone(), + }, + ); + } + } + CalleeResolution::NotFound => { + // Try interop edges before recording as not-found. + if let Some(target_key) = + resolve_via_interop(raw_callee, caller_key, interop_edges) + && let Some(&target_node) = index.get(&target_key) + { + graph.add_edge( + caller_node, + target_node, + CallEdge { + call_site: raw_callee.clone(), + }, + ); + continue; + } + unresolved_not_found.push(UnresolvedCallee { + caller: caller_key.clone(), + callee_name: raw_callee.clone(), + }); + } + CalleeResolution::Ambiguous(candidates) => { + unresolved_ambiguous.push(AmbiguousCallee { + caller: caller_key.clone(), + callee_name: raw_callee.clone(), + candidates, + }); + } + } + } + } + + CallGraph { + graph, + index, + unresolved_not_found, + unresolved_ambiguous, + } +} + +/// Check interop edges for a matching cross-language bridge. +fn resolve_via_interop( + raw_callee: &str, + caller_key: &FuncKey, + interop_edges: &[InteropEdge], +) -> Option { + for edge in interop_edges { + if edge.from.caller_lang == caller_key.lang + && edge.from.caller_namespace == caller_key.namespace + && edge.from.callee_symbol == raw_callee + && (edge.from.caller_func.is_empty() || edge.from.caller_func == caller_key.name) + { + return Some(edge.to.clone()); + } + } + None +} + +// ───────────────────────────────────────────────────────────────────────────── +// SCC / topological analysis +// ───────────────────────────────────────────────────────────────────────────── + +/// Compute SCC decomposition and topological ordering of the call graph. +/// +/// `petgraph::algo::tarjan_scc` returns SCCs in *reverse* topological order +/// of the condensation DAG — i.e. leaf SCCs (no outgoing cross-SCC edges) +/// come **first**. That is exactly the **callee-first** order suitable for +/// bottom-up taint propagation. +pub fn analyse(cg: &CallGraph) -> CallGraphAnalysis { + let sccs = petgraph::algo::tarjan_scc(&cg.graph); + + let mut node_to_scc = HashMap::with_capacity(cg.graph.node_count()); + for (scc_idx, scc) in sccs.iter().enumerate() { + for &node in scc { + node_to_scc.insert(node, scc_idx); + } + } + + // tarjan_scc already gives callee-first ordering. + let topo_scc_callee_first: Vec = (0..sccs.len()).collect(); + + CallGraphAnalysis { + sccs, + node_to_scc, + topo_scc_callee_first, + } +} + +// ───────────────────────────────────────────────────────────────────────────── +// Tests +// ───────────────────────────────────────────────────────────────────────────── + +#[cfg(test)] +mod tests { + use super::*; + use crate::interop::CallSiteKey; + use crate::summary::{FuncSummary, merge_summaries}; + use crate::symbol::Lang; + + /// Helper to create a minimal FuncSummary. + fn make_summary( + name: &str, + file_path: &str, + lang: &str, + param_count: usize, + callees: Vec<&str>, + ) -> FuncSummary { + FuncSummary { + name: name.into(), + file_path: file_path.into(), + lang: lang.into(), + param_count, + param_names: vec![], + source_caps: 0, + sanitizer_caps: 0, + sink_caps: 0, + propagates_taint: false, + tainted_sink_params: vec![], + callees: callees.into_iter().map(String::from).collect(), + } + } + + // ── normalize_callee_name ──────────────────────────────────────────── + + #[test] + fn normalize_callee_basic() { + assert_eq!(normalize_callee_name("env::var"), "var"); + assert_eq!(normalize_callee_name("std::process::Command"), "Command"); + assert_eq!(normalize_callee_name("obj.method"), "method"); + assert_eq!(normalize_callee_name("pkg.mod.func"), "func"); + assert_eq!(normalize_callee_name("foo"), "foo"); + assert_eq!(normalize_callee_name(""), ""); + } + + // ── same name, different Rust modules ──────────────────────────────── + + #[test] + fn same_name_different_rust_modules() { + let helper_a = make_summary("helper", "src/a.rs", "rust", 0, vec![]); + let helper_b = make_summary("helper", "src/b.rs", "rust", 0, vec![]); + let caller = make_summary("caller", "src/a.rs", "rust", 0, vec!["helper"]); + + let gs = merge_summaries(vec![helper_a, helper_b, caller], None); + let cg = build_call_graph(&gs, &[]); + + // Two helper nodes + one caller node = 3 nodes + assert_eq!(cg.graph.node_count(), 3); + + // Caller is in src/a.rs, so "helper" resolves to src/a.rs::helper + let caller_key = FuncKey { + lang: Lang::Rust, + namespace: "src/a.rs".into(), + name: "caller".into(), + arity: Some(0), + }; + let helper_a_key = FuncKey { + lang: Lang::Rust, + namespace: "src/a.rs".into(), + name: "helper".into(), + arity: Some(0), + }; + + let caller_node = cg.index[&caller_key]; + let helper_a_node = cg.index[&helper_a_key]; + + // Exactly one edge: caller → helper_a + let edges: Vec<_> = cg + .graph + .edges(caller_node) + .filter(|e| e.target() == helper_a_node) + .collect(); + assert_eq!(edges.len(), 1); + assert!(cg.unresolved_not_found.is_empty()); + assert!(cg.unresolved_ambiguous.is_empty()); + } + + // ── same name, Python vs Rust ──────────────────────────────────────── + + #[test] + fn same_name_python_and_rust() { + let py_foo = make_summary("foo", "handler.py", "python", 0, vec![]); + let rs_foo = make_summary("foo", "handler.rs", "rust", 0, vec![]); + // Python caller calls "foo" — should only see the Python one + let py_caller = make_summary("main", "app.py", "python", 0, vec!["foo"]); + + let gs = merge_summaries(vec![py_foo, rs_foo, py_caller], None); + let cg = build_call_graph(&gs, &[]); + + assert_eq!(cg.graph.node_count(), 3); + + let py_foo_key = FuncKey { + lang: Lang::Python, + namespace: "handler.py".into(), + name: "foo".into(), + arity: Some(0), + }; + let caller_key = FuncKey { + lang: Lang::Python, + namespace: "app.py".into(), + name: "main".into(), + arity: Some(0), + }; + + let caller_node = cg.index[&caller_key]; + let py_foo_node = cg.index[&py_foo_key]; + + // Edge goes to Python foo, not Rust foo + let edges: Vec<_> = cg.graph.edges(caller_node).collect(); + assert_eq!(edges.len(), 1); + assert_eq!(edges[0].target(), py_foo_node); + } + + // ── arity differences → separate nodes ─────────────────────────────── + + #[test] + fn arity_differences_separate_nodes() { + let helper1 = make_summary("helper", "lib.rs", "rust", 1, vec![]); + let helper2 = make_summary("helper", "lib.rs", "rust", 2, vec![]); + + let gs = merge_summaries(vec![helper1, helper2], None); + let cg = build_call_graph(&gs, &[]); + + // Two separate nodes (different arity → different FuncKey) + assert_eq!(cg.graph.node_count(), 2); + + let key1 = FuncKey { + lang: Lang::Rust, + namespace: "lib.rs".into(), + name: "helper".into(), + arity: Some(1), + }; + let key2 = FuncKey { + lang: Lang::Rust, + namespace: "lib.rs".into(), + name: "helper".into(), + arity: Some(2), + }; + assert!(cg.index.contains_key(&key1)); + assert!(cg.index.contains_key(&key2)); + } + + // ── recursive SCC detection ────────────────────────────────────────── + + #[test] + fn recursive_scc_detection() { + let a = make_summary("a", "lib.rs", "rust", 0, vec!["b"]); + let b = make_summary("b", "lib.rs", "rust", 0, vec!["a"]); + + let gs = merge_summaries(vec![a, b], None); + let cg = build_call_graph(&gs, &[]); + + assert_eq!(cg.graph.edge_count(), 2); // a→b and b→a + + let analysis = analyse(&cg); + + // Both nodes should be in the same SCC + let key_a = FuncKey { + lang: Lang::Rust, + namespace: "lib.rs".into(), + name: "a".into(), + arity: Some(0), + }; + let key_b = FuncKey { + lang: Lang::Rust, + namespace: "lib.rs".into(), + name: "b".into(), + arity: Some(0), + }; + + let scc_a = analysis.node_to_scc[&cg.index[&key_a]]; + let scc_b = analysis.node_to_scc[&cg.index[&key_b]]; + assert_eq!(scc_a, scc_b); + assert_eq!(analysis.sccs[scc_a].len(), 2); + } + + // ── unresolved callee → recorded as not found ──────────────────────── + + #[test] + fn unresolved_callee_recorded_as_not_found() { + let caller = make_summary("caller", "lib.rs", "rust", 0, vec!["nonexistent"]); + + let gs = merge_summaries(vec![caller], None); + let cg = build_call_graph(&gs, &[]); + + assert_eq!(cg.graph.edge_count(), 0); + assert_eq!(cg.unresolved_not_found.len(), 1); + assert_eq!(cg.unresolved_not_found[0].callee_name, "nonexistent"); + assert!(cg.unresolved_ambiguous.is_empty()); + } + + // ── ambiguous callee → recorded as ambiguous ───────────────────────── + + #[test] + fn ambiguous_callee_recorded() { + // Two "helper" functions in different namespaces. + let helper_a = make_summary("helper", "a.rs", "rust", 0, vec![]); + let helper_b = make_summary("helper", "b.rs", "rust", 0, vec![]); + // Caller is in a THIRD namespace, so neither is preferred. + let caller = make_summary("caller", "c.rs", "rust", 0, vec!["helper"]); + + let gs = merge_summaries(vec![helper_a, helper_b, caller], None); + let cg = build_call_graph(&gs, &[]); + + assert_eq!(cg.graph.edge_count(), 0); // no edge — ambiguous + assert!(cg.unresolved_not_found.is_empty()); + assert_eq!(cg.unresolved_ambiguous.len(), 1); + assert_eq!(cg.unresolved_ambiguous[0].callee_name, "helper"); + assert_eq!(cg.unresolved_ambiguous[0].candidates.len(), 2); + } + + // ── diamond topo order (callee-first) ──────────────────────────────── + + #[test] + fn diamond_topo_callee_first() { + // A → B, A → C, B → D, C → D + let d = make_summary("d", "lib.rs", "rust", 0, vec![]); + let b = make_summary("b", "lib.rs", "rust", 0, vec!["d"]); + let c = make_summary("c", "lib.rs", "rust", 0, vec!["d"]); + let a = make_summary("a", "lib.rs", "rust", 0, vec!["b", "c"]); + + let gs = merge_summaries(vec![a, b, c, d], None); + let cg = build_call_graph(&gs, &[]); + + assert_eq!(cg.graph.node_count(), 4); + + let analysis = analyse(&cg); + + let key = |name: &str| FuncKey { + lang: Lang::Rust, + namespace: "lib.rs".into(), + name: name.into(), + arity: Some(0), + }; + + let scc_of = |name: &str| analysis.node_to_scc[&cg.index[&key(name)]]; + let topo_pos = |name: &str| { + analysis + .topo_scc_callee_first + .iter() + .position(|&s| s == scc_of(name)) + .unwrap() + }; + + // D (leaf) must come before B and C, which must come before A (root). + assert!(topo_pos("d") < topo_pos("b")); + assert!(topo_pos("d") < topo_pos("c")); + assert!(topo_pos("b") < topo_pos("a")); + assert!(topo_pos("c") < topo_pos("a")); + } + + // ── interop edge resolution ────────────────────────────────────────── + + #[test] + fn interop_edge_resolution() { + let py_caller = make_summary("process", "handler.py", "python", 0, vec!["js_func"]); + let js_target = make_summary("js_func", "util.js", "javascript", 1, vec![]); + + let gs = merge_summaries(vec![py_caller, js_target], None); + + let interop = vec![InteropEdge { + from: CallSiteKey { + caller_lang: Lang::Python, + caller_namespace: "handler.py".into(), + caller_func: String::new(), // wildcard + callee_symbol: "js_func".into(), + ordinal: 0, + }, + to: FuncKey { + lang: Lang::JavaScript, + namespace: "util.js".into(), + name: "js_func".into(), + arity: Some(1), + }, + arg_map: vec![], + ret_taints: false, + }]; + + let cg = build_call_graph(&gs, &interop); + + let caller_key = FuncKey { + lang: Lang::Python, + namespace: "handler.py".into(), + name: "process".into(), + arity: Some(0), + }; + let target_key = FuncKey { + lang: Lang::JavaScript, + namespace: "util.js".into(), + name: "js_func".into(), + arity: Some(1), + }; + + let caller_node = cg.index[&caller_key]; + let target_node = cg.index[&target_key]; + + let edges: Vec<_> = cg + .graph + .edges(caller_node) + .filter(|e| e.target() == target_node) + .collect(); + assert_eq!(edges.len(), 1); + assert!(cg.unresolved_not_found.is_empty()); + } + + // ── namespace normalization consistency ─────────────────────────────── + + #[test] + fn namespace_normalization_consistency() { + // FuncSummary::func_key with a scan root produces the same namespace + // string that would be used as caller_namespace in resolution. + let summary = FuncSummary { + name: "my_func".into(), + file_path: "/home/user/proj/src/lib.rs".into(), + lang: "rust".into(), + param_count: 0, + param_names: vec![], + source_caps: 0, + sanitizer_caps: 0, + sink_caps: 0, + propagates_taint: false, + tainted_sink_params: vec![], + callees: vec![], + }; + + let root = "/home/user/proj"; + let key = summary.func_key(Some(root)); + + // The namespace in the key must be the same as what normalize_namespace produces + let expected_ns = crate::symbol::normalize_namespace(&summary.file_path, Some(root)); + assert_eq!(key.namespace, expected_ns); + assert_eq!(key.namespace, "src/lib.rs"); + } + + // ── raw call_site preserved on edge ────────────────────────────────── + + #[test] + fn raw_call_site_preserved_on_edge() { + // Callee "env::var" normalizes to "var" for resolution, but + // the edge should retain the original raw text. + let source = make_summary("var", "util.rs", "rust", 0, vec![]); + let caller = make_summary("main", "util.rs", "rust", 0, vec!["env::var"]); + + let gs = merge_summaries(vec![source, caller], None); + let cg = build_call_graph(&gs, &[]); + + let caller_key = FuncKey { + lang: Lang::Rust, + namespace: "util.rs".into(), + name: "main".into(), + arity: Some(0), + }; + let caller_node = cg.index[&caller_key]; + + let edges: Vec<_> = cg.graph.edges(caller_node).collect(); + assert_eq!(edges.len(), 1); + // Raw call_site preserved, not the normalized "var" + assert_eq!(edges[0].weight().call_site, "env::var"); + } +} diff --git a/src/cfg.rs b/src/cfg.rs index 329cf94e..72e4f0b2 100644 --- a/src/cfg.rs +++ b/src/cfg.rs @@ -32,6 +32,9 @@ pub enum EdgeKind { Back, // back‑edge that closes a loop } +/// Maximum number of identifiers to store from a condition expression. +const MAX_COND_VARS: usize = 8; + #[derive(Debug, Clone)] pub struct NodeInfo { pub kind: StmtKind, @@ -44,6 +47,12 @@ pub struct NodeInfo { pub enclosing_func: Option, /// Per-function call ordinal (0-based, only meaningful for Call nodes). pub call_ordinal: u32, + /// For If nodes: raw condition text (truncated to 128 chars). None for non-If nodes. + pub condition_text: Option, + /// For If nodes: identifiers referenced in the condition (sorted, deduped, max 8). + pub condition_vars: Vec, + /// For If nodes: whether the condition has a leading negation (`!` / `not`). + pub condition_negated: bool, } /// Intra‑file function summary with graph‑local node indices. @@ -122,6 +131,7 @@ fn first_call_ident<'a>(n: Node<'a>, lang: &str, code: &'a [u8]) -> Option { let func = c @@ -155,6 +165,65 @@ fn first_call_ident<'a>(n: Node<'a>, lang: &str, code: &'a [u8]) -> Option( + n: Node<'a>, + lang: &str, + code: &'a [u8], + extra: Option<&[crate::labels::RuntimeLabelRule]>, +) -> Option<(String, DataLabel)> { + let mut cursor = n.walk(); + for c in n.children(&mut cursor) { + match lookup(lang, c.kind()) { + Kind::CallFn | Kind::CallMethod | Kind::CallMacro => { + let ident = match lookup(lang, c.kind()) { + Kind::CallFn => c + .child_by_field_name("function") + .or_else(|| c.child_by_field_name("method")) + .or_else(|| c.child_by_field_name("name")) + .or_else(|| c.child_by_field_name("type")) + .and_then(|f| text_of(f, code)), + Kind::CallMethod => { + let func = c + .child_by_field_name("method") + .or_else(|| c.child_by_field_name("name")) + .and_then(|f| text_of(f, code)); + let recv = c + .child_by_field_name("object") + .or_else(|| c.child_by_field_name("receiver")) + .and_then(|f| root_receiver_text(f, lang, code)); + match (recv, func) { + (Some(r), Some(f)) => Some(format!("{r}.{f}")), + (_, Some(f)) => Some(f), + _ => None, + } + } + Kind::CallMacro => c + .child_by_field_name("macro") + .and_then(|f| text_of(f, code)), + _ => None, + }; + if let Some(ref id) = ident + && let Some(lbl) = classify(lang, id, extra) + { + return Some((id.clone(), lbl)); + } + // Recurse into arguments of this call + if let Some(found) = find_classifiable_inner_call(c, lang, code, extra) { + return Some(found); + } + } + _ => { + if let Some(found) = find_classifiable_inner_call(c, lang, code, extra) { + return Some(found); + } + } + } + } + None +} + /// Build the dot-joined text of a member_expression / attribute / selector_expression. /// E.g. for `process.env.CMD` this returns `"process.env.CMD"`. fn member_expr_text(n: Node, code: &[u8]) -> Option { @@ -209,6 +278,25 @@ fn first_member_label( } } } + // PHP/Python/Ruby subscript access: `$_GET['cmd']`, `os.environ['KEY']`, `params[:cmd]` + // Try to classify the object (before the `[`) as a source. + "subscript_expression" | "subscript" | "element_reference" => { + if let Some(obj) = n + .child_by_field_name("object") + .or_else(|| n.child_by_field_name("value")) + .or_else(|| n.child(0)) + { + if let Some(txt) = text_of(obj, code) + && let Some(lbl) = classify(lang, &txt, extra_labels) + { + return Some(lbl); + } + // Recurse into the object for nested member accesses + if let Some(lbl) = first_member_label(obj, lang, code, extra_labels) { + return Some(lbl); + } + } + } _ => {} } let mut cursor = n.walk(); @@ -224,6 +312,11 @@ fn first_member_label( fn first_member_text(n: Node, code: &[u8]) -> Option { match n.kind() { "member_expression" | "attribute" | "selector_expression" => member_expr_text(n, code), + "subscript_expression" | "subscript" | "element_reference" => n + .child_by_field_name("object") + .or_else(|| n.child_by_field_name("value")) + .or_else(|| n.child(0)) + .and_then(|obj| text_of(obj, code)), _ => { let mut cursor = n.walk(); for child in n.children(&mut cursor) { @@ -237,6 +330,42 @@ fn first_member_text(n: Node, code: &[u8]) -> Option { } /// Check whether any descendant of `n` is a call expression. +/// Collect function-expression nodes nested inside a call's arguments. +/// +/// This finds anonymous functions / arrow functions / closures that are +/// passed as arguments to a call and should be analysed as separate +/// function scopes. Only direct function-argument children are collected +/// (not functions nested inside other functions — those get handled when +/// the outer function is recursed into). +fn collect_nested_function_nodes<'a>(n: Node<'a>, lang: &str) -> Vec> { + let mut funcs = Vec::new(); + collect_nested_functions_rec(n, lang, &mut funcs, false); + funcs +} + +fn collect_nested_functions_rec<'a>( + n: Node<'a>, + lang: &str, + out: &mut Vec>, + inside_function: bool, +) { + let kind = lookup(lang, n.kind()); + // Only treat as a function if it's a real function node (has children), + // not a keyword token like `function` in JS which shares the same kind name. + if kind == Kind::Function && n.child_count() > 0 { + if inside_function { + // Don't recurse into nested functions of nested functions + return; + } + out.push(n); + return; + } + let mut cursor = n.walk(); + for c in n.children(&mut cursor) { + collect_nested_functions_rec(c, lang, out, inside_function); + } +} + fn has_call_descendant(n: Node, lang: &str) -> bool { let mut cursor = n.walk(); for c in n.children(&mut cursor) { @@ -361,6 +490,36 @@ fn def_use(ast: Node, lang: &str, code: &[u8]) -> (Option, Vec) (defs, uses) } + // if‑let / while‑let — the `let_condition` binds a variable from + // the value expression. E.g. `if let Ok(cmd) = env::var("CMD")` + // defines `cmd` and uses `env`, `var`, `CMD`. + Kind::If | Kind::While => { + let cond = ast.child_by_field_name("condition"); + if let Some(c) = cond + && c.kind() == "let_condition" + { + let mut defs = None; + let mut uses = Vec::new(); + + if let Some(pat) = c.child_by_field_name("pattern") { + let mut tmp = Vec::::new(); + collect_idents(pat, code, &mut tmp); + // The first plain identifier in the pattern is the binding. + // Skip type identifiers (e.g. "Ok" in Ok(cmd)) — take the + // last ident which is the inner binding name. + defs = tmp.into_iter().last(); + } + if let Some(val) = c.child_by_field_name("value") { + collect_idents(val, code, &mut uses); + } + return (defs, uses); + } + + let mut uses = Vec::new(); + collect_idents(ast, code, &mut uses); + (None, uses) + } + // everything else – no definition, but may read vars _ => { let mut uses = Vec::new(); @@ -370,6 +529,109 @@ fn def_use(ast: Node, lang: &str, code: &[u8]) -> (Option, Vec) } } +/// Extract raw condition metadata from an If AST node. +/// +/// Returns `(condition_text, condition_vars, condition_negated)`. +/// The condition subtree is located via `child_by_field_name("condition")` +/// for most languages, with a positional fallback for Rust `if_expression`. +/// +/// Negation is detected by checking for a leading unary `!` operator or +/// `not` keyword. Variables are sorted, deduped, and capped at +/// [`MAX_COND_VARS`]. +fn extract_condition_raw<'a>( + ast: Node<'a>, + lang: &str, + code: &'a [u8], +) -> (Option, Vec, bool) { + // 1. Find the condition subtree. + let cond_node = ast.child_by_field_name("condition").or_else(|| { + // Rust `if_expression` uses positional children: the condition is + // the first child that is not a keyword, block, or `let` pattern. + let mut cursor = ast.walk(); + ast.children(&mut cursor).find(|c| { + let k = c.kind(); + !matches!(lookup(lang, k), Kind::Block | Kind::Trivia) + && k != "if" + && k != "else" + && k != "let" + && k != "{" + && k != "}" + && k != "(" + && k != ")" + }) + }); + + let Some(cond) = cond_node else { + return (None, Vec::new(), false); + }; + + // 2. Detect leading negation (`!expr`, `not expr`, Ruby `unless`). + let (inner, negated) = detect_negation(cond, ast, lang); + + // 3. Collect identifiers from the (inner) condition subtree. + let mut vars = Vec::new(); + collect_idents(inner, code, &mut vars); + vars.sort(); + vars.dedup(); + vars.truncate(MAX_COND_VARS); + + // 4. Extract text, truncated. + let text = text_of(cond, code).map(|t| { + if t.len() > 128 { + t[..128].to_string() + } else { + t + } + }); + + (text, vars, negated) +} + +/// Detect leading negation and return the inner expression. +/// +/// Handles: +/// - `!expr` (unary_expression / prefix_unary_expression with `!` operator) +/// - `not expr` (Python `not_operator`, Ruby) +/// - Ruby `unless` (the whole If node kind is `unless`) +fn detect_negation<'a>(cond: Node<'a>, if_ast: Node<'a>, _lang: &str) -> (Node<'a>, bool) { + // Ruby `unless` is mapped to Kind::If but is semantically negated. + if if_ast.kind() == "unless" { + return (cond, true); + } + + // `!expr` appears as unary_expression, not_operator, or prefix_unary_expression + // with a `!` or `not` operator child. + let is_negation_wrapper = matches!( + cond.kind(), + "unary_expression" | "not_operator" | "prefix_unary_expression" | "unary_not" + ); + + if is_negation_wrapper { + // Check if the first child is a `!` or `not` operator. + let has_not = cond + .child(0) + .is_some_and(|c| c.kind() == "!" || c.kind() == "not"); + + if has_not { + // Return the operand (inner expression after the `!` / `not`). + let inner = cond + .child_by_field_name("argument") + .or_else(|| cond.child_by_field_name("operand")) + .or_else(|| { + // Last non-operator child. + let mut cursor = cond.walk(); + cond.children(&mut cursor) + .filter(|c| c.kind() != "!" && c.kind() != "not") + .last() + }) + .unwrap_or(cond); + return (inner, true); + } + } + + (cond, false) +} + /// Create a node in one short borrow and optionally attach a taint label. #[allow(clippy::too_many_arguments)] fn push_node<'a>( @@ -391,6 +653,7 @@ fn push_node<'a>( .child_by_field_name("function") .or_else(|| ast.child_by_field_name("method")) .or_else(|| ast.child_by_field_name("name")) + .or_else(|| ast.child_by_field_name("type")) .and_then(|n| text_of(n, code)) .unwrap_or_default(), @@ -426,7 +689,7 @@ fn push_node<'a>( // the whole line. if matches!( lookup(lang, ast.kind()), - Kind::CallWrapper | Kind::Assignment + Kind::CallWrapper | Kind::Assignment | Kind::Return ) && let Some(inner) = first_call_ident(ast, lang, code) { text = inner; @@ -437,6 +700,20 @@ fn push_node<'a>( let extra = analysis_rules.map(|r| r.extra_labels.as_slice()); let mut label = classify(lang, &text, extra); + // If the outermost call didn't classify, try inner/nested calls. + // E.g. `str(eval(expr))` — `str` is not a sink, but `eval` is. + if label.is_none() + && matches!( + lookup(lang, ast.kind()), + Kind::CallWrapper | Kind::Assignment | Kind::Return + ) + && let Some((inner_text, inner_label)) = + find_classifiable_inner_call(ast, lang, code, extra) + { + label = Some(inner_label); + text = inner_text; + } + // For assignments like `element.innerHTML = value`, the inner-call heuristic // above may have overridden `text` with a call on the RHS (e.g. getElementById). // If that didn't produce a label, check the LHS property name — it may be a @@ -493,18 +770,49 @@ fn push_node<'a>( } } + // For `if let` / `while let` patterns: try to classify the value expression + // in the let-condition as a source/sink. E.g. `if let Ok(cmd) = env::var("CMD")` + // should recognise `env::var` as a taint source and label this node accordingly. + if label.is_none() + && matches!(lookup(lang, ast.kind()), Kind::If | Kind::While) + && let Some(cond) = ast.child_by_field_name("condition") + && cond.kind() == "let_condition" + && let Some(val) = cond.child_by_field_name("value") + { + if let Some(ident) = first_call_ident(val, lang, code) + && let Some(l) = classify(lang, &ident, extra) + { + label = Some(l); + text = ident; + } + if label.is_none() + && let Some(ident_text) = text_of(val, code) + && let Some(l) = classify(lang, &ident_text, extra) + { + label = Some(l); + text = ident_text; + } + } + let span = (ast.start_byte(), ast.end_byte()); /* ── 3. GRAPH INSERTION + DEBUG ──────────────────────────────────── */ let (defines, uses) = def_use(ast, lang, code); - let callee = if kind == StmtKind::Call { + let callee = if kind == StmtKind::Call || label.is_some() { Some(text.clone()) } else { None }; + // Extract condition metadata for If nodes. + let (condition_text, condition_vars, condition_negated) = if kind == StmtKind::If { + extract_condition_raw(ast, lang, code) + } else { + (None, Vec::new(), false) + }; + let idx = g.add_node(NodeInfo { kind, span, @@ -514,6 +822,9 @@ fn push_node<'a>( callee, enclosing_func: enclosing_func.map(|s| s.to_string()), call_ordinal, + condition_text, + condition_vars, + condition_negated, }); debug!( @@ -717,19 +1028,27 @@ fn build_sub<'a>( } exits } else { - // No explicit else → if the then-branch falls through - // (non-empty exits), the false branch merges with those exits. - // If the then-branch terminates (break/return/continue → - // empty exits), the false branch flows from the condition - // to whatever comes next. - if then_exits.is_empty() { - vec![cond] - } else { - if let Some(&first) = then_exits.first() { - connect_all(g, &[cond], first, EdgeKind::False); - } - then_exits.clone() - } + // No explicit else → create a synthetic pass-through node + // for the false path. This avoids routing the False edge + // to a then-block exit (which would make it appear that the + // false path goes *through* the then-block) and gives + // path-sensitive analysis an explicit False edge to record + // predicates on. + let pass = g.add_node(NodeInfo { + kind: StmtKind::Seq, + span: (ast.end_byte(), ast.end_byte()), + label: None, + defines: None, + uses: Vec::new(), + callee: None, + enclosing_func: enclosing_func.map(|s| s.to_string()), + call_ordinal: 0, + condition_text: None, + condition_vars: Vec::new(), + condition_negated: false, + }); + connect_all(g, &[cond], pass, EdgeKind::False); + vec![pass] }; // Frontier = union of both branches @@ -995,7 +1314,7 @@ fn build_sub<'a>( collect_idents(n, code, &mut tmp); tmp.into_iter().next() }) - .unwrap_or_else(|| "".to_string()); + .unwrap_or_else(|| format!("", ast.start_byte())); let entry_idx = push_node( g, StmtKind::Seq, @@ -1016,7 +1335,20 @@ fn build_sub<'a>( // Snapshot the current node count so we can iterate only over nodes // created within this function (avoids O(N²) scan of the full graph). let fn_first_node: NodeIndex = NodeIndex::new(g.node_count()); - let body = ast.child_by_field_name("body").expect("fn w/o body"); + let body = ast.child_by_field_name("body").unwrap_or_else(|| { + // Some function expressions (e.g. JS anonymous `function(…) { … }`) + // don't have a named "body" field — find the first block child. + let mut c = ast.walk(); + ast.children(&mut c) + .find(|n| matches!(lookup(lang, n.kind()), Kind::Block | Kind::SourceFile)) + .unwrap_or_else(|| { + panic!( + "fn w/o body: kind={} text='{}'", + ast.kind(), + text_of(ast, code).unwrap_or_default() + ) + }) + }); let mut fn_call_ordinal: u32 = 0; let mut fn_breaks = Vec::new(); let mut fn_continues = Vec::new(); @@ -1191,6 +1523,9 @@ fn build_sub<'a>( callee: None, enclosing_func: Some(fn_name.clone()), call_ordinal: 0, + condition_text: None, + condition_vars: Vec::new(), + condition_negated: false, }); // Wire body exits (fall-through) to the exit node. for &b in &body_exits { @@ -1300,6 +1635,28 @@ fn build_sub<'a>( { return Vec::new(); } + + // Recurse into any function expressions nested in arguments + // (e.g. `app.get('/path', function(req, res) { ... })`) + // so that they get proper function summaries. + let nested = collect_nested_function_nodes(ast, lang); + for func_node in nested { + build_sub( + func_node, + &[node], + g, + lang, + code, + summaries, + file_path, + enclosing_func, + call_ordinal, + analysis_rules, + break_targets, + continue_targets, + ); + } + vec![node] } @@ -1326,6 +1683,26 @@ fn build_sub<'a>( { return Vec::new(); } + + // Recurse into any function expressions nested in arguments + let nested = collect_nested_function_nodes(ast, lang); + for func_node in nested { + build_sub( + func_node, + &[n], + g, + lang, + code, + summaries, + file_path, + enclosing_func, + call_ordinal, + analysis_rules, + break_targets, + continue_targets, + ); + } + vec![n] } @@ -1412,6 +1789,9 @@ pub(crate) fn build_cfg<'a>( callee: None, enclosing_func: None, call_ordinal: 0, + condition_text: None, + condition_vars: Vec::new(), + condition_negated: false, }); let exit = g.add_node(NodeInfo { kind: StmtKind::Exit, @@ -1422,6 +1802,9 @@ pub(crate) fn build_cfg<'a>( callee: None, enclosing_func: None, call_ordinal: 0, + condition_text: None, + condition_vars: Vec::new(), + condition_negated: false, }); // Build the body below the synthetic ENTRY. diff --git a/src/cfg_analysis/mod.rs b/src/cfg_analysis/mod.rs index 2e513790..eff804d2 100644 --- a/src/cfg_analysis/mod.rs +++ b/src/cfg_analysis/mod.rs @@ -33,7 +33,6 @@ pub struct CfgFinding { pub severity: Severity, pub confidence: Confidence, pub span: (usize, usize), - #[allow(dead_code)] pub message: String, pub evidence: Vec, pub score: Option, diff --git a/src/cfg_analysis/tests.rs b/src/cfg_analysis/tests.rs index ae223b03..35f8952a 100644 --- a/src/cfg_analysis/tests.rs +++ b/src/cfg_analysis/tests.rs @@ -681,6 +681,8 @@ fn taint_and_unguarded_sink_deduped() { source: entry, path: vec![entry, sink_node], source_kind: crate::labels::SourceKind::UserInput, + path_validated: false, + guard_kind: None, }]; let findings = parse_and_run_all_with_taint( diff --git a/src/cli.rs b/src/cli.rs index a8957911..467671c4 100644 --- a/src/cli.rs +++ b/src/cli.rs @@ -1,4 +1,4 @@ -use clap::{Parser, Subcommand}; +use clap::{Parser, Subcommand, ValueEnum}; #[derive(Parser)] #[command(name = "nyx")] @@ -13,10 +13,55 @@ impl Commands { /// Whether this command produces structured (machine-readable) output on /// stdout, meaning human status messages must be suppressed entirely. pub fn is_structured_output(&self) -> bool { - matches!(self, Commands::Scan { format, .. } if format == "json" || format == "sarif") + matches!(self, Commands::Scan { format, .. } if *format == OutputFormat::Json || *format == OutputFormat::Sarif) } } +/// Output format for scan results. +#[derive(Debug, Copy, Clone, PartialEq, Eq, ValueEnum, Default)] +pub enum OutputFormat { + #[default] + Console, + Json, + Sarif, +} + +impl std::fmt::Display for OutputFormat { + fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { + match self { + OutputFormat::Console => write!(f, "console"), + OutputFormat::Json => write!(f, "json"), + OutputFormat::Sarif => write!(f, "sarif"), + } + } +} + +/// Index mode for scan operations. +#[derive(Debug, Copy, Clone, PartialEq, Eq, ValueEnum, Default)] +pub enum IndexMode { + /// Use index if available, build if missing (default) + #[default] + Auto, + /// Skip indexing entirely, scan filesystem directly + Off, + /// Force rebuild index before scanning + Rebuild, +} + +/// Analysis mode for scan operations. +#[derive(Debug, Copy, Clone, PartialEq, Eq, ValueEnum, Default)] +pub enum ScanMode { + /// Run all analyses: AST patterns + CFG + taint (default) + #[default] + Full, + /// Run AST pattern queries only (no CFG/taint) + Ast, + /// Run CFG structural analyses + taint only (no AST patterns) + Cfg, + /// Alias for cfg (CFG + taint analysis) + Taint, +} + #[derive(Subcommand)] pub enum Commands { /// Scan project for vulnerabilities @@ -25,35 +70,118 @@ pub enum Commands { #[arg(default_value = ".")] path: String, - /// Skip using/building index, scan directly - #[arg(long)] - no_index: bool, + /// Index mode: auto (default), off (no index), rebuild (force rebuild) + #[arg(long, value_enum, default_value_t = IndexMode::Auto)] + index: IndexMode, - /// Force rebuild index before scanning - #[arg(long)] - rebuild_index: bool, + /// Output format + #[arg(short, long, value_enum, default_value_t = OutputFormat::Console)] + format: OutputFormat, - /// Output format (console, json, sarif) - #[arg(short, long, default_value = "")] - format: String, - - /// Show only high severity issues + /// Severity filter expression: HIGH, HIGH,MEDIUM, or >=MEDIUM + /// + /// Filters findings AFTER all severity normalization (e.g. nonprod + /// downgrades). Only findings matching the expression are emitted. + /// Case-insensitive. Shell-quote expressions containing ">". #[arg(long)] - high_only: bool, + severity: Option, - #[arg(long)] - ast_only: bool, + /// Analysis mode: full (default), ast, cfg, taint + #[arg(long, value_enum, default_value_t = ScanMode::Full)] + mode: ScanMode, - #[arg(long)] - cfg_only: bool, - - #[arg(long)] + /// Scan all targets (alias for --mode full) + #[arg(long, hide = true)] all_targets: bool, - /// Include findings from test/vendor/build paths at original severity - /// (by default these are downgraded) + /// Preserve original severity for test/vendor/build paths + /// + /// By default, findings in non-production paths are downgraded by one + /// severity tier. This flag preserves original severity. + #[arg(long, alias = "include-nonprod")] + keep_nonprod_severity: bool, + + /// Suppress all human-readable status output #[arg(long)] - include_nonprod: bool, + quiet: bool, + + /// Exit with code 1 if any finding meets or exceeds this severity + /// + /// Useful for CI gating. Example: --fail-on HIGH + #[arg(long)] + fail_on: Option, + + /// Disable attack-surface ranking (findings are sorted by exploitability by default) + #[arg(long)] + no_rank: bool, + + /// Show inline-suppressed findings (dimmed, tagged [SUPPRESSED]) + #[arg(long)] + show_suppressed: bool, + + /// Show all findings: disables category filtering, rollups, and LOW budgets + #[arg(long = "all")] + show_all: bool, + + /// Include Quality findings (excluded by default) + #[arg(long)] + include_quality: bool, + + /// Maximum total LOW findings to show + #[arg(long, default_value_t = 20)] + max_low: u32, + + /// Maximum LOW findings per file + #[arg(long, default_value_t = 1)] + max_low_per_file: u32, + + /// Maximum LOW findings per rule + #[arg(long, default_value_t = 10)] + max_low_per_rule: u32, + + /// Number of example locations in rollup findings + #[arg(long, default_value_t = 5)] + rollup_examples: u32, + + /// Show all instances for a specific rule (bypasses rollup for that rule) + #[arg(long)] + show_instances: Option, + + /// Minimum attack-surface score to include in output + /// + /// Findings with a rank score below this threshold are suppressed. + /// Requires ranking to be enabled (has no effect with --no-rank). + /// Example: --min-score 50 + #[arg(long)] + min_score: Option, + + /// Minimum confidence level to include in output + /// + /// Values: low, medium, high. Findings below this level are dropped. + /// JSON/SARIF include all unless filtered. + #[arg(long)] + min_confidence: Option, + + // ── Deprecated aliases (hidden) ───────────────────────────────── + /// Deprecated: use --index off + #[arg(long, hide = true)] + no_index: bool, + + /// Deprecated: use --index rebuild + #[arg(long, hide = true)] + rebuild_index: bool, + + /// Deprecated: use --severity HIGH + #[arg(long, hide = true)] + high_only: bool, + + /// Deprecated: use --mode ast + #[arg(long, hide = true)] + ast_only: bool, + + /// Deprecated: use --mode cfg + #[arg(long, hide = true)] + cfg_only: bool, }, /// Manage project indexes diff --git a/src/commands/mod.rs b/src/commands/mod.rs index 5c35c0e1..67dcbd8c 100644 --- a/src/commands/mod.rs +++ b/src/commands/mod.rs @@ -4,9 +4,9 @@ pub mod index; pub mod list; pub mod scan; -use crate::cli::Commands; +use crate::cli::{Commands, IndexMode, ScanMode}; use crate::errors::NyxResult; -use crate::patterns::Severity; +use crate::patterns::{Severity, SeverityFilter}; use crate::utils::config::{AnalysisMode, Config}; use std::path::Path; @@ -19,36 +19,130 @@ pub fn handle_command( match command { Commands::Scan { path, + index, + format, + severity, + mode, + all_targets, + keep_nonprod_severity, + quiet, + fail_on, + no_rank, + show_suppressed, + show_all, + include_quality, + max_low, + max_low_per_file, + max_low_per_rule, + rollup_examples, + show_instances, + min_score, + min_confidence, + // Deprecated aliases no_index, rebuild_index, - format, high_only, ast_only, cfg_only, - all_targets, - include_nonprod, } => { - if high_only { - config.scanner.min_severity = Severity::High + // ── Resolve deprecated aliases ────────────────────────────── + + // Index mode: explicit --index wins, then deprecated flags + let effective_index = if no_index { + IndexMode::Off + } else if rebuild_index { + IndexMode::Rebuild + } else { + index }; - if ast_only { - config.scanner.mode = AnalysisMode::Ast + // Analysis mode: explicit --mode wins, then deprecated flags + let effective_mode = if ast_only { + ScanMode::Ast + } else if cfg_only { + ScanMode::Cfg + } else if all_targets { + ScanMode::Full + } else { + mode }; - if cfg_only { - config.scanner.mode = AnalysisMode::Taint + // Severity filter: explicit --severity wins, then --high-only + let severity_filter = if let Some(ref expr) = severity { + Some(SeverityFilter::parse(expr).map_err(|e| { + crate::errors::NyxError::Msg(format!("invalid --severity expression: {e}")) + })?) + } else if high_only { + Some(SeverityFilter::parse("HIGH").unwrap()) + } else { + None }; - if all_targets { - config.scanner.mode = AnalysisMode::Full + // Fail-on threshold + let fail_on_sev = if let Some(ref expr) = fail_on { + Some(expr.trim().parse::().map_err(|e| { + crate::errors::NyxError::Msg(format!("invalid --fail-on value: {e}")) + })?) + } else { + None }; - if include_nonprod { - config.scanner.include_nonprod = true - }; + // ── Apply to config ───────────────────────────────────────── - scan::handle(&path, no_index, rebuild_index, format, database_dir, config)?; + match effective_mode { + ScanMode::Full => config.scanner.mode = AnalysisMode::Full, + ScanMode::Ast => config.scanner.mode = AnalysisMode::Ast, + ScanMode::Cfg | ScanMode::Taint => config.scanner.mode = AnalysisMode::Taint, + } + + if keep_nonprod_severity { + config.scanner.include_nonprod = true; + } + + if quiet { + config.output.quiet = true; + } + + if no_rank { + config.output.attack_surface_ranking = false; + } + + // Min-score: CLI wins, then config + if let Some(s) = min_score { + config.output.min_score = Some(s); + } + + // Min-confidence: CLI wins, then config + if let Some(ref expr) = min_confidence { + config.output.min_confidence = + Some(expr.parse::().map_err(|e| { + crate::errors::NyxError::Msg(format!("invalid --min-confidence value: {e}")) + })?); + } + + if show_all { + config.output.show_all = true; + } + if include_quality { + config.output.include_quality = true; + } + // CLI values override config defaults (clap provides defaults) + config.output.max_low = max_low; + config.output.max_low_per_file = max_low_per_file; + config.output.max_low_per_rule = max_low_per_rule; + config.output.rollup_examples = rollup_examples; + + scan::handle( + &path, + effective_index, + format, + severity_filter, + fail_on_sev, + show_suppressed, + show_instances.as_deref(), + database_dir, + config, + )?; } Commands::Index { action } => { index::handle(action, database_dir, config)?; diff --git a/src/commands/scan.rs b/src/commands/scan.rs index 210b5056..182d79a7 100644 --- a/src/commands/scan.rs +++ b/src/commands/scan.rs @@ -1,9 +1,10 @@ pub(crate) use crate::ast::{ analyse_file_fused, extract_summaries_from_bytes, run_rules_on_bytes, run_rules_on_file, }; +use crate::cli::{IndexMode, OutputFormat}; use crate::database::index::{Indexer, IssueRow}; use crate::errors::NyxResult; -use crate::patterns::Severity; +use crate::patterns::{FindingCategory, Severity, SeverityFilter}; use crate::summary::{self, GlobalSummaries}; use crate::utils::config::Config; use crate::utils::project::get_project_info; @@ -14,7 +15,6 @@ use indicatif::{ProgressBar, ProgressStyle}; use r2d2::Pool; use r2d2_sqlite::SqliteConnectionManager; use rayon::prelude::*; -use std::collections::BTreeMap; use std::path::{Path, PathBuf}; use std::sync::Arc; @@ -41,35 +41,116 @@ pub struct Diag { pub col: usize, pub severity: Severity, pub id: String, + /// High-level finding category (Security, Reliability, Quality). + pub category: FindingCategory, + /// Whether the finding is guarded by a path validation predicate. + /// Only set for taint findings; `false` for AST/CFG structural findings. + #[serde(skip_serializing_if = "std::ops::Not::not")] + pub path_validated: bool, + /// The kind of validation guard protecting this path, if any. + #[serde(skip_serializing_if = "Option::is_none")] + pub guard_kind: Option, + /// Optional human-readable message with additional context (e.g. state analysis details). + #[serde(skip_serializing_if = "Option::is_none")] + pub message: Option, + /// Structured evidence labels (e.g. Source, Sink) for console display. + #[serde(skip_serializing_if = "Vec::is_empty")] + pub labels: Vec<(String, String)>, + /// Confidence level (Low / Medium / High). + #[serde(skip_serializing_if = "Option::is_none")] + pub confidence: Option, + /// Structured evidence (source/sink spans, state transitions, notes). + #[serde(skip_serializing_if = "Option::is_none")] + pub evidence: Option, + /// Attack-surface ranking score (higher = more exploitable / important). + #[serde(skip_serializing_if = "Option::is_none")] + pub rank_score: Option, + /// Breakdown of how the ranking score was computed. + #[serde(skip_serializing_if = "Option::is_none")] + pub rank_reason: Option>, + /// Whether this finding was suppressed by an inline `nyx:ignore` directive. + #[serde(skip_serializing_if = "is_false")] + pub suppressed: bool, + /// Metadata about the suppression directive, if suppressed. + #[serde(skip_serializing_if = "Option::is_none")] + pub suppression: Option, + /// Rollup data when multiple occurrences are grouped into one finding. + #[serde(skip_serializing_if = "Option::is_none")] + pub rollup: Option, +} + +/// Rollup data for grouped findings (e.g. 38 occurrences of `rs.quality.unwrap`). +#[derive(Debug, Clone, serde::Serialize)] +pub struct RollupData { + /// Total number of occurrences. + pub count: usize, + /// First N example locations (controlled by `rollup_examples`). + pub occurrences: Vec, +} + +/// A source location within a file. +#[derive(Debug, Clone, serde::Serialize)] +pub struct Location { + pub line: usize, + pub col: usize, +} + +/// Statistics about findings suppressed by the prioritization pipeline. +pub struct SuppressionStats { + pub quality_dropped: usize, + pub low_budget_dropped: usize, + pub max_results_dropped: usize, + pub include_quality: bool, + #[allow(dead_code)] + pub show_all: bool, + pub max_low: u32, + pub max_low_per_file: u32, + pub max_low_per_rule: u32, +} + +impl SuppressionStats { + pub fn total_suppressed(&self) -> usize { + self.quality_dropped + self.low_budget_dropped + self.max_results_dropped + } +} + +fn is_false(b: &bool) -> bool { + !*b } /// Entry point called by the CLI. +#[allow(clippy::too_many_arguments)] pub fn handle( path: &str, - no_index: bool, - rebuild_index: bool, - format: String, + index_mode: IndexMode, + format: OutputFormat, + severity_filter: Option, + fail_on: Option, + show_suppressed: bool, + show_instances: Option<&str>, database_dir: &Path, config: &Config, ) -> NyxResult<()> { let scan_path = Path::new(path).canonicalize()?; let (project_name, db_path) = get_project_info(&scan_path, database_dir)?; - let suppress_status = config.output.quiet || format == "json" || format == "sarif"; + let is_machine = format == OutputFormat::Json || format == OutputFormat::Sarif; + let suppress_status = config.output.quiet || is_machine; if !suppress_status { - println!( + // Status messages go to stderr so stdout stays clean + eprintln!( "{} {}...\n", style("Checking").green().bold(), &project_name ); } - let show_progress = format != "json" && format != "sarif" && !config.output.quiet; + let show_progress = !is_machine && !config.output.quiet; - let diags: Vec = if no_index { + let mut diags: Vec = if index_mode == IndexMode::Off { scan_filesystem(&scan_path, config, show_progress)? } else { - if rebuild_index || !db_path.exists() { + if index_mode == IndexMode::Rebuild || !db_path.exists() { tracing::debug!("Scanning filesystem index filesystem"); crate::commands::index::build_index( &project_name, @@ -88,52 +169,68 @@ pub fn handle( scan_with_index_parallel(&project_name, pool, config, show_progress)? }; - tracing::debug!("Found {:?} issues.", diags.len()); + tracing::debug!("Found {:?} issues (pre-filter).", diags.len()); - if format == "json" { - let json = serde_json::to_string(&diags) - .map_err(|e| crate::errors::NyxError::Msg(e.to_string()))?; - println!("{json}"); - return Ok(()); + // ── Apply severity filter AFTER all downgrades/dedup ──────────────── + if let Some(ref filter) = severity_filter { + diags.retain(|d| filter.matches(d.severity)); } - if format == "sarif" { - let sarif = crate::output::build_sarif(&diags, &scan_path); - let json = serde_json::to_string_pretty(&sarif) - .map_err(|e| crate::errors::NyxError::Msg(e.to_string()))?; - println!("{json}"); - return Ok(()); + // ── Apply minimum-score filter AFTER ranking ───────────────────── + if let Some(min) = config.output.min_score { + let threshold = f64::from(min); + diags.retain(|d| d.rank_score.unwrap_or(0.0) >= threshold); } - if format == "console" || (format.is_empty() && config.output.default_format == "console") { - tracing::debug!("Printing to console"); - let mut grouped: BTreeMap<&str, Vec<&Diag>> = BTreeMap::new(); - for d in &diags { - grouped.entry(&d.path).or_default().push(d); + // ── Apply minimum-confidence filter AFTER confidence assignment ── + if let Some(min_conf) = config.output.min_confidence { + diags.retain(|d| d.confidence.is_none_or(|c| c >= min_conf)); + } + + // ── Apply inline suppressions ─────────────────────────────────── + apply_suppressions(&mut diags); + if !show_suppressed { + diags.retain(|d| !d.suppressed); + } + + // ── Prioritization: category filter, rollup, LOW budgets ───────── + let stats = prioritize(&mut diags, &config.output, show_instances); + + tracing::debug!("Emitting {:?} issues (post-filter).", diags.len()); + + // ── Output ────────────────────────────────────────────────────────── + match format { + OutputFormat::Json => { + let json = serde_json::to_string(&diags) + .map_err(|e| crate::errors::NyxError::Msg(e.to_string()))?; + println!("{json}"); } - - for (path, issues) in &grouped { - println!("{}", style(path).blue().underlined()); - for d in issues { - println!( - " {:>4}:{:<4} {} {}", - d.line, - d.col, - d.severity.colored_tag(), - style(&d.id).bold() - ); - } - println!(); + OutputFormat::Sarif => { + let sarif = crate::output::build_sarif(&diags, &scan_path); + let json = serde_json::to_string_pretty(&sarif) + .map_err(|e| crate::errors::NyxError::Msg(e.to_string()))?; + println!("{json}"); + } + OutputFormat::Console => { + tracing::debug!("Printing to console"); + print!( + "{}", + crate::fmt::render_console(&diags, &project_name, Some(&stats)) + ); } - - println!( - "{} '{}' generated {} issues.", - style("warning").yellow().bold(), - style(project_name).white().bold(), - style(diags.len()).bold() - ); - println!("\t"); } + + // ── --fail-on: exit non-zero if threshold breached ────────────────── + // Suppressed findings do not count toward the threshold. + if let Some(threshold) = fail_on { + let breached = diags + .iter() + .any(|d| !d.suppressed && d.severity <= threshold); + if breached { + std::process::exit(1); + } + } + Ok(()) } @@ -198,6 +295,14 @@ pub(crate) fn scan_filesystem( .collect(); pb.finish_and_clear(); + if cfg.output.attack_surface_ranking { + crate::rank::rank_diags(&mut diags); + } + for d in &mut diags { + if d.confidence.is_none() { + d.confidence = Some(crate::evidence::compute_confidence(d)); + } + } if let Some(max) = cfg.output.max_results { diags.truncate(max as usize); } @@ -260,6 +365,22 @@ pub(crate) fn scan_filesystem( gs }; + // ── Build call graph ──────────────────────────────────────────────── + { + let _span = tracing::info_span!("build_call_graph").entered(); + // TODO: wire interop_edges from config/index when InteropEdge sources are implemented + let call_graph = crate::callgraph::build_call_graph(&global_summaries, &[]); + let cg_analysis = crate::callgraph::analyse(&call_graph); + tracing::info!( + nodes = call_graph.graph.node_count(), + edges = call_graph.graph.edge_count(), + unresolved_not_found = call_graph.unresolved_not_found.len(), + unresolved_ambiguous = call_graph.unresolved_ambiguous.len(), + sccs = cg_analysis.sccs.len(), + "call graph built" + ); + } + // ── Pass 2: re-run with cross-file global summaries ────────────────── let mut diags: Vec = { let _span = tracing::info_span!("pass2_analysis", files = all_paths.len()).entered(); @@ -289,6 +410,14 @@ pub(crate) fn scan_filesystem( }; tracing::info!(diags = diags.len(), "pass 2 complete"); + if cfg.output.attack_surface_ranking { + crate::rank::rank_diags(&mut diags); + } + for d in &mut diags { + if d.confidence.is_none() { + d.confidence = Some(crate::evidence::compute_confidence(d)); + } + } if let Some(max) = cfg.output.max_results { diags.truncate(max as usize); } @@ -372,6 +501,22 @@ pub fn scan_with_index_parallel( None }; + // ── Build call graph ──────────────────────────────────────────────── + if let Some(ref gs) = global_summaries { + let _span = tracing::info_span!("build_call_graph").entered(); + // TODO: wire interop_edges from config/index when InteropEdge sources are implemented + let call_graph = crate::callgraph::build_call_graph(gs, &[]); + let cg_analysis = crate::callgraph::analyse(&call_graph); + tracing::info!( + nodes = call_graph.graph.node_count(), + edges = call_graph.graph.edge_count(), + unresolved_not_found = call_graph.unresolved_not_found.len(), + unresolved_ambiguous = call_graph.unresolved_ambiguous.len(), + sccs = cg_analysis.sccs.len(), + "call graph built" + ); + } + // ── Pass 2: full analysis ──────────────────────────────────────────── let _span = tracing::info_span!("pass2_indexed").entered(); let pb2 = make_progress_bar( @@ -453,6 +598,14 @@ pub fn scan_with_index_parallel( let mut diags: Vec = diag_map.into_iter().flat_map(|(_, v)| v).collect(); + if cfg.output.attack_surface_ranking { + crate::rank::rank_diags(&mut diags); + } + for d in &mut diags { + if d.confidence.is_none() { + d.confidence = Some(crate::evidence::compute_confidence(d)); + } + } if let Some(max) = cfg.output.max_results { diags.truncate(max as usize); } @@ -460,6 +613,297 @@ pub fn scan_with_index_parallel( Ok(diags) } +// ───────────────────────────────────────────────────────────────────────────── +// Low-noise prioritization pipeline +// ───────────────────────────────────────────────────────────────────────────── + +/// Rules eligible for rollup grouping (high-frequency, low-signal patterns). +const ROLLUP_RULES: &[&str] = &[ + "rs.quality.unwrap", + "rs.quality.expect", + "rs.quality.panic_macro", +]; + +/// Apply category filtering, rollup grouping, and LOW budgets to reduce noise. +/// +/// Modifies `diags` in place and returns suppression statistics for the footer. +pub(crate) fn prioritize( + diags: &mut Vec, + config: &crate::utils::config::OutputConfig, + show_instances: Option<&str>, +) -> SuppressionStats { + let mut stats = SuppressionStats { + quality_dropped: 0, + low_budget_dropped: 0, + max_results_dropped: 0, + include_quality: config.include_quality, + show_all: config.show_all, + max_low: config.max_low, + max_low_per_file: config.max_low_per_file, + max_low_per_rule: config.max_low_per_rule, + }; + + if config.show_all { + return stats; + } + + // ── 1. Category filter: drop Quality unless include_quality ──────── + if !config.include_quality { + let before = diags.len(); + diags.retain(|d| d.category != FindingCategory::Quality); + stats.quality_dropped = before - diags.len(); + } + + // ── 2. Rollup: group high-frequency LOW Quality findings ────────── + rollup_findings(diags, config, show_instances); + + // ── 3. LOW budgets ──────────────────────────────────────────────── + apply_low_budgets(diags, config, &mut stats); + + // ── 4. Global max_results with severity stability ───────────────── + if let Some(max) = config.max_results { + let max = max as usize; + if diags.len() > max { + // Partition by severity priority: High first, then Medium, then Low + let high_count = diags + .iter() + .filter(|d| d.severity == Severity::High) + .count(); + let med_count = diags + .iter() + .filter(|d| d.severity == Severity::Medium) + .count(); + + let take = if high_count >= max { + // Only High fits + diags.retain(|d| d.severity == Severity::High); + diags.truncate(max); + max + } else if high_count + med_count >= max { + // High + some Medium + let med_slots = max - high_count; + let mut med_seen = 0usize; + diags.retain(|d| { + if d.severity == Severity::High { + true + } else if d.severity == Severity::Medium && med_seen < med_slots { + med_seen += 1; + true + } else { + false + } + }); + max + } else { + // High + Medium + some Low + let low_slots = max - high_count - med_count; + let mut low_seen = 0usize; + diags.retain(|d| { + if d.severity == Severity::High || d.severity == Severity::Medium { + true + } else if low_seen < low_slots { + low_seen += 1; + true + } else { + false + } + }); + max + }; + let original_total = high_count + med_count + diags.len(); // approximate + stats.max_results_dropped = original_total.saturating_sub(take); + } + } + + stats +} + +/// Group eligible LOW Quality findings into rollup Diags. +fn rollup_findings( + diags: &mut Vec, + config: &crate::utils::config::OutputConfig, + show_instances: Option<&str>, +) { + use std::collections::HashMap; + + // Identify which diags are eligible for rollup + let mut groups: HashMap<(String, String), Vec> = HashMap::new(); + for (i, d) in diags.iter().enumerate() { + if d.severity != Severity::Low { + continue; + } + if d.category != FindingCategory::Quality { + continue; + } + if !ROLLUP_RULES.contains(&d.id.as_str()) { + continue; + } + if show_instances == Some(d.id.as_str()) { + continue; + } + groups + .entry((d.path.clone(), d.id.clone())) + .or_default() + .push(i); + } + + // Only rollup groups with more than 1 occurrence + let mut to_remove: Vec = Vec::new(); + let mut rollups: Vec = Vec::new(); + + for ((_path, _rule_id), mut indices) in groups { + if indices.len() <= 1 { + continue; + } + + // Sort by (line, col) for deterministic canonical location + indices.sort_by_key(|&i| (diags[i].line, diags[i].col)); + + let canonical_idx = indices[0]; + let total = indices.len(); + + // Collect example locations (first N) + let examples: Vec = indices + .iter() + .take(config.rollup_examples as usize) + .map(|&i| Location { + line: diags[i].line, + col: diags[i].col, + }) + .collect(); + + // Build rollup Diag from canonical + let canonical = &diags[canonical_idx]; + let rollup_diag = Diag { + path: canonical.path.clone(), + line: canonical.line, + col: canonical.col, + severity: canonical.severity, + id: canonical.id.clone(), + category: canonical.category, + path_validated: false, + guard_kind: None, + message: canonical.message.clone(), + labels: vec![], + confidence: canonical.confidence, + evidence: None, + rank_score: None, + rank_reason: None, + suppressed: false, + suppression: None, + rollup: Some(RollupData { + count: total, + occurrences: examples, + }), + }; + + rollups.push(rollup_diag); + to_remove.extend(indices); + } + + if to_remove.is_empty() { + return; + } + + // Remove originals (in reverse order to preserve indices) + to_remove.sort_unstable(); + to_remove.dedup(); + for &i in to_remove.iter().rev() { + diags.remove(i); + } + + // Sort rollups for deterministic output: by (path, id, line) + rollups.sort_by(|a, b| { + a.path + .cmp(&b.path) + .then(a.id.cmp(&b.id)) + .then(a.line.cmp(&b.line)) + }); + + // Add rollup diags + diags.extend(rollups); +} + +/// Enforce per-file, per-rule, and total LOW budgets. +fn apply_low_budgets( + diags: &mut Vec, + config: &crate::utils::config::OutputConfig, + stats: &mut SuppressionStats, +) { + use std::collections::HashMap; + + let mut per_file: HashMap = HashMap::new(); + let mut per_rule: HashMap = HashMap::new(); + let mut total_low: u32 = 0; + + let before = diags.len(); + diags.retain(|d| { + // High/Medium always kept + if d.severity != Severity::Low { + return true; + } + + // Check per-file budget + let file_count = per_file.entry(d.path.clone()).or_insert(0); + if *file_count >= config.max_low_per_file { + return false; + } + + // Check per-rule budget + let rule_count = per_rule.entry(d.id.clone()).or_insert(0); + if *rule_count >= config.max_low_per_rule { + return false; + } + + // Check total budget + if total_low >= config.max_low { + return false; + } + + *file_count += 1; + *rule_count += 1; + total_low += 1; + true + }); + stats.low_budget_dropped = before - diags.len(); +} + +// ───────────────────────────────────────────────────────────────────────────── +// Inline suppression application +// ───────────────────────────────────────────────────────────────────────────── + +/// Apply inline `nyx:ignore` / `nyx:ignore-next-line` suppressions to `diags`. +/// +/// For each unique file path in the diagnostics, the source file is read once, +/// suppression directives are parsed, and matching findings are marked as +/// suppressed. +fn apply_suppressions(diags: &mut [Diag]) { + use std::collections::HashMap; + + // Group diag indices by path (clone path strings to avoid borrowing diags). + let mut by_path: HashMap> = HashMap::new(); + for (i, d) in diags.iter().enumerate() { + by_path.entry(d.path.clone()).or_default().push(i); + } + + for (path, indices) in &by_path { + let Ok(source) = std::fs::read_to_string(path) else { + continue; + }; + let file_path = Path::new(path.as_str()); + let index = crate::suppress::parse_inline_suppressions(file_path, &source); + if index.is_empty() { + continue; + } + for &i in indices { + if let Some(meta) = index.check(diags[i].line, &diags[i].id) { + diags[i].suppressed = true; + diags[i].suppression = Some(meta); + } + } + } +} + #[test] fn scan_with_index_parallel_uses_existing_index_without_rescanning() { let mut cfg = Config::default(); @@ -492,3 +936,579 @@ fn scan_with_index_parallel_uses_existing_index_without_rescanning() { assert!(diags.is_empty()); } + +#[test] +fn severity_filter_applied_at_output_stage() { + // Simulate: findings start as High, get downgraded to Medium by nonprod logic, + // then --severity HIGH should filter them out. + let diags = vec![ + Diag { + path: "tests/test.py".into(), + line: 1, + col: 1, + severity: Severity::Medium, // was High, downgraded + id: "taint-unsanitised-flow".into(), + category: FindingCategory::Security, + path_validated: false, + guard_kind: None, + message: None, + labels: vec![], + confidence: None, + evidence: None, + rank_score: None, + rank_reason: None, + suppressed: false, + suppression: None, + rollup: None, + }, + Diag { + path: "src/main.rs".into(), + line: 10, + col: 5, + severity: Severity::High, + id: "taint-unsanitised-flow".into(), + category: FindingCategory::Security, + path_validated: false, + guard_kind: None, + message: None, + labels: vec![], + confidence: None, + evidence: None, + rank_score: None, + rank_reason: None, + suppressed: false, + suppression: None, + rollup: None, + }, + ]; + + let filter = SeverityFilter::parse("HIGH").unwrap(); + let filtered: Vec<_> = diags + .into_iter() + .filter(|d| filter.matches(d.severity)) + .collect(); + + assert_eq!(filtered.len(), 1); + assert_eq!(filtered[0].severity, Severity::High); + assert_eq!(filtered[0].path, "src/main.rs"); +} + +// ───────────────────────────────────────────────────────────────────────────── +// Prioritization pipeline tests +// ───────────────────────────────────────────────────────────────────────────── + +#[cfg(test)] +mod prioritize_tests { + use super::*; + use crate::utils::config::OutputConfig; + + fn make_diag( + path: &str, + line: usize, + severity: Severity, + id: &str, + cat: FindingCategory, + ) -> Diag { + Diag { + path: path.into(), + line, + col: 1, + severity, + id: id.into(), + category: cat, + path_validated: false, + guard_kind: None, + message: None, + labels: vec![], + confidence: None, + evidence: None, + rank_score: None, + rank_reason: None, + suppressed: false, + suppression: None, + rollup: None, + } + } + + fn default_config() -> OutputConfig { + OutputConfig::default() + } + + #[test] + fn quality_dropped_by_default() { + let mut diags = vec![ + make_diag( + "a.rs", + 1, + Severity::Low, + "rs.quality.unwrap", + FindingCategory::Quality, + ), + make_diag( + "a.rs", + 2, + Severity::High, + "taint-flow", + FindingCategory::Security, + ), + ]; + let stats = prioritize(&mut diags, &default_config(), None); + assert_eq!(diags.len(), 1); + assert_eq!(diags[0].id, "taint-flow"); + assert_eq!(stats.quality_dropped, 1); + } + + #[test] + fn quality_kept_with_include_quality() { + let mut diags = vec![ + make_diag( + "a.rs", + 1, + Severity::Low, + "rs.quality.unwrap", + FindingCategory::Quality, + ), + make_diag( + "a.rs", + 2, + Severity::High, + "taint-flow", + FindingCategory::Security, + ), + ]; + let mut cfg = default_config(); + cfg.include_quality = true; + let stats = prioritize(&mut diags, &cfg, None); + assert_eq!(diags.len(), 2); + assert_eq!(stats.quality_dropped, 0); + } + + #[test] + fn show_all_disables_everything() { + let mut diags = vec![ + make_diag( + "a.rs", + 1, + Severity::Low, + "rs.quality.unwrap", + FindingCategory::Quality, + ), + make_diag( + "a.rs", + 2, + Severity::Low, + "rs.quality.unwrap", + FindingCategory::Quality, + ), + make_diag( + "a.rs", + 3, + Severity::Low, + "rs.quality.unwrap", + FindingCategory::Quality, + ), + ]; + let mut cfg = default_config(); + cfg.show_all = true; + let stats = prioritize(&mut diags, &cfg, None); + assert_eq!(diags.len(), 3); // no filtering, no rollup + assert_eq!(stats.quality_dropped, 0); + assert_eq!(stats.low_budget_dropped, 0); + assert!(diags.iter().all(|d| d.rollup.is_none())); + } + + #[test] + fn rollup_groups_by_file_and_rule() { + let mut diags = vec![ + make_diag( + "a.rs", + 10, + Severity::Low, + "rs.quality.unwrap", + FindingCategory::Quality, + ), + make_diag( + "a.rs", + 20, + Severity::Low, + "rs.quality.unwrap", + FindingCategory::Quality, + ), + make_diag( + "a.rs", + 30, + Severity::Low, + "rs.quality.unwrap", + FindingCategory::Quality, + ), + make_diag( + "b.rs", + 5, + Severity::Low, + "rs.quality.unwrap", + FindingCategory::Quality, + ), + make_diag( + "b.rs", + 15, + Severity::Low, + "rs.quality.unwrap", + FindingCategory::Quality, + ), + ]; + let mut cfg = default_config(); + cfg.include_quality = true; + let _stats = prioritize(&mut diags, &cfg, None); + + // Should have 2 rollup diags (one per file) + let rollups: Vec<_> = diags.iter().filter(|d| d.rollup.is_some()).collect(); + assert_eq!(rollups.len(), 2); + + let a_rollup = rollups.iter().find(|d| d.path == "a.rs").unwrap(); + assert_eq!(a_rollup.rollup.as_ref().unwrap().count, 3); + + let b_rollup = rollups.iter().find(|d| d.path == "b.rs").unwrap(); + assert_eq!(b_rollup.rollup.as_ref().unwrap().count, 2); + } + + #[test] + fn rollup_examples_limited() { + let mut diags: Vec = (1..=20) + .map(|i| { + make_diag( + "a.rs", + i, + Severity::Low, + "rs.quality.unwrap", + FindingCategory::Quality, + ) + }) + .collect(); + let mut cfg = default_config(); + cfg.include_quality = true; + cfg.rollup_examples = 3; + let _stats = prioritize(&mut diags, &cfg, None); + + let rollup = diags.iter().find(|d| d.rollup.is_some()).unwrap(); + assert_eq!(rollup.rollup.as_ref().unwrap().count, 20); + assert_eq!(rollup.rollup.as_ref().unwrap().occurrences.len(), 3); + } + + #[test] + fn rollup_canonical_is_first_sorted() { + let mut diags = vec![ + make_diag( + "a.rs", + 50, + Severity::Low, + "rs.quality.unwrap", + FindingCategory::Quality, + ), + make_diag( + "a.rs", + 10, + Severity::Low, + "rs.quality.unwrap", + FindingCategory::Quality, + ), + make_diag( + "a.rs", + 30, + Severity::Low, + "rs.quality.unwrap", + FindingCategory::Quality, + ), + ]; + let mut cfg = default_config(); + cfg.include_quality = true; + let _stats = prioritize(&mut diags, &cfg, None); + + let rollup = diags.iter().find(|d| d.rollup.is_some()).unwrap(); + assert_eq!(rollup.line, 10); // canonical = first sorted + } + + #[test] + fn low_budget_per_file() { + let mut diags = vec![ + make_diag( + "a.rs", + 1, + Severity::Low, + "some-rule", + FindingCategory::Security, + ), + make_diag( + "a.rs", + 2, + Severity::Low, + "some-rule-2", + FindingCategory::Security, + ), + make_diag( + "b.rs", + 1, + Severity::Low, + "some-rule", + FindingCategory::Security, + ), + ]; + let mut cfg = default_config(); + cfg.max_low_per_file = 1; + cfg.max_low = 100; + cfg.max_low_per_rule = 100; + let stats = prioritize(&mut diags, &cfg, None); + // a.rs: only 1 LOW kept, b.rs: 1 LOW kept + assert_eq!(diags.len(), 2); + assert_eq!(stats.low_budget_dropped, 1); + } + + #[test] + fn low_budget_per_rule() { + let mut diags = vec![ + make_diag( + "a.rs", + 1, + Severity::Low, + "rule-x", + FindingCategory::Security, + ), + make_diag( + "b.rs", + 1, + Severity::Low, + "rule-x", + FindingCategory::Security, + ), + make_diag( + "c.rs", + 1, + Severity::Low, + "rule-x", + FindingCategory::Security, + ), + ]; + let mut cfg = default_config(); + cfg.max_low_per_file = 100; + cfg.max_low = 100; + cfg.max_low_per_rule = 2; + let stats = prioritize(&mut diags, &cfg, None); + assert_eq!(diags.len(), 2); + assert_eq!(stats.low_budget_dropped, 1); + } + + #[test] + fn low_budget_total() { + let mut diags: Vec = (1..=5) + .map(|i| { + make_diag( + &format!("f{i}.rs"), + 1, + Severity::Low, + &format!("rule-{i}"), + FindingCategory::Security, + ) + }) + .collect(); + let mut cfg = default_config(); + cfg.max_low_per_file = 100; + cfg.max_low_per_rule = 100; + cfg.max_low = 3; + let stats = prioritize(&mut diags, &cfg, None); + assert_eq!(diags.len(), 3); + assert_eq!(stats.low_budget_dropped, 2); + } + + #[test] + fn high_medium_never_dropped_by_low_budget() { + let mut diags = vec![ + make_diag( + "a.rs", + 1, + Severity::High, + "vuln-1", + FindingCategory::Security, + ), + make_diag( + "a.rs", + 2, + Severity::Medium, + "vuln-2", + FindingCategory::Security, + ), + make_diag( + "a.rs", + 3, + Severity::Low, + "vuln-3", + FindingCategory::Security, + ), + ]; + let mut cfg = default_config(); + cfg.max_low = 0; + cfg.max_low_per_file = 0; + cfg.max_low_per_rule = 0; + let stats = prioritize(&mut diags, &cfg, None); + assert_eq!(diags.len(), 2); // High + Medium kept + assert!(diags.iter().all(|d| d.severity != Severity::Low)); + assert_eq!(stats.low_budget_dropped, 1); + } + + #[test] + fn rollup_counts_as_one_for_budget() { + // 10 unwrap findings in same file → 1 rollup → counts as 1 LOW + let mut diags: Vec = (1..=10) + .map(|i| { + make_diag( + "a.rs", + i, + Severity::Low, + "rs.quality.unwrap", + FindingCategory::Quality, + ) + }) + .collect(); + // Add another LOW finding from a different rule + diags.push(make_diag( + "a.rs", + 100, + Severity::Low, + "other-rule", + FindingCategory::Security, + )); + + let mut cfg = default_config(); + cfg.include_quality = true; + cfg.max_low_per_file = 2; // allow 2 per file + cfg.max_low = 100; + cfg.max_low_per_rule = 100; + let _stats = prioritize(&mut diags, &cfg, None); + + // Should have rollup (1) + other-rule (1) = 2 + assert_eq!(diags.len(), 2); + } + + #[test] + fn show_instances_bypasses_rollup_for_rule() { + let mut diags = vec![ + make_diag( + "a.rs", + 1, + Severity::Low, + "rs.quality.unwrap", + FindingCategory::Quality, + ), + make_diag( + "a.rs", + 2, + Severity::Low, + "rs.quality.unwrap", + FindingCategory::Quality, + ), + make_diag( + "a.rs", + 3, + Severity::Low, + "rs.quality.expect", + FindingCategory::Quality, + ), + make_diag( + "a.rs", + 4, + Severity::Low, + "rs.quality.expect", + FindingCategory::Quality, + ), + ]; + let mut cfg = default_config(); + cfg.include_quality = true; + cfg.max_low = 100; + cfg.max_low_per_file = 100; + cfg.max_low_per_rule = 100; + let _stats = prioritize(&mut diags, &cfg, Some("rs.quality.unwrap")); + + // unwrap not rolled up (2 individual), expect rolled up (1 rollup) + let unwrap_count = diags.iter().filter(|d| d.id == "rs.quality.unwrap").count(); + let expect_rollup = diags + .iter() + .find(|d| d.id == "rs.quality.expect" && d.rollup.is_some()); + assert_eq!(unwrap_count, 2); + assert!(expect_rollup.is_some()); + } + + #[test] + fn json_includes_rollup_data() { + let d = Diag { + path: "a.rs".into(), + line: 10, + col: 1, + severity: Severity::Low, + id: "rs.quality.unwrap".into(), + category: FindingCategory::Quality, + path_validated: false, + guard_kind: None, + message: None, + labels: vec![], + confidence: None, + evidence: None, + rank_score: None, + rank_reason: None, + suppressed: false, + suppression: None, + rollup: Some(RollupData { + count: 38, + occurrences: vec![Location { line: 10, col: 1 }, Location { line: 20, col: 5 }], + }), + }; + let json = serde_json::to_string(&d).unwrap(); + assert!(json.contains("\"rollup\"")); + assert!(json.contains("\"count\":38")); + assert!(json.contains("\"occurrences\"")); + } + + #[test] + fn deterministic_output() { + let make_diags = || { + vec![ + make_diag( + "b.rs", + 5, + Severity::Low, + "rs.quality.unwrap", + FindingCategory::Quality, + ), + make_diag( + "a.rs", + 10, + Severity::Low, + "rs.quality.unwrap", + FindingCategory::Quality, + ), + make_diag( + "a.rs", + 3, + Severity::Low, + "rs.quality.unwrap", + FindingCategory::Quality, + ), + make_diag( + "b.rs", + 1, + Severity::Low, + "rs.quality.unwrap", + FindingCategory::Quality, + ), + ] + }; + let mut cfg = default_config(); + cfg.include_quality = true; + + let mut d1 = make_diags(); + let mut d2 = make_diags(); + let _s1 = prioritize(&mut d1, &cfg, None); + let _s2 = prioritize(&mut d2, &cfg, None); + + let j1 = serde_json::to_string(&d1).unwrap(); + let j2 = serde_json::to_string(&d2).unwrap(); + assert_eq!(j1, j2, "same input should produce same output"); + } +} diff --git a/src/database.rs b/src/database.rs index ac10552d..8073d769 100644 --- a/src/database.rs +++ b/src/database.rs @@ -272,6 +272,18 @@ pub mod index { line: row.get::<_, i64>(2)? as usize, col: row.get::<_, i64>(3)? as usize, severity: Severity::from_str(&sev_str).unwrap(), + category: crate::patterns::FindingCategory::Security, + path_validated: false, + guard_kind: None, + message: None, + labels: vec![], + confidence: None, + evidence: None, + rank_score: None, + rank_reason: None, + suppressed: false, + suppression: None, + rollup: None, }) })?; diff --git a/src/evidence.rs b/src/evidence.rs new file mode 100644 index 00000000..c63c47a0 --- /dev/null +++ b/src/evidence.rs @@ -0,0 +1,396 @@ +//! Structured evidence and confidence types for scan diagnostics. +//! +//! These types capture the provenance of findings (source locations, +//! sanitizer/guard info, state-machine transitions) in a structured form +//! that can be serialized to JSON and consumed by ranking, filtering, +//! and downstream tooling. + +use crate::commands::scan::Diag; +use crate::patterns::Severity; +use serde::{Deserialize, Serialize}; +use std::fmt; +use std::str::FromStr; + +// ───────────────────────────────────────────────────────────────────────────── +// Confidence +// ───────────────────────────────────────────────────────────────────────────── + +/// Confidence level for a diagnostic finding. +/// +/// Ordered Low < Medium < High so that `>=` comparisons work naturally +/// for filtering (e.g. `--min-confidence medium` keeps Medium and High). +#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Serialize, Deserialize)] +pub enum Confidence { + Low, + Medium, + High, +} + +impl fmt::Display for Confidence { + fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { + match self { + Self::Low => write!(f, "Low"), + Self::Medium => write!(f, "Medium"), + Self::High => write!(f, "High"), + } + } +} + +impl FromStr for Confidence { + type Err = String; + + fn from_str(s: &str) -> Result { + match s.to_ascii_lowercase().as_str() { + "low" => Ok(Self::Low), + "medium" | "med" => Ok(Self::Medium), + "high" => Ok(Self::High), + _ => Err(format!( + "unknown confidence level: {s:?} (expected low, medium, high)" + )), + } + } +} + +// ───────────────────────────────────────────────────────────────────────────── +// Evidence +// ───────────────────────────────────────────────────────────────────────────── + +/// Structured evidence for a diagnostic finding. +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct Evidence { + /// Where tainted data originated. + #[serde(skip_serializing_if = "Option::is_none")] + pub source: Option, + + /// Where the dangerous operation happens. + #[serde(skip_serializing_if = "Option::is_none")] + pub sink: Option, + + /// Validation guards protecting this path. + #[serde(skip_serializing_if = "Vec::is_empty")] + pub guards: Vec, + + /// Sanitizers applied to this path. + #[serde(skip_serializing_if = "Vec::is_empty")] + pub sanitizers: Vec, + + /// State-machine evidence (resource lifecycle / auth). + #[serde(skip_serializing_if = "Option::is_none")] + pub state: Option, + + /// Free-form notes for ranking and display. + #[serde(skip_serializing_if = "Vec::is_empty")] + pub notes: Vec, +} + +impl Evidence { + /// Returns `true` if the evidence contains no useful data. + pub fn is_empty(&self) -> bool { + self.source.is_none() + && self.sink.is_none() + && self.guards.is_empty() + && self.sanitizers.is_empty() + && self.state.is_none() + && self.notes.is_empty() + } +} + +/// A source-location evidence span. +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct SpanEvidence { + pub path: String, + pub line: u32, + pub col: u32, + /// One of: `"source"`, `"sink"`, `"guard"`, `"sanitizer"`. + pub kind: String, + #[serde(skip_serializing_if = "Option::is_none")] + pub snippet: Option, +} + +/// Evidence from a state-machine analysis (resource lifecycle / auth). +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct StateEvidence { + /// The state machine: `"resource"` or `"auth"`. + pub machine: String, + /// Variable name if available. + #[serde(skip_serializing_if = "Option::is_none")] + pub subject: Option, + /// State before the event. + pub from_state: String, + /// State after the event. + pub to_state: String, +} + +// ───────────────────────────────────────────────────────────────────────────── +// compute_confidence +// ───────────────────────────────────────────────────────────────────────────── + +/// Derive a confidence level for `diag` based on its rule ID, severity, +/// evidence, and analysis kind. +/// +/// This is called as a post-pass after all findings are collected; findings +/// that already have a confidence set (e.g. from CFG analysis) are preserved. +pub fn compute_confidence(diag: &Diag) -> Confidence { + // Degraded analysis caps confidence + if let Some(ev) = &diag.evidence + && ev.notes.iter().any(|n| n.starts_with("degraded:")) + { + return Confidence::Low; + } + + let id = &diag.id; + + if id.starts_with("taint-") { + if let Some(ev) = &diag.evidence + && ev.notes.iter().any(|n| n == "path_validated") + { + return Confidence::Medium; + } + // source+sink present = High + if let Some(ev) = &diag.evidence + && ev.source.is_some() + && ev.sink.is_some() + { + return Confidence::High; + } + return Confidence::High; // default for taint + } + + if id.starts_with("state-") { + return match id.as_str() { + "state-use-after-close" => Confidence::High, + "state-double-close" => Confidence::High, + "state-unauthed-access" => Confidence::High, + "state-resource-leak" => Confidence::Medium, + "state-resource-leak-possible" => Confidence::Low, + _ => Confidence::Medium, + }; + } + + if id.starts_with("cfg-") { + // If CFG conversion already set confidence, preserve it + return diag.confidence.unwrap_or(Confidence::Medium); + } + + // AST patterns: High severity → Medium confidence, else Low + if diag.severity == Severity::High { + Confidence::Medium + } else { + Confidence::Low + } +} + +// ───────────────────────────────────────────────────────────────────────────── +// Tests +// ───────────────────────────────────────────────────────────────────────────── + +#[cfg(test)] +mod tests { + use super::*; + + fn make_diag(id: &str, severity: Severity) -> Diag { + Diag { + path: "test.rs".into(), + line: 1, + col: 1, + severity, + id: id.into(), + category: crate::patterns::FindingCategory::Security, + path_validated: false, + guard_kind: None, + message: None, + labels: vec![], + confidence: None, + evidence: None, + rank_score: None, + rank_reason: None, + suppressed: false, + suppression: None, + rollup: None, + } + } + + #[test] + fn compute_confidence_taint_high() { + let mut d = make_diag("taint-unsanitised-flow (source 1:1)", Severity::High); + d.evidence = Some(Evidence { + source: Some(SpanEvidence { + path: "test.rs".into(), + line: 1, + col: 1, + kind: "source".into(), + snippet: Some("env::var(\"X\")".into()), + }), + sink: Some(SpanEvidence { + path: "test.rs".into(), + line: 10, + col: 5, + kind: "sink".into(), + snippet: Some("exec()".into()), + }), + guards: vec![], + sanitizers: vec![], + state: None, + notes: vec![], + }); + assert_eq!(compute_confidence(&d), Confidence::High); + } + + #[test] + fn compute_confidence_taint_validated() { + let mut d = make_diag("taint-unsanitised-flow (source 1:1)", Severity::High); + d.evidence = Some(Evidence { + source: Some(SpanEvidence { + path: "test.rs".into(), + line: 1, + col: 1, + kind: "source".into(), + snippet: None, + }), + sink: Some(SpanEvidence { + path: "test.rs".into(), + line: 10, + col: 5, + kind: "sink".into(), + snippet: None, + }), + guards: vec![], + sanitizers: vec![], + state: None, + notes: vec!["path_validated".into()], + }); + assert_eq!(compute_confidence(&d), Confidence::Medium); + } + + #[test] + fn compute_confidence_degraded_caps_to_low() { + let mut d = make_diag("taint-unsanitised-flow (source 1:1)", Severity::High); + d.evidence = Some(Evidence { + source: None, + sink: None, + guards: vec![], + sanitizers: vec![], + state: None, + notes: vec!["degraded:budget_exceeded".into()], + }); + assert_eq!(compute_confidence(&d), Confidence::Low); + } + + #[test] + fn compute_confidence_state_rules() { + assert_eq!( + compute_confidence(&make_diag("state-use-after-close", Severity::High)), + Confidence::High, + ); + assert_eq!( + compute_confidence(&make_diag("state-double-close", Severity::Medium)), + Confidence::High, + ); + assert_eq!( + compute_confidence(&make_diag("state-unauthed-access", Severity::High)), + Confidence::High, + ); + assert_eq!( + compute_confidence(&make_diag("state-resource-leak", Severity::Medium)), + Confidence::Medium, + ); + assert_eq!( + compute_confidence(&make_diag("state-resource-leak-possible", Severity::Low)), + Confidence::Low, + ); + } + + #[test] + fn compute_confidence_cfg_preserves_existing() { + let mut d = make_diag("cfg-unguarded-sink", Severity::High); + d.confidence = Some(Confidence::Low); + assert_eq!(compute_confidence(&d), Confidence::Low); + } + + #[test] + fn compute_confidence_ast_low() { + let d = make_diag("rs.code_exec.eval", Severity::Medium); + assert_eq!(compute_confidence(&d), Confidence::Low); + } + + #[test] + fn compute_confidence_ast_high_severity_medium() { + let d = make_diag("rs.code_exec.eval", Severity::High); + assert_eq!(compute_confidence(&d), Confidence::Medium); + } + + #[test] + fn evidence_is_empty() { + let ev = Evidence { + source: None, + sink: None, + guards: vec![], + sanitizers: vec![], + state: None, + notes: vec![], + }; + assert!(ev.is_empty()); + + let ev2 = Evidence { + source: Some(SpanEvidence { + path: "x.rs".into(), + line: 1, + col: 1, + kind: "source".into(), + snippet: None, + }), + sink: None, + guards: vec![], + sanitizers: vec![], + state: None, + notes: vec![], + }; + assert!(!ev2.is_empty()); + } + + #[test] + fn confidence_ord() { + assert!(Confidence::Low < Confidence::Medium); + assert!(Confidence::Medium < Confidence::High); + assert!(Confidence::Low < Confidence::High); + } + + #[test] + fn confidence_display_and_parse() { + assert_eq!(Confidence::Low.to_string(), "Low"); + assert_eq!(Confidence::Medium.to_string(), "Medium"); + assert_eq!(Confidence::High.to_string(), "High"); + + assert_eq!("low".parse::().unwrap(), Confidence::Low); + assert_eq!("MEDIUM".parse::().unwrap(), Confidence::Medium); + assert_eq!("High".parse::().unwrap(), Confidence::High); + assert!("invalid".parse::().is_err()); + } + + #[test] + fn compute_confidence_does_not_override_preset() { + // AST patterns set confidence directly; compute_confidence must not overwrite. + let mut d = make_diag("rs.quality.expect", Severity::Low); + d.confidence = Some(Confidence::High); + // The post-pass only runs when confidence is None, but verify compute_confidence + // itself would return something different (Low for AST + Low severity), proving + // the guard in scan.rs is necessary. + assert_eq!(compute_confidence(&d), Confidence::Low); + // The actual guard: confidence is already Some, so scan.rs skips compute_confidence. + assert_eq!(d.confidence, Some(Confidence::High)); + } + + #[test] + fn json_omits_none_fields() { + let ev = Evidence { + source: None, + sink: None, + guards: vec![], + sanitizers: vec![], + state: None, + notes: vec![], + }; + let json = serde_json::to_string(&ev).unwrap(); + assert_eq!(json, "{}"); + } +} diff --git a/src/fmt.rs b/src/fmt.rs new file mode 100644 index 00000000..fcd92321 --- /dev/null +++ b/src/fmt.rs @@ -0,0 +1,984 @@ +//! Console output formatting for scan diagnostics. +//! +//! Produces professional, security-tool-grade aligned output with a clear +//! severity hierarchy, normalised taint flow rendering, and stable wrapping. + +use crate::commands::scan::{Diag, SuppressionStats}; +use crate::patterns::Severity; +use console::style; +use std::collections::BTreeMap; + +/// Default maximum line width when terminal size is unknown. +const DEFAULT_WIDTH: usize = 100; + +// ───────────────────────────────────────────────────────────────────────────── +// Public API +// ───────────────────────────────────────────────────────────────────────────── + +/// Render all diagnostics as grouped, formatted console output with a summary. +pub fn render_console( + diags: &[Diag], + project_name: &str, + suppression_stats: Option<&SuppressionStats>, +) -> String { + let width = terminal_width(); + let mut out = String::new(); + + let mut grouped: BTreeMap<&str, Vec<&Diag>> = BTreeMap::new(); + for d in diags { + grouped.entry(&d.path).or_default().push(d); + } + + for (path, issues) in &grouped { + // File path header — dim blue, never brighter than severity. + out.push_str(&format!("{}\n", style(path).blue().dim().underlined())); + for d in issues { + out.push_str(&render_diag(d, width)); + out.push('\n'); // blank line between findings + } + } + + let suppressed_count = diags.iter().filter(|d| d.suppressed).count(); + let active_count = diags.len() - suppressed_count; + + if suppressed_count > 0 { + out.push_str(&format!( + "{} '{}' generated {} {} ({} suppressed).\n\n", + style("warning").yellow().bold(), + style(project_name).white().bold(), + style(active_count).bold(), + if active_count == 1 { "issue" } else { "issues" }, + suppressed_count, + )); + } else { + out.push_str(&format!( + "{} '{}' generated {} {}.\n\n", + style("warning").yellow().bold(), + style(project_name).white().bold(), + style(diags.len()).bold(), + if diags.len() == 1 { "issue" } else { "issues" }, + )); + } + + // ── Suppression footer ───────────────────────────────────────────── + if let Some(stats) = suppression_stats { + let total = stats.total_suppressed(); + if total > 0 { + out.push_str(&format!( + "{}\n", + style(format!("Suppressed {total} LOW/Quality findings.")).dim() + )); + out.push_str(&format!("{}\n", style("Active filters:").dim())); + if !stats.include_quality { + out.push_str(&format!( + " {} {}\n", + style("include_quality =").dim(), + style("false").dim() + )); + } + out.push_str(&format!( + " {} {}\n", + style("max_low =").dim(), + style(stats.max_low).dim() + )); + out.push_str(&format!( + " {} {}\n", + style("max_low_per_file =").dim(), + style(stats.max_low_per_file).dim() + )); + out.push_str(&format!( + " {} {}\n", + style("max_low_per_rule =").dim(), + style(stats.max_low_per_rule).dim() + )); + out.push_str(&format!( + "\n{}\n", + style("Use --include-quality, --max-low, or --all to adjust.").dim() + )); + } + } + + out +} + +/// Normalise a code snippet for display: collapse whitespace, join lines, +/// clean up method-chain spacing, trim, and truncate. +pub fn normalize_snippet(s: &str) -> String { + // Strip newlines/carriage returns with no replacement, then collapse + // runs of spaces into a single space. + let no_newlines: String = s.chars().filter(|c| *c != '\n' && *c != '\r').collect(); + let collapsed: String = no_newlines.split_whitespace().collect::>().join(" "); + // Clean up `) .foo(` → `).foo(` and similar spacing around dots in chains. + let cleaned = collapse_chain_spacing(&collapsed); + let trimmed = cleaned.trim(); + if trimmed.len() > 120 { + format!("{}…", &trimmed[..120]) + } else { + trimmed.to_string() + } +} + +/// Truncate method chains: keep constructor + first balanced `(...)`, then `…`. +/// +/// E.g. `Command::new("sh").arg("-c").arg(&cmd)` → `Command::new("sh")…` +#[allow(dead_code)] // public API, used by consumers +pub fn shorten_callee(s: &str) -> String { + let s = s.trim(); + if s.is_empty() { + return String::new(); + } + + let Some(open) = s.find('(') else { + return s.to_string(); + }; + + let mut depth = 0u32; + let mut close = None; + for (i, ch) in s[open..].char_indices() { + match ch { + '(' => depth += 1, + ')' => { + depth -= 1; + if depth == 0 { + close = Some(open + i); + break; + } + } + _ => {} + } + } + + let Some(close_idx) = close else { + return s.to_string(); + }; + + let end = close_idx + 1; + if end < s.len() { + format!("{}…", &s[..end]) + } else { + s.to_string() + } +} + +// ───────────────────────────────────────────────────────────────────────────── +// Internal rendering +// ───────────────────────────────────────────────────────────────────────────── + +/// Indentation for body/evidence lines (spaces). +const BODY_INDENT: usize = 6; + +/// Render a single diagnostic block. +fn render_diag(d: &Diag, width: usize) -> String { + let mut out = String::new(); + + // ── Header line ────────────────────────────────────────────────────── + // Format: ` 98:5 ⚠ [MEDIUM] taint-unsanitised-flow (Score: 87, Confidence: Medium)` + let loc = format!("{}:{}", d.line, d.col); + let sev = if d.suppressed { + format!("{} {}", style("○").dim(), style("[SUPPRESSED]").dim(),) + } else { + severity_tag(d.severity) + }; + let meta_suffix = match (d.rank_score, d.confidence) { + (Some(s), Some(c)) => format!( + " {}", + style(format!("(Score: {}, Confidence: {c})", s as u32)).dim() + ), + (Some(s), None) => format!(" {}", style(format!("(Score: {})", s as u32)).dim()), + (None, Some(c)) => format!(" {}", style(format!("(Confidence: {c})")).dim()), + (None, None) => String::new(), + }; + out.push_str(&format!( + " {} {} {}{}\n", + style(&loc).dim(), + sev, + style(&d.id).dim(), + meta_suffix, + )); + + // ── Rollup body ───────────────────────────────────────────────────── + let indent_str = " ".repeat(BODY_INDENT); + if let Some(ref rollup) = d.rollup { + out.push_str(&format!( + "{indent_str}{} ({} occurrences)\n", + style(&d.id).dim(), + rollup.count + )); + if !rollup.occurrences.is_empty() { + let examples: Vec = rollup + .occurrences + .iter() + .map(|loc| format!("{}:{}", loc.line, loc.col)) + .collect(); + out.push_str(&format!( + "{indent_str}{} {}\n", + style("Examples:").dim(), + style(examples.join(", ")).dim() + )); + } + out.push_str(&format!( + "{indent_str}{}\n", + style(format!("Run: nyx scan --show-instances {}", d.id)).dim() + )); + return out; + } + + // ── Message body ───────────────────────────────────────────────────── + if let Some(msg) = &d.message { + let capitalized = capitalize_first(msg); + let wrapped = wrap_text(&capitalized, width, BODY_INDENT); + out.push_str(&format!("{indent_str}{wrapped}\n")); + } + + // ── Evidence labels (Source, Sink, Path guard) ─────────────────────── + if !d.labels.is_empty() { + out.push('\n'); + let max_label = d.labels.iter().map(|(k, _)| k.len()).max().unwrap_or(0); + let key_width = max_label + 1; // +1 for ':' + for (label, value) in &d.labels { + let key_str = format!("{label}:"); + let value_indent = BODY_INDENT + key_width + 1; // key + space + let wrapped_val = wrap_text(value, width, value_indent); + if label == "Path guard" { + out.push_str(&format!( + "{indent_str}{: String { + match sev { + Severity::High => format!( + "{} [{}]", + style("✖").red().bold(), + style("HIGH").red().bold(), + ), + Severity::Medium => format!( + "{} [{}]", + style("⚠").color256(208).bold(), + style("MEDIUM").color256(208).bold(), + ), + Severity::Low => format!( + "{} [{}]", + style("●").color256(67), + style("LOW").color256(67), + ), + } +} + +// ───────────────────────────────────────────────────────────────────────────── +// Text utilities +// ───────────────────────────────────────────────────────────────────────────── + +/// Collapse spacing artefacts in method chains. +/// +/// - `") .foo("` → `").foo("` (space between `)` and `.`) +/// - Multiple spaces → single space +fn collapse_chain_spacing(s: &str) -> String { + let mut out = String::with_capacity(s.len()); + let chars: Vec = s.chars().collect(); + let len = chars.len(); + let mut i = 0; + + while i < len { + // Pattern: `)` followed by whitespace then `.` + if chars[i] == ')' { + out.push(')'); + i += 1; + // Skip whitespace between `)` and `.` + let ws_start = i; + while i < len && chars[i] == ' ' { + i += 1; + } + if i < len && chars[i] == '.' { + // Collapse: emit `.` directly after `)` + continue; + } else { + // Not a chain continuation — emit the whitespace we skipped + for c in &chars[ws_start..i] { + out.push(*c); + } + } + } else { + out.push(chars[i]); + i += 1; + } + } + out +} + +/// Word-wrap text to fit within `max_width`, with continuation lines indented +/// to `indent` spaces. The first line is NOT indented (caller handles that). +fn wrap_text(text: &str, max_width: usize, indent: usize) -> String { + let available_first = max_width.saturating_sub(indent); + let available_cont = max_width.saturating_sub(indent); + if available_first == 0 || text.len() <= available_first { + return text.to_string(); + } + + let indent_str = " ".repeat(indent); + let mut result = String::new(); + let mut line_len = 0usize; + let mut first_line = true; + + for word in text.split_whitespace() { + let wlen = word.len(); + let avail = if first_line { + available_first + } else { + available_cont + }; + + if line_len == 0 { + result.push_str(word); + line_len = wlen; + } else if line_len + 1 + wlen > avail { + result.push('\n'); + result.push_str(&indent_str); + result.push_str(word); + line_len = wlen; + first_line = false; + } else { + result.push(' '); + result.push_str(word); + line_len += 1 + wlen; + } + } + + result +} + +/// Get terminal width, falling back to DEFAULT_WIDTH. +fn terminal_width() -> usize { + terminal_size::terminal_size() + .map(|(w, _)| w.0 as usize) + .unwrap_or(DEFAULT_WIDTH) +} + +/// Capitalise the first character of a string. +fn capitalize_first(s: &str) -> String { + let mut chars = s.chars(); + match chars.next() { + None => String::new(), + Some(c) => { + let mut out = String::with_capacity(s.len()); + for upper in c.to_uppercase() { + out.push(upper); + } + out.push_str(chars.as_str()); + out + } + } +} + +// ───────────────────────────────────────────────────────────────────────────── +// Tests +// ───────────────────────────────────────────────────────────────────────────── + +#[cfg(test)] +mod tests { + use super::*; + + // ── Helpers ────────────────────────────────────────────────────────── + + /// Strip ANSI escape codes for testing visible content. + fn strip_ansi(s: &str) -> String { + let mut result = String::new(); + let mut in_escape = false; + for ch in s.chars() { + if ch == '\x1b' { + in_escape = true; + } else if in_escape { + if ch == 'm' { + in_escape = false; + } + } else { + result.push(ch); + } + } + result + } + + // ── normalize_snippet ──────────────────────────────────────────────── + + #[test] + fn normalize_snippet_strips_newlines_no_space() { + // Newlines are removed with no whitespace inserted in their place. + assert_eq!(normalize_snippet("foo\nbar\rbaz"), "foobarbaz"); + } + + #[test] + fn normalize_snippet_collapses_whitespace() { + assert_eq!( + normalize_snippet("Command::new(\"tar\") .arg(\"-czf\")"), + "Command::new(\"tar\").arg(\"-czf\")" + ); + } + + #[test] + fn normalize_snippet_trims() { + assert_eq!(normalize_snippet(" hello "), "hello"); + } + + #[test] + fn normalize_snippet_truncates_at_120() { + let long = "a".repeat(200); + let result = normalize_snippet(&long); + // 120 chars + '…' (3 bytes UTF-8) + assert!(result.len() > 120); + assert!(result.ends_with('…')); + } + + #[test] + fn normalize_snippet_short_unchanged() { + assert_eq!(normalize_snippet("short"), "short"); + } + + // ── collapse_chain_spacing ─────────────────────────────────────────── + + #[test] + fn collapse_chain_removes_space_before_dot() { + assert_eq!( + collapse_chain_spacing("foo() .bar() .baz()"), + "foo().bar().baz()" + ); + } + + #[test] + fn collapse_chain_preserves_non_chain_spacing() { + assert_eq!(collapse_chain_spacing("foo() + bar()"), "foo() + bar()"); + } + + #[test] + fn collapse_chain_multiple_spaces() { + assert_eq!( + collapse_chain_spacing("cmd() .arg(\"-c\")"), + "cmd().arg(\"-c\")" + ); + } + + // ── shorten_callee ─────────────────────────────────────────────────── + + #[test] + fn shorten_callee_truncates_chain() { + assert_eq!( + shorten_callee("Command::new(\"sh\").arg(\"-c\").arg(&cmd)"), + "Command::new(\"sh\")…" + ); + } + + #[test] + fn shorten_callee_no_chain_unchanged() { + assert_eq!(shorten_callee("env::var(\"HOME\")"), "env::var(\"HOME\")"); + } + + #[test] + fn shorten_callee_nested_parens() { + assert_eq!(shorten_callee("foo(bar(1, 2)).baz()"), "foo(bar(1, 2))…"); + } + + #[test] + fn shorten_callee_no_parens() { + assert_eq!(shorten_callee("simple_name"), "simple_name"); + } + + #[test] + fn shorten_callee_empty() { + assert_eq!(shorten_callee(""), ""); + } + + // ── wrap_text ──────────────────────────────────────────────────────── + + #[test] + fn wrap_short_text_unchanged() { + assert_eq!(wrap_text("short text", 80, 4), "short text"); + } + + #[test] + fn wrap_breaks_at_boundary() { + let text = "word1 word2 word3 word4 word5"; + let result = wrap_text(text, 20, 4); + assert!(result.contains('\n')); + for line in result.lines().skip(1) { + assert!(line.starts_with(" ")); + } + } + + // ── severity_tag ───────────────────────────────────────────────────── + + #[test] + fn severity_tags_contain_level_name() { + let h = strip_ansi(&severity_tag(Severity::High)); + let m = strip_ansi(&severity_tag(Severity::Medium)); + let l = strip_ansi(&severity_tag(Severity::Low)); + assert!(h.contains("HIGH"), "got: {h}"); + assert!(m.contains("MEDIUM"), "got: {m}"); + assert!(l.contains("LOW"), "got: {l}"); + } + + #[test] + fn severity_tags_have_icons() { + let h = strip_ansi(&severity_tag(Severity::High)); + let m = strip_ansi(&severity_tag(Severity::Medium)); + let l = strip_ansi(&severity_tag(Severity::Low)); + assert!(h.contains('✖'), "HIGH should have ✖"); + assert!(m.contains('⚠'), "MEDIUM should have ⚠"); + assert!(l.contains('●'), "LOW should have ●"); + } + + // ── render_console ─────────────────────────────────────────────────── + + #[test] + fn render_console_groups_by_file() { + let diags = vec![ + Diag { + path: "src/a.rs".into(), + line: 10, + col: 5, + severity: Severity::High, + id: "test-rule".into(), + category: crate::patterns::FindingCategory::Security, + path_validated: false, + guard_kind: None, + message: Some("test message".into()), + labels: vec![], + confidence: None, + evidence: None, + rank_score: None, + rank_reason: None, + suppressed: false, + suppression: None, + rollup: None, + }, + Diag { + path: "src/b.rs".into(), + line: 20, + col: 1, + severity: Severity::Low, + id: "another-rule".into(), + category: crate::patterns::FindingCategory::Security, + path_validated: false, + guard_kind: None, + message: None, + labels: vec![], + confidence: None, + evidence: None, + rank_score: None, + rank_reason: None, + suppressed: false, + suppression: None, + rollup: None, + }, + ]; + let output = render_console(&diags, "test-project", None); + let stripped = strip_ansi(&output); + assert!(stripped.contains("src/a.rs")); + assert!(stripped.contains("src/b.rs")); + assert!(stripped.contains("2 issues")); + assert!(stripped.contains("test-project")); + } + + #[test] + fn render_console_evidence_displayed() { + let diags = vec![Diag { + path: "src/main.rs".into(), + line: 42, + col: 5, + severity: Severity::High, + id: "taint-unsanitised-flow (source 12:3)".into(), + category: crate::patterns::FindingCategory::Security, + path_validated: false, + guard_kind: None, + message: Some("unsanitised input".into()), + labels: vec![ + ("Source".into(), "env::var(\"HOME\") at 12:3".into()), + ("Sink".into(), "Command::new(\"sh\")".into()), + ], + confidence: None, + evidence: None, + rank_score: None, + rank_reason: None, + suppressed: false, + suppression: None, + rollup: None, + }]; + let output = render_console(&diags, "proj", None); + let stripped = strip_ansi(&output); + assert!(stripped.contains("Source:"), "should contain Source label"); + assert!(stripped.contains("Sink:"), "should contain Sink label"); + // No backticks in output + assert!( + !stripped.contains('`'), + "should not contain backticks in evidence" + ); + } + + #[test] + fn render_console_blank_line_between_findings() { + let diags = vec![ + Diag { + path: "src/a.rs".into(), + line: 1, + col: 1, + severity: Severity::High, + id: "rule-a".into(), + category: crate::patterns::FindingCategory::Security, + path_validated: false, + guard_kind: None, + message: Some("first".into()), + labels: vec![], + confidence: None, + evidence: None, + rank_score: None, + rank_reason: None, + suppressed: false, + suppression: None, + rollup: None, + }, + Diag { + path: "src/a.rs".into(), + line: 10, + col: 1, + severity: Severity::Medium, + id: "rule-b".into(), + category: crate::patterns::FindingCategory::Security, + path_validated: false, + guard_kind: None, + message: Some("second".into()), + labels: vec![], + confidence: None, + evidence: None, + rank_score: None, + rank_reason: None, + suppressed: false, + suppression: None, + rollup: None, + }, + ]; + let output = render_console(&diags, "proj", None); + let stripped = strip_ansi(&output); + // There should be a blank line between the two findings + assert!( + stripped.contains("First\n\n"), + "blank line between findings: {stripped}" + ); + } + + #[test] + fn json_omits_empty_labels() { + let d = Diag { + path: "x.rs".into(), + line: 1, + col: 1, + severity: Severity::Low, + id: "test".into(), + category: crate::patterns::FindingCategory::Security, + path_validated: false, + guard_kind: None, + message: None, + labels: vec![], + confidence: None, + evidence: None, + rank_score: None, + rank_reason: None, + suppressed: false, + suppression: None, + rollup: None, + }; + let json = serde_json::to_string(&d).unwrap(); + assert!( + !json.contains("labels"), + "empty labels should be omitted from JSON" + ); + } + + #[test] + fn json_omits_rank_fields_when_none() { + let d = Diag { + path: "x.rs".into(), + line: 1, + col: 1, + severity: Severity::Low, + id: "test".into(), + category: crate::patterns::FindingCategory::Security, + path_validated: false, + guard_kind: None, + message: None, + labels: vec![], + confidence: None, + evidence: None, + rank_score: None, + rank_reason: None, + suppressed: false, + suppression: None, + rollup: None, + }; + let json = serde_json::to_string(&d).unwrap(); + assert!( + !json.contains("rank_score"), + "rank_score should be omitted when None" + ); + assert!( + !json.contains("rank_reason"), + "rank_reason should be omitted when None" + ); + } + + #[test] + fn json_includes_rank_score_when_set() { + let d = Diag { + path: "x.rs".into(), + line: 1, + col: 1, + severity: Severity::High, + id: "taint-unsanitised-flow".into(), + category: crate::patterns::FindingCategory::Security, + path_validated: false, + guard_kind: None, + message: None, + labels: vec![], + confidence: None, + evidence: None, + rank_score: Some(120.0), + rank_reason: None, + suppressed: false, + suppression: None, + rollup: None, + }; + let json = serde_json::to_string(&d).unwrap(); + assert!( + json.contains("rank_score"), + "rank_score should be present when set" + ); + assert!(json.contains("120"), "rank_score value should appear"); + } + + // ── capitalize_first ───────────────────────────────────────────────── + + #[test] + fn capitalize_first_works() { + assert_eq!(capitalize_first("hello"), "Hello"); + assert_eq!(capitalize_first(""), ""); + assert_eq!(capitalize_first("A"), "A"); + assert_eq!(capitalize_first("unsanitised"), "Unsanitised"); + } + + // ── taint flow rendering (integration-style) ───────────────────────── + + #[test] + fn taint_flow_no_broken_backticks_or_weird_spacing() { + let raw_sink = "Command::new(\"tar\") .arg(\"-czf\") .arg(\"/backups/nightly.tar.gz\") .arg(\"/var/data\") .output()"; + let normalised = normalize_snippet(raw_sink); + // Chain spacing should be collapsed + assert!( + !normalised.contains(") ."), + "chain spacing should be collapsed: {normalised}" + ); + assert!(!normalised.contains(" "), "no double-spaces: {normalised}"); + // Should not contain backticks + assert!(!normalised.contains('`'), "no backticks: {normalised}"); + } + + #[test] + fn multiline_sink_joined_and_normalised() { + let raw = "Command::new(\"tar\")\n .arg(\"-czf\")\n .arg(\"/backups/nightly.tar.gz\")\n .arg(\"/var/data\")\n .output()"; + let normalised = normalize_snippet(raw); + assert_eq!( + normalised, + "Command::new(\"tar\").arg(\"-czf\").arg(\"/backups/nightly.tar.gz\").arg(\"/var/data\").output()" + ); + } + + // ── confidence display ────────────────────────────────────────────── + + #[test] + fn confidence_after_score_on_header_line() { + let d = Diag { + path: "src/a.rs".into(), + line: 510, + col: 5, + severity: Severity::Medium, + id: "cfg-unguarded-sink".into(), + category: crate::patterns::FindingCategory::Security, + path_validated: false, + guard_kind: None, + message: Some("dangerous sink".into()), + labels: vec![], + confidence: Some(crate::evidence::Confidence::Medium), + evidence: None, + rank_score: Some(36.0), + rank_reason: None, + suppressed: false, + suppression: None, + rollup: None, + }; + let output = render_diag(&d, 120); + let stripped = strip_ansi(&output); + // Header line should contain score and confidence together + let header = stripped.lines().next().unwrap(); + assert!( + header.contains("(Score: 36, Confidence: Medium)"), + "header should contain '(Score: 36, Confidence: Medium)': {header}" + ); + // No standalone Confidence line + let non_header_lines: Vec<&str> = stripped.lines().skip(1).collect(); + assert!( + !non_header_lines + .iter() + .any(|l| l.trim().starts_with("Confidence:")), + "should not have standalone Confidence line" + ); + } + + #[test] + fn confidence_title_case() { + for (conf, expected) in [ + (crate::evidence::Confidence::Low, "Confidence: Low"), + (crate::evidence::Confidence::Medium, "Confidence: Medium"), + (crate::evidence::Confidence::High, "Confidence: High"), + ] { + let d = Diag { + path: "x.rs".into(), + line: 1, + col: 1, + severity: Severity::Low, + id: "test".into(), + category: crate::patterns::FindingCategory::Security, + path_validated: false, + guard_kind: None, + message: None, + labels: vec![], + confidence: Some(conf), + evidence: None, + rank_score: None, + rank_reason: None, + suppressed: false, + suppression: None, + rollup: None, + }; + let output = render_diag(&d, 100); + let stripped = strip_ansi(&output); + assert!( + stripped.contains(expected), + "expected '{expected}' in: {stripped}" + ); + } + } + + #[test] + fn confidence_none_only_score() { + let d = Diag { + path: "src/a.rs".into(), + line: 10, + col: 5, + severity: Severity::High, + id: "test-rule".into(), + category: crate::patterns::FindingCategory::Security, + path_validated: false, + guard_kind: None, + message: Some("test message".into()), + labels: vec![], + confidence: None, + evidence: None, + rank_score: Some(42.0), + rank_reason: None, + suppressed: false, + suppression: None, + rollup: None, + }; + let output = render_diag(&d, 100); + let stripped = strip_ansi(&output); + let header = stripped.lines().next().unwrap(); + assert!( + header.contains("(Score: 42)"), + "should show score without confidence: {header}" + ); + assert!( + !header.contains("Confidence"), + "should not mention confidence when None: {header}" + ); + } + + #[test] + fn confidence_only_no_score() { + let d = Diag { + path: "src/a.rs".into(), + line: 10, + col: 5, + severity: Severity::High, + id: "test-rule".into(), + category: crate::patterns::FindingCategory::Security, + path_validated: false, + guard_kind: None, + message: None, + labels: vec![], + confidence: Some(crate::evidence::Confidence::High), + evidence: None, + rank_score: None, + rank_reason: None, + suppressed: false, + suppression: None, + rollup: None, + }; + let output = render_diag(&d, 100); + let stripped = strip_ansi(&output); + let header = stripped.lines().next().unwrap(); + assert!( + header.contains("(Confidence: High)"), + "should show confidence without score: {header}" + ); + } + + #[test] + fn json_omits_confidence_when_none() { + let d = Diag { + path: "x.rs".into(), + line: 1, + col: 1, + severity: Severity::Low, + id: "test".into(), + category: crate::patterns::FindingCategory::Security, + path_validated: false, + guard_kind: None, + message: None, + labels: vec![], + confidence: None, + evidence: None, + rank_score: None, + rank_reason: None, + suppressed: false, + suppression: None, + rollup: None, + }; + let json = serde_json::to_string(&d).unwrap(); + assert!( + !json.contains("confidence"), + "confidence should be omitted when None: {json}" + ); + } +} diff --git a/src/labels/c.rs b/src/labels/c.rs index 3e1b7c28..3052f4b9 100644 --- a/src/labels/c.rs +++ b/src/labels/c.rs @@ -31,6 +31,10 @@ pub static RULES: &[LabelRule] = &[ matchers: &["printf", "fprintf"], label: DataLabel::Sink(Cap::FMT_STRING), }, + LabelRule { + matchers: &["fopen", "open"], + label: DataLabel::Sink(Cap::FILE_IO), + }, ]; pub static KINDS: Map<&'static str, Kind> = phf_map! { @@ -39,6 +43,9 @@ pub static KINDS: Map<&'static str, Kind> = phf_map! { "while_statement" => Kind::While, "for_statement" => Kind::For, "do_statement" => Kind::While, + "switch_statement" => Kind::Block, + "case_statement" => Kind::Block, + "labeled_statement" => Kind::Block, "return_statement" => Kind::Return, "break_statement" => Kind::Break, @@ -47,6 +54,7 @@ pub static KINDS: Map<&'static str, Kind> = phf_map! { // structure "translation_unit" => Kind::SourceFile, "compound_statement" => Kind::Block, + "else_clause" => Kind::Block, "function_definition" => Kind::Function, // data-flow diff --git a/src/labels/cpp.rs b/src/labels/cpp.rs index 02d49a9c..c37f9372 100644 --- a/src/labels/cpp.rs +++ b/src/labels/cpp.rs @@ -29,6 +29,10 @@ pub static RULES: &[LabelRule] = &[ matchers: &["printf", "fprintf"], label: DataLabel::Sink(Cap::FMT_STRING), }, + LabelRule { + matchers: &["fopen", "open"], + label: DataLabel::Sink(Cap::FILE_IO), + }, ]; pub static KINDS: Map<&'static str, Kind> = phf_map! { @@ -38,15 +42,23 @@ pub static KINDS: Map<&'static str, Kind> = phf_map! { "for_statement" => Kind::For, "for_range_loop" => Kind::For, "do_statement" => Kind::While, + "switch_statement" => Kind::Block, + "case_statement" => Kind::Block, + "labeled_statement" => Kind::Block, "return_statement" => Kind::Return, + "throw_statement" => Kind::Return, "break_statement" => Kind::Break, "continue_statement" => Kind::Continue, // structure "translation_unit" => Kind::SourceFile, "compound_statement" => Kind::Block, + "else_clause" => Kind::Block, "function_definition" => Kind::Function, + "try_statement" => Kind::Block, + "catch_clause" => Kind::Block, + "lambda_expression" => Kind::Block, // data-flow "call_expression" => Kind::CallFn, @@ -63,7 +75,7 @@ pub static KINDS: Map<&'static str, Kind> = phf_map! { "preproc_include" => Kind::Trivia, "preproc_def" => Kind::Trivia, "using_declaration" => Kind::Trivia, - "namespace_definition" => Kind::Trivia, + "namespace_definition" => Kind::Block, }; pub static PARAM_CONFIG: ParamConfig = ParamConfig { diff --git a/src/labels/go.rs b/src/labels/go.rs index d70cdf8e..1c348b7b 100644 --- a/src/labels/go.rs +++ b/src/labels/go.rs @@ -8,7 +8,17 @@ pub static RULES: &[LabelRule] = &[ label: DataLabel::Source(Cap::all()), }, LabelRule { - matchers: &["http.Request", "r.FormValue", "r.URL"], + matchers: &[ + "http.Request", + "r.FormValue", + "r.URL", + "r.Body", + "r.Header", + "r.URL.Query", + "r.URL.Query.Get", + "Request.FormValue", + "Request.URL", + ], label: DataLabel::Source(Cap::all()), }, // ───────── Sanitizers ────────── @@ -17,18 +27,40 @@ pub static RULES: &[LabelRule] = &[ label: DataLabel::Sanitizer(Cap::HTML_ESCAPE), }, LabelRule { - matchers: &["url.QueryEscape"], + matchers: &["url.QueryEscape", "url.PathEscape"], label: DataLabel::Sanitizer(Cap::URL_ENCODE), }, + LabelRule { + matchers: &["filepath.Clean", "filepath.Base"], + label: DataLabel::Sanitizer(Cap::FILE_IO), + }, // ─────────── Sinks ───────────── LabelRule { matchers: &["exec.Command"], label: DataLabel::Sink(Cap::SHELL_ESCAPE), }, LabelRule { - matchers: &["db.Query", "db.Exec"], + matchers: &["db.Query", "db.Exec", "db.QueryRow", "db.Prepare"], label: DataLabel::Sink(Cap::SHELL_ESCAPE), }, + LabelRule { + matchers: &["fmt.Fprintf", "fmt.Sprintf", "fmt.Printf"], + label: DataLabel::Sink(Cap::FMT_STRING), + }, + LabelRule { + matchers: &[ + "os.Open", + "os.OpenFile", + "os.Create", + "ioutil.ReadFile", + "os.ReadFile", + ], + label: DataLabel::Sink(Cap::FILE_IO), + }, + LabelRule { + matchers: &["template.HTML"], + label: DataLabel::Sink(Cap::HTML_ESCAPE), + }, ]; pub static KINDS: Map<&'static str, Kind> = phf_map! { @@ -46,6 +78,16 @@ pub static KINDS: Map<&'static str, Kind> = phf_map! { "statement_list" => Kind::Block, "function_declaration" => Kind::Function, "method_declaration" => Kind::Function, + "func_literal" => Kind::Function, + "expression_switch_statement" => Kind::Block, + "type_switch_statement" => Kind::Block, + "expression_case" => Kind::Block, + "type_case" => Kind::Block, + "default_case" => Kind::Block, + "select_statement" => Kind::Block, + "communication_case" => Kind::Block, + "go_statement" => Kind::Block, + "defer_statement" => Kind::Block, // data-flow "call_expression" => Kind::CallFn, diff --git a/src/labels/java.rs b/src/labels/java.rs index 02a36ee1..078aecf3 100644 --- a/src/labels/java.rs +++ b/src/labels/java.rs @@ -8,7 +8,19 @@ pub static RULES: &[LabelRule] = &[ label: DataLabel::Source(Cap::all()), }, LabelRule { - matchers: &["getParameter", "getInputStream", "getHeader", "getCookies"], + matchers: &[ + "getParameter", + "getInputStream", + "getHeader", + "getCookies", + "getReader", + "getQueryString", + "getPathInfo", + ], + label: DataLabel::Source(Cap::all()), + }, + LabelRule { + matchers: &["readObject", "readLine"], label: DataLabel::Source(Cap::all()), }, // ───────── Sanitizers ────────── @@ -18,13 +30,21 @@ pub static RULES: &[LabelRule] = &[ }, // ─────────── Sinks ───────────── LabelRule { - matchers: &["Runtime.exec"], + matchers: &["Runtime.exec", "ProcessBuilder"], label: DataLabel::Sink(Cap::SHELL_ESCAPE), }, LabelRule { matchers: &["executeQuery", "executeUpdate", "prepareStatement"], label: DataLabel::Sink(Cap::SHELL_ESCAPE), }, + LabelRule { + matchers: &["Class.forName"], + label: DataLabel::Sink(Cap::SHELL_ESCAPE), + }, + LabelRule { + matchers: &["println", "print", "write"], + label: DataLabel::Sink(Cap::HTML_ESCAPE), + }, ]; pub static KINDS: Map<&'static str, Kind> = phf_map! { @@ -33,8 +53,10 @@ pub static KINDS: Map<&'static str, Kind> = phf_map! { "while_statement" => Kind::While, "for_statement" => Kind::For, "enhanced_for_statement" => Kind::For, + "do_statement" => Kind::While, "return_statement" => Kind::Return, + "throw_statement" => Kind::Return, "break_statement" => Kind::Break, "continue_statement" => Kind::Continue, @@ -46,6 +68,15 @@ pub static KINDS: Map<&'static str, Kind> = phf_map! { "interface_body" => Kind::Block, "method_declaration" => Kind::Function, "constructor_declaration" => Kind::Function, + "switch_expression" => Kind::Block, + "switch_block" => Kind::Block, + "switch_block_statement_group" => Kind::Block, + "try_statement" => Kind::Block, + "catch_clause" => Kind::Block, + "finally_clause" => Kind::Block, + "lambda_expression" => Kind::Block, + "constructor_body" => Kind::Block, + "static_initializer" => Kind::Block, // data-flow "method_invocation" => Kind::CallMethod, diff --git a/src/labels/javascript.rs b/src/labels/javascript.rs index 60665099..d726c25b 100644 --- a/src/labels/javascript.rs +++ b/src/labels/javascript.rs @@ -62,6 +62,7 @@ pub static KINDS: Map<&'static str, Kind> = phf_map! { "while_statement" => Kind::While, "for_statement" => Kind::For, "for_in_statement" => Kind::For, + "do_statement" => Kind::While, "return_statement" => Kind::Return, "throw_statement" => Kind::Return, @@ -71,9 +72,24 @@ pub static KINDS: Map<&'static str, Kind> = phf_map! { // structure "program" => Kind::SourceFile, "statement_block" => Kind::Block, + "else_clause" => Kind::Block, "function_declaration" => Kind::Function, + "function_expression" => Kind::Function, "arrow_function" => Kind::Function, "method_definition" => Kind::Function, + "generator_function_declaration" => Kind::Function, + "generator_function" => Kind::Function, + "switch_statement" => Kind::Block, + "switch_body" => Kind::Block, + "switch_case" => Kind::Block, + "switch_default" => Kind::Block, + "try_statement" => Kind::Block, + "catch_clause" => Kind::Block, + "finally_clause" => Kind::Block, + "class_declaration" => Kind::Block, + "class" => Kind::Block, + "class_body" => Kind::Block, + "export_statement" => Kind::Block, // data-flow "call_expression" => Kind::CallFn, diff --git a/src/labels/mod.rs b/src/labels/mod.rs index 8f5e623c..7d5623e9 100644 --- a/src/labels/mod.rs +++ b/src/labels/mod.rs @@ -41,7 +41,6 @@ pub enum Kind { InfiniteLoop, While, For, - LoopBody, CallFn, CallMethod, CallMacro, @@ -196,7 +195,7 @@ pub fn lookup(lang: &str, raw: &str) -> Kind { } /// The kind of taint source, used to refine finding severity. -#[derive(Debug, Clone, Copy, PartialEq, Eq)] +#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)] pub enum SourceKind { /// Direct user input (request params, argv, stdin, form data) UserInput, @@ -375,6 +374,11 @@ pub fn classify(lang: &str, text: &str, extra: Option<&[RuntimeLabelRule]>) -> O let head = text.split(['(', '<']).next().unwrap_or(""); let trimmed = head.trim().as_bytes(); + // For chained calls like `r.URL.Query().Get`, also strip internal + // `().` segments to produce a normalized form like `r.URL.Query.Get`. + let full_normalized = normalize_chained_call(text); + let full_norm_bytes = full_normalized.as_bytes(); + // ── Check runtime (config) rules first — they take priority ────── if let Some(extras) = extra { // Pass 1: exact / suffix @@ -384,12 +388,8 @@ pub fn classify(lang: &str, text: &str, extra: Option<&[RuntimeLabelRule]>) -> O if m.last() == Some(&b'_') { continue; } - if ends_with_ignore_case(trimmed, m) { - let start = trimmed.len() - m.len(); - let ok = start == 0 || matches!(trimmed[start - 1], b'.' | b':'); - if ok { - return Some(rule.label); - } + if match_suffix(trimmed, m) || match_suffix(full_norm_bytes, m) { + return Some(rule.label); } } } @@ -397,7 +397,10 @@ pub fn classify(lang: &str, text: &str, extra: Option<&[RuntimeLabelRule]>) -> O for rule in extras { for raw in &rule.matchers { let m = raw.as_bytes(); - if m.last() == Some(&b'_') && starts_with_ignore_case(trimmed, m) { + if m.last() == Some(&b'_') + && (starts_with_ignore_case(trimmed, m) + || starts_with_ignore_case(full_norm_bytes, m)) + { return Some(rule.label); } } @@ -417,12 +420,8 @@ pub fn classify(lang: &str, text: &str, extra: Option<&[RuntimeLabelRule]>) -> O if m.last() == Some(&b'_') { continue; } - if ends_with_ignore_case(trimmed, m) { - let start = trimmed.len() - m.len(); - let ok = start == 0 || matches!(trimmed[start - 1], b'.' | b':'); - if ok { - return Some(rule.label); - } + if match_suffix(trimmed, m) || match_suffix(full_norm_bytes, m) { + return Some(rule.label); } } } @@ -431,7 +430,10 @@ pub fn classify(lang: &str, text: &str, extra: Option<&[RuntimeLabelRule]>) -> O for rule in *rules { for raw in rule.matchers { let m = raw.as_bytes(); - if m.last() == Some(&b'_') && starts_with_ignore_case(trimmed, m) { + if m.last() == Some(&b'_') + && (starts_with_ignore_case(trimmed, m) + || starts_with_ignore_case(full_norm_bytes, m)) + { return Some(rule.label); } } @@ -440,6 +442,58 @@ pub fn classify(lang: &str, text: &str, extra: Option<&[RuntimeLabelRule]>) -> O None } +/// Check if `text` ends with `matcher` at a word boundary (`.` or `:`). +#[inline] +fn match_suffix(text: &[u8], matcher: &[u8]) -> bool { + if ends_with_ignore_case(text, matcher) { + let start = text.len() - matcher.len(); + start == 0 || matches!(text[start - 1], b'.' | b':') + } else { + false + } +} + +/// Normalize a chained method call: strip `()` between `.` segments. +/// e.g. `r.URL.Query().Get` → `r.URL.Query.Get` +/// e.g. `r.URL.Query().Get("host")` → `r.URL.Query.Get` +fn normalize_chained_call(text: &str) -> String { + let mut result = String::with_capacity(text.len()); + let bytes = text.as_bytes(); + let mut i = 0; + while i < bytes.len() { + match bytes[i] { + b'(' => { + // Skip from `(` to matching `)`, but only if followed by `.` + // This handles `Query().Get` → `Query.Get` + let mut depth = 1u32; + let mut j = i + 1; + while j < bytes.len() && depth > 0 { + if bytes[j] == b'(' { + depth += 1; + } else if bytes[j] == b')' { + depth -= 1; + } + j += 1; + } + // If we're at end or next char is `.`, skip the parens + if j >= bytes.len() || bytes[j] == b'.' { + i = j; + } else { + // Keep the paren content (unusual case) + result.push('('); + i += 1; + } + } + b'<' => break, // Stop at generic args + _ => { + result.push(bytes[i] as char); + i += 1; + } + } + } + result +} + #[cfg(test)] mod tests { use super::*; diff --git a/src/labels/php.rs b/src/labels/php.rs index 5a4837f9..2777cd18 100644 --- a/src/labels/php.rs +++ b/src/labels/php.rs @@ -3,8 +3,24 @@ use phf::{Map, phf_map}; pub static RULES: &[LabelRule] = &[ // ─────────── Sources ─────────── + // Note: PHP `$` prefix is stripped by collect_idents, so match without `$`. LabelRule { - matchers: &["$_GET", "$_POST", "$_REQUEST", "$_COOKIE"], + matchers: &[ + "$_GET", + "_GET", + "$_POST", + "_POST", + "$_REQUEST", + "_REQUEST", + "$_COOKIE", + "_COOKIE", + "$_FILES", + "_FILES", + "$_SERVER", + "_SERVER", + "$_ENV", + "_ENV", + ], label: DataLabel::Source(Cap::all()), }, LabelRule { @@ -20,17 +36,44 @@ pub static RULES: &[LabelRule] = &[ matchers: &["escapeshellarg", "escapeshellcmd"], label: DataLabel::Sanitizer(Cap::SHELL_ESCAPE), }, + LabelRule { + matchers: &["basename"], + label: DataLabel::Sanitizer(Cap::FILE_IO), + }, // ─────────── Sinks ───────────── LabelRule { - matchers: &["system", "exec", "passthru", "shell_exec"], + matchers: &[ + "system", + "exec", + "passthru", + "shell_exec", + "proc_open", + "popen", + ], label: DataLabel::Sink(Cap::SHELL_ESCAPE), }, + LabelRule { + matchers: &["eval", "assert"], + label: DataLabel::Sink(Cap::SHELL_ESCAPE), + }, + LabelRule { + matchers: &["include", "include_once", "require", "require_once"], + label: DataLabel::Sink(Cap::FILE_IO), + }, + LabelRule { + matchers: &["unserialize"], + label: DataLabel::Sink(Cap::SHELL_ESCAPE), + }, + LabelRule { + matchers: &["move_uploaded_file", "copy", "file_put_contents", "fwrite"], + label: DataLabel::Sink(Cap::FILE_IO), + }, LabelRule { matchers: &["echo", "print"], label: DataLabel::Sink(Cap::HTML_ESCAPE), }, LabelRule { - matchers: &["mysqli_query", "pg_query"], + matchers: &["mysqli_query", "pg_query", "query"], label: DataLabel::Sink(Cap::SHELL_ESCAPE), }, ]; @@ -41,16 +84,29 @@ pub static KINDS: Map<&'static str, Kind> = phf_map! { "while_statement" => Kind::While, "for_statement" => Kind::For, "foreach_statement" => Kind::For, + "do_statement" => Kind::While, "return_statement" => Kind::Return, + "throw_expression" => Kind::Return, "break_statement" => Kind::Break, "continue_statement" => Kind::Continue, // structure "program" => Kind::SourceFile, "compound_statement" => Kind::Block, + "else_clause" => Kind::Block, + "else_if_clause" => Kind::Block, "function_definition" => Kind::Function, "method_declaration" => Kind::Function, + "switch_statement" => Kind::Block, + "switch_block" => Kind::Block, + "case_statement" => Kind::Block, + "default_statement" => Kind::Block, + "try_statement" => Kind::Block, + "catch_clause" => Kind::Block, + "finally_clause" => Kind::Block, + "colon_block" => Kind::Block, + "class_declaration" => Kind::Block, // data-flow "function_call_expression" => Kind::CallFn, diff --git a/src/labels/python.rs b/src/labels/python.rs index e5dede2f..df9634da 100644 --- a/src/labels/python.rs +++ b/src/labels/python.rs @@ -24,7 +24,7 @@ pub static RULES: &[LabelRule] = &[ }, LabelRule { matchers: &["open"], - label: DataLabel::Source(Cap::all()), + label: DataLabel::Sink(Cap::FILE_IO), }, LabelRule { matchers: &[ @@ -65,6 +65,14 @@ pub static RULES: &[LabelRule] = &[ matchers: &["cursor.execute", "cursor.executemany"], label: DataLabel::Sink(Cap::SHELL_ESCAPE), }, + LabelRule { + matchers: &["send_file", "send_from_directory"], + label: DataLabel::Sink(Cap::FILE_IO), + }, + LabelRule { + matchers: &["os.path.realpath"], + label: DataLabel::Sanitizer(Cap::FILE_IO), + }, ]; pub static KINDS: Map<&'static str, Kind> = phf_map! { @@ -74,13 +82,24 @@ pub static KINDS: Map<&'static str, Kind> = phf_map! { "for_statement" => Kind::For, "return_statement" => Kind::Return, + "raise_statement" => Kind::Return, "break_statement" => Kind::Break, "continue_statement" => Kind::Continue, // structure "module" => Kind::SourceFile, "block" => Kind::Block, + "else_clause" => Kind::Block, + "elif_clause" => Kind::Block, + "with_statement" => Kind::Block, "function_definition" => Kind::Function, + "try_statement" => Kind::Block, + "except_clause" => Kind::Block, + "finally_clause" => Kind::Block, + "class_definition" => Kind::Block, + "decorated_definition" => Kind::Block, + "match_statement" => Kind::Block, + "case_clause" => Kind::Block, // data-flow "call" => Kind::CallFn, diff --git a/src/labels/ruby.rs b/src/labels/ruby.rs index 2a8a731e..e4a2def9 100644 --- a/src/labels/ruby.rs +++ b/src/labels/ruby.rs @@ -40,6 +40,7 @@ pub static KINDS: Map<&'static str, Kind> = phf_map! { "if" => Kind::If, "unless" => Kind::If, "while" => Kind::While, + "until" => Kind::While, "for" => Kind::For, "return" => Kind::Return, @@ -49,15 +50,26 @@ pub static KINDS: Map<&'static str, Kind> = phf_map! { // structure "program" => Kind::SourceFile, "body_statement" => Kind::Block, - "do_block" => Kind::Block, + "do_block" => Kind::Function, "then" => Kind::Block, "else" => Kind::Block, + "elsif" => Kind::If, + + "begin" => Kind::Block, + "rescue" => Kind::Block, + "ensure" => Kind::Block, + "case" => Kind::Block, + "when" => Kind::Block, + "class" => Kind::Block, + "module" => Kind::Block, + "do" => Kind::Block, + "block" => Kind::Function, // data-flow "call" => Kind::CallFn, - "method_call" => Kind::CallFn, "assignment" => Kind::Assignment, "method" => Kind::Function, + "singleton_method" => Kind::Function, // trivia "comment" => Kind::Trivia, diff --git a/src/labels/rust.rs b/src/labels/rust.rs index 889a8b5a..bf403c67 100644 --- a/src/labels/rust.rs +++ b/src/labels/rust.rs @@ -8,7 +8,7 @@ pub static RULES: &[LabelRule] = &[ label: DataLabel::Source(Cap::all()), }, LabelRule { - matchers: &["fs::read_to_string", "source_file"], + matchers: &["source_file"], label: DataLabel::Source(Cap::all()), }, // ───────── Sanitizers ────────── @@ -36,17 +36,29 @@ pub static RULES: &[LabelRule] = &[ matchers: &["sink_html"], label: DataLabel::Sink(Cap::HTML_ESCAPE), }, + LabelRule { + matchers: &[ + "fs::read_to_string", + "fs::write", + "fs::read", + "File::open", + "File::create", + ], + label: DataLabel::Sink(Cap::FILE_IO), + }, ]; pub static KINDS: Map<&'static str, Kind> = phf_map! { // control-flow "if_expression" => Kind::If, "loop_expression" => Kind::InfiniteLoop, - "loop_statement" => Kind::LoopBody, "while_statement" => Kind::While, + "while_expression" => Kind::While, "for_statement" => Kind::For, + "for_expression" => Kind::For, "return_statement" => Kind::Return, + "return_expression" => Kind::Return, "break_expression" => Kind::Break, "break_statement" => Kind::Break, "continue_expression" => Kind::Continue, @@ -55,7 +67,17 @@ pub static KINDS: Map<&'static str, Kind> = phf_map! { // structure "source_file" => Kind::SourceFile, "block" => Kind::Block, + "else_clause" => Kind::Block, + "match_expression" => Kind::Block, + "match_block" => Kind::Block, + "match_arm" => Kind::Block, + "unsafe_block" => Kind::Block, "function_item" => Kind::Function, + "closure_expression" => Kind::Block, + "async_block" => Kind::Block, + "impl_item" => Kind::Block, + "trait_item" => Kind::Block, + "declaration_list" => Kind::Block, // data-flow "call_expression" => Kind::CallFn, diff --git a/src/labels/typescript.rs b/src/labels/typescript.rs index fcae2dec..6ce552c5 100644 --- a/src/labels/typescript.rs +++ b/src/labels/typescript.rs @@ -50,18 +50,36 @@ pub static KINDS: Map<&'static str, Kind> = phf_map! { "while_statement" => Kind::While, "for_statement" => Kind::For, "for_in_statement" => Kind::For, - "for_of_statement" => Kind::For, + "do_statement" => Kind::While, "return_statement" => Kind::Return, + "throw_statement" => Kind::Return, "break_statement" => Kind::Break, "continue_statement" => Kind::Continue, // structure "program" => Kind::SourceFile, "statement_block" => Kind::Block, + "else_clause" => Kind::Block, "function_declaration" => Kind::Function, + "function_expression" => Kind::Function, "arrow_function" => Kind::Function, "method_definition" => Kind::Function, + "generator_function_declaration" => Kind::Function, + "generator_function" => Kind::Function, + "switch_statement" => Kind::Block, + "switch_body" => Kind::Block, + "switch_case" => Kind::Block, + "switch_default" => Kind::Block, + "try_statement" => Kind::Block, + "catch_clause" => Kind::Block, + "finally_clause" => Kind::Block, + "class_declaration" => Kind::Block, + "class" => Kind::Block, + "class_body" => Kind::Block, + "abstract_class_declaration" => Kind::Block, + "export_statement" => Kind::Block, + "enum_declaration" => Kind::Trivia, // data-flow "call_expression" => Kind::CallFn, diff --git a/src/lib.rs b/src/lib.rs index a49d45ce..500ffa61 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -1,19 +1,62 @@ -// Re-exports for benchmarks and integration tests. -// The binary crate (main.rs) is the primary entry point; this lib target -// exposes internals for criterion and other tooling. +//! # Nyx Scanner +//! +//! A multi-language static vulnerability scanner. Nyx parses source files with +//! [tree-sitter](https://tree-sitter.github.io/), builds intra-procedural +//! control-flow graphs ([petgraph](https://docs.rs/petgraph)), and runs +//! cross-file taint analysis with a capability-based sanitizer system. +//! +//! ## Architecture +//! +//! Nyx uses a **two-pass architecture**: +//! +//! 1. **Pass 1 — Summary extraction**: Parse each file, build a CFG per function, +//! and export a [`summary::FuncSummary`] capturing source/sanitizer/sink capabilities, +//! taint propagation behavior, and callee lists. Summaries are persisted to SQLite. +//! +//! 2. **Pass 2 — Analysis**: Load all summaries into a [`summary::GlobalSummaries`] map, +//! re-parse files, and run taint analysis with cross-file callee resolution. CFG +//! structural analysis checks for auth gaps, unguarded sinks, and resource leaks. +//! +//! ## Four Detector Families +//! +//! - **Taint** ([`taint`]) — Monotone forward dataflow tracking source-to-sink flows +//! - **CFG Structural** ([`cfg_analysis`]) — Dominator-based guard and auth-gap detection +//! - **State Model** ([`state`]) — Resource lifecycle and authentication state lattices +//! - **AST Patterns** ([`patterns`]) — Tree-sitter structural queries per language +//! +//! ## Supported Languages +//! +//! Rust, C, C++, Java, Go, PHP, Python, Ruby, TypeScript, JavaScript. +//! +//! ## Entry Points +//! +//! - [`scan_no_index`] — Run a two-pass scan without indexing (for tests) +//! - [`commands::scan::scan_filesystem`] — Filesystem scan with optional indexing +//! - [`commands::scan::scan_with_index_parallel`] — Index-backed parallel scan +//! +//! ## Documentation +//! +//! See the [`docs/`](https://github.com/elicpeter/nyx/tree/master/docs) directory +//! for user and contributor documentation. pub mod ast; +pub mod callgraph; pub mod cfg; pub mod cfg_analysis; pub(crate) mod cli; pub mod commands; pub mod database; pub mod errors; +pub mod evidence; +pub mod fmt; pub mod interop; pub mod labels; pub mod output; pub mod patterns; +pub mod rank; +pub mod state; pub mod summary; +pub mod suppress; pub mod symbol; pub mod taint; pub mod utils; diff --git a/src/main.rs b/src/main.rs index afc5536a..24e7f610 100644 --- a/src/main.rs +++ b/src/main.rs @@ -1,15 +1,21 @@ mod ast; +mod callgraph; mod cfg; mod cfg_analysis; mod cli; mod commands; mod database; mod errors; +mod evidence; +mod fmt; mod interop; mod labels; mod output; mod patterns; +mod rank; +mod state; mod summary; +mod suppress; mod symbol; mod taint; mod utils; @@ -25,7 +31,7 @@ use std::fs; use std::time::Instant; use tracing_subscriber::fmt::time; use tracing_subscriber::prelude::*; -use tracing_subscriber::{EnvFilter, Registry, fmt}; +use tracing_subscriber::{EnvFilter, Registry, fmt as tracing_fmt}; // use tracing_appender::rolling::{RollingFileAppender, Rotation}; // use tracing_appender::non_blocking; @@ -33,7 +39,7 @@ fn init_tracing() { // let file_appender = RollingFileAppender::new(Rotation::HOURLY, "logs", "nyx-scanner.log"); // let (file_writer, guard) = non_blocking(file_appender); - let fmt_layer = fmt::layer() + let fmt_layer = tracing_fmt::layer() .pretty() .with_thread_ids(true) .with_timer(time::UtcTime::rfc_3339()); @@ -56,8 +62,8 @@ fn main() -> NyxResult<()> { tracing::debug!("CLI starting up"); let cli = Cli::parse(); - let proj_dirs = ProjectDirs::from("dev", "ecpeter23", "nyx") - .ok_or("Unable to determine project directories")?; + let proj_dirs = + ProjectDirs::from("", "", "nyx").ok_or("Unable to determine project directories")?; // todo: check if we want to actually build a config file, maybe some environments will not want to have anything written let config_dir = proj_dirs.config_dir(); @@ -83,7 +89,7 @@ fn main() -> NyxResult<()> { commands::handle_command(cli.command, database_dir, config_dir, &mut config)?; if !quiet { - println!( + eprintln!( "{} in {:.3}s.", style("Finished").green().bold(), now.elapsed().as_secs_f32() diff --git a/src/output.rs b/src/output.rs index c4b3300a..243aae22 100644 --- a/src/output.rs +++ b/src/output.rs @@ -38,6 +38,11 @@ fn cfg_rule_description(id: &str) -> Option<&'static str> { } "cfg-resource-leak" => Some("Resource acquired but not released on all exit paths"), "cfg-lock-not-released" => Some("Lock acquired but not released on all exit paths"), + "state-use-after-close" => Some("Variable used after its resource handle was closed"), + "state-double-close" => Some("Resource handle closed more than once"), + "state-resource-leak" => Some("Resource acquired but never closed"), + "state-resource-leak-possible" => Some("Resource may not be closed on all paths"), + "state-unauthed-access" => Some("Sensitive operation reached without authentication"), _ => None, } } @@ -116,11 +121,17 @@ pub fn build_sarif(diags: &[Diag], scan_root: &Path) -> Value { .map(|p| p.to_string_lossy().to_string()) .unwrap_or_else(|_| d.path.clone()); - json!({ + // Prefer the per-finding message (e.g. from state analysis) over the generic rule description. + let msg_text = d + .message + .as_deref() + .unwrap_or_else(|| rule_description(base)); + + let mut result = json!({ "ruleId": base, "ruleIndex": rule_index, "level": severity_to_level(d.severity), - "message": { "text": rule_description(base) }, + "message": { "text": msg_text }, "locations": [{ "physicalLocation": { "artifactLocation": { "uri": uri }, @@ -130,7 +141,50 @@ pub fn build_sarif(diags: &[Diag], scan_root: &Path) -> Value { } } }] - }) + }); + + // Build properties object + let mut props = serde_json::Map::new(); + props.insert("category".into(), json!(d.category.to_string())); + if let Some(conf) = d.confidence { + props.insert("confidence".into(), json!(conf.to_string())); + } + + // Add rollup data if present + if let Some(ref rollup) = d.rollup { + props.insert( + "rollup".into(), + json!({ + "count": rollup.count, + }), + ); + + // Add rollup occurrences as relatedLocations + let related: Vec = rollup + .occurrences + .iter() + .enumerate() + .map(|(idx, loc)| { + json!({ + "id": idx, + "physicalLocation": { + "artifactLocation": { "uri": &uri }, + "region": { + "startLine": loc.line, + "startColumn": loc.col + } + } + }) + }) + .collect(); + if !related.is_empty() { + result["relatedLocations"] = json!(related); + } + } + + result["properties"] = Value::Object(props); + + result }) .collect(); diff --git a/src/patterns/c.rs b/src/patterns/c.rs index 4ee38477..c7d1f8b8 100644 --- a/src/patterns/c.rs +++ b/src/patterns/c.rs @@ -1,40 +1,95 @@ -use crate::patterns::{Pattern, Severity}; +use crate::evidence::Confidence; +use crate::patterns::{Pattern, PatternCategory, PatternTier, Severity}; +/// C AST patterns. +/// +/// Taint rules cover `system`/`popen`/`exec*` (command injection), +/// `sprintf`/`strcpy`/`strcat` (buffer overflow sinks), and `printf`/`fprintf` +/// (format-string sinks). AST patterns here focus on **banned-by-default +/// functions** (`gets`, `scanf %s`) and **format-string** variants not covered +/// by taint, since these are dangerous regardless of data origin. pub const PATTERNS: &[Pattern] = &[ + // ── Tier A: Banned functions (always dangerous) ──────────────────── Pattern { - id: "strcpy_call", - description: "strcpy() usage", - query: "(call_expression function: (identifier) @id (#eq? @id \"strcpy\")) @vuln", + id: "c.memory.gets", + description: "gets() — no bounds checking, always exploitable", + query: r#"(call_expression function: (identifier) @id (#eq? @id "gets")) @vuln"#, severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::MemorySafety, + confidence: Confidence::High, }, Pattern { - id: "strcat_call", - description: "strcat() usage", - query: "(call_expression function: (identifier) @id (#eq? @id \"strcat\")) @vuln", + id: "c.memory.strcpy", + description: "strcpy() — no bounds checking on destination buffer", + query: r#"(call_expression function: (identifier) @id (#eq? @id "strcpy")) @vuln"#, severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::MemorySafety, + confidence: Confidence::High, }, Pattern { - id: "sprintf_call", - description: "sprintf() (no length limit)", - query: "(call_expression function: (identifier) @id (#eq? @id \"sprintf\")) @vuln", + id: "c.memory.strcat", + description: "strcat() — no bounds checking on destination buffer", + query: r#"(call_expression function: (identifier) @id (#eq? @id "strcat")) @vuln"#, severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::MemorySafety, + confidence: Confidence::High, }, Pattern { - id: "gets_call", - description: "gets() usage", - query: "(call_expression function: (identifier) @id (#eq? @id \"gets\")) @vuln", + id: "c.memory.sprintf", + description: "sprintf() — no length limit on output buffer", + query: r#"(call_expression function: (identifier) @id (#eq? @id "sprintf")) @vuln"#, severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::MemorySafety, + confidence: Confidence::High, }, Pattern { - id: "scanf_with_percent_s", - description: "scanf(\"%s\") without length specifier", - query: "(call_expression function: (identifier) @id (#eq? @id \"scanf\") arguments: (argument_list (string_literal) @fmt (#match? @fmt \".*%s.*\"))) @vuln", + id: "c.memory.scanf_percent_s", + description: "scanf(\"%s\") — unbounded string read", + query: r#"(call_expression + function: (identifier) @id (#eq? @id "scanf") + arguments: (argument_list + (string_literal) @fmt (#match? @fmt "%s"))) + @vuln"#, severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::MemorySafety, + confidence: Confidence::High, + }, + // ── Tier A: Command execution ────────────────────────────────────── + Pattern { + id: "c.cmdi.system", + description: "system() — shell command execution", + query: r#"(call_expression function: (identifier) @id (#eq? @id "system")) @vuln"#, + severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::CommandExec, + confidence: Confidence::High, }, Pattern { - id: "system_call", - description: "system() shell execution", - query: "(call_expression function: (identifier) @id (#eq? @id \"system\")) @vuln", + id: "c.cmdi.popen", + description: "popen() — shell command execution with pipe", + query: r#"(call_expression function: (identifier) @id (#eq? @id "popen")) @vuln"#, severity: Severity::Medium, + tier: PatternTier::A, + category: PatternCategory::CommandExec, + confidence: Confidence::High, + }, + // ── Tier A: Format-string ────────────────────────────────────────── + Pattern { + id: "c.memory.printf_no_fmt", + description: "printf(var) — format-string vulnerability when first arg is not literal", + query: r#"(call_expression + function: (identifier) @id (#eq? @id "printf") + arguments: (argument_list + . (identifier) @arg)) + @vuln"#, + severity: Severity::High, + tier: PatternTier::B, + category: PatternCategory::MemorySafety, + confidence: Confidence::Medium, }, ]; diff --git a/src/patterns/cpp.rs b/src/patterns/cpp.rs index 85ed7f60..be53b01b 100644 --- a/src/patterns/cpp.rs +++ b/src/patterns/cpp.rs @@ -1,40 +1,106 @@ -use crate::patterns::{Pattern, Severity}; +use crate::evidence::Confidence; +use crate::patterns::{Pattern, PatternCategory, PatternTier, Severity}; +/// C++ AST patterns. +/// +/// Inherits C banned-function concerns plus C++-specific patterns like +/// `reinterpret_cast` and `const_cast`. Taint rules overlap with C rules +/// for `system`/`sprintf`/`strcpy`/`strcat`. pub const PATTERNS: &[Pattern] = &[ + // ── Tier A: Banned C functions (inherited) ───────────────────────── Pattern { - id: "strcpy_call", - description: "strcpy() usage", - query: "(call_expression function: (identifier) @id (#eq? @id \"strcpy\")) @vuln", + id: "cpp.memory.gets", + description: "gets() — no bounds checking, always exploitable", + query: r#"(call_expression function: (identifier) @id (#eq? @id "gets")) @vuln"#, severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::MemorySafety, + confidence: Confidence::High, }, Pattern { - id: "strcat_call", - description: "strcat() usage", - query: "(call_expression function: (identifier) @id (#eq? @id \"strcat\")) @vuln", + id: "cpp.memory.strcpy", + description: "strcpy() — no bounds checking on destination buffer", + query: r#"(call_expression function: (identifier) @id (#eq? @id "strcpy")) @vuln"#, severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::MemorySafety, + confidence: Confidence::High, }, Pattern { - id: "sprintf_call", - description: "sprintf() (no length limit)", - query: "(call_expression function: (identifier) @id (#eq? @id \"sprintf\")) @vuln", + id: "cpp.memory.strcat", + description: "strcat() — no bounds checking on destination buffer", + query: r#"(call_expression function: (identifier) @id (#eq? @id "strcat")) @vuln"#, severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::MemorySafety, + confidence: Confidence::High, }, Pattern { - id: "gets_call", - description: "gets() usage", - query: "(call_expression function: (identifier) @id (#eq? @id \"gets\")) @vuln", + id: "cpp.memory.sprintf", + description: "sprintf() — no length limit on output buffer", + query: r#"(call_expression function: (identifier) @id (#eq? @id "sprintf")) @vuln"#, severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::MemorySafety, + confidence: Confidence::High, + }, + // ── Tier A: Command execution ────────────────────────────────────── + Pattern { + id: "cpp.cmdi.system", + description: "system() — shell command execution", + query: r#"(call_expression function: (identifier) @id (#eq? @id "system")) @vuln"#, + severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::CommandExec, + confidence: Confidence::High, }, Pattern { - id: "system_call", - description: "system() shell execution", - query: "(call_expression function: (identifier) @id (#eq? @id \"system\")) @vuln", + id: "cpp.cmdi.popen", + description: "popen() — shell command execution", + query: r#"(call_expression function: (identifier) @id (#eq? @id "popen")) @vuln"#, + severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::CommandExec, + confidence: Confidence::High, + }, + // ── Tier A: Dangerous casts ──────────────────────────────────────── + // C++ casts are parsed as call_expression with template_function + Pattern { + id: "cpp.memory.reinterpret_cast", + description: "reinterpret_cast — type-punning cast", + query: r#"(call_expression + function: (template_function + name: (identifier) @n (#eq? @n "reinterpret_cast"))) + @vuln"#, severity: Severity::Medium, + tier: PatternTier::A, + category: PatternCategory::MemorySafety, + confidence: Confidence::High, }, Pattern { - id: "reinterpret_cast", - description: "reinterpret_cast usage", - query: "(reinterpret_cast_expression) @vuln", + id: "cpp.memory.const_cast", + description: "const_cast — removes const/volatile qualifier", + query: r#"(call_expression + function: (template_function + name: (identifier) @n (#eq? @n "const_cast"))) + @vuln"#, severity: Severity::Medium, + tier: PatternTier::A, + category: PatternCategory::MemorySafety, + confidence: Confidence::High, + }, + // ── Tier B: Format-string (variable first arg) ───────────────────── + Pattern { + id: "cpp.memory.printf_no_fmt", + description: "printf(var) — format-string vulnerability when first arg is not literal", + query: r#"(call_expression + function: (identifier) @id (#eq? @id "printf") + arguments: (argument_list + . (identifier) @arg)) + @vuln"#, + severity: Severity::High, + tier: PatternTier::B, + category: PatternCategory::MemorySafety, + confidence: Confidence::Medium, }, ]; diff --git a/src/patterns/go.rs b/src/patterns/go.rs index 2da7f831..b123dcad 100644 --- a/src/patterns/go.rs +++ b/src/patterns/go.rs @@ -1,34 +1,120 @@ -use crate::patterns::{Pattern, Severity}; +use crate::evidence::Confidence; +use crate::patterns::{Pattern, PatternCategory, PatternTier, Severity}; +/// Go AST patterns. +/// +/// Taint rules cover `exec.Command` (command injection), `db.Query`/`db.Exec` +/// (SQL sinks). AST patterns here focus on **TLS misconfiguration**, +/// **weak crypto**, **unsafe.Pointer**, and **hardcoded secrets**. pub const PATTERNS: &[Pattern] = &[ + // ── Tier A: Command execution ────────────────────────────────────── Pattern { - id: "exec_command", - description: "os/exec Command construction", - query: "(call_expression function: (selector_expression field: (field_identifier) @f (#eq? @f \"Command\"))) @vuln", - severity: Severity::Medium, - }, - Pattern { - id: "http_insecure_tls", - description: "&http.Transport{TLSClientConfig: &tls.Config{InsecureSkipVerify: true}}", - query: "(composite_literal type: (selector_expression field: (field_identifier) @t (#eq? @t \"Transport\")) body: (literal_value (keyed_element key: (identifier) @k (#eq? @k \"TLSClientConfig\") value: (composite_literal body: (literal_value (keyed_element key: (identifier) @ik (#eq? @ik \"InsecureSkipVerify\") value: (true)))))) @vuln", + id: "go.cmdi.exec_command", + description: "exec.Command() — arbitrary process execution", + query: r#"(call_expression + function: (selector_expression + field: (field_identifier) @f (#eq? @f "Command"))) + @vuln"#, severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::CommandExec, + confidence: Confidence::High, }, + // ── Tier A: Unsafe pointer ───────────────────────────────────────── Pattern { - id: "unsafe_pointer", - description: "Use of unsafe.Pointer", - query: "(qualified_type type: (selector_expression field: (field_identifier) @f (#eq? @f \"Pointer\"))) @vuln", - severity: Severity::High, - }, - Pattern { - id: "md5_sha1", - description: "crypto/md5 or crypto/sha1 usage", - query: "(call_expression function: (selector_expression object: (identifier) @pkg (#match? @pkg \"md5|sha1\"))) @vuln", + id: "go.memory.unsafe_pointer", + description: "unsafe.Pointer — bypasses Go type system", + query: r#"(call_expression + function: (selector_expression + operand: (identifier) @pkg (#eq? @pkg "unsafe") + field: (field_identifier) @f (#eq? @f "Pointer"))) + @vuln"#, severity: Severity::Medium, + tier: PatternTier::A, + category: PatternCategory::MemorySafety, + confidence: Confidence::High, }, + // ── Tier A: TLS misconfiguration ─────────────────────────────────── Pattern { - id: "hardcoded_secret", - description: "Hard-coded string that looks like an API key/token", - query: "(interpreted_string_literal) @s (#match? @s \"(?i)(api|secret|token|password)[=:]?[ \\t]*[A-Za-z0-9_\\-]{8,}\")", + id: "go.transport.insecure_skip_verify", + description: "InsecureSkipVerify: true — disables TLS certificate validation", + query: r#"(keyed_element + (literal_element + (identifier) @k (#eq? @k "InsecureSkipVerify")) + (literal_element (true))) + @vuln"#, + severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::InsecureTransport, + confidence: Confidence::High, + }, + // ── Tier A: Weak crypto ──────────────────────────────────────────── + Pattern { + id: "go.crypto.md5", + description: "md5.New() / md5.Sum() — weak hash algorithm", + query: r#"(call_expression + function: (selector_expression + operand: (identifier) @pkg (#eq? @pkg "md5"))) + @vuln"#, severity: Severity::Low, + tier: PatternTier::A, + category: PatternCategory::Crypto, + confidence: Confidence::Medium, + }, + Pattern { + id: "go.crypto.sha1", + description: "sha1.New() / sha1.Sum() — weak hash algorithm", + query: r#"(call_expression + function: (selector_expression + operand: (identifier) @pkg (#eq? @pkg "sha1"))) + @vuln"#, + severity: Severity::Low, + tier: PatternTier::A, + category: PatternCategory::Crypto, + confidence: Confidence::Medium, + }, + // ── Tier B: SQL injection (concatenation heuristic) ──────────────── + Pattern { + id: "go.sqli.query_concat", + description: "db.Query/Exec with concatenated string argument", + query: r#"(call_expression + function: (selector_expression + field: (field_identifier) @f (#match? @f "^(Query|Exec|QueryRow)$")) + arguments: (argument_list + (binary_expression) @concat)) + @vuln"#, + severity: Severity::Medium, + tier: PatternTier::B, + category: PatternCategory::SqlInjection, + confidence: Confidence::Medium, + }, + // ── Tier A: Hardcoded secrets ────────────────────────────────────── + Pattern { + id: "go.secrets.hardcoded_key", + description: "Variable with secret-like name assigned a string literal", + query: r#"(short_var_declaration + left: (expression_list + (identifier) @name (#match? @name "(?i)(password|secret|api_?key|token|private_?key)")) + right: (expression_list + (interpreted_string_literal) @val)) + @vuln"#, + severity: Severity::Medium, + tier: PatternTier::A, + category: PatternCategory::Secrets, + confidence: Confidence::High, + }, + // ── Tier A: Deserialization ──────────────────────────────────────── + Pattern { + id: "go.deser.gob_decode", + description: "gob.NewDecoder — Go binary deserialization", + query: r#"(call_expression + function: (selector_expression + operand: (identifier) @pkg (#eq? @pkg "gob") + field: (field_identifier) @f (#eq? @f "NewDecoder"))) + @vuln"#, + severity: Severity::Medium, + tier: PatternTier::A, + category: PatternCategory::Deserialization, + confidence: Confidence::High, }, ]; diff --git a/src/patterns/java.rs b/src/patterns/java.rs index d6fb3451..7eddf72a 100644 --- a/src/patterns/java.rs +++ b/src/patterns/java.rs @@ -1,40 +1,116 @@ -use crate::patterns::{Pattern, Severity}; +use crate::evidence::Confidence; +use crate::patterns::{Pattern, PatternCategory, PatternTier, Severity}; +/// Java AST patterns. +/// +/// Taint rules cover `Runtime.exec` (command injection) and +/// `executeQuery`/`executeUpdate`/`prepareStatement` (SQL sinks). +/// AST patterns here focus on **deserialization**, **reflection**, +/// **SQL with concatenation** (Tier B heuristic), and **weak crypto**. pub const PATTERNS: &[Pattern] = &[ + // ── Tier A: Deserialization ──────────────────────────────────────── Pattern { - id: "runtime_exec", - description: "Runtime.getRuntime().exec(...) – arbitrary-command execution", - query: "(method_invocation object: (method_invocation name: (identifier) @n (#eq? @n \"getRuntime\")) name: (identifier) @id (#eq? @id \"exec\")) @vuln", + id: "java.deser.readobject", + description: "ObjectInputStream.readObject() — unsafe deserialization", + // Match any .readObject() call — the method name is specific enough. + query: r#"(method_invocation + name: (identifier) @id (#eq? @id "readObject")) + @vuln"#, severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::Deserialization, + confidence: Confidence::High, }, + // ── Tier A: Command execution ────────────────────────────────────── Pattern { - id: "class_for_name", - description: "Dynamic reflection via Class.forName(...)", - query: "(method_invocation object: (identifier) @c (#eq? @c \"Class\") name: (identifier) @id (#eq? @id \"forName\")) @vuln", - severity: Severity::Medium, - }, - Pattern { - id: "object_deserialization", - description: "java.io.ObjectInputStream#readObject() deserialization", - query: "(method_invocation object: (identifier) @o (#eq? @o \"ObjectInputStream\") name: (identifier) @id (#eq? @id \"readObject\")) @vuln", + id: "java.cmdi.runtime_exec", + description: "Runtime.getRuntime().exec() — shell command execution", + query: r#"(method_invocation + object: (method_invocation + name: (identifier) @n (#eq? @n "getRuntime")) + name: (identifier) @id (#eq? @id "exec")) + @vuln"#, severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::CommandExec, + confidence: Confidence::High, }, + // ── Tier A: Reflection ───────────────────────────────────────────── Pattern { - id: "insecure_random", - description: "java.util.Random used where SecureRandom is expected", - query: "(object_creation_expression type: (identifier) @t (#eq? @t \"Random\")) @vuln", + id: "java.reflection.class_forname", + description: "Class.forName() — dynamic class loading", + query: r#"(method_invocation + object: (identifier) @c (#eq? @c "Class") + name: (identifier) @id (#eq? @id "forName")) + @vuln"#, severity: Severity::Medium, + tier: PatternTier::A, + category: PatternCategory::Reflection, + confidence: Confidence::High, }, Pattern { - id: "thread_stop", - description: "Deprecated Thread.stop() invocation", - query: "(method_invocation name: (identifier) @id (#eq? @id \"stop\") object: (identifier) @obj (#eq? @obj \"Thread\")) @vuln", + id: "java.reflection.method_invoke", + description: "Method.invoke() — reflective method invocation", + query: r#"(method_invocation + name: (identifier) @id (#eq? @id "invoke")) + @vuln"#, + severity: Severity::Medium, + tier: PatternTier::A, + category: PatternCategory::Reflection, + confidence: Confidence::High, + }, + // ── Tier B: SQL injection (concatenation heuristic) ──────────────── + Pattern { + id: "java.sqli.execute_concat", + description: "SQL execute with concatenated string argument", + query: r#"(method_invocation + name: (identifier) @id (#match? @id "^execute(Query|Update)?$") + arguments: (argument_list + (binary_expression) @concat)) + @vuln"#, + severity: Severity::Medium, + tier: PatternTier::B, + category: PatternCategory::SqlInjection, + confidence: Confidence::Medium, + }, + // ── Tier A: Weak crypto ──────────────────────────────────────────── + Pattern { + id: "java.crypto.insecure_random", + description: "new Random() — java.util.Random is not cryptographically secure", + query: r#"(object_creation_expression + type: (type_identifier) @t (#eq? @t "Random")) + @vuln"#, severity: Severity::Low, + tier: PatternTier::A, + category: PatternCategory::Crypto, + confidence: Confidence::Medium, }, Pattern { - id: "sql_concat", - description: "SQL built with string concatenation", - query: "(method_invocation name: (identifier) @id (#match? @id \"execute(Query|Update)?\") arguments: (argument_list (binary_expression) @concat)) @vuln", + id: "java.crypto.weak_digest", + description: "MessageDigest.getInstance(\"MD5\"/\"SHA1\") — weak hash algorithm", + query: r#"(method_invocation + object: (identifier) @c (#eq? @c "MessageDigest") + name: (identifier) @id (#eq? @id "getInstance") + arguments: (argument_list + (string_literal) @alg (#match? @alg "(?i)(md5|sha-?1)"))) + @vuln"#, + severity: Severity::Low, + tier: PatternTier::A, + category: PatternCategory::Crypto, + confidence: Confidence::Medium, + }, + // ── Tier A: XSS (servlet) ────────────────────────────────────────── + Pattern { + id: "java.xss.getwriter_print", + description: "response.getWriter().print/println — direct output without encoding", + query: r#"(method_invocation + object: (method_invocation + name: (identifier) @gw (#eq? @gw "getWriter")) + name: (identifier) @id (#match? @id "^(print|println|write)$")) + @vuln"#, severity: Severity::Medium, + tier: PatternTier::A, + category: PatternCategory::Xss, + confidence: Confidence::High, }, ]; diff --git a/src/patterns/javascript.rs b/src/patterns/javascript.rs index b4d6e816..0c124e21 100644 --- a/src/patterns/javascript.rs +++ b/src/patterns/javascript.rs @@ -1,117 +1,182 @@ -use crate::patterns::{Pattern, Severity}; +use crate::evidence::Confidence; +use crate::patterns::{Pattern, PatternCategory, PatternTier, Severity}; +/// JavaScript AST patterns. +/// +/// Taint rules cover `eval` (code injection), `innerHTML` (XSS), +/// `location.href` (open redirect), and `child_process.exec/spawn` (command +/// injection). AST patterns here add **new Function()**, **document.write**, +/// **setTimeout with string**, **deserialization**, **prototype pollution**, +/// **XSS sinks** not covered by taint, and **weak crypto**. pub const PATTERNS: &[Pattern] = &[ + // ── Tier A: Code execution ───────────────────────────────────────── Pattern { - id: "eval_call", - description: "Use of eval()", - query: "(call_expression function: (identifier) @id (#eq? @id \"eval\")) @vuln", + id: "js.code_exec.eval", + description: "eval() — dynamic code execution", + query: r#"(call_expression + function: (identifier) @id (#eq? @id "eval")) + @vuln"#, severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::CodeExec, + confidence: Confidence::High, }, Pattern { - id: "new_function", - description: "new Function() constructor", - query: "(new_expression constructor: (identifier) @id (#eq? @id \"Function\")) @vuln", + id: "js.code_exec.new_function", + description: "new Function() constructor — eval equivalent", + query: r#"(new_expression + constructor: (identifier) @id (#eq? @id "Function")) + @vuln"#, severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::CodeExec, + confidence: Confidence::High, }, Pattern { - id: "document_write", - description: "document.write() call", - query: "(call_expression function: (member_expression object: (identifier) @obj (#eq? @obj \"document\") property: (property_identifier) @prop (#eq? @prop \"write\"))) @vuln", + id: "js.code_exec.settimeout_string", + description: "setTimeout/setInterval with string argument — implicit eval", + query: r#"(call_expression + function: (identifier) @id (#match? @id "^(setTimeout|setInterval)$") + arguments: (arguments (string) @code)) + @vuln"#, severity: Severity::Medium, + tier: PatternTier::A, + category: PatternCategory::CodeExec, + confidence: Confidence::High, }, + // ── Tier A: XSS sinks ────────────────────────────────────────────── Pattern { - id: "settimeout_string", - description: "setTimeout / setInterval with a string argument", - query: "(call_expression function: (identifier) @id (#match? @id \"setTimeout|setInterval\") arguments: (arguments (string) @code . _)) @vuln", + id: "js.xss.document_write", + description: "document.write() — XSS sink", + query: r#"(call_expression + function: (member_expression + object: (identifier) @obj (#eq? @obj "document") + property: (property_identifier) @prop (#match? @prop "^(write|writeln)$"))) + @vuln"#, severity: Severity::Medium, + tier: PatternTier::A, + category: PatternCategory::Xss, + confidence: Confidence::High, }, Pattern { - id: "json_parse", - description: "JSON.parse on dynamic string", - query: "(call_expression function: (member_expression object: (identifier) @obj (#eq? @obj \"JSON\") property: (property_identifier) @prop (#eq? @prop \"parse\"))) @vuln", + id: "js.xss.outer_html", + description: "Assignment to .outerHTML — XSS sink", + query: r#"(assignment_expression + left: (member_expression + property: (property_identifier) @prop (#eq? @prop "outerHTML"))) + @vuln"#, + severity: Severity::Medium, + tier: PatternTier::A, + category: PatternCategory::Xss, + confidence: Confidence::High, + }, + Pattern { + id: "js.xss.insert_adjacent_html", + description: "insertAdjacentHTML() — XSS sink", + query: r#"(call_expression + function: (member_expression + property: (property_identifier) @prop (#eq? @prop "insertAdjacentHTML"))) + @vuln"#, + severity: Severity::Medium, + tier: PatternTier::A, + category: PatternCategory::Xss, + confidence: Confidence::High, + }, + // ── Tier A: Prototype pollution ──────────────────────────────────── + Pattern { + id: "js.prototype.proto_assignment", + description: "Assignment to __proto__ — prototype pollution", + query: r#"(assignment_expression + left: (member_expression + property: (property_identifier) @prop (#eq? @prop "__proto__"))) + @vuln"#, + severity: Severity::Medium, + tier: PatternTier::A, + category: PatternCategory::Prototype, + confidence: Confidence::High, + }, + Pattern { + id: "js.prototype.extend_object", + description: "Assignment to Object.prototype — prototype mutation", + query: r#"(assignment_expression + left: (member_expression + object: (member_expression + object: (identifier) @obj (#eq? @obj "Object") + property: (property_identifier) @mid (#eq? @mid "prototype")))) + @vuln"#, + severity: Severity::Medium, + tier: PatternTier::A, + category: PatternCategory::Prototype, + confidence: Confidence::High, + }, + // ── Tier A: Weak crypto ──────────────────────────────────────────── + Pattern { + id: "js.crypto.weak_hash", + description: "crypto.createHash with weak algorithm (md5/sha1)", + query: r#"(call_expression + function: (member_expression + property: (property_identifier) @prop (#eq? @prop "createHash")) + arguments: (arguments + (string) @alg (#match? @alg "\"(md5|sha1)\""))) + @vuln"#, severity: Severity::Low, + tier: PatternTier::A, + category: PatternCategory::Crypto, + confidence: Confidence::Medium, }, Pattern { - id: "outer_html_assignment", - description: "Assignment to element.outerHTML", - query: "(assignment_expression - left: (member_expression - property: (property_identifier) @prop - (#eq? @prop \"outerHTML\"))) @vuln", + id: "js.crypto.math_random", + description: "Math.random() — not cryptographically secure", + query: r#"(call_expression + function: (member_expression + object: (identifier) @obj (#eq? @obj "Math") + property: (property_identifier) @prop (#eq? @prop "random"))) + @vuln"#, + severity: Severity::Low, + tier: PatternTier::A, + category: PatternCategory::Crypto, + confidence: Confidence::Medium, + }, + // ── Tier A: Open redirect ────────────────────────────────────────── + Pattern { + id: "js.xss.location_assign", + description: "Assignment to location/location.href — open redirect", + query: r#"(assignment_expression + left: (member_expression + object: (identifier) @obj (#match? @obj "^(window|location|document)$") + property: (property_identifier) @prop (#match? @prop "^(location|href)$"))) + @vuln"#, severity: Severity::Medium, + tier: PatternTier::A, + category: PatternCategory::Xss, + confidence: Confidence::High, }, + // ── Tier A: Insecure transport ───────────────────────────────────── Pattern { - id: "insert_adjacent_html", - description: "insertAdjacentHTML() call", - query: "(call_expression - function: (member_expression - property: (property_identifier) @prop - (#eq? @prop \"insertAdjacentHTML\"))) @vuln", - severity: Severity::Medium, + id: "js.transport.fetch_http", + description: "fetch() over plain HTTP", + query: r#"(call_expression + function: (identifier) @id (#eq? @id "fetch") + arguments: (arguments + (string) @url (#match? @url "^\"http://"))) + @vuln"#, + severity: Severity::Low, + tier: PatternTier::A, + category: PatternCategory::InsecureTransport, + confidence: Confidence::Medium, }, + // ── Tier A: Cookie manipulation ──────────────────────────────────── Pattern { - id: "location_href_assignment", - description: "Assignment to window.location / location.href", - query: "(assignment_expression - left: (member_expression - object: (identifier) @obj - (#match? @obj \"^(window|location|document|self|top|parent|frames)$\") - property: (property_identifier) @prop - (#match? @prop \"^(location|href)$\"))) @vuln", - severity: Severity::High, - }, - Pattern { - id: "cookie_assignment", + id: "js.xss.cookie_write", description: "Write to document.cookie", - query: "(assignment_expression - left: (member_expression - object: (identifier) @obj - (#eq? @obj \"document\") - property: (property_identifier) @prop - (#eq? @prop \"cookie\"))) @vuln", - severity: Severity::Medium, - }, - Pattern { - id: "proto_pollution", - description: "Assignment to __proto__ (prototype pollution)", - query: "(assignment_expression - left: (member_expression - property: (property_identifier) @prop - (#eq? @prop \"__proto__\"))) @vuln", - severity: Severity::Low, - }, - Pattern { - id: "weak_hash_md5", - description: "crypto.createHash(\"md5\")", - query: "(call_expression - function: (member_expression - object: (identifier) @obj - (#eq? @obj \"crypto\") - property: (property_identifier) @prop - (#eq? @prop \"createHash\")) - arguments: (arguments - (string) @alg - (#eq? @alg \"md5\"))) @vuln", - severity: Severity::Low, - }, - Pattern { - id: "regexp_constructor_string", - description: "new RegExp() with a dynamic string", - query: "(new_expression - constructor: (identifier) @id - (#eq? @id \"RegExp\") - arguments: (arguments (string) @pattern)) @vuln", - severity: Severity::Low, - }, - Pattern { - id: "dangerous_extend_builtin", - description: "Extending Object.prototype (may lead to collisions/pollution)", - query: "(assignment_expression - left: (member_expression - object: (identifier) @obj - (#eq? @obj \"Object\") - property: (property_identifier) @prop - (#eq? @prop \"prototype\"))) @vuln", + query: r#"(assignment_expression + left: (member_expression + object: (identifier) @obj (#eq? @obj "document") + property: (property_identifier) @prop (#eq? @prop "cookie"))) + @vuln"#, severity: Severity::Medium, + tier: PatternTier::A, + category: PatternCategory::Xss, + confidence: Confidence::High, }, ]; diff --git a/src/patterns/mod.rs b/src/patterns/mod.rs index ae7c09b8..cc129eb3 100644 --- a/src/patterns/mod.rs +++ b/src/patterns/mod.rs @@ -1,3 +1,43 @@ +//! # AST Pattern Conventions +//! +//! Each language file exports a `PATTERNS` slice of [`Pattern`] structs. +//! +//! ## ID format +//! +//! `..` — e.g. `java.deser.readobject`, `py.cmdi.os_system`. +//! +//! Language prefixes: `rs`, `java`, `py`, `js`, `ts`, `c`, `cpp`, `go`, `php`, `rb`. +//! +//! ## Tiers +//! +//! * **Tier A** — structural presence is high-signal (e.g. `gets()`, `eval()`). +//! * **Tier B** — requires a heuristic guard in the query (e.g. SQL with concatenated +//! arg, format-string with variable first arg). +//! +//! ## Severity +//! +//! * **High** — command exec, deserialization, banned C functions. +//! * **Medium** — SQL concat, reflection, XSS sinks, casts. +//! * **Low** — weak crypto, insecure randomness, code-quality (`unwrap`/`expect`/`panic`). +//! +//! Note: the default `min_severity` filter skips Low patterns; they only appear when +//! the user explicitly lowers the threshold. +//! +//! ## No-duplicate rule +//! +//! If a vulnerability class is already detected by taint analysis (e.g. `eval` as a +//! sink, `system` as a sink), the AST pattern is still kept for `--ast-only` mode but +//! uses a distinct ID namespace (`js.code_exec.eval` vs `taint-unsanitised-flow`). +//! The dedup pass in `ast.rs` prevents exact-duplicate findings at the same location. +//! +//! ## Adding a new pattern +//! +//! 1. Pick the language file under `src/patterns/.rs`. +//! 2. Choose tier, category, severity per the rules above. +//! 3. Write the tree-sitter query — test with `cargo test --test pattern_tests`. +//! 4. Add a snippet to `tests/fixtures/patterns//positive.`. +//! 5. Add the ID to the positive test assertion in `tests/pattern_tests.rs`. + pub mod c; pub mod cpp; mod go; @@ -9,6 +49,7 @@ mod ruby; pub mod rust; pub mod typescript; +use crate::evidence::Confidence; use console::style; use once_cell::sync::Lazy; use serde::{Deserialize, Serialize}; @@ -16,7 +57,7 @@ use std::collections::HashMap; use std::fmt; use std::str::FromStr; -#[derive(Debug, Copy, Clone, Eq, PartialEq, Ord, PartialOrd, Serialize, Deserialize)] +#[derive(Debug, Copy, Clone, Eq, PartialEq, Ord, PartialOrd, Hash, Serialize, Deserialize)] pub enum Severity { High, Medium, @@ -28,13 +69,14 @@ impl Severity { /// /// Returns e.g. `"[HIGH] "` or `"[MEDIUM]"` — always 8 visible characters /// so the column after the tag lines up regardless of severity. + #[allow(dead_code)] // public API for lib consumers pub fn colored_tag(self) -> String { // Visible widths: "[HIGH]" = 6, "[MEDIUM]" = 8, "[LOW]" = 5. // Pad the *whole* tag to 8 visible chars (the longest, "[MEDIUM]"). let (label, styled_fn): (&str, fn(&str) -> String) = match self { Severity::High => ("HIGH", |s| style(s).red().bold().to_string()), - Severity::Medium => ("MEDIUM", |s| style(s).yellow().bold().to_string()), - Severity::Low => ("LOW", |s| style(s).cyan().bold().to_string()), + Severity::Medium => ("MEDIUM", |s| style(s).color256(208).bold().to_string()), + Severity::Low => ("LOW", |s| style(s).color256(67).to_string()), }; let bracket_len = label.len() + 2; // "[" + label + "]" let pad = 8usize.saturating_sub(bracket_len); @@ -46,8 +88,8 @@ impl fmt::Display for Severity { fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { let styled = match *self { Severity::High => style("HIGH").red().bold().to_string(), - Severity::Medium => style("MEDIUM").yellow().bold().to_string(), - Severity::Low => style("LOW").cyan().bold().to_string(), + Severity::Medium => style("MEDIUM").color256(208).bold().to_string(), + Severity::Low => style("LOW").color256(67).to_string(), }; f.write_str(&styled) } @@ -65,14 +107,132 @@ impl Severity { } impl FromStr for Severity { - // TODO: FIX - type Err = (); + type Err = String; fn from_str(input: &str) -> Result { - match input.to_lowercase().as_str() { - "medium" => Ok(Severity::Medium), - "high" => Ok(Severity::High), - _ => Ok(Severity::Low), + match input.trim().to_ascii_uppercase().as_str() { + "HIGH" => Ok(Severity::High), + "MEDIUM" | "MED" => Ok(Severity::Medium), + "LOW" => Ok(Severity::Low), + other => Err(format!("unknown severity: '{other}'")), + } + } +} + +/// A parsed severity filter expression. +/// +/// Supports three forms: +/// - Single level: `"HIGH"` — matches only that level +/// - Comma list: `"HIGH,MEDIUM"` — matches any listed level +/// - Threshold: `">=MEDIUM"` — matches that level and above +/// +/// Parsing is case-insensitive and tolerates whitespace around tokens. +#[derive(Debug, Clone, PartialEq, Eq)] +pub enum SeverityFilter { + /// Match findings at or above this level (High >= Medium >= Low). + AtLeast(Severity), + /// Match findings whose severity is in this exact set. + AnyOf(Vec), +} + +impl SeverityFilter { + /// Parse a severity filter expression. + /// + /// Examples: `"HIGH"`, `"high,medium"`, `">=MEDIUM"`, `">= low"`. + pub fn parse(expr: &str) -> Result { + let trimmed = expr.trim(); + if trimmed.is_empty() { + return Err("empty severity expression".into()); + } + + // Threshold form: >=LEVEL + if let Some(rest) = trimmed.strip_prefix(">=") { + let level: Severity = rest.parse()?; + return Ok(SeverityFilter::AtLeast(level)); + } + + // Comma-separated list (also handles single value) + let levels: Result, String> = trimmed + .split(',') + .map(|tok| tok.trim().parse::()) + .collect(); + let levels = levels?; + if levels.is_empty() { + return Err("empty severity expression".into()); + } + // Optimise single-value list + if levels.len() == 1 { + return Ok(SeverityFilter::AnyOf(levels)); + } + Ok(SeverityFilter::AnyOf(levels)) + } + + /// Returns `true` if the given severity passes this filter. + pub fn matches(&self, sev: Severity) -> bool { + match self { + SeverityFilter::AtLeast(threshold) => { + // Severity ordering: High < Medium < Low (derived Ord). + // "at least Medium" means sev <= Medium in Ord terms. + sev <= *threshold + } + SeverityFilter::AnyOf(set) => set.contains(&sev), + } + } +} + +/// Pattern confidence tier. +/// +/// * **A** – Structural presence alone is high-signal (e.g. `gets()`, `eval()`). +/// * **B** – Requires a simple heuristic guard in the query (e.g. SQL with +/// concatenated arg, file-open with non-literal path). +#[derive(Debug, Copy, Clone, Eq, PartialEq, Serialize, Deserialize)] +pub enum PatternTier { + A, + B, +} + +/// High-level finding category for noise reduction and prioritization. +#[derive(Debug, Copy, Clone, Eq, PartialEq, Hash, Serialize, Deserialize)] +pub enum FindingCategory { + Security, + Reliability, + Quality, +} + +impl std::fmt::Display for FindingCategory { + fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { + match self { + FindingCategory::Security => write!(f, "Security"), + FindingCategory::Reliability => write!(f, "Reliability"), + FindingCategory::Quality => write!(f, "Quality"), + } + } +} + +/// Vulnerability class that a pattern detects. +#[derive(Debug, Copy, Clone, Eq, PartialEq, Serialize, Deserialize)] +pub enum PatternCategory { + CommandExec, + CodeExec, + Deserialization, + SqlInjection, + PathTraversal, + Xss, + Crypto, + Secrets, + InsecureTransport, + Reflection, + MemorySafety, + Prototype, + CodeQuality, +} + +impl PatternCategory { + /// Map this vulnerability class to a high-level finding category. + pub fn finding_category(self) -> FindingCategory { + match self { + PatternCategory::CodeQuality => FindingCategory::Quality, + _ => FindingCategory::Security, } } } @@ -80,7 +240,7 @@ impl FromStr for Severity { /// One AST pattern with a tree-sitter query and meta-data. #[derive(Debug, Clone, Serialize, PartialEq)] pub struct Pattern { - /// Unique identifier (snake-case preferred). + /// Unique identifier — `..` preferred. pub id: &'static str, /// Human-readable explanation. pub description: &'static str, @@ -88,6 +248,12 @@ pub struct Pattern { pub query: &'static str, /// Rough severity bucket. pub severity: Severity, + /// Confidence tier (A = structural, B = heuristic-guarded). + pub tier: PatternTier, + /// Vulnerability class. + pub category: PatternCategory, + /// Confidence level for findings produced by this pattern. + pub confidence: Confidence, } /// Global, lazily-initialised registry: lang-name → pattern slice @@ -164,3 +330,66 @@ fn load_returns_correct_pattern_slices() { assert!(load("brainfuck").is_empty()); } + +#[test] +fn severity_from_str_rejects_unknown() { + assert!("garbage".parse::().is_err()); +} + +#[test] +fn severity_filter_single() { + let f = SeverityFilter::parse("HIGH").unwrap(); + assert!(f.matches(Severity::High)); + assert!(!f.matches(Severity::Medium)); + assert!(!f.matches(Severity::Low)); +} + +#[test] +fn severity_filter_comma_list() { + let f = SeverityFilter::parse("HIGH,MEDIUM").unwrap(); + assert!(f.matches(Severity::High)); + assert!(f.matches(Severity::Medium)); + assert!(!f.matches(Severity::Low)); +} + +#[test] +fn severity_filter_threshold() { + let f = SeverityFilter::parse(">=MEDIUM").unwrap(); + assert!(f.matches(Severity::High)); + assert!(f.matches(Severity::Medium)); + assert!(!f.matches(Severity::Low)); + + let f2 = SeverityFilter::parse(">=LOW").unwrap(); + assert!(f2.matches(Severity::High)); + assert!(f2.matches(Severity::Medium)); + assert!(f2.matches(Severity::Low)); + + let f3 = SeverityFilter::parse(">=HIGH").unwrap(); + assert!(f3.matches(Severity::High)); + assert!(!f3.matches(Severity::Medium)); +} + +#[test] +fn severity_filter_case_insensitive_and_whitespace() { + let f = SeverityFilter::parse(" high , medium ").unwrap(); + assert!(f.matches(Severity::High)); + assert!(f.matches(Severity::Medium)); + assert!(!f.matches(Severity::Low)); + + let f2 = SeverityFilter::parse(">= medium").unwrap(); + assert!(f2.matches(Severity::High)); + assert!(f2.matches(Severity::Medium)); +} + +#[test] +fn severity_filter_rejects_empty() { + assert!(SeverityFilter::parse("").is_err()); + assert!(SeverityFilter::parse(" ").is_err()); +} + +#[test] +fn severity_filter_rejects_invalid_level() { + assert!(SeverityFilter::parse("CRITICAL").is_err()); + assert!(SeverityFilter::parse("HIGH,CRITICAL").is_err()); + assert!(SeverityFilter::parse(">=BOGUS").is_err()); +} diff --git a/src/patterns/php.rs b/src/patterns/php.rs index 3cbe16af..afaa2c8e 100644 --- a/src/patterns/php.rs +++ b/src/patterns/php.rs @@ -1,40 +1,144 @@ -use crate::patterns::{Pattern, Severity}; +use crate::evidence::Confidence; +use crate::patterns::{Pattern, PatternCategory, PatternTier, Severity}; +/// PHP AST patterns. +/// +/// Taint rules cover `system`/`exec`/`passthru`/`shell_exec` (command +/// injection), `echo`/`print` (XSS sinks), and `mysqli_query`/`pg_query` +/// (SQL sinks). AST patterns here focus on **eval**, **deserialization**, +/// **deprecated dangerous functions**, **include with variable**, and +/// **SQL concatenation** (Tier B). pub const PATTERNS: &[Pattern] = &[ + // ── Tier A: Code execution ───────────────────────────────────────── Pattern { - id: "eval_call", - description: "eval($code) execution", - query: "(function_call_expression function: (name) @n (#eq? @n \"eval\")) @vuln", + id: "php.code_exec.eval", + description: "eval() — dynamic code execution", + query: r#"(function_call_expression + function: (name) @n (#eq? @n "eval")) + @vuln"#, severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::CodeExec, + confidence: Confidence::High, }, Pattern { - id: "preg_replace_e", - description: "preg_replace with deprecated /e modifier", - query: "(function_call_expression function: (name) @n (#eq? @n \"preg_replace\") arguments: (arguments (string) @pat (#match? @pat \"/.*e.*$/\"))) @vuln", + id: "php.code_exec.create_function", + description: "create_function() — deprecated eval-like constructor", + query: r#"(function_call_expression + function: (name) @n (#eq? @n "create_function")) + @vuln"#, severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::CodeExec, + confidence: Confidence::High, }, Pattern { - id: "create_function", - description: "create_function(...) anonymous eval-like", - query: "(function_call_expression function: (name) @n (#eq? @n \"create_function\")) @vuln", - severity: Severity::Medium, - }, - Pattern { - id: "unserialize_call", - description: "unserialize(...) on user input", - query: "(function_call_expression function: (name) @n (#eq? @n \"unserialize\")) @vuln", + id: "php.code_exec.preg_replace_e", + description: "preg_replace with /e modifier — code execution via regex", + query: r#"(function_call_expression + function: (name) @n (#eq? @n "preg_replace") + arguments: (arguments + (argument + (string) @pat (#match? @pat "/[^/]*/[a-zA-Z]*e")))) + @vuln"#, severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::CodeExec, + confidence: Confidence::High, }, Pattern { - id: "mysql_query_concat", - description: "mysql_query with concatenated SQL", - query: "(function_call_expression function: (name) @n (#eq? @n \"mysql_query\") arguments: (arguments (binary_expression) @concat)) @vuln", + id: "php.code_exec.assert_string", + description: "assert() with string argument — evaluates PHP code", + query: r#"(function_call_expression + function: (name) @n (#eq? @n "assert") + arguments: (arguments + (argument (string) @code))) + @vuln"#, + severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::CodeExec, + confidence: Confidence::High, + }, + // ── Tier A: Command execution ────────────────────────────────────── + Pattern { + id: "php.cmdi.system", + description: "system/shell_exec/exec/passthru — shell command execution", + query: r#"(function_call_expression + function: (name) @n (#match? @n "^(system|shell_exec|exec|passthru|proc_open|popen)$")) + @vuln"#, + severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::CommandExec, + confidence: Confidence::High, + }, + // ── Tier A: Deserialization ──────────────────────────────────────── + Pattern { + id: "php.deser.unserialize", + description: "unserialize() — PHP object injection", + query: r#"(function_call_expression + function: (name) @n (#eq? @n "unserialize")) + @vuln"#, + severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::Deserialization, + confidence: Confidence::High, + }, + // ── Tier B: SQL injection (concatenation heuristic) ──────────────── + Pattern { + id: "php.sqli.query_concat", + description: "mysql_query/mysqli_query with concatenated SQL string", + query: r#"(function_call_expression + function: (name) @n (#match? @n "^(mysql_query|mysqli_query)$") + arguments: (arguments + (argument (binary_expression) @concat))) + @vuln"#, severity: Severity::Medium, + tier: PatternTier::B, + category: PatternCategory::SqlInjection, + confidence: Confidence::Medium, + }, + // ── Tier B: Path traversal (include with variable) ───────────────── + Pattern { + id: "php.path.include_variable", + description: "include/require with variable path — file inclusion vulnerability", + query: r#"(include_expression (variable_name)) @vuln"#, + severity: Severity::High, + tier: PatternTier::B, + category: PatternCategory::PathTraversal, + confidence: Confidence::Medium, + }, + // ── Tier A: Crypto ───────────────────────────────────────────────── + Pattern { + id: "php.crypto.md5", + description: "md5() — weak hash function", + query: r#"(function_call_expression + function: (name) @n (#eq? @n "md5")) + @vuln"#, + severity: Severity::Low, + tier: PatternTier::A, + category: PatternCategory::Crypto, + confidence: Confidence::Medium, }, Pattern { - id: "system_call", - description: "system()/shell_exec()/exec() command execution", - query: "(function_call_expression function: (name) @n (#match? @n \"system|shell_exec|exec|passthru\")) @vuln", - severity: Severity::Medium, + id: "php.crypto.sha1", + description: "sha1() — weak hash function", + query: r#"(function_call_expression + function: (name) @n (#eq? @n "sha1")) + @vuln"#, + severity: Severity::Low, + tier: PatternTier::A, + category: PatternCategory::Crypto, + confidence: Confidence::Medium, + }, + Pattern { + id: "php.crypto.rand", + description: "rand()/mt_rand() — not cryptographically secure", + query: r#"(function_call_expression + function: (name) @n (#match? @n "^(rand|mt_rand)$")) + @vuln"#, + severity: Severity::Low, + tier: PatternTier::A, + category: PatternCategory::Crypto, + confidence: Confidence::Medium, }, ]; diff --git a/src/patterns/python.rs b/src/patterns/python.rs index 884af560..49a4f174 100644 --- a/src/patterns/python.rs +++ b/src/patterns/python.rs @@ -1,22 +1,178 @@ -use crate::patterns::{Pattern, Severity}; +use crate::evidence::Confidence; +use crate::patterns::{Pattern, PatternCategory, PatternTier, Severity}; +/// Python AST patterns. +/// +/// Taint rules cover `eval`/`exec`, `os.system`/`os.popen`/`subprocess.*`, +/// and `cursor.execute`. AST patterns here add coverage for **deserialization**, +/// **subprocess shell=True** (Tier B — taint doesn't check keyword args), and +/// **code execution** sinks that taint cannot structurally verify. pub const PATTERNS: &[Pattern] = &[ + // ── Tier A: Code execution ───────────────────────────────────────── Pattern { - id: "eval_call", - description: "eval() on dynamic input", - query: "(call function: (identifier) @id (#eq? @id \"eval\")) @vuln", + id: "py.code_exec.eval", + description: "eval() — dynamic code execution", + query: r#"(call function: (identifier) @id (#eq? @id "eval")) @vuln"#, severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::CodeExec, + confidence: Confidence::High, }, Pattern { - id: "exec_call", - description: "exec(...) execution of dynamic code", - query: "(call function: (identifier) @id (#eq? @id \"exec\")) @vuln", + id: "py.code_exec.exec", + description: "exec() — dynamic code execution", + query: r#"(call function: (identifier) @id (#eq? @id "exec")) @vuln"#, severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::CodeExec, + confidence: Confidence::High, }, Pattern { - id: "subprocess_shell_true", - description: "subprocess.* with shell=True", - query: "(call function: (attribute object: (identifier) @pkg (#eq? @pkg \"subprocess\")) arguments: (argument_list . (keyword_argument name: (identifier) @k (#eq? @k \"shell\")) (true) @val)) @vuln", + id: "py.code_exec.compile", + description: "compile() with exec/eval mode — code compilation from string", + query: r#"(call function: (identifier) @id (#eq? @id "compile")) @vuln"#, severity: Severity::Medium, + tier: PatternTier::A, + category: PatternCategory::CodeExec, + confidence: Confidence::High, + }, + // ── Tier A: Command execution ────────────────────────────────────── + Pattern { + id: "py.cmdi.os_system", + description: "os.system() — shell command execution", + query: r#"(call + function: (attribute + object: (identifier) @pkg (#eq? @pkg "os") + attribute: (identifier) @fn (#eq? @fn "system"))) + @vuln"#, + severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::CommandExec, + confidence: Confidence::High, + }, + Pattern { + id: "py.cmdi.os_popen", + description: "os.popen() — shell command execution", + query: r#"(call + function: (attribute + object: (identifier) @pkg (#eq? @pkg "os") + attribute: (identifier) @fn (#eq? @fn "popen"))) + @vuln"#, + severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::CommandExec, + confidence: Confidence::High, + }, + // ── Tier B: subprocess with shell=True ───────────────────────────── + Pattern { + id: "py.cmdi.subprocess_shell", + description: "subprocess call with shell=True", + query: r#"(call + function: (attribute + object: (identifier) @pkg (#eq? @pkg "subprocess")) + arguments: (argument_list + (keyword_argument + name: (identifier) @k (#eq? @k "shell") + value: (true)))) + @vuln"#, + severity: Severity::High, + tier: PatternTier::B, + category: PatternCategory::CommandExec, + confidence: Confidence::Medium, + }, + // ── Tier A: Deserialization ──────────────────────────────────────── + Pattern { + id: "py.deser.pickle_loads", + description: "pickle.loads/load — arbitrary object deserialization", + query: r#"(call + function: (attribute + object: (identifier) @pkg (#eq? @pkg "pickle") + attribute: (identifier) @fn (#match? @fn "^loads?$"))) + @vuln"#, + severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::Deserialization, + confidence: Confidence::High, + }, + Pattern { + id: "py.deser.yaml_load", + description: "yaml.load() without SafeLoader — arbitrary object instantiation", + query: r#"(call + function: (attribute + object: (identifier) @pkg (#eq? @pkg "yaml") + attribute: (identifier) @fn (#eq? @fn "load"))) + @vuln"#, + severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::Deserialization, + confidence: Confidence::High, + }, + Pattern { + id: "py.deser.shelve_open", + description: "shelve.open() — pickle-backed deserialization", + query: r#"(call + function: (attribute + object: (identifier) @pkg (#eq? @pkg "shelve") + attribute: (identifier) @fn (#eq? @fn "open"))) + @vuln"#, + severity: Severity::Medium, + tier: PatternTier::A, + category: PatternCategory::Deserialization, + confidence: Confidence::High, + }, + // ── Tier B: SQL injection (format/concat heuristic) ──────────────── + Pattern { + id: "py.sqli.execute_format", + description: "cursor.execute with string concatenation — SQL injection risk", + query: r#"(call + function: (attribute + attribute: (identifier) @fn (#eq? @fn "execute")) + arguments: (argument_list + (binary_operator) @arg)) + @vuln"#, + severity: Severity::Medium, + tier: PatternTier::B, + category: PatternCategory::SqlInjection, + confidence: Confidence::Medium, + }, + // ── Tier A: Weak crypto ──────────────────────────────────────────── + Pattern { + id: "py.crypto.md5", + description: "hashlib.md5() — weak hash algorithm", + query: r#"(call + function: (attribute + object: (identifier) @pkg (#eq? @pkg "hashlib") + attribute: (identifier) @fn (#eq? @fn "md5"))) + @vuln"#, + severity: Severity::Low, + tier: PatternTier::A, + category: PatternCategory::Crypto, + confidence: Confidence::Medium, + }, + Pattern { + id: "py.crypto.sha1", + description: "hashlib.sha1() — weak hash algorithm", + query: r#"(call + function: (attribute + object: (identifier) @pkg (#eq? @pkg "hashlib") + attribute: (identifier) @fn (#eq? @fn "sha1"))) + @vuln"#, + severity: Severity::Low, + tier: PatternTier::A, + category: PatternCategory::Crypto, + confidence: Confidence::Medium, + }, + // ── Tier A: Template injection ───────────────────────────────────── + Pattern { + id: "py.xss.jinja_from_string", + description: "jinja2.Template from string — potential template injection", + query: r#"(call + function: (attribute + attribute: (identifier) @fn (#eq? @fn "from_string"))) + @vuln"#, + severity: Severity::Medium, + tier: PatternTier::A, + category: PatternCategory::Xss, + confidence: Confidence::High, }, ]; diff --git a/src/patterns/ruby.rs b/src/patterns/ruby.rs index 47e80a9f..5381de31 100644 --- a/src/patterns/ruby.rs +++ b/src/patterns/ruby.rs @@ -1,133 +1,141 @@ -use crate::patterns::{Pattern, Severity}; +use crate::evidence::Confidence; +use crate::patterns::{Pattern, PatternCategory, PatternTier, Severity}; + +/// Ruby AST patterns. +/// +/// Taint rules cover `system`/`exec` (command injection), `eval` (code +/// execution), and `puts`/`print` (output sinks). AST patterns here focus on +/// **deserialization** (YAML.load, Marshal.load), **instance_eval/class_eval**, +/// **backtick shell**, **send with dynamic arg**, and **constantize**. pub const PATTERNS: &[Pattern] = &[ - // ---------- Runtime code-execution primitives ---------- + // ── Tier A: Code execution ───────────────────────────────────────── Pattern { - id: "eval_call", - description: "Kernel#eval usage", - query: r#" - (call - (identifier) @id - (#eq? @id "eval") - ) @vuln - "#, + id: "rb.code_exec.eval", + description: "Kernel#eval — dynamic code execution", + query: r#"(call (identifier) @id (#eq? @id "eval")) @vuln"#, severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::CodeExec, + confidence: Confidence::High, }, Pattern { - id: "instance_eval_call", - description: "Object#instance_eval usage", - query: r#" - (call - (identifier) @id - (#eq? @id "instance_eval") - ) @vuln - "#, + id: "rb.code_exec.instance_eval", + description: "instance_eval — evaluates string in object context", + query: r#"(call + method: (identifier) @id (#eq? @id "instance_eval")) + @vuln"#, severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::CodeExec, + confidence: Confidence::High, }, Pattern { - id: "class_eval_call", - description: "Module#class_eval / module_eval usage", - query: r#" - (call - (identifier) @id - (#match? @id "^(class_eval|module_eval)$") - ) @vuln - "#, + id: "rb.code_exec.class_eval", + description: "class_eval / module_eval — evaluates string in class context", + query: r#"(call + method: (identifier) @id (#match? @id "^(class_eval|module_eval)$")) + @vuln"#, severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::CodeExec, + confidence: Confidence::High, }, - // ---------- Shell execution ---------- + // ── Tier A: Command execution ────────────────────────────────────── Pattern { - id: "system_exec_interp", - description: "system/exec with string interpolation", - query: r#" - (call - method: (identifier) @m - (#match? @m "^(system|exec)$") - arguments: (argument_list - (string - (interpolation)+ @vuln - ) - ) - ) - "#, + id: "rb.cmdi.backtick", + description: "Backtick shell execution", + query: r#"(subshell) @vuln"#, severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::CommandExec, + confidence: Confidence::High, + }, + // ── Tier A: Shell execution ───────────────────────────────────────── + Pattern { + id: "rb.cmdi.system_interp", + description: "system/exec call — command execution risk", + query: r#"(call + method: (identifier) @m (#match? @m "^(system|exec)$")) + @vuln"#, + severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::CommandExec, + confidence: Confidence::High, + }, + // ── Tier A: Deserialization ──────────────────────────────────────── + Pattern { + id: "rb.deser.yaml_load", + description: "YAML.load — arbitrary object deserialization (use safe_load instead)", + query: r#"(call + receiver: (constant) @recv (#match? @recv "^(YAML|Psych)$") + method: (identifier) @m (#eq? @m "load")) + @vuln"#, + severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::Deserialization, + confidence: Confidence::High, }, Pattern { - id: "backtick_command", - description: "Back-tick shell execution", - // `uname -a` - query: r#"(shell_command) @vuln"#, + id: "rb.deser.marshal_load", + description: "Marshal.load — arbitrary Ruby object deserialization", + query: r#"(call + receiver: (constant) @recv (#eq? @recv "Marshal") + method: (identifier) @m (#eq? @m "load")) + @vuln"#, severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::Deserialization, + confidence: Confidence::High, }, - // ---------- Dangerous deserialisation ---------- + // ── Tier A: Reflection ───────────────────────────────────────────── Pattern { - id: "yaml_load", - description: "YAML.load / Psych.load (arbitrary object deserialisation)", - query: r#" - (call - receiver: (constant) @recv - (#match? @recv "^(YAML|Psych)$") - method: (identifier) @m - (#eq? @m "load") - ) @vuln - "#, - severity: Severity::High, - }, - Pattern { - id: "marshal_load", - description: "Marshal.load usage", - query: r#" - (call - receiver: (constant) @recv - (#eq? @recv "Marshal") - method: (identifier) @m - (#eq? @m "load") - ) @vuln - "#, - severity: Severity::High, - }, - // ---------- Reflection / meta-programming ---------- - Pattern { - id: "send_dynamic", - description: "send() with dynamic first argument (not a literal symbol)", - query: r#" - (call - method: (identifier) @m - (#eq? @m "send") - arguments: (argument_list - [ - (identifier) ; send(method_name_var, …) - (string (interpolation)+) ; send("user_#{role}", …) - ] @vuln - ) - ) + id: "rb.reflection.send_dynamic", + description: "send() with non-symbol argument — arbitrary method dispatch", + query: r#"(call + method: (identifier) @m (#eq? @m "send") + arguments: (argument_list + [(identifier) (string (interpolation)+)] @vuln)) "#, severity: Severity::Medium, + tier: PatternTier::B, + category: PatternCategory::Reflection, + confidence: Confidence::Medium, }, Pattern { - id: "constantize_call", - description: "ActiveSupport constantize / safe_constantize on tainted data", - query: r#" - (call - method: (identifier) @m - (#match? @m "^(constantize|safe_constantize)$") - ) @vuln - "#, + id: "rb.reflection.constantize", + description: "constantize / safe_constantize — dynamic class resolution", + query: r#"(call + method: (identifier) @m (#match? @m "^(constantize|safe_constantize)$")) + @vuln"#, severity: Severity::Medium, + tier: PatternTier::A, + category: PatternCategory::Reflection, + confidence: Confidence::High, }, - // ---------- Insecure resource access ---------- + // ── Tier A: SSRF ─────────────────────────────────────────────────── Pattern { - id: "open_uri_http", - description: "Kernel#open with HTTP(S) URL (open-uri auto-follow)", - query: r#" - (call - method: (identifier) @m - (#eq? @m "open") - arguments: (argument_list - (string) @url - (#match? @url "^\"https?://") - ) - ) @vuln - "#, + id: "rb.ssrf.open_uri", + description: "Kernel#open with HTTP URL — SSRF via open-uri", + query: r#"(call + method: (identifier) @m (#eq? @m "open") + arguments: (argument_list + (string) @url (#match? @url "^\"https?://"))) + @vuln"#, severity: Severity::Medium, + tier: PatternTier::A, + category: PatternCategory::InsecureTransport, + confidence: Confidence::High, + }, + // ── Tier A: Crypto ───────────────────────────────────────────────── + Pattern { + id: "rb.crypto.md5", + description: "Digest::MD5 — weak hash algorithm", + query: r#"(scope_resolution + name: (constant) @c (#eq? @c "MD5")) + @vuln"#, + severity: Severity::Low, + tier: PatternTier::A, + category: PatternCategory::Crypto, + confidence: Confidence::Medium, }, ]; diff --git a/src/patterns/rust.rs b/src/patterns/rust.rs index 3ef4a3db..c3a06ae3 100644 --- a/src/patterns/rust.rs +++ b/src/patterns/rust.rs @@ -1,118 +1,170 @@ -use crate::patterns::{Pattern, Severity}; +use crate::evidence::Confidence; +use crate::patterns::{Pattern, PatternCategory, PatternTier, Severity}; +/// Rust AST patterns. +/// +/// Rust taint rules already cover `Command::new`/`arg`/`status`/`output` sinks +/// and `env::var` / `fs::read_to_string` sources, so we do NOT duplicate those. +/// Patterns here focus on **unsafe memory**, **panicking APIs**, and structural +/// code-quality signals specific to Rust. pub const PATTERNS: &[Pattern] = &[ + // ── Tier A: Memory Safety (unsafe) ───────────────────────────────── Pattern { - id: "unsafe_block", - description: "Use of an `unsafe` block", + id: "rs.memory.transmute", + description: "std::mem::transmute — unchecked type reinterpretation", + query: r#"(call_expression + function: (scoped_identifier + path: (identifier) @p (#eq? @p "mem") + name: (identifier) @f (#eq? @f "transmute"))) + @vuln"#, + severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::MemorySafety, + confidence: Confidence::High, + }, + Pattern { + id: "rs.memory.copy_nonoverlapping", + description: "ptr::copy_nonoverlapping — raw pointer memcpy", + query: r#"(call_expression + function: (scoped_identifier + path: (identifier) @p (#eq? @p "ptr") + name: (identifier) @f (#eq? @f "copy_nonoverlapping"))) + @vuln"#, + severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::MemorySafety, + confidence: Confidence::High, + }, + Pattern { + id: "rs.memory.get_unchecked", + description: "get_unchecked / get_unchecked_mut — unchecked indexing", + query: r#"(call_expression + function: (field_expression + field: (field_identifier) @m + (#match? @m "^get_unchecked(_mut)?$"))) + @vuln"#, + severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::MemorySafety, + confidence: Confidence::High, + }, + Pattern { + id: "rs.memory.mem_zeroed", + description: "std::mem::zeroed — zero-initialised memory may be UB for non-POD types", + query: r#"(call_expression + function: (scoped_identifier + path: (identifier) @p (#eq? @p "mem") + name: (identifier) @n (#eq? @n "zeroed"))) + @vuln"#, + severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::MemorySafety, + confidence: Confidence::High, + }, + Pattern { + id: "rs.memory.ptr_read", + description: "ptr::read / ptr::read_volatile — raw pointer dereference", + query: r#"(call_expression + function: (scoped_identifier + path: (identifier) @p (#eq? @p "ptr") + name: (identifier) @n (#match? @n "^read(_volatile)?$"))) + @vuln"#, + severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::MemorySafety, + confidence: Confidence::High, + }, + // ── Tier A: Code quality / robustness ────────────────────────────── + Pattern { + id: "rs.quality.unsafe_block", + description: "unsafe block — manual memory safety obligation", query: "(unsafe_block) @vuln", - severity: Severity::High, - }, - Pattern { - id: "unsafe_fn", - description: "`unsafe fn` declaration", - query: "(function_item - (function_modifiers) @mods - (#match? @mods \"^unsafe\\b\")) @vuln", - severity: Severity::High, - }, - Pattern { - id: "transmute_call", - description: "`std::mem::transmute` call", - query: "(call_expression - function: (scoped_identifier - path: (identifier) @p (#eq? @p \"mem\") - name: (identifier) @f (#eq? @f \"transmute\"))) - @vuln", - severity: Severity::High, - }, - Pattern { - id: "copy_nonoverlapping", - description: "Raw pointer `copy_nonoverlapping`", - query: "(call_expression - function: (scoped_identifier - path: (identifier) @p (#eq? @p \"ptr\") - name: (identifier) @f (#eq? @f \"copy_nonoverlapping\"))) - @vuln", - severity: Severity::High, - }, - Pattern { - id: "get_unchecked", - description: "`get_unchecked` / `get_unchecked_mut` slice access", - query: "(call_expression - function: (field_expression - field: (field_identifier) @m - (#match? @m \"get_unchecked(_mut)?\"))) @vuln", - severity: Severity::High, - }, - Pattern { - id: "unwrap_call", - description: "`.unwrap()` call (may panic)", - query: "(call_expression - function: (field_expression - field: (field_identifier) @name - (#eq? @name \"unwrap\"))) ; exact match - @vuln", severity: Severity::Medium, + tier: PatternTier::A, + category: PatternCategory::MemorySafety, + confidence: Confidence::High, }, Pattern { - id: "expect_call", - description: "`.expect()` call (may panic)", - query: "(call_expression - function: (field_expression - field: (field_identifier) @name - (#eq? @name \"expect\"))) @vuln", + id: "rs.quality.unsafe_fn", + description: "unsafe fn declaration", + query: r#"(function_item + (function_modifiers) @mods + (#match? @mods "^unsafe")) + @vuln"#, severity: Severity::Medium, + tier: PatternTier::A, + category: PatternCategory::MemorySafety, + confidence: Confidence::High, }, Pattern { - id: "panic_macro", - description: "`panic!` macro invocation", - query: "(macro_invocation (identifier) @id (#eq? @id \"panic\")) @vuln", - severity: Severity::Medium, - }, - Pattern { - id: "todo_or_unimplemented", - description: "`todo!()` / `unimplemented!()` placeholder", - query: "(macro_invocation - (identifier) @id - (#match? @id \"todo|unimplemented\")) @vuln", + id: "rs.quality.unwrap", + description: ".unwrap() — panics on None/Err", + query: r#"(call_expression + function: (field_expression + field: (field_identifier) @name (#eq? @name "unwrap"))) + @vuln"#, severity: Severity::Low, + tier: PatternTier::A, + category: PatternCategory::CodeQuality, + confidence: Confidence::High, }, Pattern { - id: "narrow_cast_with_as", - description: "`as` cast to an 8-/16-bit integer (possible truncation)", - query: "(type_cast_expression - type: (primitive_type) @to - (#match? @to \"^u?i(8|16)$\")) @vuln", + id: "rs.quality.expect", + description: ".expect() — panics on None/Err", + query: r#"(call_expression + function: (field_expression + field: (field_identifier) @name (#eq? @name "expect"))) + @vuln"#, severity: Severity::Low, + tier: PatternTier::A, + category: PatternCategory::CodeQuality, + confidence: Confidence::High, }, Pattern { - id: "mem_zeroed", - description: "`std::mem::zeroed()`", - query: "(call_expression function:(scoped_identifier path:(identifier)@p (#eq? @p \"mem\") name:(identifier)@n (#eq? @n \"zeroed\")))@vuln", - severity: Severity::High, - }, - Pattern { - id: "mem_forget", - description: "`std::mem::forget()`", - query: "(call_expression function:(scoped_identifier path:(identifier)@p (#eq? @p \"mem\") name:(identifier)@n (#eq? @n \"forget\")))@vuln", - severity: Severity::Medium, - }, - Pattern { - id: "ptr_read", - description: "`ptr::read_*` raw-ptr read", - query: "(call_expression function:(scoped_identifier path:(identifier)@p (#eq? @p \"ptr\") name:(identifier)@n (#match? @n \"read(_volatile)?\")))@vuln", - severity: Severity::High, - }, - Pattern { - id: "arc_unwrap", - description: "`Arc::unwrap_or_else_unchecked`", - query: "(call_expression function:(scoped_identifier name:(identifier)@n (#eq? @n \"unwrap_or_else_unchecked\")))@vuln", - severity: Severity::High, - }, - Pattern { - id: "dbg_macro", - description: "`dbg!()` left in code", - query: "(macro_invocation (identifier)@id (#eq? @id \"dbg\"))@vuln", + id: "rs.quality.panic_macro", + description: "panic! macro invocation", + query: r#"(macro_invocation (identifier) @id (#eq? @id "panic")) @vuln"#, severity: Severity::Low, + tier: PatternTier::A, + category: PatternCategory::CodeQuality, + confidence: Confidence::High, + }, + Pattern { + id: "rs.quality.todo", + description: "todo!() / unimplemented!() placeholder left in code", + query: r#"(macro_invocation + (identifier) @id + (#match? @id "^(todo|unimplemented)$")) + @vuln"#, + severity: Severity::Low, + tier: PatternTier::A, + category: PatternCategory::CodeQuality, + confidence: Confidence::High, + }, + // ── Tier A: Narrowing cast ───────────────────────────────────────── + Pattern { + id: "rs.memory.narrow_cast", + description: "`as` cast to 8/16-bit integer — possible truncation", + query: r#"(type_cast_expression + type: (primitive_type) @to + (#match? @to "^(u8|i8|u16|i16)$")) + @vuln"#, + severity: Severity::Low, + tier: PatternTier::A, + category: PatternCategory::MemorySafety, + confidence: Confidence::Medium, + }, + Pattern { + id: "rs.memory.mem_forget", + description: "std::mem::forget — may leak resources", + query: r#"(call_expression + function: (scoped_identifier + path: (identifier) @p (#eq? @p "mem") + name: (identifier) @n (#eq? @n "forget"))) + @vuln"#, + severity: Severity::Low, + tier: PatternTier::A, + category: PatternCategory::MemorySafety, + confidence: Confidence::High, }, ]; diff --git a/src/patterns/typescript.rs b/src/patterns/typescript.rs index 3f16d356..b6c676c3 100644 --- a/src/patterns/typescript.rs +++ b/src/patterns/typescript.rs @@ -1,100 +1,157 @@ -use crate::patterns::{Pattern, Severity}; +use crate::evidence::Confidence; +use crate::patterns::{Pattern, PatternCategory, PatternTier, Severity}; +/// TypeScript AST patterns. +/// +/// TypeScript shares most patterns with JavaScript. Taint rules cover `eval`, +/// `innerHTML`, and `child_process.*` sinks. AST patterns here mirror JS +/// patterns plus TS-specific `any` type-safety escapes. pub const PATTERNS: &[Pattern] = &[ + // ── Tier A: Code execution ───────────────────────────────────────── Pattern { - id: "eval_call", - description: "Use of eval()", - query: "(call_expression function: (identifier) @id (#eq? @id \"eval\")) @vuln", + id: "ts.code_exec.eval", + description: "eval() — dynamic code execution", + query: r#"(call_expression + function: (identifier) @id (#eq? @id "eval")) + @vuln"#, severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::CodeExec, + confidence: Confidence::High, }, Pattern { - id: "new_function", - description: "new Function() constructor", - query: "(new_expression constructor: (identifier) @id (#eq? @id \"Function\")) @vuln", + id: "ts.code_exec.new_function", + description: "new Function() constructor — eval equivalent", + query: r#"(new_expression + constructor: (identifier) @id (#eq? @id "Function")) + @vuln"#, severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::CodeExec, + confidence: Confidence::High, }, Pattern { - id: "document_write", - description: "document.write() call", - query: "(call_expression function: (member_expression object: (identifier) @obj (#eq? @obj \"document\") property: (property_identifier) @prop (#eq? @prop \"write\"))) @vuln", + id: "ts.code_exec.settimeout_string", + description: "setTimeout/setInterval with string argument — implicit eval", + query: r#"(call_expression + function: (identifier) @id (#match? @id "^(setTimeout|setInterval)$") + arguments: (arguments (string) @code)) + @vuln"#, severity: Severity::Medium, + tier: PatternTier::A, + category: PatternCategory::CodeExec, + confidence: Confidence::High, }, + // ── Tier A: XSS sinks ────────────────────────────────────────────── Pattern { - id: "settimeout_string", - description: "setTimeout / setInterval with a string argument", - query: "(call_expression function: (identifier) @id (#match? @id \"setTimeout|setInterval\") arguments: (arguments (string) @code . _)) @vuln", + id: "ts.xss.document_write", + description: "document.write() — XSS sink", + query: r#"(call_expression + function: (member_expression + object: (identifier) @obj (#eq? @obj "document") + property: (property_identifier) @prop (#match? @prop "^(write|writeln)$"))) + @vuln"#, severity: Severity::Medium, + tier: PatternTier::A, + category: PatternCategory::Xss, + confidence: Confidence::High, }, Pattern { - id: "any_type", - description: "Type annotation of `any`", - query: "(type_annotation (predefined_type) @t (#eq? @t \"any\")) @vuln", - severity: Severity::Low, - }, - Pattern { - id: "json_parse", - description: "JSON.parse on dynamic string", - query: "(call_expression function: (member_expression object: (identifier) @obj (#eq? @obj \"JSON\") property: (property_identifier) @prop (#eq? @prop \"parse\"))) @vuln", - severity: Severity::Low, - }, - Pattern { - id: "as_any_assertion", - description: "Type assertion to `any` using `as any`", - query: "(as_expression type: (predefined_type) @t (#eq? @t \"any\")) @vuln", - severity: Severity::Low, - }, - Pattern { - id: "type_assertion_any", - description: "Type assertion to `any` using `` syntax", - query: "(type_assertion type: (predefined_type) @t (#eq? @t \"any\")) @vuln", - severity: Severity::Low, - }, - Pattern { - id: "outer_html_assignment", - description: "Assignment to element.outerHTML", - query: "(assignment_expression left: (member_expression property: (property_identifier) @prop (#eq? @prop \"outerHTML\"))) @vuln", + id: "ts.xss.outer_html", + description: "Assignment to .outerHTML — XSS sink", + query: r#"(assignment_expression + left: (member_expression + property: (property_identifier) @prop (#eq? @prop "outerHTML"))) + @vuln"#, severity: Severity::Medium, + tier: PatternTier::A, + category: PatternCategory::Xss, + confidence: Confidence::High, }, Pattern { - id: "insert_adjacent_html", - description: "insertAdjacentHTML() call", - query: "(call_expression function: (member_expression property: (property_identifier) @prop (#eq? @prop \"insertAdjacentHTML\"))) @vuln", + id: "ts.xss.insert_adjacent_html", + description: "insertAdjacentHTML() — XSS sink", + query: r#"(call_expression + function: (member_expression + property: (property_identifier) @prop (#eq? @prop "insertAdjacentHTML"))) + @vuln"#, severity: Severity::Medium, + tier: PatternTier::A, + category: PatternCategory::Xss, + confidence: Confidence::High, + }, + // ── Tier A: Weak crypto ──────────────────────────────────────────── + Pattern { + id: "ts.crypto.math_random", + description: "Math.random() — not cryptographically secure", + query: r#"(call_expression + function: (member_expression + object: (identifier) @obj (#eq? @obj "Math") + property: (property_identifier) @prop (#eq? @prop "random"))) + @vuln"#, + severity: Severity::Low, + tier: PatternTier::A, + category: PatternCategory::Crypto, + confidence: Confidence::Medium, + }, + // ── Tier A: TypeScript-specific type-safety escapes ──────────────── + Pattern { + id: "ts.quality.any_annotation", + description: "Type annotation of `any` — disables type checking", + query: r#"(type_annotation (predefined_type) @t (#eq? @t "any")) @vuln"#, + severity: Severity::Low, + tier: PatternTier::A, + category: PatternCategory::CodeQuality, + confidence: Confidence::Medium, }, Pattern { - id: "document_cookie_write", + id: "ts.quality.as_any", + description: "Type assertion `as any` — type-safety escape hatch", + query: r#"(as_expression (predefined_type) @t (#eq? @t "any")) @vuln"#, + severity: Severity::Low, + tier: PatternTier::A, + category: PatternCategory::CodeQuality, + confidence: Confidence::Medium, + }, + // ── Tier A: Prototype pollution ──────────────────────────────────── + Pattern { + id: "ts.prototype.proto_assignment", + description: "Assignment to __proto__ — prototype pollution", + query: r#"(assignment_expression + left: (member_expression + property: (property_identifier) @prop (#eq? @prop "__proto__"))) + @vuln"#, + severity: Severity::Medium, + tier: PatternTier::A, + category: PatternCategory::Prototype, + confidence: Confidence::High, + }, + // ── Tier A: Open redirect ────────────────────────────────────────── + Pattern { + id: "ts.xss.location_assign", + description: "Assignment to location/location.href — open redirect", + query: r#"(assignment_expression + left: (member_expression + object: (identifier) @obj (#match? @obj "^(window|location|document)$") + property: (property_identifier) @prop (#match? @prop "^(location|href)$"))) + @vuln"#, + severity: Severity::Medium, + tier: PatternTier::A, + category: PatternCategory::Xss, + confidence: Confidence::High, + }, + // ── Tier A: Cookie manipulation ──────────────────────────────────── + Pattern { + id: "ts.xss.cookie_write", description: "Write to document.cookie", - query: "(assignment_expression left: (member_expression object: (identifier) @obj (#eq? @obj \"document\") property: (property_identifier) @prop (#eq? @prop \"cookie\"))) @vuln", + query: r#"(assignment_expression + left: (member_expression + object: (identifier) @obj (#eq? @obj "document") + property: (property_identifier) @prop (#eq? @prop "cookie"))) + @vuln"#, severity: Severity::Low, - }, - Pattern { - id: "onclick_setattribute", - description: "Element.setAttribute('onclick', …)", - query: "(call_expression function: (member_expression property: (property_identifier) @prop (#eq? @prop \"setAttribute\")) arguments: (arguments (string) @name (#eq? @name \"\\\"onclick\\\"\") . (string) @handler)) @vuln", - severity: Severity::Medium, - }, - Pattern { - id: "math_random_call", - description: "Use of Math.random() for security-sensitive randomness", - query: "(call_expression function: (member_expression object: (identifier) @obj (#eq? @obj \"Math\") property: (property_identifier) @prop (#eq? @prop \"random\"))) @vuln", - severity: Severity::Low, - }, - Pattern { - id: "crypto_createhash_md5", - description: "Insecure hash algorithm: crypto.createHash('md5')", - query: "(call_expression function: (member_expression object: (identifier) @obj (#eq? @obj \"crypto\") property: (property_identifier) @prop (#eq? @prop \"createHash\")) arguments: (arguments (string) @alg (#match? @alg \"(?i)\\\"md5\\\"\"))) @vuln", - severity: Severity::Medium, - }, - Pattern { - id: "fetch_http_url", - description: "fetch() over plain HTTP", - query: "(call_expression function: (identifier) @id (#eq? @id \"fetch\") arguments: (arguments (string) @url (#match? @url \"^\\\"http://\"))) @vuln", - severity: Severity::Low, - }, - Pattern { - id: "xhr_eval_response", - description: "eval() of XMLHttpRequest.responseText", - query: "(call_expression function: (identifier) @id (#eq? @id \"eval\") arguments: (arguments (member_expression property: (property_identifier) @prop (#eq? @prop \"responseText\")))) @vuln", - severity: Severity::High, + tier: PatternTier::A, + category: PatternCategory::Xss, + confidence: Confidence::Medium, }, ]; diff --git a/src/rank.rs b/src/rank.rs new file mode 100644 index 00000000..56794632 --- /dev/null +++ b/src/rank.rs @@ -0,0 +1,646 @@ +//! Attack surface ranking for scan diagnostics. +//! +//! Computes a deterministic score for each [`Diag`] using only in-memory +//! information (severity, evidence, source kind, rule ID, validation state). +//! The score is used to sort findings so that truncation keeps the most +//! exploitable / important results. + +use crate::commands::scan::Diag; +use crate::evidence::Evidence; +use crate::patterns::Severity; +use std::hash::{DefaultHasher, Hash, Hasher}; + +/// Computed attack-surface ranking for a single diagnostic. +#[derive(Debug, Clone)] +pub struct AttackRank { + pub score: f64, + /// Breakdown of score components (for debug/display purposes). + #[allow(dead_code)] + pub components: Vec<(String, String)>, +} + +/// Compute an attack-surface score for `diag`. +/// +/// The score is a positive `f64`; higher means more exploitable / important. +/// Components are returned for optional debug/display. +pub fn compute_attack_rank(diag: &Diag) -> AttackRank { + let mut score = 0.0_f64; + let mut components: Vec<(String, String)> = Vec::new(); + + // ── 1. Severity base ──────────────────────────────────────────────── + let sev_score = match diag.severity { + Severity::High => 60.0, + Severity::Medium => 30.0, + Severity::Low => 10.0, + }; + score += sev_score; + components.push(("severity".into(), format!("{sev_score}"))); + + // ── 2. Analysis kind bonus ────────────────────────────────────────── + // + // Taint-confirmed findings are the strongest signal. State findings + // (resource lifecycle / auth) are next. CFG-structural findings + // without taint evidence rank lower. AST-only pattern matches are + // the weakest. + let kind_bonus = analysis_kind_bonus(&diag.id, diag.evidence.as_ref()); + score += kind_bonus; + if kind_bonus != 0.0 { + components.push(("analysis_kind".into(), format!("{kind_bonus}"))); + } + + // ── 3. Evidence strength / source-kind priority ───────────────────── + let evidence_bonus = evidence_strength(diag); + score += evidence_bonus; + if evidence_bonus != 0.0 { + components.push(("evidence".into(), format!("{evidence_bonus}"))); + } + + // ── 4. State finding sub-ranking ──────────────────────────────────── + let state_bonus = state_finding_bonus(&diag.id); + score += state_bonus; + if state_bonus != 0.0 { + components.push(("state_rule".into(), format!("{state_bonus}"))); + } + + // ── 5. Path validation penalty ────────────────────────────────────── + // + // If a taint path is guarded by a validation predicate, the finding + // has higher informational value but lower exploitability because the + // guard may prevent the vulnerability from being triggered. Apply a + // small penalty (–5) to push validated paths below otherwise-equal + // unvalidated ones without changing the overall ranking tier. + let path_validated = diag.evidence.as_ref().map_or(diag.path_validated, |ev| { + ev.notes.iter().any(|n| n == "path_validated") + }); + if path_validated { + score -= 5.0; + components.push(("path_validated_penalty".into(), "-5".into())); + } + + AttackRank { score, components } +} + +/// Deterministic sort key for a diagnostic. +/// +/// Two diags with identical scores are tie-broken by: +/// severity (High < Medium < Low in the `Ord` impl, so we negate) +/// → rule ID → file path → line → col → message hash +/// +/// Returns a tuple suitable for `sort_by`. +pub fn sort_key(diag: &Diag) -> impl Ord { + let sev_ord: u8 = match diag.severity { + Severity::High => 0, + Severity::Medium => 1, + Severity::Low => 2, + }; + let msg_hash = { + let mut h = DefaultHasher::new(); + diag.message.hash(&mut h); + h.finish() + }; + ( + sev_ord, + diag.id.clone(), + diag.path.clone(), + diag.line, + diag.col, + msg_hash, + ) +} + +/// Sort diagnostics in-place by descending attack-surface score, then by +/// deterministic tie-breaker. Populates `rank_score` on each `Diag`. +pub fn rank_diags(diags: &mut [Diag]) { + // Compute scores + let scores: Vec = diags.iter().map(|d| compute_attack_rank(d).score).collect(); + + // Attach scores to diags + for (d, s) in diags.iter_mut().zip(scores.iter()) { + d.rank_score = Some(*s); + } + + // Sort descending by score, then ascending by tie-breaker + diags.sort_by(|a, b| { + let sa = a.rank_score.unwrap_or(0.0); + let sb = b.rank_score.unwrap_or(0.0); + // Descending score (higher first) + sb.partial_cmp(&sa) + .unwrap_or(std::cmp::Ordering::Equal) + .then_with(|| sort_key(a).cmp(&sort_key(b))) + }); +} + +// ───────────────────────────────────────────────────────────────────────────── +// Scoring helpers +// ───────────────────────────────────────────────────────────────────────────── + +/// Bonus based on analysis kind inferred from rule ID + evidence. +fn analysis_kind_bonus(rule_id: &str, evidence: Option<&Evidence>) -> f64 { + if rule_id.starts_with("taint-") { + // Taint-confirmed flow is the strongest signal + 10.0 + } else if rule_id.starts_with("state-") { + // State-model findings (resource / auth) are strong + 8.0 + } else if rule_id.starts_with("cfg-") { + // CFG-structural findings: boost if evidence exists + if evidence.is_some_and(|e| !e.is_empty()) { + 5.0 + } else { + 3.0 + } + } else { + // AST-only pattern match + 0.0 + } +} + +/// Bonus from evidence strength: number of evidence items and source-kind +/// priority. +fn evidence_strength(diag: &Diag) -> f64 { + let mut bonus = 0.0; + + if let Some(ev) = &diag.evidence { + // Count structured evidence items (capped at 4) + let item_count = ev.source.is_some() as usize + + ev.sink.is_some() as usize + + (ev.guards.len() + ev.sanitizers.len()).min(2); + bonus += item_count.min(4) as f64; + + // Source-kind priority from evidence notes + for note in &ev.notes { + if let Some(kind) = note.strip_prefix("source_kind:") { + bonus += source_kind_priority(kind); + break; + } + } + } else { + // Fallback for DB-cached diags without structured evidence + bonus += (diag.labels.len() as f64).min(4.0); + for (label, value) in &diag.labels { + if label == "Source" { + bonus += source_kind_priority(value); + } + } + } + + bonus +} + +/// Priority bonus based on the source kind string found in evidence. +/// +/// UserInput / EnvironmentConfig / Unknown are most exploitable. +/// FileSystem / Database are lower because the attacker needs a more +/// indirect vector. +fn source_kind_priority(source_value: &str) -> f64 { + // Structured SourceKind enum values (from evidence.notes "source_kind:X") + match source_value { + "UserInput" => return 6.0, + "EnvironmentConfig" => return 5.0, + "FileSystem" => return 3.0, + "Database" => return 2.0, + "Unknown" => return 4.0, + _ => {} + } + + // Fallback: substring matching for legacy labels + let lower = source_value.to_ascii_lowercase(); + if lower.contains("stdin") + || lower.contains("argv") + || lower.contains("request") + || lower.contains("form") + || lower.contains("query") + || lower.contains("param") + || lower.contains("header") + || lower.contains("body") + || lower.contains("read_line") + { + // Strong user-input signals + 6.0 + } else if lower.contains("env") || lower.contains("var(") || lower.contains("getenv") { + // Environment / config — still attacker-controllable in many deployments + 5.0 + } else if lower.contains("read") || lower.contains("file") || lower.contains("open") { + // File system — needs indirect vector + 3.0 + } else if lower.contains("query") || lower.contains("fetch") || lower.contains("select") { + // Database — needs prior injection + 2.0 + } else { + // Unknown / unrecognised — treat as moderately exploitable + 4.0 + } +} + +/// Bonus for specific state-analysis rule IDs. +fn state_finding_bonus(rule_id: &str) -> f64 { + match rule_id { + "state-use-after-close" => 6.0, + "state-unauthed-access" => 6.0, + "state-double-close" => 3.0, + "state-resource-leak" => 2.0, // must-leak + "state-resource-leak-possible" => 1.0, // may-leak + _ => 0.0, + } +} + +// ───────────────────────────────────────────────────────────────────────────── +// Tests +// ───────────────────────────────────────────────────────────────────────────── + +#[cfg(test)] +mod tests { + use super::*; + + fn make_diag( + severity: Severity, + id: &str, + path: &str, + line: usize, + labels: Vec<(String, String)>, + path_validated: bool, + ) -> Diag { + Diag { + path: path.into(), + line, + col: 1, + severity, + id: id.into(), + category: crate::patterns::FindingCategory::Security, + path_validated, + guard_kind: None, + message: None, + labels, + confidence: None, + evidence: None, + rank_score: None, + rank_reason: None, + suppressed: false, + suppression: None, + rollup: None, + } + } + + // ── Ordering tests ────────────────────────────────────────────────── + + #[test] + fn high_taint_user_input_ranks_above_medium_file_io() { + let high_taint = make_diag( + Severity::High, + "taint-unsanitised-flow (source 1:1)", + "src/main.rs", + 10, + vec![ + ("Source".into(), "read_line() at 1:1".into()), + ("Sink".into(), "exec()".into()), + ], + false, + ); + let med_file = make_diag( + Severity::Medium, + "taint-unsanitised-flow (source 5:1)", + "src/lib.rs", + 20, + vec![ + ("Source".into(), "File::open() at 5:1".into()), + ("Sink".into(), "write()".into()), + ], + false, + ); + + let score_high = compute_attack_rank(&high_taint).score; + let score_med = compute_attack_rank(&med_file).score; + assert!( + score_high > score_med, + "high taint user-input ({score_high}) should rank above medium file-io ({score_med})" + ); + } + + #[test] + fn must_leak_ranks_above_may_leak() { + let must = make_diag( + Severity::Medium, + "state-resource-leak", + "src/db.rs", + 30, + vec![], + false, + ); + let may = make_diag( + Severity::Low, + "state-resource-leak-possible", + "src/db.rs", + 35, + vec![], + false, + ); + + let score_must = compute_attack_rank(&must).score; + let score_may = compute_attack_rank(&may).score; + assert!( + score_must > score_may, + "must-leak ({score_must}) should rank above may-leak ({score_may})" + ); + } + + #[test] + fn cfg_without_evidence_ranks_below_taint_confirmed() { + let taint = make_diag( + Severity::High, + "taint-unsanitised-flow (source 1:1)", + "src/main.rs", + 10, + vec![ + ("Source".into(), "env::var(\"CMD\") at 1:1".into()), + ("Sink".into(), "exec()".into()), + ], + false, + ); + let cfg_only = make_diag( + Severity::High, + "cfg-unguarded-sink", + "src/main.rs", + 10, + vec![], + false, + ); + + let score_taint = compute_attack_rank(&taint).score; + let score_cfg = compute_attack_rank(&cfg_only).score; + assert!( + score_taint > score_cfg, + "taint-confirmed ({score_taint}) should rank above cfg-only ({score_cfg})" + ); + } + + #[test] + fn determinism_input_order_independent() { + let d1 = make_diag( + Severity::High, + "taint-unsanitised-flow (source 1:1)", + "a.rs", + 1, + vec![("Source".into(), "stdin at 1:1".into())], + false, + ); + let d2 = make_diag( + Severity::Medium, + "cfg-unguarded-sink", + "b.rs", + 2, + vec![], + false, + ); + let d3 = make_diag(Severity::Low, "rs.code_exec.eval", "c.rs", 3, vec![], false); + + let mut order_a = vec![d1.clone(), d2.clone(), d3.clone()]; + let mut order_b = vec![d3, d1, d2]; + + rank_diags(&mut order_a); + rank_diags(&mut order_b); + + let ids_a: Vec<_> = order_a.iter().map(|d| (&d.id, d.line)).collect(); + let ids_b: Vec<_> = order_b.iter().map(|d| (&d.id, d.line)).collect(); + assert_eq!( + ids_a, ids_b, + "ranking must be deterministic regardless of input order" + ); + } + + #[test] + fn path_validated_penalty_applied() { + let unvalidated = make_diag( + Severity::High, + "taint-unsanitised-flow (source 1:1)", + "src/main.rs", + 10, + vec![("Source".into(), "env::var(\"X\") at 1:1".into())], + false, + ); + let validated = make_diag( + Severity::High, + "taint-unsanitised-flow (source 1:1)", + "src/main.rs", + 10, + vec![("Source".into(), "env::var(\"X\") at 1:1".into())], + true, + ); + + let score_unval = compute_attack_rank(&unvalidated).score; + let score_val = compute_attack_rank(&validated).score; + assert!( + score_unval > score_val, + "unvalidated ({score_unval}) should rank above validated ({score_val})" + ); + } + + #[test] + fn state_use_after_close_ranks_above_may_leak() { + let uac = make_diag( + Severity::High, + "state-use-after-close", + "x.rs", + 1, + vec![], + false, + ); + let may = make_diag( + Severity::Low, + "state-resource-leak-possible", + "x.rs", + 2, + vec![], + false, + ); + + let score_uac = compute_attack_rank(&uac).score; + let score_may = compute_attack_rank(&may).score; + assert!(score_uac > score_may); + } + + #[test] + fn unauthed_access_ranks_above_resource_leak() { + let unauth = make_diag( + Severity::High, + "state-unauthed-access", + "x.rs", + 1, + vec![], + false, + ); + let leak = make_diag( + Severity::Medium, + "state-resource-leak", + "x.rs", + 2, + vec![], + false, + ); + + let score_ua = compute_attack_rank(&unauth).score; + let score_lk = compute_attack_rank(&leak).score; + assert!(score_ua > score_lk); + } + + #[test] + fn ast_only_ranks_below_all_others_at_same_severity() { + let ast = make_diag( + Severity::High, + "rs.code_exec.eval", + "x.rs", + 1, + vec![], + false, + ); + let cfg = make_diag( + Severity::High, + "cfg-unguarded-sink", + "x.rs", + 2, + vec![], + false, + ); + let taint = make_diag( + Severity::High, + "taint-unsanitised-flow (source 1:1)", + "x.rs", + 3, + vec![("Source".into(), "env::var(\"X\") at 1:1".into())], + false, + ); + let state = make_diag( + Severity::High, + "state-use-after-close", + "x.rs", + 4, + vec![], + false, + ); + + let s_ast = compute_attack_rank(&ast).score; + let s_cfg = compute_attack_rank(&cfg).score; + let s_taint = compute_attack_rank(&taint).score; + let s_state = compute_attack_rank(&state).score; + + assert!(s_ast < s_cfg, "AST ({s_ast}) < CFG ({s_cfg})"); + assert!(s_ast < s_taint, "AST ({s_ast}) < taint ({s_taint})"); + assert!(s_ast < s_state, "AST ({s_ast}) < state ({s_state})"); + } + + #[test] + fn structured_evidence_source_kind_matches_legacy() { + // Structured evidence with source_kind:UserInput note should give + // the same source-kind bonus as a legacy "Source" label with user input. + let mut structured = make_diag( + Severity::High, + "taint-unsanitised-flow (source 1:1)", + "src/main.rs", + 10, + vec![], + false, + ); + structured.evidence = Some(crate::evidence::Evidence { + source: Some(crate::evidence::SpanEvidence { + path: "src/main.rs".into(), + line: 1, + col: 1, + kind: "source".into(), + snippet: Some("read_line()".into()), + }), + sink: Some(crate::evidence::SpanEvidence { + path: "src/main.rs".into(), + line: 10, + col: 5, + kind: "sink".into(), + snippet: Some("exec()".into()), + }), + guards: vec![], + sanitizers: vec![], + state: None, + notes: vec!["source_kind:UserInput".into()], + }); + + let legacy = make_diag( + Severity::High, + "taint-unsanitised-flow (source 1:1)", + "src/main.rs", + 10, + vec![ + ("Source".into(), "read_line() at 1:1".into()), + ("Sink".into(), "exec()".into()), + ], + false, + ); + + let score_structured = compute_attack_rank(&structured).score; + let score_legacy = compute_attack_rank(&legacy).score; + assert_eq!( + score_structured, score_legacy, + "structured ({score_structured}) should equal legacy ({score_legacy})" + ); + } + + #[test] + fn evidence_item_count_capped_at_4() { + let mut d = make_diag( + Severity::High, + "taint-unsanitised-flow (source 1:1)", + "src/main.rs", + 10, + vec![], + false, + ); + let span = || crate::evidence::SpanEvidence { + path: "x.rs".into(), + line: 1, + col: 1, + kind: "guard".into(), + snippet: None, + }; + d.evidence = Some(crate::evidence::Evidence { + source: Some(span()), + sink: Some(span()), + guards: vec![span(), span(), span()], // 3 guards + sanitizers: vec![span()], // 1 sanitizer + state: None, + notes: vec![], + }); + + // item_count = 1 (source) + 1 (sink) + min(2, 3+1) = 4 + // evidence bonus should be exactly 4.0 (from items) + 4.0 (unknown source kind) = 8.0 + // ... but no source_kind note, so no source priority bonus + let score = evidence_strength(&d); + assert!( + (score - 4.0).abs() < f64::EPSILON, + "evidence item count should be capped at 4, got {score}" + ); + } + + #[test] + fn path_validated_from_evidence_notes() { + let mut d = make_diag( + Severity::High, + "taint-unsanitised-flow (source 1:1)", + "src/main.rs", + 10, + vec![], + false, // path_validated is false on Diag + ); + d.evidence = Some(crate::evidence::Evidence { + source: None, + sink: None, + guards: vec![], + sanitizers: vec![], + state: None, + notes: vec!["path_validated".into()], + }); + + let rank = compute_attack_rank(&d); + assert!( + rank.components + .iter() + .any(|(k, _)| k == "path_validated_penalty"), + "path_validated note in evidence should trigger penalty" + ); + } +} diff --git a/src/state/domain.rs b/src/state/domain.rs new file mode 100644 index 00000000..b02b5b62 --- /dev/null +++ b/src/state/domain.rs @@ -0,0 +1,313 @@ +use super::lattice::Lattice; +use super::symbol::SymbolId; +use bitflags::bitflags; +use std::collections::{HashMap, HashSet}; + +// ── ResourceLifecycle ──────────────────────────────────────────────────── + +bitflags! { + /// Bitset of possible lifecycle states for a single resource handle. + /// + /// Join = bitwise OR (a variable may be in multiple states across paths). + #[derive(Clone, Copy, Debug, PartialEq, Eq, Hash)] + pub struct ResourceLifecycle: u8 { + const UNINIT = 0b0001; + const OPEN = 0b0010; + const CLOSED = 0b0100; + const MOVED = 0b1000; + } +} + +impl Lattice for ResourceLifecycle { + fn bot() -> Self { + ResourceLifecycle::empty() + } + + fn join(&self, other: &Self) -> Self { + *self | *other + } + + fn leq(&self, other: &Self) -> bool { + self.intersection(*other) == *self + } +} + +// ── ResourceDomainState ────────────────────────────────────────────────── + +/// Maps interned variable IDs to their lifecycle bitsets. +#[derive(Clone, Debug, Default, PartialEq, Eq)] +pub struct ResourceDomainState { + pub vars: HashMap, +} + +impl ResourceDomainState { + pub fn new() -> Self { + Self::default() + } + + pub fn get(&self, sym: SymbolId) -> ResourceLifecycle { + self.vars + .get(&sym) + .copied() + .unwrap_or(ResourceLifecycle::empty()) + } + + pub fn set(&mut self, sym: SymbolId, state: ResourceLifecycle) { + self.vars.insert(sym, state); + } +} + +impl Lattice for ResourceDomainState { + fn bot() -> Self { + Self::new() + } + + fn join(&self, other: &Self) -> Self { + let mut merged = self.clone(); + for (&sym, &other_lc) in &other.vars { + let entry = merged.vars.entry(sym).or_insert(ResourceLifecycle::empty()); + *entry = entry.join(&other_lc); + } + merged + } + + fn leq(&self, other: &Self) -> bool { + for (&sym, &self_lc) in &self.vars { + let other_lc = other.get(sym); + if !self_lc.leq(&other_lc) { + return false; + } + } + true + } +} + +// ── AuthLevel ──────────────────────────────────────────────────────────── + +/// Simple ordered lattice for path authentication state. +/// +/// Bot = `Unauthed`. Join = `min` (conservative: if any path is unauthed, +/// the joined state is unauthed). +#[derive(Clone, Copy, Debug, PartialEq, Eq, Hash, PartialOrd, Ord)] +pub enum AuthLevel { + Unauthed, + Authed, + Admin, +} + +impl Lattice for AuthLevel { + fn bot() -> Self { + AuthLevel::Unauthed + } + + fn join(&self, other: &Self) -> Self { + // Conservative: take the minimum (least privileged) + (*self).min(*other) + } + + fn leq(&self, other: &Self) -> bool { + // Higher auth subsumes lower: Unauthed ⊑ Authed ⊑ Admin + // In our lattice, join = min, so leq means self >= other + *self >= *other + } +} + +// ── AuthDomainState ────────────────────────────────────────────────────── + +/// Path auth level + per-variable validation bit. +#[derive(Clone, Debug, PartialEq, Eq)] +pub struct AuthDomainState { + pub auth_level: AuthLevel, + pub validated: HashSet, +} + +impl Default for AuthDomainState { + fn default() -> Self { + Self { + auth_level: AuthLevel::Unauthed, + validated: HashSet::new(), + } + } +} + +impl AuthDomainState { + pub fn new() -> Self { + Self::default() + } +} + +impl Lattice for AuthDomainState { + fn bot() -> Self { + Self::new() + } + + fn join(&self, other: &Self) -> Self { + Self { + auth_level: self.auth_level.join(&other.auth_level), + // Only validated on ALL paths counts + validated: self + .validated + .intersection(&other.validated) + .copied() + .collect(), + } + } + + fn leq(&self, other: &Self) -> bool { + self.auth_level.leq(&other.auth_level) && self.validated.is_superset(&other.validated) + } +} + +// ── ProductState ───────────────────────────────────────────────────────── + +/// Composable product of resource and auth domains. +#[derive(Clone, Debug, PartialEq, Eq)] +pub struct ProductState { + pub resource: ResourceDomainState, + pub auth: AuthDomainState, +} + +impl ProductState { + pub fn initial() -> Self { + Self { + resource: ResourceDomainState::new(), + auth: AuthDomainState::new(), + } + } +} + +impl Lattice for ProductState { + fn bot() -> Self { + Self { + resource: ResourceDomainState::bot(), + auth: AuthDomainState::bot(), + } + } + + fn join(&self, other: &Self) -> Self { + Self { + resource: self.resource.join(&other.resource), + auth: self.auth.join(&other.auth), + } + } + + fn leq(&self, other: &Self) -> bool { + self.resource.leq(&other.resource) && self.auth.leq(&other.auth) + } +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn resource_lifecycle_join_is_or() { + let a = ResourceLifecycle::OPEN; + let b = ResourceLifecycle::CLOSED; + assert_eq!( + a.join(&b), + ResourceLifecycle::OPEN | ResourceLifecycle::CLOSED + ); + } + + #[test] + fn resource_lifecycle_bot_identity() { + let a = ResourceLifecycle::OPEN; + assert_eq!(a.join(&ResourceLifecycle::bot()), a); + } + + #[test] + fn resource_lifecycle_leq() { + let a = ResourceLifecycle::OPEN; + let b = ResourceLifecycle::OPEN | ResourceLifecycle::CLOSED; + assert!(a.leq(&b)); + assert!(!b.leq(&a)); + } + + #[test] + fn resource_domain_join_merges_keys() { + let mut a = ResourceDomainState::new(); + let mut b = ResourceDomainState::new(); + let sym_x = SymbolId(0); + let sym_y = SymbolId(1); + + a.set(sym_x, ResourceLifecycle::OPEN); + b.set(sym_x, ResourceLifecycle::CLOSED); + b.set(sym_y, ResourceLifecycle::OPEN); + + let joined = a.join(&b); + assert_eq!( + joined.get(sym_x), + ResourceLifecycle::OPEN | ResourceLifecycle::CLOSED + ); + assert_eq!(joined.get(sym_y), ResourceLifecycle::OPEN); + } + + #[test] + fn auth_level_join_is_min() { + assert_eq!( + AuthLevel::Admin.join(&AuthLevel::Unauthed), + AuthLevel::Unauthed + ); + assert_eq!(AuthLevel::Authed.join(&AuthLevel::Admin), AuthLevel::Authed); + assert_eq!( + AuthLevel::Authed.join(&AuthLevel::Authed), + AuthLevel::Authed + ); + } + + #[test] + fn auth_domain_join_intersects_validated() { + let sym_a = SymbolId(0); + let sym_b = SymbolId(1); + let sym_c = SymbolId(2); + + let a = AuthDomainState { + auth_level: AuthLevel::Authed, + validated: [sym_a, sym_b].into_iter().collect(), + }; + let b = AuthDomainState { + auth_level: AuthLevel::Admin, + validated: [sym_b, sym_c].into_iter().collect(), + }; + + let joined = a.join(&b); + assert_eq!(joined.auth_level, AuthLevel::Authed); + assert_eq!(joined.validated, [sym_b].into_iter().collect()); + } + + #[test] + fn product_state_join() { + let a = ProductState::initial(); + let b = ProductState::initial(); + let joined = a.join(&b); + assert_eq!(joined, ProductState::initial()); + } + + #[test] + fn may_must_leak_semantics() { + // Must-leak: OPEN only + let must_leak = ResourceLifecycle::OPEN; + assert!(must_leak.contains(ResourceLifecycle::OPEN)); + assert!(!must_leak.contains(ResourceLifecycle::CLOSED)); + assert!(!must_leak.contains(ResourceLifecycle::MOVED)); + + // May-leak: OPEN | CLOSED (some paths close, some don't) + let may_leak = ResourceLifecycle::OPEN | ResourceLifecycle::CLOSED; + assert!(may_leak.contains(ResourceLifecycle::OPEN)); + assert!(may_leak.contains(ResourceLifecycle::CLOSED)); + + // No leak: CLOSED only + let no_leak = ResourceLifecycle::CLOSED; + assert!(!no_leak.contains(ResourceLifecycle::OPEN)); + assert!(no_leak.contains(ResourceLifecycle::CLOSED)); + } + + // SymbolId is a newtype used in domain tests; ensure it's Copy + #[test] + fn symbol_id_is_copy() { + let s = SymbolId(0); + let s2 = s; + assert_eq!(s, s2); + } +} diff --git a/src/state/engine.rs b/src/state/engine.rs new file mode 100644 index 00000000..3252fabc --- /dev/null +++ b/src/state/engine.rs @@ -0,0 +1,288 @@ +use super::lattice::Lattice; +use crate::cfg::{Cfg, EdgeKind, NodeInfo}; +use petgraph::graph::NodeIndex; +use petgraph::visit::EdgeRef; +use std::collections::{HashMap, VecDeque}; + +/// Maximum tracked variables per function (guarded degradation). +pub const MAX_TRACKED_VARS: usize = 64; + +/// Default worklist iteration budget. +pub const MAX_WORKLIST_ITERATIONS: usize = 100_000; + +/// Generic transfer function trait for forward dataflow analysis. +/// +/// Domains implement this to define how abstract state flows through +/// CFG nodes and what events (findings) are emitted. +pub trait Transfer { + /// Side-channel events emitted during transfer (e.g., findings, violations). + type Event: Clone; + + /// Apply the transfer function to a node, returning the output state + /// and any events. + fn apply( + &self, + node: NodeIndex, + info: &NodeInfo, + edge: Option, + state: S, + ) -> (S, Vec); + + /// Per-domain iteration budget. Defaults to [`MAX_WORKLIST_ITERATIONS`]. + fn iteration_budget(&self) -> usize { + MAX_WORKLIST_ITERATIONS + } + + /// Called when the budget is exhausted. Returns true if the engine + /// should continue with the current (non-converged) state, false to bail. + fn on_budget_exceeded(&self) -> bool { + false + } +} + +/// Result of running the forward dataflow engine. +pub struct DataflowResult { + /// Converged state at the entry of each node. + pub states: HashMap, + /// Events emitted during Phase 2 transfer over converged states. + pub events: Vec, + /// Whether the analysis converged (false if budget was hit). + #[allow(dead_code)] + pub converged: bool, +} + +/// Run a forward worklist dataflow analysis over the CFG. +/// +/// Two-phase design: +/// - Phase 1: fixed-point iteration to converge states (no event collection). +/// - Phase 2: single pass over converged states to collect events. +/// +/// Termination is guaranteed by lattice finiteness + iteration budget. +pub fn run_forward>( + cfg: &Cfg, + entry: NodeIndex, + transfer: &T, + initial: S, +) -> DataflowResult { + let mut states: HashMap = HashMap::new(); + let budget = transfer.iteration_budget(); + + // Initialize entry node + states.insert(entry, initial); + + // ── Phase 1: fixed-point iteration (compute converged states) ───── + let mut worklist: VecDeque = VecDeque::new(); + worklist.push_back(entry); + + let mut iterations: usize = 0; + let mut converged = true; + + while let Some(node) = worklist.pop_front() { + iterations += 1; + if iterations > budget { + converged = !transfer.on_budget_exceeded(); + if !converged { + break; + } + } + + let node_state = match states.get(&node) { + Some(s) => s.clone(), + None => continue, + }; + + let edges: Vec<_> = cfg.edges(node).map(|e| (*e.weight(), e.target())).collect(); + + // No outgoing edges — nothing to propagate (exit/dead end). + if edges.is_empty() { + continue; + } + + for (edge_kind, target) in edges { + let info = &cfg[node]; + let (out_state, _events) = + transfer.apply(node, info, Some(edge_kind), node_state.clone()); + + // Join into target's state + let target_state = states.get(&target); + let new_target = match target_state { + Some(existing) => existing.join(&out_state), + None => out_state, + }; + + let changed = target_state.is_none_or(|existing| *existing != new_target); + if changed { + states.insert(target, new_target); + if !worklist.contains(&target) { + worklist.push_back(target); + } + } + } + } + + // ── Phase 2: single pass over converged states to collect events ── + let mut events: Vec = Vec::new(); + let mut seen_edges: std::collections::HashSet<(NodeIndex, NodeIndex)> = + std::collections::HashSet::new(); + + for node in states.keys().copied().collect::>() { + let node_state = match states.get(&node) { + Some(s) => s.clone(), + None => continue, + }; + + let edges: Vec<_> = cfg.edges(node).map(|e| (*e.weight(), e.target())).collect(); + + if edges.is_empty() { + // Exit / dead end — apply transfer for event collection. + let info = &cfg[node]; + let (_out_state, new_events) = transfer.apply(node, info, None, node_state); + events.extend(new_events); + continue; + } + + for (edge_kind, target) in edges { + if !seen_edges.insert((node, target)) { + continue; + } + let info = &cfg[node]; + let (_out_state, new_events) = + transfer.apply(node, info, Some(edge_kind), node_state.clone()); + events.extend(new_events); + } + } + + DataflowResult { + states, + events, + converged, + } +} + +#[cfg(test)] +mod tests { + use super::*; + use crate::cfg::{EdgeKind, NodeInfo, StmtKind}; + use crate::cfg_analysis::rules; + use crate::state::domain::ResourceLifecycle; + use crate::state::symbol::SymbolInterner; + use crate::state::transfer::DefaultTransfer; + use crate::symbol::Lang; + use petgraph::Graph; + + fn make_node(kind: StmtKind) -> NodeInfo { + NodeInfo { + kind, + span: (0, 0), + label: None, + defines: None, + uses: vec![], + callee: None, + enclosing_func: None, + call_ordinal: 0, + condition_text: None, + condition_vars: vec![], + condition_negated: false, + } + } + + #[test] + fn linear_cfg_converges() { + use crate::state::domain::ProductState; + + // Entry → fopen(f) → fclose(f) → Exit + let mut cfg: Cfg = Graph::new(); + let entry = cfg.add_node(make_node(StmtKind::Entry)); + let open_node = cfg.add_node(NodeInfo { + kind: StmtKind::Call, + defines: Some("f".into()), + callee: Some("fopen".into()), + ..make_node(StmtKind::Call) + }); + let close_node = cfg.add_node(NodeInfo { + kind: StmtKind::Call, + uses: vec!["f".into()], + callee: Some("fclose".into()), + ..make_node(StmtKind::Call) + }); + let exit = cfg.add_node(make_node(StmtKind::Exit)); + + cfg.add_edge(entry, open_node, EdgeKind::Seq); + cfg.add_edge(open_node, close_node, EdgeKind::Seq); + cfg.add_edge(close_node, exit, EdgeKind::Seq); + + let interner = SymbolInterner::from_cfg(&cfg); + let transfer = DefaultTransfer { + lang: Lang::C, + resource_pairs: rules::resource_pairs(Lang::C), + interner: &interner, + }; + + let result = run_forward(&cfg, entry, &transfer, ProductState::initial()); + + // No events (clean open→close) + assert!(result.events.is_empty()); + assert!(result.converged); + + // At exit, f should be CLOSED + let sym_f = interner.get("f").unwrap(); + let exit_state = result.states.get(&exit).unwrap(); + assert_eq!(exit_state.resource.get(sym_f), ResourceLifecycle::CLOSED); + } + + #[test] + fn diamond_cfg_joins_states() { + use crate::state::domain::ProductState; + + // Entry + // | + // fopen(f) + // | + // If + // / \ + // fclose(f) (no close) + // \ / + // Exit + let mut cfg: Cfg = Graph::new(); + let entry = cfg.add_node(make_node(StmtKind::Entry)); + let open_node = cfg.add_node(NodeInfo { + kind: StmtKind::Call, + defines: Some("f".into()), + callee: Some("fopen".into()), + ..make_node(StmtKind::Call) + }); + let if_node = cfg.add_node(make_node(StmtKind::If)); + let close_node = cfg.add_node(NodeInfo { + kind: StmtKind::Call, + uses: vec!["f".into()], + callee: Some("fclose".into()), + ..make_node(StmtKind::Call) + }); + let no_close = cfg.add_node(make_node(StmtKind::Seq)); + let exit = cfg.add_node(make_node(StmtKind::Exit)); + + cfg.add_edge(entry, open_node, EdgeKind::Seq); + cfg.add_edge(open_node, if_node, EdgeKind::Seq); + cfg.add_edge(if_node, close_node, EdgeKind::True); + cfg.add_edge(if_node, no_close, EdgeKind::False); + cfg.add_edge(close_node, exit, EdgeKind::Seq); + cfg.add_edge(no_close, exit, EdgeKind::Seq); + + let interner = SymbolInterner::from_cfg(&cfg); + let transfer = DefaultTransfer { + lang: Lang::C, + resource_pairs: rules::resource_pairs(Lang::C), + interner: &interner, + }; + + let result = run_forward(&cfg, entry, &transfer, ProductState::initial()); + + // At exit, f should be OPEN | CLOSED (may-leak) + let sym_f = interner.get("f").unwrap(); + let exit_state = result.states.get(&exit).unwrap(); + assert_eq!( + exit_state.resource.get(sym_f), + ResourceLifecycle::OPEN | ResourceLifecycle::CLOSED + ); + } +} diff --git a/src/state/facts.rs b/src/state/facts.rs new file mode 100644 index 00000000..23fee253 --- /dev/null +++ b/src/state/facts.rs @@ -0,0 +1,355 @@ +use super::domain::{AuthLevel, ProductState, ResourceLifecycle}; +use super::engine::DataflowResult; +use super::symbol::SymbolInterner; +use super::transfer::{TransferEvent, TransferEventKind}; +use crate::cfg::{Cfg, StmtKind}; +use crate::labels::{Cap, DataLabel}; +use crate::patterns::Severity; +use crate::symbol::Lang; +use petgraph::visit::IntoNodeReferences; + +/// Normalize a callee description for display. +fn sanitize_desc(s: &str) -> String { + crate::fmt::normalize_snippet(s) +} + +/// A finding produced by state analysis. +#[derive(Debug, Clone)] +pub struct StateFinding { + pub rule_id: String, + pub severity: Severity, + pub span: (usize, usize), + pub message: String, + /// State machine that produced this finding: `"resource"` or `"auth"`. + pub machine: &'static str, + /// Variable name involved, if available. + pub subject: Option, + /// State before the event (e.g. `"closed"`, `"open"`, `"unauthed"`). + pub from_state: &'static str, + /// State after the event (e.g. `"used"`, `"closed"`, `"leaked"`, `"access"`). + pub to_state: &'static str, +} + +/// Extract findings from converged dataflow state + transfer events. +pub fn extract_findings( + result: &DataflowResult, + cfg: &Cfg, + interner: &SymbolInterner, + lang: Lang, + func_summaries: &crate::cfg::FuncSummaries, +) -> Vec { + let mut findings = Vec::new(); + + // ── 1. Use-after-close from transfer events ────────────────────────── + for event in &result.events { + let info = &cfg[event.node]; + let var_name = interner.resolve(event.var); + match event.kind { + TransferEventKind::UseAfterClose => { + findings.push(StateFinding { + rule_id: "state-use-after-close".into(), + severity: Severity::High, + span: info.span, + message: format!("variable `{var_name}` used after close"), + machine: "resource", + subject: Some(var_name.to_string()), + from_state: "closed", + to_state: "used", + }); + } + TransferEventKind::DoubleClose => { + findings.push(StateFinding { + rule_id: "state-double-close".into(), + severity: Severity::Medium, + span: info.span, + message: format!("variable `{var_name}` closed twice"), + machine: "resource", + subject: Some(var_name.to_string()), + from_state: "closed", + to_state: "closed", + }); + } + } + } + + // ── 2. Resource leaks at Exit and function-Return nodes ────────────── + for (idx, info) in cfg.node_references() { + // Check both the file-level Exit node and the *synthesised* function + // exit node (a Return node). Skip early-return nodes — they flow + // into the synthesised exit and carry only path-specific state. + // The synthesised exit is the one Return node that does NOT have an + // outgoing edge to another Return in the same function. + let is_exit = info.kind == StmtKind::Exit; + let is_func_exit = info.kind == StmtKind::Return && info.enclosing_func.is_some(); + if !is_exit && !is_func_exit { + continue; + } + if is_func_exit { + use petgraph::Direction; + let is_early_return = cfg + .neighbors_directed(idx, Direction::Outgoing) + .any(|succ| { + let s = &cfg[succ]; + s.kind == StmtKind::Return && s.enclosing_func == info.enclosing_func + }); + if is_early_return { + continue; + } + } + let Some(state) = result.states.get(&idx) else { + continue; + }; + + for (&sym, &lifecycle) in &state.resource.vars { + if !lifecycle.contains(ResourceLifecycle::OPEN) { + continue; + } + let var_name = interner.resolve(sym); + + if !lifecycle.contains(ResourceLifecycle::CLOSED) + && !lifecycle.contains(ResourceLifecycle::MOVED) + { + // Definite leak: open on all paths, never closed + // Find the acquire span by scanning backwards for this variable's define + let acquire_span = find_acquire_span(cfg, sym, interner); + findings.push(StateFinding { + rule_id: "state-resource-leak".into(), + severity: Severity::Medium, + span: acquire_span.unwrap_or(info.span), + message: format!("resource `{var_name}` is never closed"), + machine: "resource", + subject: Some(var_name.to_string()), + from_state: "open", + to_state: "leaked", + }); + } else if lifecycle.contains(ResourceLifecycle::CLOSED) { + // May-leak: open on some paths, closed on others + let acquire_span = find_acquire_span(cfg, sym, interner); + findings.push(StateFinding { + rule_id: "state-resource-leak-possible".into(), + severity: Severity::Low, + span: acquire_span.unwrap_or(info.span), + message: format!("resource `{var_name}` may not be closed on all paths"), + machine: "resource", + subject: Some(var_name.to_string()), + from_state: "open", + to_state: "possibly_leaked", + }); + } + } + } + + // ── 3. Auth-required sinks ─────────────────────────────────────────── + // Check if any function is a web entrypoint + let has_web_entrypoint = cfg.node_references().any(|(_, info)| { + if let Some(ref func_name) = info.enclosing_func { + is_web_entrypoint_simple(func_name, lang, func_summaries, cfg) + } else { + false + } + }); + + if has_web_entrypoint { + for (idx, info) in cfg.node_references() { + if !is_privileged_sink(info) { + continue; + } + let Some(state) = result.states.get(&idx) else { + continue; + }; + if state.auth.auth_level == AuthLevel::Unauthed { + let callee_desc = sanitize_desc(info.callee.as_deref().unwrap_or("(sensitive op)")); + findings.push(StateFinding { + rule_id: "state-unauthed-access".into(), + severity: Severity::High, + span: info.span, + message: format!( + "sensitive operation `{callee_desc}` reached without authentication" + ), + machine: "auth", + subject: None, + from_state: "unauthed", + to_state: "access", + }); + } + } + } + + // Dedup + findings.sort_by(|a, b| a.span.cmp(&b.span).then_with(|| a.rule_id.cmp(&b.rule_id))); + findings.dedup_by(|a, b| a.span == b.span && a.rule_id == b.rule_id); + + findings +} + +/// Find the span where a variable was acquired (defined via Call node). +fn find_acquire_span( + cfg: &Cfg, + sym: super::symbol::SymbolId, + interner: &SymbolInterner, +) -> Option<(usize, usize)> { + let var_name = interner.resolve(sym); + for (_idx, info) in cfg.node_references() { + if info.kind == StmtKind::Call + && let Some(ref def) = info.defines + && def == var_name + { + return Some(info.span); + } + } + None +} + +/// Check if a node is a privileged sink (shell execution or file I/O). +fn is_privileged_sink(info: &crate::cfg::NodeInfo) -> bool { + match info.label { + Some(DataLabel::Sink(caps)) => caps.intersects(Cap::SHELL_ESCAPE | Cap::FILE_IO), + _ => false, + } +} + +/// Simplified web entrypoint check (avoids AnalysisContext dependency). +fn is_web_entrypoint_simple( + func_name: &str, + lang: Lang, + func_summaries: &crate::cfg::FuncSummaries, + _cfg: &Cfg, +) -> bool { + let name_lower = func_name.to_ascii_lowercase(); + + // Skip bare "main" — it's typically a CLI entry + if name_lower == "main" { + return false; + } + + let is_handler_name = name_lower.starts_with("handle_") + || name_lower.starts_with("route_") + || name_lower.starts_with("api_") + || name_lower.starts_with("serve_") + || name_lower.starts_with("process_") + || name_lower == "handler"; + + if !is_handler_name { + return false; + } + + // Check for web-like parameters + let web_params: &[&str] = match lang { + Lang::Rust => &["request", "req", "json", "query", "form", "payload", "body"], + Lang::JavaScript | Lang::TypeScript => &["req", "request", "ctx", "res", "response"], + Lang::Python => &["request", "req"], + Lang::Go => &["w", "writer", "r", "req", "request"], + Lang::Java => &["request", "req"], + _ => &["request", "req"], + }; + + let has_web_params = func_summaries.values().any(|s| { + s.param_names + .iter() + .any(|p| web_params.contains(&p.to_ascii_lowercase().as_str())) + }); + + // Strong handler names are enough even without web params + let strong_name = name_lower.starts_with("handle_") + || name_lower.starts_with("route_") + || name_lower.starts_with("api_"); + + has_web_params || strong_name +} + +#[cfg(test)] +mod tests { + use super::*; + use crate::cfg::{EdgeKind, NodeInfo}; + use crate::cfg_analysis::rules; + use crate::state::domain::ProductState; + use crate::state::engine; + use crate::state::symbol::SymbolInterner; + use crate::state::transfer::DefaultTransfer; + use petgraph::Graph; + use std::collections::HashMap; + + fn make_node(kind: StmtKind) -> NodeInfo { + NodeInfo { + kind, + span: (0, 0), + label: None, + defines: None, + uses: vec![], + callee: None, + enclosing_func: None, + call_ordinal: 0, + condition_text: None, + condition_vars: vec![], + condition_negated: false, + } + } + + #[test] + fn detects_resource_leak() { + // Entry → fopen(f) → Exit (no close) + let mut cfg: Cfg = Graph::new(); + let entry = cfg.add_node(make_node(StmtKind::Entry)); + let open_node = cfg.add_node(NodeInfo { + kind: StmtKind::Call, + span: (10, 20), + defines: Some("f".into()), + callee: Some("fopen".into()), + ..make_node(StmtKind::Call) + }); + let exit = cfg.add_node(make_node(StmtKind::Exit)); + + cfg.add_edge(entry, open_node, EdgeKind::Seq); + cfg.add_edge(open_node, exit, EdgeKind::Seq); + + let interner = SymbolInterner::from_cfg(&cfg); + let transfer = DefaultTransfer { + lang: Lang::C, + resource_pairs: rules::resource_pairs(Lang::C), + interner: &interner, + }; + + let result = engine::run_forward(&cfg, entry, &transfer, ProductState::initial()); + let findings = extract_findings(&result, &cfg, &interner, Lang::C, &HashMap::new()); + + assert_eq!(findings.len(), 1); + assert_eq!(findings[0].rule_id, "state-resource-leak"); + assert!(findings[0].message.contains("f")); + } + + #[test] + fn clean_open_close_no_findings() { + // Entry → fopen(f) → fclose(f) → Exit + let mut cfg: Cfg = Graph::new(); + let entry = cfg.add_node(make_node(StmtKind::Entry)); + let open_node = cfg.add_node(NodeInfo { + kind: StmtKind::Call, + defines: Some("f".into()), + callee: Some("fopen".into()), + ..make_node(StmtKind::Call) + }); + let close_node = cfg.add_node(NodeInfo { + kind: StmtKind::Call, + uses: vec!["f".into()], + callee: Some("fclose".into()), + ..make_node(StmtKind::Call) + }); + let exit = cfg.add_node(make_node(StmtKind::Exit)); + + cfg.add_edge(entry, open_node, EdgeKind::Seq); + cfg.add_edge(open_node, close_node, EdgeKind::Seq); + cfg.add_edge(close_node, exit, EdgeKind::Seq); + + let interner = SymbolInterner::from_cfg(&cfg); + let transfer = DefaultTransfer { + lang: Lang::C, + resource_pairs: rules::resource_pairs(Lang::C), + interner: &interner, + }; + + let result = engine::run_forward(&cfg, entry, &transfer, ProductState::initial()); + let findings = extract_findings(&result, &cfg, &interner, Lang::C, &HashMap::new()); + + assert!(findings.is_empty()); + } +} diff --git a/src/state/lattice.rs b/src/state/lattice.rs new file mode 100644 index 00000000..1269ea83 --- /dev/null +++ b/src/state/lattice.rs @@ -0,0 +1,91 @@ +/// A bounded semi-lattice with bottom element and monotone join. +/// +/// Implementations must satisfy: +/// - `join` is commutative, associative, and idempotent +/// - `bot()` is the identity for `join` +/// - `leq(a, b)` iff `join(a, b) == b` +#[allow(dead_code)] +pub trait Lattice: Clone + Eq + Sized { + /// Bottom element (least information / unreachable). + fn bot() -> Self; + + /// Least upper bound: merge two abstract values. + fn join(&self, other: &Self) -> Self; + + /// Partial order: `self ⊑ other`. + fn leq(&self, other: &Self) -> bool; +} + +#[cfg(test)] +mod tests { + use super::*; + + /// A trivial 3-element lattice for testing the trait contract. + #[derive(Clone, Debug, PartialEq, Eq)] + struct Three(u8); // 0=bot, 1, 2=top-ish + + impl Lattice for Three { + fn bot() -> Self { + Three(0) + } + fn join(&self, other: &Self) -> Self { + Three(self.0.max(other.0)) + } + fn leq(&self, other: &Self) -> bool { + self.0 <= other.0 + } + } + + #[test] + fn bot_identity() { + let a = Three(1); + assert_eq!(a.join(&Three::bot()), a); + assert_eq!(Three::bot().join(&a), a); + } + + #[test] + fn join_commutative() { + let a = Three(1); + let b = Three(2); + assert_eq!(a.join(&b), b.join(&a)); + } + + #[test] + fn join_associative() { + let a = Three(0); + let b = Three(1); + let c = Three(2); + assert_eq!(a.join(&b).join(&c), a.join(&b.join(&c))); + } + + #[test] + fn join_idempotent() { + let a = Three(1); + assert_eq!(a.join(&a), a); + } + + #[test] + fn leq_reflexive() { + let a = Three(1); + assert!(a.leq(&a)); + } + + #[test] + fn leq_transitive() { + let a = Three(0); + let b = Three(1); + let c = Three(2); + assert!(a.leq(&b)); + assert!(b.leq(&c)); + assert!(a.leq(&c)); + } + + #[test] + fn leq_consistent_with_join() { + let a = Three(1); + let b = Three(2); + // a ⊑ b iff join(a, b) == b + assert!(a.leq(&b)); + assert_eq!(a.join(&b), b); + } +} diff --git a/src/state/mod.rs b/src/state/mod.rs new file mode 100644 index 00000000..a5c35562 --- /dev/null +++ b/src/state/mod.rs @@ -0,0 +1,62 @@ +pub mod domain; +pub mod engine; +pub mod facts; +pub mod lattice; +pub mod symbol; +pub mod transfer; + +use crate::cfg::{Cfg, FuncSummaries}; +use crate::cfg_analysis::rules; +use crate::summary::GlobalSummaries; +use crate::symbol::Lang; +use domain::ProductState; +use engine::MAX_TRACKED_VARS; +use facts::StateFinding; +use petgraph::graph::NodeIndex; +use symbol::SymbolInterner; +use transfer::DefaultTransfer; + +/// Run state-model dataflow analysis on a single function's CFG. +/// +/// Returns findings for use-after-close, double-close, resource leaks, +/// and unauthenticated access to sensitive sinks. +pub fn run_state_analysis( + cfg: &Cfg, + entry: NodeIndex, + lang: Lang, + _source_bytes: &[u8], + func_summaries: &FuncSummaries, + _global_summaries: Option<&GlobalSummaries>, +) -> Vec { + let _span = tracing::debug_span!("run_state_analysis").entered(); + + // 1. Build symbol interner from CFG + let interner = SymbolInterner::from_cfg(cfg); + + // Guarded degradation: cap tracked variables + if interner.len() > MAX_TRACKED_VARS { + tracing::warn!( + symbols = interner.len(), + max = MAX_TRACKED_VARS, + "state analysis: too many variables, capping tracking" + ); + // Still run — the interner has all symbols, but transfer will only + // track the first MAX_TRACKED_VARS due to HashMap insertion order. + // This is conservative but safe. + } + + // 2. Construct transfer function + let resource_pairs = rules::resource_pairs(lang); + let transfer = DefaultTransfer { + lang, + resource_pairs, + interner: &interner, + }; + + // 3. Run forward dataflow engine + let initial = ProductState::initial(); + let result = engine::run_forward(cfg, entry, &transfer, initial); + + // 4. Extract findings + facts::extract_findings(&result, cfg, &interner, lang, func_summaries) +} diff --git a/src/state/symbol.rs b/src/state/symbol.rs new file mode 100644 index 00000000..03f13b4b --- /dev/null +++ b/src/state/symbol.rs @@ -0,0 +1,101 @@ +use crate::cfg::Cfg; +use petgraph::visit::IntoNodeReferences; +use std::collections::HashMap; + +/// Cheap `Copy` handle into a [`SymbolInterner`]. +#[derive(Clone, Copy, Debug, PartialEq, Eq, Hash, PartialOrd, Ord)] +pub struct SymbolId(pub(crate) u32); + +/// Per-function interner: maps `String` ↔ [`SymbolId`]. +/// +/// Built once from CFG node `defines`/`uses`, reused throughout analysis. +#[derive(Default)] +pub struct SymbolInterner { + to_id: HashMap, + to_str: Vec, +} + +impl SymbolInterner { + pub fn new() -> Self { + Self::default() + } + + /// Intern a name, returning its stable [`SymbolId`]. + pub fn intern(&mut self, name: &str) -> SymbolId { + if let Some(&id) = self.to_id.get(name) { + return id; + } + let id = SymbolId(self.to_str.len() as u32); + self.to_str.push(name.to_owned()); + self.to_id.insert(name.to_owned(), id); + id + } + + /// Look up a name without interning it. + pub fn get(&self, name: &str) -> Option { + self.to_id.get(name).copied() + } + + /// Resolve an id back to its string. + pub fn resolve(&self, id: SymbolId) -> &str { + &self.to_str[id.0 as usize] + } + + /// Number of interned symbols. + pub fn len(&self) -> usize { + self.to_str.len() + } + + /// Whether the interner is empty. + #[allow(dead_code)] + pub fn is_empty(&self) -> bool { + self.to_str.is_empty() + } + + /// Build from a CFG: walk all nodes, intern every `defines`/`uses` string. + pub fn from_cfg(cfg: &Cfg) -> Self { + let mut interner = Self::new(); + for (_idx, info) in cfg.node_references() { + if let Some(ref d) = info.defines { + interner.intern(d); + } + for u in &info.uses { + interner.intern(u); + } + } + interner + } +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn intern_resolve_roundtrip() { + let mut interner = SymbolInterner::new(); + let a = interner.intern("foo"); + let b = interner.intern("bar"); + let a2 = interner.intern("foo"); + + assert_eq!(a, a2); + assert_ne!(a, b); + assert_eq!(interner.resolve(a), "foo"); + assert_eq!(interner.resolve(b), "bar"); + } + + #[test] + fn get_returns_none_for_unknown() { + let interner = SymbolInterner::new(); + assert!(interner.get("missing").is_none()); + } + + #[test] + fn len_tracks_unique_symbols() { + let mut interner = SymbolInterner::new(); + interner.intern("a"); + interner.intern("b"); + interner.intern("a"); // duplicate + assert_eq!(interner.len(), 2); + } +} diff --git a/src/state/transfer.rs b/src/state/transfer.rs new file mode 100644 index 00000000..e439aa36 --- /dev/null +++ b/src/state/transfer.rs @@ -0,0 +1,426 @@ +use super::domain::{AuthLevel, ProductState, ResourceLifecycle}; +use super::engine::Transfer; +use super::symbol::{SymbolId, SymbolInterner}; +use crate::cfg::{EdgeKind, NodeInfo, StmtKind}; +use crate::cfg_analysis::rules::{self, ResourcePair}; +use crate::symbol::Lang; +use petgraph::graph::NodeIndex; + +/// Events emitted during transfer for illegal state transitions. +/// These are NOT lattice values — they become findings in `facts.rs`. +#[derive(Debug, Clone)] +pub struct TransferEvent { + pub kind: TransferEventKind, + pub node: NodeIndex, + pub var: SymbolId, +} + +#[derive(Debug, Clone, Copy, PartialEq, Eq)] +pub enum TransferEventKind { + UseAfterClose, + DoubleClose, +} + +/// Resource-use patterns: callees that read/write/operate on a resource handle +/// (triggering use-after-close if the handle is closed). +static RESOURCE_USE_PATTERNS: &[&str] = &[ + "read", "write", "send", "recv", "fread", "fwrite", "fgets", "fputs", "fprintf", "fscanf", + "fflush", "fseek", "ftell", "rewind", "feof", "ferror", "fgetc", "fputc", "getc", "putc", + "ungetc", "query", "execute", "fetch", "sendto", "recvfrom", "ioctl", "fcntl", + // Memory access functions (for malloc/free use-after-free detection) + "strcpy", "strncpy", "strcat", "strncat", "memcpy", "memmove", "memset", "memcmp", "strcmp", + "strncmp", "strlen", "sprintf", "snprintf", +]; + +/// Auth-call matchers for admin-level privilege. +static ADMIN_PATTERNS: &[&str] = &[ + "is_admin", + "hasrole", + "has_role", + "check_admin", + "require_admin", +]; + +pub struct DefaultTransfer<'a> { + pub lang: Lang, + pub resource_pairs: &'a [ResourcePair], + pub interner: &'a SymbolInterner, +} + +impl Transfer for DefaultTransfer<'_> { + type Event = TransferEvent; + + fn apply( + &self, + node_idx: NodeIndex, + info: &NodeInfo, + edge: Option, + mut state: ProductState, + ) -> (ProductState, Vec) { + let mut events = Vec::new(); + + match info.kind { + StmtKind::Call => { + self.apply_call(node_idx, info, &mut state, &mut events); + } + StmtKind::If => { + self.apply_if(info, edge, &mut state); + } + StmtKind::Seq => { + self.apply_assignment(node_idx, info, &mut state); + } + _ => {} + } + + (state, events) + } +} + +impl DefaultTransfer<'_> { + fn apply_call( + &self, + node_idx: NodeIndex, + info: &NodeInfo, + state: &mut ProductState, + events: &mut Vec, + ) { + let callee = match &info.callee { + Some(c) => c.to_ascii_lowercase(), + None => return, + }; + + // ── Resource acquire ───────────────────────────────────────────── + for pair in self.resource_pairs { + let is_acquire = pair.acquire.iter().any(|a| callee_matches(&callee, a)); + let is_excluded = pair + .exclude_acquire + .iter() + .any(|e| callee_matches(&callee, e)); + + if is_acquire + && !is_excluded + && let Some(ref def) = info.defines + && let Some(sym) = self.interner.get(def) + { + state.resource.set(sym, ResourceLifecycle::OPEN); + } + } + + // ── Resource release ───────────────────────────────────────────── + // Track which variables have already been released to avoid double- + // matching across multiple resource pair definitions. + let mut released: smallvec::SmallVec<[SymbolId; 4]> = smallvec::SmallVec::new(); + for pair in self.resource_pairs { + let is_release = pair.release.iter().any(|r| callee_matches(&callee, r)); + if is_release { + for used in &info.uses { + if let Some(sym) = self.interner.get(used) { + if released.contains(&sym) { + continue; + } + let current = state.resource.get(sym); + if current == ResourceLifecycle::CLOSED { + // Double close + events.push(TransferEvent { + kind: TransferEventKind::DoubleClose, + node: node_idx, + var: sym, + }); + } else if current.contains(ResourceLifecycle::OPEN) { + state.resource.set(sym, ResourceLifecycle::CLOSED); + } + released.push(sym); + } + } + } + } + + // ── Resource use (read/write/etc.) ─────────────────────────────── + let is_use = RESOURCE_USE_PATTERNS + .iter() + .any(|p| callee_matches(&callee, p)); + if is_use { + for used in &info.uses { + if let Some(sym) = self.interner.get(used) { + let current = state.resource.get(sym); + if current == ResourceLifecycle::CLOSED { + events.push(TransferEvent { + kind: TransferEventKind::UseAfterClose, + node: node_idx, + var: sym, + }); + } + } + } + } + + // ── Auth call ──────────────────────────────────────────────────── + let auth_rules = rules::auth_rules(self.lang); + let is_auth = auth_rules.iter().any(|rule| { + rule.matchers + .iter() + .any(|m| callee_matches(&callee, &m.to_ascii_lowercase())) + }); + if is_auth { + let is_admin = ADMIN_PATTERNS.iter().any(|p| callee_matches(&callee, p)); + let new_level = if is_admin { + AuthLevel::Admin + } else { + AuthLevel::Authed + }; + if new_level > state.auth.auth_level { + state.auth.auth_level = new_level; + } + } + + // ── Validation call (guard) ────────────────────────────────────── + if is_guard_like(&callee) { + for used in &info.uses { + if let Some(sym) = self.interner.get(used) { + state.auth.validated.insert(sym); + } + } + } + } + + fn apply_if(&self, info: &NodeInfo, edge: Option, state: &mut ProductState) { + // On the True edge of an If node whose condition is an auth check, + // refine auth level. + let is_true_edge = matches!(edge, Some(EdgeKind::True)); + if !is_true_edge { + return; + } + + if let Some(ref cond) = info.condition_text { + let cond_lower = cond.to_ascii_lowercase(); + + // Auth-related condition + let auth_rules = rules::auth_rules(self.lang); + let is_auth_cond = auth_rules.iter().any(|rule| { + rule.matchers + .iter() + .any(|m| cond_lower.contains(&m.to_ascii_lowercase())) + }); + if is_auth_cond && !info.condition_negated { + let is_admin = ADMIN_PATTERNS.iter().any(|p| cond_lower.contains(p)); + let new_level = if is_admin { + AuthLevel::Admin + } else { + AuthLevel::Authed + }; + if new_level > state.auth.auth_level { + state.auth.auth_level = new_level; + } + } + + // Validation-related condition + if is_guard_like(&cond_lower) && !info.condition_negated { + for var in &info.condition_vars { + if let Some(sym) = self.interner.get(var) { + state.auth.validated.insert(sym); + } + } + } + } + } + + fn apply_assignment(&self, _node_idx: NodeIndex, info: &NodeInfo, state: &mut ProductState) { + // Ownership transfer: if `defines` reassigns a tracked resource + // variable from a `uses` variable, transfer the lifecycle. + if let Some(ref def) = info.defines + && let Some(def_sym) = self.interner.get(def) + { + // If the RHS is a tracked resource, transfer its state + for used in &info.uses { + if let Some(use_sym) = self.interner.get(used) { + let lc = state.resource.get(use_sym); + if lc.contains(ResourceLifecycle::OPEN) { + state.resource.set(def_sym, lc); + state.resource.set(use_sym, ResourceLifecycle::MOVED); + return; + } + } + } + } + } +} + +/// Check if a callee matches a pattern. +/// Supports suffix matching (e.g., "fclose" matches callee "my_fclose") +/// and dot-prefix matching (e.g., ".close" matches "file.close"). +fn callee_matches(callee: &str, pattern: &str) -> bool { + let pattern_lower = pattern.to_ascii_lowercase(); + if pattern_lower.starts_with('.') { + // Method pattern: ".close" matches "x.close", "file.close", etc. + callee.ends_with(&pattern_lower) + } else { + // Exact or suffix match + callee == pattern_lower || callee.ends_with(&pattern_lower) + } +} + +/// Check if a callee looks like a guard/validation function. +fn is_guard_like(callee: &str) -> bool { + static GUARD_PREFIXES: &[&str] = &["validate", "sanitize", "check_", "verify_", "assert_"]; + GUARD_PREFIXES.iter().any(|p| callee.starts_with(p)) +} + +#[cfg(test)] +mod tests { + use super::*; + #[test] + fn callee_matches_exact() { + assert!(callee_matches("fopen", "fopen")); + assert!(!callee_matches("fopen", "fclose")); + } + + #[test] + fn callee_matches_suffix() { + assert!(callee_matches("curlx_fclose", "fclose")); + } + + #[test] + fn callee_matches_dot_prefix() { + assert!(callee_matches("file.close", ".close")); + assert!(!callee_matches("file.close", ".open")); + } + + #[test] + fn acquire_sets_open() { + let mut interner = SymbolInterner::new(); + let sym_f = interner.intern("f"); + + let transfer = DefaultTransfer { + lang: Lang::C, + resource_pairs: rules::resource_pairs(Lang::C), + interner: &interner, + }; + + let info = NodeInfo { + kind: StmtKind::Call, + span: (0, 10), + label: None, + defines: Some("f".into()), + uses: vec![], + callee: Some("fopen".into()), + enclosing_func: None, + call_ordinal: 0, + condition_text: None, + condition_vars: vec![], + condition_negated: false, + }; + + let (state, events) = + transfer.apply(NodeIndex::new(0), &info, None, ProductState::initial()); + assert!(events.is_empty()); + assert_eq!(state.resource.get(sym_f), ResourceLifecycle::OPEN); + } + + #[test] + fn close_after_open_sets_closed() { + let mut interner = SymbolInterner::new(); + let sym_f = interner.intern("f"); + + let transfer = DefaultTransfer { + lang: Lang::C, + resource_pairs: rules::resource_pairs(Lang::C), + interner: &interner, + }; + + let mut state = ProductState::initial(); + state.resource.set(sym_f, ResourceLifecycle::OPEN); + + let info = NodeInfo { + kind: StmtKind::Call, + span: (10, 20), + label: None, + defines: None, + uses: vec!["f".into()], + callee: Some("fclose".into()), + enclosing_func: None, + call_ordinal: 0, + condition_text: None, + condition_vars: vec![], + condition_negated: false, + }; + + let (state, events) = transfer.apply(NodeIndex::new(1), &info, None, state); + assert!(events.is_empty()); + assert_eq!(state.resource.get(sym_f), ResourceLifecycle::CLOSED); + } + + #[test] + fn double_close_emits_event() { + let mut interner = SymbolInterner::new(); + let sym_f = interner.intern("f"); + + let transfer = DefaultTransfer { + lang: Lang::C, + resource_pairs: rules::resource_pairs(Lang::C), + interner: &interner, + }; + + let mut state = ProductState::initial(); + state.resource.set(sym_f, ResourceLifecycle::CLOSED); + + let info = NodeInfo { + kind: StmtKind::Call, + span: (20, 30), + label: None, + defines: None, + uses: vec!["f".into()], + callee: Some("fclose".into()), + enclosing_func: None, + call_ordinal: 0, + condition_text: None, + condition_vars: vec![], + condition_negated: false, + }; + + let (_state, events) = transfer.apply(NodeIndex::new(2), &info, None, state); + assert_eq!(events.len(), 1); + assert_eq!(events[0].kind, TransferEventKind::DoubleClose); + assert_eq!(events[0].var, sym_f); + } + + #[test] + fn use_after_close_emits_event() { + let mut interner = SymbolInterner::new(); + let sym_f = interner.intern("f"); + + let transfer = DefaultTransfer { + lang: Lang::C, + resource_pairs: rules::resource_pairs(Lang::C), + interner: &interner, + }; + + let mut state = ProductState::initial(); + state.resource.set(sym_f, ResourceLifecycle::CLOSED); + + let info = NodeInfo { + kind: StmtKind::Call, + span: (30, 40), + label: None, + defines: None, + uses: vec!["f".into()], + callee: Some("fread".into()), + enclosing_func: None, + call_ordinal: 0, + condition_text: None, + condition_vars: vec![], + condition_negated: false, + }; + + let (_state, events) = transfer.apply(NodeIndex::new(3), &info, None, state); + assert_eq!(events.len(), 1); + assert_eq!(events[0].kind, TransferEventKind::UseAfterClose); + } + + #[test] + fn is_guard_like_check() { + assert!(is_guard_like("validate_input")); + assert!(is_guard_like("sanitize_html")); + assert!(is_guard_like("check_permission")); + assert!(!is_guard_like("open_file")); + } +} diff --git a/src/summary/mod.rs b/src/summary/mod.rs index 9ccdd94d..2a0ae4bc 100644 --- a/src/summary/mod.rs +++ b/src/summary/mod.rs @@ -139,6 +139,22 @@ impl FuncSummary { } } +// ── Callee resolution ──────────────────────────────────────────────────── + +/// Result of resolving a bare callee name to a [`FuncKey`]. +/// +/// Three-valued: the call graph builder and taint engine need to distinguish +/// "no candidates at all" from "multiple candidates, can't pick one". +#[derive(Debug, Clone, PartialEq, Eq)] +pub enum CalleeResolution { + /// Exactly one candidate matched. + Resolved(FuncKey), + /// No candidates found at all. + NotFound, + /// Multiple candidates — ambiguous, cannot pick one. + Ambiguous(Vec), +} + // ── Lookup map used by the taint engine ───────────────────────────────── /// A merged view of all function summaries keyed by qualified [`FuncKey`]. @@ -216,16 +232,66 @@ impl GlobalSummaries { } } - #[allow(dead_code)] + #[allow(dead_code)] // used by tests and future call-graph consumers pub fn is_empty(&self) -> bool { self.by_key.is_empty() } /// Iterate over all (key, summary) pairs. - #[allow(dead_code)] pub fn iter(&self) -> impl Iterator { self.by_key.iter() } + + /// Resolve a bare (already-normalized) callee name to a [`FuncKey`]. + /// + /// Resolution order: + /// 1. Collect all same-language candidates matching the name. + /// 2. If `arity_hint` is `Some`, filter candidates by matching arity. + /// 3. If exactly one candidate → [`CalleeResolution::Resolved`]. + /// 4. If multiple, filter by `caller_namespace`; if exactly one → `Resolved`. + /// 5. If still multiple → [`CalleeResolution::Ambiguous`]. + /// 6. If zero candidates → [`CalleeResolution::NotFound`]. + pub fn resolve_callee_key( + &self, + callee: &str, + caller_lang: Lang, + caller_namespace: &str, + arity_hint: Option, + ) -> CalleeResolution { + let candidates = self.lookup_same_lang(caller_lang, callee); + if candidates.is_empty() { + return CalleeResolution::NotFound; + } + + // Apply arity filter if hint provided. + let filtered: Vec<&FuncKey> = if let Some(arity) = arity_hint { + candidates + .iter() + .filter(|(k, _)| k.arity == Some(arity)) + .map(|(k, _)| *k) + .collect() + } else { + candidates.iter().map(|(k, _)| *k).collect() + }; + + match filtered.len() { + 0 => CalleeResolution::NotFound, + 1 => CalleeResolution::Resolved(filtered[0].clone()), + _ => { + // Namespace disambiguation: prefer same-namespace match. + let same_ns: Vec<&FuncKey> = filtered + .iter() + .filter(|k| k.namespace == caller_namespace) + .copied() + .collect(); + match same_ns.len() { + 1 => CalleeResolution::Resolved(same_ns[0].clone()), + 0 => CalleeResolution::Ambiguous(filtered.into_iter().cloned().collect()), + _ => CalleeResolution::Ambiguous(same_ns.into_iter().cloned().collect()), + } + } + } + } } impl std::fmt::Debug for GlobalSummaries { diff --git a/src/suppress/mod.rs b/src/suppress/mod.rs new file mode 100644 index 00000000..7427f41a --- /dev/null +++ b/src/suppress/mod.rs @@ -0,0 +1,715 @@ +//! Inline per-finding suppression via source-code comments. +//! +//! Supports two directive forms: +//! - `nyx:ignore [, …]` — suppress findings on the same line +//! - `nyx:ignore-next-line [, …]` — suppress findings on the next line +//! +//! Comments are detected for all supported languages without tree-sitter, +//! using a lightweight string/comment state machine. + +use std::collections::HashMap; + +// ───────────────────────────────────────────────────────────────────────────── +// Public types +// ───────────────────────────────────────────────────────────────────────────── + +/// Whether the directive suppresses on its own line or the next line. +#[derive(Debug, Clone, serde::Serialize)] +pub enum SuppressionKind { + SameLine, + NextLine, +} + +/// Metadata attached to a suppressed finding. +#[derive(Debug, Clone, serde::Serialize)] +pub struct SuppressionMeta { + pub kind: SuppressionKind, + /// The pattern that matched the finding's rule ID. + pub matched_pattern: String, + /// 1-indexed line where the suppression directive appears. + pub directive_line: usize, +} + +// ───────────────────────────────────────────────────────────────────────────── +// Internal types +// ───────────────────────────────────────────────────────────────────────────── + +/// A single rule matcher — either exact or wildcard-suffix (`foo.*`). +#[derive(Debug)] +enum RuleMatcher { + Exact(String), + /// `prefix` stores everything before the trailing `.*`. + WildcardSuffix(String), +} + +impl RuleMatcher { + fn matches(&self, rule_id: &str) -> bool { + match self { + RuleMatcher::Exact(s) => s == rule_id, + RuleMatcher::WildcardSuffix(prefix) => { + rule_id.starts_with(prefix.as_str()) + && rule_id.len() > prefix.len() + && rule_id.as_bytes()[prefix.len()] == b'.' + } + } + } +} + +/// A parsed directive from a single comment. +#[derive(Debug)] +struct LineDirective { + kind: SuppressionKind, + /// 1-indexed line where the directive comment appears. + directive_line: usize, + matchers: Vec, +} + +/// Pre-built index of suppression directives keyed by **target line** (the +/// line whose findings should be suppressed, 1-indexed). +pub struct SuppressionIndex { + directives: HashMap>, +} + +impl SuppressionIndex { + /// Check whether a finding at `line` (1-indexed) with `rule_id` is suppressed. + pub fn check(&self, line: usize, rule_id: &str) -> Option { + let canon = canonical_rule_id(rule_id); + let dirs = self.directives.get(&line)?; + for dir in dirs { + for m in &dir.matchers { + if m.matches(canon) { + let display_pattern = match m { + RuleMatcher::Exact(s) => s.clone(), + RuleMatcher::WildcardSuffix(s) => format!("{s}.*"), + }; + return Some(SuppressionMeta { + kind: dir.kind.clone(), + matched_pattern: display_pattern, + directive_line: dir.directive_line, + }); + } + } + } + None + } + + /// Returns `true` if no directives were found. + pub fn is_empty(&self) -> bool { + self.directives.is_empty() + } +} + +// ───────────────────────────────────────────────────────────────────────────── +// Canonical rule ID +// ───────────────────────────────────────────────────────────────────────────── + +/// Strip parenthetical suffix from a rule ID: +/// `"taint-unsanitised-flow (source 5:1)"` → `"taint-unsanitised-flow"`. +pub fn canonical_rule_id(id: &str) -> &str { + let trimmed = id.trim(); + if let Some(idx) = trimmed.find(" (") { + trimmed[..idx].trim_end() + } else { + trimmed + } +} + +// ───────────────────────────────────────────────────────────────────────────── +// Comment style per language +// ───────────────────────────────────────────────────────────────────────────── + +#[derive(Clone, Copy)] +enum CommentStyle { + /// `//` and `/* */` — Rust, C, C++, Java, Go, JS, TS + CStyle, + /// `#` only — Python, Ruby + Hash, + /// `//`, `#`, and `/* */` — PHP + PhpStyle, +} + +/// Map a file extension to the comment style for that language. +fn comment_style_for_ext(ext: &str) -> Option { + match ext { + "rs" | "c" | "cpp" | "java" | "go" | "ts" | "js" => Some(CommentStyle::CStyle), + "py" | "rb" => Some(CommentStyle::Hash), + "php" => Some(CommentStyle::PhpStyle), + _ => None, + } +} + +/// Map a file path to its comment style by inspecting the extension. +fn comment_style_for_path(path: &std::path::Path) -> Option { + let ext = path.extension().and_then(|s| s.to_str())?; + // Normalise common variant extensions + let norm = match ext { + "RS" => "rs", + "c++" => "cpp", + "PY" => "py", + "TSX" | "tsx" => "ts", + other => other, + }; + comment_style_for_ext(norm) +} + +// ───────────────────────────────────────────────────────────────────────────── +// Parser +// ───────────────────────────────────────────────────────────────────────────── + +/// Parse inline suppression directives from `source`, using comment syntax +/// appropriate for the given file path. +/// +/// Returns an empty index if the source doesn't contain `nyx:ignore` or the +/// language is unsupported. +pub fn parse_inline_suppressions(path: &std::path::Path, source: &str) -> SuppressionIndex { + // Fast path: no directives possible. + if !source.as_bytes().windows(10).any(|w| w == b"nyx:ignore") { + return SuppressionIndex { + directives: HashMap::new(), + }; + } + + let Some(style) = comment_style_for_path(path) else { + return SuppressionIndex { + directives: HashMap::new(), + }; + }; + + let mut index: HashMap> = HashMap::new(); + let total_lines = source.lines().count(); + + // State machine for string/comment tracking. + let mut in_block_comment = false; + let mut block_comment_start_line: usize = 0; + + for (line_idx, raw_line) in source.lines().enumerate() { + let line_num = line_idx + 1; // 1-indexed + let line = raw_line.trim_end_matches('\r'); + + if in_block_comment { + // Check for block comment end. + if let Some(end_pos) = line.find("*/") { + // Extract text before `*/` — may contain a directive. + let block_text = &line[..end_pos]; + if let Some(dir) = try_parse_directive(block_text, line_num) { + let target = target_line(&dir, line_num, total_lines); + if let Some(t) = target { + index.entry(t).or_default().push(dir); + } + } + in_block_comment = false; + // After the block comment ends, check the rest of the line + // for a line comment. + let rest = &line[end_pos + 2..]; + if let Some(dir) = extract_from_line_rest(rest, line_num, style) { + let target = target_line(&dir, line_num, total_lines); + if let Some(t) = target { + index.entry(t).or_default().push(dir); + } + } + } else { + // Still inside block comment — check for directive. + if let Some(dir) = try_parse_directive(line, line_num) { + let target = target_line(&dir, line_num, total_lines); + if let Some(t) = target { + index.entry(t).or_default().push(dir); + } + } + } + let _ = block_comment_start_line; // suppress unused warning + continue; + } + + // Not in a block comment — scan the line character by character + // tracking string state. + if let Some(dir) = scan_line_for_directive(line, line_num, style, &mut in_block_comment) { + let target = target_line(&dir, line_num, total_lines); + if let Some(t) = target { + index.entry(t).or_default().push(dir); + } + } + if in_block_comment { + block_comment_start_line = line_num; + } + } + + SuppressionIndex { directives: index } +} + +/// Compute the target line for a directive. Returns `None` if the directive +/// is `NextLine` but on the last line (EOF — no-op). +fn target_line(dir: &LineDirective, line_num: usize, total_lines: usize) -> Option { + match dir.kind { + SuppressionKind::SameLine => Some(line_num), + SuppressionKind::NextLine => { + if line_num < total_lines { + Some(line_num + 1) + } else { + None // EOF — no next line + } + } + } +} + +/// Scan a single line (not inside a block comment) for a suppression directive. +/// Tracks string literals to avoid false positives. +/// +/// Sets `in_block_comment` to `true` if the line opens a `/* */` block that +/// doesn't close on the same line. +fn scan_line_for_directive( + line: &str, + line_num: usize, + style: CommentStyle, + in_block_comment: &mut bool, +) -> Option { + let bytes = line.as_bytes(); + let len = bytes.len(); + let mut i = 0; + + // String state + let mut in_string: Option = None; // quote char: b'"', b'\'', b'`' + + while i < len { + let ch = bytes[i]; + + // ── Inside a string literal ───────────────────────────────────── + if let Some(quote) = in_string { + if ch == b'\\' { + i += 2; // skip escaped char + continue; + } + // Python triple quotes + if (quote == b'"' || quote == b'\'') + && i + 2 < len + && bytes[i] == quote + && bytes[i + 1] == quote + && bytes[i + 2] == quote + { + // Check if this is a triple-quote close + // (we entered via triple-quote open, but we track single quote char) + in_string = None; + i += 3; + continue; + } + if ch == quote { + in_string = None; + } + i += 1; + continue; + } + + // ── Not in a string ───────────────────────────────────────────── + + // Rust raw strings: r"..." or r#"..."# + if ch == b'r' && i + 1 < len { + let next = bytes[i + 1]; + if next == b'"' { + // r"..." — skip to closing " + i += 2; + while i < len && bytes[i] != b'"' { + i += 1; + } + i += 1; // skip closing " + continue; + } + if next == b'#' { + // Count hashes + let hash_start = i + 1; + let mut j = i + 1; + while j < len && bytes[j] == b'#' { + j += 1; + } + let hash_count = j - hash_start; + if j < len && bytes[j] == b'"' { + // Skip to closing "### + let close_pat_len = 1 + hash_count; // " + hashes + i = j + 1; + 'raw: while i < len { + if bytes[i] == b'"' { + // Check for matching hashes + let mut k = 1; + while k <= hash_count && i + k < len && bytes[i + k] == b'#' { + k += 1; + } + if k > hash_count { + i += close_pat_len; + break 'raw; + } + } + i += 1; + } + continue; + } + } + } + + // Python triple quotes: """ or ''' + if (ch == b'"' || ch == b'\'') && i + 2 < len && bytes[i + 1] == ch && bytes[i + 2] == ch { + in_string = Some(ch); + i += 3; + continue; + } + + // Regular string literals + if ch == b'"' || ch == b'\'' || ch == b'`' { + in_string = Some(ch); + i += 1; + continue; + } + + // ── Comment detection ─────────────────────────────────────────── + + // C-style line comment: // + let has_slash_slash = matches!(style, CommentStyle::CStyle | CommentStyle::PhpStyle); + if has_slash_slash && ch == b'/' && i + 1 < len && bytes[i + 1] == b'/' { + let comment_body = &line[i + 2..]; + return try_parse_directive(comment_body, line_num); + } + + // Block comment: /* + let has_block = matches!(style, CommentStyle::CStyle | CommentStyle::PhpStyle); + if has_block && ch == b'/' && i + 1 < len && bytes[i + 1] == b'*' { + // Look for closing */ on the same line + let rest = &line[i + 2..]; + if let Some(end) = rest.find("*/") { + let block_body = &rest[..end]; + // Check directive in block body + if let Some(dir) = try_parse_directive(block_body, line_num) { + return Some(dir); + } + // Continue scanning after the block + i = i + 2 + end + 2; + continue; + } else { + // Block comment extends to next line(s) + *in_block_comment = true; + let block_body = rest; + return try_parse_directive(block_body, line_num); + } + } + + // Hash comment: # + let has_hash = matches!(style, CommentStyle::Hash | CommentStyle::PhpStyle); + if has_hash && ch == b'#' { + let comment_body = &line[i + 1..]; + return try_parse_directive(comment_body, line_num); + } + + i += 1; + } + + None +} + +/// Try to extract a directive from a line rest (after a block comment closes). +fn extract_from_line_rest( + rest: &str, + line_num: usize, + style: CommentStyle, +) -> Option { + let mut in_block = false; + scan_line_for_directive(rest, line_num, style, &mut in_block) +} + +/// Try to parse a `nyx:ignore` or `nyx:ignore-next-line` directive from +/// comment body text. Returns `None` if no directive is found. +fn try_parse_directive(text: &str, line_num: usize) -> Option { + let trimmed = text.trim(); + // Strip leading `*` or `* ` common in block comments (e.g. ` * nyx:ignore ...`). + let trimmed = trimmed + .strip_prefix("* ") + .or(trimmed.strip_prefix('*')) + .unwrap_or(trimmed) + .trim(); + + // Check for `nyx:ignore-next-line` first (longer prefix wins). + if let Some(rest) = strip_directive_prefix(trimmed, "nyx:ignore-next-line") { + let matchers = parse_rule_ids(rest); + if matchers.is_empty() { + return None; + } + return Some(LineDirective { + kind: SuppressionKind::NextLine, + directive_line: line_num, + matchers, + }); + } + + if let Some(rest) = strip_directive_prefix(trimmed, "nyx:ignore") { + let matchers = parse_rule_ids(rest); + if matchers.is_empty() { + return None; + } + return Some(LineDirective { + kind: SuppressionKind::SameLine, + directive_line: line_num, + matchers, + }); + } + + None +} + +/// Strip a directive prefix, allowing optional whitespace or the rest of the +/// line to follow. +fn strip_directive_prefix<'a>(text: &'a str, prefix: &str) -> Option<&'a str> { + let rest = text.strip_prefix(prefix)?; + // Must be followed by whitespace, end of string, or nothing. + // If prefix is "nyx:ignore" and rest starts with "-next-line", don't match + // (handled by checking the longer prefix first). + if rest.is_empty() || rest.starts_with(char::is_whitespace) { + Some(rest) + } else { + None + } +} + +/// Parse comma-separated rule IDs into matchers. +fn parse_rule_ids(text: &str) -> Vec { + text.split(',') + .map(|s| s.trim()) + .filter(|s| !s.is_empty()) + .map(|s| { + if let Some(prefix) = s.strip_suffix(".*") { + RuleMatcher::WildcardSuffix(prefix.to_string()) + } else { + RuleMatcher::Exact(s.to_string()) + } + }) + .collect() +} + +// ───────────────────────────────────────────────────────────────────────────── +// Tests +// ───────────────────────────────────────────────────────────────────────────── + +#[cfg(test)] +mod tests { + use super::*; + use std::path::Path; + + fn rust_path() -> &'static Path { + Path::new("test.rs") + } + fn py_path() -> &'static Path { + Path::new("test.py") + } + fn rb_path() -> &'static Path { + Path::new("test.rb") + } + fn php_path() -> &'static Path { + Path::new("test.php") + } + fn js_path() -> &'static Path { + Path::new("test.js") + } + + // 1. `//` comment parsing + #[test] + fn slash_slash_comment_suppresses() { + let src = "let x = 1; // nyx:ignore rule.a\n"; + let idx = parse_inline_suppressions(rust_path(), src); + assert!(idx.check(1, "rule.a").is_some()); + assert!(idx.check(1, "rule.b").is_none()); + } + + // 2. `#` comment parsing + #[test] + fn hash_comment_suppresses() { + let src = "x = 1 # nyx:ignore rule.a\n"; + let idx = parse_inline_suppressions(py_path(), src); + assert!(idx.check(1, "rule.a").is_some()); + } + + // 3. `/* */` block comment + #[test] + fn block_comment_suppresses() { + let src = "let x = 1; /* nyx:ignore rule.a */\n"; + let idx = parse_inline_suppressions(rust_path(), src); + assert!(idx.check(1, "rule.a").is_some()); + } + + // 4. Same-line semantics + #[test] + fn same_line_only_suppresses_own_line() { + let src = "line1\nlet x = 1; // nyx:ignore rule.a\nline3\n"; + let idx = parse_inline_suppressions(rust_path(), src); + assert!(idx.check(1, "rule.a").is_none()); + assert!(idx.check(2, "rule.a").is_some()); + assert!(idx.check(3, "rule.a").is_none()); + } + + // 5. Next-line semantics + #[test] + fn next_line_suppresses_following_line() { + let src = "// nyx:ignore-next-line rule.a\nlet x = dangerous();\nline3\n"; + let idx = parse_inline_suppressions(rust_path(), src); + assert!(idx.check(1, "rule.a").is_none()); + assert!(idx.check(2, "rule.a").is_some()); + assert!(idx.check(3, "rule.a").is_none()); + } + + // 6. Multiple rule IDs + #[test] + fn multiple_rule_ids() { + let src = "let x = 1; // nyx:ignore a.b.c, x.y.z\n"; + let idx = parse_inline_suppressions(rust_path(), src); + assert!(idx.check(1, "a.b.c").is_some()); + assert!(idx.check(1, "x.y.z").is_some()); + assert!(idx.check(1, "other").is_none()); + } + + // 7. Wildcard suffix + #[test] + fn wildcard_suffix_matching() { + let src = "let x = 1; // nyx:ignore rs.quality.*\n"; + let idx = parse_inline_suppressions(rust_path(), src); + assert!(idx.check(1, "rs.quality.foo").is_some()); + assert!(idx.check(1, "rs.quality.bar").is_some()); + assert!(idx.check(1, "rs.other.foo").is_none()); + // Exact match of prefix without the dot should not match + assert!(idx.check(1, "rs.quality").is_none()); + } + + // 8. String literal guard + #[test] + fn string_literal_not_suppressed() { + let src = "let x = \"// nyx:ignore rule.a\";\n"; + let idx = parse_inline_suppressions(rust_path(), src); + assert!(idx.check(1, "rule.a").is_none()); + } + + // 9. Rust raw string guard + #[test] + fn rust_raw_string_not_suppressed() { + let src = "let x = r#\"// nyx:ignore rule.a\"#;\n"; + let idx = parse_inline_suppressions(rust_path(), src); + assert!(idx.check(1, "rule.a").is_none()); + } + + // 10. Rule ID mismatch + #[test] + fn rule_id_mismatch() { + let src = "let x = 1; // nyx:ignore rule-a\n"; + let idx = parse_inline_suppressions(rust_path(), src); + assert!(idx.check(1, "rule-a").is_some()); + assert!(idx.check(1, "rule-b").is_none()); + } + + // 11. Taint rule ID canonicalization + #[test] + fn taint_rule_id_canonicalization() { + let src = "let x = 1; // nyx:ignore taint-unsanitised-flow\n"; + let idx = parse_inline_suppressions(rust_path(), src); + assert!( + idx.check(1, "taint-unsanitised-flow (source 5:1)") + .is_some() + ); + assert!(idx.check(1, "taint-unsanitised-flow").is_some()); + } + + // 12. Multiple directives targeting the same line + #[test] + fn multiple_directives_same_target() { + let src = "// nyx:ignore-next-line rule-a\n// nyx:ignore-next-line rule-b\nlet x = dangerous();\n"; + let idx = parse_inline_suppressions(rust_path(), src); + // First ignore-next-line targets line 2, second targets line 3 + assert!(idx.check(2, "rule-a").is_some()); + assert!(idx.check(3, "rule-b").is_some()); + } + + // 13. Block comment with ignore-next-line + #[test] + fn block_comment_next_line() { + let src = "/* nyx:ignore-next-line rule.a */\nlet x = dangerous();\n"; + let idx = parse_inline_suppressions(rust_path(), src); + assert!(idx.check(2, "rule.a").is_some()); + } + + // 14. EOF ignore-next-line is a no-op + #[test] + fn eof_next_line_no_panic() { + let src = "// nyx:ignore-next-line rule.a"; + let idx = parse_inline_suppressions(rust_path(), src); + // Line 1 is the last line, so ignore-next-line targets line 2 which doesn't exist + assert!(idx.check(1, "rule.a").is_none()); + assert!(idx.check(2, "rule.a").is_none()); + } + + // 15. CRLF input + #[test] + fn crlf_line_endings() { + let src = "let x = 1; // nyx:ignore rule.a\r\nlet y = 2;\r\n"; + let idx = parse_inline_suppressions(rust_path(), src); + assert!(idx.check(1, "rule.a").is_some()); + assert!(idx.check(2, "rule.a").is_none()); + } + + // 16. Whitespace tolerance + #[test] + fn whitespace_tolerance() { + let src = "let x = 1; // nyx:ignore rule.a, rule.b \n"; + let idx = parse_inline_suppressions(rust_path(), src); + assert!(idx.check(1, "rule.a").is_some()); + assert!(idx.check(1, "rule.b").is_some()); + } + + // 17. PHP multi-style comments + #[test] + fn php_multi_style() { + let src_hash = ", +} + +/// A single taint origin — the node and classification of where taint came from. +#[derive(Clone, Copy, Debug, PartialEq, Eq, Hash)] +pub struct TaintOrigin { + pub node: NodeIndex, + pub source_kind: SourceKind, +} + +/// Compact bitset for up to 64 variables (indexed by SymbolId ordinal). +#[derive(Clone, Copy, Debug, PartialEq, Eq)] +pub struct SmallBitSet(u64); + +impl SmallBitSet { + pub fn empty() -> Self { + Self(0) + } + + pub fn insert(&mut self, id: SymbolId) { + let idx = id.0; + if idx < 64 { + self.0 |= 1u64 << idx; + } + } + + pub fn contains(&self, id: SymbolId) -> bool { + let idx = id.0; + if idx < 64 { + self.0 & (1u64 << idx) != 0 + } else { + false + } + } + + /// Union: self | other + pub fn union(self, other: Self) -> Self { + Self(self.0 | other.0) + } + + /// Intersection: self & other + pub fn intersection(self, other: Self) -> Self { + Self(self.0 & other.0) + } + + #[allow(dead_code)] + pub fn is_empty(self) -> bool { + self.0 == 0 + } + + /// Whether self is a subset of other. + #[allow(dead_code)] // used by Lattice::leq + pub fn is_subset_of(self, other: Self) -> bool { + self.0 & other.0 == self.0 + } + + /// Whether self is a superset of other. + #[allow(dead_code)] // used by Lattice::leq + pub fn is_superset_of(self, other: Self) -> bool { + other.is_subset_of(self) + } +} + +/// Monotone predicate summary per variable. +/// +/// Tracks which whitelisted predicate kinds are known true/false on ALL paths. +/// join = intersection of bits (must-hold semantics). +#[derive(Clone, Copy, Debug, PartialEq, Eq)] +pub struct PredicateSummary { + /// Bitmask: bit 0=NullCheck, 1=EmptyCheck, 2=ErrorCheck + pub known_true: u8, + pub known_false: u8, +} + +impl PredicateSummary { + pub fn empty() -> Self { + Self { + known_true: 0, + known_false: 0, + } + } + + /// Join = intersection (only predicates true on ALL paths). + pub fn join(self, other: Self) -> Self { + Self { + known_true: self.known_true & other.known_true, + known_false: self.known_false & other.known_false, + } + } + + /// Check for contradiction: same kind known both true and false. + pub fn has_contradiction(self) -> bool { + self.known_true & self.known_false != 0 + } + + pub fn is_empty(self) -> bool { + self.known_true == 0 && self.known_false == 0 + } +} + +/// Map a whitelisted PredicateKind to its bit index (0-2). +/// Returns None for non-whitelisted kinds. +pub fn predicate_kind_bit(kind: PredicateKind) -> Option { + match kind { + PredicateKind::NullCheck => Some(0), + PredicateKind::EmptyCheck => Some(1), + PredicateKind::ErrorCheck => Some(2), + _ => None, + } +} + +/// The abstract taint state at a program point. +/// +/// Uses sorted SmallVec keyed by SymbolId for O(n) merge-join. +/// Variables beyond the interner's capacity are naturally excluded. +#[derive(Clone, Debug, PartialEq, Eq)] +pub struct TaintState { + /// Per-variable taint, sorted by SymbolId. + pub vars: SmallVec<[(SymbolId, VarTaint); 16]>, + + /// Variables validated on ALL paths (intersection on join). + pub validated_must: SmallBitSet, + + /// Variables validated on ANY path (union on join). + pub validated_may: SmallBitSet, + + /// Per-variable predicate summary (sorted by SymbolId). + pub predicates: SmallVec<[(SymbolId, PredicateSummary); 4]>, +} + +impl TaintState { + /// Create the initial state (no taint, no validation, no predicates). + pub fn initial() -> Self { + Self { + vars: SmallVec::new(), + validated_must: SmallBitSet::empty(), + validated_may: SmallBitSet::empty(), + predicates: SmallVec::new(), + } + } + + /// Look up taint for a variable. + pub fn get(&self, sym: SymbolId) -> Option<&VarTaint> { + self.vars + .binary_search_by_key(&sym, |(id, _)| *id) + .ok() + .map(|idx| &self.vars[idx].1) + } + + /// Insert or update taint for a variable. + pub fn set(&mut self, sym: SymbolId, taint: VarTaint) { + match self.vars.binary_search_by_key(&sym, |(id, _)| *id) { + Ok(idx) => self.vars[idx].1 = taint, + Err(idx) => self.vars.insert(idx, (sym, taint)), + } + } + + /// Remove taint for a variable. + pub fn remove(&mut self, sym: SymbolId) { + if let Ok(idx) = self.vars.binary_search_by_key(&sym, |(id, _)| *id) { + self.vars.remove(idx); + } + } + + /// Set a predicate summary for a variable. + pub fn set_predicate(&mut self, sym: SymbolId, summary: PredicateSummary) { + match self.predicates.binary_search_by_key(&sym, |(id, _)| *id) { + Ok(idx) => self.predicates[idx].1 = summary, + Err(idx) => self.predicates.insert(idx, (sym, summary)), + } + } + + /// Get predicate summary for a variable. + pub fn get_predicate(&self, sym: SymbolId) -> PredicateSummary { + self.predicates + .binary_search_by_key(&sym, |(id, _)| *id) + .ok() + .map(|idx| self.predicates[idx].1) + .unwrap_or_else(PredicateSummary::empty) + } + + /// Check if any variable has contradictory predicates. + pub fn has_contradiction(&self) -> bool { + self.predicates.iter().any(|(_, s)| s.has_contradiction()) + } +} + +impl Lattice for TaintState { + fn bot() -> Self { + Self::initial() + } + + fn join(&self, other: &Self) -> Self { + // Merge-join vars (sorted by SymbolId) + let vars = merge_join_vars(&self.vars, &other.vars); + + // validated_must = intersection (must hold on ALL paths) + let validated_must = self.validated_must.intersection(other.validated_must); + + // validated_may = union (holds on ANY path) + let validated_may = self.validated_may.union(other.validated_may); + + // predicates = per-key intersection of known_true/known_false bits + let predicates = merge_join_predicates(&self.predicates, &other.predicates); + + TaintState { + vars, + validated_must, + validated_may, + predicates, + } + } + + fn leq(&self, other: &Self) -> bool { + // Per-key Cap subset + origins subset + if !vars_leq(&self.vars, &other.vars) { + return false; + } + + // validated_must: self ⊇ other (superset = less info = lower) + if !self.validated_must.is_superset_of(other.validated_must) { + return false; + } + + // validated_may: self ⊆ other + if !self.validated_may.is_subset_of(other.validated_may) { + return false; + } + + // predicates: self.known_true ⊇ other.known_true (more precise = lower) + predicates_leq(&self.predicates, &other.predicates) + } +} + +/// Merge-join two sorted var lists: per-key Cap OR + origins merge (bounded). +fn merge_join_vars( + a: &[(SymbolId, VarTaint)], + b: &[(SymbolId, VarTaint)], +) -> SmallVec<[(SymbolId, VarTaint); 16]> { + let mut result = SmallVec::with_capacity(a.len().max(b.len())); + let (mut i, mut j) = (0, 0); + + while i < a.len() && j < b.len() { + match a[i].0.cmp(&b[j].0) { + std::cmp::Ordering::Less => { + result.push(a[i].clone()); + i += 1; + } + std::cmp::Ordering::Greater => { + result.push(b[j].clone()); + j += 1; + } + std::cmp::Ordering::Equal => { + let caps = a[i].1.caps | b[j].1.caps; + let origins = merge_origins(&a[i].1.origins, &b[j].1.origins); + result.push((a[i].0, VarTaint { caps, origins })); + i += 1; + j += 1; + } + } + } + + // Remaining from either side + while i < a.len() { + result.push(a[i].clone()); + i += 1; + } + while j < b.len() { + result.push(b[j].clone()); + j += 1; + } + + result +} + +/// Merge two origin lists, deduplicating by node and bounding at MAX_ORIGINS_PER_VAR. +fn merge_origins( + a: &SmallVec<[TaintOrigin; 2]>, + b: &SmallVec<[TaintOrigin; 2]>, +) -> SmallVec<[TaintOrigin; 2]> { + let mut merged = a.clone(); + for origin in b { + if merged.len() >= MAX_ORIGINS_PER_VAR { + break; + } + if !merged.iter().any(|o| o.node == origin.node) { + merged.push(*origin); + } + } + merged +} + +/// Check if a.vars ⊑ b.vars (per-key Cap subset + origins subset). +#[allow(dead_code)] // called by Lattice::leq +fn vars_leq(a: &[(SymbolId, VarTaint)], b: &[(SymbolId, VarTaint)]) -> bool { + let (mut i, mut j) = (0, 0); + + while i < a.len() { + if j >= b.len() { + return false; // a has keys not in b → not ⊑ + } + match a[i].0.cmp(&b[j].0) { + std::cmp::Ordering::Less => return false, // key in a but not b + std::cmp::Ordering::Greater => { + j += 1; // key only in b, skip + } + std::cmp::Ordering::Equal => { + // Cap subset check + if a[i].1.caps & b[j].1.caps != a[i].1.caps { + return false; + } + // Origins subset check (by node) + for orig in &a[i].1.origins { + if !b[j].1.origins.iter().any(|o| o.node == orig.node) { + return false; + } + } + i += 1; + j += 1; + } + } + } + true +} + +/// Merge-join predicate summaries with intersection semantics. +fn merge_join_predicates( + a: &[(SymbolId, PredicateSummary)], + b: &[(SymbolId, PredicateSummary)], +) -> SmallVec<[(SymbolId, PredicateSummary); 4]> { + let mut result = SmallVec::new(); + let (mut i, mut j) = (0, 0); + + while i < a.len() && j < b.len() { + match a[i].0.cmp(&b[j].0) { + std::cmp::Ordering::Less => { + // Key only in a — intersection with empty = empty → drop + i += 1; + } + std::cmp::Ordering::Greater => { + j += 1; + } + std::cmp::Ordering::Equal => { + let joined = a[i].1.join(b[j].1); + if !joined.is_empty() { + result.push((a[i].0, joined)); + } + i += 1; + j += 1; + } + } + } + // Keys only in one side → intersection with empty = drop + + result +} + +/// Check if a.predicates ⊑ b.predicates. +/// More precise (more known_true bits) = lower in the lattice. +/// So a ⊑ b means a.known_true ⊇ b.known_true for each key. +#[allow(dead_code)] // called by Lattice::leq +fn predicates_leq(a: &[(SymbolId, PredicateSummary)], b: &[(SymbolId, PredicateSummary)]) -> bool { + let (mut i, mut j) = (0, 0); + + // For each key in b, a must have at least as many bits + while j < b.len() { + if i >= a.len() { + // b has keys that a doesn't — a is missing info = not lower + return false; + } + match a[i].0.cmp(&b[j].0) { + std::cmp::Ordering::Less => { + // a has extra keys (more info) — OK for leq + i += 1; + } + std::cmp::Ordering::Greater => { + // b has a key that a doesn't → a has fewer bits → not ⊑ + return false; + } + std::cmp::Ordering::Equal => { + // a.known_true must be a superset of b.known_true + if a[i].1.known_true & b[j].1.known_true != b[j].1.known_true { + return false; + } + if a[i].1.known_false & b[j].1.known_false != b[j].1.known_false { + return false; + } + i += 1; + j += 1; + } + } + } + true +} + +#[cfg(test)] +mod tests { + use super::*; + + fn make_taint(sym: u32, caps: Cap) -> (SymbolId, VarTaint) { + ( + SymbolId(sym), + VarTaint { + caps, + origins: SmallVec::new(), + }, + ) + } + + fn make_taint_with_origin(sym: u32, caps: Cap, node: usize) -> (SymbolId, VarTaint) { + ( + SymbolId(sym), + VarTaint { + caps, + origins: smallvec::smallvec![TaintOrigin { + node: NodeIndex::new(node), + source_kind: SourceKind::Unknown, + }], + }, + ) + } + + fn state_with_vars(vars: Vec<(SymbolId, VarTaint)>) -> TaintState { + let mut s = TaintState::initial(); + s.vars = SmallVec::from_vec(vars); + s + } + + // ── Lattice property tests ────────────────────────────────────────── + + #[test] + fn bot_identity() { + let a = state_with_vars(vec![make_taint(0, Cap::ENV_VAR)]); + assert_eq!(a.join(&TaintState::bot()), a); + assert_eq!(TaintState::bot().join(&a), a); + } + + #[test] + fn join_commutativity() { + let a = state_with_vars(vec![make_taint(0, Cap::ENV_VAR)]); + let b = state_with_vars(vec![make_taint(1, Cap::SHELL_ESCAPE)]); + assert_eq!(a.join(&b), b.join(&a)); + } + + #[test] + fn join_associativity() { + let a = state_with_vars(vec![make_taint(0, Cap::ENV_VAR)]); + let b = state_with_vars(vec![make_taint(0, Cap::SHELL_ESCAPE)]); + let c = state_with_vars(vec![make_taint(1, Cap::HTML_ESCAPE)]); + assert_eq!(a.join(&b).join(&c), a.join(&b.join(&c))); + } + + #[test] + fn join_idempotency() { + let a = state_with_vars(vec![make_taint(0, Cap::ENV_VAR | Cap::SHELL_ESCAPE)]); + assert_eq!(a.join(&a), a); + } + + #[test] + fn leq_reflexive() { + let a = state_with_vars(vec![make_taint(0, Cap::ENV_VAR)]); + assert!(a.leq(&a)); + } + + #[test] + fn leq_consistent_with_join() { + let a = state_with_vars(vec![make_taint(0, Cap::ENV_VAR)]); + let b = state_with_vars(vec![make_taint(0, Cap::ENV_VAR | Cap::SHELL_ESCAPE)]); + assert!(a.leq(&b)); + assert_eq!(a.join(&b), b); + } + + #[test] + fn join_merges_caps() { + let a = state_with_vars(vec![make_taint(0, Cap::ENV_VAR)]); + let b = state_with_vars(vec![make_taint(0, Cap::SHELL_ESCAPE)]); + let joined = a.join(&b); + assert_eq!( + joined.get(SymbolId(0)).unwrap().caps, + Cap::ENV_VAR | Cap::SHELL_ESCAPE + ); + } + + #[test] + fn join_merges_origins() { + let a = state_with_vars(vec![make_taint_with_origin(0, Cap::ENV_VAR, 1)]); + let b = state_with_vars(vec![make_taint_with_origin(0, Cap::ENV_VAR, 2)]); + let joined = a.join(&b); + assert_eq!(joined.get(SymbolId(0)).unwrap().origins.len(), 2); + } + + #[test] + fn validated_must_intersection() { + let mut a = TaintState::initial(); + a.validated_must.insert(SymbolId(0)); + a.validated_must.insert(SymbolId(1)); + + let mut b = TaintState::initial(); + b.validated_must.insert(SymbolId(1)); + b.validated_must.insert(SymbolId(2)); + + let joined = a.join(&b); + assert!(!joined.validated_must.contains(SymbolId(0))); + assert!(joined.validated_must.contains(SymbolId(1))); + assert!(!joined.validated_must.contains(SymbolId(2))); + } + + #[test] + fn validated_may_union() { + let mut a = TaintState::initial(); + a.validated_may.insert(SymbolId(0)); + + let mut b = TaintState::initial(); + b.validated_may.insert(SymbolId(1)); + + let joined = a.join(&b); + assert!(joined.validated_may.contains(SymbolId(0))); + assert!(joined.validated_may.contains(SymbolId(1))); + } + + #[test] + fn predicate_contradiction() { + let mut state = TaintState::initial(); + state.set_predicate( + SymbolId(0), + PredicateSummary { + known_true: 1, // NullCheck true + known_false: 1, // NullCheck false + }, + ); + assert!(state.has_contradiction()); + } + + #[test] + fn predicate_no_contradiction() { + let mut state = TaintState::initial(); + state.set_predicate( + SymbolId(0), + PredicateSummary { + known_true: 1, // NullCheck true + known_false: 2, // EmptyCheck false (different kind) + }, + ); + assert!(!state.has_contradiction()); + } + + #[test] + fn predicate_join_intersection() { + let mut a = TaintState::initial(); + a.set_predicate( + SymbolId(0), + PredicateSummary { + known_true: 0b011, // NullCheck + EmptyCheck + known_false: 0, + }, + ); + + let mut b = TaintState::initial(); + b.set_predicate( + SymbolId(0), + PredicateSummary { + known_true: 0b010, // EmptyCheck only + known_false: 0, + }, + ); + + let joined = a.join(&b); + let pred = joined.get_predicate(SymbolId(0)); + assert_eq!(pred.known_true, 0b010); // only EmptyCheck on both paths + } + + // ── SmallBitSet tests ─────────────────────────────────────────────── + + #[test] + fn small_bitset_basic() { + let mut bs = SmallBitSet::empty(); + assert!(bs.is_empty()); + + bs.insert(SymbolId(0)); + assert!(bs.contains(SymbolId(0))); + assert!(!bs.contains(SymbolId(1))); + assert!(!bs.is_empty()); + } + + #[test] + fn small_bitset_union_intersection() { + let mut a = SmallBitSet::empty(); + a.insert(SymbolId(0)); + a.insert(SymbolId(2)); + + let mut b = SmallBitSet::empty(); + b.insert(SymbolId(1)); + b.insert(SymbolId(2)); + + let u = a.union(b); + assert!(u.contains(SymbolId(0))); + assert!(u.contains(SymbolId(1))); + assert!(u.contains(SymbolId(2))); + + let i = a.intersection(b); + assert!(!i.contains(SymbolId(0))); + assert!(!i.contains(SymbolId(1))); + assert!(i.contains(SymbolId(2))); + } +} diff --git a/src/taint/mod.rs b/src/taint/mod.rs index 729cf233..ddc99e0d 100644 --- a/src/taint/mod.rs +++ b/src/taint/mod.rs @@ -1,11 +1,21 @@ -use crate::cfg::{Cfg, FuncSummaries, NodeInfo, StmtKind}; +pub mod domain; +pub mod path_state; +pub mod transfer; + +use crate::cfg::{Cfg, FuncSummaries}; use crate::interop::InteropEdge; -use crate::labels::{Cap, DataLabel, SourceKind}; +use crate::labels::SourceKind; +use crate::state::engine::{self, MAX_TRACKED_VARS}; +use crate::state::lattice::Lattice; +use crate::state::symbol::SymbolInterner; use crate::summary::GlobalSummaries; use crate::symbol::Lang; +use domain::TaintState; +use path_state::PredicateKind; use petgraph::graph::NodeIndex; -use std::collections::HashMap; -use tracing::debug; +use petgraph::visit::IntoNodeReferences; +use std::collections::HashSet; +use transfer::{TaintEvent, TaintTransfer}; /// A detected taint finding with both source and sink locations. #[derive(Debug, Clone)] @@ -20,269 +30,23 @@ pub struct Finding { pub path: Vec, /// The kind of source that originated the taint. pub source_kind: SourceKind, -} - -/// Order-independent hash of a taint map. -/// -/// Uses XOR of per-entry hashes so the result is the same regardless of -/// iteration order — no allocation or sorting required. -fn taint_hash(taint: &HashMap) -> u64 { - let mut h: u64 = 0; - for (k, bits) in taint { - // Per-entry hash: FNV-1a-style mixing of key bytes + cap bits. - let mut entry_h: u64 = 0xcbf2_9ce4_8422_2325; // FNV offset basis - for b in k.as_bytes() { - entry_h ^= *b as u64; - entry_h = entry_h.wrapping_mul(0x0100_0000_01b3); // FNV prime - } - entry_h ^= bits.bits() as u64; - entry_h = entry_h.wrapping_mul(0x0100_0000_01b3); - h ^= entry_h; - } - h -} - -/// Resolved summary for a callee — a uniform view regardless of whether the -/// summary came from a local (same‑file) or global (cross‑file) source. -struct ResolvedSummary { - source_caps: Cap, - sanitizer_caps: Cap, - sink_caps: Cap, - propagates_taint: bool, -} - -/// Try to resolve a callee name using conservative same-language resolution. -/// -/// Resolution order: -/// 1. Local (same-file): exact name + same lang + same namespace -/// 2. Global same-language: via `lookup_same_lang`; must be unambiguous -/// 3. Interop edges: explicit cross-language bridges -/// 4. No cross-language fallback -#[allow(clippy::too_many_arguments)] -fn resolve_callee( - callee: &str, - caller_lang: Lang, - caller_namespace: &str, - caller_func: &str, - call_ordinal: u32, - local: &FuncSummaries, - global: Option<&GlobalSummaries>, - interop_edges: &[InteropEdge], -) -> Option { - // 1) Local (same-file): scan local summaries for matching name + lang + namespace - let local_matches: Vec<_> = local - .iter() - .filter(|(k, _)| { - k.name == callee && k.lang == caller_lang && k.namespace == caller_namespace - }) - .collect(); - - if local_matches.len() == 1 { - let (_, ls) = local_matches[0]; - return Some(ResolvedSummary { - source_caps: ls.source_caps, - sanitizer_caps: ls.sanitizer_caps, - sink_caps: ls.sink_caps, - propagates_taint: ls.propagates_taint, - }); - } - - // Multiple local matches — try arity disambiguation (future), for now return None - if local_matches.len() > 1 { - return None; - } - - // 2) Global same-language - if let Some(gs) = global { - let matches = gs.lookup_same_lang(caller_lang, callee); - if matches.len() == 1 { - let (_, fs) = matches[0]; - return Some(ResolvedSummary { - source_caps: fs.source_caps(), - sanitizer_caps: fs.sanitizer_caps(), - sink_caps: fs.sink_caps(), - propagates_taint: fs.propagates_taint, - }); - } - // Multiple matches — try namespace match first - if matches.len() > 1 { - let same_ns: Vec<_> = matches - .iter() - .filter(|(k, _)| k.namespace == caller_namespace) - .collect(); - if same_ns.len() == 1 { - let (_, fs) = same_ns[0]; - return Some(ResolvedSummary { - source_caps: fs.source_caps(), - sanitizer_caps: fs.sanitizer_caps(), - sink_caps: fs.sink_caps(), - propagates_taint: fs.propagates_taint, - }); - } - // Still ambiguous — return None (conservative) - return None; - } - } - - // 3) Interop edges: explicit cross-language bridges - for edge in interop_edges { - if edge.from.caller_lang == caller_lang - && edge.from.caller_namespace == caller_namespace - && edge.from.callee_symbol == callee - && (edge.from.caller_func.is_empty() || edge.from.caller_func == caller_func) - && (edge.from.ordinal == 0 || edge.from.ordinal == call_ordinal) - { - // Look up the target in global summaries by exact FuncKey - if let Some(gs) = global - && let Some(fs) = gs.get(&edge.to) - { - return Some(ResolvedSummary { - source_caps: fs.source_caps(), - sanitizer_caps: fs.sanitizer_caps(), - sink_caps: fs.sink_caps(), - propagates_taint: fs.propagates_taint, - }); - } - } - } - - // 4) No cross-language fallback - None -} - -/// Apply taint transfer for a single node, mutating `out` in place. -/// -/// Callers should clone the taint map before calling if they need -/// the original state preserved. -fn apply_taint( - node: &NodeInfo, - out: &mut HashMap, - local_summaries: &FuncSummaries, - global_summaries: Option<&GlobalSummaries>, - caller_lang: Lang, - caller_namespace: &str, - interop_edges: &[InteropEdge], -) { - debug!(target: "taint", "Applying taint to node: {:?}", node); - debug!(target: "taint", "Taint: {:?}", out); - - let caller_func = node.enclosing_func.as_deref().unwrap_or(""); - - match node.label { - // A new untrusted value enters the program - Some(DataLabel::Source(bits)) => { - if let Some(v) = &node.defines { - out.insert(v.clone(), bits); - } - } - // Sanitizer: propagate input taint through the assignment FIRST, - // then strip the sanitizer's capability bits. This ensures that - // `let y = sanitize_html(&x)` gives y the taint of x minus the - // HTML_ESCAPE bit — rather than leaving y completely clean (which - // would hide "wrong sanitiser for this sink" bugs). - Some(DataLabel::Sanitizer(bits)) => { - if let Some(v) = &node.defines { - // 1. Propagate: union taint from all read variables - let mut combined = Cap::empty(); - for u in &node.uses { - if let Some(b) = out.get(u) { - combined |= *b; - } - } - // 2. Strip the sanitiser's bits - let new = combined & !bits; - if new.is_empty() { - out.remove(v); - } else { - out.insert(v.clone(), new); - } - } - } - - // A function call — resolve against local + global summaries - _ if node.kind == StmtKind::Call => { - if let Some(callee) = &node.callee - && let Some(resolved) = resolve_callee( - callee, - caller_lang, - caller_namespace, - caller_func, - node.call_ordinal, - local_summaries, - global_summaries, - interop_edges, - ) - { - // Build the return value's taint bits in stages, then - // write once at the end. Order matters: - // - // 1. Start with fresh source taint (if the callee is a source) - // 2. Union with propagated arg taint (if the callee propagates) - // 3. Strip sanitizer bits last (so sanitization always wins) - - let mut return_bits = Cap::empty(); - - // ── 1. Source behaviour ── - return_bits |= resolved.source_caps; - - // ── 2. Propagation ── - if resolved.propagates_taint { - for u in &node.uses { - if let Some(bits) = out.get(u) { - return_bits |= *bits; - } - } - } - - // ── 3. Sanitizer behaviour (applied last so it always wins) ── - return_bits &= !resolved.sanitizer_caps; - - // ── Write the result ── - if let Some(v) = &node.defines { - if return_bits.is_empty() { - out.remove(v); - } else { - out.insert(v.clone(), return_bits); - } - } - - // ── Sink behaviour: handled in the main analysis loop - // (checked via node.label or resolved summary) ── - - return; - } - - // Unresolved call — fall through to default gen/kill below - } - - // All other statements: classic gen/kill for assignments - _ => {} - } - - // Default gen/kill: propagate taint through variable assignments - if !matches!( - node.label, - Some(DataLabel::Source(_)) | Some(DataLabel::Sanitizer(_)) - ) && let Some(d) = &node.defines - { - let mut combined = Cap::empty(); - for u in &node.uses { - if let Some(bits) = out.get(u) { - combined |= *bits; - } - } - if combined.is_empty() { - out.remove(d); - } else { - out.insert(d.clone(), combined); - } - } + /// Whether all tainted sink variables are guarded by a validation + /// predicate on this path (metadata only — does not change severity). + #[allow(dead_code)] // surfaced in Diag output (task 4) + pub path_validated: bool, + /// The kind of validation guard protecting this path, if any. + #[allow(dead_code)] // surfaced in Diag output (task 4) + pub guard_kind: Option, } /// Run taint analysis on a single file's CFG. /// -/// `global_summaries` is `None` for pass‑1 / single‑file mode and -/// `Some(&map)` for pass‑2 cross‑file analysis. +/// Uses a monotone forward dataflow analysis via `state::engine::run_forward` +/// with the `TaintTransfer` function. Termination is guaranteed by lattice +/// finiteness (bounded `Cap` bits × bounded variable count). +/// +/// For JS/TS files: uses a two-level solve to prevent cross-function taint +/// leakage while preserving global-to-function flows. pub fn analyse_file( cfg: &Cfg, entry: NodeIndex, @@ -292,162 +56,155 @@ pub fn analyse_file( caller_namespace: &str, interop_edges: &[InteropEdge], ) -> Vec { - use std::collections::{HashMap, HashSet, VecDeque}; + let _span = tracing::debug_span!("taint_analyse_file").entered(); - /// Queue item: current CFG node + taint map that holds here - #[derive(Clone)] - struct Item { - node: NodeIndex, - taint: HashMap, + // 1. Build symbol interner from CFG + let interner = SymbolInterner::from_cfg(cfg); + + if interner.len() > MAX_TRACKED_VARS { + tracing::warn!( + symbols = interner.len(), + max = MAX_TRACKED_VARS, + "taint analysis: too many variables, some will be ignored" + ); } - // (node, taint_hash) → predecessor key (for path rebuild) - type Key = (NodeIndex, u64); - let mut pred: HashMap = HashMap::new(); + // 2. Build base transfer function + let base_transfer = TaintTransfer { + lang: caller_lang, + namespace: caller_namespace, + interner: &interner, // also used for events_to_findings below + local_summaries, + global_summaries, + interop_edges, + global_seed: None, + scope_filter: None, + }; - // Seen states so we do not revisit them infinitely - let mut seen: HashSet = HashSet::new(); + // 3. Run analysis (two-level for JS/TS, single-pass otherwise) + let events = if matches!(caller_lang, Lang::JavaScript | Lang::TypeScript) { + analyse_js_two_level(cfg, entry, &interner, &base_transfer) + } else { + let result = engine::run_forward(cfg, entry, &base_transfer, TaintState::initial()); + result.events + }; - // Resulting findings: (sink_node, source_node, full_path) - let mut findings: Vec = Vec::new(); + // 4. Convert events to findings + let mut findings = events_to_findings(&events, &interner); - let mut q = VecDeque::new(); - q.push_back(Item { - node: entry, - taint: HashMap::new(), - }); - seen.insert((entry, 0)); + // 5. Deduplicate findings by (sink, source), prefer path_validated=true + findings.sort_by_key(|f| (f.sink.index(), f.source.index(), !f.path_validated)); + findings.dedup_by_key(|f| (f.sink, f.source)); - while let Some(Item { node, taint }) = q.pop_front() { - let caller_func = cfg[node].enclosing_func.as_deref().unwrap_or(""); - let mut out = taint.clone(); - apply_taint( - &cfg[node], - &mut out, - local_summaries, - global_summaries, - caller_lang, - caller_namespace, - interop_edges, - ); + findings +} - // ── Sink check ────────────────────────────────────────────────── - // Two ways a node can be a sink: - // 1. Its AST label says Sink (existing inline labels) - // 2. Its callee resolves to a function with sink_caps (cross-file) - let sink_caps = match cfg[node].label { - Some(DataLabel::Sink(caps)) => caps, - _ => { - // check if callee resolves to a sink - cfg[node] - .callee - .as_ref() - .and_then(|c| { - resolve_callee( - c, - caller_lang, - caller_namespace, - caller_func, - cfg[node].call_ordinal, - local_summaries, - global_summaries, - interop_edges, - ) - }) - .filter(|r| !r.sink_caps.is_empty()) - .map(|r| r.sink_caps) - .unwrap_or(Cap::empty()) - } +/// JS/TS two-level solve to prevent cross-function taint leakage. +/// +/// Level 1: Solve top-level code (nodes where `enclosing_func.is_none()`). +/// Level 2: For each function, solve seeded with top-level taint. +fn analyse_js_two_level( + cfg: &Cfg, + entry: NodeIndex, + _interner: &SymbolInterner, + base_transfer: &TaintTransfer, +) -> Vec { + // Level 1: solve top-level only + let toplevel_transfer = TaintTransfer { + lang: base_transfer.lang, + namespace: base_transfer.namespace, + interner: base_transfer.interner, + local_summaries: base_transfer.local_summaries, + global_summaries: base_transfer.global_summaries, + interop_edges: base_transfer.interop_edges, + global_seed: None, + scope_filter: Some(None), // top-level only (enclosing_func == None) + }; + + let toplevel_result = + engine::run_forward(cfg, entry, &toplevel_transfer, TaintState::initial()); + + // Extract top-level taint state at the last converged point + let toplevel_state = extract_exit_state(&toplevel_result.states); + + // Level 2: solve each function seeded with top-level state + let mut all_events = toplevel_result.events; + + let func_entries = find_function_entries(cfg); + for (func_name, func_entry) in &func_entries { + let func_transfer = TaintTransfer { + lang: base_transfer.lang, + namespace: base_transfer.namespace, + interner: base_transfer.interner, + local_summaries: base_transfer.local_summaries, + global_summaries: base_transfer.global_summaries, + interop_edges: base_transfer.interop_edges, + global_seed: Some(&toplevel_state), + scope_filter: Some(Some(func_name.as_str())), }; - if !sink_caps.is_empty() { - let bad = cfg[node] - .uses - .iter() - .any(|u| out.get(u).is_some_and(|b| (*b & sink_caps) != Cap::empty())); - if bad { - // Reconstruct path backwards from sink to source. - // - // A node is considered a "source" if: - // 1. It has an inline DataLabel::Source (same-file), OR - // 2. It is a Call whose callee resolves to a source via - // local or global summaries (cross-file). - let sink_node = node; - let mut path = vec![node]; - let mut source_node = node; // fallback: sink itself - let mut key = (node, taint_hash(&taint)); + let func_result = + engine::run_forward(cfg, *func_entry, &func_transfer, TaintState::initial()); + all_events.extend(func_result.events); + } - while let Some(&(prev, prev_hash)) = pred.get(&key) { - path.push(prev); + all_events +} - // Check inline source label - if matches!(cfg[prev].label, Some(DataLabel::Source(_))) { - source_node = prev; - break; - } +/// Extract the "best" taint state from converged states (join all exit/reachable states). +fn extract_exit_state(states: &std::collections::HashMap) -> TaintState { + let mut result = TaintState::initial(); + for state in states.values() { + result = result.join(state); + } + result +} - // Check cross-file source via resolved callee summary - let prev_caller_func = cfg[prev].enclosing_func.as_deref().unwrap_or(""); - if cfg[prev].kind == StmtKind::Call - && let Some(callee) = &cfg[prev].callee - && let Some(resolved) = resolve_callee( - callee, - caller_lang, - caller_namespace, - prev_caller_func, - cfg[prev].call_ordinal, - local_summaries, - global_summaries, - interop_edges, - ) - && !resolved.source_caps.is_empty() - { - source_node = prev; - break; - } +/// Find function entry nodes: (func_name, entry_node) pairs. +/// +/// A function entry is the first node with a given `enclosing_func` value. +fn find_function_entries(cfg: &Cfg) -> Vec<(String, NodeIndex)> { + let mut seen = HashSet::new(); + let mut entries = Vec::new(); - key = (prev, prev_hash); - } - - path.reverse(); - - // Infer the source kind from the source node's label and callee - let source_kind = match cfg[source_node].label { - Some(DataLabel::Source(caps)) => { - let callee = cfg[source_node].callee.as_deref().unwrap_or(""); - crate::labels::infer_source_kind(caps, callee) - } - _ => SourceKind::Unknown, - }; - - findings.push(Finding { - sink: sink_node, - source: source_node, - path, - source_kind, - }); - } + for (idx, info) in cfg.node_references() { + if let Some(ref func_name) = info.enclosing_func + && seen.insert(func_name.clone()) + { + entries.push((func_name.clone(), idx)); } + } - // enqueue successors — cache hashes to avoid recomputation - let out_h = taint_hash(&out); - let in_h = taint_hash(&taint); - let succs: Vec<_> = cfg.neighbors(node).collect(); - for (i, succ) in succs.iter().enumerate() { - let key = (*succ, out_h); - if !seen.contains(&key) { - seen.insert(key); - pred.insert(key, (node, in_h)); - // Move the map into the last successor to avoid a clone - let taint_for_succ = if i + 1 == succs.len() { - std::mem::take(&mut out) - } else { - out.clone() - }; - q.push_back(Item { - node: *succ, - taint: taint_for_succ, - }); + entries +} + +/// Convert TaintEvents into Findings. +fn events_to_findings(events: &[TaintEvent], _interner: &SymbolInterner) -> Vec { + let mut findings = Vec::new(); + + for event in events { + let TaintEvent::SinkReached { + sink_node, + tainted_vars, + all_validated, + guard_kind, + .. + } = event; + + // Collect unique origins across all tainted vars at this sink + let mut seen_origins: HashSet<(usize, usize)> = HashSet::new(); + for (_sym, _caps, origins) in tainted_vars { + for origin in origins { + if seen_origins.insert((origin.node.index(), sink_node.index())) { + findings.push(Finding { + sink: *sink_node, + source: origin.node, + path: vec![origin.node, *sink_node], + source_kind: origin.source_kind, + path_validated: *all_validated, + guard_kind: *guard_kind, + }); + } } } } diff --git a/src/taint/path_state.rs b/src/taint/path_state.rs new file mode 100644 index 00000000..389e005a --- /dev/null +++ b/src/taint/path_state.rs @@ -0,0 +1,234 @@ +// ─── PredicateKind ─────────────────────────────────────────────────────────── + +/// Classification of what an if-condition tests. +/// +/// Determined by heuristic analysis of the raw condition text. +/// Classification is conservative: prefer [`Unknown`](PredicateKind::Unknown) +/// over a wrong guess. +#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)] +pub enum PredicateKind { + /// `x.is_none()`, `x == null`, `x == nil`, `x is None` + NullCheck, + /// `x.is_empty()`, `x.len() == 0`, `x == ""` + EmptyCheck, + /// `x.is_err()`, `x.is_ok()`, `err != nil` + ErrorCheck, + /// Call to a validation/guard function: `validate(x)`, `is_safe(x)` + ValidationCall, + /// Call to a sanitizer function: `sanitize(x)`, `escape(x)` + SanitizerCall, + /// Comparison operators: `x == 5`, `x > threshold` + Comparison, + /// Generic boolean test — cannot classify further. + Unknown, +} + +/// Classify a raw condition text into a [`PredicateKind`]. +/// +/// # Rules +/// +/// - Empty/None text → [`Unknown`](PredicateKind::Unknown). +/// - `ValidationCall` / `SanitizerCall` require a `(` in the text **and** a +/// matching callee token. This avoids misclassifying comparisons like +/// `x_valid == true`. +/// - Prefers [`Unknown`](PredicateKind::Unknown) over false positives. +pub fn classify_condition(text: &str) -> PredicateKind { + if text.is_empty() { + return PredicateKind::Unknown; + } + + let lower = text.to_ascii_lowercase(); + + // ── Error checks (before null checks: `err != nil` is an error check, + // not a null check, even though it contains `!= nil`) ────────────── + if lower.contains("is_err") + || lower.contains("is_ok") + || lower.contains("err != nil") + || lower.contains("err == nil") + || lower.contains("error != nil") + || lower.contains("error == nil") + { + return PredicateKind::ErrorCheck; + } + + // ── Null checks ────────────────────────────────────────────────────── + if lower.contains("is_none") + || lower.contains("is_some") + || lower.contains("== none") + || lower.contains("!= none") + || lower.contains("is none") + || lower.contains("is not none") + || lower.contains("== null") + || lower.contains("!= null") + || lower.contains("=== null") + || lower.contains("!== null") + || lower.contains("== nil") + || lower.contains("!= nil") + { + return PredicateKind::NullCheck; + } + + // ── Empty checks ───────────────────────────────────────────────────── + if lower.contains("is_empty") + || lower.contains(".len() == 0") + || lower.contains(".len() != 0") + || lower.contains(".length == 0") + || lower.contains(".length === 0") + || lower.contains(".length != 0") + || lower.contains(".length !== 0") + || lower.contains("== \"\"") + || lower.contains("== ''") + { + return PredicateKind::EmptyCheck; + } + + // ── Call-based kinds (require `(` to be present) ───────────────────── + if lower.contains('(') { + // Extract a rough callee token: everything before the first `(` + // that looks like an identifier (letters, digits, underscores, dots). + let callee_part = lower.split('(').next().unwrap_or(""); + // Take the last segment (after `.` or `::`) as the bare name. + let bare = callee_part + .rsplit(['.', ':']) + .next() + .unwrap_or(callee_part) + .trim(); + + // Validation + if bare.contains("valid") + || bare.contains("check") + || bare.contains("verify") + || bare.starts_with("is_safe") + || bare.starts_with("is_authorized") + || bare.starts_with("is_authenticated") + { + return PredicateKind::ValidationCall; + } + + // Sanitizer + if bare.contains("sanitiz") || bare.contains("escape") || bare.contains("encode") { + return PredicateKind::SanitizerCall; + } + } + + // ── Comparison operators ───────────────────────────────────────────── + if lower.contains("==") + || lower.contains("!=") + || lower.contains(">=") + || lower.contains("<=") + || lower.contains(" > ") + || lower.contains(" < ") + { + return PredicateKind::Comparison; + } + + PredicateKind::Unknown +} + +// ─── Tests ─────────────────────────────────────────────────────────────────── + +#[cfg(test)] +mod tests { + use super::*; + + // ── classify_condition ──────────────────────────────────────────────── + + #[test] + fn classify_empty_is_unknown() { + assert_eq!(classify_condition(""), PredicateKind::Unknown); + } + + #[test] + fn classify_null_checks() { + assert_eq!(classify_condition("x.is_none()"), PredicateKind::NullCheck); + assert_eq!(classify_condition("x == null"), PredicateKind::NullCheck); + assert_eq!(classify_condition("x != nil"), PredicateKind::NullCheck); + assert_eq!(classify_condition("x is None"), PredicateKind::NullCheck); + assert_eq!(classify_condition("x === null"), PredicateKind::NullCheck); + } + + #[test] + fn classify_error_checks() { + assert_eq!(classify_condition("x.is_err()"), PredicateKind::ErrorCheck); + assert_eq!(classify_condition("err != nil"), PredicateKind::ErrorCheck); + assert_eq!(classify_condition("x.is_ok()"), PredicateKind::ErrorCheck); + } + + #[test] + fn classify_empty_checks() { + assert_eq!( + classify_condition("x.is_empty()"), + PredicateKind::EmptyCheck + ); + assert_eq!( + classify_condition("x.len() == 0"), + PredicateKind::EmptyCheck + ); + assert_eq!( + classify_condition("x.length === 0"), + PredicateKind::EmptyCheck + ); + } + + #[test] + fn classify_validation_call() { + assert_eq!( + classify_condition("validate(x)"), + PredicateKind::ValidationCall + ); + assert_eq!( + classify_condition("is_safe(input)"), + PredicateKind::ValidationCall + ); + assert_eq!( + classify_condition("check_auth(req)"), + PredicateKind::ValidationCall + ); + assert_eq!( + classify_condition("input.verify(sig)"), + PredicateKind::ValidationCall + ); + } + + #[test] + fn classify_validation_requires_paren() { + // `x_valid == true` should NOT be ValidationCall — no `(` call syntax. + assert_eq!( + classify_condition("x_valid == true"), + PredicateKind::Comparison + ); + assert_eq!( + classify_condition("is_valid && ready"), + PredicateKind::Unknown + ); + } + + #[test] + fn classify_sanitizer_call() { + assert_eq!( + classify_condition("sanitize(x)"), + PredicateKind::SanitizerCall + ); + assert_eq!( + classify_condition("html_escape(s)"), + PredicateKind::SanitizerCall + ); + assert_eq!( + classify_condition("url_encode(path)"), + PredicateKind::SanitizerCall + ); + } + + #[test] + fn classify_comparison() { + assert_eq!(classify_condition("x == 5"), PredicateKind::Comparison); + assert_eq!(classify_condition("x != y"), PredicateKind::Comparison); + assert_eq!(classify_condition("a >= b"), PredicateKind::Comparison); + } + + #[test] + fn classify_unknown_fallback() { + assert_eq!(classify_condition("flag"), PredicateKind::Unknown); + assert_eq!(classify_condition("a && b"), PredicateKind::Unknown); + } +} diff --git a/src/taint/tests.rs b/src/taint/tests.rs index 6e5a115b..74c19516 100644 --- a/src/taint/tests.rs +++ b/src/taint/tests.rs @@ -1,6 +1,7 @@ use super::*; use crate::cfg::FuncSummaries; use crate::interop::InteropEdge; +use crate::labels::Cap; use crate::symbol::FuncKey; #[test] @@ -52,8 +53,10 @@ fn taint_through_if_else() { let (cfg, entry, summaries) = build_cfg(&tree, src, "rust", "test.rs", None); let findings = analyse_file(&cfg, entry, &summaries, None, Lang::Rust, "test.rs", &[]); - // exactly one path (via the True branch) should be flagged - assert_eq!(findings.len(), 1); + // Both branches have findings: the true branch uses unsanitized `x`, + // the else branch uses `safe` which was sanitized with HTML_ESCAPE + // but the sink requires SHELL_ESCAPE (wrong sanitizer → still tainted). + assert_eq!(findings.len(), 2); } #[test] @@ -2218,3 +2221,318 @@ fn return_call_recognized_as_source() { "foo() should have source_caps set because env::var is called inside return" ); } + +// ─── Path-sensitive analysis tests ─────────────────────────────────────────── + +#[test] +fn validate_and_early_return() { + use crate::cfg::build_cfg; + use tree_sitter::Language; + + // Validate before use: if validation fails, early return. + // The sink after the guard is on the "validated" path. + // + // The CFG creates a synthetic pass-through node for the false path + // with an explicit False edge from the If node. BFS reaches the + // sink via: cond → (False) → pass-through → (Seq) → sink. + // The predicate on the False edge records that `!validate(&x)` was + // false (i.e. validation passed), so the sink is path-guarded. + let src = br#" + use std::env; use std::process::Command; + fn main() { + let x = env::var("INPUT").unwrap(); + if !validate(&x) { return; } + Command::new("sh").arg(x).status().unwrap(); + }"#; + + let mut parser = tree_sitter::Parser::new(); + parser + .set_language(&Language::from(tree_sitter_rust::LANGUAGE)) + .unwrap(); + let tree = parser.parse(src as &[u8], None).unwrap(); + + let (cfg, entry, summaries) = build_cfg(&tree, src, "rust", "test.rs", None); + let findings = analyse_file(&cfg, entry, &summaries, None, Lang::Rust, "test.rs", &[]); + + // Taint still flows (validate doesn't kill taint), but the finding + // should be annotated as path_validated because the false path + // (validation passed) has a ValidationCall predicate with polarity=true. + assert_eq!(findings.len(), 1, "should still detect the taint flow"); + assert!( + findings[0].path_validated, + "finding should be marked as path_validated (early-return guard detected)" + ); + assert_eq!( + findings[0].guard_kind, + Some(PredicateKind::ValidationCall), + "guard_kind should be ValidationCall" + ); +} + +#[test] +fn validate_in_if_else_path_validated() { + use crate::cfg::build_cfg; + use tree_sitter::Language; + + // If/else where the True branch (validation passed) contains the sink. + // This IS detectable because the If node has genuine True/False branches. + let src = br#" + use std::env; use std::process::Command; + fn main() { + let x = env::var("INPUT").unwrap(); + if validate(&x) { + Command::new("sh").arg(&x).status().unwrap(); + } else { + println!("invalid input"); + } + }"#; + + let mut parser = tree_sitter::Parser::new(); + parser + .set_language(&Language::from(tree_sitter_rust::LANGUAGE)) + .unwrap(); + let tree = parser.parse(src as &[u8], None).unwrap(); + + let (cfg, entry, summaries) = build_cfg(&tree, src, "rust", "test.rs", None); + let findings = analyse_file(&cfg, entry, &summaries, None, Lang::Rust, "test.rs", &[]); + + assert_eq!(findings.len(), 1, "should detect the taint flow"); + assert!( + findings[0].path_validated, + "finding should be path_validated (sink in validated branch)" + ); + assert_eq!( + findings[0].guard_kind, + Some(PredicateKind::ValidationCall), + "guard_kind should be ValidationCall" + ); +} + +#[test] +fn sink_on_failed_validation_branch() { + use crate::cfg::build_cfg; + use tree_sitter::Language; + + // Sink is in the failed-validation branch (negated condition, false edge). + let src = br#" + use std::env; use std::process::Command; + fn main() { + let x = env::var("INPUT").unwrap(); + if !validate(&x) { + Command::new("sh").arg(&x).status().unwrap(); + } + }"#; + + let mut parser = tree_sitter::Parser::new(); + parser + .set_language(&Language::from(tree_sitter_rust::LANGUAGE)) + .unwrap(); + let tree = parser.parse(src as &[u8], None).unwrap(); + + let (cfg, entry, summaries) = build_cfg(&tree, src, "rust", "test.rs", None); + let findings = analyse_file(&cfg, entry, &summaries, None, Lang::Rust, "test.rs", &[]); + + assert_eq!(findings.len(), 1, "should detect taint flow to sink"); + assert!( + !findings[0].path_validated, + "finding should NOT be path_validated (sink is in failed-validation branch)" + ); +} + +#[test] +fn contradictory_null_check_pruned() { + use crate::cfg::build_cfg; + use tree_sitter::Language; + + // Inner branch is infeasible: if x.is_none() then x cannot also be is_none(). + // After early return on is_none(), the fall-through path has polarity=false + // for NullCheck. The inner `if x.is_none()` True branch has polarity=true — + // contradiction. + let src = br#" + use std::env; use std::process::Command; + fn main() { + let x = env::var("INPUT").ok(); + if x.is_none() { return; } + if x.is_none() { + Command::new("sh").arg("dangerous").status().unwrap(); + } + }"#; + + let mut parser = tree_sitter::Parser::new(); + parser + .set_language(&Language::from(tree_sitter_rust::LANGUAGE)) + .unwrap(); + let tree = parser.parse(src as &[u8], None).unwrap(); + + let (cfg, entry, summaries) = build_cfg(&tree, src, "rust", "test.rs", None); + let findings = analyse_file(&cfg, entry, &summaries, None, Lang::Rust, "test.rs", &[]); + + // The inner branch is infeasible, and the arg "dangerous" is a string + // literal (not tainted), so there should be no findings. + assert!( + findings.is_empty(), + "inner branch is infeasible — should produce no findings (got {})", + findings.len() + ); +} + +#[test] +fn sanitize_one_branch_no_regression() { + use crate::cfg::build_cfg; + use tree_sitter::Language; + + // Same as existing taint_through_if_else: sanitized in one branch, not in the other. + // Verify the finding count stays at 1 (no regression from path sensitivity). + let src = br#" + use std::env; use std::process::Command; + fn main() { + let x = env::var("DANGEROUS").unwrap(); + let safe = html_escape::encode_safe(&x); + + if x.len() > 5 { + Command::new("sh").arg(&x).status().unwrap(); // UNSAFE + } else { + Command::new("sh").arg(&safe).status().unwrap(); // SAFE + } + }"#; + + let mut parser = tree_sitter::Parser::new(); + parser + .set_language(&Language::from(tree_sitter_rust::LANGUAGE)) + .unwrap(); + let tree = parser.parse(src as &[u8], None).unwrap(); + + let (cfg, entry, summaries) = build_cfg(&tree, src, "rust", "test.rs", None); + let findings = analyse_file(&cfg, entry, &summaries, None, Lang::Rust, "test.rs", &[]); + + // Both branches produce findings: the true branch uses unsanitized `x`, + // the else branch uses `safe` (HTML_ESCAPE sanitizer vs SHELL_ESCAPE sink). + // Previously only 1 finding because else_clause was silently dropped from CFG. + assert_eq!( + findings.len(), + 2, + "two findings expected (both branches reach sink with wrong/no sanitizer)" + ); +} + +#[test] +fn path_state_budget_graceful() { + use crate::cfg::build_cfg; + use tree_sitter::Language; + + // Deeply nested ifs with a sink at the innermost level. + // PathState should truncate gracefully after MAX_PATH_PREDICATES. + let src = br#" + use std::env; use std::process::Command; + fn main() { + let x = env::var("INPUT").unwrap(); + if x.len() > 1 { + if x.len() > 2 { + if x.len() > 3 { + if x.len() > 4 { + if x.len() > 5 { + if x.len() > 6 { + if x.len() > 7 { + if x.len() > 8 { + if x.len() > 9 { + Command::new("sh").arg(&x).status().unwrap(); + } + } + } + } + } + } + } + } + } + }"#; + + let mut parser = tree_sitter::Parser::new(); + parser + .set_language(&Language::from(tree_sitter_rust::LANGUAGE)) + .unwrap(); + let tree = parser.parse(src as &[u8], None).unwrap(); + + let (cfg, entry, summaries) = build_cfg(&tree, src, "rust", "test.rs", None); + let findings = analyse_file(&cfg, entry, &summaries, None, Lang::Rust, "test.rs", &[]); + + // Should still detect the flow — truncation shouldn't cause false negatives. + assert_eq!( + findings.len(), + 1, + "should detect taint flow even with truncated PathState" + ); +} + +#[test] +fn unknown_predicate_not_pruned() { + use crate::cfg::build_cfg; + use tree_sitter::Language; + + // Comparison predicates are NOT in the contradiction whitelist, so even + // seemingly contradictory comparisons should not be pruned. + let src = br#" + use std::env; use std::process::Command; + fn main() { + let x = env::var("INPUT").unwrap(); + if x.len() > 5 { return; } + if x.len() > 5 { + Command::new("sh").arg(&x).status().unwrap(); + } + }"#; + + let mut parser = tree_sitter::Parser::new(); + parser + .set_language(&Language::from(tree_sitter_rust::LANGUAGE)) + .unwrap(); + let tree = parser.parse(src as &[u8], None).unwrap(); + + let (cfg, entry, summaries) = build_cfg(&tree, src, "rust", "test.rs", None); + let findings = analyse_file(&cfg, entry, &summaries, None, Lang::Rust, "test.rs", &[]); + + // Comparison is not in the whitelist — the path should NOT be pruned. + assert_eq!( + findings.len(), + 1, + "Comparison predicate should not cause contradiction pruning" + ); +} + +#[test] +fn multi_var_predicate_not_pruned() { + use crate::cfg::build_cfg; + use tree_sitter::Language; + + // Multi-variable conditions should never be pruned for contradiction, + // even if the kind is in the whitelist. + let src = br#" + use std::env; use std::process::Command; + fn main() { + let x = env::var("INPUT").unwrap(); + let y = env::var("OTHER").ok(); + if y.is_none() { return; } + if y.is_none() { + Command::new("sh").arg(&x).status().unwrap(); + } + }"#; + + let mut parser = tree_sitter::Parser::new(); + parser + .set_language(&Language::from(tree_sitter_rust::LANGUAGE)) + .unwrap(); + let tree = parser.parse(src as &[u8], None).unwrap(); + + let (cfg, entry, summaries) = build_cfg(&tree, src, "rust", "test.rs", None); + let findings = analyse_file(&cfg, entry, &summaries, None, Lang::Rust, "test.rs", &[]); + + // Note: y.is_none() condition references `y` and `is_none` — two idents. + // Wait, `is_none` is a method — collect_idents finds `y` and `is_none` as + // separate identifiers. That makes it multi-var, so contradiction should + // NOT fire. However, the actual behavior depends on how many idents + // collect_idents extracts from `y.is_none()`. If it returns ["y", "is_none"], + // then the predicate has 2 vars → multi-var → not pruned → finding exists. + assert!( + !findings.is_empty(), + "multi-var predicate should not be pruned; flow should be detected" + ); +} diff --git a/src/taint/transfer.rs b/src/taint/transfer.rs new file mode 100644 index 00000000..059cf029 --- /dev/null +++ b/src/taint/transfer.rs @@ -0,0 +1,458 @@ +use crate::callgraph::normalize_callee_name; +use crate::cfg::{EdgeKind, FuncSummaries, NodeInfo, StmtKind}; +use crate::interop::InteropEdge; +use crate::labels::{Cap, DataLabel}; +use crate::state::engine::Transfer; +use crate::state::lattice::Lattice; +use crate::state::symbol::{SymbolId, SymbolInterner}; +use crate::summary::{CalleeResolution, GlobalSummaries}; +use crate::symbol::Lang; +use crate::taint::domain::{TaintOrigin, TaintState, VarTaint, predicate_kind_bit}; +use crate::taint::path_state::{PredicateKind, classify_condition}; +use petgraph::graph::NodeIndex; +use smallvec::SmallVec; + +/// Events emitted by the taint transfer function during Phase 2. +#[derive(Clone, Debug)] +pub enum TaintEvent { + SinkReached { + sink_node: NodeIndex, + tainted_vars: Vec<(SymbolId, Cap, SmallVec<[TaintOrigin; 2]>)>, + #[allow(dead_code)] + sink_caps: Cap, + all_validated: bool, + guard_kind: Option, + }, +} + +/// Taint transfer function for forward dataflow analysis. +pub struct TaintTransfer<'a> { + pub lang: Lang, + pub namespace: &'a str, + pub interner: &'a SymbolInterner, + pub local_summaries: &'a FuncSummaries, + pub global_summaries: Option<&'a GlobalSummaries>, + pub interop_edges: &'a [InteropEdge], + /// For JS two-level solve: top-level taint state seeded into function solves. + pub global_seed: Option<&'a TaintState>, + /// Optional scope filter: if set, only process nodes whose enclosing_func matches. + /// None = process all nodes. Some(None) = top-level only. Some(Some(name)) = function only. + pub scope_filter: Option>, +} + +impl Transfer for TaintTransfer<'_> { + type Event = TaintEvent; + + fn apply( + &self, + node: NodeIndex, + info: &NodeInfo, + edge: Option, + mut state: TaintState, + ) -> (TaintState, Vec) { + let mut events = Vec::new(); + + // Scope filter: skip nodes outside our scope (return state unchanged) + if let Some(ref filter) = self.scope_filter { + let node_func = info.enclosing_func.as_deref(); + if node_func != *filter { + return (state, events); + } + } + + let caller_func = info.enclosing_func.as_deref().unwrap_or(""); + + // ── Apply taint transfer ──────────────────────────────────────── + match info.label { + Some(DataLabel::Source(bits)) => { + self.apply_source(node, info, bits, &mut state); + } + Some(DataLabel::Sanitizer(bits)) => { + self.apply_sanitizer(info, bits, &mut state); + } + _ if info.kind == StmtKind::Call => { + self.apply_call(node, info, caller_func, &mut state); + } + _ => { + self.apply_assignment(info, &mut state); + } + } + + // ── If-node predicate handling (edge-aware) ───────────────────── + if info.kind == StmtKind::If + && !info.condition_vars.is_empty() + && matches!(edge, Some(EdgeKind::True) | Some(EdgeKind::False)) + { + let cond_text = info.condition_text.as_deref().unwrap_or(""); + let kind = classify_condition(cond_text); + let polarity = matches!(edge, Some(EdgeKind::True)) ^ info.condition_negated; + + // ValidationCall handling + if kind == PredicateKind::ValidationCall && polarity { + for var in &info.condition_vars { + if let Some(sym) = self.interner.get(var) { + state.validated_may.insert(sym); + state.validated_must.insert(sym); + } + } + } + + // Predicate summary for whitelisted kinds (contradiction pruning) + if let Some(bit_idx) = predicate_kind_bit(kind) { + for var in &info.condition_vars { + if let Some(sym) = self.interner.get(var) { + let mut summary = state.get_predicate(sym); + if polarity { + summary.known_true |= 1 << bit_idx; + } else { + summary.known_false |= 1 << bit_idx; + } + state.set_predicate(sym, summary); + } + } + } + + // Contradiction pruning: if any variable has contradictory predicates, + // this is an infeasible path → return bot (monotonically kills branch). + if state.has_contradiction() { + return (TaintState::bot(), events); + } + } + + // ── Sink check ────────────────────────────────────────────────── + let sink_caps = self.resolve_sink_caps(info, caller_func); + if !sink_caps.is_empty() { + let tainted_vars = self.collect_tainted_sink_vars(info, &state, sink_caps); + if !tainted_vars.is_empty() { + let all_validated = tainted_vars + .iter() + .all(|(sym, _, _)| state.validated_may.contains(*sym)); + + let guard_kind = if all_validated { + Some(PredicateKind::ValidationCall) + } else { + None + }; + + events.push(TaintEvent::SinkReached { + sink_node: node, + tainted_vars, + sink_caps, + all_validated, + guard_kind, + }); + } + } + + (state, events) + } + + fn iteration_budget(&self) -> usize { + 100_000 + } + + fn on_budget_exceeded(&self) -> bool { + tracing::warn!("taint analysis: worklist budget exceeded, returning partial results"); + false + } +} + +impl TaintTransfer<'_> { + /// Apply a Source label: insert taint for the defined variable. + fn apply_source(&self, node: NodeIndex, info: &NodeInfo, bits: Cap, state: &mut TaintState) { + if let Some(ref v) = info.defines + && let Some(sym) = self.interner.get(v) + { + let callee = info.callee.as_deref().unwrap_or(""); + let source_kind = crate::labels::infer_source_kind(bits, callee); + let origin = TaintOrigin { node, source_kind }; + + match state.get(sym) { + Some(existing) => { + let mut new_taint = existing.clone(); + new_taint.caps |= bits; + if new_taint.origins.len() < 4 + && !new_taint.origins.iter().any(|o| o.node == node) + { + new_taint.origins.push(origin); + } + state.set(sym, new_taint); + } + None => { + state.set( + sym, + VarTaint { + caps: bits, + origins: SmallVec::from_elem(origin, 1), + }, + ); + } + } + } + } + + /// Apply a Sanitizer label: propagate input taint, then strip sanitizer bits. + fn apply_sanitizer(&self, info: &NodeInfo, bits: Cap, state: &mut TaintState) { + if let Some(ref v) = info.defines + && let Some(sym) = self.interner.get(v) + { + let (combined_caps, combined_origins) = self.collect_uses_taint(info, state); + let new_caps = combined_caps & !bits; + if new_caps.is_empty() { + state.remove(sym); + } else { + state.set( + sym, + VarTaint { + caps: new_caps, + origins: combined_origins, + }, + ); + } + } + } + + /// Apply a function call: resolve callee and compute return taint. + fn apply_call( + &self, + node: NodeIndex, + info: &NodeInfo, + caller_func: &str, + state: &mut TaintState, + ) { + if let Some(ref callee) = info.callee + && let Some(resolved) = self.resolve_callee(callee, caller_func, info.call_ordinal) + { + let mut return_bits = Cap::empty(); + let mut return_origins: SmallVec<[TaintOrigin; 2]> = SmallVec::new(); + + // 1. Source behaviour + if !resolved.source_caps.is_empty() { + return_bits |= resolved.source_caps; + let callee_str = info.callee.as_deref().unwrap_or(""); + let source_kind = + crate::labels::infer_source_kind(resolved.source_caps, callee_str); + let origin = TaintOrigin { node, source_kind }; + if !return_origins.iter().any(|o| o.node == node) { + return_origins.push(origin); + } + } + + // 2. Propagation + if resolved.propagates_taint { + let (use_caps, use_origins) = self.collect_uses_taint(info, state); + return_bits |= use_caps; + for orig in &use_origins { + if return_origins.len() < 4 + && !return_origins.iter().any(|o| o.node == orig.node) + { + return_origins.push(*orig); + } + } + } + + // 3. Sanitizer behaviour (applied last so it always wins) + return_bits &= !resolved.sanitizer_caps; + + // Write result + if let Some(ref v) = info.defines + && let Some(sym) = self.interner.get(v) + { + if return_bits.is_empty() { + state.remove(sym); + } else { + state.set( + sym, + VarTaint { + caps: return_bits, + origins: return_origins, + }, + ); + } + } + + return; + } + + // Unresolved call — fall through to default gen/kill + self.apply_assignment(info, state); + } + + /// Default gen/kill: propagate taint through variable assignments. + fn apply_assignment(&self, info: &NodeInfo, state: &mut TaintState) { + if matches!( + info.label, + Some(DataLabel::Source(_)) | Some(DataLabel::Sanitizer(_)) + ) { + return; + } + + if let Some(ref d) = info.defines + && let Some(sym) = self.interner.get(d) + { + let (combined_caps, combined_origins) = self.collect_uses_taint(info, state); + if combined_caps.is_empty() { + state.remove(sym); + } else { + state.set( + sym, + VarTaint { + caps: combined_caps, + origins: combined_origins, + }, + ); + } + } + } + + /// Collect taint from all `uses` variables (union of caps + merge origins). + fn collect_uses_taint( + &self, + info: &NodeInfo, + state: &TaintState, + ) -> (Cap, SmallVec<[TaintOrigin; 2]>) { + let mut combined_caps = Cap::empty(); + let mut combined_origins: SmallVec<[TaintOrigin; 2]> = SmallVec::new(); + + for u in &info.uses { + let taint = self.lookup_var(u, state); + if let Some(t) = taint { + combined_caps |= t.caps; + for orig in &t.origins { + if combined_origins.len() < 4 + && !combined_origins.iter().any(|o| o.node == orig.node) + { + combined_origins.push(*orig); + } + } + } + } + + (combined_caps, combined_origins) + } + + /// Look up a variable's taint, falling back to global_seed for JS two-level solve. + fn lookup_var<'a>(&'a self, name: &str, state: &'a TaintState) -> Option<&'a VarTaint> { + if let Some(sym) = self.interner.get(name) { + if let Some(taint) = state.get(sym) { + return Some(taint); + } + // Fall back to global seed (JS two-level solve) + if let Some(seed) = self.global_seed { + return seed.get(sym); + } + } + None + } + + /// Resolve sink caps from label or callee summary. + fn resolve_sink_caps(&self, info: &NodeInfo, caller_func: &str) -> Cap { + match info.label { + Some(DataLabel::Sink(caps)) => caps, + _ => info + .callee + .as_ref() + .and_then(|c| self.resolve_callee(c, caller_func, info.call_ordinal)) + .filter(|r| !r.sink_caps.is_empty()) + .map(|r| r.sink_caps) + .unwrap_or(Cap::empty()), + } + } + + /// Collect tainted variables at a sink node. + fn collect_tainted_sink_vars( + &self, + info: &NodeInfo, + state: &TaintState, + sink_caps: Cap, + ) -> Vec<(SymbolId, Cap, SmallVec<[TaintOrigin; 2]>)> { + let mut result = Vec::new(); + for u in &info.uses { + if let Some(taint) = self.lookup_var(u, state) + && (taint.caps & sink_caps) != Cap::empty() + && let Some(sym) = self.interner.get(u) + { + result.push((sym, taint.caps, taint.origins.clone())); + } + } + result + } + + /// Resolve a callee name to its summary (local → global → interop). + fn resolve_callee( + &self, + callee: &str, + caller_func: &str, + call_ordinal: u32, + ) -> Option { + let normalized = normalize_callee_name(callee); + + // 1) Local (same-file) + let local_matches: Vec<_> = self + .local_summaries + .iter() + .filter(|(k, _)| { + k.name == normalized && k.lang == self.lang && k.namespace == self.namespace + }) + .collect(); + + if local_matches.len() == 1 { + let (_, ls) = local_matches[0]; + return Some(ResolvedSummary { + source_caps: ls.source_caps, + sanitizer_caps: ls.sanitizer_caps, + sink_caps: ls.sink_caps, + propagates_taint: ls.propagates_taint, + }); + } + if local_matches.len() > 1 { + return None; + } + + // 2) Global same-language + if let Some(gs) = self.global_summaries { + match gs.resolve_callee_key(normalized, self.lang, self.namespace, None) { + CalleeResolution::Resolved(target_key) => { + if let Some(fs) = gs.get(&target_key) { + return Some(ResolvedSummary { + source_caps: fs.source_caps(), + sanitizer_caps: fs.sanitizer_caps(), + sink_caps: fs.sink_caps(), + propagates_taint: fs.propagates_taint, + }); + } + } + CalleeResolution::NotFound | CalleeResolution::Ambiguous(_) => {} + } + } + + // 3) Interop edges + for edge in self.interop_edges { + if edge.from.caller_lang == self.lang + && edge.from.caller_namespace == self.namespace + && edge.from.callee_symbol == callee + && (edge.from.caller_func.is_empty() || edge.from.caller_func == caller_func) + && (edge.from.ordinal == 0 || edge.from.ordinal == call_ordinal) + && let Some(gs) = self.global_summaries + && let Some(fs) = gs.get(&edge.to) + { + return Some(ResolvedSummary { + source_caps: fs.source_caps(), + sanitizer_caps: fs.sanitizer_caps(), + sink_caps: fs.sink_caps(), + propagates_taint: fs.propagates_taint, + }); + } + } + + None + } +} + +/// Resolved summary for a callee. +struct ResolvedSummary { + source_caps: Cap, + sanitizer_caps: Cap, + sink_caps: Cap, + propagates_taint: bool, +} diff --git a/src/utils/config.rs b/src/utils/config.rs index 3c9e06a2..e887d56d 100644 --- a/src/utils/config.rs +++ b/src/utils/config.rs @@ -61,6 +61,10 @@ pub struct ScannerConfig { /// benchmarks, etc.) at their original severity. When false (default), /// findings in these paths are downgraded by one severity tier. pub include_nonprod: bool, + + /// Enable the state-model dataflow engine for resource lifecycle and + /// auth-state analysis. Default: false (opt-in). + pub enable_state_analysis: bool, } impl Default for ScannerConfig { fn default() -> Self { @@ -94,6 +98,7 @@ impl Default for ScannerConfig { follow_symlinks: false, scan_hidden_files: false, include_nonprod: false, + enable_state_analysis: false, } } } @@ -135,6 +140,60 @@ pub struct OutputConfig { /// The maximum number of results to show. pub max_results: Option, + + /// Enable attack-surface ranking to sort findings by exploitability. + pub attack_surface_ranking: bool, + + /// Minimum attack-surface score to include in output. + /// Findings below this threshold are dropped after ranking. + /// `None` means no minimum (all findings shown). + pub min_score: Option, + + /// Minimum confidence level to include in output. + /// `None` means no minimum (all findings shown). + #[serde( + default, + skip_serializing_if = "Option::is_none", + deserialize_with = "deserialize_confidence_opt" + )] + pub min_confidence: Option, + + /// Include Quality-category findings (excluded by default). + #[serde(default)] + pub include_quality: bool, + + /// Show all findings: disables category filtering, rollups, and LOW budgets. + #[serde(default)] + pub show_all: bool, + + /// Maximum total LOW findings to show. + #[serde(default = "default_max_low")] + pub max_low: u32, + + /// Maximum LOW findings per file. + #[serde(default = "default_max_low_per_file")] + pub max_low_per_file: u32, + + /// Maximum LOW findings per rule. + #[serde(default = "default_max_low_per_rule")] + pub max_low_per_rule: u32, + + /// Number of example locations to store in rollup findings. + #[serde(default = "default_rollup_examples")] + pub rollup_examples: u32, +} + +fn default_max_low() -> u32 { + 20 +} +fn default_max_low_per_file() -> u32 { + 1 +} +fn default_max_low_per_rule() -> u32 { + 10 +} +fn default_rollup_examples() -> u32 { + 5 } impl Default for OutputConfig { @@ -143,10 +202,36 @@ impl Default for OutputConfig { default_format: "console".into(), quiet: false, max_results: None, + attack_surface_ranking: true, + min_score: None, + min_confidence: None, + include_quality: false, + show_all: false, + max_low: 20, + max_low_per_file: 1, + max_low_per_rule: 10, + rollup_examples: 5, } } } +/// Deserialize an optional Confidence from a TOML string. +fn deserialize_confidence_opt<'de, D>( + deserializer: D, +) -> Result, D::Error> +where + D: serde::Deserializer<'de>, +{ + let opt: Option = Option::deserialize(deserializer)?; + match opt { + None => Ok(None), + Some(s) => s + .parse::() + .map(Some) + .map_err(serde::de::Error::custom), + } +} + #[derive(Debug, Serialize, Deserialize, Clone)] #[serde(default)] pub struct PerformanceConfig { @@ -303,6 +388,7 @@ fn merge_configs(mut default: Config, user: Config) -> Config { default.scanner.follow_symlinks = user.scanner.follow_symlinks; default.scanner.scan_hidden_files = user.scanner.scan_hidden_files; default.scanner.include_nonprod = user.scanner.include_nonprod; + default.scanner.enable_state_analysis = user.scanner.enable_state_analysis; // Merge exclusion lists (default ⊔ user), then sort & dedupe default @@ -328,6 +414,15 @@ fn merge_configs(mut default: Config, user: Config) -> Config { default.output.default_format = user.output.default_format; default.output.quiet = user.output.quiet; default.output.max_results = user.output.max_results; + default.output.attack_surface_ranking = user.output.attack_surface_ranking; + default.output.min_score = user.output.min_score; + default.output.min_confidence = user.output.min_confidence; + default.output.include_quality = user.output.include_quality; + default.output.show_all = user.output.show_all; + default.output.max_low = user.output.max_low; + default.output.max_low_per_file = user.output.max_low_per_file; + default.output.max_low_per_rule = user.output.max_low_per_rule; + default.output.rollup_examples = user.output.rollup_examples; // --- PerformanceConfig --- default.performance.max_depth = user.performance.max_depth; diff --git a/src/walk.rs b/src/walk.rs index 6822a9da..25f4afd3 100644 --- a/src/walk.rs +++ b/src/walk.rs @@ -147,8 +147,8 @@ pub fn spawn_file_walker(root: &Path, cfg: &Config) -> (Receiver, JoinHan #[test] fn walker_respects_excluded_extensions() { let tmp = tempfile::tempdir().unwrap(); - std::fs::write(tmp.path().join("keep.rs"), "fn main(){}").unwrap(); - std::fs::write(tmp.path().join("skip.txt"), "ignored").unwrap(); + std::fs::write(tmp.path().join("keep.rs"), "fn main(){}").unwrap(); // nyx:ignore cfg-unguarded-sink + std::fs::write(tmp.path().join("skip.txt"), "ignored").unwrap(); // nyx:ignore cfg-unguarded-sink let mut cfg = Config::default(); cfg.scanner.excluded_extensions = vec!["txt".into()]; diff --git a/tests/common/mod.rs b/tests/common/mod.rs index 51d7eb8c..5109bf06 100644 --- a/tests/common/mod.rs +++ b/tests/common/mod.rs @@ -7,11 +7,13 @@ use std::path::Path; // ── Deterministic test config ────────────────────────────────────────────── +#[allow(dead_code)] pub fn test_config(mode: AnalysisMode) -> Config { let mut cfg = Config::default(); cfg.scanner.mode = mode; cfg.scanner.read_vcsignore = false; cfg.scanner.require_git_to_read_vcsignore = false; + cfg.scanner.enable_state_analysis = true; cfg.performance.worker_threads = Some(1); cfg.performance.batch_size = 64; cfg.performance.channel_multiplier = 1; @@ -21,6 +23,7 @@ pub fn test_config(mode: AnalysisMode) -> Config { // ── Scan helpers ─────────────────────────────────────────────────────────── /// Full two-pass scan of a directory (filesystem only, no index). +#[allow(dead_code)] pub fn scan_fixture_dir(path: &Path, mode: AnalysisMode) -> Vec { let cfg = test_config(mode); nyx_scanner::scan_no_index(path, &cfg).expect("scan_no_index should succeed") @@ -28,10 +31,12 @@ pub fn scan_fixture_dir(path: &Path, mode: AnalysisMode) -> Vec { // ── Counting / assertion helpers ─────────────────────────────────────────── +#[allow(dead_code)] pub fn count_by_prefix(diags: &[Diag], prefix: &str) -> usize { diags.iter().filter(|d| d.id.starts_with(prefix)).count() } +#[allow(dead_code)] pub fn assert_min_findings(diags: &[Diag], prefix: &str, min: usize) { let count = count_by_prefix(diags, prefix); assert!( @@ -52,6 +57,7 @@ pub fn assert_min_findings(diags: &[Diag], prefix: &str, min: usize) { ); } +#[allow(dead_code)] pub fn assert_no_findings(diags: &[Diag], prefix: &str) { let matching: Vec<_> = diags.iter().filter(|d| d.id.starts_with(prefix)).collect(); assert!( @@ -65,6 +71,7 @@ pub fn assert_no_findings(diags: &[Diag], prefix: &str) { ); } +#[allow(dead_code)] pub fn assert_max_findings(diags: &[Diag], max_total: usize, max_high: usize) { let high_count = diags .iter() @@ -130,6 +137,7 @@ pub struct PerformanceExpectations { } /// Load and parse `expectations.json` from a fixture directory. +#[allow(dead_code)] pub fn load_expectations(fixture_dir: &Path) -> Expectations { let path = fixture_dir.join("expectations.json"); let content = std::fs::read_to_string(&path) @@ -139,6 +147,7 @@ pub fn load_expectations(fixture_dir: &Path) -> Expectations { } /// Validate a set of diagnostics against a fixture's expectations.json. +#[allow(dead_code)] pub fn validate_expectations(diags: &[Diag], fixture_dir: &Path) { let exp = load_expectations(fixture_dir); diff --git a/tests/fixtures/c_utils/expectations.json b/tests/fixtures/c_utils/expectations.json index 5e6e6ee4..fa2a46b1 100644 --- a/tests/fixtures/c_utils/expectations.json +++ b/tests/fixtures/c_utils/expectations.json @@ -1,12 +1,12 @@ { "required_findings": [ { "id_prefix": "taint-unsanitised-flow", "min_count": 4 }, - { "id_prefix": "strcpy_call", "min_count": 1 }, - { "id_prefix": "strcat_call", "min_count": 1 }, - { "id_prefix": "sprintf_call", "min_count": 4 }, - { "id_prefix": "gets_call", "min_count": 1 }, - { "id_prefix": "scanf_with_percent_s", "min_count": 1 }, - { "id_prefix": "system_call", "min_count": 3 }, + { "id_prefix": "c.memory.strcpy", "min_count": 1 }, + { "id_prefix": "c.memory.strcat", "min_count": 1 }, + { "id_prefix": "c.memory.sprintf", "min_count": 4 }, + { "id_prefix": "c.memory.gets", "min_count": 1 }, + { "id_prefix": "c.memory.scanf_percent_s", "min_count": 1 }, + { "id_prefix": "c.cmdi.system", "min_count": 3 }, { "id_prefix": "cfg-unguarded-sink", "min_count": 5 } ], "forbidden_findings": [], diff --git a/tests/fixtures/express_app/expectations.json b/tests/fixtures/express_app/expectations.json index 2ccd377c..7a1943a4 100644 --- a/tests/fixtures/express_app/expectations.json +++ b/tests/fixtures/express_app/expectations.json @@ -1,10 +1,10 @@ { "required_findings": [ { "id_prefix": "taint-unsanitised-flow", "min_count": 6 }, - { "id_prefix": "eval_call", "min_count": 1 }, - { "id_prefix": "document_write", "min_count": 1 }, - { "id_prefix": "settimeout_string", "min_count": 1 }, - { "id_prefix": "cookie_assignment", "min_count": 1 } + { "id_prefix": "js.code_exec.eval", "min_count": 1 }, + { "id_prefix": "js.xss.document_write", "min_count": 1 }, + { "id_prefix": "js.code_exec.settimeout_string", "min_count": 1 }, + { "id_prefix": "js.xss.cookie_write", "min_count": 1 } ], "forbidden_findings": [], "noise_budget": { diff --git a/tests/fixtures/flask_app/expectations.json b/tests/fixtures/flask_app/expectations.json index 218d5e95..1c5984bd 100644 --- a/tests/fixtures/flask_app/expectations.json +++ b/tests/fixtures/flask_app/expectations.json @@ -1,13 +1,13 @@ { "required_findings": [ { "id_prefix": "taint-unsanitised-flow", "min_count": 8 }, - { "id_prefix": "eval_call", "min_count": 1 }, - { "id_prefix": "exec_call", "min_count": 2 }, - { "id_prefix": "cfg-auth-gap", "min_count": 5 } + { "id_prefix": "py.code_exec.eval", "min_count": 1 }, + { "id_prefix": "py.code_exec.exec", "min_count": 2 }, + { "id_prefix": "state-unauthed-access", "min_count": 5 } ], "forbidden_findings": [], "noise_budget": { - "max_total_findings": 35, + "max_total_findings": 50, "max_high_findings": 25 }, "performance_expectations": { diff --git a/tests/fixtures/go_server/expectations.json b/tests/fixtures/go_server/expectations.json index f633b3e3..1c238440 100644 --- a/tests/fixtures/go_server/expectations.json +++ b/tests/fixtures/go_server/expectations.json @@ -1,7 +1,7 @@ { "required_findings": [ { "id_prefix": "taint-unsanitised-flow", "min_count": 4 }, - { "id_prefix": "exec_command", "min_count": 3 }, + { "id_prefix": "go.cmdi.exec_command", "min_count": 3 }, { "id_prefix": "cfg-unguarded-sink", "min_count": 1 } ], "forbidden_findings": [], diff --git a/tests/fixtures/java_service/expectations.json b/tests/fixtures/java_service/expectations.json index a4e245b1..ba7bca38 100644 --- a/tests/fixtures/java_service/expectations.json +++ b/tests/fixtures/java_service/expectations.json @@ -1,14 +1,14 @@ { "required_findings": [ { "id_prefix": "taint-unsanitised-flow", "min_count": 2 }, - { "id_prefix": "runtime_exec", "min_count": 2 }, - { "id_prefix": "class_for_name", "min_count": 1 }, - { "id_prefix": "cfg-unguarded-sink", "min_count": 2 } + { "id_prefix": "java.cmdi.runtime_exec", "min_count": 2 }, + { "id_prefix": "java.reflection.class_forname", "min_count": 1 }, + { "id_prefix": "cfg-unguarded-sink", "min_count": 1 } ], "forbidden_findings": [], "noise_budget": { - "max_total_findings": 15, - "max_high_findings": 8 + "max_total_findings": 20, + "max_high_findings": 12 }, "performance_expectations": { "max_ms_no_index": 1000, diff --git a/tests/fixtures/mixed_project/expectations.json b/tests/fixtures/mixed_project/expectations.json index 05d0bf4a..d12bd911 100644 --- a/tests/fixtures/mixed_project/expectations.json +++ b/tests/fixtures/mixed_project/expectations.json @@ -1,10 +1,7 @@ { "required_findings": [ { "id_prefix": "taint-unsanitised-flow", "min_count": 10 }, - { "id_prefix": "eval_call", "min_count": 2 }, - { "id_prefix": "unwrap_call", "min_count": 3 }, - { "id_prefix": "expect_call", "min_count": 1 }, - { "id_prefix": "panic_macro", "min_count": 1 }, + { "id_prefix": "js.code_exec.eval", "min_count": 1 }, { "id_prefix": "cfg-unguarded-sink", "min_count": 2 } ], "forbidden_findings": [], diff --git a/tests/fixtures/patterns/c/negative.c b/tests/fixtures/patterns/c/negative.c new file mode 100644 index 00000000..d6e98b60 --- /dev/null +++ b/tests/fixtures/patterns/c/negative.c @@ -0,0 +1,24 @@ +/* Negative fixture: none of these should trigger security patterns. */ +#include +#include +#include + +void safe_snprintf(const char *name) { + char buf[128]; + snprintf(buf, sizeof(buf), "Hello %s", name); +} + +void safe_strncpy(const char *src) { + char dst[32]; + strncpy(dst, src, sizeof(dst) - 1); + dst[sizeof(dst) - 1] = '\0'; +} + +void safe_fgets() { + char buf[64]; + fgets(buf, sizeof(buf), stdin); +} + +void safe_printf_literal() { + printf("Hello %s\n", "world"); +} diff --git a/tests/fixtures/patterns/c/positive.c b/tests/fixtures/patterns/c/positive.c new file mode 100644 index 00000000..33a2a669 --- /dev/null +++ b/tests/fixtures/patterns/c/positive.c @@ -0,0 +1,50 @@ +/* Positive fixture: each snippet should trigger the named pattern. */ +#include +#include +#include + +/* c.memory.gets */ +void trigger_gets() { + char buf[64]; + gets(buf); +} + +/* c.memory.strcpy */ +void trigger_strcpy(char *src) { + char dst[32]; + strcpy(dst, src); +} + +/* c.memory.strcat */ +void trigger_strcat(char *extra) { + char buf[64] = "prefix"; + strcat(buf, extra); +} + +/* c.memory.sprintf */ +void trigger_sprintf(const char *name) { + char buf[128]; + sprintf(buf, "Hello %s", name); +} + +/* c.memory.scanf_percent_s */ +void trigger_scanf() { + char name[32]; + scanf("%s", name); +} + +/* c.cmdi.system */ +void trigger_system(const char *cmd) { + system(cmd); +} + +/* c.cmdi.popen */ +void trigger_popen(const char *cmd) { + FILE *f = popen(cmd, "r"); + pclose(f); +} + +/* c.memory.printf_no_fmt */ +void trigger_printf_no_fmt(char *user_data) { + printf(user_data); +} diff --git a/tests/fixtures/patterns/cpp/negative.cpp b/tests/fixtures/patterns/cpp/negative.cpp new file mode 100644 index 00000000..b27f8856 --- /dev/null +++ b/tests/fixtures/patterns/cpp/negative.cpp @@ -0,0 +1,24 @@ +// Negative fixture: none of these should trigger security patterns. +#include +#include +#include + +void safe_string_ops() { + std::string s = "hello"; + std::string copy = s; + auto len = s.length(); +} + +void safe_cast() { + double d = 3.14; + int i = static_cast(d); +} + +void safe_snprintf(const char *name) { + char buf[128]; + snprintf(buf, sizeof(buf), "Hello %s", name); +} + +void safe_printf_literal() { + printf("Hello %s\n", "world"); +} diff --git a/tests/fixtures/patterns/cpp/positive.cpp b/tests/fixtures/patterns/cpp/positive.cpp new file mode 100644 index 00000000..3963e17e --- /dev/null +++ b/tests/fixtures/patterns/cpp/positive.cpp @@ -0,0 +1,49 @@ +// Positive fixture: each snippet should trigger the named pattern. +#include +#include +#include + +// cpp.memory.gets +void trigger_gets() { + char buf[64]; + gets(buf); +} + +// cpp.memory.strcpy +void trigger_strcpy(const char *src) { + char dst[32]; + strcpy(dst, src); +} + +// cpp.memory.strcat +void trigger_strcat(const char *extra) { + char buf[64] = "prefix"; + strcat(buf, extra); +} + +// cpp.memory.sprintf +void trigger_sprintf(const char *name) { + char buf[128]; + sprintf(buf, "Hello %s", name); +} + +// cpp.cmdi.system +void trigger_system(const char *cmd) { + system(cmd); +} + +// cpp.memory.reinterpret_cast +void trigger_reinterpret_cast() { + int x = 42; + float *fp = reinterpret_cast(&x); +} + +// cpp.memory.const_cast +void trigger_const_cast(const int *p) { + int *q = const_cast(p); +} + +// cpp.memory.printf_no_fmt +void trigger_printf_no_fmt(char *user_data) { + printf(user_data); +} diff --git a/tests/fixtures/patterns/go/negative.go b/tests/fixtures/patterns/go/negative.go new file mode 100644 index 00000000..434780c1 --- /dev/null +++ b/tests/fixtures/patterns/go/negative.go @@ -0,0 +1,23 @@ +package main + +import ( + "crypto/sha256" + "database/sql" +) + +func safeHash(data []byte) { + sha256.Sum256(data) +} + +func safeParamQuery(db *sql.DB, user string) { + db.Query("SELECT * FROM users WHERE name = $1", user) +} + +func safeLiteralQuery(db *sql.DB) { + db.Query("SELECT COUNT(*) FROM users") +} + +func safeStringOps() { + x := "hello" + _ = len(x) +} diff --git a/tests/fixtures/patterns/go/positive.go b/tests/fixtures/patterns/go/positive.go new file mode 100644 index 00000000..97d93495 --- /dev/null +++ b/tests/fixtures/patterns/go/positive.go @@ -0,0 +1,55 @@ +package main + +import ( + "crypto/md5" + "crypto/sha1" + "database/sql" + "encoding/gob" + "os" + "os/exec" + "unsafe" +) + +// go.cmdi.exec_command +func triggerExecCommand(cmd string) { + exec.Command("bash", "-c", cmd) +} + +// go.memory.unsafe_pointer +func triggerUnsafePointer() { + x := 42 + p := unsafe.Pointer(&x) + _ = p +} + +// go.transport.insecure_skip_verify +func triggerInsecureSkipVerify() { + _ = struct{ InsecureSkipVerify bool }{InsecureSkipVerify: true} +} + +// go.crypto.md5 +func triggerMD5(data []byte) { + md5.Sum(data) +} + +// go.crypto.sha1 +func triggerSHA1(data []byte) { + sha1.Sum(data) +} + +// go.sqli.query_concat +func triggerSQLConcat(db *sql.DB, user string) { + db.Query("SELECT * FROM users WHERE name = '" + user + "'") +} + +// go.secrets.hardcoded_key +func triggerHardcodedSecret() { + password := "super_secret_password_12345" + _ = password +} + +// go.deser.gob_decode +func triggerGobDecode(f *os.File) { + dec := gob.NewDecoder(f) + _ = dec +} diff --git a/tests/fixtures/patterns/java/negative.java b/tests/fixtures/patterns/java/negative.java new file mode 100644 index 00000000..21b8328c --- /dev/null +++ b/tests/fixtures/patterns/java/negative.java @@ -0,0 +1,22 @@ +import java.sql.*; +import java.security.SecureRandom; + +class Negative { + // Safe: parameterized query + void safeQuery(Connection conn, String user) throws Exception { + PreparedStatement ps = conn.prepareStatement("SELECT * FROM users WHERE name = ?"); + ps.setString(1, user); + ResultSet rs = ps.executeQuery(); + } + + // Safe: SecureRandom instead of Random + void safeRandom() { + SecureRandom sr = new SecureRandom(); + int token = sr.nextInt(); + } + + // Safe: no concatenation in SQL + void safeLiteralQuery(Statement stmt) throws Exception { + stmt.executeQuery("SELECT COUNT(*) FROM users"); + } +} diff --git a/tests/fixtures/patterns/java/positive.java b/tests/fixtures/patterns/java/positive.java new file mode 100644 index 00000000..03c83f9e --- /dev/null +++ b/tests/fixtures/patterns/java/positive.java @@ -0,0 +1,48 @@ +import java.io.*; +import java.util.Random; +import java.security.MessageDigest; + +class Positive { + // java.deser.readobject + void triggerDeser(InputStream is) throws Exception { + ObjectInputStream ois = new ObjectInputStream(is); + Object obj = ois.readObject(); + } + + // java.cmdi.runtime_exec + void triggerRuntimeExec(String cmd) throws Exception { + Runtime.getRuntime().exec(cmd); + } + + // java.reflection.class_forname + void triggerClassForName(String name) throws Exception { + Class.forName(name); + } + + // java.reflection.method_invoke + void triggerMethodInvoke(Object target) throws Exception { + java.lang.reflect.Method m = target.getClass().getMethod("run"); + m.invoke(target); + } + + // java.sqli.execute_concat + void triggerSqlConcat(java.sql.Statement stmt, String user) throws Exception { + stmt.executeQuery("SELECT * FROM users WHERE name = '" + user + "'"); + } + + // java.crypto.insecure_random + void triggerInsecureRandom() { + Random r = new Random(); + int token = r.nextInt(); + } + + // java.crypto.weak_digest + void triggerWeakDigest() throws Exception { + MessageDigest md = MessageDigest.getInstance("MD5"); + } + + // java.xss.getwriter_print + void triggerGetWriterPrint(javax.servlet.http.HttpServletResponse resp) throws Exception { + resp.getWriter().println("" + "data" + ""); + } +} diff --git a/tests/fixtures/patterns/javascript/negative.js b/tests/fixtures/patterns/javascript/negative.js new file mode 100644 index 00000000..e1cd3623 --- /dev/null +++ b/tests/fixtures/patterns/javascript/negative.js @@ -0,0 +1,25 @@ +// Negative fixture: none of these should trigger security patterns. + +function safeStringOps() { + var x = "hello"; + var y = x.toUpperCase(); + var z = JSON.stringify({ key: "value" }); +} + +function safeTimeout(fn) { + // Function reference, not string + setTimeout(fn, 1000); +} + +function safeDomManipulation(el) { + el.textContent = "safe text"; + el.setAttribute("class", "active"); +} + +function safeRandomness() { + var buf = crypto.getRandomValues(new Uint8Array(16)); +} + +function safeCopy(src) { + var copy = Object.assign({}, src); +} diff --git a/tests/fixtures/patterns/javascript/positive.js b/tests/fixtures/patterns/javascript/positive.js new file mode 100644 index 00000000..ac4adb64 --- /dev/null +++ b/tests/fixtures/patterns/javascript/positive.js @@ -0,0 +1,51 @@ +// Positive fixture: each snippet should trigger the named pattern. + +// js.code_exec.eval +function triggerEval(code) { + eval(code); +} + +// js.code_exec.new_function +function triggerNewFunction(body) { + var fn = new Function(body); +} + +// js.code_exec.settimeout_string +function triggerSetTimeout() { + setTimeout("alert(1)", 1000); +} + +// js.xss.document_write +function triggerDocumentWrite(data) { + document.write(data); +} + +// js.xss.outer_html +function triggerOuterHtml(el, data) { + el.outerHTML = data; +} + +// js.xss.insert_adjacent_html +function triggerInsertAdjacentHtml(el, data) { + el.insertAdjacentHTML("beforeend", data); +} + +// js.prototype.proto_assignment +function triggerProtoAssignment(obj) { + obj.__proto__ = { malicious: true }; +} + +// js.xss.location_assign +function triggerLocationAssign(url) { + window.location = url; +} + +// js.xss.cookie_write +function triggerCookieWrite(sid) { + document.cookie = "session=" + sid; +} + +// js.crypto.math_random +function triggerMathRandom() { + var token = Math.random(); +} diff --git a/tests/fixtures/patterns/php/negative.php b/tests/fixtures/patterns/php/negative.php new file mode 100644 index 00000000..018888d1 --- /dev/null +++ b/tests/fixtures/patterns/php/negative.php @@ -0,0 +1,25 @@ +prepare("SELECT * FROM users WHERE name = ?"); + $stmt->execute([$user]); +} + +function safe_hash($data) { + return hash("sha256", $data); +} + +function safe_random() { + return random_int(1, 100); +} + +function safe_include() { + include "config.php"; +} + +function safe_string_ops() { + $x = "hello"; + $y = strtoupper($x); + $z = strlen($y); +} diff --git a/tests/fixtures/patterns/php/positive.php b/tests/fixtures/patterns/php/positive.php new file mode 100644 index 00000000..0128d9e0 --- /dev/null +++ b/tests/fixtures/patterns/php/positive.php @@ -0,0 +1,57 @@ + 0"); +} + +// php.cmdi.system +function trigger_system($cmd) { + system($cmd); +} + +// php.deser.unserialize +function trigger_unserialize($data) { + unserialize($data); +} + +// php.sqli.query_concat +function trigger_sql_concat($user) { + mysql_query("SELECT * FROM users WHERE name = '" . $user . "'"); +} + +// php.path.include_variable +function trigger_include($path) { + include $path; +} + +// php.crypto.md5 +function trigger_md5($data) { + md5($data); +} + +// php.crypto.sha1 +function trigger_sha1($data) { + sha1($data); +} + +// php.crypto.rand +function trigger_rand() { + $r = rand(); +} diff --git a/tests/fixtures/patterns/python/negative.py b/tests/fixtures/patterns/python/negative.py new file mode 100644 index 00000000..3c8eace5 --- /dev/null +++ b/tests/fixtures/patterns/python/negative.py @@ -0,0 +1,23 @@ +# Negative fixture: none of these should trigger security patterns. + +import subprocess +import hashlib + +def safe_subprocess(): + # No shell=True + subprocess.run(["ls", "-la"]) + +def safe_hash(): + hashlib.sha256(b"data") + +def safe_literal_query(cursor): + cursor.execute("SELECT COUNT(*) FROM users") + +def safe_yaml_load(data): + import yaml + yaml.safe_load(data) + +def safe_string_ops(): + x = "hello" + y = x.upper() + z = len(y) diff --git a/tests/fixtures/patterns/python/positive.py b/tests/fixtures/patterns/python/positive.py new file mode 100644 index 00000000..8063a7eb --- /dev/null +++ b/tests/fixtures/patterns/python/positive.py @@ -0,0 +1,51 @@ +# Positive fixture: each snippet should trigger the named pattern. + +import os +import subprocess +import pickle +import yaml +import hashlib + +# py.code_exec.eval +def trigger_eval(data): + result = eval(data) + +# py.code_exec.exec +def trigger_exec(code): + exec(code) + +# py.code_exec.compile +def trigger_compile(code): + co = compile(code, "", "exec") + +# py.cmdi.os_system +def trigger_os_system(cmd): + os.system(cmd) + +# py.cmdi.os_popen +def trigger_os_popen(cmd): + os.popen(cmd) + +# py.cmdi.subprocess_shell +def trigger_subprocess_shell(cmd): + subprocess.run(cmd, shell=True) + +# py.deser.pickle_loads +def trigger_pickle(data): + obj = pickle.loads(data) + +# py.deser.yaml_load +def trigger_yaml(data): + obj = yaml.load(data) + +# py.sqli.execute_format +def trigger_sql_concat(cursor, user): + cursor.execute("SELECT * FROM users WHERE name = '" + user + "'") + +# py.crypto.md5 +def trigger_md5(data): + hashlib.md5(data) + +# py.crypto.sha1 +def trigger_sha1(data): + hashlib.sha1(data) diff --git a/tests/fixtures/patterns/ruby/negative.rb b/tests/fixtures/patterns/ruby/negative.rb new file mode 100644 index 00000000..89cee092 --- /dev/null +++ b/tests/fixtures/patterns/ruby/negative.rb @@ -0,0 +1,25 @@ +# Negative fixture: none of these should trigger security patterns. + +def safe_yaml(data) + YAML.safe_load(data) +end + +def safe_system + Dir.entries(".") +end + +def safe_send(obj) + obj.send(:to_s) +end + +def safe_open + File.open("config.yml", "r") do |f| + f.read + end +end + +def safe_string_ops + x = "hello" + y = x.upcase + z = y.length +end diff --git a/tests/fixtures/patterns/ruby/positive.rb b/tests/fixtures/patterns/ruby/positive.rb new file mode 100644 index 00000000..4f26c149 --- /dev/null +++ b/tests/fixtures/patterns/ruby/positive.rb @@ -0,0 +1,51 @@ +# Positive fixture: each snippet should trigger the named pattern. + +# rb.code_exec.eval +def trigger_eval(code) + eval(code) +end + +# rb.code_exec.instance_eval +def trigger_instance_eval(obj, code) + obj.instance_eval(code) +end + +# rb.code_exec.class_eval +def trigger_class_eval(klass, code) + klass.class_eval(code) +end + +# rb.cmdi.backtick +def trigger_backtick + `uname -a` +end + +# rb.cmdi.system_interp +def trigger_system_interp(cmd) + system("run #{cmd}") +end + +# rb.deser.yaml_load +def trigger_yaml_load(data) + YAML.load(data) +end + +# rb.deser.marshal_load +def trigger_marshal_load(data) + Marshal.load(data) +end + +# rb.reflection.send_dynamic +def trigger_send_dynamic(obj, method_name) + obj.send(method_name) +end + +# rb.reflection.constantize +def trigger_constantize(name) + name.constantize +end + +# rb.ssrf.open_uri +def trigger_open_uri + open("https://example.com/api") +end diff --git a/tests/fixtures/patterns/rust/negative.rs b/tests/fixtures/patterns/rust/negative.rs new file mode 100644 index 00000000..273cc41e --- /dev/null +++ b/tests/fixtures/patterns/rust/negative.rs @@ -0,0 +1,36 @@ +// Negative fixture: none of the security-relevant patterns should fire here. + +fn safe_option_handling() { + let x: Option = Some(1); + // Using match instead of unwrap + match x { + Some(v) => println!("{}", v), + None => println!("none"), + } +} + +fn safe_result_handling() -> Result<(), String> { + let x: Result = Ok(1); + // Using ? instead of unwrap + let _v = x?; + Ok(()) +} + +fn safe_copy() { + let src = vec![1, 2, 3]; + let mut dst = vec![0; 3]; + // Safe copy via clone + dst.clone_from(&src); +} + +fn safe_cast() { + let x: u32 = 42; + // Widening cast is fine + let _ = x as u64; +} + +fn safe_string_ops() { + let s = String::from("hello"); + let _ = s.len(); + let _ = s.is_empty(); +} diff --git a/tests/fixtures/patterns/rust/positive.rs b/tests/fixtures/patterns/rust/positive.rs new file mode 100644 index 00000000..1f95b386 --- /dev/null +++ b/tests/fixtures/patterns/rust/positive.rs @@ -0,0 +1,78 @@ +// Positive fixture: each snippet should trigger the named pattern. + +use std::mem; +use std::ptr; + +// rs.memory.transmute +fn trigger_transmute() { + let x: u32 = unsafe { mem::transmute(1.0f32) }; + let _ = x; +} + +// rs.memory.copy_nonoverlapping +fn trigger_copy_nonoverlapping() { + let src = [1u8; 4]; + let mut dst = [0u8; 4]; + unsafe { ptr::copy_nonoverlapping(src.as_ptr(), dst.as_mut_ptr(), 4) }; +} + +// rs.memory.get_unchecked +fn trigger_get_unchecked() { + let v = vec![1, 2, 3]; + let _ = unsafe { v.get_unchecked(0) }; +} + +// rs.memory.mem_zeroed +fn trigger_mem_zeroed() { + let _: u64 = unsafe { mem::zeroed() }; +} + +// rs.memory.ptr_read +fn trigger_ptr_read() { + let x = 42u32; + let _ = unsafe { ptr::read(&x) }; +} + +// rs.quality.unsafe_block +fn trigger_unsafe_block() { + unsafe { + let _ = 1; + } +} + +// rs.quality.unsafe_fn +unsafe fn trigger_unsafe_fn() {} + +// rs.quality.unwrap +fn trigger_unwrap() { + let x: Option = Some(1); + let _ = x.unwrap(); +} + +// rs.quality.expect +fn trigger_expect() { + let x: Option = Some(1); + let _ = x.expect("should exist"); +} + +// rs.quality.panic_macro +fn trigger_panic() { + panic!("boom"); +} + +// rs.quality.todo +fn trigger_todo() { + todo!(); +} + +// rs.memory.narrow_cast +fn trigger_narrow_cast() { + let big: u32 = 1000; + let _ = big as u8; +} + +// rs.memory.mem_forget +fn trigger_mem_forget() { + let v = vec![1, 2, 3]; + mem::forget(v); +} diff --git a/tests/fixtures/patterns/typescript/negative.ts b/tests/fixtures/patterns/typescript/negative.ts new file mode 100644 index 00000000..fff6602e --- /dev/null +++ b/tests/fixtures/patterns/typescript/negative.ts @@ -0,0 +1,25 @@ +// Negative fixture: none of the security-relevant patterns should fire here. + +function safeStringOps(): string { + const x: string = "hello"; + return x.toUpperCase(); +} + +function safeTimeout(fn: () => void): void { + setTimeout(fn, 1000); +} + +function safeDomManipulation(el: Element): void { + el.textContent = "safe text"; +} + +function safeTypedParam(x: number): number { + return x + 1; +} + +function safeUnknownHandling(x: unknown): string { + if (typeof x === "string") { + return x; + } + return String(x); +} diff --git a/tests/fixtures/patterns/typescript/positive.ts b/tests/fixtures/patterns/typescript/positive.ts new file mode 100644 index 00000000..88c18d1a --- /dev/null +++ b/tests/fixtures/patterns/typescript/positive.ts @@ -0,0 +1,56 @@ +// Positive fixture: each snippet should trigger the named pattern. + +// ts.code_exec.eval +function triggerEval(code: string): void { + eval(code); +} + +// ts.code_exec.new_function +function triggerNewFunction(body: string): void { + const fn = new Function(body); +} + +// ts.code_exec.settimeout_string +function triggerSetTimeout(): void { + setTimeout("alert(1)", 1000); +} + +// ts.xss.document_write +function triggerDocumentWrite(data: string): void { + document.write(data); +} + +// ts.xss.outer_html +function triggerOuterHtml(el: Element, data: string): void { + el.outerHTML = data; +} + +// ts.xss.insert_adjacent_html +function triggerInsertAdjacentHtml(el: Element, data: string): void { + el.insertAdjacentHTML("beforeend", data); +} + +// ts.quality.any_annotation +function triggerAnyAnnotation(x: any): void { + console.log(x); +} + +// ts.quality.as_any +function triggerAsAny(x: unknown): void { + const y = x as any; +} + +// ts.prototype.proto_assignment +function triggerProtoAssignment(obj: Record): void { + obj.__proto__ = { malicious: true }; +} + +// ts.xss.location_assign +function triggerLocationAssign(url: string): void { + window.location = url; +} + +// ts.xss.cookie_write +function triggerCookieWrite(sid: string): void { + document.cookie = "session=" + sid; +} diff --git a/tests/fixtures/real_world/c/cfg/double_free.c b/tests/fixtures/real_world/c/cfg/double_free.c new file mode 100644 index 00000000..abf0e4b2 --- /dev/null +++ b/tests/fixtures/real_world/c/cfg/double_free.c @@ -0,0 +1,20 @@ +#include + +void double_free_bug(int flag) { + char *buf = malloc(256); + if (flag) { + free(buf); + } + free(buf); // double free if flag was true +} + +void conditional_free_safe(int flag) { + char *buf = malloc(256); + if (flag) { + free(buf); + buf = NULL; + } + if (buf != NULL) { + free(buf); + } +} diff --git a/tests/fixtures/real_world/c/cfg/double_free.expect.json b/tests/fixtures/real_world/c/cfg/double_free.expect.json new file mode 100644 index 00000000..e1d219e0 --- /dev/null +++ b/tests/fixtures/real_world/c/cfg/double_free.expect.json @@ -0,0 +1,24 @@ +{ + "description": "Double free when flag is true: free called twice on same pointer. Safe version nulls pointer after free.", + "tags": [ + "cfg", + "state", + "double-free" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-double-close", + "severity": null, + "must_match": false, + "line_range": [ + 6, + 10 + ], + "evidence_contains": [], + "notes": "Conditional double free \u2014 if flag is true, buf freed at line 6 and again at line 8. Aspirational state analysis finding." + } + ] +} diff --git a/tests/fixtures/real_world/c/cfg/malloc_branches.c b/tests/fixtures/real_world/c/cfg/malloc_branches.c new file mode 100644 index 00000000..b44b7726 --- /dev/null +++ b/tests/fixtures/real_world/c/cfg/malloc_branches.c @@ -0,0 +1,21 @@ +#include +#include + +char *duplicate_string(const char *input) { + char *buf = malloc(strlen(input) + 1); + if (buf == NULL) return NULL; + strcpy(buf, input); + return buf; +} + +void process_data(const char *input) { + char *copy = malloc(strlen(input) + 1); + if (copy == NULL) return; + strcpy(copy, input); + + if (strlen(copy) > 100) { + return; // memory leak! + } + + free(copy); +} diff --git a/tests/fixtures/real_world/c/cfg/malloc_branches.expect.json b/tests/fixtures/real_world/c/cfg/malloc_branches.expect.json new file mode 100644 index 00000000..3915892d --- /dev/null +++ b/tests/fixtures/real_world/c/cfg/malloc_branches.expect.json @@ -0,0 +1,45 @@ +{ + "description": "malloc leak on early return: process_data returns without free when strlen > 100", + "tags": [ + "cfg", + "resource-leak" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "c.memory.strcpy", + "severity": null, + "must_match": true, + "line_range": [ + 5, + 9 + ], + "evidence_contains": [], + "notes": "strcpy in duplicate_string \u2014 AST pattern match" + }, + { + "rule_id": "c.memory.strcpy", + "severity": null, + "must_match": true, + "line_range": [ + 12, + 16 + ], + "evidence_contains": [], + "notes": "strcpy in process_data \u2014 AST pattern match" + }, + { + "rule_id": "cfg-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 15, + 19 + ], + "evidence_contains": [], + "notes": "malloc at line 12 not freed before return at line 17 \u2014 aspirational CFG finding" + } + ] +} diff --git a/tests/fixtures/real_world/c/cfg/resource_leak_branches.c b/tests/fixtures/real_world/c/cfg/resource_leak_branches.c new file mode 100644 index 00000000..f01298e4 --- /dev/null +++ b/tests/fixtures/real_world/c/cfg/resource_leak_branches.c @@ -0,0 +1,29 @@ +#include +#include + +int process_file(const char *path) { + FILE *f = fopen(path, "r"); + if (f == NULL) return -1; + + char buf[256]; + if (fgets(buf, sizeof(buf), f) == NULL) { + return -2; // f leaked! + } + + fclose(f); + return 0; +} + +int process_file_safe(const char *path) { + FILE *f = fopen(path, "r"); + if (f == NULL) return -1; + + char buf[256]; + if (fgets(buf, sizeof(buf), f) == NULL) { + fclose(f); + return -2; + } + + fclose(f); + return 0; +} diff --git a/tests/fixtures/real_world/c/cfg/resource_leak_branches.expect.json b/tests/fixtures/real_world/c/cfg/resource_leak_branches.expect.json new file mode 100644 index 00000000..99e254d6 --- /dev/null +++ b/tests/fixtures/real_world/c/cfg/resource_leak_branches.expect.json @@ -0,0 +1,23 @@ +{ + "description": "File handle leak on early return in error branch vs safe version that closes before returning", + "tags": [ + "cfg", + "resource-leak" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "cfg-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 8, + 12 + ], + "evidence_contains": [], + "notes": "fopen at line 5 not closed before return -2 at line 10 \u2014 aspirational CFG finding" + } + ] +} diff --git a/tests/fixtures/real_world/c/cfg/switch_fallthrough.c b/tests/fixtures/real_world/c/cfg/switch_fallthrough.c new file mode 100644 index 00000000..f4b3c8f7 --- /dev/null +++ b/tests/fixtures/real_world/c/cfg/switch_fallthrough.c @@ -0,0 +1,20 @@ +#include +#include + +void handle_command(int cmd, char *arg) { + switch (cmd) { + case 1: + system(arg); + break; + case 2: + printf("%s\n", arg); + break; + case 3: + system(arg); // no break - falls through + case 4: + printf("Done\n"); + break; + default: + break; + } +} diff --git a/tests/fixtures/real_world/c/cfg/switch_fallthrough.expect.json b/tests/fixtures/real_world/c/cfg/switch_fallthrough.expect.json new file mode 100644 index 00000000..5c32235b --- /dev/null +++ b/tests/fixtures/real_world/c/cfg/switch_fallthrough.expect.json @@ -0,0 +1,34 @@ +{ + "description": "system() calls in switch statement \u2014 case 3 falls through without break", + "tags": [ + "cmdi", + "cfg" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "c.cmdi.system", + "severity": null, + "must_match": true, + "line_range": [ + 5, + 9 + ], + "evidence_contains": [], + "notes": "system(arg) in case 1" + }, + { + "rule_id": "c.cmdi.system", + "severity": null, + "must_match": true, + "line_range": [ + 11, + 15 + ], + "evidence_contains": [], + "notes": "system(arg) in case 3 \u2014 falls through to case 4" + } + ] +} diff --git a/tests/fixtures/real_world/c/mixed/cmdi_and_leak.c b/tests/fixtures/real_world/c/mixed/cmdi_and_leak.c new file mode 100644 index 00000000..bd52ff5d --- /dev/null +++ b/tests/fixtures/real_world/c/mixed/cmdi_and_leak.c @@ -0,0 +1,12 @@ +#include +#include + +void dangerous_pipe(char *user_input) { + char cmd[256]; + sprintf(cmd, "cat %s", user_input); + FILE *fp = popen(cmd, "r"); + char buf[1024]; + fgets(buf, sizeof(buf), fp); + printf("%s", buf); + // pclose missing + command injection +} diff --git a/tests/fixtures/real_world/c/mixed/cmdi_and_leak.expect.json b/tests/fixtures/real_world/c/mixed/cmdi_and_leak.expect.json new file mode 100644 index 00000000..024feb3e --- /dev/null +++ b/tests/fixtures/real_world/c/mixed/cmdi_and_leak.expect.json @@ -0,0 +1,46 @@ +{ + "description": "Command injection via popen with sprintf, plus missing pclose resource leak", + "tags": [ + "cmdi", + "state", + "mixed" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "c.cmdi.popen", + "severity": null, + "must_match": true, + "line_range": [ + 5, + 9 + ], + "evidence_contains": [], + "notes": "popen executes shell command built from user input" + }, + { + "rule_id": "c.memory.sprintf", + "severity": null, + "must_match": true, + "line_range": [ + 4, + 8 + ], + "evidence_contains": [], + "notes": "sprintf used to build command \u2014 buffer overflow risk" + }, + { + "rule_id": "state-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 5, + 13 + ], + "evidence_contains": [], + "notes": "popen at line 7 never pclose'd \u2014 aspirational state finding" + } + ] +} diff --git a/tests/fixtures/real_world/c/mixed/taint_plus_leak.c b/tests/fixtures/real_world/c/mixed/taint_plus_leak.c new file mode 100644 index 00000000..c45a6cf4 --- /dev/null +++ b/tests/fixtures/real_world/c/mixed/taint_plus_leak.c @@ -0,0 +1,11 @@ +#include +#include + +void process_env() { + char *path = getenv("USER_PATH"); + FILE *f = fopen(path, "r"); + char buf[1024]; + fgets(buf, sizeof(buf), f); + printf("%s", buf); + // Both: taint (getenv -> fopen) and resource leak (f not closed) +} diff --git a/tests/fixtures/real_world/c/mixed/taint_plus_leak.expect.json b/tests/fixtures/real_world/c/mixed/taint_plus_leak.expect.json new file mode 100644 index 00000000..5c077e47 --- /dev/null +++ b/tests/fixtures/real_world/c/mixed/taint_plus_leak.expect.json @@ -0,0 +1,35 @@ +{ + "description": "Combined taint and resource leak: getenv flows to fopen path, and file handle is never closed", + "tags": [ + "taint", + "state", + "mixed" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 3, + 8 + ], + "evidence_contains": [], + "notes": "getenv(\"USER_PATH\") flows into fopen as file path \u2014 path traversal" + }, + { + "rule_id": "state-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 4, + 12 + ], + "evidence_contains": [], + "notes": "fopen at line 6 never closed \u2014 aspirational state finding" + } + ] +} diff --git a/tests/fixtures/real_world/c/state/branch_state.c b/tests/fixtures/real_world/c/state/branch_state.c new file mode 100644 index 00000000..c2f36a38 --- /dev/null +++ b/tests/fixtures/real_world/c/state/branch_state.c @@ -0,0 +1,23 @@ +#include + +void branch_leak(const char *path, int flag) { + FILE *f = fopen(path, "r"); + if (flag) { + char buf[256]; + fgets(buf, sizeof(buf), f); + fclose(f); + } else { + // f leaked in else + } +} + +void both_close(const char *path, int flag) { + FILE *f = fopen(path, "r"); + if (flag) { + char buf[256]; + fgets(buf, sizeof(buf), f); + fclose(f); + } else { + fclose(f); + } +} diff --git a/tests/fixtures/real_world/c/state/branch_state.expect.json b/tests/fixtures/real_world/c/state/branch_state.expect.json new file mode 100644 index 00000000..2cbe9b86 --- /dev/null +++ b/tests/fixtures/real_world/c/state/branch_state.expect.json @@ -0,0 +1,24 @@ +{ + "description": "Branching resource lifecycle: one branch closes file, other leaks. Safe version closes in both branches.", + "tags": [ + "state", + "resource-leak", + "branching" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-resource-leak-possible", + "severity": null, + "must_match": true, + "line_range": [ + 2, + 13 + ], + "evidence_contains": [], + "notes": "fopen at line 4 only closed in if-branch, leaked in else-branch" + } + ] +} diff --git a/tests/fixtures/real_world/c/state/fopen_lifecycle.c b/tests/fixtures/real_world/c/state/fopen_lifecycle.c new file mode 100644 index 00000000..f4d48126 --- /dev/null +++ b/tests/fixtures/real_world/c/state/fopen_lifecycle.c @@ -0,0 +1,27 @@ +#include + +void read_leak(const char *path) { + FILE *f = fopen(path, "r"); + char buf[256]; + fgets(buf, sizeof(buf), f); +} + +void read_close(const char *path) { + FILE *f = fopen(path, "r"); + char buf[256]; + fgets(buf, sizeof(buf), f); + fclose(f); +} + +void double_close(const char *path) { + FILE *f = fopen(path, "r"); + fclose(f); + fclose(f); +} + +void use_after_close(const char *path) { + FILE *f = fopen(path, "r"); + fclose(f); + char buf[256]; + fgets(buf, sizeof(buf), f); +} diff --git a/tests/fixtures/real_world/c/state/fopen_lifecycle.expect.json b/tests/fixtures/real_world/c/state/fopen_lifecycle.expect.json new file mode 100644 index 00000000..163dff9f --- /dev/null +++ b/tests/fixtures/real_world/c/state/fopen_lifecycle.expect.json @@ -0,0 +1,45 @@ +{ + "description": "FILE* lifecycle: leak (no fclose), double close, and use after close patterns", + "tags": [ + "state", + "resource-lifecycle" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-resource-leak", + "severity": null, + "must_match": true, + "line_range": [ + 2, + 8 + ], + "evidence_contains": [], + "notes": "fopen at line 4 never closed in read_leak \u2014 file handle leaked" + }, + { + "rule_id": "state-double-close", + "severity": null, + "must_match": true, + "line_range": [ + 16, + 21 + ], + "evidence_contains": [], + "notes": "fclose called twice on same FILE* in double_close" + }, + { + "rule_id": "state-use-after-close", + "severity": null, + "must_match": true, + "line_range": [ + 23, + 29 + ], + "evidence_contains": [], + "notes": "fgets called on f after fclose in use_after_close" + } + ] +} diff --git a/tests/fixtures/real_world/c/state/loop_state.c b/tests/fixtures/real_world/c/state/loop_state.c new file mode 100644 index 00000000..695ba296 --- /dev/null +++ b/tests/fixtures/real_world/c/state/loop_state.c @@ -0,0 +1,21 @@ +#include + +void loop_leak() { + int i; + for (i = 0; i < 10; i++) { + FILE *f = fopen("/tmp/test", "r"); + char buf[256]; + fgets(buf, sizeof(buf), f); + // f leaked each iteration! + } +} + +void loop_close() { + int i; + for (i = 0; i < 10; i++) { + FILE *f = fopen("/tmp/test", "r"); + char buf[256]; + fgets(buf, sizeof(buf), f); + fclose(f); + } +} diff --git a/tests/fixtures/real_world/c/state/loop_state.expect.json b/tests/fixtures/real_world/c/state/loop_state.expect.json new file mode 100644 index 00000000..433b9da9 --- /dev/null +++ b/tests/fixtures/real_world/c/state/loop_state.expect.json @@ -0,0 +1,24 @@ +{ + "description": "Resource leak in loop: file opened each iteration but never closed. Safe version closes each iteration.", + "tags": [ + "state", + "resource-leak", + "loop" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 4, + 11 + ], + "evidence_contains": [], + "notes": "fopen inside loop body without fclose \u2014 leaked each iteration. Aspirational: loop-scoped leak detection." + } + ] +} diff --git a/tests/fixtures/real_world/c/state/malloc_lifecycle.c b/tests/fixtures/real_world/c/state/malloc_lifecycle.c new file mode 100644 index 00000000..852e433a --- /dev/null +++ b/tests/fixtures/real_world/c/state/malloc_lifecycle.c @@ -0,0 +1,25 @@ +#include +#include + +void alloc_leak() { + char *buf = malloc(1024); + strcpy(buf, "hello"); +} + +void alloc_free() { + char *buf = malloc(1024); + strcpy(buf, "hello"); + free(buf); +} + +void double_free() { + char *buf = malloc(1024); + free(buf); + free(buf); +} + +void use_after_free() { + char *buf = malloc(1024); + free(buf); + strcpy(buf, "oops"); +} diff --git a/tests/fixtures/real_world/c/state/malloc_lifecycle.expect.json b/tests/fixtures/real_world/c/state/malloc_lifecycle.expect.json new file mode 100644 index 00000000..ffa7b95f --- /dev/null +++ b/tests/fixtures/real_world/c/state/malloc_lifecycle.expect.json @@ -0,0 +1,45 @@ +{ + "description": "malloc lifecycle: leak (no free), double free, and use after free patterns", + "tags": [ + "state", + "resource-lifecycle" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-resource-leak", + "severity": null, + "must_match": true, + "line_range": [ + 3, + 8 + ], + "evidence_contains": [], + "notes": "malloc at line 5 never freed in alloc_leak" + }, + { + "rule_id": "state-double-close", + "severity": null, + "must_match": true, + "line_range": [ + 15, + 20 + ], + "evidence_contains": [], + "notes": "free called twice on same pointer in double_free" + }, + { + "rule_id": "state-use-after-close", + "severity": null, + "must_match": true, + "line_range": [ + 21, + 26 + ], + "evidence_contains": [], + "notes": "strcpy on buf after free in use_after_free" + } + ] +} diff --git a/tests/fixtures/real_world/c/taint/buffer_overflow.c b/tests/fixtures/real_world/c/taint/buffer_overflow.c new file mode 100644 index 00000000..232cfaa7 --- /dev/null +++ b/tests/fixtures/real_world/c/taint/buffer_overflow.c @@ -0,0 +1,28 @@ +#include +#include + +void copy_unsafe(char *input) { + char buf[64]; + strcpy(buf, input); + printf("%s\n", buf); +} + +void copy_safe(char *input) { + char buf[64]; + strncpy(buf, input, sizeof(buf) - 1); + buf[sizeof(buf) - 1] = '\0'; + printf("%s\n", buf); +} + +void gets_vuln() { + char buf[128]; + gets(buf); + printf("%s\n", buf); +} + +void concat_vuln(char *src1, char *src2) { + char buf[64]; + strcpy(buf, src1); + strcat(buf, src2); + printf("%s\n", buf); +} diff --git a/tests/fixtures/real_world/c/taint/buffer_overflow.expect.json b/tests/fixtures/real_world/c/taint/buffer_overflow.expect.json new file mode 100644 index 00000000..12922e52 --- /dev/null +++ b/tests/fixtures/real_world/c/taint/buffer_overflow.expect.json @@ -0,0 +1,56 @@ +{ + "description": "Classic C buffer overflow patterns: strcpy, gets, strcat without bounds checking", + "tags": [ + "mem", + "buffer-overflow" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "c.memory.strcpy", + "severity": null, + "must_match": true, + "line_range": [ + 4, + 8 + ], + "evidence_contains": [], + "notes": "strcpy without bounds check \u2014 potential buffer overflow" + }, + { + "rule_id": "c.memory.gets", + "severity": null, + "must_match": true, + "line_range": [ + 17, + 21 + ], + "evidence_contains": [], + "notes": "gets() is always unsafe \u2014 no way to limit input length" + }, + { + "rule_id": "c.memory.strcpy", + "severity": null, + "must_match": true, + "line_range": [ + 23, + 27 + ], + "evidence_contains": [], + "notes": "strcpy in concat_vuln \u2014 first unbounded copy" + }, + { + "rule_id": "c.memory.strcat", + "severity": null, + "must_match": true, + "line_range": [ + 24, + 28 + ], + "evidence_contains": [], + "notes": "strcat without bounds check \u2014 appends without size limit" + } + ] +} diff --git a/tests/fixtures/real_world/c/taint/cmdi_getenv.c b/tests/fixtures/real_world/c/taint/cmdi_getenv.c new file mode 100644 index 00000000..0ec00335 --- /dev/null +++ b/tests/fixtures/real_world/c/taint/cmdi_getenv.c @@ -0,0 +1,16 @@ +#include +#include +#include + +void run_from_env() { + char *cmd = getenv("USER_CMD"); + system(cmd); +} + +void run_safe() { + char *cmd = getenv("USER_CMD"); + if (cmd == NULL) return; + if (strcmp(cmd, "ls") == 0 || strcmp(cmd, "date") == 0) { + system(cmd); + } +} diff --git a/tests/fixtures/real_world/c/taint/cmdi_getenv.expect.json b/tests/fixtures/real_world/c/taint/cmdi_getenv.expect.json new file mode 100644 index 00000000..86171e6c --- /dev/null +++ b/tests/fixtures/real_world/c/taint/cmdi_getenv.expect.json @@ -0,0 +1,56 @@ +{ + "description": "getenv flows directly to system() call \u2014 classic command injection. Safe version uses allowlist comparison.", + "tags": [ + "taint", + "cmdi" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "c.cmdi.system", + "severity": null, + "must_match": true, + "line_range": [ + 5, + 9 + ], + "evidence_contains": [], + "notes": "AST pattern detects system() call" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 4, + 9 + ], + "evidence_contains": [], + "notes": "getenv(\"USER_CMD\") flows directly into system(cmd) without sanitization" + }, + { + "rule_id": "c.cmdi.system", + "severity": null, + "must_match": true, + "line_range": [ + 12, + 16 + ], + "evidence_contains": [], + "notes": "AST pattern still fires on system() in safe version \u2014 pattern is syntactic" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 9, + 16 + ], + "evidence_contains": [], + "notes": "Safe version with strcmp allowlist \u2014 ideally suppressed but scanner may not model strcmp as validation" + } + ] +} diff --git a/tests/fixtures/real_world/c/taint/cmdi_popen.c b/tests/fixtures/real_world/c/taint/cmdi_popen.c new file mode 100644 index 00000000..bf2ad086 --- /dev/null +++ b/tests/fixtures/real_world/c/taint/cmdi_popen.c @@ -0,0 +1,14 @@ +#include +#include +#include + +void execute_cmd(char *user_input) { + char cmd[256]; + sprintf(cmd, "grep -r '%s' /var/log/", user_input); + FILE *fp = popen(cmd, "r"); + char buf[1024]; + while (fgets(buf, sizeof(buf), fp)) { + printf("%s", buf); + } + pclose(fp); +} diff --git a/tests/fixtures/real_world/c/taint/cmdi_popen.expect.json b/tests/fixtures/real_world/c/taint/cmdi_popen.expect.json new file mode 100644 index 00000000..5654ca6b --- /dev/null +++ b/tests/fixtures/real_world/c/taint/cmdi_popen.expect.json @@ -0,0 +1,45 @@ +{ + "description": "Command injection via popen: user input interpolated into shell command via sprintf then passed to popen", + "tags": [ + "taint", + "cmdi" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "c.cmdi.popen", + "severity": null, + "must_match": true, + "line_range": [ + 6, + 10 + ], + "evidence_contains": [], + "notes": "popen() executes shell command constructed from user input" + }, + { + "rule_id": "c.memory.sprintf", + "severity": null, + "must_match": true, + "line_range": [ + 5, + 9 + ], + "evidence_contains": [], + "notes": "sprintf used to build command string \u2014 buffer overflow risk" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 3, + 10 + ], + "evidence_contains": [], + "notes": "user_input flows through sprintf into popen \u2014 taint engine may not track through sprintf buffer" + } + ] +} diff --git a/tests/fixtures/real_world/c/taint/env_to_file.c b/tests/fixtures/real_world/c/taint/env_to_file.c new file mode 100644 index 00000000..c59a3082 --- /dev/null +++ b/tests/fixtures/real_world/c/taint/env_to_file.c @@ -0,0 +1,9 @@ +#include +#include + +void write_config() { + char *path = getenv("CONFIG_PATH"); + FILE *f = fopen(path, "w"); + fprintf(f, "config data\n"); + fclose(f); +} diff --git a/tests/fixtures/real_world/c/taint/env_to_file.expect.json b/tests/fixtures/real_world/c/taint/env_to_file.expect.json new file mode 100644 index 00000000..99c373bc --- /dev/null +++ b/tests/fixtures/real_world/c/taint/env_to_file.expect.json @@ -0,0 +1,23 @@ +{ + "description": "Environment variable flows to fopen path \u2014 path traversal / arbitrary file write", + "tags": [ + "taint", + "path-traversal" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 3, + 8 + ], + "evidence_contains": [], + "notes": "getenv(\"CONFIG_PATH\") flows directly into fopen as file path \u2014 arbitrary file write" + } + ] +} diff --git a/tests/fixtures/real_world/c/taint/format_string.c b/tests/fixtures/real_world/c/taint/format_string.c new file mode 100644 index 00000000..93e2b0fe --- /dev/null +++ b/tests/fixtures/real_world/c/taint/format_string.c @@ -0,0 +1,14 @@ +#include +#include + +void print_user_input(char *input) { + printf(input); +} + +void print_safe(char *input) { + printf("%s", input); +} + +void sprintf_vuln(char *buf, char *user_input) { + sprintf(buf, user_input); +} diff --git a/tests/fixtures/real_world/c/taint/format_string.expect.json b/tests/fixtures/real_world/c/taint/format_string.expect.json new file mode 100644 index 00000000..bb82dada --- /dev/null +++ b/tests/fixtures/real_world/c/taint/format_string.expect.json @@ -0,0 +1,34 @@ +{ + "description": "Format string vulnerabilities: printf with user-controlled format and sprintf with user-controlled format", + "tags": [ + "taint", + "fmt" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "c.memory.printf_no_fmt", + "severity": null, + "must_match": true, + "line_range": [ + 3, + 7 + ], + "evidence_contains": [], + "notes": "printf(input) \u2014 user-controlled format string" + }, + { + "rule_id": "c.memory.sprintf", + "severity": null, + "must_match": true, + "line_range": [ + 11, + 15 + ], + "evidence_contains": [], + "notes": "sprintf with user-controlled format string \u2014 both format vuln and buffer overflow risk" + } + ] +} diff --git a/tests/fixtures/real_world/c/taint/scanf_overflow.c b/tests/fixtures/real_world/c/taint/scanf_overflow.c new file mode 100644 index 00000000..5c1e2b9d --- /dev/null +++ b/tests/fixtures/real_world/c/taint/scanf_overflow.c @@ -0,0 +1,13 @@ +#include + +void read_input() { + char name[32]; + scanf("%s", name); + printf("Hello, %s\n", name); +} + +void read_safe() { + char name[32]; + scanf("%31s", name); + printf("Hello, %s\n", name); +} diff --git a/tests/fixtures/real_world/c/taint/scanf_overflow.expect.json b/tests/fixtures/real_world/c/taint/scanf_overflow.expect.json new file mode 100644 index 00000000..91c40491 --- /dev/null +++ b/tests/fixtures/real_world/c/taint/scanf_overflow.expect.json @@ -0,0 +1,34 @@ +{ + "description": "scanf with unbounded %s format specifier vs safe %31s with width limit", + "tags": [ + "mem", + "scanf" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "c.memory.scanf_percent_s", + "severity": null, + "must_match": true, + "line_range": [ + 3, + 7 + ], + "evidence_contains": [], + "notes": "scanf(\"%s\", name) \u2014 unbounded read into stack buffer" + }, + { + "rule_id": "c.memory.scanf_percent_s", + "severity": null, + "must_match": false, + "line_range": [ + 9, + 13 + ], + "evidence_contains": [], + "notes": "scanf(\"%31s\", name) \u2014 bounded read, ideally no finding but AST pattern may still fire" + } + ] +} diff --git a/tests/fixtures/real_world/cpp/cfg/lambda_capture.cpp b/tests/fixtures/real_world/cpp/cfg/lambda_capture.cpp new file mode 100644 index 00000000..05d791d8 --- /dev/null +++ b/tests/fixtures/real_world/cpp/cfg/lambda_capture.cpp @@ -0,0 +1,10 @@ +#include +#include +#include + +std::function create_dangerous_lambda(const char *user_input) { + std::string cmd = std::string("echo ") + user_input; + return [cmd]() { + system(cmd.c_str()); + }; +} diff --git a/tests/fixtures/real_world/cpp/cfg/lambda_capture.expect.json b/tests/fixtures/real_world/cpp/cfg/lambda_capture.expect.json new file mode 100644 index 00000000..7ef9f9f7 --- /dev/null +++ b/tests/fixtures/real_world/cpp/cfg/lambda_capture.expect.json @@ -0,0 +1,24 @@ +{ + "description": "system() call inside lambda capturing user input by value", + "tags": [ + "cmdi", + "cfg", + "lambda" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "cpp.cmdi.system", + "severity": null, + "must_match": true, + "line_range": [ + 6, + 10 + ], + "evidence_contains": [], + "notes": "system() called inside lambda with captured user-derived command string" + } + ] +} diff --git a/tests/fixtures/real_world/cpp/cfg/namespace_scope.cpp b/tests/fixtures/real_world/cpp/cfg/namespace_scope.cpp new file mode 100644 index 00000000..44955524 --- /dev/null +++ b/tests/fixtures/real_world/cpp/cfg/namespace_scope.cpp @@ -0,0 +1,19 @@ +#include +#include + +namespace security { + void validate(const char *input) { + if (input == nullptr) return; + } +} + +namespace execution { + void run(const char *cmd) { + system(cmd); + } +} + +void handler(const char *user_input) { + security::validate(user_input); + execution::run(user_input); +} diff --git a/tests/fixtures/real_world/cpp/cfg/namespace_scope.expect.json b/tests/fixtures/real_world/cpp/cfg/namespace_scope.expect.json new file mode 100644 index 00000000..f47a9a24 --- /dev/null +++ b/tests/fixtures/real_world/cpp/cfg/namespace_scope.expect.json @@ -0,0 +1,35 @@ +{ + "description": "system() in namespace \u2014 validation function does not actually sanitize input", + "tags": [ + "cmdi", + "cfg", + "namespace" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "cpp.cmdi.system", + "severity": null, + "must_match": true, + "line_range": [ + 10, + 14 + ], + "evidence_contains": [], + "notes": "system(cmd) in execution::run \u2014 AST pattern detects system() call" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 15, + 20 + ], + "evidence_contains": [], + "notes": "user_input flows through validate (which only null-checks) to system \u2014 aspirational cross-function taint" + } + ] +} diff --git a/tests/fixtures/real_world/cpp/cfg/raii_vs_manual.cpp b/tests/fixtures/real_world/cpp/cfg/raii_vs_manual.cpp new file mode 100644 index 00000000..d16f76ce --- /dev/null +++ b/tests/fixtures/real_world/cpp/cfg/raii_vs_manual.cpp @@ -0,0 +1,19 @@ +#include +#include +#include + +std::string read_raii(const char *path) { + std::ifstream file(path); + std::string content; + std::getline(file, content); + return content; + // RAII: ifstream destructor closes +} + +std::string read_manual(const char *path) { + FILE *f = fopen(path, "r"); + char buf[256]; + fgets(buf, sizeof(buf), f); + // f not closed -- manual leak + return std::string(buf); +} diff --git a/tests/fixtures/real_world/cpp/cfg/raii_vs_manual.expect.json b/tests/fixtures/real_world/cpp/cfg/raii_vs_manual.expect.json new file mode 100644 index 00000000..9aea24a2 --- /dev/null +++ b/tests/fixtures/real_world/cpp/cfg/raii_vs_manual.expect.json @@ -0,0 +1,24 @@ +{ + "description": "RAII ifstream vs manual FILE* \u2014 RAII auto-closes, manual version leaks", + "tags": [ + "cfg", + "resource-leak", + "raii" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "cfg-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 12, + 20 + ], + "evidence_contains": [], + "notes": "fopen at line 14 never fclose'd in read_manual \u2014 aspirational CFG finding" + } + ] +} diff --git a/tests/fixtures/real_world/cpp/cfg/try_catch.cpp b/tests/fixtures/real_world/cpp/cfg/try_catch.cpp new file mode 100644 index 00000000..0c12c1b8 --- /dev/null +++ b/tests/fixtures/real_world/cpp/cfg/try_catch.cpp @@ -0,0 +1,30 @@ +#include +#include + +void process_file(const char *path) { + FILE *f = fopen(path, "r"); + try { + char buf[256]; + if (fgets(buf, sizeof(buf), f) == NULL) { + throw std::runtime_error("read failed"); + } + fclose(f); + } catch (...) { + // f leaked in catch + throw; + } +} + +void process_safe(const char *path) { + FILE *f = fopen(path, "r"); + try { + char buf[256]; + if (fgets(buf, sizeof(buf), f) == NULL) { + fclose(f); + throw std::runtime_error("read failed"); + } + fclose(f); + } catch (...) { + throw; + } +} diff --git a/tests/fixtures/real_world/cpp/cfg/try_catch.expect.json b/tests/fixtures/real_world/cpp/cfg/try_catch.expect.json new file mode 100644 index 00000000..da2369f4 --- /dev/null +++ b/tests/fixtures/real_world/cpp/cfg/try_catch.expect.json @@ -0,0 +1,24 @@ +{ + "description": "Resource leak in exception path: fopen not closed when exception thrown. Safe version closes before throw.", + "tags": [ + "cfg", + "resource-leak", + "exception" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "cfg-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 3, + 17 + ], + "evidence_contains": [], + "notes": "fopen at line 5 leaked when exception thrown at line 9 \u2014 catch block re-throws without closing. Aspirational." + } + ] +} diff --git a/tests/fixtures/real_world/cpp/mixed/cmdi_format.cpp b/tests/fixtures/real_world/cpp/mixed/cmdi_format.cpp new file mode 100644 index 00000000..14dfed9c --- /dev/null +++ b/tests/fixtures/real_world/cpp/mixed/cmdi_format.cpp @@ -0,0 +1,10 @@ +#include +#include +#include + +void dangerous(const char *user_input) { + char cmd[256]; + sprintf(cmd, "cat %s", user_input); + system(cmd); + printf(user_input); // also format string vuln +} diff --git a/tests/fixtures/real_world/cpp/mixed/cmdi_format.expect.json b/tests/fixtures/real_world/cpp/mixed/cmdi_format.expect.json new file mode 100644 index 00000000..af510575 --- /dev/null +++ b/tests/fixtures/real_world/cpp/mixed/cmdi_format.expect.json @@ -0,0 +1,35 @@ +{ + "description": "Multiple vulnerabilities: command injection via system() and format string via printf(user_input)", + "tags": [ + "cmdi", + "fmt", + "mixed" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "cpp.cmdi.system", + "severity": null, + "must_match": true, + "line_range": [ + 6, + 10 + ], + "evidence_contains": [], + "notes": "system(cmd) where cmd built from user input via sprintf" + }, + { + "rule_id": "cpp.memory.printf_no_fmt", + "severity": null, + "must_match": true, + "line_range": [ + 7, + 11 + ], + "evidence_contains": [], + "notes": "printf(user_input) \u2014 user-controlled format string" + } + ] +} diff --git a/tests/fixtures/real_world/cpp/mixed/taint_leak.cpp b/tests/fixtures/real_world/cpp/mixed/taint_leak.cpp new file mode 100644 index 00000000..ad3a858a --- /dev/null +++ b/tests/fixtures/real_world/cpp/mixed/taint_leak.cpp @@ -0,0 +1,10 @@ +#include +#include + +void env_leak() { + const char *path = std::getenv("USER_PATH"); + FILE *f = fopen(path, "r"); + char buf[1024]; + fgets(buf, sizeof(buf), f); + // taint (getenv -> fopen) + resource leak +} diff --git a/tests/fixtures/real_world/cpp/mixed/taint_leak.expect.json b/tests/fixtures/real_world/cpp/mixed/taint_leak.expect.json new file mode 100644 index 00000000..80c423d1 --- /dev/null +++ b/tests/fixtures/real_world/cpp/mixed/taint_leak.expect.json @@ -0,0 +1,35 @@ +{ + "description": "Combined taint and resource leak: std::getenv flows to fopen, file handle never closed", + "tags": [ + "taint", + "state", + "mixed" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 3, + 8 + ], + "evidence_contains": [], + "notes": "std::getenv(\"USER_PATH\") flows to fopen as file path \u2014 path traversal" + }, + { + "rule_id": "state-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 4, + 11 + ], + "evidence_contains": [], + "notes": "fopen at line 6 never closed \u2014 aspirational state finding" + } + ] +} diff --git a/tests/fixtures/real_world/cpp/state/fopen_lifecycle.cpp b/tests/fixtures/real_world/cpp/state/fopen_lifecycle.cpp new file mode 100644 index 00000000..a258af6f --- /dev/null +++ b/tests/fixtures/real_world/cpp/state/fopen_lifecycle.cpp @@ -0,0 +1,27 @@ +#include + +void leak() { + FILE *f = fopen("/tmp/test", "r"); + char buf[256]; + fgets(buf, sizeof(buf), f); +} + +void clean() { + FILE *f = fopen("/tmp/test", "r"); + char buf[256]; + fgets(buf, sizeof(buf), f); + fclose(f); +} + +void double_close() { + FILE *f = fopen("/tmp/test", "r"); + fclose(f); + fclose(f); +} + +void use_after_close() { + FILE *f = fopen("/tmp/test", "r"); + fclose(f); + char buf[256]; + fgets(buf, sizeof(buf), f); +} diff --git a/tests/fixtures/real_world/cpp/state/fopen_lifecycle.expect.json b/tests/fixtures/real_world/cpp/state/fopen_lifecycle.expect.json new file mode 100644 index 00000000..c7fbe8d9 --- /dev/null +++ b/tests/fixtures/real_world/cpp/state/fopen_lifecycle.expect.json @@ -0,0 +1,45 @@ +{ + "description": "C++ FILE* lifecycle patterns: leak, double close, use after close", + "tags": [ + "state", + "resource-lifecycle" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-resource-leak", + "severity": null, + "must_match": true, + "line_range": [ + 2, + 8 + ], + "evidence_contains": [], + "notes": "fopen at line 4 never closed in leak()" + }, + { + "rule_id": "state-double-close", + "severity": null, + "must_match": true, + "line_range": [ + 16, + 21 + ], + "evidence_contains": [], + "notes": "fclose called twice on same FILE* in double_close()" + }, + { + "rule_id": "state-use-after-close", + "severity": null, + "must_match": true, + "line_range": [ + 23, + 29 + ], + "evidence_contains": [], + "notes": "fgets on f after fclose in use_after_close()" + } + ] +} diff --git a/tests/fixtures/real_world/cpp/state/malloc_branches.cpp b/tests/fixtures/real_world/cpp/state/malloc_branches.cpp new file mode 100644 index 00000000..f990feb6 --- /dev/null +++ b/tests/fixtures/real_world/cpp/state/malloc_branches.cpp @@ -0,0 +1,11 @@ +#include +#include + +void branch_leak(int flag) { + char *buf = (char*)malloc(256); + if (flag) { + strcpy(buf, "hello"); + free(buf); + } + // buf leaked if !flag +} diff --git a/tests/fixtures/real_world/cpp/state/malloc_branches.expect.json b/tests/fixtures/real_world/cpp/state/malloc_branches.expect.json new file mode 100644 index 00000000..1cfdd586 --- /dev/null +++ b/tests/fixtures/real_world/cpp/state/malloc_branches.expect.json @@ -0,0 +1,24 @@ +{ + "description": "C++ malloc branch leak: only freed in one branch of conditional", + "tags": [ + "state", + "resource-leak", + "branching" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-resource-leak-possible", + "severity": null, + "must_match": false, + "line_range": [ + 3, + 12 + ], + "evidence_contains": [], + "notes": "malloc at line 5 only freed when flag is true \u2014 aspirational branch-aware state analysis" + } + ] +} diff --git a/tests/fixtures/real_world/cpp/state/new_delete.cpp b/tests/fixtures/real_world/cpp/state/new_delete.cpp new file mode 100644 index 00000000..99873aae --- /dev/null +++ b/tests/fixtures/real_world/cpp/state/new_delete.cpp @@ -0,0 +1,18 @@ +#include + +void leak() { + char *buf = new char[1024]; + strcpy(buf, "hello"); +} + +void clean() { + char *buf = new char[1024]; + strcpy(buf, "hello"); + delete[] buf; +} + +void double_delete() { + char *buf = new char[1024]; + delete[] buf; + delete[] buf; +} diff --git a/tests/fixtures/real_world/cpp/state/new_delete.expect.json b/tests/fixtures/real_world/cpp/state/new_delete.expect.json new file mode 100644 index 00000000..e11fc2d5 --- /dev/null +++ b/tests/fixtures/real_world/cpp/state/new_delete.expect.json @@ -0,0 +1,34 @@ +{ + "description": "C++ new[]/delete[] lifecycle: leak and double delete patterns", + "tags": [ + "state", + "resource-lifecycle" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 2, + 7 + ], + "evidence_contains": [], + "notes": "new char[1024] at line 4 never deleted \u2014 aspirational, requires new/delete tracking" + }, + { + "rule_id": "state-double-close", + "severity": null, + "must_match": false, + "line_range": [ + 14, + 19 + ], + "evidence_contains": [], + "notes": "delete[] called twice \u2014 aspirational, requires new/delete tracking" + } + ] +} diff --git a/tests/fixtures/real_world/cpp/state/smart_ptr_vs_raw.cpp b/tests/fixtures/real_world/cpp/state/smart_ptr_vs_raw.cpp new file mode 100644 index 00000000..8addcc37 --- /dev/null +++ b/tests/fixtures/real_world/cpp/state/smart_ptr_vs_raw.cpp @@ -0,0 +1,12 @@ +#include +#include + +void smart_clean() { + auto ptr = std::make_unique(42); + // automatically cleaned up +} + +void raw_leak() { + int *ptr = new int(42); + // never deleted +} diff --git a/tests/fixtures/real_world/cpp/state/smart_ptr_vs_raw.expect.json b/tests/fixtures/real_world/cpp/state/smart_ptr_vs_raw.expect.json new file mode 100644 index 00000000..98f1c8c1 --- /dev/null +++ b/tests/fixtures/real_world/cpp/state/smart_ptr_vs_raw.expect.json @@ -0,0 +1,24 @@ +{ + "description": "Smart pointer vs raw new: unique_ptr auto-cleans, raw pointer leaks", + "tags": [ + "state", + "resource-leak", + "smart-pointer" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 8, + 13 + ], + "evidence_contains": [], + "notes": "new int(42) at line 10 never deleted \u2014 aspirational, requires new/delete tracking" + } + ] +} diff --git a/tests/fixtures/real_world/cpp/taint/cmdi_system.cpp b/tests/fixtures/real_world/cpp/taint/cmdi_system.cpp new file mode 100644 index 00000000..a94c6cfb --- /dev/null +++ b/tests/fixtures/real_world/cpp/taint/cmdi_system.cpp @@ -0,0 +1,16 @@ +#include +#include + +void execute_user_cmd() { + const char *cmd = std::getenv("USER_CMD"); + system(cmd); +} + +void execute_safe() { + const char *cmd = std::getenv("USER_CMD"); + if (cmd == nullptr) return; + std::string s(cmd); + if (s == "ls" || s == "date") { + system(cmd); + } +} diff --git a/tests/fixtures/real_world/cpp/taint/cmdi_system.expect.json b/tests/fixtures/real_world/cpp/taint/cmdi_system.expect.json new file mode 100644 index 00000000..ee61f41b --- /dev/null +++ b/tests/fixtures/real_world/cpp/taint/cmdi_system.expect.json @@ -0,0 +1,45 @@ +{ + "description": "C++ command injection: std::getenv flows to system(). Safe version uses allowlist.", + "tags": [ + "taint", + "cmdi" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "cpp.cmdi.system", + "severity": null, + "must_match": true, + "line_range": [ + 4, + 8 + ], + "evidence_contains": [], + "notes": "system(cmd) where cmd comes from std::getenv" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 3, + 8 + ], + "evidence_contains": [], + "notes": "std::getenv flows directly to system() without sanitization" + }, + { + "rule_id": "cpp.cmdi.system", + "severity": null, + "must_match": true, + "line_range": [ + 12, + 16 + ], + "evidence_contains": [], + "notes": "AST pattern still matches system() in safe version" + } + ] +} diff --git a/tests/fixtures/real_world/cpp/taint/env_to_system.cpp b/tests/fixtures/real_world/cpp/taint/env_to_system.cpp new file mode 100644 index 00000000..5f07f817 --- /dev/null +++ b/tests/fixtures/real_world/cpp/taint/env_to_system.cpp @@ -0,0 +1,9 @@ +#include +#include + +int main() { + char *home = std::getenv("HOME"); + std::string cmd = "ls " + std::string(home); + system(cmd.c_str()); + return 0; +} diff --git a/tests/fixtures/real_world/cpp/taint/env_to_system.expect.json b/tests/fixtures/real_world/cpp/taint/env_to_system.expect.json new file mode 100644 index 00000000..8c06753a --- /dev/null +++ b/tests/fixtures/real_world/cpp/taint/env_to_system.expect.json @@ -0,0 +1,34 @@ +{ + "description": "Environment variable concatenated into system() call \u2014 command injection via HOME", + "tags": [ + "taint", + "cmdi" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "cpp.cmdi.system", + "severity": null, + "must_match": true, + "line_range": [ + 5, + 9 + ], + "evidence_contains": [], + "notes": "system() called with command built from std::getenv(\"HOME\")" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 3, + 9 + ], + "evidence_contains": [], + "notes": "std::getenv flows through string concatenation into system()" + } + ] +} diff --git a/tests/fixtures/real_world/cpp/taint/format_string.cpp b/tests/fixtures/real_world/cpp/taint/format_string.cpp new file mode 100644 index 00000000..2fa6a488 --- /dev/null +++ b/tests/fixtures/real_world/cpp/taint/format_string.cpp @@ -0,0 +1,10 @@ +#include +#include + +void print_unsafe(const char *user_input) { + printf(user_input); +} + +void print_safe(const char *user_input) { + printf("%s", user_input); +} diff --git a/tests/fixtures/real_world/cpp/taint/format_string.expect.json b/tests/fixtures/real_world/cpp/taint/format_string.expect.json new file mode 100644 index 00000000..bd1ce8e4 --- /dev/null +++ b/tests/fixtures/real_world/cpp/taint/format_string.expect.json @@ -0,0 +1,23 @@ +{ + "description": "C++ format string vulnerability: printf with user-controlled format argument", + "tags": [ + "taint", + "fmt" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "cpp.memory.printf_no_fmt", + "severity": null, + "must_match": true, + "line_range": [ + 3, + 7 + ], + "evidence_contains": [], + "notes": "printf(user_input) \u2014 user-controlled format string" + } + ] +} diff --git a/tests/fixtures/real_world/cpp/taint/gets_strcpy.cpp b/tests/fixtures/real_world/cpp/taint/gets_strcpy.cpp new file mode 100644 index 00000000..183aa496 --- /dev/null +++ b/tests/fixtures/real_world/cpp/taint/gets_strcpy.cpp @@ -0,0 +1,12 @@ +#include +#include + +void copy_unsafe(const char *input) { + char buf[64]; + strcpy(buf, input); +} + +void gets_input() { + char buf[128]; + gets(buf); +} diff --git a/tests/fixtures/real_world/cpp/taint/gets_strcpy.expect.json b/tests/fixtures/real_world/cpp/taint/gets_strcpy.expect.json new file mode 100644 index 00000000..ea1861a0 --- /dev/null +++ b/tests/fixtures/real_world/cpp/taint/gets_strcpy.expect.json @@ -0,0 +1,34 @@ +{ + "description": "C++ legacy C function usage: strcpy and gets without bounds checking", + "tags": [ + "mem", + "buffer-overflow" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "cpp.memory.strcpy", + "severity": null, + "must_match": true, + "line_range": [ + 4, + 8 + ], + "evidence_contains": [], + "notes": "strcpy without bounds check in C++ code" + }, + { + "rule_id": "cpp.memory.gets", + "severity": null, + "must_match": true, + "line_range": [ + 9, + 13 + ], + "evidence_contains": [], + "notes": "gets() always unsafe \u2014 no bounds checking" + } + ] +} diff --git a/tests/fixtures/real_world/cpp/taint/popen_cmd.cpp b/tests/fixtures/real_world/cpp/taint/popen_cmd.cpp new file mode 100644 index 00000000..b7e05b80 --- /dev/null +++ b/tests/fixtures/real_world/cpp/taint/popen_cmd.cpp @@ -0,0 +1,13 @@ +#include +#include +#include + +void run_command(const std::string &user_input) { + std::string cmd = "grep " + user_input + " /var/log/syslog"; + FILE *fp = popen(cmd.c_str(), "r"); + char buf[1024]; + while (fgets(buf, sizeof(buf), fp)) { + printf("%s", buf); + } + pclose(fp); +} diff --git a/tests/fixtures/real_world/cpp/taint/popen_cmd.expect.json b/tests/fixtures/real_world/cpp/taint/popen_cmd.expect.json new file mode 100644 index 00000000..07e3d292 --- /dev/null +++ b/tests/fixtures/real_world/cpp/taint/popen_cmd.expect.json @@ -0,0 +1,23 @@ +{ + "description": "Command injection via popen: user input concatenated into shell command string", + "tags": [ + "taint", + "cmdi" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "cpp.cmdi.popen", + "severity": null, + "must_match": true, + "line_range": [ + 5, + 9 + ], + "evidence_contains": [], + "notes": "popen executes command string built from user input via string concatenation" + } + ] +} diff --git a/tests/fixtures/real_world/cpp/taint/reinterpret_cast.cpp b/tests/fixtures/real_world/cpp/taint/reinterpret_cast.cpp new file mode 100644 index 00000000..89e4bacc --- /dev/null +++ b/tests/fixtures/real_world/cpp/taint/reinterpret_cast.cpp @@ -0,0 +1,12 @@ +#include +#include + +struct Header { + int type; + int length; +}; + +void parse_packet(const char *data) { + Header *hdr = reinterpret_cast(const_cast(data)); + printf("Type: %d, Length: %d\n", hdr->type, hdr->length); +} diff --git a/tests/fixtures/real_world/cpp/taint/reinterpret_cast.expect.json b/tests/fixtures/real_world/cpp/taint/reinterpret_cast.expect.json new file mode 100644 index 00000000..b833c22f --- /dev/null +++ b/tests/fixtures/real_world/cpp/taint/reinterpret_cast.expect.json @@ -0,0 +1,34 @@ +{ + "description": "Dangerous C++ casts: reinterpret_cast and const_cast used to parse raw data", + "tags": [ + "cast", + "unsafe" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "cpp.memory.reinterpret_cast", + "severity": null, + "must_match": true, + "line_range": [ + 8, + 12 + ], + "evidence_contains": [], + "notes": "reinterpret_cast \u2014 type punning raw bytes to struct pointer" + }, + { + "rule_id": "cpp.memory.const_cast", + "severity": null, + "must_match": true, + "line_range": [ + 8, + 12 + ], + "evidence_contains": [], + "notes": "const_cast removes const qualifier from data pointer" + } + ] +} diff --git a/tests/fixtures/real_world/go/cfg/defer_close.expect.json b/tests/fixtures/real_world/go/cfg/defer_close.expect.json new file mode 100644 index 00000000..efc8d16f --- /dev/null +++ b/tests/fixtures/real_world/go/cfg/defer_close.expect.json @@ -0,0 +1,23 @@ +{ + "description": "Idiomatic Go defer close vs manual close with early-return leak path", + "tags": [ + "cfg", + "resource-leak" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "cfg-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 21, + 35 + ], + "evidence_contains": [], + "notes": "readLeaky returns on line 32 error path without closing f \u2014 missing defer f.Close()" + } + ] +} diff --git a/tests/fixtures/real_world/go/cfg/defer_close.go b/tests/fixtures/real_world/go/cfg/defer_close.go new file mode 100644 index 00000000..3983be5e --- /dev/null +++ b/tests/fixtures/real_world/go/cfg/defer_close.go @@ -0,0 +1,36 @@ +package main + +import ( + "os" +) + +func readSafe(path string) ([]byte, error) { + f, err := os.Open(path) + if err != nil { + return nil, err + } + defer f.Close() + + buf := make([]byte, 1024) + n, err := f.Read(buf) + if err != nil { + return nil, err + } + return buf[:n], nil +} + +func readLeaky(path string) ([]byte, error) { + f, err := os.Open(path) + if err != nil { + return nil, err + } + // Missing defer f.Close() + + buf := make([]byte, 1024) + n, err := f.Read(buf) + if err != nil { + return nil, err // f leaked + } + f.Close() + return buf[:n], nil +} diff --git a/tests/fixtures/real_world/go/cfg/error_handling.expect.json b/tests/fixtures/real_world/go/cfg/error_handling.expect.json new file mode 100644 index 00000000..3c4642f2 --- /dev/null +++ b/tests/fixtures/real_world/go/cfg/error_handling.expect.json @@ -0,0 +1,47 @@ +{ + "description": "Error fallthrough in handleRequest allows exec with potentially zero-value command; safe version returns on error", + "tags": [ + "cfg", + "cmdi", + "error-handling" + ], + "modes": [ + "full", + "ast" + ], + "expected": [ + { + "rule_id": "go.cmdi.exec_command", + "severity": null, + "must_match": true, + "line_range": [ + 19, + 23 + ], + "evidence_contains": [], + "notes": "AST pattern detects exec.Command in handleRequest" + }, + { + "rule_id": "go.cmdi.exec_command", + "severity": null, + "must_match": true, + "line_range": [ + 34, + 38 + ], + "evidence_contains": [], + "notes": "AST pattern detects exec.Command in handleRequestSafe" + }, + { + "rule_id": "cfg-error-fallthrough", + "severity": null, + "must_match": false, + "line_range": [ + 14, + 23 + ], + "evidence_contains": [], + "notes": "Error from json.Decode not returned \u2014 execution falls through to exec.Command" + } + ] +} diff --git a/tests/fixtures/real_world/go/cfg/error_handling.go b/tests/fixtures/real_world/go/cfg/error_handling.go new file mode 100644 index 00000000..79fcba9a --- /dev/null +++ b/tests/fixtures/real_world/go/cfg/error_handling.go @@ -0,0 +1,38 @@ +package main + +import ( + "encoding/json" + "fmt" + "net/http" + "os/exec" +) + +func handleRequest(w http.ResponseWriter, r *http.Request) { + var req struct { + Command string `json:"command"` + } + + err := json.NewDecoder(r.Body).Decode(&req) + if err != nil { + fmt.Println("bad request") + // falls through! + } + + cmd := exec.Command("sh", "-c", req.Command) + cmd.Run() +} + +func handleRequestSafe(w http.ResponseWriter, r *http.Request) { + var req struct { + Command string `json:"command"` + } + + err := json.NewDecoder(r.Body).Decode(&req) + if err != nil { + http.Error(w, "Bad request", 400) + return + } + + cmd := exec.Command("sh", "-c", req.Command) + cmd.Run() +} diff --git a/tests/fixtures/real_world/go/cfg/panic_recover.expect.json b/tests/fixtures/real_world/go/cfg/panic_recover.expect.json new file mode 100644 index 00000000..81cd3b0f --- /dev/null +++ b/tests/fixtures/real_world/go/cfg/panic_recover.expect.json @@ -0,0 +1,24 @@ +{ + "description": "Resource leak on panic path in riskyOperation; safeOperation uses defer; recoverWrapper catches panics", + "tags": [ + "cfg", + "resource-leak", + "panic" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "cfg-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 7, + 18 + ], + "evidence_contains": [], + "notes": "If code between os.Open and f.Close panics, f is leaked \u2014 no defer f.Close()" + } + ] +} diff --git a/tests/fixtures/real_world/go/cfg/panic_recover.go b/tests/fixtures/real_world/go/cfg/panic_recover.go new file mode 100644 index 00000000..d666eefb --- /dev/null +++ b/tests/fixtures/real_world/go/cfg/panic_recover.go @@ -0,0 +1,36 @@ +package main + +import ( + "fmt" + "os" +) + +func riskyOperation() { + f, err := os.Open("/tmp/test") + if err != nil { + panic(err) + } + // f leaked on panic path + buf := make([]byte, 1024) + f.Read(buf) + f.Close() +} + +func safeOperation() { + f, err := os.Open("/tmp/test") + if err != nil { + panic(err) + } + defer f.Close() + buf := make([]byte, 1024) + f.Read(buf) +} + +func recoverWrapper() { + defer func() { + if r := recover(); r != nil { + fmt.Println("Recovered:", r) + } + }() + riskyOperation() +} diff --git a/tests/fixtures/real_world/go/cfg/select_goroutine.expect.json b/tests/fixtures/real_world/go/cfg/select_goroutine.expect.json new file mode 100644 index 00000000..099b2891 --- /dev/null +++ b/tests/fixtures/real_world/go/cfg/select_goroutine.expect.json @@ -0,0 +1,36 @@ +{ + "description": "Goroutine with select-based timeout running user-controlled command via exec.Command", + "tags": [ + "cfg", + "cmdi", + "concurrency" + ], + "modes": [ + "full", + "ast" + ], + "expected": [ + { + "rule_id": "go.cmdi.exec_command", + "severity": null, + "must_match": true, + "line_range": [ + 9, + 13 + ], + "evidence_contains": [], + "notes": "AST pattern detects exec.Command in runWithTimeout" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 27, + 32 + ], + "evidence_contains": [], + "notes": "os.Getenv(\"CMD\") flows through runWithTimeout to exec.Command \u2014 cross-function taint may not resolve" + } + ] +} diff --git a/tests/fixtures/real_world/go/cfg/select_goroutine.go b/tests/fixtures/real_world/go/cfg/select_goroutine.go new file mode 100644 index 00000000..162b59c4 --- /dev/null +++ b/tests/fixtures/real_world/go/cfg/select_goroutine.go @@ -0,0 +1,32 @@ +package main + +import ( + "fmt" + "os" + "os/exec" + "time" +) + +func runWithTimeout(command string, timeout time.Duration) (string, error) { + cmd := exec.Command("sh", "-c", command) + + done := make(chan string) + go func() { + output, _ := cmd.Output() + done <- string(output) + }() + + select { + case result := <-done: + return result, nil + case <-time.After(timeout): + cmd.Process.Kill() + return "", fmt.Errorf("timeout") + } +} + +func main() { + cmd := os.Getenv("CMD") + result, _ := runWithTimeout(cmd, 5*time.Second) + fmt.Println(result) +} diff --git a/tests/fixtures/real_world/go/mixed/http_handler_full.expect.json b/tests/fixtures/real_world/go/mixed/http_handler_full.expect.json new file mode 100644 index 00000000..9b7eeae2 --- /dev/null +++ b/tests/fixtures/real_world/go/mixed/http_handler_full.expect.json @@ -0,0 +1,72 @@ +{ + "description": "HTTP handlers with command injection, path traversal, missing auth, and resource leak", + "tags": [ + "taint", + "state", + "cfg", + "cmdi", + "path-traversal", + "auth-gap", + "mixed" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "go.cmdi.exec_command", + "severity": null, + "must_match": true, + "line_range": [ + 11, + 15 + ], + "evidence_contains": [], + "notes": "AST pattern detects exec.Command in adminHandler" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 9, + 15 + ], + "evidence_contains": [], + "notes": "r.URL.Query().Get(\"cmd\") flows directly into exec.Command(\"sh\", \"-c\", cmd)" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 16, + 21 + ], + "evidence_contains": [], + "notes": "r.URL.Query().Get(\"path\") flows into os.Open \u2014 path traversal" + }, + { + "rule_id": "cfg-auth-gap", + "severity": null, + "must_match": false, + "line_range": [ + 8, + 16 + ], + "evidence_contains": [], + "notes": "adminHandler executes commands without any authentication or authorization check" + }, + { + "rule_id": "state-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 17, + 25 + ], + "evidence_contains": [], + "notes": "os.Open on line 19 \u2014 f never closed in readHandler" + } + ] +} diff --git a/tests/fixtures/real_world/go/mixed/http_handler_full.go b/tests/fixtures/real_world/go/mixed/http_handler_full.go new file mode 100644 index 00000000..c5092231 --- /dev/null +++ b/tests/fixtures/real_world/go/mixed/http_handler_full.go @@ -0,0 +1,24 @@ +package main + +import ( + "fmt" + "net/http" + "os" + "os/exec" +) + +func adminHandler(w http.ResponseWriter, r *http.Request) { + cmd := r.URL.Query().Get("cmd") + // No auth check + out, _ := exec.Command("sh", "-c", cmd).Output() + fmt.Fprintf(w, "%s", out) +} + +func readHandler(w http.ResponseWriter, r *http.Request) { + path := r.URL.Query().Get("path") + f, _ := os.Open(path) + buf := make([]byte, 4096) + n, _ := f.Read(buf) + w.Write(buf[:n]) + // f leaked + path traversal +} diff --git a/tests/fixtures/real_world/go/mixed/taint_through_error.expect.json b/tests/fixtures/real_world/go/mixed/taint_through_error.expect.json new file mode 100644 index 00000000..1b6a351d --- /dev/null +++ b/tests/fixtures/real_world/go/mixed/taint_through_error.expect.json @@ -0,0 +1,48 @@ +{ + "description": "Command injection with error fallthrough \u2014 error path does not return, output still written", + "tags": [ + "taint", + "cfg", + "cmdi", + "error-handling", + "mixed" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "go.cmdi.exec_command", + "severity": null, + "must_match": true, + "line_range": [ + 8, + 12 + ], + "evidence_contains": [], + "notes": "AST pattern detects exec.Command call" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 7, + 12 + ], + "evidence_contains": [], + "notes": "r.URL.Query().Get(\"cmd\") flows directly into exec.Command(\"sh\", \"-c\", cmd)" + }, + { + "rule_id": "cfg-error-fallthrough", + "severity": null, + "must_match": false, + "line_range": [ + 9, + 17 + ], + "evidence_contains": [], + "notes": "Error check on line 11 does not return \u2014 execution continues to w.Write(out) on line 15" + } + ] +} diff --git a/tests/fixtures/real_world/go/mixed/taint_through_error.go b/tests/fixtures/real_world/go/mixed/taint_through_error.go new file mode 100644 index 00000000..05e39a9c --- /dev/null +++ b/tests/fixtures/real_world/go/mixed/taint_through_error.go @@ -0,0 +1,16 @@ +package main + +import ( + "net/http" + "os/exec" +) + +func handler(w http.ResponseWriter, r *http.Request) { + cmd := r.URL.Query().Get("cmd") + out, err := exec.Command("sh", "-c", cmd).Output() + if err != nil { + // Error is not properly handled + http.Error(w, err.Error(), 500) + } + w.Write(out) +} diff --git a/tests/fixtures/real_world/go/state/branch_close.expect.json b/tests/fixtures/real_world/go/state/branch_close.expect.json new file mode 100644 index 00000000..70808a83 --- /dev/null +++ b/tests/fixtures/real_world/go/state/branch_close.expect.json @@ -0,0 +1,23 @@ +{ + "description": "File handle only closed inside if branch \u2014 leaked when flag is false", + "tags": [ + "state", + "resource-leak" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-resource-leak-possible", + "severity": null, + "must_match": false, + "line_range": [ + 4, + 14 + ], + "evidence_contains": [], + "notes": "f only closed inside if(flag) block \u2014 leaked on the else path" + } + ] +} diff --git a/tests/fixtures/real_world/go/state/branch_close.go b/tests/fixtures/real_world/go/state/branch_close.go new file mode 100644 index 00000000..875e3419 --- /dev/null +++ b/tests/fixtures/real_world/go/state/branch_close.go @@ -0,0 +1,13 @@ +package main + +import "os" + +func branchLeak(path string, flag bool) { + f, _ := os.Open(path) + if flag { + buf := make([]byte, 1024) + f.Read(buf) + f.Close() + } + // f leaked if !flag +} diff --git a/tests/fixtures/real_world/go/state/double_close.expect.json b/tests/fixtures/real_world/go/state/double_close.expect.json new file mode 100644 index 00000000..f399ef0c --- /dev/null +++ b/tests/fixtures/real_world/go/state/double_close.expect.json @@ -0,0 +1,35 @@ +{ + "description": "Double close and use-after-close on os.File handle", + "tags": [ + "state", + "double-close", + "use-after-close" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-double-close", + "severity": null, + "must_match": false, + "line_range": [ + 4, + 10 + ], + "evidence_contains": [], + "notes": "f.Close() called on line 7 and again on line 8" + }, + { + "rule_id": "state-use-after-close", + "severity": null, + "must_match": false, + "line_range": [ + 10, + 17 + ], + "evidence_contains": [], + "notes": "f.Read(buf) on line 15 after f.Close() on line 13" + } + ] +} diff --git a/tests/fixtures/real_world/go/state/double_close.go b/tests/fixtures/real_world/go/state/double_close.go new file mode 100644 index 00000000..a85a7925 --- /dev/null +++ b/tests/fixtures/real_world/go/state/double_close.go @@ -0,0 +1,16 @@ +package main + +import "os" + +func doubleClose(path string) { + f, _ := os.Open(path) + f.Close() + f.Close() +} + +func useAfterClose(path string) { + f, _ := os.Open(path) + f.Close() + buf := make([]byte, 1024) + f.Read(buf) +} diff --git a/tests/fixtures/real_world/go/state/file_lifecycle.expect.json b/tests/fixtures/real_world/go/state/file_lifecycle.expect.json new file mode 100644 index 00000000..cfd4bbc4 --- /dev/null +++ b/tests/fixtures/real_world/go/state/file_lifecycle.expect.json @@ -0,0 +1,23 @@ +{ + "description": "File handle lifecycle: readLeak never closes, readClose explicit close, readDefer uses defer", + "tags": [ + "state", + "resource-leak" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 4, + 11 + ], + "evidence_contains": [], + "notes": "os.Open on line 6 \u2014 f never closed before function returns" + } + ] +} diff --git a/tests/fixtures/real_world/go/state/file_lifecycle.go b/tests/fixtures/real_world/go/state/file_lifecycle.go new file mode 100644 index 00000000..692032ea --- /dev/null +++ b/tests/fixtures/real_world/go/state/file_lifecycle.go @@ -0,0 +1,24 @@ +package main + +import "os" + +func readLeak(path string) { + f, _ := os.Open(path) + buf := make([]byte, 1024) + f.Read(buf) + // f not closed +} + +func readClose(path string) { + f, _ := os.Open(path) + buf := make([]byte, 1024) + f.Read(buf) + f.Close() +} + +func readDefer(path string) { + f, _ := os.Open(path) + defer f.Close() + buf := make([]byte, 1024) + f.Read(buf) +} diff --git a/tests/fixtures/real_world/go/state/http_body.expect.json b/tests/fixtures/real_world/go/state/http_body.expect.json new file mode 100644 index 00000000..7606799a --- /dev/null +++ b/tests/fixtures/real_world/go/state/http_body.expect.json @@ -0,0 +1,24 @@ +{ + "description": "HTTP response body leak in fetchLeak; fetchSafe uses defer resp.Body.Close()", + "tags": [ + "state", + "resource-leak", + "http" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 7, + 14 + ], + "evidence_contains": [], + "notes": "resp.Body from http.Get never closed in fetchLeak" + } + ] +} diff --git a/tests/fixtures/real_world/go/state/http_body.go b/tests/fixtures/real_world/go/state/http_body.go new file mode 100644 index 00000000..f8b5c87e --- /dev/null +++ b/tests/fixtures/real_world/go/state/http_body.go @@ -0,0 +1,20 @@ +package main + +import ( + "io/ioutil" + "net/http" +) + +func fetchLeak(url string) string { + resp, _ := http.Get(url) + body, _ := ioutil.ReadAll(resp.Body) + // resp.Body not closed + return string(body) +} + +func fetchSafe(url string) string { + resp, _ := http.Get(url) + defer resp.Body.Close() + body, _ := ioutil.ReadAll(resp.Body) + return string(body) +} diff --git a/tests/fixtures/real_world/go/taint/cmdi_http.expect.json b/tests/fixtures/real_world/go/taint/cmdi_http.expect.json new file mode 100644 index 00000000..7235c0f7 --- /dev/null +++ b/tests/fixtures/real_world/go/taint/cmdi_http.expect.json @@ -0,0 +1,45 @@ +{ + "description": "HTTP query parameter flows into exec.Command \u2014 both direct args and shell concatenation variants", + "tags": [ + "taint", + "cmdi" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "go.cmdi.exec_command", + "severity": null, + "must_match": true, + "line_range": [ + 9, + 13 + ], + "evidence_contains": [], + "notes": "AST pattern detects exec.Command call in pingHandler" + }, + { + "rule_id": "go.cmdi.exec_command", + "severity": null, + "must_match": true, + "line_range": [ + 16, + 20 + ], + "evidence_contains": [], + "notes": "AST pattern detects exec.Command call in unsafePing" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 15, + 20 + ], + "evidence_contains": [], + "notes": "r.URL.Query().Get(\"host\") concatenated into shell command string in unsafePing" + } + ] +} diff --git a/tests/fixtures/real_world/go/taint/cmdi_http.go b/tests/fixtures/real_world/go/taint/cmdi_http.go new file mode 100644 index 00000000..5971cd9f --- /dev/null +++ b/tests/fixtures/real_world/go/taint/cmdi_http.go @@ -0,0 +1,27 @@ +package main + +import ( + "fmt" + "net/http" + "os/exec" +) + +func pingHandler(w http.ResponseWriter, r *http.Request) { + host := r.URL.Query().Get("host") + cmd := exec.Command("ping", "-c", "1", host) + output, _ := cmd.Output() + fmt.Fprintf(w, "%s", output) +} + +func unsafePing(w http.ResponseWriter, r *http.Request) { + host := r.URL.Query().Get("host") + cmd := exec.Command("sh", "-c", "ping -c 1 "+host) + output, _ := cmd.Output() + fmt.Fprintf(w, "%s", output) +} + +func main() { + http.HandleFunc("/ping", pingHandler) + http.HandleFunc("/unsafe-ping", unsafePing) + http.ListenAndServe(":8080", nil) +} diff --git a/tests/fixtures/real_world/go/taint/crypto_weak.expect.json b/tests/fixtures/real_world/go/taint/crypto_weak.expect.json new file mode 100644 index 00000000..7d0d559a --- /dev/null +++ b/tests/fixtures/real_world/go/taint/crypto_weak.expect.json @@ -0,0 +1,35 @@ +{ + "description": "Weak cryptographic hash usage: MD5 and SHA1 detected; SHA256 is safe", + "tags": [ + "crypto", + "weak-hash" + ], + "modes": [ + "full", + "ast" + ], + "expected": [ + { + "rule_id": "go.crypto.md5", + "severity": null, + "must_match": true, + "line_range": [ + 9, + 13 + ], + "evidence_contains": [], + "notes": "AST pattern detects md5.New() call" + }, + { + "rule_id": "go.crypto.sha1", + "severity": null, + "must_match": true, + "line_range": [ + 15, + 19 + ], + "evidence_contains": [], + "notes": "AST pattern detects sha1.New() call" + } + ] +} diff --git a/tests/fixtures/real_world/go/taint/crypto_weak.go b/tests/fixtures/real_world/go/taint/crypto_weak.go new file mode 100644 index 00000000..73d78583 --- /dev/null +++ b/tests/fixtures/real_world/go/taint/crypto_weak.go @@ -0,0 +1,26 @@ +package main + +import ( + "crypto/md5" + "crypto/sha1" + "crypto/sha256" + "fmt" +) + +func hashMD5(data []byte) string { + h := md5.New() + h.Write(data) + return fmt.Sprintf("%x", h.Sum(nil)) +} + +func hashSHA1(data []byte) string { + h := sha1.New() + h.Write(data) + return fmt.Sprintf("%x", h.Sum(nil)) +} + +func hashSafe(data []byte) string { + h := sha256.New() + h.Write(data) + return fmt.Sprintf("%x", h.Sum(nil)) +} diff --git a/tests/fixtures/real_world/go/taint/env_exec.expect.json b/tests/fixtures/real_world/go/taint/env_exec.expect.json new file mode 100644 index 00000000..7f44f2bd --- /dev/null +++ b/tests/fixtures/real_world/go/taint/env_exec.expect.json @@ -0,0 +1,35 @@ +{ + "description": "Environment variable flows directly into shell exec \u2014 command injection via env", + "tags": [ + "taint", + "cmdi", + "env" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "go.cmdi.exec_command", + "severity": null, + "must_match": true, + "line_range": [ + 8, + 12 + ], + "evidence_contains": [], + "notes": "AST pattern detects exec.Command call" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 7, + 12 + ], + "evidence_contains": [], + "notes": "os.Getenv(\"USER_CMD\") flows directly into exec.Command(\"sh\", \"-c\", cmd)" + } + ] +} diff --git a/tests/fixtures/real_world/go/taint/env_exec.go b/tests/fixtures/real_world/go/taint/env_exec.go new file mode 100644 index 00000000..30ab3353 --- /dev/null +++ b/tests/fixtures/real_world/go/taint/env_exec.go @@ -0,0 +1,11 @@ +package main + +import ( + "os" + "os/exec" +) + +func main() { + cmd := os.Getenv("USER_CMD") + exec.Command("sh", "-c", cmd).Run() +} diff --git a/tests/fixtures/real_world/go/taint/file_path.expect.json b/tests/fixtures/real_world/go/taint/file_path.expect.json new file mode 100644 index 00000000..c1f327e5 --- /dev/null +++ b/tests/fixtures/real_world/go/taint/file_path.expect.json @@ -0,0 +1,23 @@ +{ + "description": "Path traversal via user-controlled file path in readFileHandler; safeReadHandler uses filepath.Clean and abs check", + "tags": [ + "taint", + "path-traversal" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 8, + 13 + ], + "evidence_contains": [], + "notes": "r.URL.Query().Get(\"path\") flows directly into ioutil.ReadFile without validation" + } + ] +} diff --git a/tests/fixtures/real_world/go/taint/file_path.go b/tests/fixtures/real_world/go/taint/file_path.go new file mode 100644 index 00000000..0616102a --- /dev/null +++ b/tests/fixtures/real_world/go/taint/file_path.go @@ -0,0 +1,28 @@ +package main + +import ( + "io/ioutil" + "net/http" + "path/filepath" +) + +func readFileHandler(w http.ResponseWriter, r *http.Request) { + path := r.URL.Query().Get("path") + data, err := ioutil.ReadFile(path) + if err != nil { + http.Error(w, "Not found", 404) + return + } + w.Write(data) +} + +func safeReadHandler(w http.ResponseWriter, r *http.Request) { + path := r.URL.Query().Get("path") + clean := filepath.Clean(path) + if filepath.IsAbs(clean) { + http.Error(w, "Forbidden", 403) + return + } + data, _ := ioutil.ReadFile(filepath.Join("/safe/dir", clean)) + w.Write(data) +} diff --git a/tests/fixtures/real_world/go/taint/sqli_sprintf.expect.json b/tests/fixtures/real_world/go/taint/sqli_sprintf.expect.json new file mode 100644 index 00000000..5cf5f50e --- /dev/null +++ b/tests/fixtures/real_world/go/taint/sqli_sprintf.expect.json @@ -0,0 +1,34 @@ +{ + "description": "SQL injection via fmt.Sprintf query construction; safe version uses parameterized query", + "tags": [ + "taint", + "sqli" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 10, + 16 + ], + "evidence_contains": [], + "notes": "r.URL.Query().Get(\"id\") flows through fmt.Sprintf into db.Query in getUserUnsafe" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 22, + 27 + ], + "evidence_contains": [], + "notes": "getUserSafe uses parameterized query \u2014 should NOT produce a taint finding" + } + ] +} diff --git a/tests/fixtures/real_world/go/taint/sqli_sprintf.go b/tests/fixtures/real_world/go/taint/sqli_sprintf.go new file mode 100644 index 00000000..16e27139 --- /dev/null +++ b/tests/fixtures/real_world/go/taint/sqli_sprintf.go @@ -0,0 +1,32 @@ +package main + +import ( + "database/sql" + "fmt" + "net/http" + + _ "github.com/lib/pq" +) + +func getUserUnsafe(db *sql.DB, w http.ResponseWriter, r *http.Request) { + userId := r.URL.Query().Get("id") + query := fmt.Sprintf("SELECT name FROM users WHERE id = '%s'", userId) + rows, _ := db.Query(query) + defer rows.Close() + for rows.Next() { + var name string + rows.Scan(&name) + fmt.Fprintf(w, "%s\n", name) + } +} + +func getUserSafe(db *sql.DB, w http.ResponseWriter, r *http.Request) { + userId := r.URL.Query().Get("id") + rows, _ := db.Query("SELECT name FROM users WHERE id = $1", userId) + defer rows.Close() + for rows.Next() { + var name string + rows.Scan(&name) + fmt.Fprintf(w, "%s\n", name) + } +} diff --git a/tests/fixtures/real_world/go/taint/template_xss.expect.json b/tests/fixtures/real_world/go/taint/template_xss.expect.json new file mode 100644 index 00000000..48c0b196 --- /dev/null +++ b/tests/fixtures/real_world/go/taint/template_xss.expect.json @@ -0,0 +1,23 @@ +{ + "description": "XSS via fmt.Fprintf of user input into HTTP response; safe version uses html/template auto-escaping", + "tags": [ + "taint", + "xss" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 8, + 13 + ], + "evidence_contains": [], + "notes": "r.URL.Query().Get(\"name\") flows directly into fmt.Fprintf writing to http.ResponseWriter" + } + ] +} diff --git a/tests/fixtures/real_world/go/taint/template_xss.go b/tests/fixtures/real_world/go/taint/template_xss.go new file mode 100644 index 00000000..372d5d43 --- /dev/null +++ b/tests/fixtures/real_world/go/taint/template_xss.go @@ -0,0 +1,18 @@ +package main + +import ( + "fmt" + "html/template" + "net/http" +) + +func unsafeHandler(w http.ResponseWriter, r *http.Request) { + name := r.URL.Query().Get("name") + fmt.Fprintf(w, "

Hello %s

", name) +} + +func safeHandler(w http.ResponseWriter, r *http.Request) { + name := r.URL.Query().Get("name") + tmpl := template.Must(template.New("hello").Parse("

Hello {{.}}

")) + tmpl.Execute(w, name) +} diff --git a/tests/fixtures/real_world/java/cfg/catch_finally.expect.json b/tests/fixtures/real_world/java/cfg/catch_finally.expect.json new file mode 100644 index 00000000..6e0de842 --- /dev/null +++ b/tests/fixtures/real_world/java/cfg/catch_finally.expect.json @@ -0,0 +1,23 @@ +{ + "description": "Finally-based safe close pattern vs exception-path resource leak", + "tags": [ + "cfg", + "resource-leak" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "cfg-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 20, + 29 + ], + "evidence_contains": [], + "notes": "processLeaky throws on line 26 without closing fis \u2014 resource leaked on exception path" + } + ] +} diff --git a/tests/fixtures/real_world/java/cfg/catch_finally.java b/tests/fixtures/real_world/java/cfg/catch_finally.java new file mode 100644 index 00000000..84df5ff1 --- /dev/null +++ b/tests/fixtures/real_world/java/cfg/catch_finally.java @@ -0,0 +1,30 @@ +import java.io.*; + +public class FileProcessor { + public void processWithFinally(String path) { + FileInputStream fis = null; + try { + fis = new FileInputStream(path); + byte[] data = new byte[1024]; + fis.read(data); + } catch (IOException e) { + System.err.println("Error: " + e.getMessage()); + } finally { + try { + if (fis != null) fis.close(); + } catch (IOException e) { + // ignore + } + } + } + + public void processLeaky(String path) throws IOException { + FileInputStream fis = new FileInputStream(path); + byte[] data = new byte[1024]; + fis.read(data); + if (data[0] == 0) { + throw new IOException("bad data"); // fis leaked + } + fis.close(); + } +} diff --git a/tests/fixtures/real_world/java/cfg/lambda_streams.expect.json b/tests/fixtures/real_world/java/cfg/lambda_streams.expect.json new file mode 100644 index 00000000..20bf7dc5 --- /dev/null +++ b/tests/fixtures/real_world/java/cfg/lambda_streams.expect.json @@ -0,0 +1,37 @@ +{ + "description": "Stream operations with SQL concat in map and exec in forEach lambda", + "tags": [ + "cfg", + "sqli", + "cmdi", + "lambda" + ], + "modes": [ + "full", + "ast" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 10, + 17 + ], + "evidence_contains": [], + "notes": "cmd parameter inside lambda flows to Runtime.exec \u2014 scanner may not track taint through lambda bodies" + }, + { + "rule_id": "java.sqli.execute_concat", + "severity": null, + "must_match": false, + "line_range": [ + 6, + 10 + ], + "evidence_contains": [], + "notes": "SQL string concatenation inside stream map lambda \u2014 AST pattern may not match inside lambda context" + } + ] +} diff --git a/tests/fixtures/real_world/java/cfg/lambda_streams.java b/tests/fixtures/real_world/java/cfg/lambda_streams.java new file mode 100644 index 00000000..ce32716c --- /dev/null +++ b/tests/fixtures/real_world/java/cfg/lambda_streams.java @@ -0,0 +1,21 @@ +import java.util.*; +import java.util.stream.*; + +public class StreamProcessor { + public List filterUnsafe(List inputs) { + return inputs.stream() + .filter(s -> !s.isEmpty()) + .map(s -> "SELECT * FROM users WHERE name = '" + s + "'") + .collect(Collectors.toList()); + } + + public void processCommands(List commands) { + commands.forEach(cmd -> { + try { + Runtime.getRuntime().exec(cmd); + } catch (Exception e) { + e.printStackTrace(); + } + }); + } +} diff --git a/tests/fixtures/real_world/java/cfg/switch_expressions.expect.json b/tests/fixtures/real_world/java/cfg/switch_expressions.expect.json new file mode 100644 index 00000000..a21abad9 --- /dev/null +++ b/tests/fixtures/real_world/java/cfg/switch_expressions.expect.json @@ -0,0 +1,34 @@ +{ + "description": "Switch dispatches to dangerous operations including exec and file write based on action string", + "tags": [ + "cfg", + "cmdi" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 2, + 9 + ], + "evidence_contains": [], + "notes": "input parameter flows into Runtime.exec \u2014 depends on whether scanner models method params as taint sources" + }, + { + "rule_id": "cfg-unguarded-sink", + "severity": null, + "must_match": false, + "line_range": [ + 4, + 9 + ], + "evidence_contains": [], + "notes": "No validation of input before exec call in exec case" + } + ] +} diff --git a/tests/fixtures/real_world/java/cfg/switch_expressions.java b/tests/fixtures/real_world/java/cfg/switch_expressions.java new file mode 100644 index 00000000..420172c1 --- /dev/null +++ b/tests/fixtures/real_world/java/cfg/switch_expressions.java @@ -0,0 +1,19 @@ +import java.io.*; + +public class ActionHandler { + public void handle(String action, String input) throws IOException { + switch (action) { + case "exec": + Runtime.getRuntime().exec(input); + break; + case "write": + FileWriter fw = new FileWriter(input); + fw.write("data"); + fw.close(); + break; + case "log": + System.out.println(input); + break; + } + } +} diff --git a/tests/fixtures/real_world/java/cfg/try_with_resources.expect.json b/tests/fixtures/real_world/java/cfg/try_with_resources.expect.json new file mode 100644 index 00000000..1c792baf --- /dev/null +++ b/tests/fixtures/real_world/java/cfg/try_with_resources.expect.json @@ -0,0 +1,23 @@ +{ + "description": "Try-with-resources safe pattern vs manual close with early return leak", + "tags": [ + "cfg", + "resource-leak" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "cfg-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 9, + 17 + ], + "evidence_contains": [], + "notes": "readUnsafe returns early on line 14 without closing reader \u2014 resource leak on that path" + } + ] +} diff --git a/tests/fixtures/real_world/java/cfg/try_with_resources.java b/tests/fixtures/real_world/java/cfg/try_with_resources.java new file mode 100644 index 00000000..8fc04445 --- /dev/null +++ b/tests/fixtures/real_world/java/cfg/try_with_resources.java @@ -0,0 +1,19 @@ +import java.io.*; + +public class ResourceHandler { + public String readSafe(String path) throws IOException { + try (BufferedReader reader = new BufferedReader(new FileReader(path))) { + return reader.readLine(); + } + } + + public String readUnsafe(String path) throws IOException { + BufferedReader reader = new BufferedReader(new FileReader(path)); + String line = reader.readLine(); + if (line == null) { + return "empty"; // reader leaked + } + reader.close(); + return line; + } +} diff --git a/tests/fixtures/real_world/java/mixed/deser_cmdi.expect.json b/tests/fixtures/real_world/java/mixed/deser_cmdi.expect.json new file mode 100644 index 00000000..0ad4dd38 --- /dev/null +++ b/tests/fixtures/real_world/java/mixed/deser_cmdi.expect.json @@ -0,0 +1,37 @@ +{ + "description": "Deserialized object from HTTP input stream used directly as command argument to Runtime.exec", + "tags": [ + "taint", + "deser", + "cmdi", + "servlet", + "mixed" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "java.deser.readobject", + "severity": null, + "must_match": true, + "line_range": [ + 5, + 9 + ], + "evidence_contains": [], + "notes": "AST pattern detects new ObjectInputStream() construction" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 5, + 11 + ], + "evidence_contains": [], + "notes": "Deserialized command string from request input flows into Runtime.exec" + } + ] +} diff --git a/tests/fixtures/real_world/java/mixed/deser_cmdi.java b/tests/fixtures/real_world/java/mixed/deser_cmdi.java new file mode 100644 index 00000000..efd043c8 --- /dev/null +++ b/tests/fixtures/real_world/java/mixed/deser_cmdi.java @@ -0,0 +1,13 @@ +import java.io.*; +import javax.servlet.http.*; + +public class DeserCmdi extends HttpServlet { + protected void doPost(HttpServletRequest request, HttpServletResponse response) + throws IOException, ClassNotFoundException { + ObjectInputStream ois = new ObjectInputStream(request.getInputStream()); + String command = (String) ois.readObject(); + Runtime.getRuntime().exec(command); + + response.getWriter().println("Done"); + } +} diff --git a/tests/fixtures/real_world/java/mixed/servlet_full.expect.json b/tests/fixtures/real_world/java/mixed/servlet_full.expect.json new file mode 100644 index 00000000..6eaebcf9 --- /dev/null +++ b/tests/fixtures/real_world/java/mixed/servlet_full.expect.json @@ -0,0 +1,61 @@ +{ + "description": "Servlet with multiple vuln types: command injection, SQL injection via concat, and file resource leak", + "tags": [ + "taint", + "state", + "cmdi", + "sqli", + "resource-leak", + "servlet", + "mixed" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 8, + 16 + ], + "evidence_contains": [], + "notes": "request.getParameter(\"input\") flows into Runtime.getRuntime().exec(input)" + }, + { + "rule_id": "java.sqli.execute_concat", + "severity": null, + "must_match": true, + "line_range": [ + 15, + 19 + ], + "evidence_contains": [], + "notes": "AST pattern detects string concatenation inside executeQuery argument" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 8, + 19 + ], + "evidence_contains": [], + "notes": "request.getParameter(\"input\") concatenated into SQL query passed to executeQuery" + }, + { + "rule_id": "state-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 21, + 29 + ], + "evidence_contains": [], + "notes": "FileInputStream opened on line 23 but never closed" + } + ] +} diff --git a/tests/fixtures/real_world/java/mixed/servlet_full.java b/tests/fixtures/real_world/java/mixed/servlet_full.java new file mode 100644 index 00000000..780a1f80 --- /dev/null +++ b/tests/fixtures/real_world/java/mixed/servlet_full.java @@ -0,0 +1,30 @@ +import java.io.*; +import javax.servlet.http.*; +import java.sql.*; + +public class FullServlet extends HttpServlet { + private Connection dbConn; + + protected void doGet(HttpServletRequest request, HttpServletResponse response) + throws IOException, SQLException { + String action = request.getParameter("action"); + String input = request.getParameter("input"); + + if ("exec".equals(action)) { + Runtime.getRuntime().exec(input); + } else if ("query".equals(action)) { + Statement stmt = dbConn.createStatement(); + ResultSet rs = stmt.executeQuery("SELECT * FROM data WHERE key = '" + input + "'"); + PrintWriter out = response.getWriter(); + while (rs.next()) { + out.println(rs.getString(1)); + } + } else if ("read".equals(action)) { + FileInputStream fis = new FileInputStream(input); + byte[] data = new byte[4096]; + fis.read(data); + response.getWriter().println(new String(data)); + // fis leaked + } + } +} diff --git a/tests/fixtures/real_world/java/state/branch_close.expect.json b/tests/fixtures/real_world/java/state/branch_close.expect.json new file mode 100644 index 00000000..ab7f95e0 --- /dev/null +++ b/tests/fixtures/real_world/java/state/branch_close.expect.json @@ -0,0 +1,23 @@ +{ + "description": "FileInputStream closed only in one branch of conditionalClose; both branches close in bothBranchesClose", + "tags": [ + "state", + "resource-leak" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-resource-leak-possible", + "severity": null, + "must_match": false, + "line_range": [ + 3, + 13 + ], + "evidence_contains": [], + "notes": "fis only closed inside if(flag) \u2014 leaked when flag is false" + } + ] +} diff --git a/tests/fixtures/real_world/java/state/branch_close.java b/tests/fixtures/real_world/java/state/branch_close.java new file mode 100644 index 00000000..74628d95 --- /dev/null +++ b/tests/fixtures/real_world/java/state/branch_close.java @@ -0,0 +1,24 @@ +import java.io.*; + +public class BranchClose { + public void conditionalClose(String path, boolean flag) throws IOException { + FileInputStream fis = new FileInputStream(path); + if (flag) { + byte[] data = new byte[1024]; + fis.read(data); + fis.close(); + } + // fis leaked if !flag + } + + public void bothBranchesClose(String path, boolean flag) throws IOException { + FileInputStream fis = new FileInputStream(path); + if (flag) { + byte[] data = new byte[1024]; + fis.read(data); + fis.close(); + } else { + fis.close(); + } + } +} diff --git a/tests/fixtures/real_world/java/state/connection_lifecycle.expect.json b/tests/fixtures/real_world/java/state/connection_lifecycle.expect.json new file mode 100644 index 00000000..77edb5d8 --- /dev/null +++ b/tests/fixtures/real_world/java/state/connection_lifecycle.expect.json @@ -0,0 +1,24 @@ +{ + "description": "Database connection and statement opened but never closed in queryAndLeak; properly nested finally in queryAndClose", + "tags": [ + "state", + "resource-leak", + "database" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 3, + 10 + ], + "evidence_contains": [], + "notes": "Connection and Statement opened but never closed in queryAndLeak" + } + ] +} diff --git a/tests/fixtures/real_world/java/state/connection_lifecycle.java b/tests/fixtures/real_world/java/state/connection_lifecycle.java new file mode 100644 index 00000000..cd210e82 --- /dev/null +++ b/tests/fixtures/real_world/java/state/connection_lifecycle.java @@ -0,0 +1,24 @@ +import java.sql.*; + +public class DatabaseManager { + public void queryAndLeak(String url) throws SQLException { + Connection conn = DriverManager.getConnection(url); + Statement stmt = conn.createStatement(); + stmt.executeQuery("SELECT 1"); + // conn and stmt never closed + } + + public void queryAndClose(String url) throws SQLException { + Connection conn = DriverManager.getConnection(url); + try { + Statement stmt = conn.createStatement(); + try { + stmt.executeQuery("SELECT 1"); + } finally { + stmt.close(); + } + } finally { + conn.close(); + } + } +} diff --git a/tests/fixtures/real_world/java/state/double_close.expect.json b/tests/fixtures/real_world/java/state/double_close.expect.json new file mode 100644 index 00000000..d9ad8053 --- /dev/null +++ b/tests/fixtures/real_world/java/state/double_close.expect.json @@ -0,0 +1,35 @@ +{ + "description": "Double close of FileInputStream and use-after-close read operation", + "tags": [ + "state", + "double-close", + "use-after-close" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-double-close", + "severity": null, + "must_match": false, + "line_range": [ + 3, + 9 + ], + "evidence_contains": [], + "notes": "fis.close() called twice on lines 6 and 7" + }, + { + "rule_id": "state-use-after-close", + "severity": null, + "must_match": false, + "line_range": [ + 9, + 16 + ], + "evidence_contains": [], + "notes": "fis.read(data) on line 14 after fis.close() on line 12" + } + ] +} diff --git a/tests/fixtures/real_world/java/state/double_close.java b/tests/fixtures/real_world/java/state/double_close.java new file mode 100644 index 00000000..b9d7b531 --- /dev/null +++ b/tests/fixtures/real_world/java/state/double_close.java @@ -0,0 +1,16 @@ +import java.io.*; + +public class DoubleClose { + public void doubleCloseStream(String path) throws IOException { + FileInputStream fis = new FileInputStream(path); + fis.close(); + fis.close(); + } + + public void useAfterClose(String path) throws IOException { + FileInputStream fis = new FileInputStream(path); + fis.close(); + byte[] data = new byte[1024]; + fis.read(data); + } +} diff --git a/tests/fixtures/real_world/java/state/stream_lifecycle.expect.json b/tests/fixtures/real_world/java/state/stream_lifecycle.expect.json new file mode 100644 index 00000000..ea1942a0 --- /dev/null +++ b/tests/fixtures/real_world/java/state/stream_lifecycle.expect.json @@ -0,0 +1,23 @@ +{ + "description": "FileInputStream opened and never closed in readAndLeak; properly closed via finally in readAndClose", + "tags": [ + "state", + "resource-leak" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 3, + 11 + ], + "evidence_contains": [], + "notes": "FileInputStream fis opened on line 5 and never closed before return on line 8" + } + ] +} diff --git a/tests/fixtures/real_world/java/state/stream_lifecycle.java b/tests/fixtures/real_world/java/state/stream_lifecycle.java new file mode 100644 index 00000000..6ac12c88 --- /dev/null +++ b/tests/fixtures/real_world/java/state/stream_lifecycle.java @@ -0,0 +1,22 @@ +import java.io.*; + +public class StreamManager { + public String readAndLeak(String path) throws IOException { + FileInputStream fis = new FileInputStream(path); + byte[] data = new byte[1024]; + fis.read(data); + return new String(data); + // fis never closed + } + + public String readAndClose(String path) throws IOException { + FileInputStream fis = new FileInputStream(path); + try { + byte[] data = new byte[1024]; + fis.read(data); + return new String(data); + } finally { + fis.close(); + } + } +} diff --git a/tests/fixtures/real_world/java/taint/cmdi_processbuilder.expect.json b/tests/fixtures/real_world/java/taint/cmdi_processbuilder.expect.json new file mode 100644 index 00000000..e20a4831 --- /dev/null +++ b/tests/fixtures/real_world/java/taint/cmdi_processbuilder.expect.json @@ -0,0 +1,24 @@ +{ + "description": "HttpServletRequest parameters flow into ProcessBuilder constructor \u2014 command injection via user-controlled program and arguments", + "tags": [ + "taint", + "cmdi", + "servlet" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 5, + 13 + ], + "evidence_contains": [], + "notes": "request.getParameter(\"program\") and request.getParameter(\"arg\") flow into new ProcessBuilder(program, arg)" + } + ] +} diff --git a/tests/fixtures/real_world/java/taint/cmdi_processbuilder.java b/tests/fixtures/real_world/java/taint/cmdi_processbuilder.java new file mode 100644 index 00000000..56cf15c7 --- /dev/null +++ b/tests/fixtures/real_world/java/taint/cmdi_processbuilder.java @@ -0,0 +1,21 @@ +import java.io.*; +import javax.servlet.http.*; + +public class ProcessHandler extends HttpServlet { + protected void doPost(HttpServletRequest request, HttpServletResponse response) + throws IOException { + String program = request.getParameter("program"); + String arg = request.getParameter("arg"); + + ProcessBuilder pb = new ProcessBuilder(program, arg); + Process process = pb.start(); + + BufferedReader reader = new BufferedReader( + new InputStreamReader(process.getInputStream())); + String line; + PrintWriter out = response.getWriter(); + while ((line = reader.readLine()) != null) { + out.println(line); + } + } +} diff --git a/tests/fixtures/real_world/java/taint/cmdi_runtime.expect.json b/tests/fixtures/real_world/java/taint/cmdi_runtime.expect.json new file mode 100644 index 00000000..a11fae09 --- /dev/null +++ b/tests/fixtures/real_world/java/taint/cmdi_runtime.expect.json @@ -0,0 +1,35 @@ +{ + "description": "HttpServletRequest.getParameter flows directly to Runtime.exec \u2014 classic command injection", + "tags": [ + "taint", + "cmdi", + "servlet" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 5, + 10 + ], + "evidence_contains": [], + "notes": "request.getParameter(\"cmd\") flows directly into Runtime.getRuntime().exec(cmd)" + }, + { + "rule_id": "cfg-unguarded-sink", + "severity": null, + "must_match": false, + "line_range": [ + 5, + 10 + ], + "evidence_contains": [], + "notes": "No validation or auth check before exec call" + } + ] +} diff --git a/tests/fixtures/real_world/java/taint/cmdi_runtime.java b/tests/fixtures/real_world/java/taint/cmdi_runtime.java new file mode 100644 index 00000000..d5e42cb7 --- /dev/null +++ b/tests/fixtures/real_world/java/taint/cmdi_runtime.java @@ -0,0 +1,13 @@ +import java.io.*; +import javax.servlet.http.*; + +public class CommandHandler extends HttpServlet { + protected void doGet(HttpServletRequest request, HttpServletResponse response) + throws IOException { + String cmd = request.getParameter("cmd"); + Runtime.getRuntime().exec(cmd); + + PrintWriter out = response.getWriter(); + out.println("Command executed"); + } +} diff --git a/tests/fixtures/real_world/java/taint/deser_ois.expect.json b/tests/fixtures/real_world/java/taint/deser_ois.expect.json new file mode 100644 index 00000000..bf079201 --- /dev/null +++ b/tests/fixtures/real_world/java/taint/deser_ois.expect.json @@ -0,0 +1,36 @@ +{ + "description": "Unsafe deserialization of untrusted ObjectInputStream from HTTP request body", + "tags": [ + "taint", + "deser", + "servlet" + ], + "modes": [ + "full", + "ast" + ], + "expected": [ + { + "rule_id": "java.deser.readobject", + "severity": null, + "must_match": true, + "line_range": [ + 5, + 9 + ], + "evidence_contains": [], + "notes": "AST pattern detects new ObjectInputStream() construction" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 5, + 10 + ], + "evidence_contains": [], + "notes": "request.getInputStream() flows into ObjectInputStream.readObject \u2014 taint may not track through constructor chain" + } + ] +} diff --git a/tests/fixtures/real_world/java/taint/deser_ois.java b/tests/fixtures/real_world/java/taint/deser_ois.java new file mode 100644 index 00000000..bd016d79 --- /dev/null +++ b/tests/fixtures/real_world/java/taint/deser_ois.java @@ -0,0 +1,13 @@ +import java.io.*; +import javax.servlet.http.*; + +public class DeserHandler extends HttpServlet { + protected void doPost(HttpServletRequest request, HttpServletResponse response) + throws IOException, ClassNotFoundException { + ObjectInputStream ois = new ObjectInputStream(request.getInputStream()); + Object obj = ois.readObject(); + + PrintWriter out = response.getWriter(); + out.println("Deserialized: " + obj.toString()); + } +} diff --git a/tests/fixtures/real_world/java/taint/reflection.expect.json b/tests/fixtures/real_world/java/taint/reflection.expect.json new file mode 100644 index 00000000..cfd62125 --- /dev/null +++ b/tests/fixtures/real_world/java/taint/reflection.expect.json @@ -0,0 +1,35 @@ +{ + "description": "User-controlled class name flows into Class.forName \u2014 arbitrary class instantiation", + "tags": [ + "taint", + "reflection", + "servlet" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "java.reflection.class_forname", + "severity": null, + "must_match": true, + "line_range": [ + 6, + 10 + ], + "evidence_contains": [], + "notes": "AST pattern detects Class.forName() call" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 5, + 10 + ], + "evidence_contains": [], + "notes": "request.getParameter(\"class\") flows directly into Class.forName(className)" + } + ] +} diff --git a/tests/fixtures/real_world/java/taint/reflection.java b/tests/fixtures/real_world/java/taint/reflection.java new file mode 100644 index 00000000..d58fa96a --- /dev/null +++ b/tests/fixtures/real_world/java/taint/reflection.java @@ -0,0 +1,14 @@ +import java.io.*; +import javax.servlet.http.*; + +public class ReflectionHandler extends HttpServlet { + protected void doGet(HttpServletRequest request, HttpServletResponse response) + throws Exception { + String className = request.getParameter("class"); + Class clazz = Class.forName(className); + Object instance = clazz.getDeclaredConstructor().newInstance(); + + PrintWriter out = response.getWriter(); + out.println("Created: " + instance.getClass().getName()); + } +} diff --git a/tests/fixtures/real_world/java/taint/sqli_concat.expect.json b/tests/fixtures/real_world/java/taint/sqli_concat.expect.json new file mode 100644 index 00000000..71e6b1dd --- /dev/null +++ b/tests/fixtures/real_world/java/taint/sqli_concat.expect.json @@ -0,0 +1,35 @@ +{ + "description": "SQL injection via string concatenation in doGet; safe PreparedStatement in doPost", + "tags": [ + "taint", + "sqli", + "servlet" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "java.sqli.execute_concat", + "severity": null, + "must_match": true, + "line_range": [ + 10, + 14 + ], + "evidence_contains": [], + "notes": "AST pattern detects string concatenation inside executeQuery argument" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 8, + 14 + ], + "evidence_contains": [], + "notes": "request.getParameter(\"id\") concatenated into SQL query string passed to executeQuery" + } + ] +} diff --git a/tests/fixtures/real_world/java/taint/sqli_concat.java b/tests/fixtures/real_world/java/taint/sqli_concat.java new file mode 100644 index 00000000..afd0077a --- /dev/null +++ b/tests/fixtures/real_world/java/taint/sqli_concat.java @@ -0,0 +1,32 @@ +import java.sql.*; +import javax.servlet.http.*; +import java.io.*; + +public class UserQuery extends HttpServlet { + private Connection conn; + + protected void doGet(HttpServletRequest request, HttpServletResponse response) + throws IOException, SQLException { + String userId = request.getParameter("id"); + Statement stmt = conn.createStatement(); + ResultSet rs = stmt.executeQuery("SELECT * FROM users WHERE id = " + userId); + + PrintWriter out = response.getWriter(); + while (rs.next()) { + out.println(rs.getString("name")); + } + } + + protected void doPost(HttpServletRequest request, HttpServletResponse response) + throws IOException, SQLException { + String userId = request.getParameter("id"); + PreparedStatement stmt = conn.prepareStatement("SELECT * FROM users WHERE id = ?"); + stmt.setString(1, userId); + ResultSet rs = stmt.executeQuery(); + + PrintWriter out = response.getWriter(); + while (rs.next()) { + out.println(rs.getString("name")); + } + } +} diff --git a/tests/fixtures/real_world/java/taint/xss_response.expect.json b/tests/fixtures/real_world/java/taint/xss_response.expect.json new file mode 100644 index 00000000..65f5f8ae --- /dev/null +++ b/tests/fixtures/real_world/java/taint/xss_response.expect.json @@ -0,0 +1,24 @@ +{ + "description": "XSS via reflected user input in doGet; doPost has manual HTML entity escaping", + "tags": [ + "taint", + "xss", + "servlet" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 5, + 11 + ], + "evidence_contains": [], + "notes": "request.getParameter(\"name\") flows directly into out.println without escaping in doGet" + } + ] +} diff --git a/tests/fixtures/real_world/java/taint/xss_response.java b/tests/fixtures/real_world/java/taint/xss_response.java new file mode 100644 index 00000000..0e446fa1 --- /dev/null +++ b/tests/fixtures/real_world/java/taint/xss_response.java @@ -0,0 +1,19 @@ +import java.io.*; +import javax.servlet.http.*; + +public class XssHandler extends HttpServlet { + protected void doGet(HttpServletRequest request, HttpServletResponse response) + throws IOException { + String name = request.getParameter("name"); + PrintWriter out = response.getWriter(); + out.println("

Hello " + name + "

"); + } + + protected void doPost(HttpServletRequest request, HttpServletResponse response) + throws IOException { + String name = request.getParameter("name"); + String safe = name.replace("<", "<").replace(">", ">"); + PrintWriter out = response.getWriter(); + out.println("

Hello " + safe + "

"); + } +} diff --git a/tests/fixtures/real_world/javascript/cfg/async_await_flow.expect.json b/tests/fixtures/real_world/javascript/cfg/async_await_flow.expect.json new file mode 100644 index 00000000..4a0a1bc2 --- /dev/null +++ b/tests/fixtures/real_world/javascript/cfg/async_await_flow.expect.json @@ -0,0 +1,47 @@ +{ + "description": "Async/await control flow with promisified exec and fetch-to-exec pipeline. Tests CFG handling of async patterns.", + "tags": [ + "cfg", + "async", + "cmdi", + "fetch" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 5, + 9 + ], + "evidence_contains": [], + "notes": "execAsync is a promisified wrapper; scanner cannot trace through util.promisify to recognize it as a sink" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 13, + 19 + ], + "evidence_contains": [], + "notes": "fetch response flows to child_process.exec but fetch is not a recognized source" + }, + { + "rule_id": "cfg-unguarded-sink", + "severity": null, + "must_match": false, + "line_range": [ + 15, + 19 + ], + "evidence_contains": [], + "notes": "child_process.exec called without validation of data.command" + } + ] +} diff --git a/tests/fixtures/real_world/javascript/cfg/async_await_flow.js b/tests/fixtures/real_world/javascript/cfg/async_await_flow.js new file mode 100644 index 00000000..fc11da4a --- /dev/null +++ b/tests/fixtures/real_world/javascript/cfg/async_await_flow.js @@ -0,0 +1,20 @@ +var child_process = require('child_process'); +var util = require('util'); +var execAsync = util.promisify(child_process.exec); + +async function runCommand(userCmd) { + try { + var result = await execAsync(userCmd); + return result.stdout; + } catch (err) { + return err.message; + } +} + +async function fetchAndExec(url) { + var response = await fetch(url); + var data = await response.json(); + child_process.exec(data.command, function(err, stdout) { + return stdout; + }); +} diff --git a/tests/fixtures/real_world/javascript/cfg/callback_nesting.expect.json b/tests/fixtures/real_world/javascript/cfg/callback_nesting.expect.json new file mode 100644 index 00000000..a6f903f2 --- /dev/null +++ b/tests/fixtures/real_world/javascript/cfg/callback_nesting.expect.json @@ -0,0 +1,36 @@ +{ + "description": "Nested callbacks where fs.readFile data flows into child_process.exec. Tests CFG handling of callback nesting and data flow across callback boundaries.", + "tags": [ + "cfg", + "callbacks", + "cmdi", + "fs" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 3, + 11 + ], + "evidence_contains": [], + "notes": "Data read from fs.readFile flows into child_process.exec but data is a callback param, not a recognized taint source" + }, + { + "rule_id": "cfg-unguarded-sink", + "severity": null, + "must_match": false, + "line_range": [ + 7, + 11 + ], + "evidence_contains": [], + "notes": "child_process.exec receives unchecked file contents" + } + ] +} diff --git a/tests/fixtures/real_world/javascript/cfg/callback_nesting.js b/tests/fixtures/real_world/javascript/cfg/callback_nesting.js new file mode 100644 index 00000000..29c298fc --- /dev/null +++ b/tests/fixtures/real_world/javascript/cfg/callback_nesting.js @@ -0,0 +1,13 @@ +var child_process = require('child_process'); +var fs = require('fs'); + +function processInput(input, callback) { + fs.readFile(input.path, 'utf8', function(err, data) { + if (err) { + return callback(err); + } + child_process.exec(data, function(execErr, stdout) { + callback(execErr, stdout); + }); + }); +} diff --git a/tests/fixtures/real_world/javascript/cfg/switch_fallthrough.expect.json b/tests/fixtures/real_world/javascript/cfg/switch_fallthrough.expect.json new file mode 100644 index 00000000..0506a849 --- /dev/null +++ b/tests/fixtures/real_world/javascript/cfg/switch_fallthrough.expect.json @@ -0,0 +1,58 @@ +{ + "description": "Switch statement with fallthrough from exec case to safe case (missing break). eval and execSync with function parameter.", + "tags": [ + "cfg", + "switch", + "fallthrough", + "code-exec" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "js.code_exec.eval", + "severity": null, + "must_match": true, + "line_range": [ + 4, + 8 + ], + "evidence_contains": [], + "notes": "AST pattern matches eval() call" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 4, + 8 + ], + "evidence_contains": [], + "notes": "userInput is a function param, not a recognized source; requires interprocedural analysis" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 10, + 14 + ], + "evidence_contains": [], + "notes": "userInput flows to child_process.execSync but userInput is a param, not a recognized source" + }, + { + "rule_id": "cfg-error-fallthrough", + "severity": null, + "must_match": false, + "line_range": [ + 9, + 16 + ], + "evidence_contains": [], + "notes": "Missing break after exec case causes fallthrough to safe case" + } + ] +} diff --git a/tests/fixtures/real_world/javascript/cfg/switch_fallthrough.js b/tests/fixtures/real_world/javascript/cfg/switch_fallthrough.js new file mode 100644 index 00000000..334b70d4 --- /dev/null +++ b/tests/fixtures/real_world/javascript/cfg/switch_fallthrough.js @@ -0,0 +1,19 @@ +var child_process = require('child_process'); + +function handleAction(action, userInput) { + switch (action) { + case 'eval': + eval(userInput); + break; + case 'log': + console.log(userInput); + break; + case 'exec': + child_process.execSync(userInput); + case 'safe': + console.log('safe action'); + break; + default: + break; + } +} diff --git a/tests/fixtures/real_world/javascript/cfg/try_catch_finally.expect.json b/tests/fixtures/real_world/javascript/cfg/try_catch_finally.expect.json new file mode 100644 index 00000000..acf77f8f --- /dev/null +++ b/tests/fixtures/real_world/javascript/cfg/try_catch_finally.expect.json @@ -0,0 +1,25 @@ +{ + "description": "Try-catch-finally resource handling. processFile properly closes fd in finally block. leakyProcess leaks fd when throw occurs before closeSync.", + "tags": [ + "cfg", + "resource-leak", + "try-catch", + "fs" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "cfg-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 17, + 25 + ], + "evidence_contains": [], + "notes": "fd leaks when throw fires at line 22 before closeSync at line 24; scanner may not track fd lifecycle in JS" + } + ] +} diff --git a/tests/fixtures/real_world/javascript/cfg/try_catch_finally.js b/tests/fixtures/real_world/javascript/cfg/try_catch_finally.js new file mode 100644 index 00000000..70e02937 --- /dev/null +++ b/tests/fixtures/real_world/javascript/cfg/try_catch_finally.js @@ -0,0 +1,26 @@ +var fs = require('fs'); + +function processFile(path) { + var fd; + try { + fd = fs.openSync(path, 'r'); + var data = fs.readFileSync(fd, 'utf8'); + return data; + } catch (e) { + console.error(e); + } finally { + if (fd !== undefined) { + fs.closeSync(fd); + } + } +} + +function leakyProcess(path) { + var fd = fs.openSync(path, 'r'); + var data = fs.readFileSync(fd, 'utf8'); + if (data.length === 0) { + throw new Error('empty'); + } + fs.closeSync(fd); + return data; +} diff --git a/tests/fixtures/real_world/javascript/mixed/express_auth_cmdi.expect.json b/tests/fixtures/real_world/javascript/mixed/express_auth_cmdi.expect.json new file mode 100644 index 00000000..b4091cae --- /dev/null +++ b/tests/fixtures/real_world/javascript/mixed/express_auth_cmdi.expect.json @@ -0,0 +1,60 @@ +{ + "description": "Express route with command injection: one without auth check, one with auth check. Taint flows in both cases since auth does not sanitize the input.", + "tags": [ + "mixed", + "taint", + "cfg", + "cmdi", + "auth", + "express" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 9, + 14 + ], + "evidence_contains": [], + "notes": "req.query.branch flows into child_process.exec in unauthed /deploy route" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 21, + 26 + ], + "evidence_contains": [], + "notes": "req.query.branch flows into child_process.exec in authed /deploy-safe route; auth check does not sanitize" + }, + { + "rule_id": "cfg-auth-gap", + "severity": null, + "must_match": false, + "line_range": [ + 8, + 17 + ], + "evidence_contains": [], + "notes": "No auth check before child_process.exec in /deploy handler" + }, + { + "rule_id": "cfg-unguarded-sink", + "severity": null, + "must_match": false, + "line_range": [ + 10, + 14 + ], + "evidence_contains": [], + "notes": "child_process.exec called without input validation guard" + } + ] +} diff --git a/tests/fixtures/real_world/javascript/mixed/express_auth_cmdi.js b/tests/fixtures/real_world/javascript/mixed/express_auth_cmdi.js new file mode 100644 index 00000000..95f3bbb1 --- /dev/null +++ b/tests/fixtures/real_world/javascript/mixed/express_auth_cmdi.js @@ -0,0 +1,26 @@ +var express = require('express'); +var child_process = require('child_process'); +var app = express(); + +function isAdmin(req) { + return req.headers['x-admin'] === 'true'; +} + +// Missing auth check before dangerous operation +app.get('/deploy', function(req, res) { + var branch = req.query.branch; + child_process.exec('git checkout ' + branch, function(err, stdout) { + res.send(stdout); + }); +}); + +// Has auth check but taint still flows +app.get('/deploy-safe', function(req, res) { + if (!isAdmin(req)) { + return res.status(403).send('Forbidden'); + } + var branch = req.query.branch; + child_process.exec('git checkout ' + branch, function(err, stdout) { + res.send(stdout); + }); +}); diff --git a/tests/fixtures/real_world/javascript/mixed/taint_through_state.expect.json b/tests/fixtures/real_world/javascript/mixed/taint_through_state.expect.json new file mode 100644 index 00000000..3d9dbc9e --- /dev/null +++ b/tests/fixtures/real_world/javascript/mixed/taint_through_state.expect.json @@ -0,0 +1,49 @@ +{ + "description": "Combined taint and state: req.body.name used in file path, fd leaks on early return, and a separate command injection via req.query.cmd.", + "tags": [ + "mixed", + "taint", + "state", + "resource-leak", + "cmdi", + "express" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 18, + 23 + ], + "evidence_contains": [], + "notes": "req.query.cmd flows directly into child_process.exec" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 5, + 11 + ], + "evidence_contains": [], + "notes": "req.body.name flows into fs.openSync path but fs.openSync is not a recognized taint sink" + }, + { + "rule_id": "state-resource-leak-possible", + "severity": null, + "must_match": false, + "line_range": [ + 7, + 15 + ], + "evidence_contains": [], + "notes": "fd from fs.openSync leaks when early return fires at line 13; scanner may not track JS fd lifecycle" + } + ] +} diff --git a/tests/fixtures/real_world/javascript/mixed/taint_through_state.js b/tests/fixtures/real_world/javascript/mixed/taint_through_state.js new file mode 100644 index 00000000..989f72c2 --- /dev/null +++ b/tests/fixtures/real_world/javascript/mixed/taint_through_state.js @@ -0,0 +1,24 @@ +var express = require('express'); +var fs = require('fs'); +var child_process = require('child_process'); +var app = express(); + +app.post('/upload', function(req, res) { + var filename = req.body.name; + var content = req.body.data; + var fd = fs.openSync('/tmp/' + filename, 'w'); + fs.writeSync(fd, content); + // fd leaks on early return + if (content.length > 1000000) { + return res.status(413).send('Too large'); + } + fs.closeSync(fd); + res.send('OK'); +}); + +app.get('/run', function(req, res) { + var cmd = req.query.cmd; + child_process.exec(cmd, function(err, stdout) { + res.send(stdout); + }); +}); diff --git a/tests/fixtures/real_world/javascript/state/db_connection.expect.json b/tests/fixtures/real_world/javascript/state/db_connection.expect.json new file mode 100644 index 00000000..7b79ddb5 --- /dev/null +++ b/tests/fixtures/real_world/javascript/state/db_connection.expect.json @@ -0,0 +1,25 @@ +{ + "description": "Database connection leak: queryUnsafe creates a mysql connection without calling conn.end(). querySafe properly closes.", + "tags": [ + "state", + "resource-leak", + "database", + "mysql" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 2, + 11 + ], + "evidence_contains": [], + "notes": "mysql.createConnection without conn.end(); scanner may not track mysql connection lifecycle" + } + ] +} diff --git a/tests/fixtures/real_world/javascript/state/db_connection.js b/tests/fixtures/real_world/javascript/state/db_connection.js new file mode 100644 index 00000000..1ee7bea6 --- /dev/null +++ b/tests/fixtures/real_world/javascript/state/db_connection.js @@ -0,0 +1,19 @@ +var mysql = require('mysql'); + +function queryUnsafe() { + var conn = mysql.createConnection({ host: 'localhost' }); + conn.connect(); + conn.query('SELECT 1', function(err, results) { + console.log(results); + }); + // Missing conn.end() +} + +function querySafe() { + var conn = mysql.createConnection({ host: 'localhost' }); + conn.connect(); + conn.query('SELECT 1', function(err, results) { + console.log(results); + conn.end(); + }); +} diff --git a/tests/fixtures/real_world/javascript/state/event_listener_leak.expect.json b/tests/fixtures/real_world/javascript/state/event_listener_leak.expect.json new file mode 100644 index 00000000..f1011858 --- /dev/null +++ b/tests/fixtures/real_world/javascript/state/event_listener_leak.expect.json @@ -0,0 +1,25 @@ +{ + "description": "Event listener leak: socket connections accumulate without cleanup handlers. Safe version registers close and error handlers.", + "tags": [ + "state", + "resource-leak", + "event-listener", + "net" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 3, + 14 + ], + "evidence_contains": [], + "notes": "Socket connections accumulate in array without close/error handlers; scanner likely cannot track event listener patterns" + } + ] +} diff --git a/tests/fixtures/real_world/javascript/state/event_listener_leak.js b/tests/fixtures/real_world/javascript/state/event_listener_leak.js new file mode 100644 index 00000000..9bdbcdc5 --- /dev/null +++ b/tests/fixtures/real_world/javascript/state/event_listener_leak.js @@ -0,0 +1,39 @@ +var net = require('net'); + +function startServer() { + var connections = []; + var server = net.createServer(function(socket) { + connections.push(socket); + socket.on('data', function(data) { + handleData(socket, data); + }); + // Missing: socket.on('close', ...) cleanup + // Missing: socket.on('error', ...) cleanup + }); + server.listen(3000); +} + +function startServerSafe() { + var connections = []; + var server = net.createServer(function(socket) { + connections.push(socket); + socket.on('data', function(data) { + handleData(socket, data); + }); + socket.on('close', function() { + var idx = connections.indexOf(socket); + if (idx !== -1) { + connections.splice(idx, 1); + } + }); + socket.on('error', function(err) { + console.error('Socket error:', err); + socket.destroy(); + }); + }); + server.listen(3000); +} + +function handleData(socket, data) { + socket.write('echo: ' + data.toString()); +} diff --git a/tests/fixtures/real_world/javascript/state/fd_leak.expect.json b/tests/fixtures/real_world/javascript/state/fd_leak.expect.json new file mode 100644 index 00000000..15cecbc7 --- /dev/null +++ b/tests/fixtures/real_world/javascript/state/fd_leak.expect.json @@ -0,0 +1,25 @@ +{ + "description": "File descriptor leak in readAndProcess (missing closeSync). readAndClose properly closes the fd.", + "tags": [ + "state", + "resource-leak", + "fd", + "fs" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 2, + 10 + ], + "evidence_contains": [], + "notes": "fd from fs.openSync is never closed in readAndProcess; scanner may not track JS fd lifecycle" + } + ] +} diff --git a/tests/fixtures/real_world/javascript/state/fd_leak.js b/tests/fixtures/real_world/javascript/state/fd_leak.js new file mode 100644 index 00000000..4d844293 --- /dev/null +++ b/tests/fixtures/real_world/javascript/state/fd_leak.js @@ -0,0 +1,17 @@ +var fs = require('fs'); + +function readAndProcess(path) { + var fd = fs.openSync(path, 'r'); + var buf = Buffer.alloc(1024); + fs.readSync(fd, buf); + // Missing: fs.closeSync(fd) + return buf.toString(); +} + +function readAndClose(path) { + var fd = fs.openSync(path, 'r'); + var buf = Buffer.alloc(1024); + fs.readSync(fd, buf); + fs.closeSync(fd); + return buf.toString(); +} diff --git a/tests/fixtures/real_world/javascript/state/handle_reuse.expect.json b/tests/fixtures/real_world/javascript/state/handle_reuse.expect.json new file mode 100644 index 00000000..038c46ed --- /dev/null +++ b/tests/fixtures/real_world/javascript/state/handle_reuse.expect.json @@ -0,0 +1,37 @@ +{ + "description": "Double-close and use-after-close patterns with file descriptors. Both are temporal safety violations.", + "tags": [ + "state", + "double-close", + "use-after-close", + "fd", + "fs" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-double-close", + "severity": null, + "must_match": false, + "line_range": [ + 2, + 8 + ], + "evidence_contains": [], + "notes": "fd closed twice in doubleClose; scanner may not track JS fd state transitions" + }, + { + "rule_id": "state-use-after-close", + "severity": null, + "must_match": false, + "line_range": [ + 8, + 15 + ], + "evidence_contains": [], + "notes": "fd used in readSync after closeSync in useAfterClose; scanner may not track JS fd state transitions" + } + ] +} diff --git a/tests/fixtures/real_world/javascript/state/handle_reuse.js b/tests/fixtures/real_world/javascript/state/handle_reuse.js new file mode 100644 index 00000000..50ce8d1e --- /dev/null +++ b/tests/fixtures/real_world/javascript/state/handle_reuse.js @@ -0,0 +1,14 @@ +var fs = require('fs'); + +function doubleClose(path) { + var fd = fs.openSync(path, 'r'); + fs.closeSync(fd); + fs.closeSync(fd); // double close! +} + +function useAfterClose(path) { + var fd = fs.openSync(path, 'r'); + fs.closeSync(fd); + var buf = Buffer.alloc(1024); + fs.readSync(fd, buf); // use after close! +} diff --git a/tests/fixtures/real_world/javascript/taint/cmdi_express.expect.json b/tests/fixtures/real_world/javascript/taint/cmdi_express.expect.json new file mode 100644 index 00000000..a0a4087a --- /dev/null +++ b/tests/fixtures/real_world/javascript/taint/cmdi_express.expect.json @@ -0,0 +1,35 @@ +{ + "description": "Express req.query flows into child_process.exec (command injection). Safe version uses regex replace but scanner lacks custom sanitizer recognition.", + "tags": [ + "taint", + "cmdi", + "express" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 4, + 10 + ], + "evidence_contains": [], + "notes": "req.query.host flows directly into child_process.exec via string concatenation" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 12, + 19 + ], + "evidence_contains": [], + "notes": "Safe version still fires because .replace is not a recognized sanitizer for SHELL_ESCAPE cap" + } + ] +} diff --git a/tests/fixtures/real_world/javascript/taint/cmdi_express.js b/tests/fixtures/real_world/javascript/taint/cmdi_express.js new file mode 100644 index 00000000..8d55bb25 --- /dev/null +++ b/tests/fixtures/real_world/javascript/taint/cmdi_express.js @@ -0,0 +1,18 @@ +var child_process = require('child_process'); +var express = require('express'); +var app = express(); + +app.get('/ping', function(req, res) { + var host = req.query.host; + child_process.exec('ping -c 1 ' + host, function(err, stdout) { + res.send(stdout); + }); +}); + +app.get('/safe-ping', function(req, res) { + var host = req.query.host; + var sanitized = host.replace(/[^a-zA-Z0-9.]/g, ''); + child_process.exec('ping -c 1 ' + sanitized, function(err, stdout) { + res.send(stdout); + }); +}); diff --git a/tests/fixtures/real_world/javascript/taint/eval_user_input.expect.json b/tests/fixtures/real_world/javascript/taint/eval_user_input.expect.json new file mode 100644 index 00000000..33f525f1 --- /dev/null +++ b/tests/fixtures/real_world/javascript/taint/eval_user_input.expect.json @@ -0,0 +1,36 @@ +{ + "description": "eval() with user-controlled input from req.query. AST pattern detects eval call; taint detects source-to-sink flow.", + "tags": [ + "taint", + "code-exec", + "eval", + "express" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "js.code_exec.eval", + "severity": null, + "must_match": true, + "line_range": [ + 4, + 8 + ], + "evidence_contains": [], + "notes": "AST pattern matches eval() call regardless of arguments" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 3, + 8 + ], + "evidence_contains": [], + "notes": "req.query.expr flows directly into eval() which is a SHELL_ESCAPE sink" + } + ] +} diff --git a/tests/fixtures/real_world/javascript/taint/eval_user_input.js b/tests/fixtures/real_world/javascript/taint/eval_user_input.js new file mode 100644 index 00000000..d6a16f2c --- /dev/null +++ b/tests/fixtures/real_world/javascript/taint/eval_user_input.js @@ -0,0 +1,17 @@ +var express = require('express'); +var app = express(); + +app.get('/calc', function(req, res) { + var expr = req.query.expr; + var result = eval(expr); + res.json({ result: result }); +}); + +app.get('/calc-safe', function(req, res) { + var expr = req.query.expr; + var num = parseFloat(expr); + if (isNaN(num)) { + return res.status(400).send('Invalid'); + } + res.json({ result: num }); +}); diff --git a/tests/fixtures/real_world/javascript/taint/path_traversal_fs.expect.json b/tests/fixtures/real_world/javascript/taint/path_traversal_fs.expect.json new file mode 100644 index 00000000..feb2bad0 --- /dev/null +++ b/tests/fixtures/real_world/javascript/taint/path_traversal_fs.expect.json @@ -0,0 +1,36 @@ +{ + "description": "Path traversal via req.query flowing into fs.readFileSync. Scanner lacks fs.readFileSync as a defined sink.", + "tags": [ + "taint", + "path-traversal", + "express", + "fs" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 5, + 10 + ], + "evidence_contains": [], + "notes": "req.query.path flows into fs.readFileSync but fs.readFileSync is not a recognized taint sink" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 11, + 20 + ], + "evidence_contains": [], + "notes": "Safe version uses path.resolve and startsWith guard; would require adding fs sinks to detect" + } + ] +} diff --git a/tests/fixtures/real_world/javascript/taint/path_traversal_fs.js b/tests/fixtures/real_world/javascript/taint/path_traversal_fs.js new file mode 100644 index 00000000..b5964340 --- /dev/null +++ b/tests/fixtures/real_world/javascript/taint/path_traversal_fs.js @@ -0,0 +1,20 @@ +var express = require('express'); +var fs = require('fs'); +var path = require('path'); +var app = express(); + +app.get('/read', function(req, res) { + var filePath = req.query.path; + var content = fs.readFileSync(filePath, 'utf8'); + res.send(content); +}); + +app.get('/read-safe', function(req, res) { + var filePath = req.query.path; + var resolved = path.resolve('/safe/dir', filePath); + if (!resolved.startsWith('/safe/dir')) { + return res.status(403).send('Forbidden'); + } + var content = fs.readFileSync(resolved, 'utf8'); + res.send(content); +}); diff --git a/tests/fixtures/real_world/javascript/taint/proto_pollution.expect.json b/tests/fixtures/real_world/javascript/taint/proto_pollution.expect.json new file mode 100644 index 00000000..b0b3dc75 --- /dev/null +++ b/tests/fixtures/real_world/javascript/taint/proto_pollution.expect.json @@ -0,0 +1,35 @@ +{ + "description": "Prototype pollution via recursive merge of user-controlled req.body. The __proto__ assignment is indirect (dynamic key).", + "tags": [ + "taint", + "prototype-pollution", + "express" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "js.prototype.proto_assignment", + "severity": null, + "must_match": false, + "line_range": [ + 2, + 8 + ], + "evidence_contains": [], + "notes": "Dynamic property assignment target[key] could pollute __proto__ but AST pattern only matches literal __proto__ property access" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 13, + 19 + ], + "evidence_contains": [], + "notes": "req.body flows through merge into config but no recognized sink is reached" + } + ] +} diff --git a/tests/fixtures/real_world/javascript/taint/proto_pollution.js b/tests/fixtures/real_world/javascript/taint/proto_pollution.js new file mode 100644 index 00000000..9b81fad4 --- /dev/null +++ b/tests/fixtures/real_world/javascript/taint/proto_pollution.js @@ -0,0 +1,19 @@ +function merge(target, source) { + for (var key in source) { + if (typeof source[key] === 'object') { + target[key] = merge(target[key] || {}, source[key]); + } else { + target[key] = source[key]; + } + } + return target; +} + +var express = require('express'); +var app = express(); + +app.post('/config', function(req, res) { + var defaults = { theme: 'light', lang: 'en' }; + var config = merge(defaults, req.body); + res.json(config); +}); diff --git a/tests/fixtures/real_world/javascript/taint/sqli_concat.expect.json b/tests/fixtures/real_world/javascript/taint/sqli_concat.expect.json new file mode 100644 index 00000000..f675f088 --- /dev/null +++ b/tests/fixtures/real_world/javascript/taint/sqli_concat.expect.json @@ -0,0 +1,24 @@ +{ + "description": "SQL injection via string concatenation with userId parameter. connection.query is not a recognized taint sink.", + "tags": [ + "taint", + "sqli", + "mysql" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 2, + 7 + ], + "evidence_contains": [], + "notes": "userId flows into SQL string via concat, but connection.query is not a defined sink and userId as a function param is not auto-tainted" + } + ] +} diff --git a/tests/fixtures/real_world/javascript/taint/sqli_concat.js b/tests/fixtures/real_world/javascript/taint/sqli_concat.js new file mode 100644 index 00000000..51fb7210 --- /dev/null +++ b/tests/fixtures/real_world/javascript/taint/sqli_concat.js @@ -0,0 +1,14 @@ +var mysql = require('mysql'); + +function getUser(connection, userId) { + var query = 'SELECT * FROM users WHERE id = ' + userId; + connection.query(query, function(err, results) { + return results; + }); +} + +function getUserSafe(connection, userId) { + connection.query('SELECT * FROM users WHERE id = ?', [userId], function(err, results) { + return results; + }); +} diff --git a/tests/fixtures/real_world/javascript/taint/xss_res_send.expect.json b/tests/fixtures/real_world/javascript/taint/xss_res_send.expect.json new file mode 100644 index 00000000..ca929109 --- /dev/null +++ b/tests/fixtures/real_world/javascript/taint/xss_res_send.expect.json @@ -0,0 +1,36 @@ +{ + "description": "XSS via req.query flowing into innerHTML. DOMPurify.sanitize is a recognized HTML_ESCAPE sanitizer.", + "tags": [ + "taint", + "xss", + "express", + "innerHTML" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 3, + 8 + ], + "evidence_contains": [], + "notes": "req.query.name flows into innerHTML via string concatenation" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 8, + 14 + ], + "evidence_contains": [], + "notes": "DOMPurify.sanitize strips HTML_ESCAPE cap so this should NOT fire; must_match=false means we expect absence" + } + ] +} diff --git a/tests/fixtures/real_world/javascript/taint/xss_res_send.js b/tests/fixtures/real_world/javascript/taint/xss_res_send.js new file mode 100644 index 00000000..666d0f7f --- /dev/null +++ b/tests/fixtures/real_world/javascript/taint/xss_res_send.js @@ -0,0 +1,13 @@ +var express = require('express'); +var app = express(); + +app.get('/greet', function(req, res) { + var name = req.query.name; + document.getElementById('header').innerHTML = '

Hello ' + name + '

'; +}); + +app.get('/greet-safe', function(req, res) { + var name = req.query.name; + var clean = DOMPurify.sanitize(name); + document.getElementById('header').innerHTML = '

Hello ' + clean + '

'; +}); diff --git a/tests/fixtures/real_world/php/cfg/curl_lifecycle.expect.json b/tests/fixtures/real_world/php/cfg/curl_lifecycle.expect.json new file mode 100644 index 00000000..24a850fb --- /dev/null +++ b/tests/fixtures/real_world/php/cfg/curl_lifecycle.expect.json @@ -0,0 +1,25 @@ +{ + "description": "cURL handle resource leak - missing curl_close vs properly closed", + "tags": [ + "cfg", + "resource-leak", + "curl", + "php" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "cfg-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 1, + 9 + ], + "evidence_contains": [], + "notes": "fetchUrl opens cURL handle but never calls curl_close" + } + ] +} diff --git a/tests/fixtures/real_world/php/cfg/curl_lifecycle.php b/tests/fixtures/real_world/php/cfg/curl_lifecycle.php new file mode 100644 index 00000000..e9bd31d4 --- /dev/null +++ b/tests/fixtures/real_world/php/cfg/curl_lifecycle.php @@ -0,0 +1,17 @@ + diff --git a/tests/fixtures/real_world/php/cfg/error_fallthrough.expect.json b/tests/fixtures/real_world/php/cfg/error_fallthrough.expect.json new file mode 100644 index 00000000..038c4fe6 --- /dev/null +++ b/tests/fixtures/real_world/php/cfg/error_fallthrough.expect.json @@ -0,0 +1,37 @@ +{ + "description": "Error check logs but falls through to dangerous system() call with null result", + "tags": [ + "cfg", + "error-fallthrough", + "cmdi", + "php" + ], + "modes": [ + "full", + "ast" + ], + "expected": [ + { + "rule_id": "cfg-error-fallthrough", + "severity": null, + "must_match": false, + "line_range": [ + 1, + 10 + ], + "evidence_contains": [], + "notes": "Error condition logs but does not return, falls through to system() call" + }, + { + "rule_id": "php.cmdi.system", + "severity": null, + "must_match": true, + "line_range": [ + 6, + 10 + ], + "evidence_contains": [], + "notes": "system() called with potentially null array access" + } + ] +} diff --git a/tests/fixtures/real_world/php/cfg/error_fallthrough.php b/tests/fixtures/real_world/php/cfg/error_fallthrough.php new file mode 100644 index 00000000..d98bbd60 --- /dev/null +++ b/tests/fixtures/real_world/php/cfg/error_fallthrough.php @@ -0,0 +1,10 @@ + diff --git a/tests/fixtures/real_world/php/cfg/switch_case.expect.json b/tests/fixtures/real_world/php/cfg/switch_case.expect.json new file mode 100644 index 00000000..0baf5964 --- /dev/null +++ b/tests/fixtures/real_world/php/cfg/switch_case.expect.json @@ -0,0 +1,38 @@ +{ + "description": "Switch/case dispatching to dangerous functions eval and system", + "tags": [ + "cfg", + "code-exec", + "cmdi", + "switch", + "php" + ], + "modes": [ + "full", + "ast" + ], + "expected": [ + { + "rule_id": "php.code_exec.eval", + "severity": null, + "must_match": true, + "line_range": [ + 3, + 7 + ], + "evidence_contains": [], + "notes": "eval() called in switch case with function parameter" + }, + { + "rule_id": "php.cmdi.system", + "severity": null, + "must_match": true, + "line_range": [ + 6, + 10 + ], + "evidence_contains": [], + "notes": "system() called in switch case with function parameter" + } + ] +} diff --git a/tests/fixtures/real_world/php/cfg/switch_case.php b/tests/fixtures/real_world/php/cfg/switch_case.php new file mode 100644 index 00000000..335aba00 --- /dev/null +++ b/tests/fixtures/real_world/php/cfg/switch_case.php @@ -0,0 +1,17 @@ + diff --git a/tests/fixtures/real_world/php/cfg/try_catch_finally.expect.json b/tests/fixtures/real_world/php/cfg/try_catch_finally.expect.json new file mode 100644 index 00000000..138f67b2 --- /dev/null +++ b/tests/fixtures/real_world/php/cfg/try_catch_finally.expect.json @@ -0,0 +1,26 @@ +{ + "description": "File handle resource management with try/catch/finally vs early return leak", + "tags": [ + "cfg", + "resource-leak", + "try-finally", + "file-io", + "php" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "cfg-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 12, + 24 + ], + "evidence_contains": [], + "notes": "leakyProcess returns null on empty data without closing file handle" + } + ] +} diff --git a/tests/fixtures/real_world/php/cfg/try_catch_finally.php b/tests/fixtures/real_world/php/cfg/try_catch_finally.php new file mode 100644 index 00000000..0ec5acb0 --- /dev/null +++ b/tests/fixtures/real_world/php/cfg/try_catch_finally.php @@ -0,0 +1,23 @@ +getMessage(); + } finally { + fclose($fh); + } +} + +function leakyProcess($path) { + $fh = fopen($path, 'r'); + $data = fread($fh, filesize($path)); + if (empty($data)) { + return null; // fh leaked + } + fclose($fh); + return $data; +} +?> diff --git a/tests/fixtures/real_world/php/mixed/upload_cmdi.expect.json b/tests/fixtures/real_world/php/mixed/upload_cmdi.expect.json new file mode 100644 index 00000000..a007e5b5 --- /dev/null +++ b/tests/fixtures/real_world/php/mixed/upload_cmdi.expect.json @@ -0,0 +1,37 @@ +{ + "description": "File upload with command injection via unsanitized filename in system() call", + "tags": [ + "taint", + "cmdi", + "file-upload", + "php", + "mixed" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 1, + 8 + ], + "evidence_contains": [], + "notes": "$_FILES filename flows through concatenation into system() chmod call" + }, + { + "rule_id": "php.cmdi.system", + "severity": null, + "must_match": true, + "line_range": [ + 4, + 8 + ], + "evidence_contains": [], + "notes": "system() with string concatenation containing user-controlled filename" + } + ] +} diff --git a/tests/fixtures/real_world/php/mixed/upload_cmdi.php b/tests/fixtures/real_world/php/mixed/upload_cmdi.php new file mode 100644 index 00000000..cac5b71c --- /dev/null +++ b/tests/fixtures/real_world/php/mixed/upload_cmdi.php @@ -0,0 +1,8 @@ + diff --git a/tests/fixtures/real_world/php/mixed/web_handler.expect.json b/tests/fixtures/real_world/php/mixed/web_handler.expect.json new file mode 100644 index 00000000..1926e86d --- /dev/null +++ b/tests/fixtures/real_world/php/mixed/web_handler.expect.json @@ -0,0 +1,61 @@ +{ + "description": "Web handler dispatches to system, eval, and fopen based on user input", + "tags": [ + "taint", + "state", + "cmdi", + "code-exec", + "resource-leak", + "php", + "mixed" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "php.cmdi.system", + "severity": null, + "must_match": true, + "line_range": [ + 4, + 8 + ], + "evidence_contains": [], + "notes": "system() called with $_POST data" + }, + { + "rule_id": "php.code_exec.eval", + "severity": null, + "must_match": true, + "line_range": [ + 6, + 10 + ], + "evidence_contains": [], + "notes": "eval() called with $_POST data" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 1, + 8 + ], + "evidence_contains": [], + "notes": "$_POST['data'] flows into system()" + }, + { + "rule_id": "state-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 8, + 14 + ], + "evidence_contains": [], + "notes": "File handle opened in read branch but never closed" + } + ] +} diff --git a/tests/fixtures/real_world/php/mixed/web_handler.php b/tests/fixtures/real_world/php/mixed/web_handler.php new file mode 100644 index 00000000..97f966b2 --- /dev/null +++ b/tests/fixtures/real_world/php/mixed/web_handler.php @@ -0,0 +1,14 @@ + diff --git a/tests/fixtures/real_world/php/state/branch_leak.expect.json b/tests/fixtures/real_world/php/state/branch_leak.expect.json new file mode 100644 index 00000000..59ab8831 --- /dev/null +++ b/tests/fixtures/real_world/php/state/branch_leak.expect.json @@ -0,0 +1,26 @@ +{ + "description": "File handle leaked in else branch of conditional", + "tags": [ + "state", + "resource-leak", + "branch", + "file-io", + "php" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-resource-leak-possible", + "severity": null, + "must_match": false, + "line_range": [ + 1, + 14 + ], + "evidence_contains": [], + "notes": "File handle closed in if branch but leaked in else branch" + } + ] +} diff --git a/tests/fixtures/real_world/php/state/branch_leak.php b/tests/fixtures/real_world/php/state/branch_leak.php new file mode 100644 index 00000000..49abdfdc --- /dev/null +++ b/tests/fixtures/real_world/php/state/branch_leak.php @@ -0,0 +1,13 @@ + diff --git a/tests/fixtures/real_world/php/state/curl_state.expect.json b/tests/fixtures/real_world/php/state/curl_state.expect.json new file mode 100644 index 00000000..876c2fa3 --- /dev/null +++ b/tests/fixtures/real_world/php/state/curl_state.expect.json @@ -0,0 +1,25 @@ +{ + "description": "cURL handle used after being closed", + "tags": [ + "state", + "use-after-close", + "curl", + "php" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-use-after-close", + "severity": null, + "must_match": false, + "line_range": [ + 1, + 9 + ], + "evidence_contains": [], + "notes": "curl_exec called after curl_close on same handle" + } + ] +} diff --git a/tests/fixtures/real_world/php/state/curl_state.php b/tests/fixtures/real_world/php/state/curl_state.php new file mode 100644 index 00000000..5dc8167e --- /dev/null +++ b/tests/fixtures/real_world/php/state/curl_state.php @@ -0,0 +1,9 @@ + diff --git a/tests/fixtures/real_world/php/state/db_connection.expect.json b/tests/fixtures/real_world/php/state/db_connection.expect.json new file mode 100644 index 00000000..026ea802 --- /dev/null +++ b/tests/fixtures/real_world/php/state/db_connection.expect.json @@ -0,0 +1,25 @@ +{ + "description": "mysqli connection resource leak - missing close vs properly closed", + "tags": [ + "state", + "resource-leak", + "mysqli", + "php" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 1, + 8 + ], + "evidence_contains": [], + "notes": "queryUnsafe creates mysqli connection but never calls close()" + } + ] +} diff --git a/tests/fixtures/real_world/php/state/db_connection.php b/tests/fixtures/real_world/php/state/db_connection.php new file mode 100644 index 00000000..4d28c7a3 --- /dev/null +++ b/tests/fixtures/real_world/php/state/db_connection.php @@ -0,0 +1,15 @@ +query("SELECT 1"); + return $result; + // $conn never closed +} + +function querySafe() { + $conn = new mysqli("localhost", "user", "pass", "db"); + $result = $conn->query("SELECT 1"); + $conn->close(); + return $result; +} +?> diff --git a/tests/fixtures/real_world/php/state/file_handle.expect.json b/tests/fixtures/real_world/php/state/file_handle.expect.json new file mode 100644 index 00000000..b80330de --- /dev/null +++ b/tests/fixtures/real_world/php/state/file_handle.expect.json @@ -0,0 +1,37 @@ +{ + "description": "PHP file handle lifecycle: leak, proper close, double close", + "tags": [ + "state", + "resource-leak", + "double-close", + "file-io", + "php" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 1, + 7 + ], + "evidence_contains": [], + "notes": "readAndLeak opens file handle but never calls fclose" + }, + { + "rule_id": "state-double-close", + "severity": null, + "must_match": false, + "line_range": [ + 13, + 21 + ], + "evidence_contains": [], + "notes": "doubleClose calls fclose twice on same handle" + } + ] +} diff --git a/tests/fixtures/real_world/php/state/file_handle.php b/tests/fixtures/real_world/php/state/file_handle.php new file mode 100644 index 00000000..b68f4952 --- /dev/null +++ b/tests/fixtures/real_world/php/state/file_handle.php @@ -0,0 +1,20 @@ + diff --git a/tests/fixtures/real_world/php/taint/cmdi_shell_exec.expect.json b/tests/fixtures/real_world/php/taint/cmdi_shell_exec.expect.json new file mode 100644 index 00000000..a4f59c72 --- /dev/null +++ b/tests/fixtures/real_world/php/taint/cmdi_shell_exec.expect.json @@ -0,0 +1,25 @@ +{ + "description": "Command injection via shell_exec with unsanitized GET parameter", + "tags": [ + "taint", + "cmdi", + "shell_exec", + "php" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 1, + 6 + ], + "evidence_contains": [], + "notes": "$_GET['cmd'] flows directly into shell_exec without sanitization" + } + ] +} diff --git a/tests/fixtures/real_world/php/taint/cmdi_shell_exec.php b/tests/fixtures/real_world/php/taint/cmdi_shell_exec.php new file mode 100644 index 00000000..11b2bb41 --- /dev/null +++ b/tests/fixtures/real_world/php/taint/cmdi_shell_exec.php @@ -0,0 +1,10 @@ + diff --git a/tests/fixtures/real_world/php/taint/deser_unserialize.expect.json b/tests/fixtures/real_world/php/taint/deser_unserialize.expect.json new file mode 100644 index 00000000..961a5b94 --- /dev/null +++ b/tests/fixtures/real_world/php/taint/deser_unserialize.expect.json @@ -0,0 +1,36 @@ +{ + "description": "Unsafe deserialization of user-controlled cookie data via unserialize", + "tags": [ + "taint", + "deser", + "unserialize", + "php" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "php.deser.unserialize", + "severity": null, + "must_match": true, + "line_range": [ + 1, + 5 + ], + "evidence_contains": [], + "notes": "unserialize() on cookie data enables object injection" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 1, + 5 + ], + "evidence_contains": [], + "notes": "$_COOKIE['session_data'] flows into unserialize()" + } + ] +} diff --git a/tests/fixtures/real_world/php/taint/deser_unserialize.php b/tests/fixtures/real_world/php/taint/deser_unserialize.php new file mode 100644 index 00000000..3cd5380b --- /dev/null +++ b/tests/fixtures/real_world/php/taint/deser_unserialize.php @@ -0,0 +1,10 @@ +name; + +// Safe: JSON instead +$json_data = $_COOKIE['json_data']; +$safe_obj = json_decode($json_data); +echo $safe_obj->name; +?> diff --git a/tests/fixtures/real_world/php/taint/eval_input.expect.json b/tests/fixtures/real_world/php/taint/eval_input.expect.json new file mode 100644 index 00000000..bd170043 --- /dev/null +++ b/tests/fixtures/real_world/php/taint/eval_input.expect.json @@ -0,0 +1,58 @@ +{ + "description": "eval() called with user-controlled POST and GET input", + "tags": [ + "taint", + "code-exec", + "eval", + "php" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "php.code_exec.eval", + "severity": null, + "must_match": true, + "line_range": [ + 1, + 5 + ], + "evidence_contains": [], + "notes": "eval() with $_POST['code'] - AST pattern match" + }, + { + "rule_id": "php.code_exec.eval", + "severity": null, + "must_match": true, + "line_range": [ + 4, + 8 + ], + "evidence_contains": [], + "notes": "eval() with $_GET['expr'] concatenated - AST pattern match" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 1, + 5 + ], + "evidence_contains": [], + "notes": "$_POST['code'] flows directly into eval()" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 3, + 8 + ], + "evidence_contains": [], + "notes": "$_GET['expr'] concatenated and flows into eval()" + } + ] +} diff --git a/tests/fixtures/real_world/php/taint/eval_input.php b/tests/fixtures/real_world/php/taint/eval_input.php new file mode 100644 index 00000000..5e4a3195 --- /dev/null +++ b/tests/fixtures/real_world/php/taint/eval_input.php @@ -0,0 +1,8 @@ + diff --git a/tests/fixtures/real_world/php/taint/file_upload.expect.json b/tests/fixtures/real_world/php/taint/file_upload.expect.json new file mode 100644 index 00000000..02cb17fb --- /dev/null +++ b/tests/fixtures/real_world/php/taint/file_upload.expect.json @@ -0,0 +1,25 @@ +{ + "description": "Unsafe file upload with user-controlled filename used in move_uploaded_file", + "tags": [ + "taint", + "file-upload", + "path-traversal", + "php" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 1, + 6 + ], + "evidence_contains": [], + "notes": "$_FILES['upload']['name'] flows into file path without sanitization" + } + ] +} diff --git a/tests/fixtures/real_world/php/taint/file_upload.php b/tests/fixtures/real_world/php/taint/file_upload.php new file mode 100644 index 00000000..a628f8da --- /dev/null +++ b/tests/fixtures/real_world/php/taint/file_upload.php @@ -0,0 +1,12 @@ + diff --git a/tests/fixtures/real_world/php/taint/include_rfi.expect.json b/tests/fixtures/real_world/php/taint/include_rfi.expect.json new file mode 100644 index 00000000..4f482d44 --- /dev/null +++ b/tests/fixtures/real_world/php/taint/include_rfi.expect.json @@ -0,0 +1,25 @@ +{ + "description": "Remote file inclusion via user-controlled include path", + "tags": [ + "taint", + "rfi", + "include", + "php" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 1, + 5 + ], + "evidence_contains": [], + "notes": "$_GET['page'] flows directly into include() without whitelist check" + } + ] +} diff --git a/tests/fixtures/real_world/php/taint/include_rfi.php b/tests/fixtures/real_world/php/taint/include_rfi.php new file mode 100644 index 00000000..ce36e331 --- /dev/null +++ b/tests/fixtures/real_world/php/taint/include_rfi.php @@ -0,0 +1,11 @@ + diff --git a/tests/fixtures/real_world/php/taint/sqli_concat.expect.json b/tests/fixtures/real_world/php/taint/sqli_concat.expect.json new file mode 100644 index 00000000..6b6cd085 --- /dev/null +++ b/tests/fixtures/real_world/php/taint/sqli_concat.expect.json @@ -0,0 +1,25 @@ +{ + "description": "SQL injection via string concatenation with user input in mysqli query", + "tags": [ + "taint", + "sqli", + "mysqli", + "php" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 1, + 6 + ], + "evidence_contains": [], + "notes": "$_GET['id'] concatenated directly into SQL query string" + } + ] +} diff --git a/tests/fixtures/real_world/php/taint/sqli_concat.php b/tests/fixtures/real_world/php/taint/sqli_concat.php new file mode 100644 index 00000000..ee6b018e --- /dev/null +++ b/tests/fixtures/real_world/php/taint/sqli_concat.php @@ -0,0 +1,13 @@ +query("SELECT * FROM users WHERE id = " . $id); +while ($row = $result->fetch_assoc()) { + echo $row['name']; +} + +// Safe version +$stmt = $conn->prepare("SELECT * FROM users WHERE id = ?"); +$stmt->bind_param("s", $_GET['safe_id']); +$stmt->execute(); +?> diff --git a/tests/fixtures/real_world/python/cfg/context_manager.expect.json b/tests/fixtures/real_world/python/cfg/context_manager.expect.json new file mode 100644 index 00000000..7daa5bcc --- /dev/null +++ b/tests/fixtures/real_world/python/cfg/context_manager.expect.json @@ -0,0 +1,25 @@ +{ + "description": "File handle resource management comparing manual open vs context manager", + "tags": [ + "cfg", + "resource-leak", + "context-manager", + "file-io" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "cfg-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 1, + 7 + ], + "evidence_contains": [], + "notes": "read_file_unsafe opens file handle but never closes it" + } + ] +} diff --git a/tests/fixtures/real_world/python/cfg/context_manager.py b/tests/fixtures/real_world/python/cfg/context_manager.py new file mode 100644 index 00000000..38f71a69 --- /dev/null +++ b/tests/fixtures/real_world/python/cfg/context_manager.py @@ -0,0 +1,15 @@ +def read_file_unsafe(path): + f = open(path, 'r') + data = f.read() + return data + # f never closed + +def read_file_safe(path): + with open(path, 'r') as f: + data = f.read() + return data + +def nested_context(path1, path2): + with open(path1, 'r') as f1: + with open(path2, 'w') as f2: + f2.write(f1.read()) diff --git a/tests/fixtures/real_world/python/cfg/early_return.expect.json b/tests/fixtures/real_world/python/cfg/early_return.expect.json new file mode 100644 index 00000000..48fe0be8 --- /dev/null +++ b/tests/fixtures/real_world/python/cfg/early_return.expect.json @@ -0,0 +1,36 @@ +{ + "description": "Early return leaks file handle when header check fails", + "tags": [ + "cfg", + "resource-leak", + "early-return", + "file-io" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "cfg-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 1, + 12 + ], + "evidence_contains": [], + "notes": "process_file leaks file handle on early return when header does not start with #" + }, + { + "rule_id": "state-resource-leak-possible", + "severity": null, + "must_match": false, + "line_range": [ + 2, + 9 + ], + "evidence_contains": [], + "notes": "File handle leaked on one branch of the conditional" + } + ] +} diff --git a/tests/fixtures/real_world/python/cfg/early_return.py b/tests/fixtures/real_world/python/cfg/early_return.py new file mode 100644 index 00000000..e70b10fd --- /dev/null +++ b/tests/fixtures/real_world/python/cfg/early_return.py @@ -0,0 +1,19 @@ +import os + +def process_file(path): + f = open(path, 'r') + header = f.readline() + if not header.startswith('#'): + return None # leak: f not closed + data = f.read() + f.close() + return data + +def process_with_guard(path): + if not os.path.exists(path): + return None + f = open(path, 'r') + try: + return f.read() + finally: + f.close() diff --git a/tests/fixtures/real_world/python/cfg/raise_terminator.expect.json b/tests/fixtures/real_world/python/cfg/raise_terminator.expect.json new file mode 100644 index 00000000..a541e701 --- /dev/null +++ b/tests/fixtures/real_world/python/cfg/raise_terminator.expect.json @@ -0,0 +1,37 @@ +{ + "description": "Validator raises exception on invalid input, acting as a guard before subprocess", + "tags": [ + "cfg", + "validation", + "raise", + "flask", + "subprocess" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 14, + 21 + ], + "evidence_contains": [], + "notes": "Validator raise acts as guard - ideally no taint finding since invalid input is rejected" + }, + { + "rule_id": "cfg-unguarded-sink", + "severity": null, + "must_match": false, + "line_range": [ + 15, + 20 + ], + "evidence_contains": [], + "notes": "Subprocess call is guarded by validate_cmd raise - should not trigger" + } + ] +} diff --git a/tests/fixtures/real_world/python/cfg/raise_terminator.py b/tests/fixtures/real_world/python/cfg/raise_terminator.py new file mode 100644 index 00000000..5543b380 --- /dev/null +++ b/tests/fixtures/real_world/python/cfg/raise_terminator.py @@ -0,0 +1,19 @@ +from flask import Flask, request +import subprocess + +app = Flask(__name__) + +class ValidationError(Exception): + pass + +def validate_cmd(cmd): + if not cmd.isalnum(): + raise ValidationError("Invalid command") + return cmd + +@app.route('/exec') +def exec_cmd(): + cmd = request.args.get('cmd') + validated = validate_cmd(cmd) + result = subprocess.run([validated], capture_output=True) + return result.stdout.decode() diff --git a/tests/fixtures/real_world/python/cfg/try_except_resource.expect.json b/tests/fixtures/real_world/python/cfg/try_except_resource.expect.json new file mode 100644 index 00000000..9ae53d18 --- /dev/null +++ b/tests/fixtures/real_world/python/cfg/try_except_resource.expect.json @@ -0,0 +1,25 @@ +{ + "description": "Database connection resource management with try/except/finally vs missing close", + "tags": [ + "cfg", + "resource-leak", + "sqlite", + "try-finally" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "cfg-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 14, + 23 + ], + "evidence_contains": [], + "notes": "query_db_leak opens sqlite3 connection but never closes it" + } + ] +} diff --git a/tests/fixtures/real_world/python/cfg/try_except_resource.py b/tests/fixtures/real_world/python/cfg/try_except_resource.py new file mode 100644 index 00000000..0132d2b6 --- /dev/null +++ b/tests/fixtures/real_world/python/cfg/try_except_resource.py @@ -0,0 +1,21 @@ +import sqlite3 + +def query_db(path, sql): + conn = sqlite3.connect(path) + try: + cursor = conn.cursor() + cursor.execute(sql) + results = cursor.fetchall() + return results + except Exception as e: + print(f"Error: {e}") + finally: + conn.close() + +def query_db_leak(path, sql): + conn = sqlite3.connect(path) + cursor = conn.cursor() + cursor.execute(sql) + results = cursor.fetchall() + return results + # conn never closed diff --git a/tests/fixtures/real_world/python/mixed/flask_full_stack.expect.json b/tests/fixtures/real_world/python/mixed/flask_full_stack.expect.json new file mode 100644 index 00000000..1ea1728c --- /dev/null +++ b/tests/fixtures/real_world/python/mixed/flask_full_stack.expect.json @@ -0,0 +1,72 @@ +{ + "description": "Flask app with multiple vulnerability types: cmdi, path traversal, eval, resource leak", + "tags": [ + "taint", + "state", + "cmdi", + "path-traversal", + "eval", + "flask", + "mixed" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 7, + 13 + ], + "evidence_contains": [], + "notes": "request.args.get('cmd') flows into subprocess.run with shell=True" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 13, + 20 + ], + "evidence_contains": [], + "notes": "request.args.get('path') flows into open() - path traversal" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 21, + 26 + ], + "evidence_contains": [], + "notes": "request.args.get('expr') flows into eval()" + }, + { + "rule_id": "py.code_exec.eval", + "severity": null, + "must_match": true, + "line_range": [ + 22, + 26 + ], + "evidence_contains": [], + "notes": "eval() is a dangerous function - AST pattern match" + }, + { + "rule_id": "state-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 12, + 21 + ], + "evidence_contains": [], + "notes": "File handle opened in read_file but never closed" + } + ] +} diff --git a/tests/fixtures/real_world/python/mixed/flask_full_stack.py b/tests/fixtures/real_world/python/mixed/flask_full_stack.py new file mode 100644 index 00000000..3fb5f2b3 --- /dev/null +++ b/tests/fixtures/real_world/python/mixed/flask_full_stack.py @@ -0,0 +1,24 @@ +from flask import Flask, request +import subprocess +import os + +app = Flask(__name__) + +@app.route('/api/exec') +def execute(): + cmd = request.args.get('cmd') + result = subprocess.run(cmd, shell=True, capture_output=True) + return result.stdout.decode() + +@app.route('/api/read') +def read_file(): + path = request.args.get('path') + f = open(path, 'r') + data = f.read() + return data + # f leaked + path traversal taint + +@app.route('/api/eval') +def eval_expr(): + expr = request.args.get('expr') + return str(eval(expr)) diff --git a/tests/fixtures/real_world/python/mixed/taint_through_file.expect.json b/tests/fixtures/real_world/python/mixed/taint_through_file.expect.json new file mode 100644 index 00000000..fa7c0f9b --- /dev/null +++ b/tests/fixtures/real_world/python/mixed/taint_through_file.expect.json @@ -0,0 +1,38 @@ +{ + "description": "User-controlled filename in open() with resource leak on early return", + "tags": [ + "taint", + "state", + "path-traversal", + "resource-leak", + "flask", + "mixed" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 6, + 13 + ], + "evidence_contains": [], + "notes": "request.args.get('name') flows through os.path.join into open()" + }, + { + "rule_id": "state-resource-leak-possible", + "severity": null, + "must_match": false, + "line_range": [ + 9, + 18 + ], + "evidence_contains": [], + "notes": "File handle leaked when early return triggered by data length check" + } + ] +} diff --git a/tests/fixtures/real_world/python/mixed/taint_through_file.py b/tests/fixtures/real_world/python/mixed/taint_through_file.py new file mode 100644 index 00000000..240388d8 --- /dev/null +++ b/tests/fixtures/real_world/python/mixed/taint_through_file.py @@ -0,0 +1,17 @@ +from flask import Flask, request +import os + +app = Flask(__name__) + +@app.route('/save') +def save_data(): + filename = request.args.get('name') + data = request.args.get('data') + filepath = os.path.join('/tmp', filename) + f = open(filepath, 'w') + f.write(data) + if len(data) > 10000: + return 'Too large', 413 + # f leaks on early return + f.close() + return 'OK' diff --git a/tests/fixtures/real_world/python/state/branch_leak.expect.json b/tests/fixtures/real_world/python/state/branch_leak.expect.json new file mode 100644 index 00000000..7ffab32f --- /dev/null +++ b/tests/fixtures/real_world/python/state/branch_leak.expect.json @@ -0,0 +1,25 @@ +{ + "description": "File handle leaked in else branch of conditional", + "tags": [ + "state", + "resource-leak", + "branch", + "file-io" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-resource-leak-possible", + "severity": null, + "must_match": false, + "line_range": [ + 1, + 13 + ], + "evidence_contains": [], + "notes": "File handle closed in if branch but leaked in else branch" + } + ] +} diff --git a/tests/fixtures/real_world/python/state/branch_leak.py b/tests/fixtures/real_world/python/state/branch_leak.py new file mode 100644 index 00000000..e2a363fd --- /dev/null +++ b/tests/fixtures/real_world/python/state/branch_leak.py @@ -0,0 +1,11 @@ +import os + +def conditional_open(path, flag): + f = open(path, 'r') + if flag: + data = f.read() + f.close() + return data + else: + return "skipped" + # f leaked in else branch diff --git a/tests/fixtures/real_world/python/state/file_lifecycle.expect.json b/tests/fixtures/real_world/python/state/file_lifecycle.expect.json new file mode 100644 index 00000000..991e705e --- /dev/null +++ b/tests/fixtures/real_world/python/state/file_lifecycle.expect.json @@ -0,0 +1,48 @@ +{ + "description": "File handle lifecycle patterns: leak, proper close, double close, use after close", + "tags": [ + "state", + "resource-leak", + "double-close", + "use-after-close", + "file-io" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 1, + 6 + ], + "evidence_contains": [], + "notes": "read_and_leak opens file but never closes it" + }, + { + "rule_id": "state-double-close", + "severity": null, + "must_match": false, + "line_range": [ + 10, + 17 + ], + "evidence_contains": [], + "notes": "double_close calls f.close() twice" + }, + { + "rule_id": "state-use-after-close", + "severity": null, + "must_match": false, + "line_range": [ + 15, + 23 + ], + "evidence_contains": [], + "notes": "use_after_close reads from file handle after closing it" + } + ] +} diff --git a/tests/fixtures/real_world/python/state/file_lifecycle.py b/tests/fixtures/real_world/python/state/file_lifecycle.py new file mode 100644 index 00000000..fd07fb97 --- /dev/null +++ b/tests/fixtures/real_world/python/state/file_lifecycle.py @@ -0,0 +1,21 @@ +def read_and_leak(path): + f = open(path, 'r') + data = f.read() + return data + +def read_and_close(path): + f = open(path, 'r') + data = f.read() + f.close() + return data + +def double_close(path): + f = open(path, 'r') + f.close() + f.close() + +def use_after_close(path): + f = open(path, 'r') + f.close() + data = f.read() + return data diff --git a/tests/fixtures/real_world/python/state/socket_lifecycle.expect.json b/tests/fixtures/real_world/python/state/socket_lifecycle.expect.json new file mode 100644 index 00000000..e217e74e --- /dev/null +++ b/tests/fixtures/real_world/python/state/socket_lifecycle.expect.json @@ -0,0 +1,24 @@ +{ + "description": "Socket resource lifecycle - leaked vs properly closed with try/finally", + "tags": [ + "state", + "resource-leak", + "socket" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 1, + 10 + ], + "evidence_contains": [], + "notes": "connect_and_leak creates socket but never closes it" + } + ] +} diff --git a/tests/fixtures/real_world/python/state/socket_lifecycle.py b/tests/fixtures/real_world/python/state/socket_lifecycle.py new file mode 100644 index 00000000..2a02a0f9 --- /dev/null +++ b/tests/fixtures/real_world/python/state/socket_lifecycle.py @@ -0,0 +1,18 @@ +import socket + +def connect_and_leak(host, port): + s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) + s.connect((host, port)) + s.send(b'hello') + data = s.recv(1024) + return data + +def connect_and_close(host, port): + s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) + s.connect((host, port)) + try: + s.send(b'hello') + data = s.recv(1024) + return data + finally: + s.close() diff --git a/tests/fixtures/real_world/python/state/with_statement.expect.json b/tests/fixtures/real_world/python/state/with_statement.expect.json new file mode 100644 index 00000000..71fdff61 --- /dev/null +++ b/tests/fixtures/real_world/python/state/with_statement.expect.json @@ -0,0 +1,25 @@ +{ + "description": "Context manager vs manual open - else branch leaks file handle", + "tags": [ + "state", + "resource-leak", + "context-manager", + "file-io" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 12, + 19 + ], + "evidence_contains": [], + "notes": "else branch opens file manually and never closes it" + } + ] +} diff --git a/tests/fixtures/real_world/python/state/with_statement.py b/tests/fixtures/real_world/python/state/with_statement.py new file mode 100644 index 00000000..93cb2f8e --- /dev/null +++ b/tests/fixtures/real_world/python/state/with_statement.py @@ -0,0 +1,17 @@ +def safe_with(path): + with open(path, 'r') as f: + return f.read() + +def nested_with(src, dst): + with open(src, 'r') as reader: + with open(dst, 'w') as writer: + writer.write(reader.read()) + +def conditional_with(path, mode): + if mode == 'read': + with open(path, 'r') as f: + return f.read() + else: + f = open(path, 'w') + f.write('default') + # f not closed in else branch diff --git a/tests/fixtures/real_world/python/taint/cmdi_subprocess.expect.json b/tests/fixtures/real_world/python/taint/cmdi_subprocess.expect.json new file mode 100644 index 00000000..a46590e0 --- /dev/null +++ b/tests/fixtures/real_world/python/taint/cmdi_subprocess.expect.json @@ -0,0 +1,25 @@ +{ + "description": "Flask handler passes user input directly to subprocess.run with shell=True", + "tags": [ + "taint", + "cmdi", + "flask", + "subprocess" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 6, + 12 + ], + "evidence_contains": [], + "notes": "request.args.get('cmd') flows directly into subprocess.run with shell=True" + } + ] +} diff --git a/tests/fixtures/real_world/python/taint/cmdi_subprocess.py b/tests/fixtures/real_world/python/taint/cmdi_subprocess.py new file mode 100644 index 00000000..a72b2566 --- /dev/null +++ b/tests/fixtures/real_world/python/taint/cmdi_subprocess.py @@ -0,0 +1,19 @@ +from flask import Flask, request +import subprocess + +app = Flask(__name__) + +@app.route('/run') +def run_cmd(): + cmd = request.args.get('cmd') + result = subprocess.run(cmd, shell=True, capture_output=True) + return result.stdout.decode() + +@app.route('/run-safe') +def run_cmd_safe(): + cmd = request.args.get('cmd') + allowed = ['ls', 'date', 'whoami'] + if cmd not in allowed: + return 'Not allowed', 403 + result = subprocess.run([cmd], capture_output=True) + return result.stdout.decode() diff --git a/tests/fixtures/real_world/python/taint/eval_input.expect.json b/tests/fixtures/real_world/python/taint/eval_input.expect.json new file mode 100644 index 00000000..e7452d5e --- /dev/null +++ b/tests/fixtures/real_world/python/taint/eval_input.expect.json @@ -0,0 +1,36 @@ +{ + "description": "eval() called with user-controlled input from Flask request", + "tags": [ + "taint", + "code-exec", + "eval", + "flask" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "py.code_exec.eval", + "severity": null, + "must_match": true, + "line_range": [ + 6, + 10 + ], + "evidence_contains": [], + "notes": "eval() is an AST-level dangerous function pattern" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 5, + 11 + ], + "evidence_contains": [], + "notes": "request.args.get('expr') flows directly into eval()" + } + ] +} diff --git a/tests/fixtures/real_world/python/taint/eval_input.py b/tests/fixtures/real_world/python/taint/eval_input.py new file mode 100644 index 00000000..406b821a --- /dev/null +++ b/tests/fixtures/real_world/python/taint/eval_input.py @@ -0,0 +1,9 @@ +from flask import Flask, request + +app = Flask(__name__) + +@app.route('/calc') +def calculate(): + expr = request.args.get('expr') + result = eval(expr) + return str(result) diff --git a/tests/fixtures/real_world/python/taint/path_traversal.expect.json b/tests/fixtures/real_world/python/taint/path_traversal.expect.json new file mode 100644 index 00000000..a9970c65 --- /dev/null +++ b/tests/fixtures/real_world/python/taint/path_traversal.expect.json @@ -0,0 +1,25 @@ +{ + "description": "Path traversal via user-controlled filename passed to send_file", + "tags": [ + "taint", + "path-traversal", + "flask", + "file-io" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 6, + 12 + ], + "evidence_contains": [], + "notes": "request.args.get('file') flows into os.path.join then send_file without validation" + } + ] +} diff --git a/tests/fixtures/real_world/python/taint/path_traversal.py b/tests/fixtures/real_world/python/taint/path_traversal.py new file mode 100644 index 00000000..6ce2185f --- /dev/null +++ b/tests/fixtures/real_world/python/taint/path_traversal.py @@ -0,0 +1,19 @@ +from flask import Flask, request, send_file +import os + +app = Flask(__name__) + +@app.route('/download') +def download(): + filename = request.args.get('file') + filepath = os.path.join('/uploads', filename) + return send_file(filepath) + +@app.route('/download-safe') +def download_safe(): + filename = request.args.get('file') + filepath = os.path.join('/uploads', filename) + realpath = os.path.realpath(filepath) + if not realpath.startswith('/uploads'): + return 'Forbidden', 403 + return send_file(realpath) diff --git a/tests/fixtures/real_world/python/taint/pickle_deser.expect.json b/tests/fixtures/real_world/python/taint/pickle_deser.expect.json new file mode 100644 index 00000000..5593f971 --- /dev/null +++ b/tests/fixtures/real_world/python/taint/pickle_deser.expect.json @@ -0,0 +1,37 @@ +{ + "description": "Pickle deserialization of user-supplied base64 data", + "tags": [ + "taint", + "deser", + "pickle", + "flask" + ], + "modes": [ + "full", + "ast" + ], + "expected": [ + { + "rule_id": "py.deser.pickle_loads", + "severity": null, + "must_match": true, + "line_range": [ + 9, + 13 + ], + "evidence_contains": [], + "notes": "pickle.loads on user-controlled data enables arbitrary code execution" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 7, + 14 + ], + "evidence_contains": [], + "notes": "User data flows through base64 decode into pickle.loads - aspirational taint finding" + } + ] +} diff --git a/tests/fixtures/real_world/python/taint/pickle_deser.py b/tests/fixtures/real_world/python/taint/pickle_deser.py new file mode 100644 index 00000000..4dff412b --- /dev/null +++ b/tests/fixtures/real_world/python/taint/pickle_deser.py @@ -0,0 +1,12 @@ +from flask import Flask, request +import pickle +import base64 + +app = Flask(__name__) + +@app.route('/load', methods=['POST']) +def load_object(): + data = request.get_data() + decoded = base64.b64decode(data) + obj = pickle.loads(decoded) + return str(obj) diff --git a/tests/fixtures/real_world/python/taint/sqli_concat.expect.json b/tests/fixtures/real_world/python/taint/sqli_concat.expect.json new file mode 100644 index 00000000..385b9ba3 --- /dev/null +++ b/tests/fixtures/real_world/python/taint/sqli_concat.expect.json @@ -0,0 +1,36 @@ +{ + "description": "SQL injection via string concatenation with user input in cursor.execute", + "tags": [ + "taint", + "sqli", + "flask", + "sqlite" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 6, + 14 + ], + "evidence_contains": [], + "notes": "request.args.get('id') concatenated directly into SQL query string" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 14, + 22 + ], + "evidence_contains": [], + "notes": "Safe version uses parameterized query - should not trigger" + } + ] +} diff --git a/tests/fixtures/real_world/python/taint/sqli_concat.py b/tests/fixtures/real_world/python/taint/sqli_concat.py new file mode 100644 index 00000000..584194ec --- /dev/null +++ b/tests/fixtures/real_world/python/taint/sqli_concat.py @@ -0,0 +1,20 @@ +from flask import Flask, request +import sqlite3 + +app = Flask(__name__) + +@app.route('/user') +def get_user(): + user_id = request.args.get('id') + conn = sqlite3.connect('app.db') + cursor = conn.cursor() + cursor.execute("SELECT * FROM users WHERE id = " + user_id) + return str(cursor.fetchall()) + +@app.route('/user-safe') +def get_user_safe(): + user_id = request.args.get('id') + conn = sqlite3.connect('app.db') + cursor = conn.cursor() + cursor.execute("SELECT * FROM users WHERE id = ?", (user_id,)) + return str(cursor.fetchall()) diff --git a/tests/fixtures/real_world/python/taint/yaml_deser.expect.json b/tests/fixtures/real_world/python/taint/yaml_deser.expect.json new file mode 100644 index 00000000..5b7085a8 --- /dev/null +++ b/tests/fixtures/real_world/python/taint/yaml_deser.expect.json @@ -0,0 +1,37 @@ +{ + "description": "Unsafe YAML deserialization with yaml.load vs safe yaml.safe_load", + "tags": [ + "taint", + "deser", + "yaml", + "flask" + ], + "modes": [ + "full", + "ast" + ], + "expected": [ + { + "rule_id": "py.deser.yaml_load", + "severity": null, + "must_match": true, + "line_range": [ + 7, + 11 + ], + "evidence_contains": [], + "notes": "yaml.load with FullLoader is unsafe with user-controlled data" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 6, + 12 + ], + "evidence_contains": [], + "notes": "User data flows into yaml.load - aspirational taint finding" + } + ] +} diff --git a/tests/fixtures/real_world/python/taint/yaml_deser.py b/tests/fixtures/real_world/python/taint/yaml_deser.py new file mode 100644 index 00000000..57e1ef52 --- /dev/null +++ b/tests/fixtures/real_world/python/taint/yaml_deser.py @@ -0,0 +1,16 @@ +from flask import Flask, request +import yaml + +app = Flask(__name__) + +@app.route('/parse', methods=['POST']) +def parse_config(): + data = request.get_data() + config = yaml.load(data, Loader=yaml.FullLoader) + return str(config) + +@app.route('/parse-safe', methods=['POST']) +def parse_config_safe(): + data = request.get_data() + config = yaml.safe_load(data) + return str(config) diff --git a/tests/fixtures/real_world/ruby/cfg/begin_rescue_ensure.expect.json b/tests/fixtures/real_world/ruby/cfg/begin_rescue_ensure.expect.json new file mode 100644 index 00000000..37402cb3 --- /dev/null +++ b/tests/fixtures/real_world/ruby/cfg/begin_rescue_ensure.expect.json @@ -0,0 +1,26 @@ +{ + "description": "File handle management with begin/rescue/ensure vs early return leak", + "tags": [ + "cfg", + "resource-leak", + "begin-rescue-ensure", + "file-io", + "ruby" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "cfg-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 11, + 23 + ], + "evidence_contains": [], + "notes": "leaky_process returns nil on empty data without closing file handle" + } + ] +} diff --git a/tests/fixtures/real_world/ruby/cfg/begin_rescue_ensure.rb b/tests/fixtures/real_world/ruby/cfg/begin_rescue_ensure.rb new file mode 100644 index 00000000..84fbdf4d --- /dev/null +++ b/tests/fixtures/real_world/ruby/cfg/begin_rescue_ensure.rb @@ -0,0 +1,21 @@ +def process_file(path) + f = File.open(path, 'r') + begin + data = f.read + return data + rescue IOError => e + puts e.message + ensure + f.close + end +end + +def leaky_process(path) + f = File.open(path, 'r') + data = f.read + if data.empty? + return nil # f leaked + end + f.close + data +end diff --git a/tests/fixtures/real_world/ruby/cfg/block_form.expect.json b/tests/fixtures/real_world/ruby/cfg/block_form.expect.json new file mode 100644 index 00000000..7da485ad --- /dev/null +++ b/tests/fixtures/real_world/ruby/cfg/block_form.expect.json @@ -0,0 +1,26 @@ +{ + "description": "Block form auto-closes file handle vs manual open without close", + "tags": [ + "cfg", + "resource-leak", + "block", + "file-io", + "ruby" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "cfg-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 5, + 13 + ], + "evidence_contains": [], + "notes": "unsafe_read opens file without block form and never closes it" + } + ] +} diff --git a/tests/fixtures/real_world/ruby/cfg/block_form.rb b/tests/fixtures/real_world/ruby/cfg/block_form.rb new file mode 100644 index 00000000..c2134bc9 --- /dev/null +++ b/tests/fixtures/real_world/ruby/cfg/block_form.rb @@ -0,0 +1,11 @@ +def safe_read(path) + File.open(path, 'r') do |f| + f.read + end +end + +def unsafe_read(path) + f = File.open(path, 'r') + data = f.read + data +end diff --git a/tests/fixtures/real_world/ruby/cfg/case_when.expect.json b/tests/fixtures/real_world/ruby/cfg/case_when.expect.json new file mode 100644 index 00000000..05b02f8a --- /dev/null +++ b/tests/fixtures/real_world/ruby/cfg/case_when.expect.json @@ -0,0 +1,38 @@ +{ + "description": "Case/when dispatching to dangerous functions eval and system", + "tags": [ + "cfg", + "code-exec", + "cmdi", + "case-when", + "ruby" + ], + "modes": [ + "full", + "ast" + ], + "expected": [ + { + "rule_id": "rb.code_exec.eval", + "severity": null, + "must_match": true, + "line_range": [ + 2, + 6 + ], + "evidence_contains": [], + "notes": "eval() called in case/when branch with function parameter" + }, + { + "rule_id": "rb.cmdi.system_interp", + "severity": null, + "must_match": true, + "line_range": [ + 4, + 8 + ], + "evidence_contains": [], + "notes": "system() called in case/when branch with function parameter" + } + ] +} diff --git a/tests/fixtures/real_world/ruby/cfg/case_when.rb b/tests/fixtures/real_world/ruby/cfg/case_when.rb new file mode 100644 index 00000000..1e692be8 --- /dev/null +++ b/tests/fixtures/real_world/ruby/cfg/case_when.rb @@ -0,0 +1,10 @@ +def handle_action(action, input) + case action + when 'eval' + eval(input) + when 'exec' + system(input) + when 'log' + puts input + end +end diff --git a/tests/fixtures/real_world/ruby/cfg/unless_elsif.expect.json b/tests/fixtures/real_world/ruby/cfg/unless_elsif.expect.json new file mode 100644 index 00000000..29086a6a --- /dev/null +++ b/tests/fixtures/real_world/ruby/cfg/unless_elsif.expect.json @@ -0,0 +1,49 @@ +{ + "description": "Authorization check with unless guard vs unguarded system call", + "tags": [ + "cfg", + "cmdi", + "auth-gap", + "unless", + "ruby" + ], + "modes": [ + "full", + "ast" + ], + "expected": [ + { + "rule_id": "rb.cmdi.system_interp", + "severity": null, + "must_match": true, + "line_range": [ + 3, + 7 + ], + "evidence_contains": [], + "notes": "system() in check_and_exec - guarded by admin check" + }, + { + "rule_id": "rb.cmdi.system_interp", + "severity": null, + "must_match": true, + "line_range": [ + 7, + 11 + ], + "evidence_contains": [], + "notes": "system() in unchecked_exec - no authorization check" + }, + { + "rule_id": "cfg-auth-gap", + "severity": null, + "must_match": false, + "line_range": [ + 6, + 12 + ], + "evidence_contains": [], + "notes": "unchecked_exec calls system without any authorization guard" + } + ] +} diff --git a/tests/fixtures/real_world/ruby/cfg/unless_elsif.rb b/tests/fixtures/real_world/ruby/cfg/unless_elsif.rb new file mode 100644 index 00000000..74b0d6e4 --- /dev/null +++ b/tests/fixtures/real_world/ruby/cfg/unless_elsif.rb @@ -0,0 +1,10 @@ +def check_and_exec(cmd, user) + unless user.admin? + return "Not authorized" + end + system(cmd) +end + +def unchecked_exec(cmd) + system(cmd) +end diff --git a/tests/fixtures/real_world/ruby/mixed/file_taint.expect.json b/tests/fixtures/real_world/ruby/mixed/file_taint.expect.json new file mode 100644 index 00000000..c68abefe --- /dev/null +++ b/tests/fixtures/real_world/ruby/mixed/file_taint.expect.json @@ -0,0 +1,39 @@ +{ + "description": "User-controlled path passed to File.open with leaked file handle", + "tags": [ + "taint", + "state", + "path-traversal", + "resource-leak", + "sinatra", + "ruby", + "mixed" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 2, + 8 + ], + "evidence_contains": [], + "notes": "params[:path] flows into File.open - path traversal" + }, + { + "rule_id": "state-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 3, + 11 + ], + "evidence_contains": [], + "notes": "File handle opened but never closed" + } + ] +} diff --git a/tests/fixtures/real_world/ruby/mixed/file_taint.rb b/tests/fixtures/real_world/ruby/mixed/file_taint.rb new file mode 100644 index 00000000..381b7d12 --- /dev/null +++ b/tests/fixtures/real_world/ruby/mixed/file_taint.rb @@ -0,0 +1,10 @@ +require 'sinatra' + +get '/read' do + path = params[:path] + f = File.open(path, 'r') + data = f.read + # taint: path is user input + # state: f leaked + data +end diff --git a/tests/fixtures/real_world/ruby/mixed/rails_controller.expect.json b/tests/fixtures/real_world/ruby/mixed/rails_controller.expect.json new file mode 100644 index 00000000..40b2f9e2 --- /dev/null +++ b/tests/fixtures/real_world/ruby/mixed/rails_controller.expect.json @@ -0,0 +1,51 @@ +{ + "description": "Rails controller with SQL injection, command injection, and missing auth check", + "tags": [ + "taint", + "cfg", + "cmdi", + "sqli", + "auth-gap", + "rails", + "ruby", + "mixed" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "rb.cmdi.system_interp", + "severity": null, + "must_match": true, + "line_range": [ + 8, + 12 + ], + "evidence_contains": [], + "notes": "system() called with user-controlled params[:cmd]" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 7, + 12 + ], + "evidence_contains": [], + "notes": "params[:cmd] flows directly into system()" + }, + { + "rule_id": "cfg-auth-gap", + "severity": null, + "must_match": false, + "line_range": [ + 6, + 13 + ], + "evidence_contains": [], + "notes": "exec_cmd has no authorization check before system() call" + } + ] +} diff --git a/tests/fixtures/real_world/ruby/mixed/rails_controller.rb b/tests/fixtures/real_world/ruby/mixed/rails_controller.rb new file mode 100644 index 00000000..d9d4154e --- /dev/null +++ b/tests/fixtures/real_world/ruby/mixed/rails_controller.rb @@ -0,0 +1,17 @@ +class UsersController + def show + user_id = params[:id] + query = "SELECT * FROM users WHERE id = #{user_id}" + @user = execute_query(query) + end + + def exec_cmd + cmd = params[:cmd] + system(cmd) + end + + def safe_show + user_id = params[:id] + @user = User.find(user_id) + end +end diff --git a/tests/fixtures/real_world/ruby/state/block_vs_manual.expect.json b/tests/fixtures/real_world/ruby/state/block_vs_manual.expect.json new file mode 100644 index 00000000..26d13516 --- /dev/null +++ b/tests/fixtures/real_world/ruby/state/block_vs_manual.expect.json @@ -0,0 +1,26 @@ +{ + "description": "Block form auto-close vs manual close vs forgotten close", + "tags": [ + "state", + "resource-leak", + "block", + "file-io", + "ruby" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 10, + 17 + ], + "evidence_contains": [], + "notes": "forgot_close opens file without block form and never closes it" + } + ] +} diff --git a/tests/fixtures/real_world/ruby/state/block_vs_manual.rb b/tests/fixtures/real_world/ruby/state/block_vs_manual.rb new file mode 100644 index 00000000..023286ad --- /dev/null +++ b/tests/fixtures/real_world/ruby/state/block_vs_manual.rb @@ -0,0 +1,15 @@ +def block_auto_close(path) + File.open(path, 'r') { |f| f.read } +end + +def manual_close(path) + f = File.open(path, 'r') + data = f.read + f.close + data +end + +def forgot_close(path) + f = File.open(path, 'r') + f.read +end diff --git a/tests/fixtures/real_world/ruby/state/conditional_close.expect.json b/tests/fixtures/real_world/ruby/state/conditional_close.expect.json new file mode 100644 index 00000000..bcfa194b --- /dev/null +++ b/tests/fixtures/real_world/ruby/state/conditional_close.expect.json @@ -0,0 +1,26 @@ +{ + "description": "File handle closed in one branch but leaked in the other", + "tags": [ + "state", + "resource-leak", + "branch", + "file-io", + "ruby" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-resource-leak-possible", + "severity": null, + "must_match": false, + "line_range": [ + 1, + 12 + ], + "evidence_contains": [], + "notes": "File handle closed in if branch but leaked in else branch" + } + ] +} diff --git a/tests/fixtures/real_world/ruby/state/conditional_close.rb b/tests/fixtures/real_world/ruby/state/conditional_close.rb new file mode 100644 index 00000000..fb1fc240 --- /dev/null +++ b/tests/fixtures/real_world/ruby/state/conditional_close.rb @@ -0,0 +1,10 @@ +def maybe_close(path, flag) + f = File.open(path, 'r') + if flag + data = f.read + f.close + data + else + "skipped" + end +end diff --git a/tests/fixtures/real_world/ruby/state/file_lifecycle.expect.json b/tests/fixtures/real_world/ruby/state/file_lifecycle.expect.json new file mode 100644 index 00000000..ac7f92c2 --- /dev/null +++ b/tests/fixtures/real_world/ruby/state/file_lifecycle.expect.json @@ -0,0 +1,49 @@ +{ + "description": "Ruby file handle lifecycle: leak, proper close, double close, use after close", + "tags": [ + "state", + "resource-leak", + "double-close", + "use-after-close", + "file-io", + "ruby" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 1, + 7 + ], + "evidence_contains": [], + "notes": "read_and_leak opens file but never closes it" + }, + { + "rule_id": "state-double-close", + "severity": null, + "must_match": false, + "line_range": [ + 12, + 20 + ], + "evidence_contains": [], + "notes": "double_close calls f.close twice" + }, + { + "rule_id": "state-use-after-close", + "severity": null, + "must_match": false, + "line_range": [ + 18, + 27 + ], + "evidence_contains": [], + "notes": "use_after_close reads from file handle after closing it" + } + ] +} diff --git a/tests/fixtures/real_world/ruby/state/file_lifecycle.rb b/tests/fixtures/real_world/ruby/state/file_lifecycle.rb new file mode 100644 index 00000000..f94651fa --- /dev/null +++ b/tests/fixtures/real_world/ruby/state/file_lifecycle.rb @@ -0,0 +1,25 @@ +def read_and_leak(path) + f = File.open(path, 'r') + data = f.read + data +end + +def read_and_close(path) + f = File.open(path, 'r') + data = f.read + f.close + data +end + +def double_close(path) + f = File.open(path, 'r') + f.close + f.close +end + +def use_after_close(path) + f = File.open(path, 'r') + f.close + data = f.read + data +end diff --git a/tests/fixtures/real_world/ruby/state/socket_lifecycle.expect.json b/tests/fixtures/real_world/ruby/state/socket_lifecycle.expect.json new file mode 100644 index 00000000..b7ff93c2 --- /dev/null +++ b/tests/fixtures/real_world/ruby/state/socket_lifecycle.expect.json @@ -0,0 +1,25 @@ +{ + "description": "TCP socket resource lifecycle - leaked vs properly closed with ensure", + "tags": [ + "state", + "resource-leak", + "socket", + "ruby" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 1, + 10 + ], + "evidence_contains": [], + "notes": "connect_leak creates TCP socket but never closes it" + } + ] +} diff --git a/tests/fixtures/real_world/ruby/state/socket_lifecycle.rb b/tests/fixtures/real_world/ruby/state/socket_lifecycle.rb new file mode 100644 index 00000000..0821fa09 --- /dev/null +++ b/tests/fixtures/real_world/ruby/state/socket_lifecycle.rb @@ -0,0 +1,18 @@ +require 'socket' + +def connect_leak(host, port) + s = TCPSocket.new(host, port) + s.puts('hello') + data = s.gets + data +end + +def connect_safe(host, port) + s = TCPSocket.new(host, port) + begin + s.puts('hello') + s.gets + ensure + s.close + end +end diff --git a/tests/fixtures/real_world/ruby/taint/cmdi_backticks.expect.json b/tests/fixtures/real_world/ruby/taint/cmdi_backticks.expect.json new file mode 100644 index 00000000..bf502c55 --- /dev/null +++ b/tests/fixtures/real_world/ruby/taint/cmdi_backticks.expect.json @@ -0,0 +1,38 @@ +{ + "description": "Command injection via backtick interpolation with user input", + "tags": [ + "taint", + "cmdi", + "backticks", + "sinatra", + "ruby" + ], + "modes": [ + "full", + "ast" + ], + "expected": [ + { + "rule_id": "rb.cmdi.backtick", + "severity": null, + "must_match": true, + "line_range": [ + 3, + 7 + ], + "evidence_contains": [], + "notes": "Backtick command execution with interpolated user input" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 2, + 8 + ], + "evidence_contains": [], + "notes": "params[:command] flows into backtick execution - aspirational taint tracking through interpolation" + } + ] +} diff --git a/tests/fixtures/real_world/ruby/taint/cmdi_backticks.rb b/tests/fixtures/real_world/ruby/taint/cmdi_backticks.rb new file mode 100644 index 00000000..47200a95 --- /dev/null +++ b/tests/fixtures/real_world/ruby/taint/cmdi_backticks.rb @@ -0,0 +1,7 @@ +require 'sinatra' + +get '/run' do + cmd = params[:command] + result = `#{cmd}` + result +end diff --git a/tests/fixtures/real_world/ruby/taint/cmdi_system.expect.json b/tests/fixtures/real_world/ruby/taint/cmdi_system.expect.json new file mode 100644 index 00000000..ca88935a --- /dev/null +++ b/tests/fixtures/real_world/ruby/taint/cmdi_system.expect.json @@ -0,0 +1,37 @@ +{ + "description": "Sinatra handler passes user input directly to system() for command execution", + "tags": [ + "taint", + "cmdi", + "sinatra", + "system", + "ruby" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "rb.cmdi.system_interp", + "severity": null, + "must_match": true, + "line_range": [ + 3, + 7 + ], + "evidence_contains": [], + "notes": "system() called with user-controlled params[:cmd]" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 2, + 8 + ], + "evidence_contains": [], + "notes": "params[:cmd] flows directly into system() without sanitization" + } + ] +} diff --git a/tests/fixtures/real_world/ruby/taint/cmdi_system.rb b/tests/fixtures/real_world/ruby/taint/cmdi_system.rb new file mode 100644 index 00000000..6cf86ec1 --- /dev/null +++ b/tests/fixtures/real_world/ruby/taint/cmdi_system.rb @@ -0,0 +1,15 @@ +require 'sinatra' + +get '/exec' do + cmd = params[:cmd] + output = system(cmd) + output.to_s +end + +get '/exec-safe' do + cmd = params[:cmd] + allowed = %w[ls date whoami] + halt 403, 'Not allowed' unless allowed.include?(cmd) + output = system(cmd) + output.to_s +end diff --git a/tests/fixtures/real_world/ruby/taint/constantize.expect.json b/tests/fixtures/real_world/ruby/taint/constantize.expect.json new file mode 100644 index 00000000..3d0cb9e7 --- /dev/null +++ b/tests/fixtures/real_world/ruby/taint/constantize.expect.json @@ -0,0 +1,38 @@ +{ + "description": "Unsafe constantize on user-controlled class name enabling arbitrary class instantiation", + "tags": [ + "taint", + "reflect", + "constantize", + "sinatra", + "ruby" + ], + "modes": [ + "full", + "ast" + ], + "expected": [ + { + "rule_id": "rb.reflection.constantize", + "severity": null, + "must_match": true, + "line_range": [ + 3, + 7 + ], + "evidence_contains": [], + "notes": "constantize on user input allows instantiation of arbitrary classes" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 2, + 8 + ], + "evidence_contains": [], + "notes": "params[:type] flows into constantize - aspirational taint finding" + } + ] +} diff --git a/tests/fixtures/real_world/ruby/taint/constantize.rb b/tests/fixtures/real_world/ruby/taint/constantize.rb new file mode 100644 index 00000000..9a93fcc9 --- /dev/null +++ b/tests/fixtures/real_world/ruby/taint/constantize.rb @@ -0,0 +1,8 @@ +require 'sinatra' + +get '/create' do + class_name = params[:type] + klass = class_name.constantize + instance = klass.new + instance.to_s +end diff --git a/tests/fixtures/real_world/ruby/taint/eval_input.expect.json b/tests/fixtures/real_world/ruby/taint/eval_input.expect.json new file mode 100644 index 00000000..0eb65722 --- /dev/null +++ b/tests/fixtures/real_world/ruby/taint/eval_input.expect.json @@ -0,0 +1,37 @@ +{ + "description": "eval() called with user-controlled input from Sinatra params", + "tags": [ + "taint", + "code-exec", + "eval", + "sinatra", + "ruby" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "rb.code_exec.eval", + "severity": null, + "must_match": true, + "line_range": [ + 3, + 7 + ], + "evidence_contains": [], + "notes": "eval() is a dangerous function - AST pattern match" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 2, + 8 + ], + "evidence_contains": [], + "notes": "params[:code] flows directly into eval()" + } + ] +} diff --git a/tests/fixtures/real_world/ruby/taint/eval_input.rb b/tests/fixtures/real_world/ruby/taint/eval_input.rb new file mode 100644 index 00000000..1d484309 --- /dev/null +++ b/tests/fixtures/real_world/ruby/taint/eval_input.rb @@ -0,0 +1,7 @@ +require 'sinatra' + +post '/eval' do + code = params[:code] + result = eval(code) + result.to_s +end diff --git a/tests/fixtures/real_world/ruby/taint/marshal_deser.expect.json b/tests/fixtures/real_world/ruby/taint/marshal_deser.expect.json new file mode 100644 index 00000000..5afff3d1 --- /dev/null +++ b/tests/fixtures/real_world/ruby/taint/marshal_deser.expect.json @@ -0,0 +1,38 @@ +{ + "description": "Marshal.load deserialization of user-supplied base64 data", + "tags": [ + "taint", + "deser", + "marshal", + "sinatra", + "ruby" + ], + "modes": [ + "full", + "ast" + ], + "expected": [ + { + "rule_id": "rb.deser.marshal_load", + "severity": null, + "must_match": true, + "line_range": [ + 5, + 9 + ], + "evidence_contains": [], + "notes": "Marshal.load on user-controlled data enables arbitrary code execution" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 3, + 10 + ], + "evidence_contains": [], + "notes": "User data flows through Base64 decode into Marshal.load - aspirational taint finding" + } + ] +} diff --git a/tests/fixtures/real_world/ruby/taint/marshal_deser.rb b/tests/fixtures/real_world/ruby/taint/marshal_deser.rb new file mode 100644 index 00000000..28c93f40 --- /dev/null +++ b/tests/fixtures/real_world/ruby/taint/marshal_deser.rb @@ -0,0 +1,9 @@ +require 'sinatra' +require 'base64' + +post '/load' do + data = params[:data] + decoded = Base64.decode64(data) + obj = Marshal.load(decoded) + obj.to_s +end diff --git a/tests/fixtures/real_world/ruby/taint/yaml_deser.expect.json b/tests/fixtures/real_world/ruby/taint/yaml_deser.expect.json new file mode 100644 index 00000000..ad34de42 --- /dev/null +++ b/tests/fixtures/real_world/ruby/taint/yaml_deser.expect.json @@ -0,0 +1,38 @@ +{ + "description": "Unsafe YAML.load vs safe YAML.safe_load with user input", + "tags": [ + "taint", + "deser", + "yaml", + "sinatra", + "ruby" + ], + "modes": [ + "full", + "ast" + ], + "expected": [ + { + "rule_id": "rb.deser.yaml_load", + "severity": null, + "must_match": true, + "line_range": [ + 4, + 8 + ], + "evidence_contains": [], + "notes": "YAML.load on user-controlled data enables arbitrary object instantiation" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 3, + 9 + ], + "evidence_contains": [], + "notes": "Request body flows into YAML.load - aspirational taint finding" + } + ] +} diff --git a/tests/fixtures/real_world/ruby/taint/yaml_deser.rb b/tests/fixtures/real_world/ruby/taint/yaml_deser.rb new file mode 100644 index 00000000..79683861 --- /dev/null +++ b/tests/fixtures/real_world/ruby/taint/yaml_deser.rb @@ -0,0 +1,14 @@ +require 'sinatra' +require 'yaml' + +post '/parse' do + data = request.body.read + config = YAML.load(data) + config.to_s +end + +post '/parse-safe' do + data = request.body.read + config = YAML.safe_load(data) + config.to_s +end diff --git a/tests/fixtures/real_world/rust/cfg/closure_async.expect.json b/tests/fixtures/real_world/rust/cfg/closure_async.expect.json new file mode 100644 index 00000000..848580be --- /dev/null +++ b/tests/fixtures/real_world/rust/cfg/closure_async.expect.json @@ -0,0 +1,35 @@ +{ + "description": "Command execution inside closure passed as higher-order function argument", + "tags": [ + "cmdi", + "cfg", + "closure" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "rs.quality.unwrap", + "severity": null, + "must_match": true, + "line_range": [ + 3, + 7 + ], + "evidence_contains": [], + "notes": ".unwrap() on env::var Result in apply_command" + }, + { + "rule_id": "rs.quality.unwrap", + "severity": null, + "must_match": true, + "line_range": [ + 9, + 13 + ], + "evidence_contains": [], + "notes": ".unwrap() on Command output Result in closure" + } + ] +} diff --git a/tests/fixtures/real_world/rust/cfg/closure_async.rs b/tests/fixtures/real_world/rust/cfg/closure_async.rs new file mode 100644 index 00000000..c90421cb --- /dev/null +++ b/tests/fixtures/real_world/rust/cfg/closure_async.rs @@ -0,0 +1,13 @@ +use std::env; +use std::process::Command; + +fn apply_command(f: F) { + let cmd = env::var("CMD").unwrap(); + f(&cmd); +} + +fn main() { + apply_command(|cmd| { + Command::new("sh").arg("-c").arg(cmd).output().unwrap(); + }); +} diff --git a/tests/fixtures/real_world/rust/cfg/error_handling.expect.json b/tests/fixtures/real_world/rust/cfg/error_handling.expect.json new file mode 100644 index 00000000..fe2c000a --- /dev/null +++ b/tests/fixtures/real_world/rust/cfg/error_handling.expect.json @@ -0,0 +1,34 @@ +{ + "description": "Error handling patterns: Result with ? (safe), .unwrap() (panicky), .expect() (panicky with message)", + "tags": [ + "error-handling", + "cfg" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "rs.quality.unwrap", + "severity": null, + "must_match": true, + "line_range": [ + 8, + 12 + ], + "evidence_contains": [], + "notes": ".unwrap() in read_config_panicky \u2014 will panic on IO error" + }, + { + "rule_id": "rs.quality.expect", + "severity": null, + "must_match": true, + "line_range": [ + 12, + 16 + ], + "evidence_contains": [], + "notes": ".expect() in read_config_expect \u2014 will panic with message on IO error" + } + ] +} diff --git a/tests/fixtures/real_world/rust/cfg/error_handling.rs b/tests/fixtures/real_world/rust/cfg/error_handling.rs new file mode 100644 index 00000000..0a5a6ab7 --- /dev/null +++ b/tests/fixtures/real_world/rust/cfg/error_handling.rs @@ -0,0 +1,15 @@ +use std::fs; +use std::io; + +fn read_config(path: &str) -> Result { + let content = fs::read_to_string(path)?; + Ok(content) +} + +fn read_config_panicky(path: &str) -> String { + fs::read_to_string(path).unwrap() +} + +fn read_config_expect(path: &str) -> String { + fs::read_to_string(path).expect("config must exist") +} diff --git a/tests/fixtures/real_world/rust/cfg/if_let_while_let.expect.json b/tests/fixtures/real_world/rust/cfg/if_let_while_let.expect.json new file mode 100644 index 00000000..c52c9932 --- /dev/null +++ b/tests/fixtures/real_world/rust/cfg/if_let_while_let.expect.json @@ -0,0 +1,35 @@ +{ + "description": "if-let binding of env::var flows to Command execution", + "tags": [ + "taint", + "cmdi", + "cfg" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 3, + 8 + ], + "evidence_contains": [], + "notes": "env::var(\"CMD\") bound via if-let flows to Command .arg(&cmd)" + }, + { + "rule_id": "rs.quality.unwrap", + "severity": null, + "must_match": true, + "line_range": [ + 4, + 8 + ], + "evidence_contains": [], + "notes": ".unwrap() on Command output Result" + } + ] +} diff --git a/tests/fixtures/real_world/rust/cfg/if_let_while_let.rs b/tests/fixtures/real_world/rust/cfg/if_let_while_let.rs new file mode 100644 index 00000000..7fc4b63a --- /dev/null +++ b/tests/fixtures/real_world/rust/cfg/if_let_while_let.rs @@ -0,0 +1,13 @@ +use std::env; +use std::process::Command; + +fn process_env_commands() { + if let Ok(cmd) = env::var("CMD") { + Command::new("sh").arg("-c").arg(&cmd).output().unwrap(); + } + + let mut items: Vec = vec![]; + while let Some(item) = items.pop() { + println!("{}", item); + } +} diff --git a/tests/fixtures/real_world/rust/cfg/match_arms.expect.json b/tests/fixtures/real_world/rust/cfg/match_arms.expect.json new file mode 100644 index 00000000..ece7d987 --- /dev/null +++ b/tests/fixtures/real_world/rust/cfg/match_arms.expect.json @@ -0,0 +1,23 @@ +{ + "description": "Command execution inside match arm \u2014 pattern matching dispatches to shell execution", + "tags": [ + "cmdi", + "cfg" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "rs.quality.unwrap", + "severity": null, + "must_match": true, + "line_range": [ + 11, + 15 + ], + "evidence_contains": [], + "notes": ".unwrap() on Command output Result" + } + ] +} diff --git a/tests/fixtures/real_world/rust/cfg/match_arms.rs b/tests/fixtures/real_world/rust/cfg/match_arms.rs new file mode 100644 index 00000000..bc4ffa72 --- /dev/null +++ b/tests/fixtures/real_world/rust/cfg/match_arms.rs @@ -0,0 +1,18 @@ +use std::env; +use std::process::Command; + +enum Action { + Run(String), + Log(String), + Quit, +} + +fn handle(action: Action) { + match action { + Action::Run(cmd) => { + Command::new("sh").arg("-c").arg(&cmd).output().unwrap(); + } + Action::Log(msg) => println!("{}", msg), + Action::Quit => std::process::exit(0), + } +} diff --git a/tests/fixtures/real_world/rust/mixed/unsafe_transmute_cmd.expect.json b/tests/fixtures/real_world/rust/mixed/unsafe_transmute_cmd.expect.json new file mode 100644 index 00000000..6b4e4021 --- /dev/null +++ b/tests/fixtures/real_world/rust/mixed/unsafe_transmute_cmd.expect.json @@ -0,0 +1,69 @@ +{ + "description": "Combined unsafe + taint: env var bytes transmuted to u32, then used in Command via format!", + "tags": [ + "unsafe", + "taint", + "cmdi", + "mixed" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "rs.memory.transmute", + "severity": null, + "must_match": true, + "line_range": [ + 6, + 10 + ], + "evidence_contains": [], + "notes": "mem::transmute to convert bytes to u32" + }, + { + "rule_id": "rs.quality.unsafe_block", + "severity": null, + "must_match": true, + "line_range": [ + 6, + 10 + ], + "evidence_contains": [], + "notes": "unsafe block wrapping mem::transmute" + }, + { + "rule_id": "rs.quality.unwrap", + "severity": null, + "must_match": true, + "line_range": [ + 4, + 8 + ], + "evidence_contains": [], + "notes": ".unwrap() on env::var Result" + }, + { + "rule_id": "rs.quality.unwrap", + "severity": null, + "must_match": true, + "line_range": [ + 9, + 13 + ], + "evidence_contains": [], + "notes": ".unwrap() on Command output Result" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 4, + 13 + ], + "evidence_contains": [], + "notes": "env::var flows through transmute and format! to Command \u2014 aspirational, complex data flow through unsafe" + } + ] +} diff --git a/tests/fixtures/real_world/rust/mixed/unsafe_transmute_cmd.rs b/tests/fixtures/real_world/rust/mixed/unsafe_transmute_cmd.rs new file mode 100644 index 00000000..990d5e08 --- /dev/null +++ b/tests/fixtures/real_world/rust/mixed/unsafe_transmute_cmd.rs @@ -0,0 +1,12 @@ +use std::env; +use std::mem; +use std::process::Command; + +fn main() { + let input = env::var("INPUT").unwrap(); + let bytes = input.as_bytes(); + let val: u32 = unsafe { mem::transmute([bytes[0], bytes[1], bytes[2], bytes[3]]) }; + + let cmd = format!("echo {}", val); + Command::new("sh").arg("-c").arg(&cmd).output().unwrap(); +} diff --git a/tests/fixtures/real_world/rust/mixed/web_handler.expect.json b/tests/fixtures/real_world/rust/mixed/web_handler.expect.json new file mode 100644 index 00000000..e7e15f0b --- /dev/null +++ b/tests/fixtures/real_world/rust/mixed/web_handler.expect.json @@ -0,0 +1,80 @@ +{ + "description": "Multiple taint flows: env::var to Command (cmdi) and env::var to read_to_string (path traversal)", + "tags": [ + "taint", + "cmdi", + "path-traversal", + "mixed" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 4, + 13 + ], + "evidence_contains": [], + "notes": "env::var(\"USER_CMD\") flows to Command .arg(&cmd) \u2014 command injection" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 11, + 16 + ], + "evidence_contains": [], + "notes": "env::var(\"USER_PATH\") flows to fs::read_to_string \u2014 path traversal" + }, + { + "rule_id": "rs.quality.unwrap", + "severity": null, + "must_match": true, + "line_range": [ + 4, + 8 + ], + "evidence_contains": [], + "notes": ".unwrap() on env::var for USER_CMD" + }, + { + "rule_id": "rs.quality.unwrap", + "severity": null, + "must_match": true, + "line_range": [ + 9, + 13 + ], + "evidence_contains": [], + "notes": ".unwrap() on Command output" + }, + { + "rule_id": "rs.quality.unwrap", + "severity": null, + "must_match": true, + "line_range": [ + 11, + 15 + ], + "evidence_contains": [], + "notes": ".unwrap() on env::var for USER_PATH" + }, + { + "rule_id": "rs.quality.unwrap", + "severity": null, + "must_match": true, + "line_range": [ + 12, + 16 + ], + "evidence_contains": [], + "notes": ".unwrap() on read_to_string" + } + ] +} diff --git a/tests/fixtures/real_world/rust/mixed/web_handler.rs b/tests/fixtures/real_world/rust/mixed/web_handler.rs new file mode 100644 index 00000000..5aa36fd4 --- /dev/null +++ b/tests/fixtures/real_world/rust/mixed/web_handler.rs @@ -0,0 +1,16 @@ +use std::env; +use std::process::Command; +use std::fs; + +fn handle_request() { + let cmd = env::var("USER_CMD").unwrap(); + let output = Command::new("sh") + .arg("-c") + .arg(&cmd) + .output() + .unwrap(); + + let path = env::var("USER_PATH").unwrap(); + let content = fs::read_to_string(&path).unwrap(); + println!("{}", content); +} diff --git a/tests/fixtures/real_world/rust/state/early_return.expect.json b/tests/fixtures/real_world/rust/state/early_return.expect.json new file mode 100644 index 00000000..7f9984a6 --- /dev/null +++ b/tests/fixtures/real_world/rust/state/early_return.expect.json @@ -0,0 +1,6 @@ +{ + "description": "Rust early return with RAII: File dropped automatically on all paths including early return", + "tags": ["state", "raii", "early-return"], + "modes": ["full"], + "expected": [] +} diff --git a/tests/fixtures/real_world/rust/state/early_return.rs b/tests/fixtures/real_world/rust/state/early_return.rs new file mode 100644 index 00000000..822004d9 --- /dev/null +++ b/tests/fixtures/real_world/rust/state/early_return.rs @@ -0,0 +1,12 @@ +use std::fs::File; +use std::io::Read; + +fn process(path: &str) -> Option { + let mut f = File::open(path).ok()?; + let mut buf = String::new(); + f.read_to_string(&mut buf).ok()?; + if buf.is_empty() { + return None; // f dropped by RAII, safe + } + Some(buf) +} diff --git a/tests/fixtures/real_world/rust/state/file_lifecycle.expect.json b/tests/fixtures/real_world/rust/state/file_lifecycle.expect.json new file mode 100644 index 00000000..31460cc6 --- /dev/null +++ b/tests/fixtures/real_world/rust/state/file_lifecycle.expect.json @@ -0,0 +1,45 @@ +{ + "description": "Rust RAII file lifecycle: File::open auto-closes on drop, no resource leak", + "tags": [ + "state", + "raii" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "rs.quality.unwrap", + "severity": null, + "must_match": true, + "line_range": [ + 3, + 7 + ], + "evidence_contains": [], + "notes": ".unwrap() on File::open Result in read_and_drop" + }, + { + "rule_id": "rs.quality.unwrap", + "severity": null, + "must_match": true, + "line_range": [ + 5, + 9 + ], + "evidence_contains": [], + "notes": ".unwrap() on read_to_string Result" + }, + { + "rule_id": "rs.quality.unwrap", + "severity": null, + "must_match": true, + "line_range": [ + 11, + 15 + ], + "evidence_contains": [], + "notes": ".unwrap() on File::open Result in explicit_drop" + } + ] +} diff --git a/tests/fixtures/real_world/rust/state/file_lifecycle.rs b/tests/fixtures/real_world/rust/state/file_lifecycle.rs new file mode 100644 index 00000000..3f5e5fc2 --- /dev/null +++ b/tests/fixtures/real_world/rust/state/file_lifecycle.rs @@ -0,0 +1,15 @@ +use std::fs::File; +use std::io::Read; + +fn read_and_drop() -> String { + let mut f = File::open("/tmp/test").unwrap(); + let mut buf = String::new(); + f.read_to_string(&mut buf).unwrap(); + buf + // f dropped automatically by RAII +} + +fn explicit_drop() { + let f = File::open("/tmp/test").unwrap(); + drop(f); +} diff --git a/tests/fixtures/real_world/rust/state/mem_forget.expect.json b/tests/fixtures/real_world/rust/state/mem_forget.expect.json new file mode 100644 index 00000000..8bd81893 --- /dev/null +++ b/tests/fixtures/real_world/rust/state/mem_forget.expect.json @@ -0,0 +1,45 @@ +{ + "description": "mem::forget prevents RAII drop \u2014 intentional resource leak of File handle", + "tags": [ + "mem", + "resource-leak" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "rs.memory.mem_forget", + "severity": null, + "must_match": true, + "line_range": [ + 4, + 8 + ], + "evidence_contains": [], + "notes": "mem::forget(f) prevents File destructor from running \u2014 resource leak" + }, + { + "rule_id": "rs.quality.unwrap", + "severity": null, + "must_match": true, + "line_range": [ + 3, + 7 + ], + "evidence_contains": [], + "notes": ".unwrap() on File::open Result" + }, + { + "rule_id": "state-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 3, + 8 + ], + "evidence_contains": [], + "notes": "mem::forget prevents RAII cleanup \u2014 aspirational, requires forget-aware state analysis" + } + ] +} diff --git a/tests/fixtures/real_world/rust/state/mem_forget.rs b/tests/fixtures/real_world/rust/state/mem_forget.rs new file mode 100644 index 00000000..c236a04b --- /dev/null +++ b/tests/fixtures/real_world/rust/state/mem_forget.rs @@ -0,0 +1,7 @@ +use std::mem; +use std::fs::File; + +fn forget_file() { + let f = File::open("/tmp/test").unwrap(); + mem::forget(f); // resource leak! +} diff --git a/tests/fixtures/real_world/rust/state/unsafe_resource.expect.json b/tests/fixtures/real_world/rust/state/unsafe_resource.expect.json new file mode 100644 index 00000000..8bd9242a --- /dev/null +++ b/tests/fixtures/real_world/rust/state/unsafe_resource.expect.json @@ -0,0 +1,24 @@ +{ + "description": "Unsafe Rust alloc/dealloc: alloc_leak never deallocates, alloc_clean does", + "tags": [ + "state", + "unsafe", + "resource-leak" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 2, + 8 + ], + "evidence_contains": [], + "notes": "alloc() at line 5 never dealloc'd in alloc_leak \u2014 aspirational, requires alloc/dealloc tracking" + } + ] +} diff --git a/tests/fixtures/real_world/rust/state/unsafe_resource.rs b/tests/fixtures/real_world/rust/state/unsafe_resource.rs new file mode 100644 index 00000000..acb0c8d4 --- /dev/null +++ b/tests/fixtures/real_world/rust/state/unsafe_resource.rs @@ -0,0 +1,13 @@ +use std::alloc::{alloc, dealloc, Layout}; + +unsafe fn alloc_leak() { + let layout = Layout::new::<[u8; 1024]>(); + let ptr = alloc(layout); + // never deallocated +} + +unsafe fn alloc_clean() { + let layout = Layout::new::<[u8; 1024]>(); + let ptr = alloc(layout); + dealloc(ptr, layout); +} diff --git a/tests/fixtures/real_world/rust/taint/command_env_args.expect.json b/tests/fixtures/real_world/rust/taint/command_env_args.expect.json new file mode 100644 index 00000000..d1ae001c --- /dev/null +++ b/tests/fixtures/real_world/rust/taint/command_env_args.expect.json @@ -0,0 +1,34 @@ +{ + "description": "Command-line argument passed directly to Command::new \u2014 command injection via CLI args", + "tags": [ + "taint", + "cmdi" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 3, + 12 + ], + "evidence_contains": [], + "notes": "env::args() flows to Command::new \u2014 aspirational, depends on args() being classified as source" + }, + { + "rule_id": "rs.quality.expect", + "severity": null, + "must_match": true, + "line_range": [ + 8, + 12 + ], + "evidence_contains": [], + "notes": ".expect() on Command output Result" + } + ] +} diff --git a/tests/fixtures/real_world/rust/taint/command_env_args.rs b/tests/fixtures/real_world/rust/taint/command_env_args.rs new file mode 100644 index 00000000..e5fa583d --- /dev/null +++ b/tests/fixtures/real_world/rust/taint/command_env_args.rs @@ -0,0 +1,12 @@ +use std::env; +use std::process::Command; + +fn main() { + let args: Vec = env::args().collect(); + if args.len() > 1 { + let user_cmd = &args[1]; + Command::new(user_cmd) + .output() + .expect("failed to execute"); + } +} diff --git a/tests/fixtures/real_world/rust/taint/env_to_command.expect.json b/tests/fixtures/real_world/rust/taint/env_to_command.expect.json new file mode 100644 index 00000000..fad46d6f --- /dev/null +++ b/tests/fixtures/real_world/rust/taint/env_to_command.expect.json @@ -0,0 +1,45 @@ +{ + "description": "env::var flows to Command::new \u2014 command injection. Safe version uses allowlist.", + "tags": [ + "taint", + "cmdi" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 3, + 12 + ], + "evidence_contains": [], + "notes": "env::var(\"USER_CMD\") flows through .arg(&cmd) to Command execution" + }, + { + "rule_id": "rs.quality.unwrap", + "severity": null, + "must_match": true, + "line_range": [ + 3, + 7 + ], + "evidence_contains": [], + "notes": ".unwrap() on env::var Result" + }, + { + "rule_id": "rs.quality.unwrap", + "severity": null, + "must_match": true, + "line_range": [ + 8, + 12 + ], + "evidence_contains": [], + "notes": ".unwrap() on Command output Result" + } + ] +} diff --git a/tests/fixtures/real_world/rust/taint/env_to_command.rs b/tests/fixtures/real_world/rust/taint/env_to_command.rs new file mode 100644 index 00000000..19fdacc4 --- /dev/null +++ b/tests/fixtures/real_world/rust/taint/env_to_command.rs @@ -0,0 +1,19 @@ +use std::env; +use std::process::Command; + +fn run_user_command() { + let cmd = env::var("USER_CMD").unwrap(); + Command::new("sh") + .arg("-c") + .arg(&cmd) + .output() + .unwrap(); +} + +fn run_safe_command() { + let cmd = env::var("USER_CMD").unwrap_or_default(); + let allowed = ["ls", "date"]; + if allowed.contains(&cmd.as_str()) { + Command::new(&cmd).output().unwrap(); + } +} diff --git a/tests/fixtures/real_world/rust/taint/env_to_file.expect.json b/tests/fixtures/real_world/rust/taint/env_to_file.expect.json new file mode 100644 index 00000000..d2d42cae --- /dev/null +++ b/tests/fixtures/real_world/rust/taint/env_to_file.expect.json @@ -0,0 +1,45 @@ +{ + "description": "env::var flows to fs::read_to_string \u2014 path traversal / arbitrary file read", + "tags": [ + "taint", + "path-traversal" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 3, + 8 + ], + "evidence_contains": [], + "notes": "env::var(\"FILE_PATH\") flows to fs::read_to_string \u2014 arbitrary file read" + }, + { + "rule_id": "rs.quality.unwrap", + "severity": null, + "must_match": true, + "line_range": [ + 3, + 7 + ], + "evidence_contains": [], + "notes": ".unwrap() on env::var Result" + }, + { + "rule_id": "rs.quality.unwrap", + "severity": null, + "must_match": true, + "line_range": [ + 4, + 8 + ], + "evidence_contains": [], + "notes": ".unwrap() on read_to_string Result" + } + ] +} diff --git a/tests/fixtures/real_world/rust/taint/env_to_file.rs b/tests/fixtures/real_world/rust/taint/env_to_file.rs new file mode 100644 index 00000000..4f15f1f4 --- /dev/null +++ b/tests/fixtures/real_world/rust/taint/env_to_file.rs @@ -0,0 +1,7 @@ +use std::env; +use std::fs; + +fn read_user_file() -> String { + let path = env::var("FILE_PATH").unwrap(); + fs::read_to_string(&path).unwrap() +} diff --git a/tests/fixtures/real_world/rust/taint/serde_deser.expect.json b/tests/fixtures/real_world/rust/taint/serde_deser.expect.json new file mode 100644 index 00000000..4f5f6648 --- /dev/null +++ b/tests/fixtures/real_world/rust/taint/serde_deser.expect.json @@ -0,0 +1,45 @@ +{ + "description": "env::var flows to serde_json::from_str \u2014 deserialization of untrusted input", + "tags": [ + "taint", + "deserialization" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 2, + 7 + ], + "evidence_contains": [], + "notes": "Depends on serde_json::from_str being classified as a sink \u2014 aspirational" + }, + { + "rule_id": "rs.quality.unwrap", + "severity": null, + "must_match": true, + "line_range": [ + 2, + 6 + ], + "evidence_contains": [], + "notes": ".unwrap() on env::var Result" + }, + { + "rule_id": "rs.quality.unwrap", + "severity": null, + "must_match": true, + "line_range": [ + 3, + 7 + ], + "evidence_contains": [], + "notes": ".unwrap() on from_str Result" + } + ] +} diff --git a/tests/fixtures/real_world/rust/taint/serde_deser.rs b/tests/fixtures/real_world/rust/taint/serde_deser.rs new file mode 100644 index 00000000..f3a7b604 --- /dev/null +++ b/tests/fixtures/real_world/rust/taint/serde_deser.rs @@ -0,0 +1,6 @@ +use std::env; + +fn parse_user_json() { + let input = env::var("JSON_INPUT").unwrap(); + let _value: serde_json::Value = serde_json::from_str(&input).unwrap(); +} diff --git a/tests/fixtures/real_world/rust/taint/sql_format.expect.json b/tests/fixtures/real_world/rust/taint/sql_format.expect.json new file mode 100644 index 00000000..5eb3e458 --- /dev/null +++ b/tests/fixtures/real_world/rust/taint/sql_format.expect.json @@ -0,0 +1,34 @@ +{ + "description": "SQL injection via format! string interpolation \u2014 env var concatenated into SQL query", + "tags": [ + "taint", + "sqli" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 6, + 12 + ], + "evidence_contains": [], + "notes": "env::var flows through format! into println \u2014 benign sink, but SQL string construction is interesting. Aspirational." + }, + { + "rule_id": "rs.quality.unwrap", + "severity": null, + "must_match": true, + "line_range": [ + 6, + 10 + ], + "evidence_contains": [], + "notes": ".unwrap() on env::var Result" + } + ] +} diff --git a/tests/fixtures/real_world/rust/taint/sql_format.rs b/tests/fixtures/real_world/rust/taint/sql_format.rs new file mode 100644 index 00000000..df338171 --- /dev/null +++ b/tests/fixtures/real_world/rust/taint/sql_format.rs @@ -0,0 +1,11 @@ +use std::env; + +fn query_user(user_id: &str) -> String { + format!("SELECT * FROM users WHERE id = {}", user_id) +} + +fn main() { + let id = env::var("USER_ID").unwrap(); + let query = query_user(&id); + println!("{}", query); +} diff --git a/tests/fixtures/real_world/rust/taint/transmute_unsafe.expect.json b/tests/fixtures/real_world/rust/taint/transmute_unsafe.expect.json new file mode 100644 index 00000000..3ce42a35 --- /dev/null +++ b/tests/fixtures/real_world/rust/taint/transmute_unsafe.expect.json @@ -0,0 +1,34 @@ +{ + "description": "Unsafe Rust: mem::transmute for type punning and raw pointer manipulation", + "tags": [ + "unsafe", + "mem" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "rs.memory.transmute", + "severity": null, + "must_match": true, + "line_range": [ + 9, + 13 + ], + "evidence_contains": [], + "notes": "mem::transmute used for u32 to [u8; 4] conversion" + }, + { + "rule_id": "rs.quality.unsafe_block", + "severity": null, + "must_match": true, + "line_range": [ + 9, + 13 + ], + "evidence_contains": [], + "notes": "unsafe block wrapping mem::transmute call" + } + ] +} diff --git a/tests/fixtures/real_world/rust/taint/transmute_unsafe.rs b/tests/fixtures/real_world/rust/taint/transmute_unsafe.rs new file mode 100644 index 00000000..5148c595 --- /dev/null +++ b/tests/fixtures/real_world/rust/taint/transmute_unsafe.rs @@ -0,0 +1,12 @@ +use std::mem; + +unsafe fn reinterpret(data: &[u8]) -> &[u32] { + let ptr = data.as_ptr() as *const u32; + let len = data.len() / 4; + std::slice::from_raw_parts(ptr, len) +} + +fn transmute_example() { + let val: u32 = 0x41414141; + let bytes: [u8; 4] = unsafe { mem::transmute(val) }; +} diff --git a/tests/fixtures/real_world/typescript/cfg/error_handling.expect.json b/tests/fixtures/real_world/typescript/cfg/error_handling.expect.json new file mode 100644 index 00000000..7c423974 --- /dev/null +++ b/tests/fixtures/real_world/typescript/cfg/error_handling.expect.json @@ -0,0 +1,24 @@ +{ + "description": "Error handling fallthrough: readConfigUnsafe logs error but falls through to return config.value. readConfigSafe properly throws.", + "tags": [ + "cfg", + "error-fallthrough", + "typescript" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "cfg-error-fallthrough", + "severity": null, + "must_match": false, + "line_range": [ + 4, + 11 + ], + "evidence_contains": [], + "notes": "Error condition detected but execution falls through to return config.value without return or throw" + } + ] +} diff --git a/tests/fixtures/real_world/typescript/cfg/error_handling.ts b/tests/fixtures/real_world/typescript/cfg/error_handling.ts new file mode 100644 index 00000000..e1349f92 --- /dev/null +++ b/tests/fixtures/real_world/typescript/cfg/error_handling.ts @@ -0,0 +1,20 @@ +import * as fs from 'fs'; + +function readConfigUnsafe(path: string): string { + var content = fs.readFileSync(path, 'utf8'); + var config = JSON.parse(content); + if (config.error) { + console.log('Error in config'); + // falls through without returning! + } + return config.value; +} + +function readConfigSafe(path: string): string { + var content = fs.readFileSync(path, 'utf8'); + var config = JSON.parse(content); + if (config.error) { + throw new Error('Invalid config: ' + config.error); + } + return config.value; +} diff --git a/tests/fixtures/real_world/typescript/cfg/interface_guard.expect.json b/tests/fixtures/real_world/typescript/cfg/interface_guard.expect.json new file mode 100644 index 00000000..beb82175 --- /dev/null +++ b/tests/fixtures/real_world/typescript/cfg/interface_guard.expect.json @@ -0,0 +1,37 @@ +{ + "description": "Interface-typed input with admin guard. runIfAdmin checks isAdmin before exec; runUnchecked does not.", + "tags": [ + "cfg", + "auth-guard", + "interface", + "cmdi", + "typescript" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "cfg-unguarded-sink", + "severity": null, + "must_match": false, + "line_range": [ + 14, + 18 + ], + "evidence_contains": [], + "notes": "child_process.exec in runUnchecked has no guard; input.command is a param not a recognized source" + }, + { + "rule_id": "cfg-auth-gap", + "severity": null, + "must_match": false, + "line_range": [ + 13, + 18 + ], + "evidence_contains": [], + "notes": "runUnchecked lacks the isAdmin check that runIfAdmin has" + } + ] +} diff --git a/tests/fixtures/real_world/typescript/cfg/interface_guard.ts b/tests/fixtures/real_world/typescript/cfg/interface_guard.ts new file mode 100644 index 00000000..55329eb8 --- /dev/null +++ b/tests/fixtures/real_world/typescript/cfg/interface_guard.ts @@ -0,0 +1,17 @@ +import child_process from 'child_process'; + +interface UserInput { + command: string; + isAdmin: boolean; +} + +function runIfAdmin(input: UserInput): void { + if (!input.isAdmin) { + return; + } + child_process.exec(input.command); +} + +function runUnchecked(input: UserInput): void { + child_process.exec(input.command); +} diff --git a/tests/fixtures/real_world/typescript/cfg/promise_chain.expect.json b/tests/fixtures/real_world/typescript/cfg/promise_chain.expect.json new file mode 100644 index 00000000..420ce5ed --- /dev/null +++ b/tests/fixtures/real_world/typescript/cfg/promise_chain.expect.json @@ -0,0 +1,37 @@ +{ + "description": "Promise chain with promisified exec. Input flows through multiple await steps. Scanner cannot trace through promisify wrappers.", + "tags": [ + "cfg", + "async", + "promise", + "cmdi", + "typescript" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 5, + 10 + ], + "evidence_contains": [], + "notes": "input flows to execAsync but execAsync is a promisified wrapper not recognized as a taint sink" + }, + { + "rule_id": "cfg-unguarded-sink", + "severity": null, + "must_match": false, + "line_range": [ + 5, + 9 + ], + "evidence_contains": [], + "notes": "execAsync called without validation of input" + } + ] +} diff --git a/tests/fixtures/real_world/typescript/cfg/promise_chain.ts b/tests/fixtures/real_world/typescript/cfg/promise_chain.ts new file mode 100644 index 00000000..f1d45046 --- /dev/null +++ b/tests/fixtures/real_world/typescript/cfg/promise_chain.ts @@ -0,0 +1,16 @@ +import child_process from 'child_process'; +import { promisify } from 'util'; + +var execAsync = promisify(child_process.exec); + +async function pipeline(input: string): Promise { + var step1 = await execAsync('echo ' + input); + var step2 = await execAsync('wc -c <<< "' + step1.stdout + '"'); + return step2.stdout; +} + +async function safePipeline(input: string): Promise { + var sanitized = input.replace(/[^a-zA-Z0-9]/g, ''); + var step1 = await execAsync('echo ' + sanitized); + return step1.stdout; +} diff --git a/tests/fixtures/real_world/typescript/cfg/try_catch_typed.expect.json b/tests/fixtures/real_world/typescript/cfg/try_catch_typed.expect.json new file mode 100644 index 00000000..26b7cc8b --- /dev/null +++ b/tests/fixtures/real_world/typescript/cfg/try_catch_typed.expect.json @@ -0,0 +1,26 @@ +{ + "description": "Class-based resource management. riskyUsage skips close() on throw; safeUsage uses try/finally.", + "tags": [ + "cfg", + "resource-leak", + "class", + "try-catch", + "typescript" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "cfg-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 24, + 32 + ], + "evidence_contains": [], + "notes": "fp.close() skipped when fp.process() throws in riskyUsage; scanner may not track class-based resource lifecycle" + } + ] +} diff --git a/tests/fixtures/real_world/typescript/cfg/try_catch_typed.ts b/tests/fixtures/real_world/typescript/cfg/try_catch_typed.ts new file mode 100644 index 00000000..3cb1ecc3 --- /dev/null +++ b/tests/fixtures/real_world/typescript/cfg/try_catch_typed.ts @@ -0,0 +1,42 @@ +import * as fs from 'fs'; + +class FileProcessor { + private fd: number | null = null; + + open(path: string): void { + this.fd = fs.openSync(path, 'r'); + } + + process(): string { + if (this.fd === null) throw new Error('not opened'); + var buf = Buffer.alloc(1024); + fs.readSync(this.fd, buf); + return buf.toString(); + } + + close(): void { + if (this.fd !== null) { + fs.closeSync(this.fd); + this.fd = null; + } + } +} + +function riskyUsage(path: string): string { + var fp = new FileProcessor(); + fp.open(path); + var data = fp.process(); // may throw + fp.close(); // skipped on throw + return data; +} + +function safeUsage(path: string): string { + var fp = new FileProcessor(); + try { + fp.open(path); + var data = fp.process(); + return data; + } finally { + fp.close(); + } +} diff --git a/tests/fixtures/real_world/typescript/mixed/auth_taint_cfg.expect.json b/tests/fixtures/real_world/typescript/mixed/auth_taint_cfg.expect.json new file mode 100644 index 00000000..92c8d99e --- /dev/null +++ b/tests/fixtures/real_world/typescript/mixed/auth_taint_cfg.expect.json @@ -0,0 +1,50 @@ +{ + "description": "Two Express routes: /run without auth, /run-authed with auth check. Both have command injection via req.query.cmd since auth does not sanitize input.", + "tags": [ + "mixed", + "taint", + "cfg", + "auth", + "cmdi", + "express", + "typescript" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 9, + 14 + ], + "evidence_contains": [], + "notes": "req.query.cmd flows into child_process.exec in unauthenticated /run route" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 20, + 25 + ], + "evidence_contains": [], + "notes": "req.query.cmd flows into child_process.exec in /run-authed; auth check does not sanitize the input" + }, + { + "rule_id": "cfg-auth-gap", + "severity": null, + "must_match": false, + "line_range": [ + 8, + 17 + ], + "evidence_contains": [], + "notes": "No auth check before child_process.exec in /run handler" + } + ] +} diff --git a/tests/fixtures/real_world/typescript/mixed/auth_taint_cfg.ts b/tests/fixtures/real_world/typescript/mixed/auth_taint_cfg.ts new file mode 100644 index 00000000..6cd50456 --- /dev/null +++ b/tests/fixtures/real_world/typescript/mixed/auth_taint_cfg.ts @@ -0,0 +1,25 @@ +import child_process from 'child_process'; +import express from 'express'; + +var app = express(); + +function authenticate(req: any): boolean { + return req.headers.authorization === 'Bearer secret'; +} + +app.get('/run', function(req: any, res: any) { + var cmd = req.query.cmd; + child_process.exec(cmd, function(err: any, stdout: any) { + res.send(stdout); + }); +}); + +app.get('/run-authed', function(req: any, res: any) { + if (!authenticate(req)) { + return res.status(401).send('Unauthorized'); + } + var cmd = req.query.cmd; + child_process.exec(cmd, function(err: any, stdout: any) { + res.send(stdout); + }); +}); diff --git a/tests/fixtures/real_world/typescript/mixed/taint_state_interaction.expect.json b/tests/fixtures/real_world/typescript/mixed/taint_state_interaction.expect.json new file mode 100644 index 00000000..056ca138 --- /dev/null +++ b/tests/fixtures/real_world/typescript/mixed/taint_state_interaction.expect.json @@ -0,0 +1,37 @@ +{ + "description": "Combined taint and state: userPath param flows to fs.openSync, and fd leaks on early return when bytesRead is 0. Safe version uses try/finally.", + "tags": [ + "mixed", + "taint", + "state", + "resource-leak", + "typescript" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 2, + 6 + ], + "evidence_contains": [], + "notes": "userPath flows to fs.openSync but fs.openSync is not a recognized taint sink and userPath is a param not a recognized source" + }, + { + "rule_id": "state-resource-leak-possible", + "severity": null, + "must_match": false, + "line_range": [ + 2, + 11 + ], + "evidence_contains": [], + "notes": "fd from fs.openSync leaks when early return fires at line 9; scanner may not track JS/TS fd lifecycle" + } + ] +} diff --git a/tests/fixtures/real_world/typescript/mixed/taint_state_interaction.ts b/tests/fixtures/real_world/typescript/mixed/taint_state_interaction.ts new file mode 100644 index 00000000..b90f56f5 --- /dev/null +++ b/tests/fixtures/real_world/typescript/mixed/taint_state_interaction.ts @@ -0,0 +1,27 @@ +import * as fs from 'fs'; + +function processUserFile(userPath: string): string { + var fd = fs.openSync(userPath, 'r'); // taint: userPath is user-controlled + var buf = Buffer.alloc(4096); + var bytesRead = fs.readSync(fd, buf); + if (bytesRead === 0) { + // early return leaks fd + return 'empty'; + } + fs.closeSync(fd); + return buf.slice(0, bytesRead).toString(); +} + +function processUserFileSafe(userPath: string): string { + var fd = fs.openSync(userPath, 'r'); + try { + var buf = Buffer.alloc(4096); + var bytesRead = fs.readSync(fd, buf); + if (bytesRead === 0) { + return 'empty'; + } + return buf.slice(0, bytesRead).toString(); + } finally { + fs.closeSync(fd); + } +} diff --git a/tests/fixtures/real_world/typescript/state/double_close.expect.json b/tests/fixtures/real_world/typescript/state/double_close.expect.json new file mode 100644 index 00000000..89e54602 --- /dev/null +++ b/tests/fixtures/real_world/typescript/state/double_close.expect.json @@ -0,0 +1,25 @@ +{ + "description": "Double-close risk in try-catch: fd closed in try block, then closed again in catch block. Safe version uses a flag to prevent double-close.", + "tags": [ + "state", + "double-close", + "try-catch", + "typescript" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-double-close", + "severity": null, + "must_match": false, + "line_range": [ + 2, + 13 + ], + "evidence_contains": [], + "notes": "fd closed at line 8 in try block, then again at line 10 in catch block if readSync throws after partial execution" + } + ] +} diff --git a/tests/fixtures/real_world/typescript/state/double_close.ts b/tests/fixtures/real_world/typescript/state/double_close.ts new file mode 100644 index 00000000..d586e47c --- /dev/null +++ b/tests/fixtures/real_world/typescript/state/double_close.ts @@ -0,0 +1,27 @@ +import * as fs from 'fs'; + +function doubleCloseRisk(path: string): void { + var fd = fs.openSync(path, 'r'); + try { + var buf = Buffer.alloc(1024); + fs.readSync(fd, buf); + fs.closeSync(fd); + } catch (e) { + fs.closeSync(fd); // might double-close if readSync succeeds then later code throws + } +} + +function safeClosePattern(path: string): void { + var fd = fs.openSync(path, 'r'); + var closed = false; + try { + var buf = Buffer.alloc(1024); + fs.readSync(fd, buf); + fs.closeSync(fd); + closed = true; + } catch (e) { + if (!closed) { + fs.closeSync(fd); + } + } +} diff --git a/tests/fixtures/real_world/typescript/state/map_cleanup.expect.json b/tests/fixtures/real_world/typescript/state/map_cleanup.expect.json new file mode 100644 index 00000000..2b92b239 --- /dev/null +++ b/tests/fixtures/real_world/typescript/state/map_cleanup.expect.json @@ -0,0 +1,26 @@ +{ + "description": "Batch file handle processing. processFiles opens handles in a loop but never closes them. processFilesSafe closes all handles in a separate loop.", + "tags": [ + "state", + "resource-leak", + "loop", + "batch", + "typescript" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 3, + 15 + ], + "evidence_contains": [], + "notes": "File handles opened in loop are never closed in processFiles; scanner may not track handles stored in arrays" + } + ] +} diff --git a/tests/fixtures/real_world/typescript/state/map_cleanup.ts b/tests/fixtures/real_world/typescript/state/map_cleanup.ts new file mode 100644 index 00000000..ba91f2bc --- /dev/null +++ b/tests/fixtures/real_world/typescript/state/map_cleanup.ts @@ -0,0 +1,28 @@ +import * as fs from 'fs'; + +function processFiles(paths: string[]): void { + var handles: number[] = []; + for (var i = 0; i < paths.length; i++) { + handles.push(fs.openSync(paths[i], 'r')); + } + // Process all + for (var j = 0; j < handles.length; j++) { + var buf = Buffer.alloc(1024); + fs.readSync(handles[j], buf); + } + // Forgot to close any handles! +} + +function processFilesSafe(paths: string[]): void { + var handles: number[] = []; + for (var i = 0; i < paths.length; i++) { + handles.push(fs.openSync(paths[i], 'r')); + } + for (var j = 0; j < handles.length; j++) { + var buf = Buffer.alloc(1024); + fs.readSync(handles[j], buf); + } + for (var k = 0; k < handles.length; k++) { + fs.closeSync(handles[k]); + } +} diff --git a/tests/fixtures/real_world/typescript/state/resource_class.expect.json b/tests/fixtures/real_world/typescript/state/resource_class.expect.json new file mode 100644 index 00000000..fcdf92ed --- /dev/null +++ b/tests/fixtures/real_world/typescript/state/resource_class.expect.json @@ -0,0 +1,25 @@ +{ + "description": "Class-based resource management. leak() creates Database without calling close(). clean() properly closes.", + "tags": [ + "state", + "resource-leak", + "class", + "typescript" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 20, + 26 + ], + "evidence_contains": [], + "notes": "Database created in leak() but close() never called; scanner may not track class-based resource lifecycle in TS" + } + ] +} diff --git a/tests/fixtures/real_world/typescript/state/resource_class.ts b/tests/fixtures/real_world/typescript/state/resource_class.ts new file mode 100644 index 00000000..b2eb80e1 --- /dev/null +++ b/tests/fixtures/real_world/typescript/state/resource_class.ts @@ -0,0 +1,31 @@ +import * as fs from 'fs'; + +class Database { + private connection: number; + + constructor(path: string) { + this.connection = fs.openSync(path, 'r'); + } + + query(): string { + var buf = Buffer.alloc(1024); + fs.readSync(this.connection, buf); + return buf.toString(); + } + + close(): void { + fs.closeSync(this.connection); + } +} + +function leak(): void { + var db = new Database('/tmp/test.db'); + db.query(); + // Missing db.close() +} + +function clean(): void { + var db = new Database('/tmp/test.db'); + db.query(); + db.close(); +} diff --git a/tests/fixtures/real_world/typescript/state/stream_lifecycle.expect.json b/tests/fixtures/real_world/typescript/state/stream_lifecycle.expect.json new file mode 100644 index 00000000..2f43b51b --- /dev/null +++ b/tests/fixtures/real_world/typescript/state/stream_lifecycle.expect.json @@ -0,0 +1,25 @@ +{ + "description": "Stream lifecycle: processStream pipes without error handlers (may leak). processStreamSafe handles errors and destroys streams.", + "tags": [ + "state", + "resource-leak", + "stream", + "typescript" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "state-resource-leak", + "severity": null, + "must_match": false, + "line_range": [ + 2, + 9 + ], + "evidence_contains": [], + "notes": "Streams created without error handlers may leak on pipe failure; scanner likely cannot track stream lifecycle" + } + ] +} diff --git a/tests/fixtures/real_world/typescript/state/stream_lifecycle.ts b/tests/fixtures/real_world/typescript/state/stream_lifecycle.ts new file mode 100644 index 00000000..cab632f3 --- /dev/null +++ b/tests/fixtures/real_world/typescript/state/stream_lifecycle.ts @@ -0,0 +1,27 @@ +import * as fs from 'fs'; + +function processStream(inputPath: string, outputPath: string): void { + var reader = fs.createReadStream(inputPath); + var writer = fs.createWriteStream(outputPath); + reader.pipe(writer); + // Streams may leak if error occurs before pipe completes +} + +function processStreamSafe(inputPath: string, outputPath: string): void { + var reader = fs.createReadStream(inputPath); + var writer = fs.createWriteStream(outputPath); + + reader.on('error', function(err: Error) { + console.error('Read error:', err); + writer.destroy(); + reader.destroy(); + }); + + writer.on('error', function(err: Error) { + console.error('Write error:', err); + reader.destroy(); + writer.destroy(); + }); + + reader.pipe(writer); +} diff --git a/tests/fixtures/real_world/typescript/taint/decorator_handler.expect.json b/tests/fixtures/real_world/typescript/taint/decorator_handler.expect.json new file mode 100644 index 00000000..c71586a2 --- /dev/null +++ b/tests/fixtures/real_world/typescript/taint/decorator_handler.expect.json @@ -0,0 +1,36 @@ +{ + "description": "Controller class with methods that call exec and eval. eval is detectable via AST pattern; exec requires taint source.", + "tags": [ + "taint", + "code-exec", + "class", + "typescript" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "ts.code_exec.eval", + "severity": null, + "must_match": true, + "line_range": [ + 13, + 17 + ], + "evidence_contains": [], + "notes": "AST pattern matches eval() call in handleEval method" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 9, + 13 + ], + "evidence_contains": [], + "notes": "userInput flows to child_process.exec but is a method param, not a recognized source" + } + ] +} diff --git a/tests/fixtures/real_world/typescript/taint/decorator_handler.ts b/tests/fixtures/real_world/typescript/taint/decorator_handler.ts new file mode 100644 index 00000000..37007ee8 --- /dev/null +++ b/tests/fixtures/real_world/typescript/taint/decorator_handler.ts @@ -0,0 +1,17 @@ +import child_process from 'child_process'; + +function Route(path: string) { + return function(target: any, key: string, descriptor: PropertyDescriptor) { + return descriptor; + }; +} + +class Controller { + handleExec(userInput: string) { + child_process.exec(userInput); + } + + handleEval(code: string) { + eval(code); + } +} diff --git a/tests/fixtures/real_world/typescript/taint/enum_switch.expect.json b/tests/fixtures/real_world/typescript/taint/enum_switch.expect.json new file mode 100644 index 00000000..6dfff267 --- /dev/null +++ b/tests/fixtures/real_world/typescript/taint/enum_switch.expect.json @@ -0,0 +1,37 @@ +{ + "description": "Enum-based switch dispatching payload to child_process.exec. payload is a function param, not a recognized taint source.", + "tags": [ + "taint", + "cmdi", + "enum", + "switch", + "typescript" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 9, + 13 + ], + "evidence_contains": [], + "notes": "payload flows to child_process.exec but is a function param, not a recognized source" + }, + { + "rule_id": "cfg-unguarded-sink", + "severity": null, + "must_match": false, + "line_range": [ + 9, + 13 + ], + "evidence_contains": [], + "notes": "child_process.exec called without input validation; enum switch only controls dispatch, not payload safety" + } + ] +} diff --git a/tests/fixtures/real_world/typescript/taint/enum_switch.ts b/tests/fixtures/real_world/typescript/taint/enum_switch.ts new file mode 100644 index 00000000..d6169141 --- /dev/null +++ b/tests/fixtures/real_world/typescript/taint/enum_switch.ts @@ -0,0 +1,17 @@ +import child_process from 'child_process'; + +enum Action { + Run = 'run', + Stop = 'stop', +} + +function handleAction(action: Action, payload: string): void { + switch (action) { + case Action.Run: + child_process.exec(payload); + break; + case Action.Stop: + console.log('stopping'); + break; + } +} diff --git a/tests/fixtures/real_world/typescript/taint/express_typed.expect.json b/tests/fixtures/real_world/typescript/taint/express_typed.expect.json new file mode 100644 index 00000000..3f02270e --- /dev/null +++ b/tests/fixtures/real_world/typescript/taint/express_typed.expect.json @@ -0,0 +1,36 @@ +{ + "description": "TypeScript Express handler with typed query params. req.query.host flows into child_process.exec.", + "tags": [ + "taint", + "cmdi", + "express", + "typescript" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 9, + 14 + ], + "evidence_contains": [], + "notes": "req.query.host flows into child_process.exec via string concatenation" + }, + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 17, + 23 + ], + "evidence_contains": [], + "notes": "Safe version still fires because .replace is not a recognized SHELL_ESCAPE sanitizer" + } + ] +} diff --git a/tests/fixtures/real_world/typescript/taint/express_typed.ts b/tests/fixtures/real_world/typescript/taint/express_typed.ts new file mode 100644 index 00000000..a9b59bd9 --- /dev/null +++ b/tests/fixtures/real_world/typescript/taint/express_typed.ts @@ -0,0 +1,23 @@ +import child_process from 'child_process'; +import express from 'express'; + +interface QueryParams { + host: string; +} + +var app = express(); + +app.get('/ping', function(req: any, res: any) { + var host = req.query.host; + child_process.exec('ping -c 1 ' + host, function(err: any, stdout: any) { + res.send(stdout); + }); +}); + +app.get('/safe-ping', function(req: any, res: any) { + var host = req.query.host; + var sanitized = host.replace(/[^a-zA-Z0-9.]/g, ''); + child_process.exec('ping -c 1 ' + sanitized, function(err: any, stdout: any) { + res.send(stdout); + }); +}); diff --git a/tests/fixtures/real_world/typescript/taint/generic_handler.expect.json b/tests/fixtures/real_world/typescript/taint/generic_handler.expect.json new file mode 100644 index 00000000..85e07dfc --- /dev/null +++ b/tests/fixtures/real_world/typescript/taint/generic_handler.expect.json @@ -0,0 +1,25 @@ +{ + "description": "Generic function passing typed input to child_process.exec. Requires tracking through generic type parameter and object property access.", + "tags": [ + "taint", + "cmdi", + "generics", + "typescript" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 2, + 6 + ], + "evidence_contains": [], + "notes": "input.cmd flows to child_process.exec but input is a function param, not a recognized source; requires interprocedural analysis" + } + ] +} diff --git a/tests/fixtures/real_world/typescript/taint/generic_handler.ts b/tests/fixtures/real_world/typescript/taint/generic_handler.ts new file mode 100644 index 00000000..ad908358 --- /dev/null +++ b/tests/fixtures/real_world/typescript/taint/generic_handler.ts @@ -0,0 +1,9 @@ +import child_process from 'child_process'; + +function executeCommand(input: T): void { + child_process.exec(input.cmd); +} + +function processUserInput(userInput: string): void { + executeCommand({ cmd: userInput }); +} diff --git a/tests/fixtures/real_world/typescript/taint/optional_chain.expect.json b/tests/fixtures/real_world/typescript/taint/optional_chain.expect.json new file mode 100644 index 00000000..889d2326 --- /dev/null +++ b/tests/fixtures/real_world/typescript/taint/optional_chain.expect.json @@ -0,0 +1,26 @@ +{ + "description": "Optional chaining and nullish coalescing with user input flowing to child_process.exec. req.query.cmd can override the safe default.", + "tags": [ + "taint", + "cmdi", + "optional-chaining", + "express", + "typescript" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": true, + "line_range": [ + 11, + 18 + ], + "evidence_contains": [], + "notes": "req.query.cmd flows through nullish coalescing into child_process.exec" + } + ] +} diff --git a/tests/fixtures/real_world/typescript/taint/optional_chain.ts b/tests/fixtures/real_world/typescript/taint/optional_chain.ts new file mode 100644 index 00000000..6755d71d --- /dev/null +++ b/tests/fixtures/real_world/typescript/taint/optional_chain.ts @@ -0,0 +1,19 @@ +import child_process from 'child_process'; +import express from 'express'; + +interface Config { + commands?: { + deploy?: string; + }; +} + +var app = express(); + +app.get('/deploy', function(req: any, res: any) { + var userOverride = req.query.cmd; + var config: Config = { commands: { deploy: 'echo noop' } }; + var cmd = userOverride ?? config.commands?.deploy ?? 'echo noop'; + child_process.exec(cmd, function(err: any, stdout: any) { + res.send(stdout); + }); +}); diff --git a/tests/fixtures/real_world/typescript/taint/type_assertion_bypass.expect.json b/tests/fixtures/real_world/typescript/taint/type_assertion_bypass.expect.json new file mode 100644 index 00000000..4888d47e --- /dev/null +++ b/tests/fixtures/real_world/typescript/taint/type_assertion_bypass.expect.json @@ -0,0 +1,25 @@ +{ + "description": "Type assertion (as SafeInput) does not sanitize tainted data. req.body flows into SQL string despite type cast.", + "tags": [ + "taint", + "sqli", + "type-assertion", + "typescript" + ], + "modes": [ + "full" + ], + "expected": [ + { + "rule_id": "taint-unsanitised-flow", + "severity": null, + "must_match": false, + "line_range": [ + 10, + 15 + ], + "evidence_contains": [], + "notes": "req.body flows through type assertion into SQL string concat; depends on scanner tracking taint through as-expressions and no SQL sink is defined" + } + ] +} diff --git a/tests/fixtures/real_world/typescript/taint/type_assertion_bypass.ts b/tests/fixtures/real_world/typescript/taint/type_assertion_bypass.ts new file mode 100644 index 00000000..e3dffc5e --- /dev/null +++ b/tests/fixtures/real_world/typescript/taint/type_assertion_bypass.ts @@ -0,0 +1,16 @@ +import express from 'express'; + +interface SafeInput { + name: string; + age: number; +} + +var app = express(); + +app.post('/update', function(req: any, res: any) { + // Type assertion does NOT sanitize + var input = req.body as SafeInput; + var query = 'UPDATE users SET name = \'' + input.name + '\' WHERE age = ' + input.age; + // SQL injection despite type assertion + res.json({ query: query }); +}); diff --git a/tests/fixtures/rust_web_app/expectations.json b/tests/fixtures/rust_web_app/expectations.json index 983c2d0a..fc32cef5 100644 --- a/tests/fixtures/rust_web_app/expectations.json +++ b/tests/fixtures/rust_web_app/expectations.json @@ -1,11 +1,8 @@ { "required_findings": [ { "id_prefix": "taint-unsanitised-flow", "min_count": 5 }, - { "id_prefix": "unwrap_call", "min_count": 10 }, - { "id_prefix": "expect_call", "min_count": 5 }, - { "id_prefix": "unsafe_block", "min_count": 1 }, - { "id_prefix": "panic_macro", "min_count": 1 }, - { "id_prefix": "cfg-auth-gap", "min_count": 3 } + { "id_prefix": "rs.quality.unsafe_block", "min_count": 1 }, + { "id_prefix": "state-unauthed-access", "min_count": 3 } ], "forbidden_findings": [], "noise_budget": { diff --git a/tests/fixtures/state/both_branches_close.c b/tests/fixtures/state/both_branches_close.c new file mode 100644 index 00000000..8429bbc4 --- /dev/null +++ b/tests/fixtures/state/both_branches_close.c @@ -0,0 +1,12 @@ +#include + +/* Both branches close f — no leak on any path. + Expected: NO state- findings. */ +void both_close(int cond) { + FILE *f = fopen("data.txt", "r"); + if (cond) { + fclose(f); + } else { + fclose(f); + } +} diff --git a/tests/fixtures/state/chain_ops.c b/tests/fixtures/state/chain_ops.c new file mode 100644 index 00000000..b320d260 --- /dev/null +++ b/tests/fixtures/state/chain_ops.c @@ -0,0 +1,13 @@ +#include + +/* Multiple resource operations in sequence: open → read → write → close. + Tests that repeated uses do not corrupt lifecycle state. + Expected: NO state- findings. */ +void chain_ops(void) { + FILE *f = fopen("data.txt", "r"); + char buf[256]; + fread(buf, 1, sizeof(buf), f); + fwrite(buf, 1, sizeof(buf), f); + fread(buf, 1, sizeof(buf), f); + fclose(f); +} diff --git a/tests/fixtures/state/clean.c b/tests/fixtures/state/clean.c new file mode 100644 index 00000000..43ddff70 --- /dev/null +++ b/tests/fixtures/state/clean.c @@ -0,0 +1,9 @@ +#include + +void clean_usage() { + FILE *f = fopen("data.txt", "r"); + char buf[256]; + fread(buf, 1, sizeof(buf), f); + fclose(f); + // Clean: open, use, close — no bugs +} diff --git a/tests/fixtures/state/double_close.c b/tests/fixtures/state/double_close.c new file mode 100644 index 00000000..b54af49f --- /dev/null +++ b/tests/fixtures/state/double_close.c @@ -0,0 +1,7 @@ +#include + +void double_close_bug() { + FILE *f = fopen("data.txt", "r"); + fclose(f); + fclose(f); // BUG: double close +} diff --git a/tests/fixtures/state/double_close_branch.c b/tests/fixtures/state/double_close_branch.c new file mode 100644 index 00000000..20a51fcd --- /dev/null +++ b/tests/fixtures/state/double_close_branch.c @@ -0,0 +1,14 @@ +#include + +/* fclose inside a branch, then unconditional fclose after. + True path: fclose(OPEN→CLOSED), then fclose(CLOSED) = double close. + False path: skip inner fclose, then fclose(OPEN→CLOSED) = fine. + Converged state at the second fclose: OPEN|CLOSED (join). + Expected: NO state-double-close (conservative: join masks the bug). */ +void double_close_branch(int cond) { + FILE *f = fopen("data.txt", "r"); + if (cond) { + fclose(f); + } + fclose(f); +} diff --git a/tests/fixtures/state/double_close_straight.c b/tests/fixtures/state/double_close_straight.c new file mode 100644 index 00000000..e1f53891 --- /dev/null +++ b/tests/fixtures/state/double_close_straight.c @@ -0,0 +1,10 @@ +#include + +/* Straight-line double close — no branching ambiguity. + The converged state at the second fclose is definitely CLOSED. + Expected: state-double-close. */ +void double_close_straight(void) { + FILE *f = fopen("data.txt", "r"); + fclose(f); + fclose(f); +} diff --git a/tests/fixtures/state/early_return_may_leak.c b/tests/fixtures/state/early_return_may_leak.c new file mode 100644 index 00000000..3e854857 --- /dev/null +++ b/tests/fixtures/state/early_return_may_leak.c @@ -0,0 +1,11 @@ +#include + +/* Early return leaks on the error path; normal path closes. + Expected: state-resource-leak-possible (may-leak). */ +void early_return_leak(int err) { + FILE *f = fopen("data.txt", "r"); + if (err) { + return; + } + fclose(f); +} diff --git a/tests/fixtures/state/handle_overwrite.c b/tests/fixtures/state/handle_overwrite.c new file mode 100644 index 00000000..b0533cb3 --- /dev/null +++ b/tests/fixtures/state/handle_overwrite.c @@ -0,0 +1,11 @@ +#include + +/* The first fopen result is overwritten by the second fopen. + The first handle leaks silently because per-variable tracking + loses the old allocation. The second handle is properly closed. + Expected: NO state- findings (known per-variable-tracking limitation). */ +void overwrite_handle(void) { + FILE *f = fopen("a.txt", "r"); + f = fopen("b.txt", "r"); + fclose(f); +} diff --git a/tests/fixtures/state/loop_clean.c b/tests/fixtures/state/loop_clean.c new file mode 100644 index 00000000..033c458a --- /dev/null +++ b/tests/fixtures/state/loop_clean.c @@ -0,0 +1,14 @@ +#include + +/* Open before loop, use inside loop, close after loop. + The back-edge should not prevent convergence. + Expected: NO state- findings. */ +void loop_clean(void) { + FILE *f = fopen("data.txt", "r"); + char buf[256]; + int i; + for (i = 0; i < 10; i++) { + fread(buf, 1, sizeof(buf), f); + } + fclose(f); +} diff --git a/tests/fixtures/state/loop_use_after_close.c b/tests/fixtures/state/loop_use_after_close.c new file mode 100644 index 00000000..a58d21e8 --- /dev/null +++ b/tests/fixtures/state/loop_use_after_close.c @@ -0,0 +1,16 @@ +#include + +/* Close before the loop, then use inside the loop body. + The back-edge means the use node joins CLOSED (first iter) + with CLOSED (back-edge, still CLOSED). The converged state + at the fread call is CLOSED → use-after-close. + Expected: state-use-after-close. */ +void loop_use_after_close(void) { + FILE *f = fopen("data.txt", "r"); + fclose(f); + char buf[256]; + int i; + for (i = 0; i < 10; i++) { + fread(buf, 1, sizeof(buf), f); + } +} diff --git a/tests/fixtures/state/malloc_free_clean.c b/tests/fixtures/state/malloc_free_clean.c new file mode 100644 index 00000000..e3fbd24e --- /dev/null +++ b/tests/fixtures/state/malloc_free_clean.c @@ -0,0 +1,10 @@ +#include + +/* malloc followed by free — clean. + Tests the memory resource pair. + Expected: NO state- findings. */ +void malloc_free_clean(void) { + void *p = malloc(100); + *(char *)p = 'x'; + free(p); +} diff --git a/tests/fixtures/state/malloc_leak.c b/tests/fixtures/state/malloc_leak.c new file mode 100644 index 00000000..b05f5c8d --- /dev/null +++ b/tests/fixtures/state/malloc_leak.c @@ -0,0 +1,9 @@ +#include + +/* malloc without free — resource leak. + Tests the memory resource pair (malloc → free). + Expected: state-resource-leak. */ +void malloc_leak(void) { + void *p = malloc(100); + *(char *)p = 'x'; +} diff --git a/tests/fixtures/state/may_leak_branch.c b/tests/fixtures/state/may_leak_branch.c new file mode 100644 index 00000000..7a7195c3 --- /dev/null +++ b/tests/fixtures/state/may_leak_branch.c @@ -0,0 +1,10 @@ +#include + +/* Only the true branch closes f; the false branch leaks. + Expected: state-resource-leak-possible (NOT state-resource-leak). */ +void may_leak(int cond) { + FILE *f = fopen("data.txt", "r"); + if (cond) { + fclose(f); + } +} diff --git a/tests/fixtures/state/multiple_handles.c b/tests/fixtures/state/multiple_handles.c new file mode 100644 index 00000000..484295eb --- /dev/null +++ b/tests/fixtures/state/multiple_handles.c @@ -0,0 +1,11 @@ +#include + +/* Two separate handles: f1 is closed, f2 is leaked. + Expected: state-resource-leak for f2, NO state-resource-leak for f1. + (The finding message should contain "f2".) */ +void multiple_handles(void) { + FILE *f1 = fopen("a.txt", "r"); + FILE *f2 = fopen("b.txt", "r"); + fclose(f1); + /* f2 never closed */ +} diff --git a/tests/fixtures/state/nested_branch_leak.c b/tests/fixtures/state/nested_branch_leak.c new file mode 100644 index 00000000..95995e78 --- /dev/null +++ b/tests/fixtures/state/nested_branch_leak.c @@ -0,0 +1,16 @@ +#include + +/* Nested if — only the innermost branch closes. + Path true→true: fclose → CLOSED (clean) + Path true→false: no close → OPEN (leak) + Path false: no close → OPEN (leak) + Joined at exit: OPEN|CLOSED → may-leak. + Expected: state-resource-leak-possible. */ +void nested_branch_leak(int a, int b) { + FILE *f = fopen("data.txt", "r"); + if (a) { + if (b) { + fclose(f); + } + } +} diff --git a/tests/fixtures/state/reopen_after_close.c b/tests/fixtures/state/reopen_after_close.c new file mode 100644 index 00000000..a0038db0 --- /dev/null +++ b/tests/fixtures/state/reopen_after_close.c @@ -0,0 +1,12 @@ +#include + +/* Open, close, then reopen the same variable and close again. + The second fopen overwrites CLOSED with OPEN; the second fclose + brings it back to CLOSED. Clean usage. + Expected: NO state- findings. */ +void reopen_after_close(void) { + FILE *f = fopen("a.txt", "r"); + fclose(f); + f = fopen("b.txt", "r"); + fclose(f); +} diff --git a/tests/fixtures/state/resource_leak.c b/tests/fixtures/state/resource_leak.c new file mode 100644 index 00000000..d502a888 --- /dev/null +++ b/tests/fixtures/state/resource_leak.c @@ -0,0 +1,9 @@ +#include + +void resource_leak_bug() { + FILE *f = fopen("data.txt", "r"); + if (f == NULL) { + return; + } + // Missing fclose(f) — resource leak +} diff --git a/tests/fixtures/state/use_after_close.c b/tests/fixtures/state/use_after_close.c new file mode 100644 index 00000000..66e1a6b9 --- /dev/null +++ b/tests/fixtures/state/use_after_close.c @@ -0,0 +1,8 @@ +#include + +void use_after_close_bug() { + FILE *f = fopen("data.txt", "r"); + fclose(f); + char buf[256]; + fread(buf, 1, sizeof(buf), f); // BUG: use after close +} diff --git a/tests/fixtures/state/use_closed_branch.c b/tests/fixtures/state/use_closed_branch.c new file mode 100644 index 00000000..c3a0690e --- /dev/null +++ b/tests/fixtures/state/use_closed_branch.c @@ -0,0 +1,16 @@ +#include + +/* fclose in one branch, then unconditional fread after. + True path: fclose(f) → fread(CLOSED) = use-after-close. + False path: fread(OPEN) = fine. + Converged state at fread: OPEN|CLOSED (join). + Expected: NO state-use-after-close (conservative: join masks it). + Expected: state-resource-leak-possible (false path never closes). */ +void use_closed_branch(int cond) { + FILE *f = fopen("data.txt", "r"); + if (cond) { + fclose(f); + } + char buf[256]; + fread(buf, 1, sizeof(buf), f); +} diff --git a/tests/fixtures/taint_termination/heavy_loop.js b/tests/fixtures/taint_termination/heavy_loop.js new file mode 100644 index 00000000..6c3faee0 --- /dev/null +++ b/tests/fixtures/taint_termination/heavy_loop.js @@ -0,0 +1,64 @@ +// Synthetic fixture: many tainted variables in loops. +// Triggers divergent taint-map hashes on each loop iteration, +// exercising the BFS iteration limit in the taint engine. +// Without the limit the BFS would run forever. + +function heavyLoop(req) { + const userInput = req.query.data; // source + let a = userInput; + let b = a; + let c = b; + let d = c; + let e = d; + let f = e; + let g = f; + let h = g; + let i = h; + let j = i; + + // Loop with accumulating taint + for (let k = 0; k < 100; k++) { + a = b + c; + b = c + d; + c = d + e; + d = e + f; + e = f + g; + f = g + h; + g = h + i; + h = i + j; + i = j + a; + j = a + b; + } + + // Nested loop + for (let m = 0; m < 10; m++) { + for (let n = 0; n < 10; n++) { + a = b + c + d; + b = c + d + e; + c = d + e + f; + } + } + + // Sink: eval with tainted data + eval(a + b + c + d + e); +} + +function multiSource(req, res) { + const x1 = req.query.a; + const x2 = req.query.b; + const x3 = req.query.c; + const x4 = req.query.d; + const x5 = req.query.e; + const x6 = req.query.f; + const x7 = req.query.g; + const x8 = req.query.h; + + let result = x1; + for (let i = 0; i < 20; i++) { + result = result + x2 + x3; + const tmp = x4 + x5 + x6; + result = result + tmp + x7 + x8; + } + + eval(result); +} diff --git a/tests/integration_tests.rs b/tests/integration_tests.rs index 791b40ba..b2c3b455 100644 --- a/tests/integration_tests.rs +++ b/tests/integration_tests.rs @@ -80,8 +80,8 @@ fn taint_only_mode_excludes_ast() { let diags = scan_fixture_dir(&dir, AnalysisMode::Taint); // Taint mode should not produce AST-only pattern findings - assert_no_findings(&diags, "unwrap_call"); - assert_no_findings(&diags, "expect_call"); + assert_no_findings(&diags, "rs.quality.unwrap"); + assert_no_findings(&diags, "rs.quality.expect"); } #[test] @@ -160,13 +160,9 @@ fn binary_json_output() { ); let stdout = String::from_utf8_lossy(&cmd.stdout); - // Find the JSON array line in stdout (config notes and "Finished" surround it) + // Find the JSON array in stdout (config notes and "Finished" surround it) let json_start = stdout.find('[').expect("Expected JSON array in stdout"); - let json_end = stdout[json_start..] - .find(']') - .expect("Expected closing bracket in JSON") - + json_start - + 1; + let json_end = stdout.rfind(']').expect("Expected closing bracket in JSON") + 1; let json_str = &stdout[json_start..json_end]; let parsed: Vec = serde_json::from_str(json_str).expect("stdout should contain valid JSON array"); diff --git a/tests/pattern_tests.rs b/tests/pattern_tests.rs new file mode 100644 index 00000000..77709096 --- /dev/null +++ b/tests/pattern_tests.rs @@ -0,0 +1,500 @@ +//! Pattern sanity tests and positive/negative fixture validation. +//! +//! These tests verify that: +//! 1. All pattern IDs are globally unique. +//! 2. All tree-sitter queries compile without error. +//! 3. All patterns have non-empty descriptions and valid severity/tier/category. +//! 4. Positive fixtures trigger expected patterns. +//! 5. Negative fixtures do NOT trigger security patterns. + +use nyx_scanner::patterns::{self, PatternTier, Severity}; +use std::collections::{HashMap, HashSet}; +use std::path::PathBuf; +use tree_sitter::{Language, Query, QueryCursor, StreamingIterator}; + +// ── Helpers ────────────────────────────────────────────────────────────────── + +fn fixture_path(lang: &str, kind: &str) -> PathBuf { + PathBuf::from(env!("CARGO_MANIFEST_DIR")) + .join("tests/fixtures/patterns") + .join(lang) + .join(kind) +} + +fn ts_lang_for(slug: &str) -> Language { + match slug { + "rust" => Language::from(tree_sitter_rust::LANGUAGE), + "java" => Language::from(tree_sitter_java::LANGUAGE), + "python" => Language::from(tree_sitter_python::LANGUAGE), + "javascript" => Language::from(tree_sitter_javascript::LANGUAGE), + "typescript" => Language::from(tree_sitter_typescript::LANGUAGE_TYPESCRIPT), + "c" => Language::from(tree_sitter_c::LANGUAGE), + "cpp" => Language::from(tree_sitter_cpp::LANGUAGE), + "go" => Language::from(tree_sitter_go::LANGUAGE), + "php" => Language::from(tree_sitter_php::LANGUAGE_PHP), + "ruby" => Language::from(tree_sitter_ruby::LANGUAGE), + _ => panic!("unknown language: {slug}"), + } +} + +/// Run all patterns for a language against source bytes. +/// Returns the set of pattern IDs that matched at least once. +fn run_patterns(slug: &str, source: &[u8]) -> HashSet { + let ts_lang = ts_lang_for(slug); + let pats = patterns::load(slug); + let mut matched = HashSet::new(); + + let mut parser = tree_sitter::Parser::new(); + parser.set_language(&ts_lang).expect("set language"); + let tree = parser.parse(source, None).expect("parse"); + let root = tree.root_node(); + + for pat in &pats { + let query = match Query::new(&ts_lang, pat.query) { + Ok(q) => q, + Err(_) => continue, + }; + let mut cursor = QueryCursor::new(); + let mut matches = cursor.matches(&query, root, source); + if matches.next().is_some() { + matched.insert(pat.id.to_string()); + } + } + + matched +} + +// ── All languages for iteration ────────────────────────────────────────────── + +const ALL_LANGS: &[&str] = &[ + "rust", + "java", + "python", + "javascript", + "typescript", + "c", + "cpp", + "go", + "php", + "ruby", +]; + +// ── Sanity tests ───────────────────────────────────────────────────────────── + +#[test] +fn all_pattern_ids_are_globally_unique() { + let mut seen: HashMap = HashMap::new(); + let mut dupes = Vec::new(); + + for &lang in ALL_LANGS { + for pat in patterns::load(lang) { + if let Some(prev_lang) = seen.insert(pat.id.to_string(), lang.to_string()) { + // Same lang alias is ok (e.g. "js" and "javascript" share patterns) + if prev_lang != lang { + dupes.push(format!("{} (in {} and {})", pat.id, prev_lang, lang)); + } + } + } + } + + assert!( + dupes.is_empty(), + "Duplicate pattern IDs across languages:\n {}", + dupes.join("\n ") + ); +} + +#[test] +fn all_queries_compile() { + let mut errors = Vec::new(); + + for &lang in ALL_LANGS { + let ts_lang = ts_lang_for(lang); + for pat in patterns::load(lang) { + if let Err(e) = Query::new(&ts_lang, pat.query) { + errors.push(format!("[{}] {}: {}", lang, pat.id, e)); + } + } + } + + assert!( + errors.is_empty(), + "Pattern query compilation errors:\n {}", + errors.join("\n ") + ); +} + +#[test] +fn all_descriptions_non_empty() { + for &lang in ALL_LANGS { + for pat in patterns::load(lang) { + assert!( + !pat.description.trim().is_empty(), + "Pattern {} has empty description", + pat.id + ); + } + } +} + +#[test] +fn all_ids_follow_naming_convention() { + // IDs should be .. with dots + for &lang in ALL_LANGS { + for pat in patterns::load(lang) { + let parts: Vec<&str> = pat.id.split('.').collect(); + assert!( + parts.len() == 3, + "Pattern ID '{}' should have 3 dot-separated parts (lang.category.specific), got {}", + pat.id, + parts.len() + ); + // First part should be a short lang prefix + assert!( + parts[0].len() <= 4, + "Pattern ID '{}' language prefix '{}' too long (max 4 chars)", + pat.id, + parts[0] + ); + } + } +} + +#[test] +fn severity_distribution_reasonable() { + // Sanity: no language should have ALL patterns at the same severity + for &lang in ALL_LANGS { + let pats = patterns::load(lang); + if pats.len() < 3 { + continue; + } + let severities: HashSet<_> = pats.iter().map(|p| p.severity).collect(); + // At least 2 different severity levels if >= 5 patterns + if pats.len() >= 5 { + assert!( + severities.len() >= 2, + "{} has {} patterns but only 1 severity level", + lang, + pats.len() + ); + } + } +} + +#[test] +fn tier_a_patterns_have_no_heuristic_in_description() { + // Tier A patterns should not reference "concatenation" or "format" heuristics + // (that's Tier B territory). This is a soft check. + let heuristic_words = ["concatenat", "non-literal", "heuristic"]; + let mut violations = Vec::new(); + + for &lang in ALL_LANGS { + for pat in patterns::load(lang) { + if pat.tier == PatternTier::A { + let desc_lower = pat.description.to_lowercase(); + for word in &heuristic_words { + if desc_lower.contains(word) { + violations.push(format!( + "{}: Tier A but description mentions '{}'", + pat.id, word + )); + } + } + } + } + } + + // Warn but don't fail — descriptions are informational + if !violations.is_empty() { + eprintln!( + "WARNING: Tier A patterns with heuristic-like descriptions:\n {}", + violations.join("\n ") + ); + } +} + +// ── Positive fixture tests ─────────────────────────────────────────────────── +// Each test verifies that the positive fixture triggers at least the listed IDs. + +fn assert_positive_match(lang: &str, fixture_file: &str, expected_ids: &[&str]) { + let path = fixture_path(lang, fixture_file); + if !path.exists() { + eprintln!("SKIP: fixture not found: {}", path.display()); + return; + } + let source = std::fs::read(&path).expect("read fixture"); + let matched = run_patterns(lang, &source); + + let mut missing = Vec::new(); + for &id in expected_ids { + if !matched.contains(id) { + missing.push(id); + } + } + + assert!( + missing.is_empty(), + "[{}] Positive fixture '{}' did not trigger expected patterns:\n missing: {:?}\n matched: {:?}", + lang, + fixture_file, + missing, + matched + ); +} + +#[test] +fn positive_rust() { + assert_positive_match( + "rust", + "positive.rs", + &[ + "rs.memory.transmute", + "rs.memory.copy_nonoverlapping", + "rs.memory.get_unchecked", + "rs.memory.mem_zeroed", + "rs.memory.ptr_read", + "rs.quality.unsafe_block", + "rs.quality.unsafe_fn", + "rs.quality.unwrap", + "rs.quality.expect", + "rs.quality.panic_macro", + "rs.quality.todo", + "rs.memory.narrow_cast", + "rs.memory.mem_forget", + ], + ); +} + +#[test] +fn positive_java() { + assert_positive_match( + "java", + "positive.java", + &[ + "java.deser.readobject", + "java.cmdi.runtime_exec", + "java.reflection.class_forname", + "java.reflection.method_invoke", + "java.sqli.execute_concat", + "java.crypto.insecure_random", + ], + ); +} + +#[test] +fn positive_python() { + assert_positive_match( + "python", + "positive.py", + &[ + "py.code_exec.eval", + "py.code_exec.exec", + "py.cmdi.os_system", + "py.cmdi.os_popen", + "py.deser.pickle_loads", + "py.deser.yaml_load", + ], + ); +} + +#[test] +fn positive_javascript() { + assert_positive_match( + "javascript", + "positive.js", + &[ + "js.code_exec.eval", + "js.code_exec.new_function", + "js.code_exec.settimeout_string", + "js.xss.document_write", + "js.xss.outer_html", + "js.xss.insert_adjacent_html", + "js.prototype.proto_assignment", + "js.xss.cookie_write", + ], + ); +} + +#[test] +fn positive_typescript() { + assert_positive_match( + "typescript", + "positive.ts", + &[ + "ts.code_exec.eval", + "ts.code_exec.new_function", + "ts.code_exec.settimeout_string", + "ts.xss.document_write", + "ts.xss.outer_html", + "ts.xss.insert_adjacent_html", + "ts.quality.any_annotation", + "ts.quality.as_any", + "ts.prototype.proto_assignment", + "ts.xss.cookie_write", + ], + ); +} + +#[test] +fn positive_c() { + assert_positive_match( + "c", + "positive.c", + &[ + "c.memory.gets", + "c.memory.strcpy", + "c.memory.strcat", + "c.memory.sprintf", + "c.memory.scanf_percent_s", + "c.cmdi.system", + "c.cmdi.popen", + "c.memory.printf_no_fmt", + ], + ); +} + +#[test] +fn positive_cpp() { + assert_positive_match( + "cpp", + "positive.cpp", + &[ + "cpp.memory.gets", + "cpp.memory.strcpy", + "cpp.memory.strcat", + "cpp.memory.sprintf", + "cpp.cmdi.system", + "cpp.memory.reinterpret_cast", + "cpp.memory.const_cast", + "cpp.memory.printf_no_fmt", + ], + ); +} + +#[test] +fn positive_go() { + assert_positive_match( + "go", + "positive.go", + &["go.cmdi.exec_command", "go.crypto.md5", "go.crypto.sha1"], + ); +} + +#[test] +fn positive_php() { + assert_positive_match( + "php", + "positive.php", + &[ + "php.code_exec.eval", + "php.code_exec.create_function", + "php.cmdi.system", + "php.deser.unserialize", + ], + ); +} + +#[test] +fn positive_ruby() { + assert_positive_match( + "ruby", + "positive.rb", + &[ + "rb.code_exec.eval", + "rb.code_exec.instance_eval", + "rb.code_exec.class_eval", + "rb.cmdi.backtick", + "rb.deser.yaml_load", + "rb.deser.marshal_load", + "rb.reflection.constantize", + ], + ); +} + +// ── Negative fixture tests ─────────────────────────────────────────────────── +// Negative fixtures should produce zero matches for High/Medium security patterns. + +fn get_security_pattern_ids(lang: &str) -> HashSet { + patterns::load(lang) + .into_iter() + .filter(|p| { + p.severity != Severity::Low + && !matches!( + p.category, + nyx_scanner::patterns::PatternCategory::CodeQuality + ) + }) + .map(|p| p.id.to_string()) + .collect() +} + +fn assert_negative_no_security_match(lang: &str, fixture_file: &str) { + let path = fixture_path(lang, fixture_file); + if !path.exists() { + eprintln!("SKIP: fixture not found: {}", path.display()); + return; + } + let source = std::fs::read(&path).expect("read fixture"); + let matched = run_patterns(lang, &source); + let security_ids = get_security_pattern_ids(lang); + + let false_positives: Vec<_> = matched.intersection(&security_ids).collect(); + + assert!( + false_positives.is_empty(), + "[{}] Negative fixture '{}' triggered security patterns (false positives):\n {:?}", + lang, + fixture_file, + false_positives + ); +} + +#[test] +fn negative_rust() { + assert_negative_no_security_match("rust", "negative.rs"); +} + +#[test] +fn negative_java() { + assert_negative_no_security_match("java", "negative.java"); +} + +#[test] +fn negative_python() { + assert_negative_no_security_match("python", "negative.py"); +} + +#[test] +fn negative_javascript() { + assert_negative_no_security_match("javascript", "negative.js"); +} + +#[test] +fn negative_typescript() { + assert_negative_no_security_match("typescript", "negative.ts"); +} + +#[test] +fn negative_c() { + assert_negative_no_security_match("c", "negative.c"); +} + +#[test] +fn negative_cpp() { + assert_negative_no_security_match("cpp", "negative.cpp"); +} + +#[test] +fn negative_go() { + assert_negative_no_security_match("go", "negative.go"); +} + +#[test] +fn negative_php() { + assert_negative_no_security_match("php", "negative.php"); +} + +#[test] +fn negative_ruby() { + assert_negative_no_security_match("ruby", "negative.rb"); +} diff --git a/tests/real_world_tests.rs b/tests/real_world_tests.rs new file mode 100644 index 00000000..049125db --- /dev/null +++ b/tests/real_world_tests.rs @@ -0,0 +1,537 @@ +//! Real-world vulnerability fixture test suite. +//! +//! Scans realistic code snippets (20–120 lines) across all 10 supported languages +//! and compares findings against `.expect.json` expectation files. +//! +//! # Environment Variables +//! +//! - `NYX_TEST_LANG=python` — run only fixtures for one language +//! - `NYX_TEST_FIXTURE=cmdi_subprocess` — run only fixtures whose name contains this string +//! - `NYX_TEST_VERBOSE=1` — print full diff details for every fixture +//! - `NYX_TEST_CATEGORY=taint` — run only one category (taint/cfg/state/mixed) +//! +//! # Known-failure handling +//! +//! Expectations with `"must_match": false` are tracked but do not cause test failure. +//! A summary of soft misses is always printed at the end. + +mod common; + +use common::test_config; +use nyx_scanner::commands::scan::Diag; +use nyx_scanner::utils::config::AnalysisMode; +use serde::Deserialize; +use std::collections::BTreeMap; +use std::path::{Path, PathBuf}; +use std::sync::OnceLock; + +// ── Expectation schema ─────────────────────────────────────────────────────── + +#[derive(Debug, Clone, Deserialize)] +#[allow(dead_code)] +struct RealWorldExpectations { + /// Human description of what this fixture tests. + #[serde(default)] + description: String, + /// Tags for coverage matrix (e.g. ["taint", "cmdi", "express"]). + #[serde(default)] + tags: Vec, + /// Which analysis modes this fixture targets. + #[serde(default = "default_modes")] + modes: Vec, + /// Expected findings. + expected: Vec, +} + +fn default_modes() -> Vec { + vec!["full".to_string()] +} + +#[derive(Debug, Clone, Deserialize)] +struct ExpectedFinding { + /// Rule ID substring to match (e.g. "taint-" or "js.xss.innerhtml"). + rule_id: String, + /// Severity (optional, not checked if absent). + #[serde(default)] + severity: Option, + /// If true, missing this finding is a hard failure. If false, it's a soft miss. + #[serde(default = "default_must_match")] + must_match: bool, + /// Line number or range [start, end] where finding should appear. + #[serde(default)] + line_range: Option<(usize, usize)>, + /// Substrings that must appear in message or evidence fields. + #[serde(default)] + evidence_contains: Vec, + /// Human explanation of this expectation. + #[serde(default)] + notes: String, +} + +fn default_must_match() -> bool { + true +} + +// ── Fixture discovery ──────────────────────────────────────────────────────── + +#[derive(Debug, Clone)] +struct Fixture { + /// Language slug (rust, c, cpp, java, go, php, python, ruby, typescript, javascript). + lang: String, + /// Category (taint, cfg, state, mixed). + category: String, + /// Fixture name (stem of source file). + name: String, + /// Path to the source fixture file. + source_path: PathBuf, + /// Parsed expectations. + expectations: RealWorldExpectations, +} + +fn discover_fixtures() -> Vec { + let base = Path::new(env!("CARGO_MANIFEST_DIR")).join("tests/fixtures/real_world"); + let mut fixtures = Vec::new(); + + let langs = [ + "rust", + "c", + "cpp", + "java", + "go", + "php", + "python", + "ruby", + "typescript", + "javascript", + ]; + let categories = ["taint", "cfg", "state", "mixed"]; + + for lang in &langs { + for category in &categories { + let dir = base.join(lang).join(category); + if !dir.is_dir() { + continue; + } + + // Find all .expect.json files, derive source file from them. + let Ok(entries) = std::fs::read_dir(&dir) else { + continue; + }; + for entry in entries.flatten() { + let path = entry.path(); + let fname = path.file_name().unwrap().to_string_lossy().to_string(); + if !fname.ends_with(".expect.json") { + continue; + } + + let stem = fname.trim_end_matches(".expect.json"); + + // Find the corresponding source file (any extension). + let source_path = find_source_file(&dir, stem); + let Some(source_path) = source_path else { + eprintln!( + "WARN: no source file for {}/{}/{}/{}", + lang, category, stem, fname + ); + continue; + }; + + let expect_content = std::fs::read_to_string(&path).unwrap_or_else(|e| { + panic!("Failed to read {}: {e}", path.display()); + }); + let expectations: RealWorldExpectations = serde_json::from_str(&expect_content) + .unwrap_or_else(|e| { + panic!("Failed to parse {}: {e}", path.display()); + }); + + fixtures.push(Fixture { + lang: lang.to_string(), + category: category.to_string(), + name: stem.to_string(), + source_path, + expectations, + }); + } + } + } + + // Sort for deterministic ordering. + fixtures.sort_by(|a, b| { + a.lang + .cmp(&b.lang) + .then(a.category.cmp(&b.category)) + .then(a.name.cmp(&b.name)) + }); + + fixtures +} + +fn find_source_file(dir: &Path, stem: &str) -> Option { + let extensions = [ + "rs", "c", "cpp", "cc", "cxx", "java", "go", "php", "py", "rb", "ts", "tsx", "js", "jsx", + ]; + for ext in &extensions { + let candidate = dir.join(format!("{stem}.{ext}")); + if candidate.exists() { + return Some(candidate); + } + } + None +} + +// ── Scanning ───────────────────────────────────────────────────────────────── + +fn scan_fixture(fixture: &Fixture, mode: AnalysisMode) -> Vec { + // We scan the parent directory containing just this fixture file. + // To isolate, we copy the fixture to a temp dir. + let tmp = tempfile::TempDir::with_prefix("nyx_rw_test_").expect("tempdir"); + let dest = tmp.path().join(fixture.source_path.file_name().unwrap()); + std::fs::copy(&fixture.source_path, &dest).expect("copy fixture"); + + let cfg = test_config(mode); + let mut diags = + nyx_scanner::scan_no_index(tmp.path(), &cfg).expect("scan_no_index should succeed"); + + // Normalize paths to just the filename for comparison. + for d in &mut diags { + if let Some(fname) = Path::new(&d.path).file_name() { + d.path = fname.to_string_lossy().to_string(); + } + } + + // Sort deterministically. + diags.sort_by(|a, b| { + a.path + .cmp(&b.path) + .then(a.line.cmp(&b.line)) + .then(a.id.cmp(&b.id)) + .then(a.col.cmp(&b.col)) + }); + + diags +} + +// ── Matching ───────────────────────────────────────────────────────────────── + +#[derive(Debug)] +struct MatchResult { + hard_misses: Vec<(ExpectedFinding, String)>, + soft_misses: Vec<(ExpectedFinding, String)>, + unexpected: Vec, + matched: usize, +} + +fn match_expectations( + diags: &[Diag], + expectations: &[ExpectedFinding], + fixture_file: &str, +) -> MatchResult { + let mut hard_misses = Vec::new(); + let mut soft_misses = Vec::new(); + let mut matched_indices: Vec = vec![false; diags.len()]; + let mut matched = 0; + + for exp in expectations { + let found = diags.iter().enumerate().any(|(i, d)| { + if matched_indices[i] { + return false; + } + if !d.id.contains(&exp.rule_id) { + return false; + } + // Check file + if !d.path.contains(fixture_file) && fixture_file != d.path { + return false; + } + // Check severity if specified + if let Some(ref sev) = exp.severity + && d.severity.as_db_str() != sev.to_uppercase() + { + return false; + } + // Check line range if specified + if let Some((start, end)) = exp.line_range + && (d.line < start || d.line > end) + { + return false; + } + // Check evidence substrings + for substr in &exp.evidence_contains { + let msg = d.message.as_deref().unwrap_or(""); + let ev_text = if let Some(ev) = &d.evidence { + let mut parts = Vec::new(); + if let Some(src) = &ev.source { + parts.push(format!( + "source: {}", + src.snippet.as_deref().unwrap_or(&src.kind) + )); + } + if let Some(snk) = &ev.sink { + parts.push(format!( + "sink: {}", + snk.snippet.as_deref().unwrap_or(&snk.kind) + )); + } + for note in &ev.notes { + parts.push(note.clone()); + } + parts.join(" ") + } else { + String::new() + }; + let combined = format!("{msg} {ev_text}"); + if !combined.to_lowercase().contains(&substr.to_lowercase()) { + return false; + } + } + matched_indices[i] = true; + true + }); + + if found { + matched += 1; + } else { + let reason = format!( + "rule_id='{}' severity={:?} line_range={:?}", + exp.rule_id, exp.severity, exp.line_range + ); + if exp.must_match { + hard_misses.push((exp.clone(), reason)); + } else { + soft_misses.push((exp.clone(), reason)); + } + } + } + + // Unexpected = diags not matched by any expectation (informational only). + let unexpected: Vec = diags + .iter() + .enumerate() + .filter(|(i, _)| !matched_indices[*i]) + .map(|(_, d)| d.clone()) + .collect(); + + MatchResult { + hard_misses, + soft_misses, + unexpected, + matched, + } +} + +// ── Mode resolution ────────────────────────────────────────────────────────── + +fn resolve_mode(mode_str: &str) -> AnalysisMode { + match mode_str.to_lowercase().as_str() { + "ast" => AnalysisMode::Ast, + "taint" => AnalysisMode::Taint, + "full" => AnalysisMode::Full, + _ => AnalysisMode::Full, + } +} + +// ── Coverage matrix ────────────────────────────────────────────────────────── + +fn print_coverage_matrix(fixtures: &[Fixture]) { + let mut matrix: BTreeMap> = BTreeMap::new(); + let mut tag_counts: BTreeMap = BTreeMap::new(); + + for f in fixtures { + *matrix + .entry(f.lang.clone()) + .or_default() + .entry(f.category.clone()) + .or_default() += 1; + for tag in &f.expectations.tags { + *tag_counts.entry(tag.clone()).or_default() += 1; + } + } + + eprintln!("\n╔══════════════════════════════════════════════════════════╗"); + eprintln!("║ REAL-WORLD TEST COVERAGE MATRIX ║"); + eprintln!("╠══════════════╦════════╦══════╦════════╦════════╦════════╣"); + eprintln!("║ Language ║ Taint ║ CFG ║ State ║ Mixed ║ Total ║"); + eprintln!("╠══════════════╬════════╬══════╬════════╬════════╬════════╣"); + + let mut grand_total = 0; + for (lang, cats) in &matrix { + let t = cats.get("taint").unwrap_or(&0); + let c = cats.get("cfg").unwrap_or(&0); + let s = cats.get("state").unwrap_or(&0); + let m = cats.get("mixed").unwrap_or(&0); + let total = t + c + s + m; + grand_total += total; + eprintln!( + "║ {:<12} ║ {:>4} ║ {:>3} ║ {:>4} ║ {:>4} ║ {:>4} ║", + lang, t, c, s, m, total + ); + } + eprintln!("╠══════════════╬════════╬══════╬════════╬════════╬════════╣"); + eprintln!( + "║ TOTAL ║ ║ ║ ║ ║ {:>4} ║", + grand_total + ); + eprintln!("╚══════════════╩════════╩══════╩════════╩════════╩════════╝"); + + if !tag_counts.is_empty() { + eprintln!("\nTag distribution:"); + for (tag, count) in &tag_counts { + eprintln!(" {tag}: {count}"); + } + } +} + +// ── Main test ──────────────────────────────────────────────────────────────── + +static ALL_FIXTURES: OnceLock> = OnceLock::new(); + +fn get_fixtures() -> &'static [Fixture] { + ALL_FIXTURES.get_or_init(discover_fixtures) +} + +fn should_run(fixture: &Fixture) -> bool { + if let Ok(lang) = std::env::var("NYX_TEST_LANG") + && !fixture.lang.eq_ignore_ascii_case(&lang) + { + return false; + } + if let Ok(name) = std::env::var("NYX_TEST_FIXTURE") + && !fixture.name.contains(&name) + { + return false; + } + if let Ok(cat) = std::env::var("NYX_TEST_CATEGORY") + && !fixture.category.eq_ignore_ascii_case(&cat) + { + return false; + } + true +} + +fn is_verbose() -> bool { + std::env::var("NYX_TEST_VERBOSE").is_ok() +} + +#[test] +fn real_world_fixture_suite() { + let fixtures = get_fixtures(); + let verbose = is_verbose(); + + let active: Vec<&Fixture> = fixtures.iter().filter(|f| should_run(f)).collect(); + + if active.is_empty() { + eprintln!( + "No fixtures matched filters. Total available: {}", + fixtures.len() + ); + print_coverage_matrix(fixtures); + return; + } + + eprintln!( + "\nRunning {} real-world fixtures (of {} total)\n", + active.len(), + fixtures.len() + ); + + let mut total_hard_fails = 0; + let mut total_soft_misses = 0; + let mut total_matched = 0; + let mut total_unexpected = 0; + let mut failure_details: Vec = Vec::new(); + let mut soft_miss_details: Vec = Vec::new(); + + for fixture in &active { + let fixture_label = format!("{}/{}/{}", fixture.lang, fixture.category, fixture.name); + + for mode_str in &fixture.expectations.modes { + let mode = resolve_mode(mode_str); + let diags = scan_fixture(fixture, mode); + let fixture_file = fixture + .source_path + .file_name() + .unwrap() + .to_string_lossy() + .to_string(); + + let result = match_expectations(&diags, &fixture.expectations.expected, &fixture_file); + + total_matched += result.matched; + total_unexpected += result.unexpected.len(); + + if !result.hard_misses.is_empty() { + let mut msg = format!("FAIL {fixture_label} [{mode_str}]:"); + for (exp, reason) in &result.hard_misses { + msg.push_str(&format!( + "\n MISSING (must_match): {} — {}", + reason, exp.notes + )); + } + failure_details.push(msg); + total_hard_fails += result.hard_misses.len(); + } + + if !result.soft_misses.is_empty() { + let mut msg = format!("SOFT {fixture_label} [{mode_str}]:"); + for (exp, reason) in &result.soft_misses { + msg.push_str(&format!("\n soft miss: {} — {}", reason, exp.notes)); + } + soft_miss_details.push(msg); + total_soft_misses += result.soft_misses.len(); + } + + if verbose { + eprintln!( + " {fixture_label} [{mode_str}]: {} matched, {} hard misses, {} soft misses, {} unexpected", + result.matched, + result.hard_misses.len(), + result.soft_misses.len(), + result.unexpected.len() + ); + if !result.unexpected.is_empty() { + for d in &result.unexpected { + eprintln!( + " EXTRA: {}:{} [{}] {}", + d.path, + d.line, + d.severity.as_db_str(), + d.id + ); + } + } + } + } + } + + // Print coverage matrix. + print_coverage_matrix(fixtures); + + // Print summary. + eprintln!("\n────────────────────────────────────────────────────"); + eprintln!( + "RESULTS: {} matched, {} hard failures, {} soft misses, {} unexpected", + total_matched, total_hard_fails, total_soft_misses, total_unexpected + ); + eprintln!("────────────────────────────────────────────────────"); + + if !failure_details.is_empty() { + eprintln!("\n=== HARD FAILURES (must_match=true) ==="); + for msg in &failure_details { + eprintln!("{msg}"); + } + } + + if !soft_miss_details.is_empty() { + eprintln!("\n=== SOFT MISSES (must_match=false, informational) ==="); + for msg in &soft_miss_details { + eprintln!("{msg}"); + } + } + + // Hard failures cause test failure. + assert_eq!( + total_hard_fails, 0, + "{total_hard_fails} expected findings not found (must_match=true). \ + Run with NYX_TEST_VERBOSE=1 for details." + ); +} diff --git a/tests/state_tests.rs b/tests/state_tests.rs new file mode 100644 index 00000000..a84e1c5b --- /dev/null +++ b/tests/state_tests.rs @@ -0,0 +1,306 @@ +mod common; + +use nyx_scanner::commands::scan::Diag; +use nyx_scanner::utils::config::{AnalysisMode, Config}; +use std::path::PathBuf; +use std::sync::OnceLock; + +fn state_fixture_dir() -> PathBuf { + PathBuf::from(env!("CARGO_MANIFEST_DIR")) + .join("tests") + .join("fixtures") + .join("state") +} + +fn state_config() -> Config { + let mut cfg = common::test_config(AnalysisMode::Full); + cfg.scanner.enable_state_analysis = true; + cfg +} + +/// Scan the fixtures directory once and cache the result for all tests. +/// Every test in this module filters the shared result by filename. +fn scan_all_fixtures() -> &'static Vec { + static DIAGS: OnceLock> = OnceLock::new(); + DIAGS.get_or_init(|| { + let cfg = state_config(); + nyx_scanner::scan_no_index(&state_fixture_dir(), &cfg).expect("scan should succeed") + }) +} + +// ── Helpers ────────────────────────────────────────────────────────────── + +fn state_diags_for(filename: &str) -> Vec<&'static Diag> { + scan_all_fixtures() + .iter() + .filter(|d| d.path.contains(filename) && d.id.starts_with("state-")) + .collect() +} + +fn state_ids_for(filename: &str) -> Vec { + state_diags_for(filename) + .iter() + .map(|d| d.id.clone()) + .collect() +} + +fn has_rule(filename: &str, rule_id: &str) -> bool { + state_diags_for(filename).iter().any(|d| d.id == rule_id) +} + +fn has_rule_prefix(filename: &str, prefix: &str) -> bool { + state_diags_for(filename) + .iter() + .any(|d| d.id.starts_with(prefix)) +} + +fn assert_has(filename: &str, rule_id: &str) { + assert!( + has_rule(filename, rule_id), + "Expected {rule_id} in {filename}.\n Got: {:?}", + state_ids_for(filename) + ); +} + +fn assert_has_prefix(filename: &str, prefix: &str) { + assert!( + has_rule_prefix(filename, prefix), + "Expected finding starting with `{prefix}` in {filename}.\n Got: {:?}", + state_ids_for(filename) + ); +} + +fn assert_absent(filename: &str, rule_id: &str) { + assert!( + !has_rule(filename, rule_id), + "Did NOT expect {rule_id} in {filename}.\n Got: {:?}", + state_ids_for(filename) + ); +} + +fn assert_no_state_findings(filename: &str) { + let found = state_ids_for(filename); + assert!( + found.is_empty(), + "Expected zero state findings in {filename}.\n Got: {:?}", + found + ); +} + +fn assert_message_contains(filename: &str, rule_id: &str, substr: &str) { + let matching: Vec<_> = state_diags_for(filename) + .into_iter() + .filter(|d| d.id == rule_id) + .collect(); + assert!( + matching + .iter() + .any(|d| d.message.as_deref().unwrap_or("").contains(substr)), + "Expected {rule_id} in {filename} with message containing `{substr}`.\n Messages: {:?}", + matching + .iter() + .map(|d| d.message.as_deref().unwrap_or("(none)")) + .collect::>() + ); +} + +// ═══════════════════════════════════════════════════════════════════════ +// Original basic tests +// ═══════════════════════════════════════════════════════════════════════ + +#[test] +fn detects_use_after_close() { + assert_has("use_after_close.c", "state-use-after-close"); +} + +#[test] +fn detects_double_close() { + assert_has("double_close.c", "state-double-close"); +} + +#[test] +fn detects_resource_leak() { + assert_has_prefix("resource_leak.c", "state-resource-leak"); +} + +#[test] +fn clean_usage_no_state_findings() { + assert_no_state_findings("clean.c"); +} + +#[test] +fn state_analysis_off_by_default() { + let mut cfg = common::test_config(AnalysisMode::Full); + cfg.scanner.enable_state_analysis = false; + let diags = + nyx_scanner::scan_no_index(&state_fixture_dir(), &cfg).expect("scan should succeed"); + let state: Vec<_> = diags + .iter() + .filter(|d| d.id.starts_with("state-")) + .collect(); + assert!( + state.is_empty(), + "State findings should not appear when enable_state_analysis is false.\n Got: {:?}", + state.iter().map(|d| &d.id).collect::>() + ); +} + +// ═══════════════════════════════════════════════════════════════════════ +// (1) May-leak vs must-leak (branch semantics) +// ═══════════════════════════════════════════════════════════════════════ + +#[test] +fn may_leak_branch_emits_possible_not_definite() { + // Only the true branch closes → OPEN|CLOSED at exit → may-leak. + assert_has("may_leak_branch.c", "state-resource-leak-possible"); + assert_absent("may_leak_branch.c", "state-resource-leak"); +} + +#[test] +fn early_return_may_leak() { + // Early return leaks; normal path closes → OPEN|CLOSED at exit → may-leak. + assert_has("early_return_may_leak.c", "state-resource-leak-possible"); + assert_absent("early_return_may_leak.c", "state-resource-leak"); +} + +#[test] +fn nested_branch_may_leak() { + // Only innermost branch closes → OPEN|CLOSED at exit → may-leak. + assert_has("nested_branch_leak.c", "state-resource-leak-possible"); + assert_absent("nested_branch_leak.c", "state-resource-leak"); +} + +#[test] +fn both_branches_close_no_leak() { + // Both branches close f → CLOSED at exit → no leak. + assert_no_state_findings("both_branches_close.c"); +} + +// ═══════════════════════════════════════════════════════════════════════ +// (2) Loop / back-edge convergence +// ═══════════════════════════════════════════════════════════════════════ + +#[test] +fn loop_clean_converges_no_findings() { + // Open → loop { read } → close. Back-edge should not prevent convergence. + assert_no_state_findings("loop_clean.c"); +} + +#[test] +fn loop_use_after_close() { + // Close before loop → read inside loop on converged CLOSED state. + assert_has("loop_use_after_close.c", "state-use-after-close"); +} + +// ═══════════════════════════════════════════════════════════════════════ +// (3) Handle reassignment / overwrite semantics +// ═══════════════════════════════════════════════════════════════════════ + +#[test] +fn handle_overwrite_silent_per_variable() { + // f = fopen("a"); f = fopen("b"); fclose(f). + // The first handle leaks silently because per-variable tracking + // overwrites the old state. No findings because at exit f = CLOSED. + assert_no_state_findings("handle_overwrite.c"); +} + +#[test] +fn reopen_after_close_is_clean() { + // fopen → fclose → fopen → fclose. Each lifecycle is independent. + assert_no_state_findings("reopen_after_close.c"); +} + +#[test] +fn multiple_handles_leaks_only_unclosed() { + // f1 closed, f2 leaked. + assert_has("multiple_handles.c", "state-resource-leak"); + assert_message_contains("multiple_handles.c", "state-resource-leak", "f2"); + // Must NOT blame f1. + let f1_findings: Vec<_> = state_diags_for("multiple_handles.c") + .into_iter() + .filter(|d| { + d.id == "state-resource-leak" && d.message.as_deref().unwrap_or("").contains("f1") + }) + .collect(); + assert!( + f1_findings.is_empty(), + "f1 is properly closed — should not be reported as leaked.\n Got: {:?}", + f1_findings + .iter() + .map(|d| d.message.as_deref().unwrap_or("")) + .collect::>() + ); +} + +// ═══════════════════════════════════════════════════════════════════════ +// (4) Conservative join behaviour (branch masks path-specific bugs) +// ═══════════════════════════════════════════════════════════════════════ + +#[test] +fn double_close_branch_conservative_no_event() { + // if (cond) fclose(f); fclose(f); + // True path is double-close, false path is single-close. + // Joined state at the second fclose is OPEN|CLOSED → NOT CLOSED-only. + // Engine correctly refuses to flag when it's ambiguous. + assert_absent("double_close_branch.c", "state-double-close"); +} + +#[test] +fn use_closed_branch_conservative_no_event() { + // if (cond) fclose(f); fread(f); + // True path is use-after-close, false path is clean use. + // Joined state at fread is OPEN|CLOSED → NOT CLOSED-only. + assert_absent("use_closed_branch.c", "state-use-after-close"); + // However, the false path never closes → may-leak at exit. + assert_has("use_closed_branch.c", "state-resource-leak-possible"); +} + +// ═══════════════════════════════════════════════════════════════════════ +// (5) Additional edge cases +// ═══════════════════════════════════════════════════════════════════════ + +#[test] +fn chain_ops_clean() { + // fopen → fread → fwrite → fread → fclose. Multiple uses do not + // corrupt lifecycle state. + assert_no_state_findings("chain_ops.c"); +} + +#[test] +fn malloc_free_clean() { + // Tests the memory resource pair (malloc→free). + assert_no_state_findings("malloc_free_clean.c"); +} + +#[test] +fn malloc_leak() { + // malloc without free. + assert_has("malloc_leak.c", "state-resource-leak"); +} + +#[test] +fn double_close_straight_fires() { + // Straight-line fclose → fclose (no branching). Converged state is + // definitely CLOSED at the second fclose. + assert_has("double_close_straight.c", "state-double-close"); +} + +// ═══════════════════════════════════════════════════════════════════════ +// (6) Cross-cutting: message field populated +// ═══════════════════════════════════════════════════════════════════════ + +#[test] +fn findings_carry_messages() { + // Every state finding should have a non-empty message. + for d in scan_all_fixtures() { + if d.id.starts_with("state-") { + assert!( + d.message.as_ref().is_some_and(|m| !m.is_empty()), + "State finding {} at {}:{} has no message", + d.id, + d.path, + d.line + ); + } + } +} diff --git a/tests/taint_termination_test.rs b/tests/taint_termination_test.rs new file mode 100644 index 00000000..327b20f1 --- /dev/null +++ b/tests/taint_termination_test.rs @@ -0,0 +1,102 @@ +//! Regression tests for taint BFS termination. +//! +//! Before the fix in taint/mod.rs (MAX_BFS_ITERATIONS / MAX_SEEN_STATES), +//! files with many tainted variables and loops caused the BFS to run +//! forever because each loop iteration produced a distinct taint-map hash, +//! bypassing the `(node, taint_hash)` seen-state dedup. + +use nyx_scanner::commands::scan::Diag; +use nyx_scanner::utils::Config; +use std::path::Path; +use std::sync::OnceLock; +use std::time::{Duration, Instant}; + +/// Shared result so we only run the scan once across all assertions. +fn scan_fixture() -> &'static Vec { + static DIAGS: OnceLock> = OnceLock::new(); + DIAGS.get_or_init(|| { + let fixture = + Path::new(env!("CARGO_MANIFEST_DIR")).join("tests/fixtures/taint_termination"); + let cfg = Config::default(); + nyx_scanner::scan_no_index(&fixture, &cfg).expect("scan should succeed") + }) +} + +/// The scan must complete in a reasonable time. The old code hung forever +/// on this fixture; with the BFS limit it should finish in well under 10s. +#[test] +fn taint_bfs_terminates_within_timeout() { + let start = Instant::now(); + let _diags = scan_fixture(); + let elapsed = start.elapsed(); + + assert!( + elapsed < Duration::from_secs(10), + "Taint BFS took {:?} — should complete in <10s (was infinite before fix)", + elapsed + ); +} + +/// The scan should still produce meaningful findings even after bail-out. +#[test] +fn taint_bfs_produces_findings_after_bailout() { + let diags = scan_fixture(); + // We should get at least *some* findings (cfg-unguarded-sink at minimum, + // possibly taint findings depending on how far the BFS got). + assert!( + !diags.is_empty(), + "Expected at least some findings from heavy_loop.js fixture" + ); +} + +/// Scan a single-file fixture directory via --no-index path. This is the +/// exact code path that hung: `scan_filesystem` → `par_iter().fold().reduce()`. +#[test] +fn scan_no_index_completes() { + let fixture = Path::new(env!("CARGO_MANIFEST_DIR")).join("tests/fixtures/taint_termination"); + let cfg = Config::default(); + + let start = Instant::now(); + let result = nyx_scanner::scan_no_index(&fixture, &cfg); + let elapsed = start.elapsed(); + + assert!(result.is_ok(), "scan should not error"); + assert!( + elapsed < Duration::from_secs(10), + "scan took {:?} on small fixture", + elapsed + ); +} + +/// Indexed path: build_index + scan_with_index_parallel must also complete. +#[test] +fn scan_with_index_completes() { + use nyx_scanner::commands::scan::scan_with_index_parallel; + use nyx_scanner::database::index::Indexer; + use std::sync::Arc; + + let fixture = Path::new(env!("CARGO_MANIFEST_DIR")).join("tests/fixtures/taint_termination"); + let td = tempfile::tempdir().unwrap(); + let db_path = td.path().join("test.sqlite"); + let cfg = Config::default(); + + let start = Instant::now(); + + // Build index + nyx_scanner::commands::index::build_index("test", &fixture, &db_path, &cfg, false) + .expect("build_index should succeed"); + + // Scan with index + let pool = Indexer::init(&db_path).unwrap(); + let diags = scan_with_index_parallel("test", Arc::clone(&pool), &cfg, false) + .expect("indexed scan should succeed"); + + let elapsed = start.elapsed(); + assert!( + elapsed < Duration::from_secs(10), + "Indexed scan took {:?} on small fixture", + elapsed + ); + // Should produce findings just like the no-index path + assert!(!diags.is_empty(), "Expected findings from indexed scan"); +}