[pitboss/grind] cleanup session-0020 (20260521T201327Z-3848)

2026-07-24 21:41:02 +02:00 · 2026-05-21 20:38:05 -05:00 · 2026-05-21 20:38:05 -05:00 · de24d25e4f
commit de24d25e4f
parent dd9da4eef5
6 changed files with 54 additions and 68 deletions
--- a/docs/cli.md
+++ b/docs/cli.md
@ -74,7 +74,7 @@ nyx scan [PATH] [OPTIONS]
 | `--fail-on <SEV>` | *(none)* | Exit code 1 if any finding >= this severity |
 | `--show-suppressed` | off | Show inline-suppressed findings (dimmed, tagged `[SUPPRESSED]`) |
 | `--keep-nonprod-severity` | off | Don't downgrade severity for test/vendor paths |
-| `--all` | off | Disable category filtering, rollups, and LOW budgets -- show everything |
+| `--all` | off | Disable category filtering, rollups, and LOW budgets. Shows everything |
 | `--include-quality` | off | Include Quality-category findings (hidden by default) |
 | `--max-low <N>` | `20` | Maximum total LOW findings to show |
 | `--max-low-per-file <N>` | `1` | Maximum LOW findings per file |
--- a/docs/language-maturity.md
+++ b/docs/language-maturity.md
@ -9,9 +9,10 @@ The classifications here are grounded in three concrete signals:
 1. **Rule depth**: how many distinct source / sanitizer / sink matchers exist
   for the language in `src/labels/<lang>.rs`, and how many vulnerability
   classes (Cap bits) those matchers cover.
-2. **Benchmark results**: rule-level precision / recall / F1 on the 492-case
+2. **Benchmark results**: rule-level precision / recall / F1 on the synthetic
   corpus in
   [`tests/benchmark/RESULTS.md`](https://github.com/elicpeter/nyx/blob/master/tests/benchmark/RESULTS.md).
+   `RESULTS.md` is the authoritative case counts and per-language scores.
 3. **Known weak spots**: FPs and FNs the maintainers have deliberately left
   in the benchmark rather than suppressed, plus structural engine
   limitations the corpus does not stress, documented in
@ -42,23 +43,25 @@ use tree-sitter and are stable; parsing is not a differentiator.

 ### Stable tier

-#### Python: 100% P / 100% R / 100% F1 *(46-case corpus)*
+#### Python

- **Rule depth**: 5 source families, 7 sanitizer families, 21 sink matchers
+- **Rule depth**: deep source / sanitizer / sink coverage in
+  [`src/labels/python.rs`](https://github.com/elicpeter/nyx/blob/master/src/labels/python.rs)
  spanning HTML, URL, Shell, SQL, Code, SSRF, File I/O, and Deserialization.
 - **Framework context**: Flask, Django, argparse source matchers; `flask_request`
  import-alias support.
 - **Advanced analysis**: gated sinks (`Popen`, `subprocess.run/call` with
  activation-arg awareness), most SSA-equivalence and symbolic-execution
  fixtures target Python.
- **Fixtures**: 125 under `tests/fixtures/` plus 42 benchmark cases.
+- **Fixtures**: extensive `.py` coverage under `tests/fixtures/` plus the benchmark cases.
 - **Blind spots**: f-string interpolation is not explicitly modeled as a
  distinct taint-producing construct; string-formatting flows are caught by
  the general concatenation path.

-#### JavaScript: 100% P / 100% R / 100% F1 *(42-case corpus)*
+#### JavaScript

- **Rule depth**: 3 source families, 10 sanitizer families, 24 sink matchers
+- **Rule depth**: deep source / sanitizer / sink coverage in
+  [`src/labels/javascript.rs`](https://github.com/elicpeter/nyx/blob/master/src/labels/javascript.rs)
  spanning HTML, URL, JSON, Shell, SQL, Code, SSRF, and File I/O.
 - **Advanced analysis**: gated sinks (`setAttribute`, `parseFromString`),
  two-level SSA solve for top-level + per-function scopes
@ -66,15 +69,16 @@ use tree-sitter and are stable; parsing is not a differentiator.
  StringFact, abstract-interpretation interval tracking.
 - **Framework context**: Express, Koa, Fastify (via in-file import scan when
  `package.json` is absent).
- **Fixtures**: 238 under `tests/fixtures/`; the largest fixture set of any
+- **Fixtures**: the largest `.js` set under `tests/fixtures/` of any
  language.
 - **Blind spots**: template literals are lowered through concatenation rather
  than modeled as a first-class taint operator; dynamic property access
  (`obj[user]`) is conservatively treated.

-#### TypeScript: 100% P / 100% R / 100% F1 *(47-case corpus)*
+#### TypeScript

- **Rule depth**: Shares the JS ruleset (3 sources, 10 sanitizers, 24 sinks)
+- **Rule depth**: shares the JS ruleset (see
+  [`src/labels/typescript.rs`](https://github.com/elicpeter/nyx/blob/master/src/labels/typescript.rs))
  plus TS-specific grammar handling.
 - **Advanced analysis**: TSX and JSX grammars wired;
  discriminated-union narrowing, generic erasure, decorator flow, and
@ -82,15 +86,16 @@ use tree-sitter and are stable; parsing is not a differentiator.
  stressors.
 - **Framework context**: Fastify detection via `detect_in_file_frameworks`
  (import-driven, no `package.json` required).
- **Fixtures**: 39 test fixtures plus 42 benchmark cases.
+- **Fixtures**: dedicated `.ts` / `.tsx` set under `tests/fixtures/` plus the benchmark cases.
 - **Blind spots**: `as any` casts and `any`-typed flows are handled
  conservatively (treated as tainted).

 ### Beta tier

-#### Go: 100% P / 100% R / 100% F1 *(56-case corpus)*
+#### Go

- **Rule depth**: 4 source families, 4 sanitizer families, 9 sink matchers
+- **Rule depth**: mid-depth source / sanitizer / sink coverage in
+  [`src/labels/go.rs`](https://github.com/elicpeter/nyx/blob/master/src/labels/go.rs)
  covering HTML, URL, Shell, SQL, SSRF, Crypto, and File I/O.
 - **Framework context**: Gin, Echo source matchers.
 - **Recent fix**: `strings.ReplaceAll` is now recognised as a CMDi sanitiser
@ -103,9 +108,10 @@ use tree-sitter and are stable; parsing is not a differentiator.
  so production CI gates may surface additional FPs the corpus does not
  exercise.

-#### Java: 100% P / 100% R / 100% F1 *(35-case corpus)*
+#### Java

- **Rule depth**: 3 source families, 8 sanitizer families, 10 sink matchers
+- **Rule depth**: mid-depth source / sanitizer / sink coverage in
+  [`src/labels/java.rs`](https://github.com/elicpeter/nyx/blob/master/src/labels/java.rs)
  covering HTML, URL, Shell, SQL, Code, SSRF, and Deserialization.
 - **Framework context**: Spring, JPA, Hibernate ORM rules; JNDI injection
  sinks.
@ -115,18 +121,20 @@ use tree-sitter and are stable; parsing is not a differentiator.
  cannot be inferred are conservatively over-tainted on unusual builder
  chains.

-#### PHP: 100% P / 100% R / 100% F1 *(37-case corpus)*
+#### PHP

- **Rule depth**: 3 source families (`$_GET`, `$_POST`, `$_REQUEST`
-  superglobals), 7 sanitizer families, 10 sink matchers covering HTML, URL,
-  Shell, SQL, Code, SSRF, File I/O, and Deserialization.
+- **Rule depth**: sources include `$_GET`, `$_POST`, `$_REQUEST`
+  superglobals plus sanitizer / sink matchers in
+  [`src/labels/php.rs`](https://github.com/elicpeter/nyx/blob/master/src/labels/php.rs)
+  covering HTML, URL, Shell, SQL, Code, SSRF, File I/O, and Deserialization.
 - **Known gaps**: no gated sinks. Limited framework context (Laravel raw
  methods only). `echo` language-construct detection is wired but its
  inner-argument propagation is narrower than function-call sinks.

-#### Ruby: 100% P / 100% R / 100% F1 *(39-case corpus)*
+#### Ruby

- **Rule depth**: 3 source families, 7 sanitizer families, 16 sink matchers
+- **Rule depth**: source / sanitizer / sink coverage in
+  [`src/labels/ruby.rs`](https://github.com/elicpeter/nyx/blob/master/src/labels/ruby.rs)
  covering HTML, Shell, SQL, Code, SSRF, File I/O, and Deserialization. SSRF
  coverage includes `URI.open` and the low-level `OpenURI.open_uri` it
  delegates to (the canonical CarrierWave CVE-2021-21288 sink).
@ -140,18 +148,19 @@ use tree-sitter and are stable; parsing is not a differentiator.
  recognized structurally but not modeled as a distinct operator.
  `begin/rescue/ensure` exception-edge wiring is not implemented.

-#### Rust: 100% P / 100% R / 100% F1 *(70-case adversarial corpus)*
+#### Rust

 Rust holds the largest per-language adversarial corpus. PathFact-driven
 path-domain narrowing covers the `rs-safe-*` regression set.

- **Rule depth**: 6 source families, **2** sanitizer families (prefix and
-  type-coercion), 11 sink matchers covering HTML, Shell, SQL, SSRF,
-  Deserialization, and File I/O. Extensive framework source coverage
-  (Axum, Actix, Rocket); the most of any language on the source side. The
-  narrow sanitizer count is the primary reason Rust is not in the Stable
-  tier. Engine-side path/typed sanitizer recognition (PathFact) compensates,
-  but the ruleset itself is shallow.
+- **Rule depth**: source / sanitizer / sink coverage in
+  [`src/labels/rust.rs`](https://github.com/elicpeter/nyx/blob/master/src/labels/rust.rs)
+  covering HTML, Shell, SQL, SSRF, Deserialization, and File I/O.
+  Extensive framework source coverage (Axum, Actix, Rocket); the most of
+  any language on the source side. The narrow sanitizer rule set (prefix
+  and type-coercion only) is the primary reason Rust is not in the Stable
+  tier. Engine-side path/typed sanitizer recognition (PathFact)
+  compensates, but the ruleset itself is shallow.
 - **Coverage**: SQL class (`rusqlite`, `sqlx`, `diesel`, `postgres`),
  Deserialization class (`serde_yaml`, `bincode`, `rmp_serde`, `ciborium`,
  `ron`, `toml`), file I/O (`fs::remove_file/dir/rename/copy`), and the
@ -220,20 +229,22 @@ Clang Static Analyzer, or Infer for production use.
  doesn't make `buf` an alias for every element.
 - Nested classes beyond one level (C++ only).

-#### C: 100% P / 100% R / 100% F1 *(30-case corpus)*
+#### C

- **Rule depth**: 3 source families, **2** sanitizer families (the
-  `sanitize_*` prefix and numeric-parse functions), 5 sink matchers spanning
-  Shell, File, SSRF, and Format-String.
+- **Rule depth**: source / sanitizer / sink coverage in
+  [`src/labels/c.rs`](https://github.com/elicpeter/nyx/blob/master/src/labels/c.rs).
+  Sanitizers are limited to the `sanitize_*` prefix and numeric-parse
+  functions; sinks span Shell, File, SSRF, and Format-String.
 - **Known gaps**: no framework rules, no gated sinks. The structural
  limitations listed above are the dominant concern; rule additions alone
  will not lift this language out of the Preview tier.

-#### C++: 100% P / 100% R / 100% F1 *(33-case corpus, plus 6 new fixtures for STL / builder / inline-method flows)*
+#### C++

- **Rule depth**: Builds on the C ruleset with `std::cin` / `std::getline`
-  sources and a wider numeric-sanitizer set covering the full `std::sto*`
-  family (3 sources, 3 sanitizer families, 5 sinks).
+- **Rule depth**: builds on the C ruleset (see
+  [`src/labels/cpp.rs`](https://github.com/elicpeter/nyx/blob/master/src/labels/cpp.rs))
+  with `std::cin` / `std::getline` sources and a wider numeric-sanitizer
+  set covering the full `std::sto*` family.
 - **Known gaps**: still no framework rules and no gated sinks. The
  structural blind spots are now narrower than they were a release ago
  (see "What now works" above), but function pointers and the harder
--- a/src/database.rs
+++ b/src/database.rs
@ -283,9 +283,6 @@ pub mod index {
    ///   footprint.
    pub const SCHEMA_VERSION: &str = "4";

-    // TODO: ADD CLEANS FOR EACH TABLE BASED ON PROJECT WHICH RUNS ON CLEAN
-    // TODO: ADD DROP AND GIVE A CLI PARAMETER FOR DROP
-
    /// A single issue row, ready for insertion.
    #[derive(Debug, Clone)]
    pub struct IssueRow<'a> {
--- a/src/taint/ssa_transfer/inline.rs
+++ b/src/taint/ssa_transfer/inline.rs
@ -8,7 +8,6 @@

 use crate::labels::Cap;
 use crate::ssa::ir::{SsaBody, Terminator};
-use crate::summary::ssa_summary::PathFactReturnEntry;
 use crate::symbol::FuncKey;
 use crate::taint::domain::{TaintOrigin, VarTaint};
 use petgraph::graph::NodeIndex;
@ -32,11 +31,6 @@ pub(crate) struct InlineResult {
    /// provably narrows it (e.g. a `sanitize_path` early-returning on
    /// `s.contains("..")`).
    pub(super) return_path_fact: crate::abstract_interp::PathFact,
-    /// Per-return-path decomposition of `return_path_fact`. Non-empty
-    /// when the callee has ≥2 return blocks with different predicate
-    /// gates.
-    #[allow(dead_code)]
-    pub(super) return_path_facts: SmallVec<[PathFactReturnEntry; 2]>,
 }

 /// Structural (callsite-agnostic) summary of an inline-analyzed
@ -71,9 +65,6 @@ pub(crate) struct ReturnShape {
    /// state under Top-seeded Params. Describes the callee's intrinsic
    /// narrowing.
    pub(super) return_path_fact: crate::abstract_interp::PathFact,
-    /// Per-return-path decomposition of the return value. Populated
-    /// when the callee has ≥2 return blocks with different predicates.
-    pub(super) return_path_facts: SmallVec<[PathFactReturnEntry; 2]>,
 }

 impl CachedInlineShape {
--- a/src/taint/ssa_transfer/mod.rs
+++ b/src/taint/ssa_transfer/mod.rs
@ -3114,20 +3114,13 @@ fn extract_inline_return_taint(
    let return_path_fact =
        return_path_fact_acc.unwrap_or_else(crate::abstract_interp::PathFact::top);

-    // Only keep per-return-path entries when at least one entry carries
-    // meaningful signal (non-Top path_fact or a variant_inner_fact).  A
-    // list of all-Top entries adds bytes on disk without helping a
-    // caller pick a path.  Additionally require ≥2 distinct entries ,
-    // a single-entry list is no finer than the joined `return_path_fact`.
-    let return_path_facts = if per_return_path_entries.len() >= 2
+    // Surface per-return-path signal in the gate below: at least two
+    // distinct entries with non-Top path_fact or a variant_inner_fact.
+    // Single-entry lists are no finer than the joined `return_path_fact`.
+    let has_per_return_path_signal = per_return_path_entries.len() >= 2
        && per_return_path_entries
            .iter()
-            .any(|e| !e.path_fact.is_top() || e.variant_inner_fact.is_some())
-    {
-        per_return_path_entries
-    } else {
-        SmallVec::new()
-    };
+            .any(|e| !e.path_fact.is_top() || e.variant_inner_fact.is_some());

    // Even when the callee produces no return taint and no param/receiver
    // provenance, a non-Top PathFact on the return is still meaningful
@ -3138,7 +3131,7 @@ fn extract_inline_return_taint(
        && !final_receiver
        && final_internal.is_empty()
        && return_path_fact.is_top()
-        && return_path_facts.is_empty()
+        && !has_per_return_path_signal
    {
        return CachedInlineShape(None);
    }
@ -3150,7 +3143,6 @@ fn extract_inline_return_taint(
        receiver_provenance: final_receiver,
        uses_summary: true, // inline analysis is a form of summary
        return_path_fact,
-        return_path_facts,
    }))
 }

@ -3325,7 +3317,6 @@ fn apply_cached_shape(
        return InlineResult {
            return_taint: None,
            return_path_fact: crate::abstract_interp::PathFact::top(),
-            return_path_facts: SmallVec::new(),
        };
    };

@ -3407,7 +3398,6 @@ fn apply_cached_shape(
    InlineResult {
        return_taint,
        return_path_fact: ret.return_path_fact.clone(),
-        return_path_facts: ret.return_path_facts.clone(),
    }
 }

--- a/src/taint/ssa_transfer/tests.rs
+++ b/src/taint/ssa_transfer/tests.rs
@ -263,7 +263,6 @@ mod inline_cache_epoch_tests {
            receiver_provenance: false,
            uses_summary: false,
            return_path_fact: crate::abstract_interp::PathFact::top(),
-            return_path_facts: SmallVec::new(),
        }))
    }

@ -337,7 +336,6 @@ mod inline_cache_epoch_tests {
            receiver_provenance: false,
            uses_summary: true,
            return_path_fact: crate::abstract_interp::PathFact::top(),
-            return_path_facts: SmallVec::new(),
        }));

        // Caller A: argument carries an env-source origin.
@ -404,7 +402,6 @@ mod inline_cache_epoch_tests {
            receiver_provenance: false,
            uses_summary: true,
            return_path_fact: crate::abstract_interp::PathFact::top(),
-            return_path_facts: SmallVec::new(),
        }));

        let state = SsaTaintState::initial();