Added Cap::DATA_EXFIL and taint fp and fn fixes on real repos (#59)

* feat: Enhance data exfiltration detection with source sensitivity gating for cookies and headers * feat: Implement cross-file data exfiltration detection with parameter-specific gate filters * feat: Add calibration tests and refine DATA_EXFIL severity scoring logic * feat: Introduce per-detector configuration for data exfiltration suppression * feat: Enhance DATA_EXFIL findings with destination field tracking in diagnostics and SARIF output * feat: Add tainted body and URL handling for data exfiltration detection * feat: Add integration tests and fixtures for DATA_EXFIL and SSRF detection in Go * feat: Add Java integration tests and fixtures for DATA_EXFIL detection across multiple HTTP clients * feat: Add synthetic externals handling for closure-captured variables in SSA * feat: Implement closure-based suppression for resource leak findings * feat: Add regression guards for shell-injection and taint propagation in for-of destructure patterns * feat: Implement constructor cap narrowing for data exfiltration detection in HTTP request builders * feat: Add gated sinks for data exfiltration detection in C and C++ using curl_easy_setopt * feat: Implement DATA_EXFIL cap parity for backwards analysis and add integration tests * feat: Add data exfiltration sinks for various languages and enhance documentation * refactor: Simplify formatting and improve readability in various files * refactor: Improve readability by simplifying conditional statements and adding clippy linting * docs: Update CHANGELOG and comments for data exfiltration features and configuration * docs: Clarify configuration instructions for data exfiltration trusted destinations * docs: Enhance comments for evidence routing logic in data exfiltration
2026-07-21 21:31:03 +02:00 · 2026-05-01 10:59:52 -04:00 · 2026-05-01 10:59:52 -04:00 · 58f1794a4e
commit 58f1794a4e
parent a438886217
189 changed files with 8421 additions and 383 deletions
--- a/docs/advanced-analysis.md
+++ b/docs/advanced-analysis.md
@ -245,6 +245,19 @@ cross-function body expansion.  See `DEFAULT_BACKWARDS_DEPTH`,
 `BACKWARDS_VALUE_BUDGET`, and `MAX_BACKWARDS_CALLEE_BLOCKS` in
 `src/taint/backwards.rs` for the exact bounds.

+**Cap parity.** The walk treats `DemandState.caps` as opaque bitflags,
+every cap defined in `src/labels/mod.rs` round-trips identically through
+the demand transfer.  Including `Cap::DATA_EXFIL` (bit 13): a
+`taint-data-exfiltration` forward finding receives `backwards-confirmed`
+exactly like a `taint-unsanitised-flow` SQL/CMD/SSRF finding when its
+demand walk reaches a Sensitive source.  The cap-routing logic in
+`src/ast.rs` then surfaces the rule id correctly regardless of which
+direction confirmed the flow.  See
+`tests/backwards_analysis_tests.rs::demand_driven_suite` (the
+`data_exfil` sub-case) and
+`taint::backwards::tests::driver_walks_data_exfil_source_to_sink` for
+the regression guards.
+
 **Source**: [`src/taint/backwards.rs`](https://github.com/elicpeter/nyx/blob/master/src/taint/backwards.rs).

 ---
--- a/docs/configuration.md
+++ b/docs/configuration.md
@ -213,6 +213,26 @@ CLI flag map (each pair is `--enable / --no-enable`):

 **Explain effective engine**: pass `--explain-engine` to print the resolved engine configuration (profile + config + CLI overrides) and exit without scanning.

+### `[detectors.data_exfil]`
+
+Per-project tuning for the `taint-data-exfiltration` rule. All fields are optional.
+
+| Field | Type | Default | Description |
+|-------|------|---------|-------------|
+| `enabled` | bool | `true` | Set `false` to strip `Cap::DATA_EXFIL` from sink caps before emission. No `taint-data-exfiltration` finding reaches the report. Other taint classes are not affected. |
+| `trusted_destinations` | [string] | `[]` | URL prefixes that drop `Cap::DATA_EXFIL` on the call site. Matched against the abstract-string domain prefix of the destination arg, so a literal URL or a template literal with a static prefix both work. Use full origins or origin-pinned paths and include the trailing `/`, otherwise `https://api.` matches `https://api.evil.example.com/` too. |
+
+```toml
+[detectors.data_exfil]
+enabled = true
+trusted_destinations = [
+  "https://api.internal/",
+  "https://telemetry.example.com/",
+]
+```
+
+For the sanitizer convention, source sensitivity gate, and per-language sink coverage, see [Detectors / Taint / DATA_EXFIL](detectors/taint.md#data_exfil-suppression-layers).
+
 ### `[analysis.languages.<slug>]`

 Per-language custom rules. `<slug>` is one of: `rust`, `javascript`, `typescript`, `python`, `go`, `java`, `c`, `cpp`, `php`, `ruby`.
@ -232,7 +252,8 @@ kind = "sanitizer"        # "source" | "sanitizer" | "sink"
 cap = "html_escape"       # "env_var" | "html_escape" | "shell_escape" |
                          # "url_encode" | "json_parse" | "file_io" |
                          # "fmt_string" | "sql_query" | "deserialize" |
-                          # "ssrf" | "code_exec" | "crypto" | "all"
+                          # "ssrf" | "data_exfil" | "code_exec" | "crypto" |
+                          # "unauthorized_id" | "all"
 ```

 ---
--- a/docs/detectors.md
+++ b/docs/detectors.md
@ -49,11 +49,13 @@ score = severity_base + analysis_kind + evidence_strength + state_bonus - valida
 | Component | Values |
 |---|---|
 | Severity base | High=60, Medium=30, Low=10 |
-| Analysis kind | taint=+10, state=+8, cfg with evidence=+5, cfg without evidence=+3, ast=+0 |
+| Analysis kind | taint=+10, taint-data-exfiltration=+7, state=+8, cfg with evidence=+5, cfg without evidence=+3, ast=+0 |
 | Evidence strength | +1 per evidence item up to 4; +2 to +6 for source kind |
 | State bonus | use-after-close / unauthed=+6, double-close=+3, must-leak=+2, may-leak=+1 |
 | Validation penalty | -5 if path-validated |

+DATA_EXFIL is calibrated below other taint classes by design. Severity is High only when the source carries credential / session material (cookies, env vars); other Sensitive sources (request headers, file system, database, caught exception) downgrade to Medium. Confidence is capped at Medium and only fires Medium when the abstract / symbolic domain corroborates a concrete string body reaching the outbound payload; otherwise it falls to Low. A guarded flow (`path_validated`) drops a confidence tier. The intent is to seat data-exfiltration findings below SSRF / SQLi / command-injection but above informational AST patterns.
+
 Source-kind contributions (taint only):

 | Source | Bonus |
@ -71,7 +73,9 @@ Approximate score ranges:
 | High taint with user input | 76 to 81 |
 | High state (use-after-close) | ~74 |
 | High CFG structural | 63 to 68 |
+| High DATA_EXFIL (cookie / env source, body confirmed) | ~76 |
 | Medium taint with env source | 45 to 50 |
+| Medium DATA_EXFIL (header / fs / db / caught-exception source) | 40 to 45 |
 | Medium state (resource leak) | ~40 |
 | Low AST-only pattern | ~10 |

--- a/docs/detectors/taint.md
+++ b/docs/detectors/taint.md
@ -135,10 +135,130 @@ Sources, sanitizers, and sinks are linked by named capabilities. A sanitizer onl
 | `sql_query` | | parameterized query binders | `cursor.execute`, `db.query` with concatenation |
 | `deserialize` | | | `pickle.loads`, `yaml.load`, `Marshal.load` |
 | `ssrf` | | URL-prefix locks | `requests.get`, `fetch` URL arg, outbound HTTP destination |
-| `data_exfil` | | | `fetch` body / headers / json, `XMLHttpRequest.send` body |
+| `data_exfil` | cookies, headers, env, db rows, file reads (Sensitive-tier sources only) | | `fetch` body / headers / json, `XMLHttpRequest.send` body |
 | `code_exec` | | | `eval`, `exec`, `Function` |
 | `crypto` | | | weak-algorithm constructors |
 | `unauthorized_id` | request-bound scoped IDs (Rust auth analysis) | ownership check | row-level write |
 | `all` | Sources typically use `all` so they match any sink | | |

 Sources typically use `cap = "all"` so they match every sink. Sinks declare the specific cap they need. Sanitizers only clear the cap they name.
+
+## Source sensitivity
+
+Some detector classes need to know not just *that* a value is attacker-influenced but *what kind* of value it is. Each source carries a `SourceKind` (`UserInput`, `Cookie`, `Header`, `EnvironmentConfig`, `FileSystem`, `Database`, `CaughtException`, `Unknown`) and a derived sensitivity tier:
+
+| Tier | Source kinds | Meaning |
+|---|---|---|
+| `Plain` | `UserInput` (request bodies, query strings, form fields, argv, stdin) | Attacker-controlled but already in the attacker's hands. Echoing it back to them is not a disclosure. |
+| `Sensitive` | `Cookie`, `Header`, `EnvironmentConfig`, `FileSystem`, `Database`, `CaughtException`, `Unknown` | Operator-bound state that should not leak across boundaries. |
+| `Secret` | (reserved for explicit credential sources) | Highest tier; treated identically to `Sensitive` today. |
+
+`Cap::DATA_EXFIL` only fires when the contributing source is at least `Sensitive`. Plain user input flowing into an outbound `fetch` body is suppressed at finding-emission time — the canonical false-positive class for API gateways and telemetry forwarders that proxy `req.body`. SSRF and other classes are unaffected; the gate is scoped to `DATA_EXFIL`.
+
+If a project legitimately classifies a request body as sensitive (e.g. an internal forwarder where `req.body` carries a pre-authenticated user token), override via custom rules in `nyx.conf`:
+
+```toml
+# Treat the forwarder's outbound payload as already-sanitized so the
+# DATA_EXFIL gate stops firing on it.
+[[analysis.languages.javascript.rules]]
+matchers = ["sanitizeOutbound"]
+kind     = "sanitizer"
+cap      = "data_exfil"
+```
+
+Or re-classify the source itself with a custom Source rule whose name matches one of the Sensitive substrings (`cookie`, `header`).
+
+## DATA_EXFIL suppression layers
+
+Three knobs ship out of the box so projects can match the cap to their architecture without per-call suppressions.
+
+### 1. Forwarding-wrapper sanitizer convention
+
+A named function that exists to *forward* a payload across a known boundary is the developer's explicit decision to send the data. The default sanitizer rules treat the following identifiers as `Sanitizer(data_exfil)` in JavaScript and TypeScript:
+
+```
+serializeForUpstream
+forwardPayload
+tracker.send
+analytics.track
+metrics.report
+logEvent
+```
+
+If your codebase follows this convention, the cap stops firing on these calls automatically. Extend the convention with your own forwarding wrappers via the standard custom-rule path:
+
+```toml
+[[analysis.languages.javascript.rules]]
+matchers = ["dispatchTelemetry", "sendToBus"]
+kind     = "sanitizer"
+cap      = "data_exfil"
+```
+
+The rule of thumb: a function that *only* exists to ship a payload to a known boundary belongs in this list. A function that *might* leak (a generic HTTP wrapper, a logging helper that writes to an arbitrary destination) does not.
+
+### 2. Destination allowlist
+
+Configure a set of trusted outbound prefixes once and the cap is dropped on every site whose destination argument has a static prefix that begins with one of them:
+
+```toml
+[detectors.data_exfil]
+trusted_destinations = [
+  "https://api.internal/",
+  "https://telemetry.",
+]
+```
+
+Use full origins or origin-pinned paths so a partial-host match across unrelated origins cannot occur. `https://api.` would also match `https://api.evil.example.com/` — the entry must include the path separator (`/`) at the end of the host.
+
+The match consults the abstract string domain: a literal URL is a static prefix; a template literal `\`https://api.internal/${id}\`` exposes the prefix `https://api.internal/`; a fully dynamic URL has no prefix and the cap fires as usual.
+
+### 3. Detector-class disable
+
+Some projects forward user-bound payloads as a matter of architecture. Turn the entire detector class off when the noise is permanent:
+
+```toml
+[detectors.data_exfil]
+enabled = false
+```
+
+`enabled = false` strips `Cap::DATA_EXFIL` from sink caps before event emission, so no `taint-data-exfiltration` finding reaches the report. The decision is per-project — other projects loaded by the same `nyx serve` instance keep their own settings.
+
+## DATA_EXFIL sinks per language
+
+Sinks Nyx ships with for `Cap::DATA_EXFIL`. The body, headers, or json payload arg fires; the URL arg routes through the SSRF gate and emits `taint-unsanitised-flow` instead.
+
+| Language | Sinks | Example |
+|---|---|---|
+| JavaScript, TypeScript | `fetch(url, {body, headers, json})` body-bind, `XMLHttpRequest.prototype.send`, type-qualified `HttpClient.send` | `fetch('/upload', {method: 'POST', body: req.cookies.session})` |
+| Python | `requests.post / put / patch` body and json kwargs, `httpx.AsyncClient().post` json kwarg, `aiohttp.ClientSession().post` body, dict round-trip into json | `requests.post('https://api.internal/ingest', json={'k': os.environ.get('SECRET')})` |
+| Java | `HttpClient.send` with `BodyPublishers.ofString`, OkHttp `newCall(req).execute` body chain, Apache `HttpClient.execute(HttpPost)`, `RestTemplate.postForEntity / exchange`, `WebClient.post().bodyValue / body` | `client.send(HttpRequest.newBuilder().uri(...).POST(BodyPublishers.ofString(token)).build(), ...)` |
+| Go | `http.Post(url, ct, body)` body arg, `http.PostForm` form arg, `(*http.Client).Do(req)` after `http.NewRequest`, `(*http.Request).Body` assignment | `http.Post("https://analytics.internal/track", "text/plain", strings.NewReader(c.Value))` |
+| Rust | `reqwest::Client.post().body / json / form / multipart().send()`, `ureq::post().send_string / send_form / send_json`, `surf::post().body_string / body_json`, `hyper::Request::builder().body()` | `reqwest::Client::new().post(url).form(&secret).send()` |
+| Ruby | `Net::HTTP.post(uri, body)` body arg, `Net::HTTP::Post.new(uri).body=`, `RestClient.post / put`, `HTTParty.post(url, body: ...)` body | `Net::HTTP.post(URI('https://analytics.internal/track'), "session=#{request.cookies[:auth]}")` |
+| C, C++ | `curl_easy_setopt(handle, CURLOPT_POSTFIELDS, body)` and `CURLOPT_COPYPOSTFIELDS` gated sinks (macro-arg activation), `CURLOPT_POSTFIELDSIZE` body-bind | `curl_easy_setopt(curl, CURLOPT_POSTFIELDS, getenv("AUTH_TOKEN"));` |
+| PHP | `curl_setopt($ch, CURLOPT_POSTFIELDS, $body)`, `Guzzle\Client.post($url, ['body' => $tainted])`, `Symfony\HttpClient->request('POST', $url, ['body' => $tainted])` | `curl_setopt($ch, CURLOPT_POSTFIELDS, $_COOKIE['session']);` |
+
+Add project-specific sinks with `nyx config add-rule --kind sink --cap data_exfil --matcher <name>` or the equivalent TOML rule.
+
+## DATA_EXFIL calibration ranges
+
+`taint-data-exfiltration` is calibrated below the other taint classes on purpose.
+
+| Source kind | Severity | Confidence ceiling |
+|---|---|---|
+| Cookie, environment variable | High | Medium |
+| Header | Medium | Medium |
+| File system, database | Medium | Medium |
+| Caught exception | Medium | Low |
+
+Path-validated flows (`path_validated: true`) drop one severity tier. Confidence drops to Low when the abstract or symbolic domain cannot corroborate a concrete string reaching the outbound payload (for example, when the body comes from a callee with no summary).
+
+Attack-surface score ranges:
+
+| Finding shape | Score |
+|---|---|
+| High DATA_EXFIL, cookie or env source, body confirmed | around 76 |
+| Medium DATA_EXFIL, header, fs, db, or caught-exception source | 40 to 45 |
+| Low DATA_EXFIL, no abstract corroboration, path-validated | 18 to 25 |
+
+For reference: High SSRF, SQLi, cmdi land at 76 to 81; Medium taint with env source lands at 45 to 50; AST-only patterns sit around 10. Data-exfil sits below the direct-compromise classes but above informational AST patterns.