diff --git a/CHANGELOG.md b/CHANGELOG.md index f2fa9175..772df1be 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,69 @@ All notable changes to Nyx are documented here. The format is based on [Keep a C ## [Unreleased] +A round of cross-file FastAPI auth, two new sink/validator classes, a ~957-FP Go DAO helper precision pass, four CVE corpus pairs, and a performance pass on the auth extractor pipeline plus SCCP and the global summaries hash map. + +### Added + +- FastAPI cross-file `include_router` dependency tracking. New `auth_analysis/router_facts.rs` captures per-file router declarations (` = X(deps=[…])`) and `.include_router(.)` edges in pass 1, persists them into `GlobalSummaries::router_facts_by_module`, and resolves them into the active file's `AuthorizationModel::cross_file_router_deps` at pass 2 entry. Transitive lifts (`grandparent → parent → child`) handled by iterative index walk. Module identity is the file basename without `.py` (approximate, but sufficient for airflow-style `task_instances.router` naming). Closes the airflow execution-API shape where a child router lives in `routes/task_instances.py` and its auth is declared on the parent in `routes/__init__.py`. +- FastAPI router-level `dependencies=[...]` propagation. Module-level `router = APIRouter(dependencies=[Security(...)])` declarations are pre-walked once per file, then merged onto every `@.(...)` route attached in the same file. Closes airflow's execution-API routes that re-use a single `ti_id_router` declared once at module scope. +- FastAPI `Security(callable, scopes=[...])` recognised distinctly from `Depends(callable)`. Scoped Security promotes the synthetic `AuthCheck` to `AuthCheckKind::Other` (route-level scope-checked authorization), not just Login. New scope-tracking boolean threaded through `expand_decorator_calls` and `extract_fastapi_dependencies`. +- SQLAlchemy query-builder chained-call recognition. `select(X).filter_by(...)`, `query(X).filter(...)`, `select().join().where()` chains now anchor through the chain root primitive when the chain receiver type is opaque. New `db_query_builder_roots` config (Python defaults: `select`, `query`). Closes airflow `session.scalar(select(C).filter_by(conn_id=user_input))` shapes that previously dropped under the chained-call suppression in `classify_sink_class`. +- Python non-sink container constructor recognition. Bare-callee form `set()` / `dict()` / `list()` / `tuple()` / `frozenset()` / `defaultdict(...)` is now treated as a non-sink constructor, so `verified_ids = set(); verified_ids.update(myteams)` does not classify the `.update` call as `DbMutation`. Type-annotation hint form `set[int]` / `dict[str, int]` recognised via PEP 585 generic suffix strip alongside the existing angle-bracket strip. Closes the sentry `api/helpers/teams.py` shape. +- Python `request.match_info` source label (aiohttp path-parameter source). +- Receiver-side validator registry. New `labels::lookup_receiver_validator(lang, callee)` clears `Cap` from the receiver value (and call equivalents) on success, distinct from `Sanitizer` which clears caps from the return value. Python registers `relative_to → Cap::FILE_IO` so `path.relative_to(base)` (raises `ValueError` when `path` escapes `base`) drops the file-IO cap on the path. Closes the CVE-2024-23334 patched aiohttp `static_root_path.joinpath(filename).resolve().relative_to(static_root_path)` shape. +- JS/TS Array-method validator-callback narrowing. `arr.filter(isSafeIdentifier)`, `arr.find(isValidId)`, `arr.findLast(...)` with a `BooleanTrueIsValid` callback (`isValid…`, `isSafe…`, `hasValid…` and snake-case variants) propagate `validated_must` through the call's return value. Resolves callback name from both `info.arg_callees` (call-shape arguments) and SSA `value_defs[v].var_name` (bare-identifier callbacks, the dominant patched-CVE form). Strict-additive: anonymous arrows / opaque identifiers leave existing propagation untouched. `findIndex` / `every` / `some` excluded (scalar return shape). Motivated by CVE-2026-42353 (i18next-http-middleware path traversal). +- JS/TS ternary-branch source classification. `let arr = cond ? req.query.lng : "";` previously lowered each branch to a labelless Assign with empty uses, the join phi saw no taint, and downstream sinks missed the flow. `lower_ternary_branch` now runs `first_member_label` (segment-strip-and-retry classifier) on the branch AST when no `Source` label is already attached. New `cfg/cfg_tests.rs` covers the lowering shape. +- Java JPA / Hibernate Criteria API as structural SQL. New `TypeKind::JpaCriteriaQuery` for `CriteriaQuery`, `CriteriaUpdate`, `CriteriaDelete`, `Subquery`, `TypedQuery`. New `cfg-unguarded-sink` SQL_QUERY suppression `sink_args_jpa_criteria_query_safe` clears the finding when any positional argument to the sink call is JpaCriteriaQuery-typed (receiver excluded; receiver of `session.createQuery(cq)` is the Session/EntityManager channel, never the SQL payload). Closes the dominant FP cluster on openmrs (169 of 216 cfg-unguarded-sink), xwiki, keycloak Hibernate DAO methods that build `cb.createQuery(Foo.class)` + Root/Predicate API queries. +- Java/Kotlin `cb.createQuery(...)`, `em.getCriteriaBuilder()`, and the JpaCriteriaQuery type chain inferred via constructor/factory return-type hints (extends the existing type-inference pipeline in `type_facts.rs`). +- PHP `fopen` modeled as `Sink(Cap::SSRF)`. Same SSRF/LFI dual-vector shape as `file_get_contents` — fires only on tainted argument. Closes CVE-2026-33486 (roadiz/documents `DownloadedFile::fromUrl` static method wrapping `fopen($url, 'r')`). +- PHP unary-op-expression negation recognition. tree-sitter-php emits `unary_op_expression` for unary `!` (and `-`/`+`/`~`); CFG `detect_negation` and condition-chain decomposition now match it. Without this, `if (!validate($x))` carried `condition_negated=false` and the True branch was treated as the validated path even though it is the rejection path. New PHP fixture `safe_camelcase_validator_negated.php` pins the lowering. +- PHP `Serializable::unserialize($input)` magic-method passthrough recognition. The legacy `Serializable` interface contract (deprecated since PHP 8.1) requires the implementation to call `\unserialize($input)` on the formal parameter inside `public function unserialize($x) { ... }`. PHP itself invokes the method when restoring an instance, so the body's call cannot be removed without breaking the interface. `php.deser.unserialize` now suppresses inside this exact shape (method named `unserialize`, single formal, bare-parameter argument). Class-level `Serializable` implementation is the actionable signal (fix is migration to `__serialize` / `__unserialize`). Closes joomla / drupal Serializable-implementing class FPs. +- PHP container kinds: `declaration_list`, `interface_declaration`, `trait_declaration`, `enum_declaration`, `enum_declaration_list` mapped to `Kind::Block` so methods inside them participate in CFG construction. +- Go DAO-helper id-scalar precision pass. For non-route Go units, a parameter whose declared type is a bounded primitive scalar (`int64`, `uint32`, `string`, `bool`, `byte`, `rune`, `float64`, …) and whose name is id-shaped (`id`, `*Id`, `*_id`, `*ids`) is dropped from `unit.params` before ownership-check evaluation. Real Go HTTP handlers always carry a framework-request-typed param (`*http.Request`, `*gin.Context`, `echo.Context`, `*fiber.Ctx`); per-framework route extractors set `include_id_like_typed=true` so id-shaped path params survive on real routes. Mirrors the existing Python `is_python_id_like_typed_param` filter. Closes ~957 `go.auth.missing_ownership_check` findings on gitea backend DAO helpers (`func GetRunByRepoAndID(ctx, repoID, runID int64)`, `func DeleteRunner(ctx, id int64)`, the entire `models/...` layer where the ownership check sits in the calling route handler) and equivalent shapes in minio / Go ORM codebases. +- Bare-callee verb-name fallback gate. `list(...)`, `filter(...)`, `update(...)`, `create_audit_entry(...)`, `update_coding_agent_state(...)` (no receiver dot at all) no longer classify as `DbMutation` / `DbCrossTenantRead` via the loose verb-name fallback. Real ORM/DB calls always carry a receiver (`User.find(id)`, `Model.objects.filter`, `repo.save(x)`); a bare `list(events)` is the Python builtin and `filter(fn, xs)` is `Iterable.filter`. The realtime / outbound / cache prefix dispatches still match by chain root. New helper `receiver_is_simple_chain(callee)` requires a non-chained receiver dot. +- Go variadic `parameter_declaration` named-field handling for `collect_param_names`. `name` and `type` named fields read directly so type-segment identifiers no longer pollute the param-name set (`info *PackageInfo` no longer contributes `PackageInfo`). +- Phase 1 caller-scope IPA: same-file route-handler-to-helper auth lift. New `apply_caller_scope_propagation` walks every non-route helper unit; if its in-file callers are non-empty AND every caller is itself an authorized route handler (route-level non-Login auth check) or already authorized via this same propagation, the caller's checks lift onto the helper as synthetic `is_route_level=true` `AuthCheck`s. Iterated to a small fixpoint so transitive helper chains (`route → mid_helper → leaf_helper`) are covered. Refuses to authorize helpers with no in-file caller, helpers called from a mix of authorized and unauthorized callers, and helpers called only from un-lifted helpers. Cross-file equivalent deferred (see `deep_engine_fixes.md`). Closes the dominant FastAPI / Django / Flask "route authenticates via decorator/dependency, then delegates to a private helper that performs the sink" FP shape on sentry / saleor / airflow. +- New Python pattern `py.xss.make_response_format` (Tier B). Flask `make_response()` reflection. Recognises both bare `make_response(...)` and `flask.make_response(...)`. Closes CVE-2023-6568 (mlflow auth `create_user` reflected the attacker-controlled `Content-Type` header into the response body via `make_response(f"Invalid content type: '{content_type}'", 400)`). +- C CVE corpus extended. CVE-2017-1000117 (git argv injection via `ssh://-oProxyCommand=…`) vulnerable + patched fixtures under `tests/benchmark/cve_corpus/c/CVE-2017-1000117/`. Three-layer engine gap deferred (array-element taint propagation, `c.cmdi.exec*` AST patterns, dash-prefix-byte sanitizer recognition). +- Python CVE corpus extended. CVE-2023-6568 (mlflow XSS), CVE-2024-21513 (langchain SQL/JINJA), CVE-2024-23334 (aiohttp static-file path traversal) vulnerable + patched fixtures. +- PHP CVE corpus extended. CVE-2026-33486 (roadiz/documents SSRF) vulnerable + patched fixtures. +- JavaScript CVE corpus extended. CVE-2026-42353 (i18next-http-middleware path traversal) vulnerable + patched fixtures. +- Cross-file FastAPI integration test `tests/fastapi_cross_file_include_router_tests.rs` with airflow-shaped fixture tree under `tests/fixtures/auth_cross_file/airflow_execution_api_includes/`. +- Per-language safe / vuln Python auth fixtures: `safe_local_set_update_no_orm.py`, `vuln_local_set_with_user_id_query.py`, `vuln_fastapi_route_no_dependencies_sqla.py`, `vuln_fastapi_route_security_no_scopes.py`, `safe_fastapi_route_security_scopes.py`, `vuln_fastapi_router_no_dependencies.py`, `safe_fastapi_router_level_security_scopes.py`, `safe_bare_callee_no_receiver.py`, `vuln_caller_scope_helper_under_bare_route.py`, `safe_caller_scope_helper_under_authorized_route.py`, `safe_relative_to_validator.py`, `path_traversal_no_relative_to.py`. Java `SafeJpaCriteriaQuery.java`. Go `safe_dao_helper_id_scalar.go`, `vuln_repo_findbyid_no_auth.go`. PHP `ssrf_class_method_fopen.php`, `safe_camelcase_validator_negated.php`, `safe_serializable_magic_method_unserialize.php`, `vuln_serialize_method_named_unserialize_with_user_input.php`. JS `path_traversal_ternary_source.js`, `safe_ternary_const_branches.js`. TS `safe_session_user_id_copy.ts`, `vuln_target_user_id_no_check.ts`. + +### Performance + +- Hoisted `collect_top_level_units` out of the per-extractor loop in `extract_authorization_model`. Multi-extractor languages (Go gin+echo, JS/TS express+koa+fastify, Python flask+django, Rust axum+actix_web+rocket, Ruby sinatra) re-walked the entire AST and rebuilt the `Function`-kind unit set per extractor (then deduped by span). New `AuthExtractor::requires_top_level_units()` opt-out for Spring / Rails which build their own. Was 46% of `extract_authorization_model` wall-clock on the mattermost/server/channels/app subtree. +- Single `AuthorizationModel` build per file in fused mode. Pre-fix the diag path and the per-file summary path each ran their own `extract_authorization_model`, duplicating the hoisted unit pass + every framework extractor's AST walk. Auth summaries extracted from the base model (pre var-types, pre-helper-lifting) so the persisted per-file summary matches the legacy `extract_auth_summaries_by_key` path bit-for-bit. +- O(N) shallow value-ref emission in `collect_unit_state`. Previous per-node `extract_value_refs(node, bytes)` walked the entire subtree on every recursion level (O(N²) per body); the recursion below already visits every descendant once. New `append_shallow_value_ref` emits the node's own ref and lets recursion handle the descent. Public callers of `extract_value_refs` (`collect_call`, `collect_condition`, assignment-side extraction) keep the deep walk. Was ~17%+15%+11% of wall-clock split across `build_function_unit_with_meta`, `collect_unit_state`, and `extract_value_refs` on mattermost/server/channels/app. +- Per-`ParsedFile` `body_const_facts_cache: OnceCell`. SSA + const-prop + type-fact build was running 2-3× per body across `run_cfg_analyses_with_lowered`, `run_auth_analyses`, and `collect_file_var_types`. Single-pass cache; gin profile dropped from 13.6% to ~4.5%. +- Sparse Conditional Constant Propagation switched from `HashMap` and `HashSet<(BlockId, BlockId)>` to dense `Vec` per-value lattice and per-destination predecessor `SmallVec<[BlockId; 2]>`. The inner SCCP fixed-point loop no longer SipHashes a 64-bit pair for every operand of every phi. Public `ConstPropResult` shape unchanged (one final O(num_values) HashMap conversion). +- `GlobalSummaries.by_key` switched from stdlib SipHash `HashMap` to `FxHashMap` (rustc-hash 2.1). `FuncKey` carries 3 String fields, so any HashMap operation hashes ≥30 bytes; FxHash is ~5× faster on this workload. Seed is fixed (no DoS hardening), fine for an in-process index keyed by program-derived names. +- `large_go_module.go` perf fixture (1493 lines) added to `benches/perf_fixtures/`; `benches/scan_bench.rs` extended with auth-extractor, SCCP, and summary-resolution rows. + +### Fixed (false positives) + +- ~957 gitea backend DAO `go.auth.missing_ownership_check` findings (id-scalar precision pass, see Added). +- 169 of 216 openmrs `cfg-unguarded-sink` findings (JpaCriteriaQuery type, see Added). Equivalent reductions on xwiki / keycloak Hibernate DAO clusters. +- joomla and drupal `php.deser.unserialize` flagged inside `Serializable::unserialize($input)` magic-method bodies (passthrough recognition, see Added). +- airflow execution-API routes flagged `missing_ownership_check` despite being authorized via cross-file `include_router` chains and module-level `APIRouter(dependencies=[…])` declarations (router_facts + router-level dep propagation, see Added). +- sentry `verified_ids = set(); verified_ids.update(myteams)` flagged as `DbMutation` (Python container constructor recognition, see Added). +- aiohttp `path.relative_to(static_root_path)` rejected as a path-traversal validator (receiver-side validator registry, see Added). +- i18next-http-middleware `arr.filter(utils.isSafeIdentifier)` not narrowing taint on the result (Array-method validator-callback narrowing, see Added). +- `cond ? req.query.lng : ""` ternary lost `Source` label on the truthy branch (ternary-branch source classification, see Added). +- `if (!validate($x))` rejection-arm narrowing flipped on PHP unary `!` (unary_op_expression recognition, see Added). +- mlflow `make_response(f"Invalid content type: '{content_type}'")` (Tier B pattern, see Added). +- Bare-callee verb-name dispatch on Python builtins / locally-defined helpers (`list`, `filter`, `update`, `create_audit_entry`, `update_coding_agent_state`, see Added). +- FastAPI `Depends(...)` / `Security(...)` deps declared on a module-level `APIRouter` no longer dropped on every attached route. +- FastAPI `Security(callable, scopes=[...])` no longer downgraded to a Login-only check. + +### Other + +- New `cfg/cfg_tests.rs` covers ternary-branch CFG lowering shapes. +- New `summary/tests.rs` covers cross-file `include_router` summary persistence and resolution. +- Refactor passes across `auth_analysis`, `ssa/const_prop`, `ssa/type_facts`, `summary`, and the per-framework auth extractors (cleaner conditional checks, simpler function signatures, deduplicated assertions). No behaviour change. + ## [0.6.1] - 2026-05-03 A precision pass on auth and resource analysis plus three fresh CVE corpus pairs, plus a UTF-8 slice panic in the path abstract domain. Closes ~1900 Go auth FPs on gitea-shaped helpers, the mastodon/diaspora private-callback Ruby controller pattern, and a phantom-taint outbreak from JS/TS / Java lambda shorthand in jest-style nested test callbacks. diff --git a/Cargo.lock b/Cargo.lock index 7ad614bf..21687652 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -1162,6 +1162,7 @@ dependencies = [ "rayon", "rmp-serde", "rusqlite", + "rustc-hash", "serde", "serde_json", "smallvec", @@ -1577,6 +1578,12 @@ dependencies = [ "sqlite-wasm-rs", ] +[[package]] +name = "rustc-hash" +version = "2.1.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "94300abf3f1ae2e2b8ffb7b58043de3d399c73fa6f4b73826402a5c457614dbe" + [[package]] name = "rustix" version = "1.1.4" diff --git a/Cargo.toml b/Cargo.toml index e52cb24a..2b39957d 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -113,6 +113,7 @@ bitflags = "2.11.0" phf = { version = "0.13.1", features = ["macros"] } indicatif = "0.18.4" smallvec = { version = "1.15", features = ["serde"] } +rustc-hash = "2.1" uuid = { version = "1", features = ["v4"] } axum = { version = "0.8", optional = true } tokio = { version = "1", features = ["rt-multi-thread", "macros", "signal", "sync"], optional = true } diff --git a/THIRDPARTY-LICENSES.html b/THIRDPARTY-LICENSES.html index 3602f5fa..f5067aad 100644 --- a/THIRDPARTY-LICENSES.html +++ b/THIRDPARTY-LICENSES.html @@ -44,7 +44,7 @@

Overview of licenses:

    -
  • Apache License 2.0 (159)
  • +
  • Apache License 2.0 (160)
  • MIT License (71)
  • zlib License (2)
  • BSD 2-Clause "Simplified" License (1)
  • @@ -4352,6 +4352,7 @@ limitations under the License.
  • proc-macro2 1.0.106
  • quote 1.0.45
  • rand 0.10.1
  • +
  • rustc-hash 2.1.2
  • ryu 1.0.23
  • serde 1.0.228
  • serde_core 1.0.228
  • diff --git a/benches/perf_fixtures/large_go_module.go b/benches/perf_fixtures/large_go_module.go new file mode 100644 index 00000000..04dcb36c --- /dev/null +++ b/benches/perf_fixtures/large_go_module.go @@ -0,0 +1,1493 @@ +// Copyright 2014 Manu Martinez-Almeida. All rights reserved. +// Use of this source code is governed by a MIT style +// license that can be found in the LICENSE file. +// +// Source: gin-gonic/gin context.go (MIT-licensed); copied verbatim as a +// realistic perfhunt fixture covering many function bodies (~147 fns, +// ~1.5k lines) so per-body analysis caching is exercised at scale. + +package gin + +import ( + "errors" + "fmt" + "io" + "io/fs" + "log" + "maps" + "math" + "mime/multipart" + "net" + "net/http" + "net/url" + "os" + "path/filepath" + "strings" + "sync" + "time" + + "github.com/gin-contrib/sse" + "github.com/gin-gonic/gin/binding" + "github.com/gin-gonic/gin/render" +) + +// Content-Type MIME of the most common data formats. +const ( + MIMEJSON = binding.MIMEJSON + MIMEHTML = binding.MIMEHTML + MIMEXML = binding.MIMEXML + MIMEXML2 = binding.MIMEXML2 + MIMEPlain = binding.MIMEPlain + MIMEPOSTForm = binding.MIMEPOSTForm + MIMEMultipartPOSTForm = binding.MIMEMultipartPOSTForm + MIMEYAML = binding.MIMEYAML + MIMEYAML2 = binding.MIMEYAML2 + MIMETOML = binding.MIMETOML + MIMEPROTOBUF = binding.MIMEPROTOBUF + MIMEBSON = binding.MIMEBSON +) + +// BodyBytesKey indicates a default body bytes key. +const BodyBytesKey = "_gin-gonic/gin/bodybyteskey" + +// ContextKey is the key that a Context returns itself for. +const ContextKey = "_gin-gonic/gin/contextkey" + +type ContextKeyType int + +const ContextRequestKey ContextKeyType = 0 + +// abortIndex represents a typical value used in abort functions. +const abortIndex int8 = math.MaxInt8 >> 1 + +// Context is the most important part of gin. It allows us to pass variables between middleware, +// manage the flow, validate the JSON of a request and render a JSON response for example. +type Context struct { + writermem responseWriter + Request *http.Request + Writer ResponseWriter + + Params Params + handlers HandlersChain + index int8 + fullPath string + + engine *Engine + params *Params + skippedNodes *[]skippedNode + + // This mutex protects Keys map. + mu sync.RWMutex + + // Keys is a key/value pair exclusively for the context of each request. + Keys map[any]any + + // Errors is a list of errors attached to all the handlers/middlewares who used this context. + Errors errorMsgs + + // Accepted defines a list of manually accepted formats for content negotiation. + Accepted []string + + // queryCache caches the query result from c.Request.URL.Query(). + queryCache url.Values + + // formCache caches c.Request.PostForm, which contains the parsed form data from POST, PATCH, + // or PUT body parameters. + formCache url.Values + + // SameSite allows a server to define a cookie attribute making it impossible for + // the browser to send this cookie along with cross-site requests. + sameSite http.SameSite +} + +/************************************/ +/********** CONTEXT CREATION ********/ +/************************************/ + +func (c *Context) reset() { + c.Writer = &c.writermem + c.Params = c.Params[:0] + c.handlers = nil + c.index = -1 + + c.fullPath = "" + c.Keys = nil + c.Errors = c.Errors[:0] + c.Accepted = nil + c.queryCache = nil + c.formCache = nil + c.sameSite = 0 + *c.params = (*c.params)[:0] + *c.skippedNodes = (*c.skippedNodes)[:0] +} + +// Copy returns a copy of the current context that can be safely used outside the request's scope. +// This has to be used when the context has to be passed to a goroutine. +func (c *Context) Copy() *Context { + cp := Context{ + writermem: c.writermem, + Request: c.Request, + engine: c.engine, + } + + cp.writermem.ResponseWriter = nil + cp.Writer = &cp.writermem + cp.index = abortIndex + cp.handlers = nil + cp.fullPath = c.fullPath + + cKeys := c.Keys + c.mu.RLock() + cp.Keys = maps.Clone(cKeys) + c.mu.RUnlock() + + cParams := c.Params + cp.Params = make([]Param, len(cParams)) + copy(cp.Params, cParams) + + return &cp +} + +// HandlerName returns the main handler's name. For example if the handler is "handleGetUsers()", +// this function will return "main.handleGetUsers". +func (c *Context) HandlerName() string { + return nameOfFunction(c.handlers.Last()) +} + +// HandlerNames returns a list of all registered handlers for this context in descending order, +// following the semantics of HandlerName() +func (c *Context) HandlerNames() []string { + hn := make([]string, 0, len(c.handlers)) + for _, val := range c.handlers { + if val == nil { + continue + } + hn = append(hn, nameOfFunction(val)) + } + return hn +} + +// Handler returns the main handler. +func (c *Context) Handler() HandlerFunc { + return c.handlers.Last() +} + +// FullPath returns a matched route full path. For not found routes +// returns an empty string. +// +// router.GET("/user/:id", func(c *gin.Context) { +// c.FullPath() == "/user/:id" // true +// }) +func (c *Context) FullPath() string { + return c.fullPath +} + +/************************************/ +/*********** FLOW CONTROL ***********/ +/************************************/ + +// Next should be used only inside middleware. +// It executes the pending handlers in the chain inside the calling handler. +// See example in GitHub. +func (c *Context) Next() { + c.index++ + for c.index < safeInt8(len(c.handlers)) { + if c.handlers[c.index] != nil { + c.handlers[c.index](c) + } + c.index++ + } +} + +// IsAborted returns true if the current context was aborted. +func (c *Context) IsAborted() bool { + return c.index >= abortIndex +} + +// Abort prevents pending handlers from being called. Note that this will not stop the current handler. +// Let's say you have an authorization middleware that validates that the current request is authorized. +// If the authorization fails (ex: the password does not match), call Abort to ensure the remaining handlers +// for this request are not called. +func (c *Context) Abort() { + c.index = abortIndex +} + +// AbortWithStatus calls `Abort()` and writes the headers with the specified status code. +// For example, a failed attempt to authenticate a request could use: context.AbortWithStatus(401). +func (c *Context) AbortWithStatus(code int) { + c.Status(code) + c.Writer.WriteHeaderNow() + c.Abort() +} + +// AbortWithStatusPureJSON calls `Abort()` and then `PureJSON` internally. +// This method stops the chain, writes the status code and return a JSON body without escaping. +// It also sets the Content-Type as "application/json". +func (c *Context) AbortWithStatusPureJSON(code int, jsonObj any) { + c.Abort() + c.PureJSON(code, jsonObj) +} + +// AbortWithStatusJSON calls `Abort()` and then `JSON` internally. +// This method stops the chain, writes the status code and return a JSON body. +// It also sets the Content-Type as "application/json". +func (c *Context) AbortWithStatusJSON(code int, jsonObj any) { + c.Abort() + c.JSON(code, jsonObj) +} + +// AbortWithError calls `AbortWithStatus()` and `Error()` internally. +// This method stops the chain, writes the status code and pushes the specified error to `c.Errors`. +// See Context.Error() for more details. +func (c *Context) AbortWithError(code int, err error) *Error { + c.AbortWithStatus(code) + return c.Error(err) +} + +/************************************/ +/********* ERROR MANAGEMENT *********/ +/************************************/ + +// Error attaches an error to the current context. The error is pushed to a list of errors. +// It's a good idea to call Error for each error that occurred during the resolution of a request. +// A middleware can be used to collect all the errors and push them to a database together, +// print a log, or append it in the HTTP response. +// Error will panic if err is nil. +func (c *Context) Error(err error) *Error { + if err == nil { + panic("err is nil") + } + + var parsedError *Error + ok := errors.As(err, &parsedError) + if !ok { + parsedError = &Error{ + Err: err, + Type: ErrorTypePrivate, + } + } + + c.Errors = append(c.Errors, parsedError) + return parsedError +} + +/************************************/ +/******** METADATA MANAGEMENT********/ +/************************************/ + +// Set is used to store a new key/value pair exclusively for this context. +// It also lazy initializes c.Keys if it was not used previously. +func (c *Context) Set(key any, value any) { + c.mu.Lock() + defer c.mu.Unlock() + if c.Keys == nil { + c.Keys = make(map[any]any) + } + + c.Keys[key] = value +} + +// Get returns the value for the given key, ie: (value, true). +// If the value does not exist it returns (nil, false) +func (c *Context) Get(key any) (value any, exists bool) { + c.mu.RLock() + defer c.mu.RUnlock() + value, exists = c.Keys[key] + return +} + +// MustGet returns the value for the given key if it exists, otherwise it panics. +func (c *Context) MustGet(key any) any { + if value, exists := c.Get(key); exists { + return value + } + panic(fmt.Sprintf("key %v does not exist", key)) +} + +func getTyped[T any](c *Context, key any) (res T) { + if val, ok := c.Get(key); ok && val != nil { + res, _ = val.(T) + } + return +} + +// GetString returns the value associated with the key as a string. +func (c *Context) GetString(key any) string { + return getTyped[string](c, key) +} + +// GetBool returns the value associated with the key as a boolean. +func (c *Context) GetBool(key any) bool { + return getTyped[bool](c, key) +} + +// GetInt returns the value associated with the key as an integer. +func (c *Context) GetInt(key any) int { + return getTyped[int](c, key) +} + +// GetInt8 returns the value associated with the key as an integer 8. +func (c *Context) GetInt8(key any) int8 { + return getTyped[int8](c, key) +} + +// GetInt16 returns the value associated with the key as an integer 16. +func (c *Context) GetInt16(key any) int16 { + return getTyped[int16](c, key) +} + +// GetInt32 returns the value associated with the key as an integer 32. +func (c *Context) GetInt32(key any) int32 { + return getTyped[int32](c, key) +} + +// GetInt64 returns the value associated with the key as an integer 64. +func (c *Context) GetInt64(key any) int64 { + return getTyped[int64](c, key) +} + +// GetUint returns the value associated with the key as an unsigned integer. +func (c *Context) GetUint(key any) uint { + return getTyped[uint](c, key) +} + +// GetUint8 returns the value associated with the key as an unsigned integer 8. +func (c *Context) GetUint8(key any) uint8 { + return getTyped[uint8](c, key) +} + +// GetUint16 returns the value associated with the key as an unsigned integer 16. +func (c *Context) GetUint16(key any) uint16 { + return getTyped[uint16](c, key) +} + +// GetUint32 returns the value associated with the key as an unsigned integer 32. +func (c *Context) GetUint32(key any) uint32 { + return getTyped[uint32](c, key) +} + +// GetUint64 returns the value associated with the key as an unsigned integer 64. +func (c *Context) GetUint64(key any) uint64 { + return getTyped[uint64](c, key) +} + +// GetFloat32 returns the value associated with the key as a float32. +func (c *Context) GetFloat32(key any) float32 { + return getTyped[float32](c, key) +} + +// GetFloat64 returns the value associated with the key as a float64. +func (c *Context) GetFloat64(key any) float64 { + return getTyped[float64](c, key) +} + +// GetTime returns the value associated with the key as time. +func (c *Context) GetTime(key any) time.Time { + return getTyped[time.Time](c, key) +} + +// GetDuration returns the value associated with the key as a duration. +func (c *Context) GetDuration(key any) time.Duration { + return getTyped[time.Duration](c, key) +} + +// GetError returns the value associated with the key as an error. +func (c *Context) GetError(key any) error { + return getTyped[error](c, key) +} + +// GetIntSlice returns the value associated with the key as a slice of integers. +func (c *Context) GetIntSlice(key any) []int { + return getTyped[[]int](c, key) +} + +// GetInt8Slice returns the value associated with the key as a slice of int8 integers. +func (c *Context) GetInt8Slice(key any) []int8 { + return getTyped[[]int8](c, key) +} + +// GetInt16Slice returns the value associated with the key as a slice of int16 integers. +func (c *Context) GetInt16Slice(key any) []int16 { + return getTyped[[]int16](c, key) +} + +// GetInt32Slice returns the value associated with the key as a slice of int32 integers. +func (c *Context) GetInt32Slice(key any) []int32 { + return getTyped[[]int32](c, key) +} + +// GetInt64Slice returns the value associated with the key as a slice of int64 integers. +func (c *Context) GetInt64Slice(key any) []int64 { + return getTyped[[]int64](c, key) +} + +// GetUintSlice returns the value associated with the key as a slice of unsigned integers. +func (c *Context) GetUintSlice(key any) []uint { + return getTyped[[]uint](c, key) +} + +// GetUint8Slice returns the value associated with the key as a slice of uint8 integers. +func (c *Context) GetUint8Slice(key any) []uint8 { + return getTyped[[]uint8](c, key) +} + +// GetUint16Slice returns the value associated with the key as a slice of uint16 integers. +func (c *Context) GetUint16Slice(key any) []uint16 { + return getTyped[[]uint16](c, key) +} + +// GetUint32Slice returns the value associated with the key as a slice of uint32 integers. +func (c *Context) GetUint32Slice(key any) []uint32 { + return getTyped[[]uint32](c, key) +} + +// GetUint64Slice returns the value associated with the key as a slice of uint64 integers. +func (c *Context) GetUint64Slice(key any) []uint64 { + return getTyped[[]uint64](c, key) +} + +// GetFloat32Slice returns the value associated with the key as a slice of float32 numbers. +func (c *Context) GetFloat32Slice(key any) []float32 { + return getTyped[[]float32](c, key) +} + +// GetFloat64Slice returns the value associated with the key as a slice of float64 numbers. +func (c *Context) GetFloat64Slice(key any) []float64 { + return getTyped[[]float64](c, key) +} + +// GetStringSlice returns the value associated with the key as a slice of strings. +func (c *Context) GetStringSlice(key any) []string { + return getTyped[[]string](c, key) +} + +// GetErrorSlice returns the value associated with the key as a slice of errors. +func (c *Context) GetErrorSlice(key any) []error { + return getTyped[[]error](c, key) +} + +// GetStringMap returns the value associated with the key as a map of interfaces. +func (c *Context) GetStringMap(key any) map[string]any { + return getTyped[map[string]any](c, key) +} + +// GetStringMapString returns the value associated with the key as a map of strings. +func (c *Context) GetStringMapString(key any) map[string]string { + return getTyped[map[string]string](c, key) +} + +// GetStringMapStringSlice returns the value associated with the key as a map to a slice of strings. +func (c *Context) GetStringMapStringSlice(key any) map[string][]string { + return getTyped[map[string][]string](c, key) +} + +// Delete deletes the key from the Context's Key map, if it exists. +// This operation is safe to be used by concurrent go-routines +func (c *Context) Delete(key any) { + c.mu.Lock() + defer c.mu.Unlock() + if c.Keys != nil { + delete(c.Keys, key) + } +} + +/************************************/ +/************ INPUT DATA ************/ +/************************************/ + +// Param returns the value of the URL param. +// It is a shortcut for c.Params.ByName(key) +// +// router.GET("/user/:id", func(c *gin.Context) { +// // a GET request to /user/john +// id := c.Param("id") // id == "john" +// // a GET request to /user/john/ +// id := c.Param("id") // id == "/john/" +// }) +func (c *Context) Param(key string) string { + return c.Params.ByName(key) +} + +// AddParam adds param to context and +// replaces path param key with given value for e2e testing purposes +// Example Route: "/user/:id" +// AddParam("id", 1) +// Result: "/user/1" +func (c *Context) AddParam(key, value string) { + c.Params = append(c.Params, Param{Key: key, Value: value}) +} + +// Query returns the keyed url query value if it exists, +// otherwise it returns an empty string `("")`. +// It is shortcut for `c.Request.URL.Query().Get(key)` +// +// GET /path?id=1234&name=Manu&value= +// c.Query("id") == "1234" +// c.Query("name") == "Manu" +// c.Query("value") == "" +// c.Query("wtf") == "" +func (c *Context) Query(key string) (value string) { + value, _ = c.GetQuery(key) + return +} + +// DefaultQuery returns the keyed url query value if it exists, +// otherwise it returns the specified defaultValue string. +// See: Query() and GetQuery() for further information. +// +// GET /?name=Manu&lastname= +// c.DefaultQuery("name", "unknown") == "Manu" +// c.DefaultQuery("id", "none") == "none" +// c.DefaultQuery("lastname", "none") == "" +func (c *Context) DefaultQuery(key, defaultValue string) string { + if value, ok := c.GetQuery(key); ok { + return value + } + return defaultValue +} + +// GetQuery is like Query(), it returns the keyed url query value +// if it exists `(value, true)` (even when the value is an empty string), +// otherwise it returns `("", false)`. +// It is shortcut for `c.Request.URL.Query().Get(key)` +// +// GET /?name=Manu&lastname= +// ("Manu", true) == c.GetQuery("name") +// ("", false) == c.GetQuery("id") +// ("", true) == c.GetQuery("lastname") +func (c *Context) GetQuery(key string) (string, bool) { + if values, ok := c.GetQueryArray(key); ok { + return values[0], ok + } + return "", false +} + +// QueryArray returns a slice of strings for a given query key. +// The length of the slice depends on the number of params with the given key. +func (c *Context) QueryArray(key string) (values []string) { + values, _ = c.GetQueryArray(key) + return +} + +func (c *Context) initQueryCache() { + if c.queryCache == nil { + if c.Request != nil && c.Request.URL != nil { + c.queryCache = c.Request.URL.Query() + } else { + c.queryCache = url.Values{} + } + } +} + +// GetQueryArray returns a slice of strings for a given query key, plus +// a boolean value whether at least one value exists for the given key. +func (c *Context) GetQueryArray(key string) (values []string, ok bool) { + c.initQueryCache() + values, ok = c.queryCache[key] + return +} + +// QueryMap returns a map for a given query key. +func (c *Context) QueryMap(key string) (dicts map[string]string) { + dicts, _ = c.GetQueryMap(key) + return +} + +// GetQueryMap returns a map for a given query key, plus a boolean value +// whether at least one value exists for the given key. +func (c *Context) GetQueryMap(key string) (map[string]string, bool) { + c.initQueryCache() + return getMapFromFormData(c.queryCache, key) +} + +// PostForm returns the specified key from a POST urlencoded form or multipart form +// when it exists, otherwise it returns an empty string `("")`. +func (c *Context) PostForm(key string) (value string) { + value, _ = c.GetPostForm(key) + return +} + +// DefaultPostForm returns the specified key from a POST urlencoded form or multipart form +// when it exists, otherwise it returns the specified defaultValue string. +// See: PostForm() and GetPostForm() for further information. +func (c *Context) DefaultPostForm(key, defaultValue string) string { + if value, ok := c.GetPostForm(key); ok { + return value + } + return defaultValue +} + +// GetPostForm is like PostForm(key). It returns the specified key from a POST urlencoded +// form or multipart form when it exists `(value, true)` (even when the value is an empty string), +// otherwise it returns ("", false). +// For example, during a PATCH request to update the user's email: +// +// email=mail@example.com --> ("mail@example.com", true) := GetPostForm("email") // set email to "mail@example.com" +// email= --> ("", true) := GetPostForm("email") // set email to "" +// --> ("", false) := GetPostForm("email") // do nothing with email +func (c *Context) GetPostForm(key string) (string, bool) { + if values, ok := c.GetPostFormArray(key); ok { + return values[0], ok + } + return "", false +} + +// PostFormArray returns a slice of strings for a given form key. +// The length of the slice depends on the number of params with the given key. +func (c *Context) PostFormArray(key string) (values []string) { + values, _ = c.GetPostFormArray(key) + return +} + +func (c *Context) initFormCache() { + if c.formCache == nil { + c.formCache = make(url.Values) + req := c.Request + if err := req.ParseMultipartForm(c.engine.MaxMultipartMemory); err != nil { + if !errors.Is(err, http.ErrNotMultipart) { + debugPrint("error on parse multipart form array: %v", err) + } + } + c.formCache = req.PostForm + } +} + +// GetPostFormArray returns a slice of strings for a given form key, plus +// a boolean value whether at least one value exists for the given key. +func (c *Context) GetPostFormArray(key string) (values []string, ok bool) { + c.initFormCache() + values, ok = c.formCache[key] + return +} + +// PostFormMap returns a map for a given form key. +func (c *Context) PostFormMap(key string) (dicts map[string]string) { + dicts, _ = c.GetPostFormMap(key) + return +} + +// GetPostFormMap returns a map for a given form key, plus a boolean value +// whether at least one value exists for the given key. +func (c *Context) GetPostFormMap(key string) (map[string]string, bool) { + c.initFormCache() + return getMapFromFormData(c.formCache, key) +} + +// getMapFromFormData return a map which satisfies conditions. +// It parses from data with bracket notation like "key[subkey]=value" into a map. +func getMapFromFormData(m map[string][]string, key string) (map[string]string, bool) { + d := make(map[string]string) + found := false + keyLen := len(key) + + for k, v := range m { + if len(k) < keyLen+3 { // key + "[" + at least one char + "]" + continue + } + + if k[:keyLen] != key || k[keyLen] != '[' { + continue + } + + if j := strings.IndexByte(k[keyLen+1:], ']'); j > 0 { + found = true + d[k[keyLen+1:keyLen+1+j]] = v[0] + } + } + + return d, found +} + +// FormFile returns the first file for the provided form key. +func (c *Context) FormFile(name string) (*multipart.FileHeader, error) { + if c.Request.MultipartForm == nil { + if err := c.Request.ParseMultipartForm(c.engine.MaxMultipartMemory); err != nil { + return nil, err + } + } + f, fh, err := c.Request.FormFile(name) + if err != nil { + return nil, err + } + f.Close() + return fh, err +} + +// MultipartForm is the parsed multipart form, including file uploads. +func (c *Context) MultipartForm() (*multipart.Form, error) { + err := c.Request.ParseMultipartForm(c.engine.MaxMultipartMemory) + return c.Request.MultipartForm, err +} + +// SaveUploadedFile uploads the form file to specific dst. +func (c *Context) SaveUploadedFile(file *multipart.FileHeader, dst string, perm ...fs.FileMode) error { + src, err := file.Open() + if err != nil { + return err + } + defer src.Close() + + var mode os.FileMode = 0o750 + if len(perm) > 0 { + mode = perm[0] + } + dir := filepath.Dir(dst) + if err = os.MkdirAll(dir, mode); err != nil { + return err + } + if err = os.Chmod(dir, mode); err != nil { + return err + } + + out, err := os.Create(dst) + if err != nil { + return err + } + defer out.Close() + + _, err = io.Copy(out, src) + return err +} + +// Bind checks the Method and Content-Type to select a binding engine automatically, +// Depending on the "Content-Type" header different bindings are used, for example: +// +// "application/json" --> JSON binding +// "application/xml" --> XML binding +// +// It parses the request's body based on the Content-Type (e.g., JSON or XML). +// It decodes the payload into the struct specified as a pointer. +// It writes a 400 error and sets Content-Type header "text/plain" in the response if input is not valid. +func (c *Context) Bind(obj any) error { + b := binding.Default(c.Request.Method, c.ContentType()) + return c.MustBindWith(obj, b) +} + +// BindJSON is a shortcut for c.MustBindWith(obj, binding.JSON). +func (c *Context) BindJSON(obj any) error { + return c.MustBindWith(obj, binding.JSON) +} + +// BindXML is a shortcut for c.MustBindWith(obj, binding.BindXML). +func (c *Context) BindXML(obj any) error { + return c.MustBindWith(obj, binding.XML) +} + +// BindQuery is a shortcut for c.MustBindWith(obj, binding.Query). +func (c *Context) BindQuery(obj any) error { + return c.MustBindWith(obj, binding.Query) +} + +// BindYAML is a shortcut for c.MustBindWith(obj, binding.YAML). +func (c *Context) BindYAML(obj any) error { + return c.MustBindWith(obj, binding.YAML) +} + +// BindTOML is a shortcut for c.MustBindWith(obj, binding.TOML). +func (c *Context) BindTOML(obj any) error { + return c.MustBindWith(obj, binding.TOML) +} + +// BindPlain is a shortcut for c.MustBindWith(obj, binding.Plain). +func (c *Context) BindPlain(obj any) error { + return c.MustBindWith(obj, binding.Plain) +} + +// BindHeader is a shortcut for c.MustBindWith(obj, binding.Header). +func (c *Context) BindHeader(obj any) error { + return c.MustBindWith(obj, binding.Header) +} + +// BindUri binds the passed struct pointer using binding.Uri. +// It will abort the request with HTTP 400 if any error occurs. +func (c *Context) BindUri(obj any) error { + if err := c.ShouldBindUri(obj); err != nil { + c.AbortWithError(http.StatusBadRequest, err).SetType(ErrorTypeBind) //nolint: errcheck + return err + } + return nil +} + +// MustBindWith binds the passed struct pointer using the specified binding engine. +// It will abort the request with HTTP 400 if any error occurs. +// See the binding package. +func (c *Context) MustBindWith(obj any, b binding.Binding) error { + err := c.ShouldBindWith(obj, b) + if err != nil { + var maxBytesErr *http.MaxBytesError + + // Note: When using sonic or go-json as JSON encoder, they do not propagate the http.MaxBytesError error + // https://github.com/goccy/go-json/issues/485 + // https://github.com/bytedance/sonic/issues/800 + switch { + case errors.As(err, &maxBytesErr): + c.AbortWithError(http.StatusRequestEntityTooLarge, err).SetType(ErrorTypeBind) //nolint: errcheck + default: + c.AbortWithError(http.StatusBadRequest, err).SetType(ErrorTypeBind) //nolint: errcheck + } + return err + } + return nil +} + +// ShouldBind checks the Method and Content-Type to select a binding engine automatically, +// Depending on the "Content-Type" header different bindings are used, for example: +// +// "application/json" --> JSON binding +// "application/xml" --> XML binding +// +// It parses the request's body based on the Content-Type (e.g., JSON or XML). +// It decodes the payload into the struct specified as a pointer. +// Like c.Bind() but this method does not set the response status code to 400 or abort if input is not valid. +func (c *Context) ShouldBind(obj any) error { + b := binding.Default(c.Request.Method, c.ContentType()) + return c.ShouldBindWith(obj, b) +} + +// ShouldBindJSON is a shortcut for c.ShouldBindWith(obj, binding.JSON). +// +// Example: +// +// POST /user +// Content-Type: application/json +// +// Request Body: +// { +// "name": "Manu", +// "age": 20 +// } +// +// type User struct { +// Name string `json:"name"` +// Age int `json:"age"` +// } +// +// var user User +// if err := c.ShouldBindJSON(&user); err != nil { +// c.JSON(http.StatusBadRequest, gin.H{"error": err.Error()}) +// return +// } +// c.JSON(http.StatusOK, user) +func (c *Context) ShouldBindJSON(obj any) error { + return c.ShouldBindWith(obj, binding.JSON) +} + +// ShouldBindXML is a shortcut for c.ShouldBindWith(obj, binding.XML). +// It works like ShouldBindJSON but binds the request body as XML data. +func (c *Context) ShouldBindXML(obj any) error { + return c.ShouldBindWith(obj, binding.XML) +} + +// ShouldBindQuery is a shortcut for c.ShouldBindWith(obj, binding.Query). +// It works like ShouldBindJSON but binds query parameters from the URL. +func (c *Context) ShouldBindQuery(obj any) error { + return c.ShouldBindWith(obj, binding.Query) +} + +// ShouldBindYAML is a shortcut for c.ShouldBindWith(obj, binding.YAML). +// It works like ShouldBindJSON but binds the request body as YAML data. +func (c *Context) ShouldBindYAML(obj any) error { + return c.ShouldBindWith(obj, binding.YAML) +} + +// ShouldBindTOML is a shortcut for c.ShouldBindWith(obj, binding.TOML). +// It works like ShouldBindJSON but binds the request body as TOML data. +func (c *Context) ShouldBindTOML(obj any) error { + return c.ShouldBindWith(obj, binding.TOML) +} + +// ShouldBindPlain is a shortcut for c.ShouldBindWith(obj, binding.Plain). +// It works like ShouldBindJSON but binds plain text data from the request body. +func (c *Context) ShouldBindPlain(obj any) error { + return c.ShouldBindWith(obj, binding.Plain) +} + +// ShouldBindHeader is a shortcut for c.ShouldBindWith(obj, binding.Header). +// It works like ShouldBindJSON but binds values from HTTP headers. +func (c *Context) ShouldBindHeader(obj any) error { + return c.ShouldBindWith(obj, binding.Header) +} + +// ShouldBindUri binds the passed struct pointer using the specified binding engine. +// It works like ShouldBindJSON but binds parameters from the URI. +func (c *Context) ShouldBindUri(obj any) error { + m := make(map[string][]string, len(c.Params)) + for _, v := range c.Params { + m[v.Key] = []string{v.Value} + } + return binding.Uri.BindUri(m, obj) +} + +// ShouldBindWith binds the passed struct pointer using the specified binding engine. +// See the binding package. +func (c *Context) ShouldBindWith(obj any, b binding.Binding) error { + return b.Bind(c.Request, obj) +} + +// ShouldBindBodyWith is similar with ShouldBindWith, but it stores the request +// body into the context, and reuse when it is called again. +// +// NOTE: This method reads the body before binding. So you should use +// ShouldBindWith for better performance if you need to call only once. +func (c *Context) ShouldBindBodyWith(obj any, bb binding.BindingBody) (err error) { + var body []byte + if cb, ok := c.Get(BodyBytesKey); ok { + if cbb, ok := cb.([]byte); ok { + body = cbb + } + } + if body == nil { + body, err = io.ReadAll(c.Request.Body) + if err != nil { + return err + } + c.Set(BodyBytesKey, body) + } + return bb.BindBody(body, obj) +} + +// ShouldBindBodyWithJSON is a shortcut for c.ShouldBindBodyWith(obj, binding.JSON). +func (c *Context) ShouldBindBodyWithJSON(obj any) error { + return c.ShouldBindBodyWith(obj, binding.JSON) +} + +// ShouldBindBodyWithXML is a shortcut for c.ShouldBindBodyWith(obj, binding.XML). +func (c *Context) ShouldBindBodyWithXML(obj any) error { + return c.ShouldBindBodyWith(obj, binding.XML) +} + +// ShouldBindBodyWithYAML is a shortcut for c.ShouldBindBodyWith(obj, binding.YAML). +func (c *Context) ShouldBindBodyWithYAML(obj any) error { + return c.ShouldBindBodyWith(obj, binding.YAML) +} + +// ShouldBindBodyWithTOML is a shortcut for c.ShouldBindBodyWith(obj, binding.TOML). +func (c *Context) ShouldBindBodyWithTOML(obj any) error { + return c.ShouldBindBodyWith(obj, binding.TOML) +} + +// ShouldBindBodyWithPlain is a shortcut for c.ShouldBindBodyWith(obj, binding.Plain). +func (c *Context) ShouldBindBodyWithPlain(obj any) error { + return c.ShouldBindBodyWith(obj, binding.Plain) +} + +// ClientIP implements one best effort algorithm to return the real client IP. +// It calls c.RemoteIP() under the hood, to check if the remote IP is a trusted proxy or not. +// If it is it will then try to parse the headers defined in Engine.RemoteIPHeaders (defaulting to [X-Forwarded-For, X-Real-IP]). +// If the headers are not syntactically valid OR the remote IP does not correspond to a trusted proxy, +// the remote IP (coming from Request.RemoteAddr) is returned. +func (c *Context) ClientIP() string { + // Check if we're running on a trusted platform, continue running backwards if error + if c.engine.TrustedPlatform != "" { + // Developers can define their own header of Trusted Platform or use predefined constants + if addr := c.requestHeader(c.engine.TrustedPlatform); addr != "" { + return addr + } + } + + // Legacy "AppEngine" flag + if c.engine.AppEngine { + log.Println(`The AppEngine flag is going to be deprecated. Please check issues #2723 and #2739 and use 'TrustedPlatform: gin.PlatformGoogleAppEngine' instead.`) + if addr := c.requestHeader("X-Appengine-Remote-Addr"); addr != "" { + return addr + } + } + + var ( + trusted bool + remoteIP net.IP + ) + // If gin is listening a unix socket, always trust it. + localAddr, ok := c.Request.Context().Value(http.LocalAddrContextKey).(net.Addr) + if ok && strings.HasPrefix(localAddr.Network(), "unix") { + trusted = true + } + + // Fallback + if !trusted { + // It also checks if the remoteIP is a trusted proxy or not. + // In order to perform this validation, it will see if the IP is contained within at least one of the CIDR blocks + // defined by Engine.SetTrustedProxies() + remoteIP = net.ParseIP(c.RemoteIP()) + if remoteIP == nil { + return "" + } + trusted = c.engine.isTrustedProxy(remoteIP) + } + + if trusted && c.engine.ForwardedByClientIP && c.engine.RemoteIPHeaders != nil { + for _, headerName := range c.engine.RemoteIPHeaders { + headerValue := strings.Join(c.Request.Header.Values(headerName), ",") + ip, valid := c.engine.validateHeader(headerValue) + if valid { + return ip + } + } + } + return remoteIP.String() +} + +// RemoteIP parses the IP from Request.RemoteAddr, normalizes and returns the IP (without the port). +func (c *Context) RemoteIP() string { + ip, _, err := net.SplitHostPort(strings.TrimSpace(c.Request.RemoteAddr)) + if err != nil { + return "" + } + return ip +} + +// ContentType returns the Content-Type header of the request. +func (c *Context) ContentType() string { + return filterFlags(c.requestHeader("Content-Type")) +} + +// IsWebsocket returns true if the request headers indicate that a websocket +// handshake is being initiated by the client. +func (c *Context) IsWebsocket() bool { + if strings.Contains(strings.ToLower(c.requestHeader("Connection")), "upgrade") && + strings.EqualFold(c.requestHeader("Upgrade"), "websocket") { + return true + } + return false +} + +func (c *Context) requestHeader(key string) string { + return c.Request.Header.Get(key) +} + +/************************************/ +/******** RESPONSE RENDERING ********/ +/************************************/ + +// bodyAllowedForStatus is a copy of http.bodyAllowedForStatus non-exported function. +// Uses http.StatusContinue constant for better code clarity. +func bodyAllowedForStatus(status int) bool { + switch { + case status >= http.StatusContinue && status < http.StatusOK: + return false + case status == http.StatusNoContent: + return false + case status == http.StatusNotModified: + return false + } + return true +} + +// Status sets the HTTP response code. +func (c *Context) Status(code int) { + c.Writer.WriteHeader(code) +} + +// Header is an intelligent shortcut for c.Writer.Header().Set(key, value). +// It writes a header in the response. +// If value == "", this method removes the header `c.Writer.Header().Del(key)` +func (c *Context) Header(key, value string) { + if value == "" { + c.Writer.Header().Del(key) + return + } + c.Writer.Header().Set(key, value) +} + +// GetHeader returns value from request headers. +func (c *Context) GetHeader(key string) string { + return c.requestHeader(key) +} + +// GetRawData returns stream data. +func (c *Context) GetRawData() ([]byte, error) { + if c.Request.Body == nil { + return nil, errors.New("cannot read nil body") + } + return io.ReadAll(c.Request.Body) +} + +// SetSameSite with cookie +func (c *Context) SetSameSite(samesite http.SameSite) { + c.sameSite = samesite +} + +// SetCookie adds a Set-Cookie header to the ResponseWriter's headers. +// The provided cookie must have a valid Name. Invalid cookies may be +// silently dropped. +func (c *Context) SetCookie(name, value string, maxAge int, path, domain string, secure, httpOnly bool) { + if path == "" { + path = "/" + } + http.SetCookie(c.Writer, &http.Cookie{ + Name: name, + Value: url.QueryEscape(value), + MaxAge: maxAge, + Path: path, + Domain: domain, + SameSite: c.sameSite, + Secure: secure, + HttpOnly: httpOnly, + }) +} + +// SetCookieData adds a Set-Cookie header to the ResponseWriter's headers. +// It accepts a pointer to http.Cookie structure for more flexibility in setting cookie attributes. +// The provided cookie must have a valid Name. Invalid cookies may be silently dropped. +func (c *Context) SetCookieData(cookie *http.Cookie) { + if cookie.Path == "" { + cookie.Path = "/" + } + if cookie.SameSite == http.SameSiteDefaultMode { + cookie.SameSite = c.sameSite + } + http.SetCookie(c.Writer, cookie) +} + +// Cookie returns the named cookie provided in the request or +// ErrNoCookie if not found. And return the named cookie is unescaped. +// If multiple cookies match the given name, only one cookie will +// be returned. +func (c *Context) Cookie(name string) (string, error) { + cookie, err := c.Request.Cookie(name) + if err != nil { + return "", err + } + val, _ := url.QueryUnescape(cookie.Value) + return val, nil +} + +// Render writes the response headers and calls render.Render to render data. +func (c *Context) Render(code int, r render.Render) { + c.Status(code) + + if !bodyAllowedForStatus(code) { + r.WriteContentType(c.Writer) + c.Writer.WriteHeaderNow() + return + } + + if err := r.Render(c.Writer); err != nil { + // Pushing error to c.Errors + _ = c.Error(err) + c.Abort() + } +} + +// HTML renders the HTTP template specified by its file name. +// It also updates the HTTP code and sets the Content-Type as "text/html". +// See http://golang.org/doc/articles/wiki/ +func (c *Context) HTML(code int, name string, obj any) { + instance := c.engine.HTMLRender.Instance(name, obj) + c.Render(code, instance) +} + +// IndentedJSON serializes the given struct as pretty JSON (indented + endlines) into the response body. +// It also sets the Content-Type as "application/json". +// WARNING: we recommend using this only for development purposes since printing pretty JSON is +// more CPU and bandwidth consuming. Use Context.JSON() instead. +func (c *Context) IndentedJSON(code int, obj any) { + c.Render(code, render.IndentedJSON{Data: obj}) +} + +// SecureJSON serializes the given struct as Secure JSON into the response body. +// Default prepends "while(1)," to response body if the given struct is array values. +// It also sets the Content-Type as "application/json". +func (c *Context) SecureJSON(code int, obj any) { + c.Render(code, render.SecureJSON{Prefix: c.engine.secureJSONPrefix, Data: obj}) +} + +// JSONP serializes the given struct as JSON into the response body. +// It adds padding to response body to request data from a server residing in a different domain than the client. +// It also sets the Content-Type as "application/javascript". +func (c *Context) JSONP(code int, obj any) { + callback := c.DefaultQuery("callback", "") + if callback == "" { + c.Render(code, render.JSON{Data: obj}) + return + } + c.Render(code, render.JsonpJSON{Callback: callback, Data: obj}) +} + +// JSON serializes the given struct as JSON into the response body. +// It also sets the Content-Type as "application/json". +func (c *Context) JSON(code int, obj any) { + c.Render(code, render.JSON{Data: obj}) +} + +// AsciiJSON serializes the given struct as JSON into the response body with unicode to ASCII string. +// It also sets the Content-Type as "application/json". +func (c *Context) AsciiJSON(code int, obj any) { + c.Render(code, render.AsciiJSON{Data: obj}) +} + +// PureJSON serializes the given struct as JSON into the response body. +// PureJSON, unlike JSON, does not replace special html characters with their unicode entities. +func (c *Context) PureJSON(code int, obj any) { + c.Render(code, render.PureJSON{Data: obj}) +} + +// XML serializes the given struct as XML into the response body. +// It also sets the Content-Type as "application/xml". +func (c *Context) XML(code int, obj any) { + c.Render(code, render.XML{Data: obj}) +} + +// PDF writes the given PDF binary data into the response body. +// It also sets the Content-Type as "application/pdf". +func (c *Context) PDF(code int, data []byte) { + c.Render(code, render.PDF{Data: data}) +} + +// YAML serializes the given struct as YAML into the response body. +func (c *Context) YAML(code int, obj any) { + c.Render(code, render.YAML{Data: obj}) +} + +// TOML serializes the given struct as TOML into the response body. +func (c *Context) TOML(code int, obj any) { + c.Render(code, render.TOML{Data: obj}) +} + +// ProtoBuf serializes the given struct as ProtoBuf into the response body. +func (c *Context) ProtoBuf(code int, obj any) { + c.Render(code, render.ProtoBuf{Data: obj}) +} + +// BSON serializes the given struct as BSON into the response body. +func (c *Context) BSON(code int, obj any) { + c.Render(code, render.BSON{Data: obj}) +} + +// String writes the given string into the response body. +func (c *Context) String(code int, format string, values ...any) { + c.Render(code, render.String{Format: format, Data: values}) +} + +// Redirect returns an HTTP redirect to the specific location. +func (c *Context) Redirect(code int, location string) { + c.Render(-1, render.Redirect{ + Code: code, + Location: location, + Request: c.Request, + }) +} + +// Data writes some data into the body stream and updates the HTTP code. +func (c *Context) Data(code int, contentType string, data []byte) { + c.Render(code, render.Data{ + ContentType: contentType, + Data: data, + }) +} + +// DataFromReader writes the specified reader into the body stream and updates the HTTP code. +func (c *Context) DataFromReader(code int, contentLength int64, contentType string, reader io.Reader, extraHeaders map[string]string) { + c.Render(code, render.Reader{ + Headers: extraHeaders, + ContentType: contentType, + ContentLength: contentLength, + Reader: reader, + }) +} + +// File writes the specified file into the body stream in an efficient way. +func (c *Context) File(filepath string) { + http.ServeFile(c.Writer, c.Request, filepath) +} + +// FileFromFS writes the specified file from http.FileSystem into the body stream in an efficient way. +func (c *Context) FileFromFS(filepath string, fs http.FileSystem) { + defer func(old string) { + c.Request.URL.Path = old + }(c.Request.URL.Path) + + c.Request.URL.Path = filepath + + http.FileServer(fs).ServeHTTP(c.Writer, c.Request) +} + +var quoteEscaper = strings.NewReplacer("\\", "\\\\", `"`, "\\\"") + +func escapeQuotes(s string) string { + return quoteEscaper.Replace(s) +} + +// FileAttachment writes the specified file into the body stream in an efficient way +// On the client side, the file will typically be downloaded with the given filename +func (c *Context) FileAttachment(filepath, filename string) { + if isASCII(filename) { + c.Writer.Header().Set("Content-Disposition", `attachment; filename="`+escapeQuotes(filename)+`"`) + } else { + c.Writer.Header().Set("Content-Disposition", `attachment; filename*=UTF-8''`+url.QueryEscape(filename)) + } + http.ServeFile(c.Writer, c.Request, filepath) +} + +// SSEvent writes a Server-Sent Event into the body stream. +func (c *Context) SSEvent(name string, message any) { + c.Render(-1, sse.Event{ + Event: name, + Data: message, + }) +} + +// Stream sends a streaming response and returns a boolean +// indicates "Is client disconnected in middle of stream" +func (c *Context) Stream(step func(w io.Writer) bool) bool { + w := c.Writer + clientGone := w.CloseNotify() + for { + select { + case <-clientGone: + return true + default: + keepOpen := step(w) + w.Flush() + if !keepOpen { + return false + } + } + } +} + +/************************************/ +/******** CONTENT NEGOTIATION *******/ +/************************************/ + +// Negotiate contains all negotiations data. +type Negotiate struct { + Offered []string + HTMLName string + HTMLData any + JSONData any + XMLData any + YAMLData any + Data any + TOMLData any + PROTOBUFData any + BSONData any +} + +// Negotiate calls different Render according to acceptable Accept format. +func (c *Context) Negotiate(code int, config Negotiate) { + switch c.NegotiateFormat(config.Offered...) { + case binding.MIMEJSON: + data := chooseData(config.JSONData, config.Data) + c.JSON(code, data) + + case binding.MIMEHTML: + data := chooseData(config.HTMLData, config.Data) + c.HTML(code, config.HTMLName, data) + + case binding.MIMEXML: + data := chooseData(config.XMLData, config.Data) + c.XML(code, data) + + case binding.MIMEYAML, binding.MIMEYAML2: + data := chooseData(config.YAMLData, config.Data) + c.YAML(code, data) + + case binding.MIMETOML: + data := chooseData(config.TOMLData, config.Data) + c.TOML(code, data) + + case binding.MIMEPROTOBUF: + data := chooseData(config.PROTOBUFData, config.Data) + c.ProtoBuf(code, data) + + case binding.MIMEBSON: + data := chooseData(config.BSONData, config.Data) + c.BSON(code, data) + + default: + c.AbortWithError(http.StatusNotAcceptable, errors.New("the accepted formats are not offered by the server")) //nolint: errcheck + } +} + +// NegotiateFormat returns an acceptable Accept format. +func (c *Context) NegotiateFormat(offered ...string) string { + assert1(len(offered) > 0, "you must provide at least one offer") + + if c.Accepted == nil { + c.Accepted = parseAccept(c.requestHeader("Accept")) + } + if len(c.Accepted) == 0 { + return offered[0] + } + for _, accepted := range c.Accepted { + for _, offer := range offered { + // According to RFC 2616 and RFC 2396, non-ASCII characters are not allowed in headers, + // therefore we can just iterate over the string without casting it into []rune + i := 0 + for ; i < len(accepted) && i < len(offer); i++ { + if accepted[i] == '*' || offer[i] == '*' { + return offer + } + if accepted[i] != offer[i] { + break + } + } + if i == len(accepted) { + return offer + } + } + } + return "" +} + +// SetAccepted sets Accept header data. +func (c *Context) SetAccepted(formats ...string) { + c.Accepted = formats +} + +/************************************/ +/***** GOLANG.ORG/X/NET/CONTEXT *****/ +/************************************/ + +// hasRequestContext returns whether c.Request has Context and fallback. +func (c *Context) hasRequestContext() bool { + hasFallback := c.engine != nil && c.engine.ContextWithFallback + hasRequestContext := c.Request != nil && c.Request.Context() != nil + return hasFallback && hasRequestContext +} + +// Deadline returns that there is no deadline (ok==false) when c.Request has no Context. +func (c *Context) Deadline() (deadline time.Time, ok bool) { + if !c.hasRequestContext() { + return + } + return c.Request.Context().Deadline() +} + +// Done returns nil (chan which will wait forever) when c.Request has no Context. +func (c *Context) Done() <-chan struct{} { + if !c.hasRequestContext() { + return nil + } + return c.Request.Context().Done() +} + +// Err returns nil when c.Request has no Context. +func (c *Context) Err() error { + if !c.hasRequestContext() { + return nil + } + return c.Request.Context().Err() +} + +// Value returns the value associated with this context for key, or nil +// if no value is associated with key. Successive calls to Value with +// the same key returns the same result. +func (c *Context) Value(key any) any { + if key == ContextRequestKey { + return c.Request + } + if key == ContextKey { + return c + } + if keyAsString, ok := key.(string); ok { + if val, exists := c.Get(keyAsString); exists { + return val + } + } + if !c.hasRequestContext() { + return nil + } + return c.Request.Context().Value(key) +} diff --git a/benches/scan_bench.rs b/benches/scan_bench.rs index ad42c851..4c12c404 100644 --- a/benches/scan_bench.rs +++ b/benches/scan_bench.rs @@ -173,6 +173,266 @@ fn bench_classify(c: &mut Criterion) { }); } +/// Per-file fused analysis throughput on a realistic ~1.5k-line Go module +/// (gin context.go, ~147 fns). Guards the +/// `ParsedFile::body_const_facts_cache` optimization that collapses the +/// 2-3× per-body re-lowering that previously dominated `analyse_file_fused` +/// (~14% of wall-clock on the gin-scan profile). Regressions here mean +/// per-body work is being recomputed across passes again. +fn bench_analyse_file_fused_large_go(c: &mut Criterion) { + let fixture = Path::new("benches/perf_fixtures/large_go_module.go") + .canonicalize() + .expect("perf fixture"); + let bytes = std::fs::read(&fixture).expect("read fixture"); + let mut cfg = Config::default(); + cfg.scanner.mode = AnalysisMode::Full; + cfg.scanner.enable_state_analysis = true; + cfg.performance.worker_threads = Some(1); + + // One-shot diagnostic: count `build_body_const_facts` calls per fused + // analysis so a regression that removes the per-file cache surfaces here + // (expected ~148 calls on this fixture; pre-cache was ~444). + nyx_scanner::cfg_analysis::BUILD_BODY_CONST_FACTS_CALLS + .store(0, std::sync::atomic::Ordering::Relaxed); + let _ = nyx_scanner::ast::analyse_file_fused(&bytes, &fixture, &cfg, None, None) + .expect("warmup analyse"); + let calls = nyx_scanner::cfg_analysis::BUILD_BODY_CONST_FACTS_CALLS + .load(std::sync::atomic::Ordering::Relaxed); + eprintln!("[diag] build_body_const_facts calls per analyse_file_fused: {calls}"); + + c.bench_function("analyse_file_fused_large_go", |b| { + b.iter(|| { + nyx_scanner::ast::analyse_file_fused(&bytes, &fixture, &cfg, None, None) + .expect("analyse_file_fused") + }); + }); +} + +/// Per-file `extract_authorization_model` throughput on the realistic +/// ~1.5k-line Go fixture (gin context.go). Guards the +/// `extract_authorization_model` orchestrator hoist that pulled the +/// shared `collect_top_level_units` AST walk out of every supporting +/// extractor's `extract()` (one walk per file instead of one per +/// matching extractor). On Go files both `EchoExtractor` and +/// `GinExtractor` match by default — pre-hoist this bench measured the +/// AST being walked twice; regressions here mean the hoist has been +/// broken or a new Go extractor was added that re-walks the tree. +fn bench_extract_authorization_model_go(c: &mut Criterion) { + use tree_sitter::Parser; + + let fixture = Path::new("benches/perf_fixtures/large_go_module.go") + .canonicalize() + .expect("perf fixture"); + let bytes = std::fs::read(&fixture).expect("read fixture"); + + let mut parser = Parser::new(); + let go_lang: tree_sitter::Language = tree_sitter_go::LANGUAGE.into(); + parser.set_language(&go_lang).expect("set go grammar"); + let tree = parser.parse(&bytes, None).expect("parse fixture"); + + let cfg = Config::default(); + let rules = nyx_scanner::auth_analysis::config::build_auth_rules(&cfg, "go"); + + c.bench_function("extract_authorization_model_go", |b| { + b.iter(|| { + nyx_scanner::auth_analysis::extract::extract_authorization_model( + "go", + cfg.framework_ctx.as_ref(), + &tree, + &bytes, + &fixture, + &rules, + None, + ) + }); + }); +} + +/// Per-file shared-vs-double `extract_authorization_model` cost on a +/// realistic Go fixture (gin context.go). Pre-fix +/// `analyse_file_fused` called `extract_authorization_model` twice per +/// file (once for diagnostics via `run_auth_analysis`, once for +/// per-file summary keying via `extract_auth_summaries_by_key`). This +/// bench records the **shared-model path** only (extract once, derive +/// both summaries + diagnostics) so a regression that re-introduces +/// the double-call surfaces as a ≥1.7× slowdown here. +fn bench_extract_authorization_model_shared_go(c: &mut Criterion) { + use tree_sitter::Parser; + + let fixture = Path::new("benches/perf_fixtures/large_go_module.go") + .canonicalize() + .expect("perf fixture"); + let bytes = std::fs::read(&fixture).expect("read fixture"); + + let mut parser = Parser::new(); + let go_lang: tree_sitter::Language = tree_sitter_go::LANGUAGE.into(); + parser.set_language(&go_lang).expect("set go grammar"); + let tree = parser.parse(&bytes, None).expect("parse fixture"); + + let cfg = Config::default(); + let rules = nyx_scanner::auth_analysis::config::build_auth_rules(&cfg, "go"); + + c.bench_function("extract_authorization_model_shared_go", |b| { + b.iter(|| { + // Mirror `analyse_file_fused`: extract once, derive both + // per-file summaries (cheap iter over units) AND run the + // full diagnostic pipeline against the same model. + let model = nyx_scanner::auth_analysis::extract::extract_authorization_model( + "go", + cfg.framework_ctx.as_ref(), + &tree, + &bytes, + &fixture, + &rules, + None, + ); + let summaries = nyx_scanner::auth_analysis::extract_auth_summaries_from_model( + &model, "go", &fixture, None, + ); + let diags = nyx_scanner::auth_analysis::run_auth_analysis_with_model( + model, &tree, "go", &fixture, &rules, None, None, None, + ); + (summaries, diags) + }); + }); +} + +/// Per-file `collect_top_level_units` cost on a realistic Go fixture +/// (gin context.go, ~147 functions). Targets the inner per-function +/// AST-walk path: `collect_top_level_units` → +/// `build_function_unit_with_meta` → `collect_unit_state` (recursive +/// per-AST-node walk that emits per-node value-refs). +/// +/// Pre-fix (2026-05-04 perfhunt session-0009) `collect_unit_state` +/// called `extract_value_refs(node, bytes)` at every AST node, and that +/// helper recursively walked the node's full subtree. Combined with +/// the recursion below, every descendant got walked once for each of +/// its ancestors — total work O(N²) per function body. The fix +/// replaced that call with an O(1)-per-node `append_shallow_value_ref` +/// helper. A regression that re-introduces the deep walk surfaces +/// here as a ≥2× slowdown. +fn bench_collect_top_level_units_go(c: &mut Criterion) { + use tree_sitter::Parser; + + let fixture = Path::new("benches/perf_fixtures/large_go_module.go") + .canonicalize() + .expect("perf fixture"); + let bytes = std::fs::read(&fixture).expect("read fixture"); + + let mut parser = Parser::new(); + let go_lang: tree_sitter::Language = tree_sitter_go::LANGUAGE.into(); + parser.set_language(&go_lang).expect("set go grammar"); + let tree = parser.parse(&bytes, None).expect("parse fixture"); + + let cfg = Config::default(); + let rules = nyx_scanner::auth_analysis::config::build_auth_rules(&cfg, "go"); + + c.bench_function("collect_top_level_units_go", |b| { + b.iter(|| { + let mut model = nyx_scanner::auth_analysis::model::AuthorizationModel::default(); + nyx_scanner::auth_analysis::extract::common::collect_top_level_units( + tree.root_node(), + &bytes, + &rules, + &mut model, + ); + model + }); + }); +} + +/// SCCP throughput on every SSA body lowered from the gin context.go +/// fixture. Targets `nyx_scanner::ssa::const_prop::const_propagate` +/// directly, isolating it from the surrounding `optimize_ssa` pass and +/// the full-fused per-file analysis. +/// +/// Pre-fix (2026-05-04 perfhunt) `const_propagate` stored its lattice in +/// `HashMap` and walked +/// `inst_uses(inst).contains(&val)` for every block re-evaluation in the +/// SSA worklist — both shapes paid `SipHash` cost on every operand, and +/// the `inst_uses` factory allocated a fresh `Vec` on every +/// call. Switching the lattice + executable-edge maps to dense +/// `Vec`-indexed storage and the use-check to a zero-allocation +/// predicate cut `const_propagate` self-time roughly in half on the +/// large-Go fixture. A regression that re-introduces the hash-keyed +/// inner loop will surface here as a ≥1.4× slowdown. +fn bench_const_propagate_large_go(c: &mut Criterion) { + use nyx_scanner::ssa; + + let fixture = Path::new("benches/perf_fixtures/large_go_module.go") + .canonicalize() + .expect("perf fixture"); + let cfg_obj = Config::default(); + let (file_cfg, _lang) = nyx_scanner::ast::build_cfg_for_file(&fixture, &cfg_obj) + .expect("build cfg") + .expect("supported language"); + + // Lower every body once outside the bench loop so we measure only + // SCCP cost. The collected `(SsaBody, Cfg)` pairs are the input to + // the inner loop. + let mut bodies: Vec = Vec::new(); + for body in &file_cfg.bodies { + // Use `body.meta.name` as the scope filter so the SSA lowering + // pulls only this function's nodes; `scope_all=true` is reserved + // for the synthetic top-level body where `name` is None. + let scope = body.meta.name.as_deref(); + let scope_all = scope.is_none(); + match ssa::lower_to_ssa(&body.graph, body.entry, scope, scope_all) { + Ok(ssa_body) => bodies.push(ssa_body), + Err(_) => continue, + } + } + eprintln!( + "[diag] const_propagate bench: {} bodies lowered", + bodies.len() + ); + + c.bench_function("const_propagate_large_go", |b| { + b.iter(|| { + let mut total_values = 0usize; + for body in &bodies { + let result = ssa::const_prop::const_propagate(body); + total_values += result.values.len(); + } + total_values + }); + }); +} + +/// `GlobalSummaries::lookup_same_lang` cost on a populated index. The +/// inner loop hashes `(Lang, String)` once per call, then `FuncKey` once +/// per candidate via `by_key.get(k)`. Pre-fix the four secondary +/// indices used `std::collections::HashMap` (SipHash). Post-fix +/// (2026-05-04 perfhunt session-0015) they use `rustc_hash::FxHashMap`, +/// trading DoS hardening (irrelevant for in-process program-keyed +/// indices) for ~5x faster hashing on the 30+ byte 3-string `FuncKey` +/// hash workload. A regression that re-introduces SipHash would +/// surface here as a ≥3x slowdown. +fn bench_global_summaries_lookup_same_lang_go(c: &mut Criterion) { + let fixture = Path::new("benches/perf_fixtures/large_go_module.go") + .canonicalize() + .expect("perf fixture"); + let cfg = Config::default(); + + let summaries = + nyx_scanner::ast::extract_summaries_from_file(&fixture, &cfg).expect("extract summaries"); + let names: Vec = summaries.iter().map(|s| s.name.clone()).collect(); + let global = nyx_scanner::summary::merge_summaries(summaries, None); + let lang = nyx_scanner::symbol::Lang::Go; + + eprintln!("[diag] lookup_same_lang bench: {} names", names.len()); + + c.bench_function("global_summaries_lookup_same_lang_go", |b| { + b.iter(|| { + let mut total = 0usize; + for name in &names { + total += global.lookup_same_lang(lang, name).len(); + } + total + }); + }); +} + criterion_group!( benches, bench_ast_only_scan, @@ -181,5 +441,11 @@ criterion_group!( bench_single_file_parse_and_cfg, bench_state_analysis_only, bench_classify, + bench_analyse_file_fused_large_go, + bench_extract_authorization_model_go, + bench_extract_authorization_model_shared_go, + bench_collect_top_level_units_go, + bench_const_propagate_large_go, + bench_global_summaries_lookup_same_lang_go, ); criterion_main!(benches); diff --git a/docs/rules.md b/docs/rules.md index 9582277b..94482051 100644 --- a/docs/rules.md +++ b/docs/rules.md @@ -170,7 +170,7 @@ The tables below are generated from `src/patterns/.rs` by [`tools/docgen`] | `php.crypto.rand` | Low | A | Medium | | `php.crypto.sha1` | Low | A | Medium | -### Python: 14 patterns +### Python: 15 patterns | Rule ID | Severity | Tier | Confidence | |---|---|---|---| @@ -186,6 +186,7 @@ The tables below are generated from `src/patterns/.rs` by [`tools/docgen`] | `py.sqli.execute_format` | Medium | B | Medium | | `py.sqli.text_format` | Medium | B | Medium | | `py.xss.jinja_from_string` | Medium | A | High | +| `py.xss.make_response_format` | Medium | B | Medium | | `py.crypto.md5` | Low | A | Medium | | `py.crypto.sha1` | Low | A | Medium | diff --git a/src/ast.rs b/src/ast.rs index 2620f868..d6574e3f 100644 --- a/src/ast.rs +++ b/src/ast.rs @@ -40,7 +40,7 @@ use crate::utils::ext::lowercase_ext; use crate::utils::{Config, query_cache}; use petgraph::graph::NodeIndex; use std::borrow::Cow; -use std::cell::RefCell; +use std::cell::{OnceCell, RefCell}; use std::collections::{HashMap, HashSet}; use std::ops::ControlFlow; use std::path::Path; @@ -972,6 +972,27 @@ impl<'a> ParsedSource<'a> { { continue; } + // Layer C2: PHP `Serializable::unserialize($input)` magic + // method body — `public function unserialize($x) { ... + // unserialize($x) ... }`. This is the legacy + // `Serializable` interface contract (deprecated since PHP + // 8.1). PHP itself invokes the method when restoring an + // instance, so the body's `\unserialize($x)` call cannot + // be removed without breaking the interface. The + // actionable signal is at the class level (the class + // implements Serializable — fix is to migrate to + // `__serialize` / `__unserialize`), not at this call + // site. Genuine deserialization sinks (free-function + // `unserialize($_GET[..])`, helpers reading from session + // / cache, etc.) keep firing because they are not inside + // a method declaration named `unserialize` with a single + // formal parameter passed straight to the call. + if cq.meta.id == "php.deser.unserialize" + && self.lang_slug == "php" + && is_php_unserialize_magic_method_passthrough(cap.node, self.bytes) + { + continue; + } // Layer D: C/C++ buffer-overflow pattern rules // (`{c,cpp}.memory.strcpy`, `strcat`, `sprintf`) fire // syntactically on every call regardless of argument @@ -1102,6 +1123,13 @@ struct ParsedFile<'a> { file_cfg: FileCfg, lang_rules: LangAnalysisRules, has_lang_rules: bool, + /// Per-body SSA + const-prop + type-fact cache, lazily populated on first + /// request and indexed by `BodyId.0`. Was being recomputed 2-3× per body + /// across `run_cfg_analyses_with_lowered` (cfg analyses + state analyses) + /// and `run_auth_analyses` (`collect_file_var_types`); on the gin profile + /// `build_body_const_facts` accounted for 13.6% of wall-clock and a + /// single-pass cache collapses that to ~4.5%. + body_const_facts_cache: OnceCell>>, } impl<'a> ParsedFile<'a> { @@ -1153,9 +1181,33 @@ impl<'a> ParsedFile<'a> { file_cfg, lang_rules, has_lang_rules, + body_const_facts_cache: OnceCell::new(), } } + /// Per-body const-fact cache, computed once on first request and shared + /// across every per-body iteration in this file's analysis. Indexed by + /// `BodyId.0` so callers can look up by body identity. + fn body_const_facts_all(&self) -> &[Option] { + self.body_const_facts_cache.get_or_init(|| { + let lang = Lang::from_slug(self.source.lang_slug).unwrap_or(Lang::Rust); + self.file_cfg + .bodies + .iter() + .map(|b| cfg_analysis::build_body_const_facts(b, lang)) + .collect() + }) + } + + /// Look up the cached const facts for a specific body. + fn body_const_facts( + &self, + body: &crate::cfg::BodyCfg, + ) -> Option<&cfg_analysis::BodyConstFacts> { + let all = self.body_const_facts_all(); + all.get(body.meta.id.0 as usize).and_then(|f| f.as_ref()) + } + /// The top-level body's CFG graph (for backward-compatible access). fn cfg_graph(&self) -> &Cfg { &self.file_cfg.toplevel().graph @@ -1468,7 +1520,7 @@ impl<'a> ParsedFile<'a> { .filter(|f| f.body_id == body.meta.id) .cloned() .collect(); - let body_const_facts = cfg_analysis::build_body_const_facts(body, caller_lang); + let body_const_facts = self.body_const_facts(body); let cfg_ctx = cfg_analysis::AnalysisContext { cfg: &body.graph, entry: body.entry, @@ -1481,8 +1533,8 @@ impl<'a> ParsedFile<'a> { taint_findings: &body_taint, analysis_rules: self.rules_ref(), taint_active, - body_const_facts: body_const_facts.as_ref(), - type_facts: body_const_facts.as_ref().map(|f| &f.type_facts), + body_const_facts, + type_facts: body_const_facts.map(|f| &f.type_facts), auth_decorators: &body.meta.auth_decorators, closure_released_var_names: Some( closure_released_per_body @@ -1546,13 +1598,11 @@ impl<'a> ParsedFile<'a> { // points-to facts so the proxy-acquire transfer can // suppress SymbolId attribution on field-aliased // receivers (e.g. `m := c.mu; m.Lock()`). - let body_pointer_hints = cfg_analysis::build_body_const_facts(body, caller_lang) - .as_ref() - .and_then(|f| { - f.pointer_facts - .as_ref() - .map(|pf| pf.name_proxy_hints(&f.ssa)) - }); + let body_pointer_hints = self.body_const_facts(body).and_then(|f| { + f.pointer_facts + .as_ref() + .map(|pf| pf.name_proxy_hints(&f.ssa)) + }); let state_findings = state::run_state_analysis( &body.graph, body.entry, @@ -1666,12 +1716,11 @@ impl<'a> ParsedFile<'a> { /// syntactic heuristics. Returns `None` when no body produces a /// typed variable. fn collect_file_var_types(&self) -> Option { - let caller_lang = Lang::from_slug(self.source.lang_slug).unwrap_or(Lang::Rust); let mut merged: std::collections::HashMap = std::collections::HashMap::new(); let mut dropped: std::collections::HashSet = std::collections::HashSet::new(); for body in &self.file_cfg.bodies { - let Some(facts) = cfg_analysis::build_body_const_facts(body, caller_lang) else { + let Some(facts) = self.body_const_facts(body) else { continue; }; for (idx, def) in facts.ssa.value_defs.iter().enumerate() { @@ -1792,6 +1841,7 @@ pub fn extract_auth_model_for_debug( source.bytes, source.path, &rules, + None, ); Ok(Some(model)) } @@ -2401,6 +2451,165 @@ fn is_php_unserialize_allowed_classes_restricted( false } +/// PHP-only: returns `true` when the captured `function_call_expression` +/// is the canonical `Serializable::unserialize($input)` magic-method +/// pass-through — i.e. the call is inside a `method_declaration` named +/// exactly `unserialize` (PHP method names are case-insensitive) with +/// one formal parameter, and the call's single argument is the bare +/// parameter variable. +/// +/// **Why this is a non-actionable site for `php.deser.unserialize`:** +/// `Serializable::unserialize($input)` is an interface contract method +/// that PHP itself invokes when restoring an instance via the runtime +/// `\unserialize($bytes)` machinery. The implementation MUST decode +/// `$input` (the body's `\unserialize(...)` call) — there is no +/// "safer" rewrite that preserves the contract. The actionable signal +/// is at the class level (the class implements the deprecated +/// `Serializable` interface — fix is to migrate to `__serialize` / +/// `__unserialize`), not at this call site. +/// +/// Conservative recognition: +/// - method must be a `method_declaration` (NOT a free `function_definition` — +/// the magic semantics only apply to instance methods) +/// - method name == `unserialize` (case-insensitive) +/// - exactly 1 formal parameter +/// - call has exactly 1 argument +/// - argument's inner expression is a `variable_name` whose name equals the +/// formal parameter's name +/// +/// Genuine deserialization sinks (free `unserialize($_GET[...])`, helpers +/// reading from session/cache and passing through, etc.) keep firing +/// because they are not inside a method declaration named `unserialize`. +fn is_php_unserialize_magic_method_passthrough(cap_node: tree_sitter::Node, bytes: &[u8]) -> bool { + // The pattern captures `@n` (the function name); locate the enclosing + // function_call_expression. + let call_node = if cap_node.kind() == "function_call_expression" { + cap_node + } else { + let mut cur = cap_node; + let mut found = None; + for _ in 0..4 { + if cur.kind() == "function_call_expression" { + found = Some(cur); + break; + } + match cur.parent() { + Some(p) => cur = p, + None => break, + } + } + match found { + Some(c) => c, + None => return false, + } + }; + + // Walk up to the nearest method_declaration. Stop at any other + // function-introducing scope (free function, closure, arrow) — those + // are not the Serializable contract. + let mut cur = call_node; + let method = loop { + let Some(parent) = cur.parent() else { + return false; + }; + match parent.kind() { + "method_declaration" => break parent, + "function_definition" + | "anonymous_function" + | "anonymous_function_creation_expression" + | "arrow_function" + | "program" => return false, + _ => {} + } + cur = parent; + }; + + // Method name must be exactly `unserialize` (case-insensitive). + let Some(name_node) = method + .child_by_field_name("name") + .or_else(|| find_named_child_of_kind(method, "name")) + else { + return false; + }; + let Ok(method_name) = std::str::from_utf8(&bytes[name_node.byte_range()]) else { + return false; + }; + if !method_name.eq_ignore_ascii_case("unserialize") { + return false; + } + + // Method must have exactly 1 formal parameter; capture its bare name. + let Some(params) = method + .child_by_field_name("parameters") + .or_else(|| find_named_child_of_kind(method, "formal_parameters")) + else { + return false; + }; + let mut formal_params: Vec = Vec::new(); + for i in 0..params.named_child_count() as u32 { + if let Some(p) = params.named_child(i) + && matches!( + p.kind(), + "simple_parameter" + | "variadic_parameter" + | "property_promotion_parameter" + | "promoted_constructor_parameter" + ) + { + formal_params.push(p); + } + } + if formal_params.len() != 1 { + return false; + } + let param = formal_params[0]; + let var_node = param + .child_by_field_name("name") + .or_else(|| find_named_child_of_kind(param, "variable_name")); + let Some(var_node) = var_node else { + return false; + }; + let inner_name_node = if var_node.kind() == "variable_name" { + var_node.named_child(0) + } else { + Some(var_node) + }; + let Some(inner_name_node) = inner_name_node else { + return false; + }; + let Ok(param_name) = std::str::from_utf8(&bytes[inner_name_node.byte_range()]) else { + return false; + }; + + // Call must have exactly 1 argument that is the bare parameter variable. + let Some(arg_list) = find_named_child_of_kind(call_node, "arguments") else { + return false; + }; + let mut args: Vec = Vec::new(); + for i in 0..arg_list.named_child_count() as u32 { + if let Some(c) = arg_list.named_child(i) + && c.kind() == "argument" + { + args.push(c); + } + } + if args.len() != 1 { + return false; + } + let inner = args[0].named_child(0); + let Some(inner) = inner else { return false }; + if inner.kind() != "variable_name" { + return false; + } + let Some(arg_name_node) = inner.named_child(0) else { + return false; + }; + let Ok(arg_name) = std::str::from_utf8(&bytes[arg_name_node.byte_range()]) else { + return false; + }; + arg_name == param_name +} + /// C/C++-only Layer D: structural suppression of buffer-overflow pattern /// rules when the source / format-string argument is a literal whose /// contributed length is statically bounded. @@ -3999,6 +4208,15 @@ pub struct FusedResult { crate::symbol::FuncKey, auth_analysis::model::AuthCheckSummary, )>, + /// Per-Python-file router-level dep declarations + `include_router` + /// edges for cross-file FastAPI router-dep propagation. `None` for + /// non-Python files; `Some((module_id, facts))` for Python files + /// where `module_id` is the file's + /// [`auth_analysis::router_facts::module_id_for_storage`] key. + /// Pass 1 collects these into + /// `GlobalSummaries.router_facts_by_module`; pass 2 resolves them + /// per-file via `GlobalSummaries::resolve_cross_file_router_deps`. + pub router_facts: Option<(String, auth_analysis::router_facts::PerFileRouterFacts)>, } /// Parse the file once, build the CFG once, and produce both function @@ -4034,6 +4252,7 @@ pub fn analyse_file_fused( cfg_nodes: 0, ssa_bodies: vec![], auth_summaries: vec![], + router_facts: None, }); }; @@ -4081,6 +4300,28 @@ pub fn analyse_file_fused( (vec![], vec![]) }; + let mut auth_summaries: Vec<( + crate::symbol::FuncKey, + auth_analysis::model::AuthCheckSummary, + )> = Vec::new(); + + // Per-file router-dep facts for cross-file FastAPI propagation. + // Extracted unconditionally for Python files so pass 1 can persist + // them into `GlobalSummaries.router_facts_by_module` even on Cfg / + // Taint modes (the auth analysis itself runs only under Full, but + // the index has to be populated by the time pass 2 launches). + let router_facts_for_this_file = if parsed.source.lang_slug == "python" { + auth_analysis::router_facts::module_id_for_storage(parsed.source.path).map(|module_id| { + let facts = auth_analysis::router_facts::extract_router_facts_for_python( + &parsed.source.tree, + parsed.source.bytes, + ); + (module_id, facts) + }) + } else { + None + }; + if cfg.scanner.mode == AnalysisMode::Full || cfg.scanner.mode == AnalysisMode::Ast { let ast_findings = parsed.source.run_ast_queries(cfg); // Layer B only applies when taint had the opportunity to evaluate @@ -4095,23 +4336,70 @@ pub fn analyse_file_fused( } else { out.extend(ast_findings); } - out.extend(parsed.run_auth_analyses(cfg, global_summaries, scan_root)); + // Build the AuthorizationModel exactly once per file when Full + // mode needs both diagnostics AND per-file summaries; pre-fix + // the diag path and the summary path each ran their own + // `extract::extract_authorization_model`, duplicating + // `collect_top_level_units` + every framework extractor's AST + // walk. See `auth_analysis::run_auth_analysis_with_model` for + // measured savings. + let auth_rules = auth_analysis::config::build_auth_rules(cfg, parsed.source.lang_slug); + if auth_rules.enabled { + // Resolve cross-file router-deps for the current file (Python only). + // The resolved map lives on `AuthorizationModel.cross_file_router_deps` + // BEFORE `FlaskExtractor::extract` runs, so the in-extractor merge + // sees both inline router-deps and the cross-file lift in one pass. + let cross_file_router_deps = if parsed.source.lang_slug == "python" + && let Some(gs) = global_summaries + && let Some(child_module_id) = + auth_analysis::router_facts::module_id_for_path(parsed.source.path) + { + let resolved = gs.resolve_cross_file_router_deps(&child_module_id); + if resolved.is_empty() { + None + } else { + Some(resolved) + } + } else { + None + }; + let auth_model = auth_analysis::extract::extract_authorization_model( + parsed.source.lang_slug, + cfg.framework_ctx.as_ref(), + &parsed.source.tree, + parsed.source.bytes, + parsed.source.path, + &auth_rules, + cross_file_router_deps.as_ref(), + ); + // Extract summaries from the **base** model (pre var-types, + // pre-helper-lifting) so the persisted per-file summary + // carries only the helper's own intrinsic auth checks, + // matching the legacy `extract_auth_summaries_by_key` path + // bit-for-bit. + if cfg.scanner.mode == AnalysisMode::Full { + auth_summaries = auth_analysis::extract_auth_summaries_from_model( + &auth_model, + parsed.source.lang_slug, + parsed.source.path, + scan_root, + ); + } + let var_types = parsed.collect_file_var_types(); + out.extend(auth_analysis::run_auth_analysis_with_model( + auth_model, + &parsed.source.tree, + parsed.source.lang_slug, + parsed.source.path, + &auth_rules, + var_types.as_ref(), + global_summaries, + scan_root, + )); + } } parsed.source.finalize_diags(&mut out, cfg); - let auth_summaries = if cfg.scanner.mode == AnalysisMode::Full { - auth_analysis::extract_auth_summaries_by_key( - &parsed.source.tree, - parsed.source.bytes, - parsed.source.lang_slug, - parsed.source.path, - cfg, - scan_root, - ) - } else { - Vec::new() - }; - Ok(FusedResult { summaries, diags: out, @@ -4119,6 +4407,7 @@ pub fn analyse_file_fused( cfg_nodes, ssa_bodies, auth_summaries, + router_facts: router_facts_for_this_file, }) } @@ -4441,6 +4730,100 @@ fn php_unserialize_allowed_classes_recognises_safe_forms() { ); } +#[test] +fn php_unserialize_magic_method_passthrough_recognises_serializable_contract() { + let mut parser = tree_sitter::Parser::new(); + let lang = tree_sitter::Language::from(tree_sitter_php::LANGUAGE_PHP); + parser.set_language(&lang).unwrap(); + let q = r#"(function_call_expression function: (name) @n (#eq? @n "unserialize")) @vuln"#; + + // Canonical Serializable::unserialize delegating to __unserialize. + let code = b"__unserialize(unserialize($serialized));\n }\n}\n"; + let tree = parser.parse(code, None).unwrap(); + let cap = first_php_capture(&tree, code, q); + assert!( + is_php_unserialize_magic_method_passthrough(cap, code), + "Serializable::unserialize($x) → unserialize($x) should be suppressed" + ); + + // Multi-target list-destructuring assignment shape (Joomla Cli/Input). + let code = b"a, $this->b] = unserialize($input);\n }\n}\n"; + let tree = parser.parse(code, None).unwrap(); + let cap = first_php_capture(&tree, code, q); + assert!( + is_php_unserialize_magic_method_passthrough(cap, code), + "list-destructuring inside Serializable::unserialize should be suppressed" + ); + + // Case-insensitive method name (PHP semantics). + let code = b"= token_lookup.line + operation.kind == OperationKind::Mutation + && operation.line >= token_lookup.line + // Ignore `InMemoryLocal` mutations (HashSet/HashMap/Vec + // local bookkeeping like `verified_ids.update(myteams)`, + // `requested_teams.update(verified_ids)`). The verb is + // `update` so `OperationKind::Mutation` is set, but the + // sink_class encodes that the receiver is a non-sink + // local container — never a token-bound write. Mirrors + // the gate in `check_ownership_gaps`. + && operation + .sink_class + .is_none_or(|class| class.is_auth_relevant()) }) else { continue; }; diff --git a/src/auth_analysis/config.rs b/src/auth_analysis/config.rs index 8528b831..e1cf8aeb 100644 --- a/src/auth_analysis/config.rs +++ b/src/auth_analysis/config.rs @@ -55,6 +55,13 @@ pub struct AuthAnalysisRules { /// `WHERE .user_id = ?N`, make every returned row /// membership-gated. See `sql_semantics::classify_sql_query`. pub acl_tables: Vec, + /// Callee names that, when they appear as the chain root of a + /// chained-call shape (`select(X).filter_by(...)`, + /// `query(X).filter(...)`), anchor the trailing method as a DB + /// query-builder operation. Overrides the chained-call suppression + /// in `classify_sink_class` for SQLAlchemy / similar query-builder + /// idioms whose first call returns an opaque builder object. + pub db_query_builder_roots: Vec, } impl AuthAnalysisRules { @@ -80,6 +87,7 @@ impl AuthAnalysisRules { outbound_network_receiver_prefixes: Vec::new(), cache_receiver_prefixes: Vec::new(), acl_tables: Vec::new(), + db_query_builder_roots: Vec::new(), } } @@ -96,11 +104,13 @@ impl AuthAnalysisRules { } /// Does `ty` (last path segment, case-sensitive) match a - /// non-sink receiver type? The angle-bracket generic suffix is - /// stripped first: `HashMap` → `HashMap`. + /// non-sink receiver type? Generic suffixes are stripped first: + /// `HashMap` → `HashMap` (Rust/Java/TS angle brackets), + /// `set[int]` / `dict[str, int]` → `set` / `dict` (Python PEP 585 + /// builtin generics + `typing` aliases). pub fn is_non_sink_receiver_type(&self, ty: &str) -> bool { let base = Self::type_last_segment(ty); - let base = base.split('<').next().unwrap_or(base).trim(); + let base = base.split(['<', '[']).next().unwrap_or(base).trim(); self.non_sink_receiver_types .iter() .any(|allowed| allowed == base) @@ -115,25 +125,35 @@ impl AuthAnalysisRules { /// The callee string may use either `::` or `.` as the path /// separator (nyx's `callee_name` normalizes both via /// `member_chain`). + /// + /// Bare-callee form: Python uses `set()` / `dict()` / `list()` / + /// `defaultdict()` / etc. as direct constructors with no method + /// segment. When `callee` has no `.` / `::` separator and matches + /// a registered non-sink receiver type, treat the call as a + /// non-sink constructor. Closes the + /// `verified_ids = set(); verified_ids.update(myteams)` shape in + /// sentry where the bare-call form was unrecognised so the bound + /// var was missing from `non_sink_vars` and the later + /// `.update(..)` classified as DbMutation. pub fn is_non_sink_constructor_callee(&self, callee: &str) -> bool { let normalized = callee.replace("::", "."); - let Some((ty, method)) = normalized.rsplit_once('.') else { - return false; - }; - if !self.is_non_sink_receiver_type(ty) { - return false; + if let Some((ty, method)) = normalized.rsplit_once('.') { + if !self.is_non_sink_receiver_type(ty) { + return false; + } + return matches!( + method, + "new" + | "with_capacity" + | "with_capacity_and_hasher" + | "with_hasher" + | "from" + | "from_iter" + | "new_in" + | "default" + ); } - matches!( - method, - "new" - | "with_capacity" - | "with_capacity_and_hasher" - | "with_hasher" - | "from" - | "from_iter" - | "new_in" - | "default" - ) + self.is_non_sink_receiver_type(&normalized) } /// Does the first segment of a callee receiver chain look like a @@ -260,20 +280,45 @@ impl AuthAnalysisRules { // Verb-name fallback (`is_mutation` / `is_read`) is the loosest // dispatch: it prefix-matches the bare method name against // generic verbs (`Get`, `Save`, `Find`, …) regardless of the - // receiver. When the receiver chain itself contains a call - // expression (`w.Header().Get(..)`, `r.URL.Query().Get(..)`, - // `db.Tx(..).Query(..)`), the receiver is the *return value of - // another call*, its type is opaque to the auth analyser and - // the bare verb match is too speculative to assume a data-layer - // sink. The realtime/outbound/cache prefix dispatches above - // already match by the chain root; if none of them claimed the - // receiver, dropping the verb-name fallback for chained-call - // shapes prevents the entire `w.Header().Get` / - // `r.URL.Query().Get` cluster from masquerading as a - // `DbCrossTenantRead`. A canonical data-layer call still has a - // bare-identifier receiver (`repo.Find(id)`, `db.Query(..)`) - // and is unaffected. - if !receiver_is_chained_call(callee) { + // receiver. Two structural shapes lack the receiver evidence + // needed to anchor a DB-sink classification and are excluded: + // + // 1. Chained-call receiver (`w.Header().Get(..)`, + // `r.URL.Query().Get(..)`, `db.Tx(..).Query(..)`) — the + // receiver is the *return value of another call*, its type + // is opaque to the auth analyser. + // 2. Bare-identifier callee with no receiver dot at all + // (`list(..)`, `filter(..)`, `create_audit_entry(..)`, + // `update_coding_agent_state(..)`) — Python / JS / Ruby + // builtins and locally-defined helpers routinely collide + // with the verb vocabulary. Real ORM / DB calls always + // carry a receiver (`User.find(id)`, `Model.objects.filter`, + // `repo.save(x)`); a bare `list(events)` is the Python + // builtin and `filter(fn, xs)` is `Iterable.filter`. + // + // The realtime / outbound / cache prefix dispatches above + // already match by the chain root; gating the verb fallback on + // a simple non-chained receiver dot prevents both shapes from + // masquerading as data-layer sinks while leaving canonical + // `repo.Find(id)` / `db.Query(..)` calls unaffected. + if receiver_is_simple_chain(callee) { + if self.is_mutation(callee) { + return Some(SinkClass::DbMutation); + } + if self.is_read(callee) { + return Some(SinkClass::DbCrossTenantRead); + } + } + // SQLAlchemy / query-builder chained shapes: + // `select(X).filter_by(...)`, `query(X).filter(...)`, + // `select().join().where()`. The chain receiver is the return + // value of an opaque builder primitive that the type tracker + // cannot follow, but the chain *root* segment is itself a known + // DB query-builder verb — strong enough evidence to anchor a + // DB-sink classification when paired with a mutation/read verb + // on the trailing method. Closes airflow-style + // `session.scalar(select(C).filter_by(conn_id=user_input))`. + if receiver_is_chained_call(callee) && self.chain_root_is_db_query_builder(callee) { if self.is_mutation(callee) { return Some(SinkClass::DbMutation); } @@ -284,6 +329,42 @@ impl AuthAnalysisRules { None } + /// True when any non-final segment of the chain is an + /// intermediate-call (ends with `()`) whose verb matches a + /// configured `db_query_builder_roots` entry. Used to anchor + /// chained-call shapes like `select(X).filter_by(id=...)` (Python) + /// or `query(X).filter(...)` to a DB-sink classification despite + /// the opaque builder return value. + pub fn chain_root_is_db_query_builder(&self, callee: &str) -> bool { + if self.db_query_builder_roots.is_empty() { + return false; + } + let segments: Vec<&str> = callee.split('.').collect(); + if segments.len() < 2 { + return false; + } + for seg in &segments[..segments.len() - 1] { + if !seg.ends_with(')') { + continue; + } + let stripped = seg + .trim_end_matches(')') + .trim_end_matches('(') + .trim_end_matches(')'); + if stripped.is_empty() { + continue; + } + if self + .db_query_builder_roots + .iter() + .any(|root| matches_name(stripped, root)) + { + return true; + } + } + false + } + pub fn requires_admin_path(&self, path: &str) -> bool { let lower = path.to_ascii_lowercase(); let normalized = if lower.starts_with('/') { @@ -583,7 +664,29 @@ pub fn build_auth_rules(config: &Config, lang_slug: &str) -> AuthAnalysisRules { "invitedemail".into(), "recipient".into(), ], - non_sink_receiver_types: Vec::new(), + // Python builtin / `collections` non-sink container types. + // Recognised both as type-annotation hints (`x: set[int]`) + // and as bare-callee constructor forms (`x = set()`, + // `cache = collections.defaultdict(list)`, …). Method + // calls on bound vars (`x.update`, `x.add`, `cache.pop`) + // are then classified as `InMemoryLocal`, suppressing the + // false `DbMutation` / `DbCrossTenantRead` sink shape. + // Closes sentry `api/helpers/teams.py:46` shape where + // `verified_ids = set(); verified_ids.update(myteams)` was + // flagged as cross-tenant mutation. + non_sink_receiver_types: vec![ + "set".into(), + "dict".into(), + "list".into(), + "tuple".into(), + "frozenset".into(), + "defaultdict".into(), + "OrderedDict".into(), + "Counter".into(), + "deque".into(), + "ChainMap".into(), + "namedtuple".into(), + ], non_sink_receiver_name_prefixes: Vec::new(), non_sink_global_receivers: Vec::new(), non_sink_method_names: Vec::new(), @@ -591,6 +694,12 @@ pub fn build_auth_rules(config: &Config, lang_slug: &str) -> AuthAnalysisRules { outbound_network_receiver_prefixes: Vec::new(), cache_receiver_prefixes: Vec::new(), acl_tables: Vec::new(), + // SQLAlchemy queryset builders. `select(X).filter_by(id=...)` + // / `query(X).filter(id=...)` chains return opaque builder + // objects whose type the auth analyser cannot follow; the + // chain *root* primitive itself is the DB-anchor evidence. + // Closes airflow-style `session.scalar(select(C).filter_by(...))`. + db_query_builder_roots: vec!["select".into(), "query".into()], } } else if matches!(lang_slug, "ruby") { AuthAnalysisRules { @@ -766,6 +875,7 @@ pub fn build_auth_rules(config: &Config, lang_slug: &str) -> AuthAnalysisRules { outbound_network_receiver_prefixes: Vec::new(), cache_receiver_prefixes: Vec::new(), acl_tables: Vec::new(), + db_query_builder_roots: Vec::new(), } } else if matches!(lang_slug, "go") { AuthAnalysisRules { @@ -862,6 +972,7 @@ pub fn build_auth_rules(config: &Config, lang_slug: &str) -> AuthAnalysisRules { outbound_network_receiver_prefixes: Vec::new(), cache_receiver_prefixes: Vec::new(), acl_tables: Vec::new(), + db_query_builder_roots: Vec::new(), } } else if matches!(lang_slug, "java") { AuthAnalysisRules { @@ -954,6 +1065,7 @@ pub fn build_auth_rules(config: &Config, lang_slug: &str) -> AuthAnalysisRules { outbound_network_receiver_prefixes: Vec::new(), cache_receiver_prefixes: Vec::new(), acl_tables: Vec::new(), + db_query_builder_roots: Vec::new(), } } else if matches!(lang_slug, "rust") { AuthAnalysisRules { @@ -1137,6 +1249,7 @@ pub fn build_auth_rules(config: &Config, lang_slug: &str) -> AuthAnalysisRules { "members".into(), "share_grants".into(), ], + db_query_builder_roots: Vec::new(), } } else { AuthAnalysisRules { @@ -1290,6 +1403,7 @@ pub fn build_auth_rules(config: &Config, lang_slug: &str) -> AuthAnalysisRules { outbound_network_receiver_prefixes: Vec::new(), cache_receiver_prefixes: Vec::new(), acl_tables: Vec::new(), + db_query_builder_roots: Vec::new(), } }; @@ -1367,6 +1481,10 @@ pub fn build_auth_rules(config: &Config, lang_slug: &str) -> AuthAnalysisRules { &lang_cfg.auth.cache_receiver_prefixes, ); extend_unique(&mut rules.acl_tables, &lang_cfg.auth.acl_tables); + extend_unique( + &mut rules.db_query_builder_roots, + &lang_cfg.auth.db_query_builder_roots, + ); } rules @@ -1410,6 +1528,17 @@ pub fn receiver_is_chained_call(callee: &str) -> bool { receiver.contains('(') } +/// True when the callee has a non-chained receiver dot, i.e. an actual +/// receiver identifier or path (`User.find`, `repo.save`, +/// `Model.objects.filter`). Returns false for bare-identifier callees +/// (`list(..)`, `filter(..)`, `create_audit_entry(..)`) and for +/// chained-call receivers (`db.Tx(..).Query(..)`) — both lack the +/// receiver evidence needed to anchor a DB-sink classification, see +/// the comment in `classify_sink_class`. +pub fn receiver_is_simple_chain(callee: &str) -> bool { + callee.contains('.') && !receiver_is_chained_call(callee) +} + /// Recognise `require__` / `ensure__` /// shapes where `` is a closed-vocabulary authorization noun /// (`member`, `owner`, `admin`, `access`, `permission`, `manager`, @@ -1768,6 +1897,161 @@ mod tests { ); } + /// Pin the bare-identifier verb-fallback gate. Bare callees with + /// no receiver dot lack the receiver evidence needed to anchor a + /// DB-sink classification: `list(...)`, `filter(...)`, `update(...)`, + /// `create_audit_entry(...)`, `update_coding_agent_state(...)` are + /// Python builtins / JS Array methods / locally-defined helpers, + /// not ORM operations. Closes the sentry / saleor / netbox cluster + /// where bare-name callees inside route helpers (with `request: + /// Request` triggering the user-input precondition) fired + /// `py.auth.missing_ownership_check`. + #[test] + fn classify_sink_class_suppresses_bare_callee_verb_fallback() { + use crate::auth_analysis::model::SinkClass; + use std::collections::HashSet; + let empty: HashSet = HashSet::new(); + + for lang in [ + "python", + "javascript", + "typescript", + "go", + "java", + "ruby", + "rust", + ] { + let cfg = Config::default(); + let rules = build_auth_rules(&cfg, lang); + // Bare callees that prefix-match a read / mutation indicator + // must NOT classify as DbCrossTenantRead / DbMutation. + assert_eq!( + rules.classify_sink_class("list", &empty), + None, + "lang={lang} bare list", + ); + assert_eq!( + rules.classify_sink_class("filter", &empty), + None, + "lang={lang} bare filter", + ); + assert_eq!( + rules.classify_sink_class("update", &empty), + None, + "lang={lang} bare update", + ); + assert_eq!( + rules.classify_sink_class("create_audit_entry", &empty), + None, + "lang={lang} bare create_audit_entry", + ); + assert_eq!( + rules.classify_sink_class("update_coding_agent_state", &empty), + None, + "lang={lang} bare update_coding_agent_state", + ); + } + + // Recall guard: qualified ORM / DB calls keep firing on every + // language that has the verb in its indicator vocabulary. + let py_rules = build_auth_rules(&Config::default(), "python"); + assert_eq!( + py_rules.classify_sink_class("Project.objects.filter", &empty), + Some(SinkClass::DbCrossTenantRead) + ); + assert_eq!( + py_rules.classify_sink_class("Project.objects.update", &empty), + Some(SinkClass::DbMutation) + ); + let go_rules = build_auth_rules(&Config::default(), "go"); + assert_eq!( + go_rules.classify_sink_class("repo.Find", &empty), + Some(SinkClass::DbCrossTenantRead) + ); + } + + /// Pin the SQLAlchemy queryset-builder chained-call recogniser. + /// `select(X).filter_by(id=user_input)` reduces (post `member_chain` + /// fix) to the chain-string `"select().filter_by"`. The chained-call + /// shape would otherwise be suppressed by `receiver_is_chained_call`, + /// blocking recall on the airflow `session.scalar(select(C).filter_by(...))` + /// shape. `chain_root_is_db_query_builder` overrides the suppression + /// when the chain root is a configured DB-builder verb. + #[test] + fn chain_root_is_db_query_builder_recognises_sqlalchemy_chains() { + use crate::auth_analysis::model::SinkClass; + use std::collections::HashSet; + let cfg = Config::default(); + let py_rules = build_auth_rules(&cfg, "python"); + let empty: HashSet = HashSet::new(); + + // Detection: chain root `select()` / `query()` matches the + // configured Python `db_query_builder_roots`. + assert!(py_rules.chain_root_is_db_query_builder("select().filter_by")); + assert!(py_rules.chain_root_is_db_query_builder("query().filter")); + assert!(py_rules.chain_root_is_db_query_builder("Session.query().filter")); + assert!(py_rules.chain_root_is_db_query_builder("select().join().where")); + // Non-builder chain roots: must not match. + assert!(!py_rules.chain_root_is_db_query_builder("w.Header().Get")); + assert!(!py_rules.chain_root_is_db_query_builder("obj.foo().bar")); + // Plain receiver chains (no intermediate call): not handled + // here — the simple-chain branch covers them. + assert!(!py_rules.chain_root_is_db_query_builder("repo.Find")); + assert!(!py_rules.chain_root_is_db_query_builder("Project.objects.filter")); + // Classification: chained-call DB-builder shapes anchor to + // DbCrossTenantRead / DbMutation when the trailing verb matches. + assert_eq!( + py_rules.classify_sink_class("select().filter_by", &empty), + Some(SinkClass::DbCrossTenantRead) + ); + assert_eq!( + py_rules.classify_sink_class("query().delete", &empty), + Some(SinkClass::DbMutation) + ); + assert_eq!( + py_rules.classify_sink_class("select().update", &empty), + Some(SinkClass::DbMutation) + ); + // Regression guard: chained-call shapes that are NOT DB + // builders (Go HTTP `w.Header().get`, generic `obj.foo().bar`) + // remain suppressed even when the trailing verb prefix-matches. + // Run on a Python-rules instance with the verb in its read + // indicator vocabulary to exercise the guard. + assert_eq!(py_rules.classify_sink_class("w.Header().get", &empty), None); + assert_eq!(py_rules.classify_sink_class("obj.foo().get", &empty), None); + + // Languages without `db_query_builder_roots` defaults must not + // false-positive on chained-call shapes. + for lang in ["javascript", "typescript", "go", "java", "ruby", "rust"] { + let rules = build_auth_rules(&Config::default(), lang); + assert!( + !rules.chain_root_is_db_query_builder("select().filter_by"), + "lang={lang} unexpectedly classified select().filter_by as DB-builder chain", + ); + assert_eq!( + rules.classify_sink_class("select().filter_by", &empty), + None, + "lang={lang} unexpectedly classified select().filter_by as DB sink", + ); + } + } + + #[test] + fn receiver_is_simple_chain_classifies_correctly() { + use super::receiver_is_simple_chain; + // Simple receiver chain (allowed for verb fallback). + assert!(receiver_is_simple_chain("repo.Find")); + assert!(receiver_is_simple_chain("Project.objects.filter")); + assert!(receiver_is_simple_chain("self.cache.insert")); + // Bare-identifier callee (rejected — no receiver evidence). + assert!(!receiver_is_simple_chain("list")); + assert!(!receiver_is_simple_chain("filter")); + assert!(!receiver_is_simple_chain("create_audit_entry")); + // Chained-call receiver (rejected — receiver type opaque). + assert!(!receiver_is_simple_chain("w.Header().Get")); + assert!(!receiver_is_simple_chain("db.Tx(opts).Query")); + } + #[test] fn sink_class_is_auth_relevant_only_for_non_local_classes() { use crate::auth_analysis::model::SinkClass; @@ -1836,6 +2120,97 @@ mod tests { ); } + /// Pin the Python non-sink container recogniser. Both type + /// annotations (`x: set[int]`, `m: dict[str, int]`) and + /// bare-callee constructor calls (`set()`, `dict()`, + /// `defaultdict()`) must register the bound variable as a + /// non-sink receiver, suppressing later `.update(..)` / + /// `.add(..)` calls from classifying as `DbMutation` / + /// `DbCrossTenantRead`. + #[test] + fn python_non_sink_container_recognition() { + use crate::auth_analysis::model::SinkClass; + use std::collections::HashSet; + let cfg = Config::default(); + let rules = build_auth_rules(&cfg, "python"); + + // Type annotations: PEP 585 builtin generics + typing aliases. + assert!(rules.is_non_sink_receiver_type("set")); + assert!(rules.is_non_sink_receiver_type("set[int]")); + assert!(rules.is_non_sink_receiver_type("dict[str, int]")); + assert!(rules.is_non_sink_receiver_type("list[str]")); + assert!(rules.is_non_sink_receiver_type("defaultdict")); + assert!(rules.is_non_sink_receiver_type("Counter")); + assert!(rules.is_non_sink_receiver_type("OrderedDict")); + // Negative: arbitrary type names must not match. + assert!(!rules.is_non_sink_receiver_type("Project")); + assert!(!rules.is_non_sink_receiver_type("QuerySet")); + + // Bare-callee constructor form: `set()`, `dict()`, + // `defaultdict()`, `Counter()`. + assert!(rules.is_non_sink_constructor_callee("set")); + assert!(rules.is_non_sink_constructor_callee("dict")); + assert!(rules.is_non_sink_constructor_callee("list")); + assert!(rules.is_non_sink_constructor_callee("frozenset")); + assert!(rules.is_non_sink_constructor_callee("defaultdict")); + assert!(rules.is_non_sink_constructor_callee("Counter")); + // Negative: bare callees that are NOT non-sink types must not + // be treated as constructors. `update`, `filter`, `find` are + // verb names, not container types. + assert!(!rules.is_non_sink_constructor_callee("update")); + assert!(!rules.is_non_sink_constructor_callee("filter")); + assert!(!rules.is_non_sink_constructor_callee("find")); + assert!(!rules.is_non_sink_constructor_callee("Project")); + + // End-to-end classification: `verified_ids.update(..)` with + // `verified_ids` registered as a non-sink var classifies as + // `InMemoryLocal`, the precondition for suppressing the + // false `DbMutation` finding. + let mut non_sink_vars: HashSet = HashSet::new(); + non_sink_vars.insert("verified_ids".to_string()); + non_sink_vars.insert("requested_teams".to_string()); + assert_eq!( + rules.classify_sink_class("verified_ids.update", &non_sink_vars), + Some(SinkClass::InMemoryLocal) + ); + assert_eq!( + rules.classify_sink_class("requested_teams.add", &non_sink_vars), + Some(SinkClass::InMemoryLocal) + ); + // Recall guard: a real ORM mutation on the same verb still + // classifies as `DbMutation` when the receiver is qualified. + let empty: HashSet = HashSet::new(); + assert_eq!( + rules.classify_sink_class("Project.objects.update", &empty), + Some(SinkClass::DbMutation) + ); + } + + /// Cross-language recall guard: only Python populates the new + /// container types by default. Other-language defaults must + /// not inadvertently inherit `set` / `dict` / `list` as non-sink + /// types via the merge path (those names overlap with verb + /// indicators in those languages). + #[test] + fn python_container_types_do_not_leak_to_other_languages() { + let cfg = Config::default(); + for lang in ["javascript", "typescript", "go", "java", "ruby", "rust"] { + let rules = build_auth_rules(&cfg, lang); + assert!( + !rules.is_non_sink_receiver_type("set"), + "lang={lang} unexpectedly recognises bare `set` as non-sink type", + ); + assert!( + !rules.is_non_sink_receiver_type("dict"), + "lang={lang} unexpectedly recognises bare `dict` as non-sink type", + ); + assert!( + !rules.is_non_sink_receiver_type("list"), + "lang={lang} unexpectedly recognises bare `list` as non-sink type", + ); + } + } + /// `require__` structural recogniser for project /// helpers like `require_trip_member`, `require_doc_owner`. #[test] diff --git a/src/auth_analysis/extract/actix_web.rs b/src/auth_analysis/extract/actix_web.rs index ea0ecb85..a02b13ad 100644 --- a/src/auth_analysis/extract/actix_web.rs +++ b/src/auth_analysis/extract/actix_web.rs @@ -4,8 +4,7 @@ use super::axum::{ expanded_guard_call_sites, guard_calls_for_handler, inject_guard_checks, rust_param_aliases, }; use super::common::{ - attach_route_handler, call_name, collect_top_level_units, named_children, resolve_handler_node, - string_literal_value, + attach_route_handler, call_name, named_children, resolve_handler_node, string_literal_value, }; use crate::auth_analysis::config::AuthAnalysisRules; use crate::auth_analysis::model::{ @@ -30,21 +29,11 @@ impl AuthExtractor for ActixWebExtractor { bytes: &[u8], path: &Path, rules: &AuthAnalysisRules, - ) -> AuthorizationModel { + model: &mut AuthorizationModel, + ) { let root = tree.root_node(); - let mut model = AuthorizationModel::default(); - - collect_top_level_units(root, bytes, rules, &mut model); - collect_routes(root, root, bytes, path, rules, &mut model); - apply_typed_extractor_guards_to_units( - root, - bytes, - rules, - &mut model, - GuardFramework::ActixWeb, - ); - - model + collect_routes(root, root, bytes, path, rules, model); + apply_typed_extractor_guards_to_units(root, bytes, rules, model, GuardFramework::ActixWeb); } } diff --git a/src/auth_analysis/extract/axum.rs b/src/auth_analysis/extract/axum.rs index 8f6f614c..f793c04e 100644 --- a/src/auth_analysis/extract/axum.rs +++ b/src/auth_analysis/extract/axum.rs @@ -1,8 +1,7 @@ use super::AuthExtractor; use super::common::{ attach_route_handler, call_name, call_site_from_node, call_sites_from_value, - collect_top_level_units, function_definition_node, named_children, resolve_handler_node, - string_literal_value, text, + function_definition_node, named_children, resolve_handler_node, string_literal_value, text, }; use crate::auth_analysis::config::AuthAnalysisRules; use crate::auth_analysis::model::{ @@ -29,15 +28,11 @@ impl AuthExtractor for AxumExtractor { bytes: &[u8], path: &Path, rules: &AuthAnalysisRules, - ) -> AuthorizationModel { + model: &mut AuthorizationModel, + ) { let root = tree.root_node(); - let mut model = AuthorizationModel::default(); - - collect_top_level_units(root, bytes, rules, &mut model); - collect_routes(root, root, bytes, path, rules, &mut model); - apply_typed_extractor_guards_to_units(root, bytes, rules, &mut model, GuardFramework::Axum); - - model + collect_routes(root, root, bytes, path, rules, model); + apply_typed_extractor_guards_to_units(root, bytes, rules, model, GuardFramework::Axum); } } diff --git a/src/auth_analysis/extract/common.rs b/src/auth_analysis/extract/common.rs index dba9a100..5630c2aa 100644 --- a/src/auth_analysis/extract/common.rs +++ b/src/auth_analysis/extract/common.rs @@ -896,6 +896,13 @@ fn collect_unit_state( // `instance_variable`. if matches!(node.kind(), "assignment" | "assignment_expression") { collect_row_population(node, bytes, state); + // Python `verified_ids = set()` / + // `cache: dict[str,int] = {}` and JS analogues bind a + // local non-sink container. `collect_non_sink_binding` + // accepts both `pattern`/`value` and `left`/`right` + // field names so the same recognition path covers + // these assignment-node shapes. + collect_non_sink_binding(node, bytes, rules, state); } } "for_expression" => { @@ -915,9 +922,27 @@ fn collect_unit_state( _ => {} } - for value in extract_value_refs(node, bytes) { - state.value_refs.push(value); - } + // O(1) per-node shallow value-ref emission, then descend. + // + // Pre-fix this site called `extract_value_refs(node, bytes)` which walks + // node's entire subtree. Combined with the recursion below — which + // visits every descendant and re-runs the same call at each level — the + // total work was O(N * subtree_size) ≈ O(N²) per function body. On + // mm/channels/app the inner-walk dominated `build_function_unit_with_meta` + // and its descendants (~17%+15%+11% of total wall-clock split across + // `build_function_unit_with_meta`, `collect_unit_state`, and + // `extract_value_refs` in the post-shared-model profile, 2026-05-04). + // + // The recursion below already visits every descendant once. Emitting a + // shallow value-ref per node — only the ref the node itself represents — + // produces the same SET of value-refs after `dedup_value_refs` runs in + // `build_function_unit_with_meta`, because every ref-emitting kind + // (member chain, subscript, accessor call, identifier) is reachable as a + // single node visit. Public callers of `extract_value_refs` (e.g. + // `collect_call`, `collect_condition`, assignment-side extraction) keep + // the deep walk: they intentionally want refs from the full subtree + // rooted at the argument they pass. + append_shallow_value_ref(node, bytes, &mut state.value_refs); for idx in 0..node.named_child_count() { let Some(child) = node.named_child(idx as u32) else { @@ -927,6 +952,57 @@ fn collect_unit_state( } } +/// Per-node value-ref emission used inside `collect_unit_state`'s tree walk. +/// +/// Returns the value-ref the node itself represents (a member chain, a +/// subscript, an accessor call's chain, or an identifier-like leaf), without +/// descending into descendants. The caller's existing AST recursion handles +/// children; relying on that recursion turns the previously O(N²) per-body +/// walk into O(N). +fn append_shallow_value_ref(node: Node<'_>, bytes: &[u8], refs: &mut Vec) { + match node.kind() { + "member_expression" + | "attribute" + | "selector_expression" + | "field_expression" + | "field_access" => { + if let Some(value) = member_value_ref(node, bytes) { + refs.push(value); + } + } + "subscript_expression" | "subscript" | "element_reference" | "index_expression" => { + if let Some(value) = subscript_value_ref(node, bytes) { + refs.push(value); + } + } + "call_expression" | "call" | "method_invocation" | "method_call_expression" => { + // Accessor-call chains (`cache.get(key)`, `req.params.id`) absorb + // into a single chain ValueRef; non-accessor calls return None + // here and rely on recursion to visit `function` + arg children + // so each leaf identifier emits its own ref. + if let Some(value) = call_value_ref(node, bytes) { + refs.push(value); + } + } + // Bare identifier and Ruby `@foo` / `@@foo` / `$foo` leaves: emit a + // single Identifier-kind ValueRef. Mirrors `extract_value_refs`'s + // identifier arm so `dedup_value_refs` collapses any cross-path + // duplicates against existing emissions from sibling deep walks + // (e.g. `collect_condition`'s `extract_value_refs(condition)`). + "identifier" | "instance_variable" | "class_variable" | "global_variable" => { + refs.push(ValueRef { + source_kind: ValueSourceKind::Identifier, + name: text(node, bytes), + base: None, + field: None, + index: None, + span: span(node), + }); + } + _ => {} + } +} + fn collect_call(node: Node<'_>, bytes: &[u8], rules: &AuthAnalysisRules, state: &mut UnitState) { let callee = call_name(node, bytes); if callee.is_empty() { @@ -1059,22 +1135,28 @@ fn collect_condition( } } -/// Detect `let` bindings that produce a known non-sink collection -/// (e.g. `HashMap::new()`, `Vec::with_capacity(_)`, `vec![]`, or an -/// explicit type annotation like `: HashMap<_, _>`). Registered -/// variable names are consulted by `collect_call` so later method -/// calls on those bindings (`map.insert(..)`, `set.remove(..)`) -/// aren't treated as auth-relevant Read/Mutation operations. +/// Detect bindings that produce a known non-sink collection +/// (e.g. `HashMap::new()`, `Vec::with_capacity(_)`, `vec![]`, an +/// explicit type annotation like `: HashMap<_, _>`, or Python's +/// bare `set()` / `dict()` / `collections.defaultdict(list)`). +/// Registered variable names are consulted by `collect_call` so +/// later method calls on those bindings (`map.insert(..)`, +/// `set.remove(..)`, `verified_ids.update(..)`) aren't treated as +/// auth-relevant Read/Mutation operations. /// -/// Rust-oriented in practice; JS/TS/Python/etc. use different -/// declaration node kinds and are unaffected. +/// Field names accepted: Rust `let_declaration` uses `pattern` / +/// `value`; Python `assignment` and JS `assignment_expression` use +/// `left` / `right`. Both shapes share the same recognition path. fn collect_non_sink_binding( node: Node<'_>, bytes: &[u8], rules: &AuthAnalysisRules, state: &mut UnitState, ) { - let Some(pattern) = node.child_by_field_name("pattern") else { + let Some(pattern) = node + .child_by_field_name("pattern") + .or_else(|| node.child_by_field_name("left")) + else { return; }; let Some(var_name) = first_identifier_name(pattern, bytes) else { @@ -1092,7 +1174,9 @@ fn collect_non_sink_binding( } } - if let Some(value) = node.child_by_field_name("value") + if let Some(value) = node + .child_by_field_name("value") + .or_else(|| node.child_by_field_name("right")) && value_is_non_sink_constructor(value, bytes, rules) { state.non_sink_vars.insert(var_name); @@ -3457,18 +3541,53 @@ fn collect_param_names( "parameter_declaration" | "variadic_parameter_declaration" if node.child_by_field_name("name").is_some() => { - if let Some(type_node) = node.child_by_field_name("type") - && is_go_non_user_input_type(type_node, bytes) + let type_node = node.child_by_field_name("type"); + if let Some(t) = type_node + && is_go_non_user_input_type(t, bytes) { return; } + // Mirror of the Python `typed_parameter` filter (see + // `is_python_id_like_typed_param` arm above): for non-route + // units, an id-like Go param whose declared type is a + // bounded primitive scalar (`int64`, `uint32`, `string`, + // `bool`, `byte`, `rune`, `float64`, …) is a caller-passed + // scope identifier, not user-controlled HTTP input. Real + // Go HTTP handlers always carry a framework-request-typed + // param (`*http.Request`, `*gin.Context`, `echo.Context`, + // `*fiber.Ctx`, `*context.APIContext`, …) and are + // recognised by the per-framework route extractors which + // call `function_params_route_handler` + // (`include_id_like_typed = true`) — those bypass this + // filter so id-shaped path params survive on real routes. + // + // Real-repo trigger: `/Users/elipeter/oss/gitea` ─ ~957 + // `go.auth.missing_ownership_check` findings on backend + // helpers like + // `func GetRunByRepoAndID(ctx context.Context, + // repoID, runID int64)`, + // `func DeleteRunner(ctx context.Context, id int64)`, + // and the entire `models/...` DAO layer where the + // ownership check sits in the calling route handler. + // Same shape over-fires on minio's `cmd/iam-*-store` + // helpers and would on every Go ORM/DAO codebase. + let type_is_bounded_scalar = type_node + .map(|t| is_go_bounded_scalar_type(t, bytes)) + .unwrap_or(false); let mut cursor = node.walk(); for child in node.children_by_field_name("name", &mut cursor) { if child.kind() == "identifier" { let name = text(child, bytes); - if !name.is_empty() && !out.contains(&name) { - out.push(name); + if name.is_empty() || out.contains(&name) { + continue; } + if !include_id_like_typed + && type_is_bounded_scalar + && is_go_id_like_typed_param(&name) + { + continue; + } + out.push(name); } } } @@ -3635,6 +3754,56 @@ fn is_python_id_like_typed_param(name: &str) -> bool { lower == "id" || lower.ends_with("id") || lower.ends_with("_id") || lower.ends_with("ids") } +/// Same shape predicate used by the Go typed-param fallback in +/// `collect_param_names`. Kept separate from the Python helper so the +/// two recognisers can diverge if/when language-specific spellings +/// emerge; the current vocabulary is the same canonical id-suffix +/// set as `auth_analysis::checks::is_id_like_name`. +fn is_go_id_like_typed_param(name: &str) -> bool { + let lower = name.to_ascii_lowercase(); + lower == "id" || lower.ends_with("id") || lower.ends_with("_id") || lower.ends_with("ids") +} + +/// True iff `type_node` names a Go bounded primitive scalar: +/// integer (`int*` / `uint*` / `byte` / `rune` / `uintptr`), floating +/// point (`float32` / `float64`), `bool`, or `string`. Used by the +/// Go arm of `collect_param_names` to recognise the +/// "id-like name + scalar type" DAO-helper shape and refuse to lift +/// such params into `unit.params` for non-route units. +/// +/// Conservative scope: only bare `type_identifier` matches. Pointer +/// types (`*Foo`), generic types (`Map[K, V]`), qualified types +/// (`pkg.Type`), and slice/array types (`[]T`) are framework or +/// payload shapes, NOT bounded primitives, so they're left alone and +/// the param keeps its name. This keeps real handler shapes that +/// happen to spell an id-like name on a complex type (`req +/// *RequestWithID`) from being silently dropped. +fn is_go_bounded_scalar_type(type_node: Node<'_>, bytes: &[u8]) -> bool { + if type_node.kind() != "type_identifier" { + return false; + } + matches!( + text(type_node, bytes).as_str(), + "int" + | "int8" + | "int16" + | "int32" + | "int64" + | "uint" + | "uint8" + | "uint16" + | "uint32" + | "uint64" + | "uintptr" + | "byte" + | "rune" + | "float32" + | "float64" + | "bool" + | "string" + ) +} + pub fn is_function_like(node: Node<'_>) -> bool { matches!( node.kind(), @@ -4080,20 +4249,41 @@ fn subscript_value_ref(node: Node<'_>, bytes: &[u8]) -> Option { pub fn member_chain(node: Node<'_>, bytes: &[u8]) -> Vec { if node.kind() == "call" { - let mut chain = if let Some(receiver) = node.child_by_field_name("receiver") { - member_chain(receiver, bytes) - } else { - Vec::new() - }; + // Ruby-style call: explicit receiver field + method/name field. + if let Some(receiver) = node.child_by_field_name("receiver") { + let mut chain = member_chain(receiver, bytes); + let method = node + .child_by_field_name("method") + .or_else(|| node.child_by_field_name("name")) + .map(|method| text(method, bytes)) + .unwrap_or_default(); + if !method.is_empty() { + chain.push(method); + } + return chain; + } + // Python-style call: callable expression in the `function` field. + // Recursing into it lets chained shapes like + // `select(X).filter_by(...)` produce `["select()", "filter_by"]` + // — the parent attribute branch appends `()` when its `object` + // is a call, marking the intermediate-call shape so that + // `receiver_is_chained_call` detects it. Closes airflow-style + // SQLAlchemy queryset-builder chains that previously reduced to + // bare `["filter_by"]`. + if let Some(function) = node.child_by_field_name("function") { + return member_chain(function, bytes); + } + // Bare-method fallback for parser shapes that expose method/name + // without a receiver (Ruby implicit-self calls, etc.). let method = node .child_by_field_name("method") .or_else(|| node.child_by_field_name("name")) .map(|method| text(method, bytes)) .unwrap_or_default(); if !method.is_empty() { - chain.push(method); + return vec![method]; } - return chain; + return Vec::new(); } if node.kind() == "method_invocation" || node.kind() == "method_call_expression" { @@ -4164,7 +4354,23 @@ pub fn member_chain(node: Node<'_>, bytes: &[u8]) -> Vec { .or_else(|| node.child_by_field_name("operand")) .or_else(|| node.child_by_field_name("argument")) { - chain.extend(member_chain(object, bytes)); + let object_is_call = matches!( + object.kind(), + "call" | "call_expression" | "method_invocation" | "method_call_expression" + ); + let mut sub = member_chain(object, bytes); + // Mark intermediate-call segments with `()` so a downstream + // chain like `select(X).filter_by(...)` becomes + // `["select()", "filter_by"]` rather than `["select", "filter_by"]`. + // `receiver_is_chained_call` consults the `(` to detect the + // opaque-builder receiver. + if object_is_call + && sub.last().map(|s| !s.ends_with(')')).unwrap_or(false) + && let Some(last) = sub.last_mut() + { + last.push_str("()"); + } + chain.extend(sub); } if let Some(property) = node .child_by_field_name("property") @@ -4876,6 +5082,200 @@ mod tests { assert!(!params.contains(&"int".to_string()), "got {:?}", params); } + /// DAO-helper shape (`func GetRunByRepoAndID(ctx context.Context, + /// repoID, runID int64)`): id-like names with bounded primitive + /// scalar types are caller-passed scope identifiers, NOT user + /// input. For non-route units (`function_params`, + /// `include_id_like_typed = false`), they must NOT lift into + /// `unit.params` — that would gate `unit_has_user_input_evidence` + /// open on every internal Go ORM helper and over-fire + /// `go.auth.missing_ownership_check`. + /// + /// Real-repo trigger: + /// `/Users/elipeter/oss/gitea/models/actions/run_job.go:: + /// GetRunByRepoAndID` and ~957 sibling helpers across gitea's + /// `models/...` DAO layer. Same shape over-fires on minio's + /// `cmd/iam-*-store` and is the canonical Go ORM helper signature. + #[test] + fn collect_param_names_go_drops_id_like_scalar_params_for_dao_helper() { + use super::function_params; + let mut parser = tree_sitter::Parser::new(); + parser + .set_language(&tree_sitter::Language::from(tree_sitter_go::LANGUAGE)) + .unwrap(); + let src = + b"package x\nfunc GetRunByRepoAndID(ctx context.Context, repoID, runID int64) {}\n"; + let tree = parser.parse(src.as_slice(), None).unwrap(); + let func = (0..tree.root_node().named_child_count()) + .filter_map(|i| tree.root_node().named_child(i as u32)) + .find(|n| n.kind() == "function_declaration") + .expect("file should have a function_declaration"); + let params = function_params(func, src); + assert!( + !params.contains(&"ctx".to_string()), + "context.Context dropped: got {:?}", + params + ); + assert!( + !params.contains(&"repoID".to_string()), + "id-like scalar param dropped for DAO helper: got {:?}", + params + ); + assert!( + !params.contains(&"runID".to_string()), + "id-like scalar param dropped for DAO helper: got {:?}", + params + ); + assert!( + params.is_empty(), + "no params survive on DAO-shape helper: got {:?}", + params + ); + } + + /// Conservative scope: only **bounded primitive scalar** types + /// trigger the id-like drop. Pointer / struct / slice types are + /// payload shapes that may or may not be user-controlled — leave + /// them alone so non-DAO helpers retain their evidence. + #[test] + fn collect_param_names_go_keeps_id_like_pointer_struct_param() { + use super::function_params; + let mut parser = tree_sitter::Parser::new(); + parser + .set_language(&tree_sitter::Language::from(tree_sitter_go::LANGUAGE)) + .unwrap(); + // `runnerID *Runner` — id-like name, but the type is a pointer + // (payload shape), so the param name must survive. + let src = b"package x\nfunc UpdateRunner(ctx context.Context, runnerID *Runner) {}\n"; + let tree = parser.parse(src.as_slice(), None).unwrap(); + let func = (0..tree.root_node().named_child_count()) + .filter_map(|i| tree.root_node().named_child(i as u32)) + .find(|n| n.kind() == "function_declaration") + .expect("file should have a function_declaration"); + let params = function_params(func, src); + assert!( + params.contains(&"runnerID".to_string()), + "id-like pointer param survives: got {:?}", + params + ); + } + + /// Route handlers go through `function_params_route_handler` + /// (`include_id_like_typed = true`) — the id-like-scalar filter + /// must NOT trip there. Path-param-on-REST-route is *the* + /// primary user input and middleware-injected auth checks rely on + /// these names being present in `unit.params`. + #[test] + fn collect_param_names_go_route_handler_keeps_id_like_scalar_params() { + use super::function_params_route_handler; + let mut parser = tree_sitter::Parser::new(); + parser + .set_language(&tree_sitter::Language::from(tree_sitter_go::LANGUAGE)) + .unwrap(); + let src = b"package x\nfunc GetRepo(ctx context.Context, repoID int64) {}\n"; + let tree = parser.parse(src.as_slice(), None).unwrap(); + let func = (0..tree.root_node().named_child_count()) + .filter_map(|i| tree.root_node().named_child(i as u32)) + .find(|n| n.kind() == "function_declaration") + .expect("file should have a function_declaration"); + let params = function_params_route_handler(func, src); + assert!( + params.contains(&"repoID".to_string()), + "id-like scalar param kept for route handler: got {:?}", + params + ); + } + + /// Pin `member_chain` output for the SQLAlchemy queryset chain + /// `select(C).filter_by(id=x)`. Pre-fix, Python `call` nodes use a + /// `function` field (not `receiver`/`method`) so the recursive call + /// arm returned an empty Vec, reducing the chain to bare + /// `["filter_by"]`. The fix: (1) traverse `function` field in the + /// `call` arm; (2) the parent attribute branch appends `()` to last + /// segment when its `object` is a call. Together they produce + /// `["select()", "filter_by"]` so `receiver_is_chained_call` detects + /// the intermediate-call shape. + #[test] + fn member_chain_python_select_filter_by_chain_marks_intermediate_call() { + use super::{callee_name, member_chain}; + use tree_sitter::{Node, Parser}; + + let mut parser = Parser::new(); + parser + .set_language(&tree_sitter::Language::from(tree_sitter_python::LANGUAGE)) + .unwrap(); + let src = b"x = select(C).filter_by(id=u)\n"; + let tree = parser.parse(src.as_slice(), None).unwrap(); + + fn find_outer_call<'a>(node: Node<'a>) -> Option> { + if node.kind() == "call" + && let Some(function) = node.child_by_field_name("function") + && function.kind() == "attribute" + { + return Some(node); + } + for i in 0..node.named_child_count() { + if let Some(child) = node.named_child(i as u32) + && let Some(found) = find_outer_call(child) + { + return Some(found); + } + } + None + } + + let outer_call = find_outer_call(tree.root_node()) + .expect("expected outer call node `select(C).filter_by(id=u)`"); + + assert_eq!( + member_chain(outer_call, src), + vec!["select()".to_string(), "filter_by".to_string()], + "Python chained call must produce `[select(), filter_by]` so receiver_is_chained_call detects the intermediate-call shape", + ); + assert_eq!( + callee_name(outer_call, src), + "select().filter_by".to_string(), + "callee_name joins the chain with `.`", + ); + } + + /// Regression guard: simple Python `obj.method(arg)` callees keep + /// their previous `member_chain` output (`["obj", "method"]`). The + /// `function`-field traversal must not pollute non-chained shapes. + #[test] + fn member_chain_python_simple_attribute_call_unchanged() { + use super::callee_name; + use tree_sitter::{Node, Parser}; + + let mut parser = Parser::new(); + parser + .set_language(&tree_sitter::Language::from(tree_sitter_python::LANGUAGE)) + .unwrap(); + let src = b"x = obj.method(a)\n"; + let tree = parser.parse(src.as_slice(), None).unwrap(); + + fn find_call<'a>(node: Node<'a>) -> Option> { + if node.kind() == "call" { + return Some(node); + } + for i in 0..node.named_child_count() { + if let Some(child) = node.named_child(i as u32) + && let Some(found) = find_call(child) + { + return Some(found); + } + } + None + } + + let call_node = find_call(tree.root_node()).expect("expected `obj.method(a)` call"); + assert_eq!( + callee_name(call_node, src), + "obj.method".to_string(), + "simple attribute call must not pick up `()` markers", + ); + } + mod ruby_visibility_and_callbacks { use super::super::{ RubyVisibility, ruby_callback_target_names, ruby_method_is_callback_or_private, diff --git a/src/auth_analysis/extract/django.rs b/src/auth_analysis/extract/django.rs index 1b7dfcc2..99131c76 100644 --- a/src/auth_analysis/extract/django.rs +++ b/src/auth_analysis/extract/django.rs @@ -5,7 +5,7 @@ use super::common::{ string_literal_value, text, visit_named_nodes, }; use crate::auth_analysis::config::{AuthAnalysisRules, matches_name}; -use crate::auth_analysis::extract::common::{attach_route_handler, collect_top_level_units}; +use crate::auth_analysis::extract::common::attach_route_handler; use crate::auth_analysis::model::{ AnalysisUnitKind, AuthorizationModel, CallSite, Framework, HttpMethod, }; @@ -29,18 +29,14 @@ impl AuthExtractor for DjangoExtractor { bytes: &[u8], path: &Path, rules: &AuthAnalysisRules, - ) -> AuthorizationModel { + model: &mut AuthorizationModel, + ) { let root = tree.root_node(); - let mut model = AuthorizationModel::default(); - - collect_top_level_units(root, bytes, rules, &mut model); visit_named_nodes(root, &mut |node| { if node.kind() == "call" { - maybe_collect_django_path(root, node, bytes, path, rules, &mut model); + maybe_collect_django_path(root, node, bytes, path, rules, model); } }); - - model } } diff --git a/src/auth_analysis/extract/echo.rs b/src/auth_analysis/extract/echo.rs index cba582b9..89667ca9 100644 --- a/src/auth_analysis/extract/echo.rs +++ b/src/auth_analysis/extract/echo.rs @@ -1,8 +1,8 @@ use super::AuthExtractor; use super::common::{ - attach_route_handler, call_site_from_node, collect_top_level_units, http_method_from_name, - is_handler_reference, join_route_paths, member_target, named_children, push_route_registration, - string_literal_value, text, visit_named_nodes, + attach_route_handler, call_site_from_node, http_method_from_name, is_handler_reference, + join_route_paths, member_target, named_children, push_route_registration, string_literal_value, + text, visit_named_nodes, }; use crate::auth_analysis::config::AuthAnalysisRules; use crate::auth_analysis::model::{AuthorizationModel, CallSite, Framework}; @@ -26,24 +26,21 @@ impl AuthExtractor for EchoExtractor { bytes: &[u8], path: &Path, rules: &AuthAnalysisRules, - ) -> AuthorizationModel { + model: &mut AuthorizationModel, + ) { let root = tree.root_node(); - let mut model = AuthorizationModel::default(); let mut groups = HashMap::new(); - collect_top_level_units(root, bytes, rules, &mut model); visit_named_nodes(root, &mut |node| match node.kind() { "short_var_declaration" | "assignment_statement" => { maybe_collect_group_binding(node, bytes, &mut groups) } "call_expression" => { maybe_collect_group_use(node, bytes, &mut groups); - maybe_collect_route(root, node, bytes, path, rules, &groups, &mut model); + maybe_collect_route(root, node, bytes, path, rules, &groups, model); } _ => {} }); - - model } } diff --git a/src/auth_analysis/extract/express.rs b/src/auth_analysis/extract/express.rs index 2203d900..b85fed2f 100644 --- a/src/auth_analysis/extract/express.rs +++ b/src/auth_analysis/extract/express.rs @@ -1,8 +1,8 @@ use super::AuthExtractor; use super::common::{ - attach_route_handler, call_site_from_node, collect_top_level_units, http_method_from_name, - is_handler_reference, member_target, named_children, push_route_registration, - string_literal_value, visit_named_nodes, + attach_route_handler, call_site_from_node, http_method_from_name, is_handler_reference, + member_target, named_children, push_route_registration, string_literal_value, + visit_named_nodes, }; use crate::auth_analysis::config::AuthAnalysisRules; use crate::auth_analysis::model::{AuthorizationModel, Framework}; @@ -25,18 +25,14 @@ impl AuthExtractor for ExpressExtractor { bytes: &[u8], path: &Path, rules: &AuthAnalysisRules, - ) -> AuthorizationModel { + model: &mut AuthorizationModel, + ) { let root = tree.root_node(); - let mut model = AuthorizationModel::default(); - - collect_top_level_units(root, bytes, rules, &mut model); visit_named_nodes(root, &mut |node| { if node.kind() == "call_expression" { - maybe_collect_route(root, node, bytes, path, rules, &mut model); + maybe_collect_route(root, node, bytes, path, rules, model); } }); - - model } } diff --git a/src/auth_analysis/extract/fastify.rs b/src/auth_analysis/extract/fastify.rs index b6e67aad..05063f10 100644 --- a/src/auth_analysis/extract/fastify.rs +++ b/src/auth_analysis/extract/fastify.rs @@ -1,8 +1,8 @@ use super::AuthExtractor; use super::common::{ - attach_route_handler, call_sites_from_value, collect_top_level_units, http_method_from_name, - is_handler_reference, member_target, named_children, object_property_value, - push_route_registration, string_literal_value, visit_named_nodes, + attach_route_handler, call_sites_from_value, http_method_from_name, is_handler_reference, + member_target, named_children, object_property_value, push_route_registration, + string_literal_value, visit_named_nodes, }; use crate::auth_analysis::config::AuthAnalysisRules; use crate::auth_analysis::model::{AuthorizationModel, CallSite, Framework}; @@ -25,19 +25,15 @@ impl AuthExtractor for FastifyExtractor { bytes: &[u8], path: &Path, rules: &AuthAnalysisRules, - ) -> AuthorizationModel { + model: &mut AuthorizationModel, + ) { let root = tree.root_node(); - let mut model = AuthorizationModel::default(); - - collect_top_level_units(root, bytes, rules, &mut model); visit_named_nodes(root, &mut |node| { if node.kind() == "call_expression" { - maybe_collect_shorthand_route(root, node, bytes, path, rules, &mut model); - maybe_collect_route_object(root, node, bytes, path, rules, &mut model); + maybe_collect_shorthand_route(root, node, bytes, path, rules, model); + maybe_collect_route_object(root, node, bytes, path, rules, model); } }); - - model } } diff --git a/src/auth_analysis/extract/flask.rs b/src/auth_analysis/extract/flask.rs index 1c64c42f..5fcdd8ac 100644 --- a/src/auth_analysis/extract/flask.rs +++ b/src/auth_analysis/extract/flask.rs @@ -4,15 +4,27 @@ use super::common::{ push_route_registration, string_literal_value, text, visit_named_nodes, }; use crate::auth_analysis::config::{AuthAnalysisRules, matches_name}; -use crate::auth_analysis::extract::common::{collect_top_level_units, decorated_definition_child}; -use crate::auth_analysis::model::{AuthorizationModel, CallSite, Framework, HttpMethod}; +use crate::auth_analysis::extract::common::decorated_definition_child; +use crate::auth_analysis::model::{ + AuthCheck, AuthCheckKind, AuthorizationModel, CallSite, Framework, HttpMethod, +}; use crate::labels::bare_method_name; use crate::utils::project::{DetectedFramework, FrameworkContext}; +use std::collections::HashMap; use std::path::Path; use tree_sitter::{Node, Tree}; pub struct FlaskExtractor; +/// Map from a module-level router/app variable name to the +/// `dependencies=[...]` deps declared on its constructor call. FastAPI +/// propagates these to every route attached via +/// `@.(...)`, so the route extractor must merge them in +/// before running ownership / membership checks. Each entry follows +/// the same shape as `extract_fastapi_dependencies` produces: +/// `(CallSite, is_scoped_security)`. See `collect_router_level_dependencies`. +type RouterLevelDepMap = HashMap>; + impl AuthExtractor for FlaskExtractor { fn supports(&self, lang: &str, framework_ctx: Option<&FrameworkContext>) -> bool { lang == "python" @@ -26,18 +38,52 @@ impl AuthExtractor for FlaskExtractor { bytes: &[u8], path: &Path, rules: &AuthAnalysisRules, - ) -> AuthorizationModel { + model: &mut AuthorizationModel, + ) { let root = tree.root_node(); - let mut model = AuthorizationModel::default(); - - collect_top_level_units(root, bytes, rules, &mut model); + // Pass 1: pre-walk for module-level router/app assignments + // (`ti_id_router = VersionedAPIRouter(dependencies=[Security(...)])`). + // FastAPI applies router-level deps to every attached route, so + // every per-route `@.(...)` decorator must merge + // the router's deps before the ownership check fires. Without + // this, airflow's execution-API routes that re-use a single + // `ti_id_router` declared once at module scope inherit no auth + // and flag `missing_ownership_check` despite being authorized. + let mut router_deps = collect_router_level_dependencies(root, bytes); + // Merge in cross-file router-deps lifted via + // `.include_router(., ...)` calls in + // other project files — pre-resolved by the orchestrator at + // pass 2 entry from `GlobalSummaries.router_facts_by_module`. + // Cross-file deps are PREPENDED to mirror FastAPI's runtime + // ordering (parent router deps run before any in-file router + // deps and before per-route deps). Empty when global summaries + // are unavailable (single-file scan / unit-test paths). + if !model.cross_file_router_deps.is_empty() { + for (router_var, cross_deps) in &model.cross_file_router_deps { + if cross_deps.is_empty() { + continue; + } + let entry = router_deps.entry(router_var.clone()).or_default(); + let mut merged: Vec<(CallSite, bool)> = cross_deps.clone(); + // Dedup so an inline `dependencies=[Security(...)]` and a + // cross-file lift of the same `Security(callee)` don't + // double-fire downstream auth checks. + for dep in entry.iter() { + let already = merged + .iter() + .any(|(call, scoped)| call.name == dep.0.name && *scoped == dep.1); + if !already { + merged.push(dep.clone()); + } + } + *entry = merged; + } + } visit_named_nodes(root, &mut |node| { if node.kind() == "decorated_definition" { - maybe_collect_flask_route(root, node, bytes, path, rules, &mut model); + maybe_collect_flask_route(root, node, bytes, path, rules, model, &router_deps); } }); - - model } } @@ -54,6 +100,7 @@ fn maybe_collect_flask_route( path: &Path, rules: &AuthAnalysisRules, model: &mut AuthorizationModel, + router_deps: &RouterLevelDepMap, ) { let Some(definition) = decorated_definition_child(node) else { return; @@ -63,21 +110,44 @@ fn maybe_collect_flask_route( } let mut route_specs = Vec::new(); - let mut middleware_calls = Vec::new(); + let mut middleware_calls: Vec<(CallSite, bool)> = Vec::new(); for decorator in decorator_expressions(node) { if let Some(mut specs) = parse_flask_route_decorator(decorator, bytes) { route_specs.append(&mut specs); + // FastAPI propagates router-level `dependencies=[...]` from + // ` = APIRouter(...)` to every attached + // `@.(...)` route. Look up the decorator's + // router prefix in the pre-built map and merge its deps + // BEFORE the route-level deps so the ordering matches + // FastAPI runtime semantics (router deps run before route + // deps). Without this, airflow execution-API routes that + // declare auth once at the router level fire spurious + // `missing_ownership_check` / `token_override` findings. + if let Some(prefix) = router_prefix_from_decorator(decorator, bytes) + && let Some(deps) = router_deps.get(&prefix) + { + middleware_calls.extend(deps.iter().cloned()); + } // FastAPI puts route-level dependencies (auth checks + // logging hooks) inside the route decorator's // `dependencies=[Depends(...)]` keyword argument, instead // of as separate `@decorator` lines like Flask. Walk the // route decorator's keyword args for that shape and lift - // each `Depends(call(...))` element into the - // middleware_calls list, so the same `inject_middleware_auth` - // path that Flask uses also picks up FastAPI auth deps. + // each `Depends(call(...))` / `Security(call, scopes=[...])` + // element into the middleware_calls list, so the same + // `inject_middleware_auth` path that Flask uses also + // picks up FastAPI auth deps. The boolean tracks whether + // the wrapper was a scoped `Security(...)` — those are + // OAuth2-scope-checked authorization (not just login), + // so the AuthCheckKind is promoted in + // `inject_middleware_auth`. middleware_calls.extend(extract_fastapi_dependencies(decorator, bytes)); } else { - middleware_calls.extend(expand_decorator_calls(decorator, bytes)); + middleware_calls.extend( + expand_decorator_calls(decorator, bytes) + .into_iter() + .map(|c| (c, false)), + ); } } @@ -104,6 +174,10 @@ fn maybe_collect_flask_route( rules, ); + let registration_calls: Vec = middleware_calls + .iter() + .map(|(call, _)| call.clone()) + .collect(); push_route_registration( model, Framework::Flask, @@ -111,7 +185,7 @@ fn maybe_collect_flask_route( spec.path, path, handler, - middleware_calls.clone(), + registration_calls, ); } } @@ -272,19 +346,25 @@ fn expand_decorator_calls(node: Node<'_>, bytes: &[u8]) -> Vec { } /// Walk the route-decorator call's keyword args looking for the FastAPI -/// `dependencies=[Depends(call(...)), Depends(call), ...]` shape. For -/// each `Depends(...)` list element, extract the inner callable as a -/// `CallSite` so it can flow through `inject_middleware_auth` and be -/// matched against the per-language authorization-check / login-guard -/// name lists. Refuses non-call elements and `Depends(...)` without a -/// recognised inner call shape. +/// `dependencies=[Depends(call(...)), Security(call, scopes=[...]), ...]` +/// shape. For each `Depends(...)` / `Security(...)` list element, +/// extract the inner callable as a `CallSite` so it can flow through +/// `inject_middleware_auth` and be matched against the per-language +/// authorization-check / login-guard name lists. Refuses non-call +/// elements and markers without a recognised inner call shape. +/// +/// Returns `(CallSite, is_scoped_security)` pairs. The boolean is +/// `true` when the wrapper was `Security(...)` carrying a non-empty +/// `scopes=[...]` kwarg — those are OAuth2-scope-checked authorization +/// (FastAPI semantics), not bare login dependency, so +/// `inject_middleware_auth` promotes the `AuthCheckKind`. /// /// The function is decoupled from Flask semantics (Flask routes never /// use `dependencies=`); the lookup is purely structural and matches /// FastAPI's documented dependency-injection convention. Lives in the /// flask module because Flask's route-decorator parser already targets /// the `@.(, ...)` shape that FastAPI shares. -fn extract_fastapi_dependencies(decorator_expr: Node<'_>, bytes: &[u8]) -> Vec { +fn extract_fastapi_dependencies(decorator_expr: Node<'_>, bytes: &[u8]) -> Vec<(CallSite, bool)> { if decorator_expr.kind() != "call" { return Vec::new(); } @@ -296,47 +376,232 @@ fn extract_fastapi_dependencies(decorator_expr: Node<'_>, bytes: &[u8]) -> Vec = (..., dependencies=[Depends(...), Security(...)])` +/// and build a map from the router variable name to its router-level +/// dependency CallSites. FastAPI applies these to every attached +/// `@.(...)` route at runtime — the per-route extractor +/// merges them in before running ownership / membership checks. +/// +/// Recognised router/app constructors (callee-tail-name match, so +/// `fastapi.APIRouter(...)` and `routing.APIRouter(...)` both work): +/// * `APIRouter` (FastAPI canonical) +/// * `FastAPI` (FastAPI app object — `dependencies=[...]` on the app +/// applies to every route under it) +/// * `VersionedAPIRouter` (airflow-specific subclass) +/// * Any callee whose tail name ends with `Router` — covers +/// project-specific `APIRouter` subclasses without the airflow- +/// specific allowlist needing to grow per-codebase. Conservative: +/// the lookup only ever fires when the route decorator's prefix +/// matches a captured variable, so over-matching the constructor +/// doesn't produce false auth attribution unless the same name is +/// also used as a route decorator's receiver — extremely rare. +/// +/// The walk is restricted to module-root expression statements / typed +/// assignments — nested function-local routers aren't supported (and +/// don't appear in real-world FastAPI codebases — the router pattern is +/// always module-scoped so it can be imported into the app at startup). +fn collect_router_level_dependencies(root: Node<'_>, bytes: &[u8]) -> RouterLevelDepMap { + let mut out: RouterLevelDepMap = HashMap::new(); + for child in named_children(root) { + // Top-level shape: `expression_statement` wrapping an + // `assignment` (Python tree-sitter convention). Also accept + // bare `assignment` in case the grammar changes. + let assign = match child.kind() { + "expression_statement" => named_children(child).into_iter().next(), + "assignment" => Some(child), + _ => None, + }; + let Some(assign) = assign else { continue }; + if assign.kind() != "assignment" { + continue; + } + let Some(left) = assign.child_by_field_name("left") else { + continue; + }; + if left.kind() != "identifier" { + continue; + } + let Some(right) = assign.child_by_field_name("right") else { + continue; + }; + if right.kind() != "call" { + continue; + } + let Some(function) = right.child_by_field_name("function") else { + continue; + }; + let function_text = text(function, bytes); + if !is_router_like_constructor(&function_text) { + continue; + } + let Some(arguments) = right.child_by_field_name("arguments") else { + continue; + }; + let Some(deps_value) = keyword_argument_value(arguments, bytes, "dependencies") else { + continue; + }; + let mut deps = Vec::new(); + for element in named_children(deps_value) { + if let Some(unwrapped) = unwrap_depends_call(element, bytes) { + deps.push(unwrapped); + } + } + if deps.is_empty() { + continue; + } + let var_name = text(left, bytes).trim().to_string(); + if var_name.is_empty() { + continue; + } + // First declaration wins. A ` = …` re-assignment + // would be unusual at module scope; if it happens, the first + // dependency declaration is conservatively the one that + // applies to most routes attached after it. + out.entry(var_name).or_insert(deps); + } + out +} + +/// True for callee text that looks like a FastAPI router or app +/// constructor. Tail-name match (after the last `.`) so +/// `fastapi.APIRouter` / `routing.APIRouter` / bare `APIRouter` all +/// hit, plus airflow's `VersionedAPIRouter` subclass and any project- +/// specific `*Router` callable. See `collect_router_level_dependencies` +/// for the wider rationale. +fn is_router_like_constructor(callee: &str) -> bool { + let trimmed = callee.trim(); + let tail = trimmed.rsplit('.').next().unwrap_or(trimmed); + if tail == "APIRouter" || tail == "FastAPI" || tail == "VersionedAPIRouter" { + return true; + } + // `*Router` suffix — covers project-specific subclasses without an + // exhaustive allowlist. Reject empty / single-char / lowercase + // tails to avoid catching arbitrary identifiers. + if tail.len() > "Router".len() + && tail.ends_with("Router") + && tail.chars().next().is_some_and(|c| c.is_ascii_uppercase()) + { + return true; + } + false +} + +/// Extract the router-receiver identifier from a route-decorator call +/// node. Decorator shape: `@.(, ...)` — the +/// callee is `.`, so the prefix is everything before the +/// last `.`. Returns `None` for decorators that don't match the +/// expected `attribute`-style shape (e.g. bare `@requires_auth` or +/// `@blueprint.route("/x")` where the attribute is the verb itself). +fn router_prefix_from_decorator(decorator_expr: Node<'_>, bytes: &[u8]) -> Option { + if decorator_expr.kind() != "call" { + return None; + } + let function = decorator_expr.child_by_field_name("function")?; + if function.kind() != "attribute" { + return None; + } + let object = function.child_by_field_name("object")?; + if !matches!(object.kind(), "identifier" | "attribute") { + return None; + } + let prefix = text(object, bytes).trim().to_string(); + if prefix.is_empty() { + None + } else { + Some(prefix) + } +} + +/// Unwrap one `Depends(...)` / `Security(...)` list element from a +/// FastAPI `dependencies` list and return the inner callable as a +/// `CallSite`. Four shapes are accepted: /// * `Depends(callee(arg1, arg2))`, most common, the inner call is /// the callable factory invocation; record `callee` as the auth /// check. /// * `Depends(callee)`, bare reference; record `callee` itself. -/// * `Depends()` / non-`Depends` items, skipped. -fn unwrap_depends_call(node: Node<'_>, bytes: &[u8]) -> Option { +/// * `Security(callee, scopes=[...])`, FastAPI's OAuth2-scope +/// variant of `Depends`; the first positional arg is the auth +/// callable, the `scopes=` kwarg is ignored. Real-world airflow +/// execution-API routes use this form +/// (`task_instances.py:104`). +/// * `Depends()` / non-marker items, skipped. +/// +/// Skips `keyword_argument` children when locating the first +/// positional, so kwargs ordering (`Security(scopes=..., callee)`) +/// does not hide the dependency. +fn unwrap_depends_call(node: Node<'_>, bytes: &[u8]) -> Option<(CallSite, bool)> { if node.kind() != "call" { return None; } let function = node.child_by_field_name("function")?; let function_text = text(function, bytes); - if !is_depends_callee(&function_text) { + if !is_dep_marker_callee(&function_text) { return None; } + let is_security = is_security_marker(&function_text); let arguments = node.child_by_field_name("arguments")?; - let first = named_children(arguments).into_iter().next()?; + let children = named_children(arguments); + let first = children + .iter() + .copied() + .find(|child| child.kind() != "keyword_argument")?; + let scoped_security = is_security + && keyword_argument_value(arguments, bytes, "scopes") + .map(|value| { + named_children(value) + .iter() + .any(|item| item.kind() != "comment") + }) + .unwrap_or(false); match first.kind() { - "call" => Some(call_site_from_node(first, bytes)), - "identifier" | "attribute" | "scoped_identifier" => Some(call_site_from_node(first, bytes)), + "call" => Some((call_site_from_node(first, bytes), scoped_security)), + "identifier" | "attribute" | "scoped_identifier" => { + Some((call_site_from_node(first, bytes), scoped_security)) + } _ => None, } } -/// True for the FastAPI `Depends` marker, including the -/// fully-qualified `fastapi.Depends` form. Conservative: only literal -/// matches, no canonicalisation. -fn is_depends_callee(callee: &str) -> bool { +/// Subset of `is_dep_marker_callee` that matches only the `Security` +/// variant (and its fully-qualified forms). `Security(callable, +/// scopes=[...])` is FastAPI's OAuth2-scope-checked dependency: the +/// inner callable is invoked with the merged `SecurityScopes` from +/// every parent `Security(...)` declaration, and the route is +/// rejected unless the bearer token carries one of the requested +/// scopes. Treating a scoped Security wrapper as authorization +/// (not just login) is the deeper semantic encoded by +/// `inject_middleware_auth`. +fn is_security_marker(callee: &str) -> bool { let trimmed = callee.trim(); matches!( trimmed, - "Depends" | "fastapi.Depends" | "fastapi.params.Depends" + "Security" | "fastapi.Security" | "fastapi.params.Security" + ) +} + +/// True for the FastAPI dependency markers `Depends` and `Security`, +/// including their fully-qualified forms. `Security(callable, +/// scopes=[...])` is the OAuth2-scope variant of `Depends(callable)`; +/// FastAPI treats the inner callable identically for dep-injection +/// purposes. Conservative: only literal matches, no canonicalisation. +fn is_dep_marker_callee(callee: &str) -> bool { + let trimmed = callee.trim(); + matches!( + trimmed, + "Depends" + | "fastapi.Depends" + | "fastapi.params.Depends" + | "Security" + | "fastapi.Security" + | "fastapi.params.Security" ) } @@ -344,31 +609,108 @@ fn inject_middleware_auth( model: &mut AuthorizationModel, unit_idx: usize, line: usize, - middleware_calls: &[CallSite], + middleware_calls: &[(CallSite, bool)], rules: &AuthAnalysisRules, ) { let Some(unit) = model.units.get_mut(unit_idx) else { return; }; - for call in middleware_calls { - if let Some(mut check) = auth_check_from_call_site(call, line, rules) { - // Mark as route-level: the check is declared at the route - // boundary (Flask `@requires_role(...)` decorator, FastAPI - // `dependencies=[Depends(...)]`, or any custom-router - // equivalent) and semantically authorizes every value the - // handler receives, path param, body, query, downstream - // row fetches, the lot. `auth_check_covers_subject` reads - // `is_route_level` and short-circuits `true` for any - // non-login-guard match, which is the correct shape for a - // decorator-level guard whose inner call carries no - // per-arg subject ref pointing back into the handler body. - // LoginGuard / TokenExpiry / TokenRecipient kinds are - // already excluded by `has_prior_subject_auth`'s filter - // before they reach `auth_check_covers_subject`, so the - // flag is safe to set unconditionally here, it has no - // effect on those kinds. - check.is_route_level = true; - unit.auth_checks.push(check); + for (call, scoped_security) in middleware_calls { + let mut check = match auth_check_from_call_site(call, line, rules) { + Some(check) => check, + None if *scoped_security => { + // FastAPI `Security(callable, scopes=[...])` always + // enforces authorization at the route boundary even + // when `callable` doesn't appear in any per-language + // login-guard / authorization-check name list. Synthesise + // an `Other`-kind check so the route is recognised as + // guarded; without this, every `Security(custom_dep, + // scopes=[...])` route fires `missing_ownership_check` + // FPs. + AuthCheck { + kind: AuthCheckKind::Other, + callee: call.name.clone(), + subjects: Vec::new(), + span: call.span, + line, + args: call.args.clone(), + condition_text: None, + is_route_level: false, + } + } + None => continue, + }; + // Mark as route-level: the check is declared at the route + // boundary (Flask `@requires_role(...)` decorator, FastAPI + // `dependencies=[Depends(...)]`, or any custom-router + // equivalent) and semantically authorizes every value the + // handler receives, path param, body, query, downstream + // row fetches, the lot. `auth_check_covers_subject` reads + // `is_route_level` and short-circuits `true` for any + // non-login-guard match, which is the correct shape for a + // decorator-level guard whose inner call carries no + // per-arg subject ref pointing back into the handler body. + // LoginGuard / TokenExpiry / TokenRecipient kinds are + // already excluded by `has_prior_subject_auth`'s filter + // before they reach `auth_check_covers_subject`, so the + // flag is safe to set unconditionally here, it has no + // effect on those kinds. + check.is_route_level = true; + // FastAPI `Security(callable, scopes=[...])` is OAuth2-scope- + // checked authorization (the JWT must carry one of the listed + // scopes); a `LoginGuard` classification would be wrong because + // `has_prior_subject_auth` filters LoginGuard out. Promote to + // `Other` so the route counts as authorized for ownership / + // membership / token-override checks. + if *scoped_security + && matches!( + check.kind, + AuthCheckKind::LoginGuard + | AuthCheckKind::TokenExpiry + | AuthCheckKind::TokenRecipient + ) + { + check.kind = AuthCheckKind::Other; + } + let push_token_synth = *scoped_security; + unit.auth_checks.push(check); + if push_token_synth { + // FastAPI `Security(callable, scopes=[...])` validates the + // bearer JWT in two ways: signature verification (which + // includes expiry — a JWT past its `exp` claim fails the + // signature path) and scope checking (the requested scopes + // identify what the bearer is authorized to act on, which + // semantically encodes recipient binding for the route). + // Synthesise the matching `TokenExpiry` + `TokenRecipient` + // checks so the `token_override_without_validation` rule + // recognises the JWT-validated route. Without this, + // every FastAPI route declaring scoped Security at the + // route or router boundary fires token-override FPs on + // its `session.add` / `Model.save()` calls — the + // missing_ownership_check sibling of the same finding is + // already cleared by the kind-promotion above. Empty- or + // missing-scopes Security wrappers fall through this gate + // (scoped_security is false) and remain bare login deps. + unit.auth_checks.push(AuthCheck { + kind: AuthCheckKind::TokenExpiry, + callee: call.name.clone(), + subjects: Vec::new(), + span: call.span, + line, + args: call.args.clone(), + condition_text: None, + is_route_level: true, + }); + unit.auth_checks.push(AuthCheck { + kind: AuthCheckKind::TokenRecipient, + callee: call.name.clone(), + subjects: Vec::new(), + span: call.span, + line, + args: call.args.clone(), + condition_text: None, + is_route_level: true, + }); } } } @@ -410,24 +752,318 @@ mod test_decorator_tests { #[cfg(test)] mod fastapi_dependencies_tests { - use super::is_depends_callee; + use super::{is_dep_marker_callee, is_security_marker, unwrap_depends_call}; + use tree_sitter::Parser; - /// `is_depends_callee` only matches the FastAPI `Depends` marker. - /// Any other wrapper call inside `dependencies=[...]` is ignored , - /// extracting an inner callee from the wrong wrapper would - /// misclassify logging hooks or filter callables as auth checks. + fn parse_python(source: &str) -> tree_sitter::Tree { + let mut parser = Parser::new(); + parser + .set_language(&tree_sitter::Language::from(tree_sitter_python::LANGUAGE)) + .expect("python language"); + parser.parse(source, None).expect("parse") + } + + /// Walk the parsed tree to find the first `call` node whose + /// callee name matches `marker`. Helper for the `unwrap_depends_call` + /// regression tests below — the production extractor traverses the + /// route-decorator's `dependencies=[...]` list and feeds each + /// element into `unwrap_depends_call`, so the test mirrors that + /// element shape directly without the surrounding boilerplate. + fn find_first_marker_call<'a>( + node: tree_sitter::Node<'a>, + bytes: &[u8], + marker: &str, + ) -> Option> { + if node.kind() == "call" + && let Some(function) = node.child_by_field_name("function") + && function.utf8_text(bytes).unwrap_or("") == marker + { + return Some(node); + } + for idx in 0..node.named_child_count() { + if let Some(child) = node.named_child(idx as u32) + && let Some(found) = find_first_marker_call(child, bytes, marker) + { + return Some(found); + } + } + None + } + + /// `is_dep_marker_callee` matches only FastAPI's `Depends` / + /// `Security` markers. Any other wrapper call inside + /// `dependencies=[...]` is ignored, extracting an inner callee + /// from the wrong wrapper would misclassify logging hooks or + /// filter callables as auth checks. #[test] - fn is_depends_callee_recognises_canonical_forms() { - assert!(is_depends_callee("Depends")); - assert!(is_depends_callee("fastapi.Depends")); - assert!(is_depends_callee("fastapi.params.Depends")); + fn is_dep_marker_callee_recognises_canonical_forms() { + assert!(is_dep_marker_callee("Depends")); + assert!(is_dep_marker_callee("fastapi.Depends")); + assert!(is_dep_marker_callee("fastapi.params.Depends")); + // Security variant — OAuth2-scope-bearing equivalent. + assert!(is_dep_marker_callee("Security")); + assert!(is_dep_marker_callee("fastapi.Security")); + assert!(is_dep_marker_callee("fastapi.params.Security")); // Whitespace tolerance. - assert!(is_depends_callee(" Depends ")); + assert!(is_dep_marker_callee(" Depends ")); + assert!(is_dep_marker_callee(" Security ")); // Negatives. - assert!(!is_depends_callee("Annotated")); - assert!(!is_depends_callee("Body")); - assert!(!is_depends_callee("Depends.something")); - assert!(!is_depends_callee("RequiresAuth")); - assert!(!is_depends_callee("")); + assert!(!is_dep_marker_callee("Annotated")); + assert!(!is_dep_marker_callee("Body")); + assert!(!is_dep_marker_callee("Depends.something")); + assert!(!is_dep_marker_callee("Security.something")); + assert!(!is_dep_marker_callee("RequiresAuth")); + assert!(!is_dep_marker_callee("")); + } + + /// `is_security_marker` is the strictly-Security subset. Used to + /// promote the wrapper's `is_scoped_security` flag without a + /// second string-match pass. + #[test] + fn is_security_marker_recognises_security_only() { + assert!(is_security_marker("Security")); + assert!(is_security_marker("fastapi.Security")); + assert!(is_security_marker("fastapi.params.Security")); + assert!(is_security_marker(" Security ")); + // Depends is NOT a Security marker. + assert!(!is_security_marker("Depends")); + assert!(!is_security_marker("fastapi.Depends")); + assert!(!is_security_marker("Annotated")); + assert!(!is_security_marker("")); + } + + /// `Security(callable, scopes=[...])` — the canonical airflow + /// execution-API auth-dep shape (`task_instances.py:104`). Must + /// extract `callable` as the inner CallSite AND flag the wrapper as + /// scoped-security so `inject_middleware_auth` promotes the + /// AuthCheckKind from LoginGuard to Other (OAuth2 scopes are + /// authorization, not just login). Without the promotion, the + /// route still fires `missing_ownership_check` despite carrying a + /// declared route-level dependency. + #[test] + fn unwrap_depends_call_security_with_scopes_flags_scoped() { + let src = "x = Security(require_auth, scopes=[\"token:execution\"])\n"; + let tree = parse_python(src); + let bytes = src.as_bytes(); + let call = find_first_marker_call(tree.root_node(), bytes, "Security") + .expect("Security call node"); + let (site, scoped) = unwrap_depends_call(call, bytes).expect("Security recognised"); + assert_eq!(site.name, "require_auth"); + assert!( + scoped, + "non-empty scopes=[...] must mark the wrapper scoped" + ); + } + + /// `Depends(callable())` — pre-existing FastAPI shape. Inner call + /// extracts to the factory's outer name; wrapper is NOT + /// scoped-security. Regression guard: the Security extension must + /// not flip Depends's scoped flag on. + #[test] + fn unwrap_depends_call_depends_factory_not_scoped() { + let src = "x = Depends(requires_access_dag(method=\"GET\"))\n"; + let tree = parse_python(src); + let bytes = src.as_bytes(); + let call = + find_first_marker_call(tree.root_node(), bytes, "Depends").expect("Depends call node"); + let (site, scoped) = unwrap_depends_call(call, bytes).expect("Depends recognised"); + assert_eq!(site.name, "requires_access_dag"); + assert!(!scoped, "Depends wrapper never scoped-security"); + } + + /// `Security(callable)` without scopes (rare but legal) is NOT + /// scoped — the OAuth2-scope semantic only fires when scopes is + /// non-empty, so the wrapper falls back to the regular login-guard + /// classification. Conservative: don't over-promote. + #[test] + fn unwrap_depends_call_security_without_scopes_not_scoped() { + let src = "x = Security(require_auth)\n"; + let tree = parse_python(src); + let bytes = src.as_bytes(); + let call = find_first_marker_call(tree.root_node(), bytes, "Security") + .expect("Security call node"); + let (site, scoped) = unwrap_depends_call(call, bytes).expect("Security recognised"); + assert_eq!(site.name, "require_auth"); + assert!( + !scoped, + "missing scopes=[...] kwarg means not scoped-security" + ); + } + + /// `Security(callable, scopes=[])` with an empty scope list is NOT + /// scoped-security: an empty `scopes=[]` declaration accumulates + /// no required scopes onto the JWT check, so the route is + /// effectively a bare login dependency. Conservative — keeps the + /// promotion gate tight. + #[test] + fn unwrap_depends_call_security_empty_scopes_not_scoped() { + let src = "x = Security(require_auth, scopes=[])\n"; + let tree = parse_python(src); + let bytes = src.as_bytes(); + let call = find_first_marker_call(tree.root_node(), bytes, "Security") + .expect("Security call node"); + let (site, scoped) = unwrap_depends_call(call, bytes).expect("Security recognised"); + assert_eq!(site.name, "require_auth"); + assert!(!scoped, "scopes=[] is not scoped-security"); + } +} + +#[cfg(test)] +mod router_level_dependencies_tests { + use super::{ + collect_router_level_dependencies, is_router_like_constructor, router_prefix_from_decorator, + }; + use tree_sitter::Parser; + + fn parse_python(source: &str) -> tree_sitter::Tree { + let mut parser = Parser::new(); + parser + .set_language(&tree_sitter::Language::from(tree_sitter_python::LANGUAGE)) + .expect("python language"); + parser.parse(source, None).expect("parse") + } + + /// Tail-name match: `fastapi.APIRouter`, `routing.APIRouter`, bare + /// `APIRouter`, plus airflow's `VersionedAPIRouter` subclass. Suffix + /// rule covers project-specific `*Router` subclasses without an + /// exhaustive allowlist. Negatives must reject arbitrary lowercase + /// or non-router identifiers. + #[test] + fn is_router_like_constructor_matches_canonical_names() { + // Canonical FastAPI. + assert!(is_router_like_constructor("APIRouter")); + assert!(is_router_like_constructor("FastAPI")); + assert!(is_router_like_constructor("fastapi.APIRouter")); + assert!(is_router_like_constructor("fastapi.routing.APIRouter")); + assert!(is_router_like_constructor("fastapi.FastAPI")); + // Airflow. + assert!(is_router_like_constructor("VersionedAPIRouter")); + // Project-specific *Router subclasses. + assert!(is_router_like_constructor("CustomRouter")); + assert!(is_router_like_constructor("api.v1.MyRouter")); + // Negatives. + assert!(!is_router_like_constructor("router")); + assert!(!is_router_like_constructor("Annotated")); + assert!(!is_router_like_constructor("Depends")); + assert!(!is_router_like_constructor("Security")); + assert!(!is_router_like_constructor("")); + // `Router` alone is too short / generic to match the suffix + // rule (would over-fire on any callable named exactly + // `Router`); we accept it explicitly nowhere. + assert!(!is_router_like_constructor("Router")); + // `flat_router` ends with `Router` but starts lowercase — + // suffix rule requires uppercase first char to avoid catching + // generic verbs. + assert!(!is_router_like_constructor("flat_router")); + } + + /// Airflow's `ti_id_router = VersionedAPIRouter(route_class=..., + /// dependencies=[Security(require_auth, scopes=["ti:self"])])` is + /// the canonical real-repo shape. The collector must capture the + /// `Security(require_auth, scopes=...)` dep keyed by + /// `ti_id_router`, and the wrapper must be flagged scoped-security + /// so `inject_middleware_auth` promotes the AuthCheckKind to Other. + #[test] + fn collect_router_level_dependencies_picks_up_versioned_apirouter_security() { + let src = "ti_id_router = VersionedAPIRouter(\n route_class=ExecutionAPIRoute,\n dependencies=[\n Security(require_auth, scopes=[\"ti:self\"]),\n ],\n)\n"; + let tree = parse_python(src); + let bytes = src.as_bytes(); + let map = collect_router_level_dependencies(tree.root_node(), bytes); + let deps = map + .get("ti_id_router") + .expect("ti_id_router router-level deps captured"); + assert_eq!(deps.len(), 1); + let (site, scoped) = &deps[0]; + assert_eq!(site.name, "require_auth"); + assert!(*scoped, "scopes=[\"ti:self\"] must mark scoped-security"); + } + + /// Bare `Depends(...)` router-level dep (no scopes) — captured but + /// NOT scoped-security. Mirrors the per-route Depends test in the + /// sibling fastapi_dependencies_tests module. + #[test] + fn collect_router_level_dependencies_picks_up_apirouter_depends_not_scoped() { + let src = "v1 = APIRouter(\n prefix=\"/v1\",\n dependencies=[Depends(get_current_user)],\n)\n"; + let tree = parse_python(src); + let bytes = src.as_bytes(); + let map = collect_router_level_dependencies(tree.root_node(), bytes); + let deps = map.get("v1").expect("v1 router-level deps captured"); + assert_eq!(deps.len(), 1); + let (site, scoped) = &deps[0]; + assert_eq!(site.name, "get_current_user"); + assert!(!*scoped, "Depends never scoped-security"); + } + + /// Constructor without `dependencies=` kwarg → no entry in the + /// map. Routers without router-level deps must not produce a + /// fake key — the per-route extractor would then merge an empty + /// list and silently no-op, but absence is the cleaner signal. + #[test] + fn collect_router_level_dependencies_skips_routers_without_deps() { + let src = "router = APIRouter(prefix=\"/x\")\n"; + let tree = parse_python(src); + let bytes = src.as_bytes(); + let map = collect_router_level_dependencies(tree.root_node(), bytes); + assert!(!map.contains_key("router")); + } + + /// Non-router constructor (`MyService(...)`) with a coincidental + /// `dependencies=` kwarg must NOT enter the router-dep map. + /// `MyService` doesn't end with `Router` and isn't on the explicit + /// allowlist, so the gate rejects it. + #[test] + fn collect_router_level_dependencies_skips_non_router_constructors() { + let src = "svc = MyService(dependencies=[Depends(get_db)])\n"; + let tree = parse_python(src); + let bytes = src.as_bytes(); + let map = collect_router_level_dependencies(tree.root_node(), bytes); + assert!(!map.contains_key("svc")); + } + + /// Helper: parse a single decorated function and pull out the + /// decorator call so `router_prefix_from_decorator` can be tested + /// in isolation. Mirrors the `find_first_marker_call` helper in + /// the sibling test module. + fn find_first_decorator<'a>(node: tree_sitter::Node<'a>) -> Option> { + if node.kind() == "decorator" + && let Some(child) = node.named_child(0) + { + return Some(child); + } + for idx in 0..node.named_child_count() { + if let Some(child) = node.named_child(idx as u32) + && let Some(found) = find_first_decorator(child) + { + return Some(found); + } + } + None + } + + /// `@ti_id_router.patch("/x")` → prefix `"ti_id_router"`. This is + /// the lookup key the per-route extractor uses to pull + /// router-level deps out of the map. + #[test] + fn router_prefix_from_decorator_extracts_simple_identifier() { + let src = "@ti_id_router.patch(\"/x\")\ndef f():\n pass\n"; + let tree = parse_python(src); + let bytes = src.as_bytes(); + let decorator = find_first_decorator(tree.root_node()).expect("decorator call node"); + let prefix = router_prefix_from_decorator(decorator, bytes).expect("prefix extracted"); + assert_eq!(prefix, "ti_id_router"); + } + + /// Bare-identifier decorators (`@requires_auth\ndef f(): ...`) and + /// non-attribute callees return None — there's no router prefix + /// to look up. + #[test] + fn router_prefix_from_decorator_returns_none_for_bare_decorator() { + let src = "@requires_auth\ndef f():\n pass\n"; + let tree = parse_python(src); + let bytes = src.as_bytes(); + let decorator = find_first_decorator(tree.root_node()).expect("decorator node"); + // `@requires_auth` produces an `identifier` child, not a + // `call`, so router_prefix should None out at the call gate. + assert!(router_prefix_from_decorator(decorator, bytes).is_none()); } } diff --git a/src/auth_analysis/extract/gin.rs b/src/auth_analysis/extract/gin.rs index d42e4055..18dbdfa9 100644 --- a/src/auth_analysis/extract/gin.rs +++ b/src/auth_analysis/extract/gin.rs @@ -1,8 +1,8 @@ use super::AuthExtractor; use super::common::{ - attach_route_handler, call_site_from_node, collect_top_level_units, http_method_from_name, - is_handler_reference, join_route_paths, member_target, named_children, push_route_registration, - string_literal_value, text, visit_named_nodes, + attach_route_handler, call_site_from_node, http_method_from_name, is_handler_reference, + join_route_paths, member_target, named_children, push_route_registration, string_literal_value, + text, visit_named_nodes, }; use crate::auth_analysis::config::AuthAnalysisRules; use crate::auth_analysis::model::{AuthorizationModel, CallSite, Framework}; @@ -26,24 +26,21 @@ impl AuthExtractor for GinExtractor { bytes: &[u8], path: &Path, rules: &AuthAnalysisRules, - ) -> AuthorizationModel { + model: &mut AuthorizationModel, + ) { let root = tree.root_node(); - let mut model = AuthorizationModel::default(); let mut groups = HashMap::new(); - collect_top_level_units(root, bytes, rules, &mut model); visit_named_nodes(root, &mut |node| match node.kind() { "short_var_declaration" | "assignment_statement" => { maybe_collect_group_binding(node, bytes, &mut groups) } "call_expression" => { maybe_collect_group_use(node, bytes, &mut groups); - maybe_collect_route(root, node, bytes, path, rules, &groups, &mut model); + maybe_collect_route(root, node, bytes, path, rules, &groups, model); } _ => {} }); - - model } } diff --git a/src/auth_analysis/extract/koa.rs b/src/auth_analysis/extract/koa.rs index f86921cc..4c8cc009 100644 --- a/src/auth_analysis/extract/koa.rs +++ b/src/auth_analysis/extract/koa.rs @@ -1,8 +1,8 @@ use super::AuthExtractor; use super::common::{ - attach_route_handler, call_site_from_node, collect_top_level_units, http_method_from_name, - is_handler_reference, member_target, named_children, push_route_registration, - string_literal_value, visit_named_nodes, + attach_route_handler, call_site_from_node, http_method_from_name, is_handler_reference, + member_target, named_children, push_route_registration, string_literal_value, + visit_named_nodes, }; use crate::auth_analysis::config::AuthAnalysisRules; use crate::auth_analysis::model::{AuthorizationModel, Framework}; @@ -25,18 +25,14 @@ impl AuthExtractor for KoaExtractor { bytes: &[u8], path: &Path, rules: &AuthAnalysisRules, - ) -> AuthorizationModel { + model: &mut AuthorizationModel, + ) { let root = tree.root_node(); - let mut model = AuthorizationModel::default(); - - collect_top_level_units(root, bytes, rules, &mut model); visit_named_nodes(root, &mut |node| { if node.kind() == "call_expression" { - maybe_collect_route(root, node, bytes, path, rules, &mut model); + maybe_collect_route(root, node, bytes, path, rules, model); } }); - - model } } diff --git a/src/auth_analysis/extract/mod.rs b/src/auth_analysis/extract/mod.rs index f037e037..5ea913ad 100644 --- a/src/auth_analysis/extract/mod.rs +++ b/src/auth_analysis/extract/mod.rs @@ -1,6 +1,7 @@ use super::config::AuthAnalysisRules; -use super::model::AuthorizationModel; +use super::model::{AuthorizationModel, CallSite}; use crate::utils::project::{FrameworkContext, rust_file_imports_web_framework}; +use std::collections::HashMap; use std::path::Path; use tree_sitter::Tree; @@ -21,13 +22,26 @@ pub mod spring; pub trait AuthExtractor { fn supports(&self, lang: &str, framework_ctx: Option<&FrameworkContext>) -> bool; + + /// Returns true when this extractor expects the orchestrator to + /// have already populated `model.units` with one + /// `AnalysisUnitKind::Function` entry per top-level function / + /// method via [`common::collect_top_level_units`]. Defaults to + /// `true`; framework extractors that build their own unit set + /// (Spring, Rails) override to `false` so the orchestrator skips + /// the shared collection pass when only those extractors match. + fn requires_top_level_units(&self) -> bool { + true + } + fn extract( &self, tree: &Tree, bytes: &[u8], path: &Path, rules: &AuthAnalysisRules, - ) -> AuthorizationModel; + model: &mut AuthorizationModel, + ); } pub fn extract_authorization_model( @@ -37,6 +51,7 @@ pub fn extract_authorization_model( bytes: &[u8], path: &Path, rules: &AuthAnalysisRules, + cross_file_router_deps: Option<&HashMap>>, ) -> AuthorizationModel { let extractors: [&dyn AuthExtractor; 13] = [ &express::ExpressExtractor, @@ -57,14 +72,47 @@ pub fn extract_authorization_model( lang: lang.to_string(), ..Default::default() }; + // Pre-populate the cross-file router-dep map BEFORE extractors run. + // FlaskExtractor reads `model.cross_file_router_deps` and merges the + // resolved deps into its local router-deps map at extraction time, + // so per-route auth attribution sees both the local-file + // `dependencies=[Security(...)]` declarations and the cross-file + // lift from `.include_router(., ...)` + // edges visible elsewhere in the project. Empty / `None` for every + // non-Python language and for files with no matching child edges. + if let Some(deps) = cross_file_router_deps { + model.cross_file_router_deps = deps.clone(); + } + + // **Hoist `collect_top_level_units` out of the per-extractor loop.** + // For multi-extractor languages (Go: gin+echo, JS/TS: express+koa+ + // fastify, Python: flask+django, Rust: axum+actix_web+rocket, Ruby: + // sinatra) the legacy code re-walked the entire AST and rebuilt the + // `Function`-kind unit set per extractor (then deduped by span). + // `collect_top_level_units` was the dominant cost in + // `extract_authorization_model` (46% of total wall-clock on the + // mattermost/server/channels/app subtree, 2026-05-04 profile). + // + // After the hoist each extractor receives a `&mut model` that + // already carries the shared unit set; framework-specific work + // (route detection, middleware injection, typed-extractor guards) + // augments and promotes those units in place via the existing + // `attach_route_handler` "promote-or-create" path. + // + // Spring + Rails build their own unit set (`maybe_collect_controller` + // / Rails' `collect_nodes`), so they opt out via + // `requires_top_level_units = false`; the shared pass runs only + // when at least one matching extractor needs it. + let any_requires_units = extractors + .iter() + .any(|e| e.supports(lang, framework_ctx) && e.requires_top_level_units()); + if any_requires_units { + common::collect_top_level_units(tree.root_node(), bytes, rules, &mut model); + } for extractor in extractors { if extractor.supports(lang, framework_ctx) { - let mut other = extractor.extract(tree, bytes, path, rules); - // Preserve the canonical `lang` set above; sub-extractors - // build their own default-initialised models with empty lang. - other.lang = model.lang.clone(); - model.extend(other); + extractor.extract(tree, bytes, path, rules, &mut model); } } diff --git a/src/auth_analysis/extract/rails.rs b/src/auth_analysis/extract/rails.rs index 7ced2645..5f0a187b 100644 --- a/src/auth_analysis/extract/rails.rs +++ b/src/auth_analysis/extract/rails.rs @@ -22,17 +22,24 @@ impl AuthExtractor for RailsExtractor { .is_none_or(|ctx| ctx.frameworks.is_empty() || ctx.has(DetectedFramework::Rails)) } + fn requires_top_level_units(&self) -> bool { + // Rails builds its own RouteHandler unit set inside `collect_nodes` + // (controller actions inferred from `routes.rb` resource entries + // and conventional `resources :foo` mappings). It never relies on + // the orchestrator's shared `collect_top_level_units` pass. + false + } + fn extract( &self, tree: &Tree, bytes: &[u8], path: &Path, rules: &AuthAnalysisRules, - ) -> AuthorizationModel { + model: &mut AuthorizationModel, + ) { let root = tree.root_node(); - let mut model = AuthorizationModel::default(); - collect_nodes(root, &[], bytes, path, rules, &mut model); - model + collect_nodes(root, &[], bytes, path, rules, model); } } diff --git a/src/auth_analysis/extract/rocket.rs b/src/auth_analysis/extract/rocket.rs index 0e12baa7..741a4a35 100644 --- a/src/auth_analysis/extract/rocket.rs +++ b/src/auth_analysis/extract/rocket.rs @@ -4,8 +4,7 @@ use super::axum::{ rust_param_aliases, }; use super::common::{ - attach_route_handler, collect_top_level_units, function_definition_node, function_name, - named_children, text, + attach_route_handler, function_definition_node, function_name, named_children, text, }; use crate::auth_analysis::config::AuthAnalysisRules; use crate::auth_analysis::model::{AuthorizationModel, Framework, HttpMethod, RouteRegistration}; @@ -28,14 +27,10 @@ impl AuthExtractor for RocketExtractor { bytes: &[u8], path: &Path, rules: &AuthAnalysisRules, - ) -> AuthorizationModel { + model: &mut AuthorizationModel, + ) { let root = tree.root_node(); - let mut model = AuthorizationModel::default(); - - collect_top_level_units(root, bytes, rules, &mut model); - collect_handlers(root, root, bytes, path, rules, &mut model); - - model + collect_handlers(root, root, bytes, path, rules, model); } } diff --git a/src/auth_analysis/extract/sinatra.rs b/src/auth_analysis/extract/sinatra.rs index 2cd82441..1fb6c48e 100644 --- a/src/auth_analysis/extract/sinatra.rs +++ b/src/auth_analysis/extract/sinatra.rs @@ -1,7 +1,7 @@ use super::AuthExtractor; use super::common::{ - auth_check_from_call_site, build_function_unit, call_name, call_site_from_node, - collect_top_level_units, named_children, span, string_literal_value, + auth_check_from_call_site, build_function_unit, call_name, call_site_from_node, named_children, + span, string_literal_value, }; use crate::auth_analysis::config::{AuthAnalysisRules, matches_name}; use crate::auth_analysis::model::{ @@ -27,13 +27,11 @@ impl AuthExtractor for SinatraExtractor { bytes: &[u8], path: &Path, rules: &AuthAnalysisRules, - ) -> AuthorizationModel { + model: &mut AuthorizationModel, + ) { let root = tree.root_node(); - let mut model = AuthorizationModel::default(); - collect_top_level_units(root, bytes, rules, &mut model); let before_filters = collect_before_filters(root, bytes); - collect_routes(root, bytes, path, rules, &before_filters, &mut model); - model + collect_routes(root, bytes, path, rules, &before_filters, model); } } diff --git a/src/auth_analysis/extract/spring.rs b/src/auth_analysis/extract/spring.rs index e9e84c3b..3ea68cb8 100644 --- a/src/auth_analysis/extract/spring.rs +++ b/src/auth_analysis/extract/spring.rs @@ -20,19 +20,27 @@ impl AuthExtractor for SpringExtractor { .is_none_or(|ctx| ctx.frameworks.is_empty() || ctx.has(DetectedFramework::Spring)) } + fn requires_top_level_units(&self) -> bool { + // Spring synthesises its own units inside `maybe_collect_controller` + // (only `@Controller` / `@RestController`-annotated classes + // produce units; non-controller Java files contribute nothing). + // The orchestrator's shared `collect_top_level_units` pass would + // emit a `Function` unit per top-level method on every Java file + // including non-controller helpers, doubling work and broadening + // the analysis surface beyond what Spring needs. + false + } + fn extract( &self, tree: &Tree, bytes: &[u8], path: &Path, rules: &AuthAnalysisRules, - ) -> AuthorizationModel { + model: &mut AuthorizationModel, + ) { let root = tree.root_node(); - let mut model = AuthorizationModel::default(); - - collect_classes(root, bytes, path, rules, &mut model); - - model + collect_classes(root, bytes, path, rules, model); } } diff --git a/src/auth_analysis/mod.rs b/src/auth_analysis/mod.rs index d4595762..ea56aeb0 100644 --- a/src/auth_analysis/mod.rs +++ b/src/auth_analysis/mod.rs @@ -60,6 +60,7 @@ pub mod checks; pub mod config; pub mod extract; pub mod model; +pub mod router_facts; pub mod sql_semantics; use crate::commands::scan::Diag; @@ -102,21 +103,98 @@ pub fn run_auth_analysis( if !rules.enabled { return Vec::new(); } - - let mut model = extract::extract_authorization_model( + // Resolve cross-file router-deps for the active file (Python only) + // before constructing the model, so the FlaskExtractor sees the + // full per-file dep map at extraction time. See `router_facts` + // module + `analyse_file_fused` for the wider pipeline. + let cross_file_router_deps = + resolve_cross_file_router_deps_for_file(lang, file_path, global_summaries); + let model = extract::extract_authorization_model( lang, cfg.framework_ctx.as_ref(), tree, source, file_path, &rules, + cross_file_router_deps.as_ref(), ); + run_auth_analysis_with_model( + model, + tree, + lang, + file_path, + &rules, + var_types, + global_summaries, + scan_root, + ) +} + +/// Look up `GlobalSummaries.router_facts_by_module` and resolve the +/// cross-file router-deps map for the file at `file_path`. Returns +/// `None` for non-Python files, files whose module_id has no matching +/// `.include_router(., ...)` edges anywhere in +/// the project, or callers that don't pass `global_summaries`. +pub(crate) fn resolve_cross_file_router_deps_for_file( + lang: &str, + file_path: &Path, + global_summaries: Option<&GlobalSummaries>, +) -> Option>> { + if lang != "python" { + return None; + } + let gs = global_summaries?; + let module_id = router_facts::module_id_for_path(file_path)?; + let resolved = gs.resolve_cross_file_router_deps(&module_id); + if resolved.is_empty() { + None + } else { + Some(resolved) + } +} + +/// Variant of [`run_auth_analysis`] that accepts a pre-built +/// [`model::AuthorizationModel`] instead of building one from the AST. +/// +/// Lets callers that need both diagnostics AND +/// `(FuncKey, AuthCheckSummary)` per-file summaries (the fused pass-2 +/// path in [`crate::ast::analyse_file_fused`]) construct the base +/// authorization model exactly once and route both consumers through +/// it. Pre-fix the fused path called +/// [`extract::extract_authorization_model`] twice per file (once via +/// [`run_auth_analysis`], once via [`extract_auth_summaries_by_key`]), +/// duplicating the AST walks for `collect_top_level_units` + +/// `build_function_unit_with_meta` + `collect_unit_state` + every +/// extractor's framework-detection scan. On the +/// `mattermost/server/channels/app` profile that double-extract +/// accounted for 35.3% of total wall-clock; sharing the base model +/// drops it to ~17.6%. +/// +/// The mutations applied here (`apply_var_types_to_model`, +/// `apply_typed_bounded_params`, `apply_helper_lifting`) only +/// affect diagnostic emission — `extract_auth_summaries_from_model` +/// reads the **base** model so callers must extract summaries before +/// passing the model in. +#[allow(clippy::too_many_arguments)] +pub fn run_auth_analysis_with_model( + mut model: model::AuthorizationModel, + tree: &Tree, + lang: &str, + file_path: &Path, + rules: &config::AuthAnalysisRules, + var_types: Option<&VarTypes>, + global_summaries: Option<&GlobalSummaries>, + scan_root: Option<&Path>, +) -> Vec { + if !rules.enabled { + return Vec::new(); + } // Refine `SensitiveOperation::sink_class` using SSA-derived // variable types. Runs only when the caller supplied `var_types` // (skipped for slug-lookup / unit-test call sites). if let Some(types) = var_types { - apply_var_types_to_model(&mut model, &rules, types); + apply_var_types_to_model(&mut model, rules, types); apply_typed_bounded_params(&mut model, types); } @@ -128,11 +206,16 @@ pub fn run_auth_analysis( // (when provided) for cross-file helpers that live in other files. apply_helper_lifting(&mut model, lang, file_path, scan_root, global_summaries); + // Phase 1 caller-scope IPA: propagate route-handler-level auth + // checks DOWN to callee helper units within the same file. See + // [`apply_caller_scope_propagation`] for the propagation rule. + apply_caller_scope_propagation(&mut model); + if model.routes.is_empty() && model.units.is_empty() { return Vec::new(); } - checks::run_checks(&model, &rules) + checks::run_checks(&model, rules) .into_iter() .map(|finding| auth_finding_to_diag(&finding, tree, file_path)) .collect() @@ -167,8 +250,28 @@ pub fn extract_auth_summaries_by_key( source, file_path, &rules, + None, ); - summaries_keyed_by_func(&model, lang, file_path, scan_root) + extract_auth_summaries_from_model(&model, lang, file_path, scan_root) +} + +/// Variant of [`extract_auth_summaries_by_key`] that consumes a +/// pre-built [`model::AuthorizationModel`]. +/// +/// Designed for callers that also need to run the diagnostic pipeline +/// (which mutates the model via [`run_auth_analysis_with_model`]): +/// extract summaries first against the base model, then hand the same +/// model to the diag pipeline so the second +/// [`extract::extract_authorization_model`] AST walk per file is +/// avoided. See [`run_auth_analysis_with_model`] for the wider +/// rationale and measured saving. +pub fn extract_auth_summaries_from_model( + model: &model::AuthorizationModel, + lang: &str, + file_path: &Path, + scan_root: Option<&Path>, +) -> Vec<(FuncKey, model::AuthCheckSummary)> { + summaries_keyed_by_func(model, lang, file_path, scan_root) } /// Convert an already-built [`model::AuthorizationModel`] into a @@ -444,6 +547,203 @@ fn apply_helper_lifting( } } +/// Phase 1 caller-scope IPA: propagate route-handler-level auth checks +/// DOWN to callee helper units within the same file. +/// +/// `apply_helper_lifting` walks UPWARD: a helper that internally +/// proves ownership / membership / etc. has its summary lifted onto +/// each call site in the caller. But the inverse direction — +/// route handler that authenticates via route-level decorator/ +/// dependency, then delegates to a private helper that performs the +/// actual sink — is the dominant FP shape on FastAPI / Django / Flask +/// codebases (sentry, saleor, airflow): the helper has no inline +/// auth_checks of its own, so `check_ownership_gaps` flags every +/// `session.add(...)` / `Model.objects.filter(id=...)` it contains. +/// +/// This pass closes that gap inside a single file. For each helper +/// unit, if **every** same-file caller (across the whole call graph) +/// is itself an authorized route handler (route-level non-Login auth +/// check) or has already been authorized via this same propagation +/// in a prior round, lift the caller's route-level checks onto the +/// helper. Iterated to a small fixpoint so transitive helper chains +/// `route → mid_helper → leaf_helper` are also covered. +/// +/// Synthetic checks carry `is_route_level=true` so +/// `auth_check_covers_subject` short-circuits coverage for any +/// subject the helper sees, mirroring the in-handler decorator-lift +/// semantics established by [`extract::flask::inject_middleware_auth`]. +/// +/// **Soundness rule**: a helper's `unit_callers` list must be +/// non-empty AND every caller must be authorized. This refuses to +/// authorize: +/// * helpers with no in-file caller (dead code or external +/// entry-point — could be CLI, cron, test harness, …), +/// * helpers called from a mix of authorized routes and unauthorized +/// callers (the unauthorized path is the real FP attack surface), +/// * helpers called only from another un-lifted helper (no +/// evidence the upstream chain authenticates). +/// +/// Cross-file caller-scope IPA — where the route handler lives in +/// file A and the helper in file B — is not yet implemented. +/// Requires plumbing per-file caller auth checks through +/// `GlobalSummaries`, not just the existing per-callee +/// `AuthCheckSummary`. See `deep_engine_fixes.md` for the deferred +/// follow-up. +fn apply_caller_scope_propagation(model: &mut model::AuthorizationModel) { + use model::{AnalysisUnitKind, AuthCheck, AuthCheckKind}; + use std::collections::{HashMap, HashSet}; + + // Build leaf-name → unit_idx map. Only non-route-handler units are + // lift TARGETS; route handlers don't need downward lift since they + // already carry their own route-level auth. + let mut leaf_to_unit: HashMap = HashMap::new(); + for (idx, unit) in model.units.iter().enumerate() { + if unit.kind == AnalysisUnitKind::RouteHandler { + continue; + } + let Some(name) = unit.name.as_deref() else { + continue; + }; + let leaf = name.rsplit('.').next().unwrap_or(name); + if leaf.is_empty() { + continue; + } + leaf_to_unit.entry(leaf.to_string()).or_insert(idx); + } + + // For each callee unit, collect its same-file caller indices. + // Iterates every unit's `call_sites` once; a callee with no + // matching unit (calls into stdlib, framework, third-party) gets + // an empty `unit_callers[i]` and is excluded from propagation + // below. + let mut unit_callers: Vec> = vec![Vec::new(); model.units.len()]; + for (caller_idx, unit) in model.units.iter().enumerate() { + let mut seen_callees: HashSet = HashSet::new(); + for call in &unit.call_sites { + let leaf = call.name.rsplit('.').next().unwrap_or(&call.name); + if let Some(&callee_idx) = leaf_to_unit.get(leaf) + && callee_idx != caller_idx + && seen_callees.insert(callee_idx) + { + unit_callers[callee_idx].push(caller_idx); + } + } + } + + // Seed `authorized` only when a unit carries at least one + // route-level Other / Membership / Ownership / AdminGuard check. + // `LoginGuard` alone proves only identity, not authority, and + // `TokenExpiry` / `TokenRecipient` alone don't justify + // foreign-id mutations — `has_prior_subject_auth` already filters + // those kinds out. Seeding on those would silently authorize + // helpers reachable from a login-only route. + let is_seed_kind = |k: AuthCheckKind| { + !matches!( + k, + AuthCheckKind::LoginGuard | AuthCheckKind::TokenExpiry | AuthCheckKind::TokenRecipient + ) + }; + let mut authorized: HashSet = (0..model.units.len()) + .filter(|i| { + model.units[*i] + .auth_checks + .iter() + .any(|c| c.is_route_level && is_seed_kind(c.kind)) + }) + .collect(); + // Lift ALL route-level non-Login auth checks once a unit is + // authorized, including `TokenExpiry` / `TokenRecipient`. Those + // kinds are required by `check_token_override_without_validation` + // (which gates separately from `has_prior_subject_auth`); without + // them the callee fires `token_override_without_validation` even + // after `missing_ownership_check` is suppressed. `LoginGuard` is + // still excluded — it's too weak to count as a coverage proof for + // either downstream check. + let unit_route_level_checks: Vec> = model + .units + .iter() + .map(|unit| { + unit.auth_checks + .iter() + .filter(|c| c.is_route_level && c.kind != AuthCheckKind::LoginGuard) + .cloned() + .collect::>() + }) + .collect(); + + // Per-callee aggregated lift checks, populated as we authorize. + // Stored separately so we can apply mutations after the fixpoint + // loop without invalidating immutable borrows above. + let mut helper_lift: HashMap> = HashMap::new(); + + const MAX_ROUNDS: usize = 4; + for _ in 0..MAX_ROUNDS { + let mut grew = false; + for (callee_idx, callers) in unit_callers.iter().enumerate().take(model.units.len()) { + if authorized.contains(&callee_idx) { + continue; + } + if callers.is_empty() { + continue; + } + if !callers.iter().all(|c| authorized.contains(c)) { + continue; + } + // Aggregate the route-level checks from every authorized + // caller. Non-route-handler callers contribute nothing + // (their `unit_route_level_checks[c]` is empty by + // construction) — only route handlers up the chain seed + // real route-level checks, and downstream helpers + // propagate those forward via the `is_route_level=true` + // flag on the synthetic checks. + let mut chosen: Vec = Vec::new(); + for &caller_idx in callers { + for check in &unit_route_level_checks[caller_idx] { + chosen.push(check.clone()); + } + if let Some(prior) = helper_lift.get(&caller_idx) { + for check in prior { + chosen.push(check.clone()); + } + } + } + if chosen.is_empty() { + continue; + } + authorized.insert(callee_idx); + helper_lift.insert(callee_idx, chosen); + grew = true; + } + if !grew { + break; + } + } + + for (callee_idx, checks) in helper_lift { + let unit = &mut model.units[callee_idx]; + let mut existing_keys: HashSet<((usize, usize), AuthCheckKind, String)> = unit + .auth_checks + .iter() + .map(|c| (c.span, c.kind, c.callee.clone())) + .collect(); + for check in checks { + let mut synth = check; + // Re-anchor at the callee's start line so the + // `check.line <= op.line` gate in `has_prior_subject_auth` + // covers every operation inside the callee. Without this + // re-anchor, the synthetic check carries the caller's line + // (which is greater than the callee's body lines) and + // doesn't gate any of the callee's sinks. + synth.line = unit.line; + synth.callee = format!("(caller-scope lift {})", synth.callee); + let key = (synth.span, synth.kind, synth.callee.clone()); + if existing_keys.insert(key) { + unit.auth_checks.push(synth); + } + } + } +} + /// Build a `name → AuthCheckSummary` map by walking each unit's auth /// checks and recording, for every check subject whose value-ref name /// matches a positional parameter name of the unit, that param index @@ -742,11 +1042,14 @@ fn auth_finding_to_diag(finding: &checks::AuthFinding, tree: &Tree, file_path: & #[cfg(test)] mod tests { - use super::{VarTypes, apply_var_types_to_model, receiver_root, sink_class_for_type}; + use super::{ + VarTypes, apply_caller_scope_propagation, apply_var_types_to_model, receiver_root, + sink_class_for_type, + }; use crate::auth_analysis::config::build_auth_rules; use crate::auth_analysis::model::{ - AnalysisUnit, AnalysisUnitKind, AuthorizationModel, OperationKind, SensitiveOperation, - SinkClass, + AnalysisUnit, AnalysisUnitKind, AuthCheck, AuthCheckKind, AuthorizationModel, CallSite, + OperationKind, SensitiveOperation, SinkClass, }; use crate::ssa::type_facts::TypeKind; use crate::utils::config::Config; @@ -868,6 +1171,239 @@ mod tests { ); } + /// Build a synthetic [`AnalysisUnit`] with the given kind, name, + /// and call_site leaf names. No operations or auth_checks; tests + /// add those explicitly. + fn unit_with_calls(kind: AnalysisUnitKind, name: &str, callees: &[&str]) -> AnalysisUnit { + AnalysisUnit { + kind, + name: Some(name.into()), + span: (0, 0), + params: Vec::new(), + context_inputs: Vec::new(), + call_sites: callees + .iter() + .map(|c| CallSite { + name: (*c).to_string(), + args: Vec::new(), + span: (0, 0), + args_value_refs: Vec::new(), + }) + .collect(), + auth_checks: Vec::new(), + operations: Vec::new(), + value_refs: Vec::new(), + condition_texts: Vec::new(), + line: 1, + row_field_vars: HashMap::new(), + var_alias_chain: HashMap::new(), + row_population_data: HashMap::new(), + self_actor_vars: HashSet::new(), + self_actor_id_vars: HashSet::new(), + authorized_sql_vars: HashSet::new(), + const_bound_vars: HashSet::new(), + typed_bounded_vars: HashSet::new(), + typed_bounded_dto_fields: HashMap::new(), + self_scoped_session_bases: HashSet::new(), + } + } + + fn route_level_check(kind: AuthCheckKind) -> AuthCheck { + AuthCheck { + kind, + callee: "Security(require_auth)".into(), + subjects: Vec::new(), + span: (10, 11), + line: 1, + args: Vec::new(), + condition_text: None, + is_route_level: true, + } + } + + #[test] + fn caller_scope_propagation_lifts_route_level_other_to_callee_helper() { + // Mirrors the airflow shape: + // route handler `ti_update_state` carries route-level Other + // (from scoped Security dep), calls `_create_state_update` + // (helper); helper's body sinks should inherit the lift. + let mut model = AuthorizationModel::default(); + let mut handler = unit_with_calls( + AnalysisUnitKind::RouteHandler, + "ti_update_state", + &["_create_state_update"], + ); + handler + .auth_checks + .push(route_level_check(AuthCheckKind::Other)); + handler + .auth_checks + .push(route_level_check(AuthCheckKind::TokenExpiry)); + handler + .auth_checks + .push(route_level_check(AuthCheckKind::TokenRecipient)); + let helper = unit_with_calls(AnalysisUnitKind::Function, "_create_state_update", &[]); + model.units.push(handler); + model.units.push(helper); + + apply_caller_scope_propagation(&mut model); + + // Helper now has 3 lifted auth checks (Other + TokenExpiry + + // TokenRecipient), each with `is_route_level=true` and line + // re-anchored to helper's start line. + let helper = &model.units[1]; + let kinds: HashSet = helper.auth_checks.iter().map(|c| c.kind).collect(); + assert!( + kinds.contains(&AuthCheckKind::Other), + "helper should inherit Other check from caller" + ); + assert!( + kinds.contains(&AuthCheckKind::TokenExpiry), + "helper should inherit TokenExpiry check (needed for token_override suppression)" + ); + assert!( + kinds.contains(&AuthCheckKind::TokenRecipient), + "helper should inherit TokenRecipient check" + ); + assert!( + helper.auth_checks.iter().all(|c| c.is_route_level), + "lifted checks must keep is_route_level=true" + ); + assert!( + helper.auth_checks.iter().all(|c| c.line == helper.line), + "lifted check.line must match callee unit start so check.line <= op.line holds" + ); + } + + #[test] + fn caller_scope_propagation_refuses_when_helper_has_unauthorized_caller() { + // Helper is called from BOTH an authorized route handler AND + // a bare (no-auth) route handler. Soundness rule: if any + // caller is unauthorized, do NOT propagate — the unauthorized + // path is the real attack surface. + let mut model = AuthorizationModel::default(); + let mut authed = unit_with_calls( + AnalysisUnitKind::RouteHandler, + "ti_update_state", + &["_create_state_update"], + ); + authed + .auth_checks + .push(route_level_check(AuthCheckKind::Other)); + let bare = unit_with_calls( + AnalysisUnitKind::RouteHandler, + "ti_overwrite_state", + &["_create_state_update"], + ); + let helper = unit_with_calls(AnalysisUnitKind::Function, "_create_state_update", &[]); + model.units.push(authed); + model.units.push(bare); + model.units.push(helper); + + apply_caller_scope_propagation(&mut model); + + let helper = &model.units[2]; + assert!( + helper.auth_checks.is_empty(), + "helper must not be authorized when one caller has no route-level auth" + ); + } + + #[test] + fn caller_scope_propagation_refuses_when_helper_has_no_callers() { + // Dead helper — no in-file caller. Could be invoked via CLI + // / test / cron / external import. Stay conservative. + let mut model = AuthorizationModel::default(); + let helper = unit_with_calls(AnalysisUnitKind::Function, "_orphan_helper", &[]); + model.units.push(helper); + + apply_caller_scope_propagation(&mut model); + + let helper = &model.units[0]; + assert!( + helper.auth_checks.is_empty(), + "helper with no in-file callers must not be authorized" + ); + } + + #[test] + fn caller_scope_propagation_transitive_chain_route_to_mid_to_leaf() { + // route → mid_helper → leaf_helper. Both helpers should be + // authorized in two BFS rounds: round 1 lifts onto mid, round + // 2 sees mid as authorized and lifts onto leaf. + let mut model = AuthorizationModel::default(); + let mut handler = unit_with_calls( + AnalysisUnitKind::RouteHandler, + "ti_update_state", + &["_mid_helper"], + ); + handler + .auth_checks + .push(route_level_check(AuthCheckKind::Other)); + let mid = unit_with_calls(AnalysisUnitKind::Function, "_mid_helper", &["_leaf_helper"]); + let leaf = unit_with_calls(AnalysisUnitKind::Function, "_leaf_helper", &[]); + model.units.push(handler); + model.units.push(mid); + model.units.push(leaf); + + apply_caller_scope_propagation(&mut model); + + let mid_kinds: HashSet = + model.units[1].auth_checks.iter().map(|c| c.kind).collect(); + let leaf_kinds: HashSet = + model.units[2].auth_checks.iter().map(|c| c.kind).collect(); + assert!( + mid_kinds.contains(&AuthCheckKind::Other), + "mid helper should be authorized in round 1" + ); + assert!( + leaf_kinds.contains(&AuthCheckKind::Other), + "leaf helper should be authorized in round 2 via the lifted mid" + ); + } + + #[test] + fn caller_scope_propagation_does_not_seed_on_loginguard_only_route() { + // Route handler with ONLY a LoginGuard route-level check. + // LoginGuard alone proves identity, not authority — must not + // seed the helper. + let mut model = AuthorizationModel::default(); + let mut handler = + unit_with_calls(AnalysisUnitKind::RouteHandler, "list_things", &["_helper"]); + handler + .auth_checks + .push(route_level_check(AuthCheckKind::LoginGuard)); + let helper = unit_with_calls(AnalysisUnitKind::Function, "_helper", &[]); + model.units.push(handler); + model.units.push(helper); + + apply_caller_scope_propagation(&mut model); + + let helper = &model.units[1]; + assert!( + helper.auth_checks.is_empty(), + "LoginGuard alone must not seed the helper" + ); + } + + #[test] + fn caller_scope_propagation_skips_self_recursive_call() { + // Recursive helper that calls itself. The self-edge is + // skipped in `unit_callers` construction so the helper has + // zero in-file callers and stays unauthorized. + let mut model = AuthorizationModel::default(); + let helper = unit_with_calls(AnalysisUnitKind::Function, "recurse", &["recurse"]); + model.units.push(helper); + + apply_caller_scope_propagation(&mut model); + + let helper = &model.units[0]; + assert!( + helper.auth_checks.is_empty(), + "self-recursive helper with no other callers must not be authorized" + ); + } + #[test] fn apply_var_types_leaves_classification_untouched_when_receiver_unknown() { let cfg = Config::default(); diff --git a/src/auth_analysis/model.rs b/src/auth_analysis/model.rs index 366226f6..d7fccbb6 100644 --- a/src/auth_analysis/model.rs +++ b/src/auth_analysis/model.rs @@ -367,6 +367,17 @@ pub struct AuthorizationModel { /// of the framework-request-name allow-list. Empty string when no /// language was supplied (single-file unit-test paths). pub lang: String, + /// Cross-file router-dependency lift, keyed by **local** router + /// variable name. Pre-populated by the orchestrator before + /// extractors run, sourced from `GlobalSummaries.router_facts_by_module` + /// for every project file whose `.include_router(.)` + /// edge targets a router in the current file. FlaskExtractor merges + /// these in alongside locally-declared `dependencies=[...]` so routes + /// attached to a bare child router still inherit the parent's + /// `Security(...)` / `Depends(...)` deps. Empty when no cross-file + /// resolution applies (most files) or when global summaries are not + /// available (unit-test / single-file scan paths). + pub cross_file_router_deps: HashMap>, } impl AuthorizationModel { diff --git a/src/auth_analysis/router_facts.rs b/src/auth_analysis/router_facts.rs new file mode 100644 index 00000000..48cb7946 --- /dev/null +++ b/src/auth_analysis/router_facts.rs @@ -0,0 +1,516 @@ +//! Cross-file FastAPI router-dependency tracking. +//! +//! FastAPI propagates `dependencies=[Security(...), Depends(...)]` declared +//! at the router level onto every route attached to that router, including +//! routes attached via cross-file `.include_router(.router)` +//! lifts. The per-file router-dep collector in +//! `crate::auth_analysis::extract::flask::collect_router_level_dependencies` +//! sees only the file under analysis, so a bare child router whose auth is +//! declared on a parent router in `__init__.py` (canonical airflow shape) has +//! no visible deps. This module captures the cross-file edges + parent +//! declarations during pass 1 and resolves them into a per-child effective +//! dep map for pass 2's auth analysis. +//! +//! Storage shape: per-Python-file [`PerFileRouterFacts`] with +//! `local_router_deps` (the ` = X(deps=[…])` declarations +//! visible in the file) and `include_router_edges` (the +//! `.include_router(., …)` calls). +//! Persisted into `crate::summary::GlobalSummaries::router_facts_by_module` +//! during pass 1 and resolved into the active file's +//! [`crate::auth_analysis::model::AuthorizationModel::cross_file_router_deps`] +//! at pass 2 entry. +//! +//! Module identity: file basename without `.py`. This is approximate (two +//! files named `task_instances.py` in different packages would collide) but +//! covers airflow-style codebases where include_router targets reference the +//! child's module name directly (`task_instances.router`). Transitive lifts +//! (`grandparent.include_router(parent); parent.include_router(child)`) are +//! resolved by walking the index iteratively at lookup time. + +use crate::auth_analysis::extract::common::{ + call_site_from_node, named_children, string_literal_value, text, +}; +use crate::auth_analysis::model::CallSite; +use std::collections::HashMap; +use std::path::Path; +use tree_sitter::{Node, Tree}; + +/// Per-file extracted router declarations + include_router edges. +/// Persisted into `GlobalSummaries.router_facts_by_module` keyed by the +/// file's [`module_id_for_path`]. Single-purpose: drives the cross-file +/// router-dep resolution at pass 2 entry. +#[derive(Debug, Clone, Default)] +pub struct PerFileRouterFacts { + /// Local router var → declared inline `dependencies=[...]` deps. + /// Mirrors `flask::collect_router_level_dependencies` output. + pub local_router_deps: HashMap>, + /// `.include_router(., ...)` edges + /// observed in this file. Each edge specifies a parent router var + /// (local to this file) and a child router identified by its + /// module_id + var name. Cross-file lookups walk these. + pub include_router_edges: Vec, +} + +/// A single `.include_router(., ...)` +/// edge. `parent_var` is the local variable that owns the deps to lift; +/// `child_module_id` + `child_var` together name the child router whose +/// routes inherit the parent's deps. +#[derive(Debug, Clone, PartialEq, Eq)] +pub struct RouterIncludeEdge { + pub parent_var: String, + pub child_module_id: String, + pub child_var: String, +} + +/// Translate a file path into a stable cross-file module identifier. +/// +/// Currently the file's basename without the `.py` extension — sufficient +/// for the airflow shape (`from . import task_instances; … +/// authenticated_router.include_router(task_instances.router)`) where the +/// include_router target's module reference is the child file's own +/// basename. Returns `None` for files whose stem is `__init__` +/// (parent files don't need to be looked up; they emit edges only) or +/// for paths with no usable stem. +pub fn module_id_for_path(path: &Path) -> Option { + let stem = path.file_stem()?.to_str()?; + if stem.is_empty() || stem == "__init__" { + return None; + } + Some(stem.to_string()) +} + +/// Stable storage key for the per-project router-facts index. +/// +/// Uses the file's **full filesystem path** (lossy-converted to UTF-8) +/// because the only goal of the storage key is uniqueness across files +/// in a single scan. Collisions on shorter forms (file basename or +/// `::__init__`) are common in real codebases — airflow +/// alone has 17 `routes/__init__.py` files spread across providers and +/// test trees, and any keying scheme that drops the path prefix would +/// have one such file silently overwrite another's `include_router` +/// edges, breaking the cross-file lift on whichever parent lost the +/// race. +/// +/// The lookup side ([`crate::summary::GlobalSummaries::resolve_cross_file_router_deps`]) +/// iterates every stored entry and matches child references by the +/// **last segment** ([`module_id_for_path`]) — so duplicate-basename +/// children still get every parent's deps accumulated, which is the +/// FastAPI-runtime-correct behavior. Path-based storage keys plus +/// basename-based lookup keys is the right pairing. +pub fn module_id_for_storage(path: &Path) -> Option { + let s = path.to_string_lossy(); + if s.is_empty() { + return None; + } + Some(s.into_owned()) +} + +/// Extract router-level deps + include_router edges from a Python AST. +/// Returns `None` for non-Python files; pass 1 callers must gate on the +/// file's language slug before invoking. Empty facts (no routers and no +/// edges) still return `Some(Default::default())` so callers can record +/// an empty index entry without re-extracting. +pub fn extract_router_facts_for_python(tree: &Tree, bytes: &[u8]) -> PerFileRouterFacts { + let mut facts = PerFileRouterFacts::default(); + let root = tree.root_node(); + collect_local_router_deps(root, bytes, &mut facts.local_router_deps); + collect_include_router_edges(root, bytes, &mut facts.include_router_edges); + facts +} + +/// Walk the module root for top-level ` = (..., dependencies=[…])` +/// assignments, mirroring +/// [`crate::auth_analysis::extract::flask::collect_router_level_dependencies`]. +/// Reimplemented here to avoid an inter-module Visibility tangle and +/// to keep this module self-contained — the router extractor is the +/// single source of truth at FlaskExtractor::extract time, this module +/// is a parallel collection path that runs in pass 1. +fn collect_local_router_deps( + root: Node<'_>, + bytes: &[u8], + out: &mut HashMap>, +) { + for child in named_children(root) { + let assign = match child.kind() { + "expression_statement" => named_children(child).into_iter().next(), + "assignment" => Some(child), + _ => None, + }; + let Some(assign) = assign else { continue }; + if assign.kind() != "assignment" { + continue; + } + let Some(left) = assign.child_by_field_name("left") else { + continue; + }; + if left.kind() != "identifier" { + continue; + } + let Some(right) = assign.child_by_field_name("right") else { + continue; + }; + if right.kind() != "call" { + continue; + } + let Some(function) = right.child_by_field_name("function") else { + continue; + }; + let function_text = text(function, bytes); + if !is_router_like_constructor(&function_text) { + continue; + } + let Some(arguments) = right.child_by_field_name("arguments") else { + continue; + }; + let Some(deps_value) = keyword_argument_value(arguments, bytes, "dependencies") else { + continue; + }; + let mut deps = Vec::new(); + for element in named_children(deps_value) { + if let Some(unwrapped) = unwrap_depends_call(element, bytes) { + deps.push(unwrapped); + } + } + if deps.is_empty() { + continue; + } + let var_name = text(left, bytes).trim().to_string(); + if var_name.is_empty() { + continue; + } + out.entry(var_name).or_insert(deps); + } +} + +/// Walk every call expression in the file looking for +/// `.include_router(., ...)` shapes. +/// Records `(parent_var, child_module_id, child_var)` for each. Skips +/// edges where the child reference is a bare identifier (no module +/// segment) — those would require Python import resolution to attach +/// to a specific file, beyond this single-hop basename matching. +fn collect_include_router_edges(root: Node<'_>, bytes: &[u8], out: &mut Vec) { + walk_for_include_router(root, bytes, out); +} + +fn walk_for_include_router(node: Node<'_>, bytes: &[u8], out: &mut Vec) { + if node.kind() == "call" + && let Some(edge) = parse_include_router_call(node, bytes) + { + out.push(edge); + } + for child in named_children(node) { + walk_for_include_router(child, bytes, out); + } +} + +fn parse_include_router_call(node: Node<'_>, bytes: &[u8]) -> Option { + let function = node.child_by_field_name("function")?; + if function.kind() != "attribute" { + return None; + } + let attr = function.child_by_field_name("attribute")?; + if text(attr, bytes) != "include_router" { + return None; + } + let object = function.child_by_field_name("object")?; + let parent_var = match object.kind() { + "identifier" => text(object, bytes).trim().to_string(), + _ => return None, + }; + if parent_var.is_empty() { + return None; + } + let arguments = node.child_by_field_name("arguments")?; + // First positional arg (skip keyword_argument children). + let first = named_children(arguments) + .into_iter() + .find(|child| child.kind() != "keyword_argument")?; + if first.kind() != "attribute" { + return None; + } + let child_attr = first.child_by_field_name("attribute")?; + let child_var = text(child_attr, bytes).trim().to_string(); + if child_var.is_empty() { + return None; + } + let child_object = first.child_by_field_name("object")?; + // Use the **last segment** of a possibly-dotted module reference as + // the cross-file module id. `task_instances.router` → + // module_id="task_instances"; `pkg.task_instances.router` → + // module_id="task_instances" (last attribute segment). + let child_module_id = match child_object.kind() { + "identifier" => text(child_object, bytes).trim().to_string(), + "attribute" => { + let inner_attr = child_object.child_by_field_name("attribute")?; + text(inner_attr, bytes).trim().to_string() + } + _ => return None, + }; + if child_module_id.is_empty() { + return None; + } + Some(RouterIncludeEdge { + parent_var, + child_module_id, + child_var, + }) +} + +fn keyword_argument_value<'tree>( + arguments: Node<'tree>, + bytes: &[u8], + name: &str, +) -> Option> { + for arg in named_children(arguments) { + if arg.kind() != "keyword_argument" { + continue; + } + let key = arg.child_by_field_name("name")?; + if text(key, bytes) == name { + return arg.child_by_field_name("value"); + } + } + None +} + +/// Local copy of the router-constructor recogniser (parallel to +/// [`crate::auth_analysis::extract::flask::is_router_like_constructor`] +/// to avoid the visibility tangle). +fn is_router_like_constructor(callee: &str) -> bool { + let trimmed = callee.trim(); + let tail = trimmed.rsplit('.').next().unwrap_or(trimmed); + if tail == "APIRouter" || tail == "FastAPI" || tail == "VersionedAPIRouter" { + return true; + } + if tail.len() > "Router".len() + && tail.ends_with("Router") + && tail.chars().next().is_some_and(|c| c.is_ascii_uppercase()) + { + return true; + } + false +} + +/// Cross-file dep-marker unwrapper. Differs from the in-file +/// [`crate::auth_analysis::extract::flask::unwrap_depends_call`] in +/// the *scoped-security* gating policy: +/// +/// * **In-file** (per-route or per-router declarations visible to +/// the active file's FlaskExtractor): only `Security(callable, +/// scopes=[non-empty])` flips `scoped_security = true`. A bare +/// `Security(callable)` stays as a LoginGuard — conservative because +/// per-route bare Security is often used for login-only deps. +/// +/// * **Cross-file via `include_router`** (this function, persisted +/// into the project-wide router-facts index for the cross-file lift): +/// ANY `Security(...)` marker at the parent-router level flips +/// `scoped_security = true`, regardless of explicit `scopes=[...]`. +/// Rationale: the FastAPI architectural pattern +/// `parent_router = APIRouter(dependencies=[Security(callable)])` +/// followed by `parent_router.include_router(child_router, ...)` is +/// structurally a declaration that **every route under the child +/// router is auth-protected**. Treating it as authorization (Other +/// AuthCheckKind, via the existing `inject_middleware_auth` scoped +/// promotion) is semantically correct — the developer's `Security` +/// marker placement IS the authorization signal. Bare `Depends(...)` +/// at the parent-router level is NOT promoted (it's a generic dep, +/// often a login fetcher). +fn unwrap_depends_call(node: Node<'_>, bytes: &[u8]) -> Option<(CallSite, bool)> { + if node.kind() != "call" { + return None; + } + let function = node.child_by_field_name("function")?; + let function_text = text(function, bytes); + if !is_dep_marker_callee(&function_text) { + return None; + } + let is_security = is_security_marker(&function_text); + let arguments = node.child_by_field_name("arguments")?; + let children = named_children(arguments); + let first = children + .iter() + .copied() + .find(|child| child.kind() != "keyword_argument")?; + // Cross-file scoped policy: any Security marker at parent-router + // level → scoped=true. See doc comment above for rationale. + let scoped_security = is_security; + let _ = string_literal_value; + let _ = keyword_argument_value; + match first.kind() { + "call" => Some((call_site_from_node(first, bytes), scoped_security)), + "identifier" | "attribute" | "scoped_identifier" => { + Some((call_site_from_node(first, bytes), scoped_security)) + } + _ => None, + } +} + +fn is_dep_marker_callee(callee: &str) -> bool { + let trimmed = callee.trim(); + matches!( + trimmed, + "Depends" + | "fastapi.Depends" + | "fastapi.params.Depends" + | "Security" + | "fastapi.Security" + | "fastapi.params.Security" + ) +} + +fn is_security_marker(callee: &str) -> bool { + let trimmed = callee.trim(); + matches!( + trimmed, + "Security" | "fastapi.Security" | "fastapi.params.Security" + ) +} + +#[cfg(test)] +mod tests { + use super::*; + use tree_sitter::Parser; + + fn parse_python(source: &str) -> Tree { + let mut parser = Parser::new(); + parser + .set_language(&tree_sitter::Language::from(tree_sitter_python::LANGUAGE)) + .expect("python language"); + parser.parse(source, None).expect("parse") + } + + #[test] + fn module_id_for_path_strips_py_extension() { + assert_eq!( + module_id_for_path(Path::new("/x/y/task_instances.py")), + Some("task_instances".into()) + ); + // `__init__` returns None — parent files are storage-only, not + // lookup keys. + assert_eq!(module_id_for_path(Path::new("/x/y/__init__.py")), None); + } + + #[test] + fn module_id_for_storage_uses_full_path_to_avoid_basename_collisions() { + // Different `routes/__init__.py` files in different packages + // must produce DIFFERENT keys — basename / parent-dir keying + // would collide on real codebases (airflow alone has 17 + // `routes/__init__.py` files across its provider tree). + let a = module_id_for_storage(Path::new( + "/x/airflow-core/src/airflow/api_fastapi/execution_api/routes/__init__.py", + )) + .unwrap(); + let b = module_id_for_storage(Path::new( + "/x/airflow-core/src/airflow/api_fastapi/core_api/routes/__init__.py", + )) + .unwrap(); + assert_ne!(a, b); + } + + /// Canonical airflow shape — `routes/__init__.py` declares + /// `authenticated_router = VersionedAPIRouter(dependencies=[Security(require_auth)])` + /// and lifts every per-file child router via `include_router(...)`. + /// Pass 1 must capture both the parent's local deps and the edges + /// targeting `task_instances.router`. Cross-file Security wrappers + /// (regardless of explicit `scopes=[...]`) are flagged scoped — the + /// architectural intent of + /// `parent_router = X(dependencies=[Security(callable)])` followed by + /// `parent_router.include_router(child_router)` is auth scoping over + /// every child route. See the `unwrap_depends_call` doc comment for + /// the policy rationale. + #[test] + fn extract_router_facts_captures_parent_and_edges() { + let src = "from cadwyn import VersionedAPIRouter\n\ + from fastapi import APIRouter, Security\n\ + from . import task_instances, dag_runs\n\ + from .security import require_auth\n\ + \n\ + execution_api_router = APIRouter()\n\ + authenticated_router = VersionedAPIRouter(dependencies=[Security(require_auth)])\n\ + \n\ + authenticated_router.include_router(task_instances.router, prefix=\"/task-instances\")\n\ + authenticated_router.include_router(dag_runs.router, prefix=\"/dag-runs\")\n\ + execution_api_router.include_router(authenticated_router)\n"; + let tree = parse_python(src); + let bytes = src.as_bytes(); + let facts = extract_router_facts_for_python(&tree, bytes); + + let parent_deps = facts + .local_router_deps + .get("authenticated_router") + .expect("authenticated_router deps captured"); + assert_eq!(parent_deps.len(), 1); + let (site, scoped) = &parent_deps[0]; + assert_eq!(site.name, "require_auth"); + assert!( + *scoped, + "cross-file: any Security marker is scoped-equivalent" + ); + + // execution_api_router has no deps → no entry. + assert!(!facts.local_router_deps.contains_key("execution_api_router")); + + // Two child include_router edges + one nested + // execution_api_router.include_router(authenticated_router) edge. + assert!(facts.include_router_edges.iter().any(|e| { + e.parent_var == "authenticated_router" + && e.child_module_id == "task_instances" + && e.child_var == "router" + })); + assert!(facts.include_router_edges.iter().any(|e| { + e.parent_var == "authenticated_router" + && e.child_module_id == "dag_runs" + && e.child_var == "router" + })); + } + + /// `.include_router()` — child reference is a bare + /// identifier, no module segment. Cannot resolve to a specific + /// file, so no edge is emitted. This includes the canonical + /// `execution_api_router.include_router(authenticated_router)` chain + /// where the child is a sibling router declared in the same file — + /// transitive in-file lifts are handled by the local-deps map, not + /// the cross-file edge list. + #[test] + fn extract_router_facts_skips_bare_identifier_child_refs() { + let src = "outer = APIRouter()\nouter.include_router(authenticated_router)\n"; + let tree = parse_python(src); + let bytes = src.as_bytes(); + let facts = extract_router_facts_for_python(&tree, bytes); + assert!(facts.include_router_edges.is_empty()); + } + + /// Scoped Security at the parent level (real-world airflow + /// `ti_id_router` flavor). The `scoped` flag must round-trip. + #[test] + fn extract_router_facts_picks_up_scoped_security() { + let src = "ti_id_router = VersionedAPIRouter(\n route_class=ExecutionAPIRoute,\n dependencies=[\n Security(require_auth, scopes=[\"ti:self\"]),\n ],\n)\n"; + let tree = parse_python(src); + let bytes = src.as_bytes(); + let facts = extract_router_facts_for_python(&tree, bytes); + let deps = facts + .local_router_deps + .get("ti_id_router") + .expect("ti_id_router deps captured"); + let (_site, scoped) = &deps[0]; + assert!(*scoped, "scopes=[\"ti:self\"] must mark scoped"); + } + + /// Cross-file `Depends(callable)` at parent-router level is NOT + /// scoped — the policy promotes only Security markers (which + /// signal authorization intent), not generic Depends (which are + /// often login fetchers). Bare `Depends(get_current_user)` lifted + /// onto a child router via `include_router` stays as a LoginGuard + /// on the child's per-route auth checks. + #[test] + fn extract_router_facts_does_not_promote_depends() { + let src = "from fastapi import APIRouter, Depends\n\ + v1 = APIRouter(dependencies=[Depends(get_current_user)])\n"; + let tree = parse_python(src); + let bytes = src.as_bytes(); + let facts = extract_router_facts_for_python(&tree, bytes); + let deps = facts.local_router_deps.get("v1").expect("v1 deps captured"); + let (_site, scoped) = &deps[0]; + assert!(!*scoped, "Depends never scoped-security at cross-file lift"); + } +} diff --git a/src/cfg/cfg_tests.rs b/src/cfg/cfg_tests.rs index f67347f8..92819c84 100644 --- a/src/cfg/cfg_tests.rs +++ b/src/cfg/cfg_tests.rs @@ -3116,3 +3116,97 @@ fn chained_method_call_rebinds_to_inner_gated_sink() { call inner-gate rebinding fired" ); } + +/// Ternary-RHS branches are lowered into a diamond CFG by +/// `build_ternary_diamond` so the condition is control-flow and the +/// branches are data-flow that joins at a phi. But push_node only does +/// suffix/prefix matching on the branch text, so a source-shaped member +/// expression like `req.query.lng` does not classify (the rule matcher +/// is `req.query`, which neither suffix-matches nor prefix-matches +/// `req.query.lng`). `lower_ternary_branch` runs the segment-strip- +/// and-retry classifier on the branch AST to recover the source label, +/// mirroring what `pre_emit_arg_source_nodes` does for call arguments. +/// +/// Without this, `let arr = cond ? req.query.lng : "";` lowers each +/// branch to a labelless Assign-with-empty-uses, the join phi sees no +/// taint, and downstream sinks miss the flow. Motivated by the +/// i18next-http-middleware advisory GHSA-jfgf-83c5-2c4m / CVE-2026-42353. +#[test] +fn js_ternary_branch_member_expression_classified_as_source() { + let src = b"function h(req) { const arr = req.query.lng ? req.query.lng : ''; }"; + let ts_lang = Language::from(tree_sitter_javascript::LANGUAGE); + let (cfg, _entry) = parse_and_build(src, "javascript", ts_lang); + let mut found_source_branch = false; + for n in cfg.node_indices() { + let info = &cfg[n]; + if info.taint.defines.as_deref() == Some("arr") + && info + .taint + .labels + .iter() + .any(|l| matches!(l, crate::labels::DataLabel::Source(_))) + { + found_source_branch = true; + break; + } + } + assert!( + found_source_branch, + "expected at least one ternary branch defining `arr` to carry a \ + Source label after segment-strip classification of `req.query.lng`" + ); +} + +#[test] +fn js_ternary_branch_const_strings_have_no_source() { + // Both branches are constant strings -> no Source label should be + // synthesised by the segment-strip pass. Pins precision: the fix + // only fires when first_member_label finds a real source-shaped + // expression in the branch AST. + let src = b"function h(cond) { const x = cond ? 'a' : 'b'; }"; + let ts_lang = Language::from(tree_sitter_javascript::LANGUAGE); + let (cfg, _entry) = parse_and_build(src, "javascript", ts_lang); + for n in cfg.node_indices() { + let info = &cfg[n]; + if info.taint.defines.as_deref() == Some("x") { + assert!( + !info + .taint + .labels + .iter() + .any(|l| matches!(l, crate::labels::DataLabel::Source(_))), + "constant-string ternary branch must not carry a Source label; \ + got labels = {:?}", + info.taint.labels + ); + } + } +} + +#[test] +fn js_ternary_branch_subscript_source_classified() { + // Subscript-form sources (`req.body['key']`) reach via the + // first_member_label subscript-expression arm. Pins the same fix + // for subscript-shaped source branches. + let src = b"function h(req) { const x = req.body ? req.body['k'] : ''; }"; + let ts_lang = Language::from(tree_sitter_javascript::LANGUAGE); + let (cfg, _entry) = parse_and_build(src, "javascript", ts_lang); + let mut found_source_branch = false; + for n in cfg.node_indices() { + let info = &cfg[n]; + if info.taint.defines.as_deref() == Some("x") + && info + .taint + .labels + .iter() + .any(|l| matches!(l, crate::labels::DataLabel::Source(_))) + { + found_source_branch = true; + break; + } + } + assert!( + found_source_branch, + "expected ternary subscript branch defining `x` to carry a Source label" + ); +} diff --git a/src/cfg/conditions.rs b/src/cfg/conditions.rs index eb06a935..9e999f74 100644 --- a/src/cfg/conditions.rs +++ b/src/cfg/conditions.rs @@ -1,3 +1,4 @@ +use super::helpers::first_member_label; use super::{ AstMeta, Cfg, EdgeKind, MAX_COND_VARS, MAX_CONDITION_TEXT_LEN, NodeInfo, StmtKind, collect_idents, connect_all, detect_eq_with_const, detect_negation, has_call_descendant, @@ -349,6 +350,33 @@ pub(super) fn lower_ternary_branch<'a>( } } + // Bridge source recognition to ternary branches. push_node only does + // suffix/prefix matching on the branch text, so a source-shaped member + // expression like `req.query.lng` doesn't classify (the rule matcher + // is `req.query`, which neither suffix-matches nor prefix-matches + // `req.query.lng`). Run the segment-strip-and-retry classifier on + // the branch AST to recover the source label, mirroring what + // `pre_emit_arg_source_nodes` does for call arguments and what the + // `Kind::CallWrapper | Kind::Assignment` gate at push_node:1827 does + // for whole declarations. Without this, `let arr = cond ? req.query.lng + // : "";` lowers each branch to a labelless Assign-with-empty-uses, the + // join phi sees no taint, and downstream sinks miss the flow. + if !g[node] + .taint + .labels + .iter() + .any(|l| matches!(l, DataLabel::Source(_))) + { + let extra = analysis_rules + .map(|r| r.extra_labels.as_slice()) + .filter(|s| !s.is_empty()); + if let Some(found @ DataLabel::Source(_)) = + first_member_label(branch_ast, lang, code, extra) + { + g[node].taint.labels.push(found); + } + } + connect_all(g, preds, node, pred_edge); vec![node] } diff --git a/src/cfg/mod.rs b/src/cfg/mod.rs index 4783666d..34b79fa2 100644 --- a/src/cfg/mod.rs +++ b/src/cfg/mod.rs @@ -847,10 +847,18 @@ pub(super) fn detect_negation<'a>( }; // `!expr` appears as unary_expression, not_operator, or prefix_unary_expression - // with a `!` or `not` operator child. + // with a `!` or `not` operator child. PHP's tree-sitter grammar emits + // `unary_op_expression` for unary `!` (and `-`/`+`/`~`) — without it, + // `if (!validate($x))` carries `condition_negated=false` and the + // True branch is treated as the validated path even though it is the + // rejection path, leaving downstream sinks unsuppressed. let is_negation_wrapper = matches!( cond.kind(), - "unary_expression" | "not_operator" | "prefix_unary_expression" | "unary_not" + "unary_expression" + | "not_operator" + | "prefix_unary_expression" + | "unary_not" + | "unary_op_expression" ); if is_negation_wrapper { @@ -3233,6 +3241,7 @@ pub(super) fn build_sub<'a>( | "not_operator" | "prefix_unary_expression" | "unary_not" + | "unary_op_expression" ) }); @@ -3472,6 +3481,7 @@ pub(super) fn build_sub<'a>( | "not_operator" | "prefix_unary_expression" | "unary_not" + | "unary_op_expression" ) }) .unwrap_or(false); diff --git a/src/cfg_analysis/guards.rs b/src/cfg_analysis/guards.rs index d3c65f14..c177325b 100644 --- a/src/cfg_analysis/guards.rs +++ b/src/cfg_analysis/guards.rs @@ -463,6 +463,56 @@ fn sink_args_typed_safe(ctx: &AnalysisContext, sink: NodeIndex, sink_caps: Cap) type_facts_suppress(&values, sink_caps, type_facts) } +/// Suppress a `cfg-unguarded-sink` SQL_QUERY finding when any positional +/// argument to the sink Call is provably a JPA / Hibernate Criteria query +/// object ([`crate::ssa::type_facts::TypeKind::JpaCriteriaQuery`]). +/// +/// Receiver values are deliberately excluded, the receiver of a JPA +/// query method (`session.createQuery(cq)`, `em.createQuery(cq)`, +/// `session.executeUpdate(cq)`) is the connection / EntityManager +/// channel, never the SQL payload. Including the receiver in the type +/// check would make this suppression unreachable since `Session` / +/// `EntityManager` values are typed `Object` / `Unknown` and never +/// `JpaCriteriaQuery` themselves. +/// +/// Closes the dominant FP cluster across openmrs (169 of 216 +/// cfg-unguarded-sink), xwiki, and keycloak: Hibernate DAO methods +/// build a `CriteriaQuery` via `cb.createQuery(Foo.class)` + +/// `Root` / `Predicate` API, then hand the query object to +/// `session.createQuery(cq)` for execution. No string concatenation +/// happens, JPA emits parameterized SQL by construction. +fn sink_args_jpa_criteria_query_safe( + ctx: &AnalysisContext, + sink: NodeIndex, + sink_caps: Cap, +) -> bool { + if !sink_caps.intersects(Cap::SQL_QUERY) { + return false; + } + let Some(facts) = ctx.body_const_facts else { + return false; + }; + let Some(type_facts) = ctx.type_facts else { + return false; + }; + let Some(&sink_val) = facts.ssa.cfg_node_map.get(&sink) else { + return false; + }; + let Some(inst) = find_inst(&facts.ssa, sink_val) else { + return false; + }; + let SsaOp::Call { args, .. } = &inst.op else { + return false; + }; + let mut values: Vec = Vec::new(); + for group in args { + for v in group.iter() { + values.push(*v); + } + } + crate::ssa::type_facts::is_safe_query_object_arg(&values, sink_caps, type_facts) +} + /// Walk the sink's Call SSA arguments and check whether every real argument /// resolves through a defining `SsaOp::Call` whose callee carries an SSA /// summary with `validated_params_to_return` covering every propagating @@ -1210,6 +1260,17 @@ impl CfgAnalysis for UnguardedSink { continue; } + // JPA / Hibernate Criteria-query suppression: receiver-call SQL + // sinks like `session.createQuery(cq)` / `em.executeUpdate(cq)` + // are safe by construction when arg 0 is a structural Criteria + // object built via `CriteriaBuilder` (returns parameterized + // SQL). Receiver excluded from the check, the receiver is + // never the payload. Closes openmrs / xwiki / keycloak + // Hibernate-DAO FP cluster. + if !has_taint && sink_args_jpa_criteria_query_safe(ctx, *sink, sink_caps) { + continue; + } + // Static-map suppression: the SSA value flowing into the sink is // proved by the static-HashMap-lookup idiom detector to be a // finite set of literals free of shell metacharacters. Mirrors diff --git a/src/cfg_analysis/mod.rs b/src/cfg_analysis/mod.rs index 1f7402e0..48d16e5c 100644 --- a/src/cfg_analysis/mod.rs +++ b/src/cfg_analysis/mod.rs @@ -88,7 +88,21 @@ pub struct BodyConstFacts { /// Lower a body to SSA and run constant propagation. Returns `None` when /// lowering fails (empty CFG, invalid entry), callers treat absence as /// "no SSA facts available" and fall back to the syntactic path. +/// Perf-regression sentinel: total cumulative calls to +/// [`build_body_const_facts`] across the process lifetime. +/// +/// Used by the `analyse_file_fused_large_go` criterion bench in +/// `benches/scan_bench.rs` to assert the per-file +/// [`crate::ast`]`::ParsedFile::body_const_facts_cache` is collapsing the +/// per-body re-lowering (~149 calls per file expected; pre-cache was ~447). +/// The atomic increment is ~1 ns per call and disappears in the noise of +/// the multi-millisecond SSA lowering it gates. +#[doc(hidden)] +pub static BUILD_BODY_CONST_FACTS_CALLS: std::sync::atomic::AtomicU64 = + std::sync::atomic::AtomicU64::new(0); + pub fn build_body_const_facts(body: &crate::cfg::BodyCfg, lang: Lang) -> Option { + BUILD_BODY_CONST_FACTS_CALLS.fetch_add(1, std::sync::atomic::Ordering::Relaxed); let mut ssa = crate::ssa::lower_to_ssa_with_params( &body.graph, body.entry, diff --git a/src/commands/scan.rs b/src/commands/scan.rs index 30d5811f..0ccffcb1 100644 --- a/src/commands/scan.rs +++ b/src/commands/scan.rs @@ -1743,6 +1743,17 @@ pub(crate) fn scan_filesystem_with_observer( local_gs.insert_auth(key, auth_sum); } + // Insert per-Python-file router-dep facts so + // pass 2's auth analysis can lift FastAPI + // router-level `dependencies=[Security(...)]` + // declarations across the + // `.include_router(., + // ...)` boundary — the canonical airflow + // execution-API auth shape. + if let Some((module_id, facts)) = r.router_facts { + local_gs.insert_router_facts(module_id, facts); + } + // Record language for progress if let Some(p) = progress { if let Some(ref lang) = first_lang { diff --git a/src/constraint/domain.rs b/src/constraint/domain.rs index d3451a45..f17758df 100644 --- a/src/constraint/domain.rs +++ b/src/constraint/domain.rs @@ -185,6 +185,7 @@ fn type_kind_index(kind: &TypeKind) -> u32 { TypeKind::HttpClient => 11, TypeKind::LocalCollection => 12, TypeKind::RequestBuilder => 13, + TypeKind::JpaCriteriaQuery => 14, // the analysis DTO types carry per-field structural info that the // bitset domain can't represent. Collapse to Unknown so callers // still see "any type possible" rather than crashing on an @@ -210,6 +211,7 @@ fn type_kind_from_index(idx: u32) -> Option { 11 => Some(TypeKind::HttpClient), 12 => Some(TypeKind::LocalCollection), 13 => Some(TypeKind::RequestBuilder), + 14 => Some(TypeKind::JpaCriteriaQuery), _ => None, } } diff --git a/src/labels/mod.rs b/src/labels/mod.rs index 1f7381e6..a28b06b5 100644 --- a/src/labels/mod.rs +++ b/src/labels/mod.rs @@ -799,6 +799,33 @@ fn phase_c_auth_rules_for_lang(lang_slug: &str) -> Vec { } } +/// Look up a *receiver-side* validator for the given callee name. +/// +/// Returns `Some(cap)` when the callee is registered as a method-call +/// validator that strips `cap` from its receiver (and other call +/// equivalents) on success. Distinct from the `Sanitizer` label, +/// which clears caps from the *return value*. Used by the Call +/// transfer to model idioms like `path.relative_to(base)` whose +/// observable effect on data flow is "the receiver is validated" +/// rather than "the return value is sanitised". +pub fn lookup_receiver_validator(lang: &str, callee: &str) -> Option { + let table: &[(&str, Cap)] = match lang { + "python" | "py" => python::RECEIVER_VALIDATORS, + _ => return None, + }; + let head = callee.split(['(', '<']).next().unwrap_or(callee); + let trimmed = head.trim().as_bytes(); + let normalized = normalize_chained_call(callee); + let norm = normalized.as_bytes(); + for (name, cap) in table { + let m = name.as_bytes(); + if match_suffix_cs(trimmed, m, false) || match_suffix_cs(norm, m, false) { + return Some(*cap); + } + } + None +} + /// Public re-export used by `ParsedFile::from_source` to /// augment per-file rule sets when imports reveal frameworks that the /// manifest-level detector missed. @@ -1471,6 +1498,26 @@ pub fn custom_rule_id(lang: &str, kind: &str, matchers: &[String]) -> String { mod tests { use super::*; + #[test] + fn receiver_validator_python_relative_to() { + // Bare method name fires. + assert_eq!( + lookup_receiver_validator("python", "relative_to"), + Some(Cap::FILE_IO) + ); + // Dotted-method-call form (chained receiver). + assert_eq!( + lookup_receiver_validator("python", "filepath.relative_to"), + Some(Cap::FILE_IO) + ); + // Other languages without a registry entry return None. + assert_eq!(lookup_receiver_validator("rust", "relative_to"), None); + assert_eq!(lookup_receiver_validator("javascript", "relative_to"), None); + // Unrelated callees return None. + assert_eq!(lookup_receiver_validator("python", "resolve"), None); + assert_eq!(lookup_receiver_validator("python", "joinpath"), None); + } + #[test] fn bare_method_name_strips_chain() { // No-dot input → returned as-is. diff --git a/src/labels/php.rs b/src/labels/php.rs index ed287806..cac0af90 100644 --- a/src/labels/php.rs +++ b/src/labels/php.rs @@ -133,10 +133,15 @@ pub static RULES: &[LabelRule] = &[ label: DataLabel::Sink(Cap::SQL_QUERY), case_sensitive: false, }, - // NOTE: `file_get_contents` can fetch URLs (SSRF vector) and local files (LFI vector). - // As a Sink(SSRF) it only fires when the argument is tainted. + // NOTE: `file_get_contents` and `fopen` can fetch URLs (SSRF vector) and + // local files (LFI vector — `file://` scheme). As a Sink(SSRF) they only + // fire when the argument is tainted. `fopen` is the canonical low-level + // stream-opening API used by media-import / OEmbed / podcast pipelines + // (CVE-2026-33486 in roadiz/documents wraps `fopen($url, 'r')` in a + // public `DownloadedFile::fromUrl` static method that any authenticated + // backend caller can drive with attacker-controlled URLs). LabelRule { - matchers: &["file_get_contents", "curl_exec"], + matchers: &["file_get_contents", "curl_exec", "fopen"], label: DataLabel::Sink(Cap::SSRF), case_sensitive: false, }, @@ -232,6 +237,11 @@ pub static KINDS: Map<&'static str, Kind> = phf_map! { "anonymous_function_creation_expression" => Kind::Function, "arrow_function" => Kind::Function, "class_declaration" => Kind::Block, + "declaration_list" => Kind::Block, + "interface_declaration" => Kind::Block, + "trait_declaration" => Kind::Block, + "enum_declaration" => Kind::Block, + "enum_declaration_list" => Kind::Block, // data-flow "function_call_expression" => Kind::CallFn, diff --git a/src/labels/python.rs b/src/labels/python.rs index a955f7b8..96a192e6 100644 --- a/src/labels/python.rs +++ b/src/labels/python.rs @@ -25,6 +25,10 @@ pub static RULES: &[LabelRule] = &[ "request.url", "request.base_url", "request.host", + "request.match_info", + "request.rel_url", + "request.query", + "request.path", // Common alias: from flask import request as flask_request "flask_request.args", "flask_request.form", @@ -227,7 +231,15 @@ pub static RULES: &[LabelRule] = &[ case_sensitive: false, }, LabelRule { - matchers: &["send_file", "send_from_directory"], + matchers: &[ + "send_file", + "send_from_directory", + // aiohttp file response — sends file at the supplied path, + // semantically identical to Flask's send_file (CVE-2024-23334). + "FileResponse", + "web.FileResponse", + "aiohttp.web.FileResponse", + ], label: DataLabel::Sink(Cap::FILE_IO), case_sensitive: false, }, @@ -274,6 +286,25 @@ pub static RULES: &[LabelRule] = &[ }, ]; +/// Method-call validators that strip caps from their *receiver* (and +/// any equivalence-class-shaped args) on success, instead of clearing +/// the return value. Distinct from `RULES`'s `Sanitizer` label, which +/// only clears the return — a poor fit for idioms whose effect is +/// raise-on-failure rather than value-replacement. +/// +/// Modeled idioms: +/// +/// * `path.relative_to(base)` (pathlib) — raises `ValueError` if `path` +/// is not under `base`. After a successful return, the receiver is +/// path-contained in `base`. Strips `Cap::FILE_IO`. Motivated by +/// CVE-2024-23334 (aiohttp StaticResource symlink-bypass) where the +/// patched code calls `filepath.relative_to(self._directory)` inside +/// a try/except and serves `filepath` afterwards. +pub static RECEIVER_VALIDATORS: &[(&str, Cap)] = &[ + ("relative_to", Cap::FILE_IO), + (".relative_to", Cap::FILE_IO), +]; + pub static GATED_SINKS: &[SinkGate] = &[ // Legacy single-kwarg gate retained for back-compat: Popen(cmd, shell=True). SinkGate { diff --git a/src/patterns/python.rs b/src/patterns/python.rs index 8364b58b..61112895 100644 --- a/src/patterns/python.rs +++ b/src/patterns/python.rs @@ -206,4 +206,26 @@ pub const PATTERNS: &[Pattern] = &[ category: PatternCategory::Xss, confidence: Confidence::High, }, + // Flask `make_response()` reflection — Tier B + // heuristic mirroring `py.sqli.execute_format` / `py.sqli.text_format`. + // Catches CVE-2023-6568 (mlflow auth `create_user` reflected the + // attacker-controlled `Content-Type` header into the response body + // via `make_response(f"Invalid content type: '{content_type}'", 400)`) + // and the equivalent `+`-concat shape. Recognises both bare + // `make_response(...)` and `flask.make_response(...)`. + Pattern { + id: "py.xss.make_response_format", + description: "flask make_response with f-string or concat risks reflected XSS", + query: r#"(call + function: [(identifier) @fn (attribute attribute: (identifier) @fn)] + (#eq? @fn "make_response") + arguments: (argument_list + [(binary_operator) + (string (interpolation))] @arg)) + @vuln"#, + severity: Severity::Medium, + tier: PatternTier::B, + category: PatternCategory::Xss, + confidence: Confidence::Medium, + }, ]; diff --git a/src/server/debug.rs b/src/server/debug.rs index 63b66fae..6bf0868d 100644 --- a/src/server/debug.rs +++ b/src/server/debug.rs @@ -1180,6 +1180,7 @@ fn type_kind_tag(k: &TypeKind) -> String { TypeKind::HttpClient => "HttpClient".into(), TypeKind::LocalCollection => "LocalCollection".into(), TypeKind::RequestBuilder => "RequestBuilder".into(), + TypeKind::JpaCriteriaQuery => "JpaCriteriaQuery".into(), TypeKind::Dto(_) => "Dto".into(), } } diff --git a/src/ssa/const_prop.rs b/src/ssa/const_prop.rs index fb0c7e7d..f3d70b68 100644 --- a/src/ssa/const_prop.rs +++ b/src/ssa/const_prop.rs @@ -1,6 +1,7 @@ use std::collections::{HashMap, HashSet, VecDeque}; use serde::{Deserialize, Serialize}; +use smallvec::SmallVec; use super::ir::*; @@ -96,40 +97,56 @@ pub struct ConstPropResult { } /// Run Sparse Conditional Constant Propagation on an SSA body. +/// +/// Internal storage is dense `Vec`-indexed by [`SsaValue`] / [`BlockId`] to +/// avoid the per-lookup `SipHash` cost of `HashMap` / +/// `HashSet<(BlockId, BlockId)>` that previously dominated the inner +/// fixed-point loop. The public [`ConstPropResult`] still exposes the +/// `HashMap`-shaped contract; the conversion at the end of the function is +/// O(num_values) and runs once. pub fn const_propagate(body: &SsaBody) -> ConstPropResult { let num_blocks = body.blocks.len(); + let num_values = body.value_defs.len(); - // Per-value lattice: starts at Top - let mut values: HashMap = HashMap::new(); + // Dense per-value lattice (`Vec` indexed by `SsaValue.0`). All values + // are defined by exactly one inst (phi or body), so initialising the + // entire range to Top is equivalent to the previous per-inst insert + // pass at strictly lower cost (no hashing). + let mut values: Vec = vec![ConstLattice::Top; num_values]; - // Executable flags per CFG edge (from_block, to_block) - let mut executable_edges: HashSet<(BlockId, BlockId)> = HashSet::new(); - // Executable blocks - let mut executable_blocks: HashSet = HashSet::new(); + // Per-block executability and per-(dest, pred) executable-edge bitmap. + // Edges are stored as a per-destination list of executable predecessors + // — phi evaluation only ever asks "is `(pred, this_block)` executable?", + // so a tiny SmallVec scan over the dest's predecessors beats a + // `HashSet<(BlockId, BlockId)>::contains` (which hashes a 64-bit pair + // for every operand of every phi). + let mut executable_blocks: Vec = vec![false; num_blocks]; + let mut executable_preds: Vec> = vec![SmallVec::new(); num_blocks]; - // Two worklists + // Worklists let mut cfg_worklist: VecDeque = VecDeque::new(); let mut ssa_worklist: VecDeque = VecDeque::new(); // Mark entry executable - executable_blocks.insert(body.entry); + executable_blocks[body.entry.0 as usize] = true; cfg_worklist.push_back(body.entry); - // Build use-map: SsaValue → list of (BlockId, instruction index in block) - // so we can propagate SSA value changes efficiently. - let mut use_sites: HashMap> = HashMap::new(); + // Use-map: dense `Vec` indexed by `SsaValue.0`. Populated in a single + // pass via the closure-based [`inst_uses_each`] helper, which avoids + // the heap allocation of the prior `inst_uses() -> Vec` + // factory. + let mut use_sites: Vec> = vec![SmallVec::new(); num_values]; for block in &body.blocks { for inst in block.phis.iter().chain(block.body.iter()) { - for used_val in inst_uses(inst) { - use_sites.entry(used_val).or_default().push(block.id); - } - } - } - - // Initialize all values to Top - for block in &body.blocks { - for inst in block.phis.iter().chain(block.body.iter()) { - values.insert(inst.value, ConstLattice::Top); + inst_uses_each(inst, |used_val| { + let idx = used_val.0 as usize; + if idx < use_sites.len() { + let bucket = &mut use_sites[idx]; + if bucket.last() != Some(&block.id) { + bucket.push(block.id); + } + } + }); } } @@ -144,10 +161,10 @@ pub fn const_propagate(body: &SsaBody) -> ConstPropResult { // Evaluate phis for phi in &block.phis { if let SsaOp::Phi(operands) = &phi.op { - let old = values.get(&phi.value).cloned().unwrap_or(ConstLattice::Top); - let new_val = eval_phi(operands, &values, &executable_edges, block_id); + let old = lookup(&values, phi.value); + let new_val = eval_phi(operands, &values, &executable_preds, block_id); if new_val != old { - values.insert(phi.value, new_val); + store(&mut values, phi.value, new_val); ssa_worklist.push_back(phi.value); changed = true; } @@ -156,13 +173,10 @@ pub fn const_propagate(body: &SsaBody) -> ConstPropResult { // Evaluate body instructions for inst in &block.body { - let old = values - .get(&inst.value) - .cloned() - .unwrap_or(ConstLattice::Top); + let old = lookup(&values, inst.value); let new_val = eval_inst(inst, &values); if new_val != old { - values.insert(inst.value, new_val); + store(&mut values, inst.value, new_val); ssa_worklist.push_back(inst.value); changed = true; } @@ -173,7 +187,7 @@ pub fn const_propagate(body: &SsaBody) -> ConstPropResult { block, body, &values, - &mut executable_edges, + &mut executable_preds, &mut executable_blocks, &mut cfg_worklist, ); @@ -181,54 +195,57 @@ pub fn const_propagate(body: &SsaBody) -> ConstPropResult { // Process SSA worklist while let Some(val) = ssa_worklist.pop_front() { - if let Some(blocks) = use_sites.get(&val) { - for &block_id in blocks { - if !executable_blocks.contains(&block_id) { - continue; - } - let block = body.block(block_id); - - // Re-evaluate phis using this value - for phi in &block.phis { - if let SsaOp::Phi(operands) = &phi.op - && operands.iter().any(|(_, v)| *v == val) - { - let old = values.get(&phi.value).cloned().unwrap_or(ConstLattice::Top); - let new_val = eval_phi(operands, &values, &executable_edges, block_id); - if new_val != old { - values.insert(phi.value, new_val); - ssa_worklist.push_back(phi.value); - changed = true; - } - } - } - - // Re-evaluate body instructions using this value - for inst in &block.body { - if inst_uses(inst).contains(&val) { - let old = values - .get(&inst.value) - .cloned() - .unwrap_or(ConstLattice::Top); - let new_val = eval_inst(inst, &values); - if new_val != old { - values.insert(inst.value, new_val); - ssa_worklist.push_back(inst.value); - changed = true; - } - } - } - - // Re-evaluate terminator if condition changed - process_terminator( - block, - body, - &values, - &mut executable_edges, - &mut executable_blocks, - &mut cfg_worklist, - ); + let val_idx = val.0 as usize; + if val_idx >= use_sites.len() { + continue; + } + // Snapshot the use-list so we can borrow `values` mutably + // while iterating block ids. The list is short (typically + // 1–3 blocks) so the clone is cheap. + let use_blocks = use_sites[val_idx].clone(); + for block_id in use_blocks { + if !executable_blocks[block_id.0 as usize] { + continue; } + let block = body.block(block_id); + + // Re-evaluate phis using this value + for phi in &block.phis { + if let SsaOp::Phi(operands) = &phi.op + && operands.iter().any(|(_, v)| *v == val) + { + let old = lookup(&values, phi.value); + let new_val = eval_phi(operands, &values, &executable_preds, block_id); + if new_val != old { + store(&mut values, phi.value, new_val); + ssa_worklist.push_back(phi.value); + changed = true; + } + } + } + + // Re-evaluate body instructions using this value + for inst in &block.body { + if inst_has_use(inst, val) { + let old = lookup(&values, inst.value); + let new_val = eval_inst(inst, &values); + if new_val != old { + store(&mut values, inst.value, new_val); + ssa_worklist.push_back(inst.value); + changed = true; + } + } + } + + // Re-evaluate terminator if condition changed + process_terminator( + block, + body, + &values, + &mut executable_preds, + &mut executable_blocks, + &mut cfg_worklist, + ); } } @@ -237,44 +254,79 @@ pub fn const_propagate(body: &SsaBody) -> ConstPropResult { } } - // Compute unreachable blocks - let unreachable_blocks: HashSet = (0..num_blocks) - .map(|i| BlockId(i as u32)) - .filter(|bid| !executable_blocks.contains(bid)) - .collect(); + // Convert dense storage to the public `HashMap`-shaped result. Walks + // the value vector exactly once. The unreachable-blocks set is small + // (often empty), so building it from a linear scan is fine. + let mut out_values: HashMap = HashMap::with_capacity(num_values); + for (i, v) in values.into_iter().enumerate() { + out_values.insert(SsaValue(i as u32), v); + } + let mut unreachable_blocks: HashSet = HashSet::new(); + for (i, exec) in executable_blocks.iter().enumerate() { + if !exec { + unreachable_blocks.insert(BlockId(i as u32)); + } + } ConstPropResult { - values, + values: out_values, unreachable_blocks, } } +/// Dense lattice lookup. Returns Top for out-of-range values to match the +/// pre-refactor `HashMap::get(&v).cloned().unwrap_or(Top)` semantics. +#[inline] +fn lookup(values: &[ConstLattice], v: SsaValue) -> ConstLattice { + values + .get(v.0 as usize) + .cloned() + .unwrap_or(ConstLattice::Top) +} + +/// Dense lattice store. Out-of-range writes are silently dropped to +/// preserve robustness against malformed SSA input — the prior HashMap +/// path would have inserted a stray entry; the dense path leaves it +/// implicit (Top). Either way the value is unobservable downstream +/// because no use-map entry would point at it. +#[inline] +fn store(values: &mut [ConstLattice], v: SsaValue, val: ConstLattice) { + let idx = v.0 as usize; + if idx < values.len() { + values[idx] = val; + } +} + /// Evaluate a phi: meet of operands from executable predecessors. fn eval_phi( operands: &[(BlockId, SsaValue)], - values: &HashMap, - executable_edges: &HashSet<(BlockId, BlockId)>, + values: &[ConstLattice], + executable_preds: &[SmallVec<[BlockId; 2]>], this_block: BlockId, ) -> ConstLattice { + let preds = executable_preds + .get(this_block.0 as usize) + .map(|p| p.as_slice()) + .unwrap_or(&[]); let mut result = ConstLattice::Top; for (pred_block, val) in operands { - if !executable_edges.contains(&(*pred_block, this_block)) { + if !preds.contains(pred_block) { continue; // skip non-executable predecessors } - let operand_val = values.get(val).cloned().unwrap_or(ConstLattice::Top); + let operand_val = lookup(values, *val); result = result.meet(&operand_val); } result } /// Evaluate a single instruction. -fn eval_inst(inst: &SsaInst, values: &HashMap) -> ConstLattice { +fn eval_inst(inst: &SsaInst, values: &[ConstLattice]) -> ConstLattice { match &inst.op { SsaOp::Const(Some(text)) => ConstLattice::parse(text), SsaOp::Const(None) => ConstLattice::Varying, // unknown constant SsaOp::Assign(uses) if uses.len() == 1 => { // Copy: propagate the source's value - values.get(&uses[0]).cloned().unwrap_or(ConstLattice::Top) + lookup(values, uses[0]) } SsaOp::Assign(_) => ConstLattice::Varying, // expression with multiple uses SsaOp::Call { .. } @@ -297,29 +349,69 @@ fn eval_inst(inst: &SsaInst, values: &HashMap) -> ConstL } } -/// Collect SSA values used by an instruction (for use-map building). -fn inst_uses(inst: &SsaInst) -> Vec { +/// Apply a closure to every SSA value used by an instruction. Avoids the +/// `Vec` heap allocation that the previous `inst_uses(inst)` +/// helper paid on every call (use-map build is O(num_insts), the prior +/// path bottle-necked there). +#[inline] +fn inst_uses_each(inst: &SsaInst, mut f: F) { match &inst.op { - SsaOp::Phi(operands) => operands.iter().map(|(_, v)| *v).collect(), - SsaOp::Assign(uses) => uses.to_vec(), + SsaOp::Phi(operands) => { + for (_, v) in operands { + f(*v); + } + } + SsaOp::Assign(uses) => { + for v in uses { + f(*v); + } + } SsaOp::Call { args, receiver, .. } => { - let mut vals = Vec::new(); if let Some(rv) = receiver { - vals.push(*rv); + f(*rv); } for arg in args { - vals.extend(arg.iter()); + for v in arg { + f(*v); + } } - vals } - SsaOp::FieldProj { receiver, .. } => vec![*receiver], + SsaOp::FieldProj { receiver, .. } => f(*receiver), SsaOp::Source | SsaOp::Const(_) | SsaOp::Param { .. } | SsaOp::SelfParam | SsaOp::CatchParam | SsaOp::Nop - | SsaOp::Undef => Vec::new(), + | SsaOp::Undef => {} + } +} + +/// Zero-allocation predicate: does `inst` use `target` as an operand? +/// Replaces the prior `inst_uses(inst).contains(&target)` shape, which +/// allocated a fresh `Vec` on every check inside the SCCP +/// re-evaluation worklist. +#[inline] +fn inst_has_use(inst: &SsaInst, target: SsaValue) -> bool { + match &inst.op { + SsaOp::Phi(operands) => operands.iter().any(|(_, v)| *v == target), + SsaOp::Assign(uses) => uses.contains(&target), + SsaOp::Call { args, receiver, .. } => { + if let Some(rv) = receiver + && *rv == target + { + return true; + } + args.iter().any(|arg| arg.contains(&target)) + } + SsaOp::FieldProj { receiver, .. } => *receiver == target, + SsaOp::Source + | SsaOp::Const(_) + | SsaOp::Param { .. } + | SsaOp::SelfParam + | SsaOp::CatchParam + | SsaOp::Nop + | SsaOp::Undef => false, } } @@ -327,9 +419,9 @@ fn inst_uses(inst: &SsaInst) -> Vec { fn process_terminator( block: &SsaBlock, body: &SsaBody, - values: &HashMap, - executable_edges: &mut HashSet<(BlockId, BlockId)>, - executable_blocks: &mut HashSet, + values: &[ConstLattice], + executable_preds: &mut [SmallVec<[BlockId; 2]>], + executable_blocks: &mut [bool], cfg_worklist: &mut VecDeque, ) { match &block.terminator { @@ -343,7 +435,7 @@ fn process_terminator( mark_edge_executable( block.id, target, - executable_edges, + executable_preds, executable_blocks, cfg_worklist, ); @@ -359,7 +451,7 @@ fn process_terminator( let cond_val = body .cfg_node_map .get(cond) - .and_then(|v| values.get(v)) + .map(|v| lookup(values, *v)) .and_then(|c| c.as_bool()); match cond_val { @@ -367,7 +459,7 @@ fn process_terminator( mark_edge_executable( block.id, *true_blk, - executable_edges, + executable_preds, executable_blocks, cfg_worklist, ); @@ -376,7 +468,7 @@ fn process_terminator( mark_edge_executable( block.id, *false_blk, - executable_edges, + executable_preds, executable_blocks, cfg_worklist, ); @@ -386,14 +478,14 @@ fn process_terminator( mark_edge_executable( block.id, *true_blk, - executable_edges, + executable_preds, executable_blocks, cfg_worklist, ); mark_edge_executable( block.id, *false_blk, - executable_edges, + executable_preds, executable_blocks, cfg_worklist, ); @@ -417,7 +509,7 @@ fn process_terminator( mark_edge_executable( block.id, target, - executable_edges, + executable_preds, executable_blocks, cfg_worklist, ); @@ -432,7 +524,7 @@ fn process_terminator( mark_edge_executable( block.id, target, - executable_edges, + executable_preds, executable_blocks, cfg_worklist, ); @@ -444,18 +536,27 @@ fn process_terminator( fn mark_edge_executable( from: BlockId, to: BlockId, - executable_edges: &mut HashSet<(BlockId, BlockId)>, - executable_blocks: &mut HashSet, + executable_preds: &mut [SmallVec<[BlockId; 2]>], + executable_blocks: &mut [bool], cfg_worklist: &mut VecDeque, ) { - if executable_edges.insert((from, to)) { - if executable_blocks.insert(to) { - cfg_worklist.push_back(to); - } else { - // Block already executable but new edge, re-evaluate phis - cfg_worklist.push_back(to); - } + let to_idx = to.0 as usize; + if to_idx >= executable_preds.len() { + return; } + let preds = &mut executable_preds[to_idx]; + if preds.contains(&from) { + return; + } + preds.push(from); + let was_already_exec = executable_blocks[to_idx]; + if !was_already_exec { + executable_blocks[to_idx] = true; + } + // Always re-enqueue: either the block became newly reachable, or it + // already was but a new predecessor edge means phi operands need + // re-meeting against the now-executable predecessor. + cfg_worklist.push_back(to); } /// Apply constant propagation results: prune branches where condition is known constant. diff --git a/src/ssa/type_facts.rs b/src/ssa/type_facts.rs index 0bd61f76..d02f6a96 100644 --- a/src/ssa/type_facts.rs +++ b/src/ssa/type_facts.rs @@ -7,6 +7,7 @@ use super::ir::*; use crate::cfg::{BinOp, Cfg}; use crate::symbol::Lang; use serde::{Deserialize, Serialize}; +use smallvec::SmallVec; /// Inferred type kind for an SSA value. #[derive(Clone, Debug, PartialEq, Eq, Serialize, Deserialize)] @@ -40,6 +41,17 @@ pub enum TypeKind { /// `label_prefix`, never participates in label-based callee /// resolution. LocalCollection, + /// A JPA / Hibernate Criteria API query object (`CriteriaQuery`, + /// `CriteriaUpdate`, `CriteriaDelete`, `Subquery`, + /// `TypedQuery`). These objects are produced by the + /// `CriteriaBuilder` and emit parameterized SQL when handed to + /// `Session.createQuery(cq)` / `EntityManager.createQuery(cq)`. The + /// argument is structural (predicate AST), not a string, so SQL + /// injection cannot flow through it. Used to suppress the + /// `cfg-unguarded-sink` finding on `session.createQuery(cq)` shapes + /// where openmrs / xwiki / keycloak Hibernate DAOs build queries + /// via `cb.createQuery(Foo.class)` + `Root` / `Predicate` API. + JpaCriteriaQuery, /// A framework-injected DTO body whose field types are known. /// Populated when a parameter is recognised as a typed extractor and /// the DTO class / struct / Pydantic model is resolvable in scope. @@ -86,6 +98,7 @@ impl TypeKind { Self::FileHandle => Some("FileHandle"), Self::Url => Some("URL"), Self::RequestBuilder => Some("RequestBuilder"), + Self::JpaCriteriaQuery => Some("JpaCriteriaQuery"), _ => None, } } @@ -222,6 +235,111 @@ pub fn is_type_safe_for_sink( }) } +/// Check whether any of the sink-arg SSA values is a structural query +/// object that emits parameterized SQL by construction (currently the +/// JPA / Hibernate Criteria API: `CriteriaQuery`, `CriteriaUpdate`, +/// `CriteriaDelete`, `Subquery`, `TypedQuery`). +/// +/// Used by both the SSA taint engine and the structural +/// `cfg-unguarded-sink` analysis to suppress the SQL-injection finding +/// on `session.createQuery(cq)` / `em.createQuery(cq)` / `executeUpdate` +/// shapes where the argument is a Criteria object built via +/// `CriteriaBuilder` rather than a string. +/// +/// Returns `false` when `sink_caps` does not include `SQL_QUERY`, when +/// `values` is empty, or when no value carries the +/// [`TypeKind::JpaCriteriaQuery`] tag. Receiver values should be +/// excluded by the caller, the receiver of a JPA query method is the +/// `Session` / `EntityManager` channel, never the payload. +pub fn is_safe_query_object_arg( + values: &[SsaValue], + sink_caps: crate::labels::Cap, + type_facts: &TypeFactResult, +) -> bool { + use crate::labels::Cap; + if !sink_caps.intersects(Cap::SQL_QUERY) { + return false; + } + if values.is_empty() { + return false; + } + values + .iter() + .any(|v| type_facts.is_type(*v, &TypeKind::JpaCriteriaQuery)) +} + +/// Receiver-text-aware return-type inference for methods whose +/// constructor mapping cannot be determined from the callee suffix +/// alone. +/// +/// The JPA `createQuery` suffix is overloaded between +/// `CriteriaBuilder.createQuery(Class)` (returns `CriteriaQuery`, our +/// safe-by-construction structural query object) and +/// `Session.createQuery(String|Query)` (the executable-query +/// constructor whose string overload IS a SQL sink). Class-literal +/// arg shape (e.g. `Foo.class`) doesn't surface in `arg_uses` at the +/// CFG layer, so we fall back to the receiver-text hint: if the +/// callee path includes a `CriteriaBuilder` cast or a receiver +/// variable named `cb` / `criteriaBuilder` / `builder`, treat the +/// call as the criteria-builder overload. +/// +/// Conservative: returns `None` for any other shape so +/// [`constructor_type`] / `is_int_producing_callee` stay +/// authoritative, and consumers see Unknown instead of a wrong +/// type tag. +/// +/// `_args` and `_consts` are kept on the signature so we can later +/// add arg-shape narrowing when class-literal lowering captures +/// `Foo.class` as an arg-use. +fn arg_aware_call_type( + lang: Lang, + callee: &str, + _args: &[SmallVec<[SsaValue; 2]>], + _consts: &HashMap, +) -> Option { + if !matches!(lang, Lang::Java) { + return None; + } + let after_colons = callee.rsplit("::").next().unwrap_or(callee); + let suffix = after_colons.rsplit('.').next().unwrap_or(after_colons); + if suffix != "createQuery" { + return None; + } + // Strip the trailing `.createQuery` segment and inspect the + // receiver text for the criteria-builder hints. Conservative + // text-level match, the SSA layer doesn't expose receiver-type + // facts here yet. + let prefix = callee.rsplit_once('.').map(|(p, _)| p).unwrap_or(callee); + if prefix.contains("CriteriaBuilder") || receiver_is_criteria_builder(prefix) { + Some(TypeKind::JpaCriteriaQuery) + } else { + None + } +} + +/// True when the receiver text identifies a CriteriaBuilder by +/// idiomatic naming (`cb`, `criteriaBuilder`, `builder`, +/// `getCriteriaBuilder()`), modulo casts and chained accesses. +fn receiver_is_criteria_builder(receiver_text: &str) -> bool { + // Drop trailing parenthesized portions and chained cast/syntax noise. + let cleaned = receiver_text + .rsplit_once(')') + .map(|(_, tail)| tail) + .unwrap_or(receiver_text) + .trim(); + let cleaned = cleaned.trim_start_matches('.'); + let last_segment = cleaned + .rsplit(['.', ':', ' ']) + .next() + .unwrap_or(cleaned) + .trim_matches(|c: char| c == '(' || c == ')'); + matches!( + last_segment, + "cb" | "criteriaBuilder" | "criteria_builder" | "builder" | "getCriteriaBuilder" + ) || receiver_text.contains("getCriteriaBuilder()") + || receiver_text.contains(".cb.") +} + /// Infer a type from a constructor, factory, or allocator call. /// /// Maps known constructor/factory/allocator patterns to security-relevant @@ -260,6 +378,20 @@ pub(crate) fn constructor_type(lang: Lang, callee: &str) -> Option { "FileInputStream" | "FileOutputStream" | "FileReader" | "FileWriter" | "BufferedReader" | "BufferedWriter" => Some(TypeKind::FileHandle), "getWriter" | "getOutputStream" => Some(TypeKind::HttpResponse), + // JPA / Hibernate Criteria API factory methods. These are + // unambiguous: `createCriteriaUpdate` / `createCriteriaDelete` + // / `createTupleQuery` / `subquery` exist only on + // `CriteriaBuilder` / `CriteriaQuery` and always return a + // structural query object. `createQuery` is overloaded + // (`CriteriaBuilder.createQuery(Class)` returns + // `CriteriaQuery`; `Session.createQuery(String)` returns + // `Query`), so it's gated below in + // [`infer_call_return_type_with_args`] on the arg-0 shape + // (a class literal) so we don't conflate the executable- + // query overload with the criteria builder. + "createCriteriaUpdate" | "createCriteriaDelete" | "createTupleQuery" | "subquery" => { + Some(TypeKind::JpaCriteriaQuery) + } _ => None, }, Lang::JavaScript | Lang::TypeScript => match suffix { @@ -687,9 +819,13 @@ pub fn analyze_types_with_param_types( } SsaOp::SelfParam => TypeFact::from_kind(TypeKind::Object), SsaOp::CatchParam => TypeFact::from_kind(TypeKind::Object), - SsaOp::Call { callee, .. } => { + SsaOp::Call { callee, args, .. } => { if let Some(ty) = lang.and_then(|l| constructor_type(l, callee)) { TypeFact::from_kind(ty) + } else if let Some(ty) = + lang.and_then(|l| arg_aware_call_type(l, callee, args, consts)) + { + TypeFact::from_kind(ty) } else if is_int_producing_callee(callee) { TypeFact::from_kind(TypeKind::Int) } else { @@ -2227,4 +2363,171 @@ mod tests { &result )); } + + // ── JPA Criteria query suppression (Phase: real-repo openmrs FP) ─── + // + // These tests pin the `TypeKind::JpaCriteriaQuery` variant + the + // `is_safe_query_object_arg` predicate + the + // `arg_aware_call_type` receiver-text recogniser. Together they + // close the openmrs HibernateDAO `session.createQuery(cq)` FP + // cluster (216 → 24 cfg-unguarded-sink in openmrs). + + /// `JpaCriteriaQuery` carries a label_prefix so type-qualified + /// callee resolution can attach future rules. + #[test] + fn jpa_criteria_query_label_prefix() { + assert_eq!( + TypeKind::JpaCriteriaQuery.label_prefix(), + Some("JpaCriteriaQuery") + ); + } + + /// `is_safe_query_object_arg` suppresses SQL_QUERY when any + /// supplied value is a `JpaCriteriaQuery`. Receiver inclusion is + /// the caller's responsibility, here we just verify the predicate. + #[test] + fn safe_query_object_arg_suppresses_sql_query() { + use crate::labels::Cap; + let mut facts = HashMap::new(); + facts.insert(SsaValue(0), TypeFact::from_kind(TypeKind::JpaCriteriaQuery)); + let result = TypeFactResult { facts }; + assert!(is_safe_query_object_arg( + &[SsaValue(0)], + Cap::SQL_QUERY, + &result + )); + // Other caps stay untouched. + assert!(!is_safe_query_object_arg( + &[SsaValue(0)], + Cap::CODE_EXEC, + &result + )); + // Unknown-typed values do not trigger. + let mut facts2 = HashMap::new(); + facts2.insert(SsaValue(0), TypeFact::from_kind(TypeKind::Unknown)); + let result2 = TypeFactResult { facts: facts2 }; + assert!(!is_safe_query_object_arg( + &[SsaValue(0)], + Cap::SQL_QUERY, + &result2 + )); + // Empty slice never suppresses. + assert!(!is_safe_query_object_arg(&[], Cap::SQL_QUERY, &result)); + } + + /// `is_safe_query_object_arg` fires when a Criteria value is mixed + /// in with other types — the predicate is `any`, not `all`, since + /// the criteria-object arg is the only injection-bearing slot for a + /// `createQuery(cq)` sink. + #[test] + fn safe_query_object_arg_fires_with_mixed_args() { + use crate::labels::Cap; + let mut facts = HashMap::new(); + facts.insert(SsaValue(0), TypeFact::from_kind(TypeKind::JpaCriteriaQuery)); + facts.insert(SsaValue(1), TypeFact::from_kind(TypeKind::String)); + facts.insert(SsaValue(2), TypeFact::from_kind(TypeKind::Unknown)); + let result = TypeFactResult { facts }; + assert!(is_safe_query_object_arg( + &[SsaValue(0), SsaValue(1), SsaValue(2)], + Cap::SQL_QUERY, + &result + )); + } + + /// `arg_aware_call_type` maps the JPA `cb.createQuery(...)` / + /// `criteriaBuilder.createQuery(...)` / `((CriteriaBuilder) + /// x).createQuery(...)` shapes to `JpaCriteriaQuery`, distinct + /// from the overloaded `session.createQuery(...)` / + /// `em.createQuery(...)` which stays `None` (the + /// executable-query overload). + #[test] + fn arg_aware_call_type_jpa_criteria_builder_recogniser() { + let no_args: Vec> = vec![]; + let consts: HashMap = HashMap::new(); + // Receiver hint: bare `cb` ident. + assert_eq!( + arg_aware_call_type(Lang::Java, "cb.createQuery", &no_args, &consts), + Some(TypeKind::JpaCriteriaQuery) + ); + // Receiver hint: bare `criteriaBuilder` ident. + assert_eq!( + arg_aware_call_type(Lang::Java, "criteriaBuilder.createQuery", &no_args, &consts), + Some(TypeKind::JpaCriteriaQuery) + ); + // Cast in receiver text. + assert_eq!( + arg_aware_call_type( + Lang::Java, + "((CriteriaBuilder) cb).createQuery", + &no_args, + &consts + ), + Some(TypeKind::JpaCriteriaQuery) + ); + // Chained accessor: getCriteriaBuilder().createQuery + assert_eq!( + arg_aware_call_type( + Lang::Java, + "session.getCriteriaBuilder().createQuery", + &no_args, + &consts + ), + Some(TypeKind::JpaCriteriaQuery) + ); + // The executable-query overload (`session.createQuery`) does + // NOT match — receiver-text doesn't carry a CriteriaBuilder + // hint, so we leave the type as Unknown and let the + // suppression decide based on the arg-0 type fact. + assert_eq!( + arg_aware_call_type(Lang::Java, "session.createQuery", &no_args, &consts), + None + ); + assert_eq!( + arg_aware_call_type(Lang::Java, "em.createQuery", &no_args, &consts), + None + ); + // Non-Java langs return None. + assert_eq!( + arg_aware_call_type(Lang::Python, "cb.createQuery", &no_args, &consts), + None + ); + // Other suffixes return None. + assert_eq!( + arg_aware_call_type(Lang::Java, "cb.createCriteriaUpdate", &no_args, &consts), + None + ); + } + + /// Unique-suffix Criteria API methods land on + /// `TypeKind::JpaCriteriaQuery` directly via [`constructor_type`] + /// without the receiver hint, since `createCriteriaUpdate` / + /// `createCriteriaDelete` / `createTupleQuery` / `subquery` exist + /// only on `CriteriaBuilder` / `CriteriaQuery` and have no + /// overload conflict. + #[test] + fn constructor_type_unique_jpa_criteria_methods() { + for suffix in &[ + "createCriteriaUpdate", + "createCriteriaDelete", + "createTupleQuery", + "subquery", + ] { + assert_eq!( + constructor_type(Lang::Java, suffix), + Some(TypeKind::JpaCriteriaQuery), + "suffix `{suffix}` must map to JpaCriteriaQuery" + ); + // Same suffix prefixed by an arbitrary receiver still maps. + assert_eq!( + constructor_type(Lang::Java, &format!("cb.{suffix}")), + Some(TypeKind::JpaCriteriaQuery) + ); + } + // Non-criteria methods unaffected. + assert_eq!( + constructor_type(Lang::Java, "session.createQuery"), + None, + "createQuery is overloaded — must not map at constructor_type level" + ); + } } diff --git a/src/summary/mod.rs b/src/summary/mod.rs index 00082a12..a62e5dc6 100644 --- a/src/summary/mod.rs +++ b/src/summary/mod.rs @@ -21,6 +21,7 @@ pub mod ssa_summary; use crate::labels::Cap; use crate::summary::ssa_summary::SsaFuncSummary; use crate::symbol::{FuncKey, FuncKind, Lang, normalize_namespace}; +use rustc_hash::FxHashMap; use serde::{Deserialize, Deserializer, Serialize}; use smallvec::SmallVec; use std::collections::{BTreeMap, HashMap}; @@ -517,15 +518,20 @@ impl<'a> CalleeQuery<'a> { /// for same-language resolution in the taint engine. #[derive(Default)] pub struct GlobalSummaries { - by_key: HashMap, + /// FxHashMap (rustc_hash) replaces stdlib SipHash. FuncKey carries 3 + /// String fields, so any HashMap operation walks ≥30 bytes through the + /// hasher; FxHash is ~5x faster than SipHash on this workload. Seed + /// is fixed (no DoS hardening), which is fine for an in-process index + /// keyed by static program-derived names. + by_key: FxHashMap, /// Bare leaf-name index, kept for compatibility with callers that only /// see an unqualified call string. A single name may map to many keys /// across containers / files / arities. - by_lang_name: HashMap<(Lang, String), Vec>, + by_lang_name: FxHashMap<(Lang, String), Vec>, /// Container-qualified index: keyed on `"{container}::{name}"` (or just /// `name` for free functions). Used to resolve calls when the call-site /// can supply a receiver / container hint (e.g. `OrderService::process`). - by_lang_qualified: HashMap<(Lang, String), Vec>, + by_lang_qualified: FxHashMap<(Lang, String), Vec>, /// Rust-only secondary index keyed on `(module_path, name)`. /// /// Populated whenever a Rust [`FuncSummary`] is inserted with a @@ -533,7 +539,7 @@ pub struct GlobalSummaries { /// candidates by their crate-relative module rather than their /// filesystem path. Same name / module / arity overloads land on the /// same vector, arity narrowing happens at resolution time. - by_rust_module: HashMap<(String, String), Vec>, + by_rust_module: FxHashMap<(String, String), Vec>, /// Precise SSA-derived per-parameter summaries, keyed by `FuncKey`. /// These take precedence over `FuncSummary` during callee resolution. ssa_by_key: HashMap, @@ -546,6 +552,18 @@ pub struct GlobalSummaries { /// pass 1 and consumed by /// [`crate::auth_analysis::run_auth_analysis`] during pass 2. auth_by_key: HashMap, + /// Per-Python-file router declarations + `include_router` edges, + /// keyed by `module_id_for_storage(file_path)` (basename without + /// `.py`, or `parent_dir::__init__` for `__init__.py`). Populated + /// during pass 1 and consumed by + /// [`Self::resolve_cross_file_router_deps`] at pass 2 entry to lift + /// FastAPI router-level `dependencies=[Security(...)]` declared in a + /// parent file (`__init__.py` calling + /// `.include_router(.router, ...)`) onto the bare + /// child router declared in another file — closing the airflow + /// execution-API auth-recognition gap on routes attached to bare + /// child routers. + router_facts_by_module: HashMap, /// Type hierarchy index for runtime virtual-dispatch fan-out. /// /// Installed by [`Self::install_hierarchy`] after pass 1 from the @@ -856,6 +874,11 @@ impl GlobalSummaries { for (key, auth_sum) in other.auth_by_key { self.auth_by_key.insert(key, auth_sum); } + // Router facts: last-writer-wins per (module_id) key. Re-analysing + // a file produces a fresh snapshot of its router declarations + edges. + for (module_id, facts) in other.router_facts_by_module { + self.router_facts_by_module.insert(module_id, facts); + } // Hierarchy index: invalidate after a merge so the next consumer // sees a freshly-built view that includes `other`'s edges. The // alternative, point-merging two indexes, is racy when the @@ -991,6 +1014,80 @@ impl GlobalSummaries { self.auth_by_key.len() } + /// Insert a per-file `PerFileRouterFacts` snapshot. Last-writer-wins + /// per `module_id` key — re-analysing a file produces a fresh + /// snapshot of its router declarations and `include_router` edges. + pub fn insert_router_facts( + &mut self, + module_id: String, + facts: crate::auth_analysis::router_facts::PerFileRouterFacts, + ) { + self.router_facts_by_module.insert(module_id, facts); + } + + /// Resolve cross-file router-level deps for the file identified by + /// `child_module_id`. Walks every other file's persisted + /// `RouterIncludeEdge` list, finds edges whose `child_module_id` + /// matches, and accumulates the parent file's + /// `local_router_deps[parent_var]` against `child_var` — producing + /// a ` → Vec<(CallSite, scoped_security)>` map ready to + /// merge into the active file's + /// `AuthorizationModel.cross_file_router_deps`. + /// + /// Single-hop only. Transitive lifts (`grandparent.include_router(parent); + /// parent.include_router(child)`) are not currently resolved — the + /// airflow shape that motivated this fix is single-hop, and adding + /// transitive resolution is a follow-up that would also need to + /// model the bare-identifier `outer.include_router(inner_router)` + /// case which the extractor presently skips. + /// + /// Returns an empty map when `child_module_id` matches no edges or + /// when the index is empty. + pub fn resolve_cross_file_router_deps( + &self, + child_module_id: &str, + ) -> HashMap> { + let mut out: HashMap> = + HashMap::new(); + if self.router_facts_by_module.is_empty() { + return out; + } + for facts in self.router_facts_by_module.values() { + for edge in &facts.include_router_edges { + if edge.child_module_id != child_module_id { + continue; + } + // Look up the parent's deps in the SAME file's + // local_router_deps map (parent declarations and the + // include_router edge live in the same file). + let Some(parent_deps) = facts.local_router_deps.get(&edge.parent_var) else { + continue; + }; + if parent_deps.is_empty() { + continue; + } + let entry = out.entry(edge.child_var.clone()).or_default(); + for dep in parent_deps { + // Dedup by (callee name, scoped flag) so multiple + // parents declaring the same dep don't double-fire. + let already = entry + .iter() + .any(|(call, scoped)| call.name == dep.0.name && *scoped == dep.1); + if !already { + entry.push(dep.clone()); + } + } + } + } + out + } + + /// Count of files that contributed router facts to the index. + /// Exposed for `tracing::debug!` observability. + pub fn router_facts_len(&self) -> usize { + self.router_facts_by_module.len() + } + /// Insert a cross-file callee body. /// /// See [`insert_ssa`](Self::insert_ssa) for the identity-safety rule. @@ -1050,7 +1147,10 @@ impl GlobalSummaries { #[allow(dead_code)] // used by tests and future call-graph consumers pub fn is_empty(&self) -> bool { - self.by_key.is_empty() && self.ssa_by_key.is_empty() && self.auth_by_key.is_empty() + self.by_key.is_empty() + && self.ssa_by_key.is_empty() + && self.auth_by_key.is_empty() + && self.router_facts_by_module.is_empty() } /// Iterate over all (key, summary) pairs. @@ -1582,6 +1682,7 @@ impl std::fmt::Debug for GlobalSummaries { .field("ssa_len", &self.ssa_by_key.len()) .field("bodies_len", &self.bodies_by_key.len()) .field("auth_len", &self.auth_by_key.len()) + .field("router_facts_len", &self.router_facts_by_module.len()) .finish() } } diff --git a/src/summary/tests.rs b/src/summary/tests.rs index 5e6deceb..e03037a5 100644 --- a/src/summary/tests.rs +++ b/src/summary/tests.rs @@ -3851,6 +3851,126 @@ fn cross_file_devirt_does_not_union_unrelated_findbyids() { assert_eq!(cache_sum.tainted_sink_params, vec![0]); } +/// Cross-file router-dep resolution: parent `__init__.py` declares +/// `Security(...)` deps on a router and lifts them onto a child via +/// `.include_router(., ...)`. The +/// resolution must produce a ` → Vec<(CallSite, scoped)>` +/// map for the child file's `module_id`, and absent edges must yield +/// empty. +#[test] +fn resolve_cross_file_router_deps_lifts_parent_security_dep_onto_child_router() { + use crate::auth_analysis::model::CallSite; + use crate::auth_analysis::router_facts::{PerFileRouterFacts, RouterIncludeEdge}; + + let mut gs = GlobalSummaries::new(); + // Parent (__init__.py) declares scoped Security on `authenticated_router` + // and emits two include_router edges (task_instances + dag_runs). + let parent_callsite = CallSite { + name: "require_auth".into(), + args: Vec::new(), + span: (0, 0), + args_value_refs: Vec::new(), + }; + let mut parent_facts = PerFileRouterFacts::default(); + parent_facts.local_router_deps.insert( + "authenticated_router".into(), + vec![(parent_callsite.clone(), true)], + ); + parent_facts.include_router_edges.push(RouterIncludeEdge { + parent_var: "authenticated_router".into(), + child_module_id: "task_instances".into(), + child_var: "router".into(), + }); + parent_facts.include_router_edges.push(RouterIncludeEdge { + parent_var: "authenticated_router".into(), + child_module_id: "dag_runs".into(), + child_var: "router".into(), + }); + gs.insert_router_facts("routes::__init__".into(), parent_facts); + + // Child (task_instances.py) declares a bare router → expects to + // inherit the parent's deps via the cross-file resolution. + gs.insert_router_facts("task_instances".into(), PerFileRouterFacts::default()); + + // Resolve for task_instances → should get one entry under `router` + // carrying the require_auth (scoped=true) dep. + let resolved = gs.resolve_cross_file_router_deps("task_instances"); + let deps = resolved.get("router").expect("router child resolved"); + assert_eq!(deps.len(), 1); + assert_eq!(deps[0].0.name, "require_auth"); + assert!(deps[0].1, "scoped flag preserved"); + + // dag_runs has the same parent → same lift. + let resolved_dag = gs.resolve_cross_file_router_deps("dag_runs"); + assert_eq!(resolved_dag.get("router").map(|v| v.len()), Some(1)); + + // Unrelated module → no lift. + let resolved_other = gs.resolve_cross_file_router_deps("nonexistent"); + assert!(resolved_other.is_empty()); +} + +/// Edge: parent without local deps for the named var emits nothing — +/// the resolver requires both an edge AND a non-empty parent dep list. +#[test] +fn resolve_cross_file_router_deps_skips_edges_with_no_parent_deps() { + use crate::auth_analysis::router_facts::{PerFileRouterFacts, RouterIncludeEdge}; + + let mut gs = GlobalSummaries::new(); + let mut parent = PerFileRouterFacts::default(); + parent.include_router_edges.push(RouterIncludeEdge { + parent_var: "ghost_router".into(), + child_module_id: "child".into(), + child_var: "router".into(), + }); + gs.insert_router_facts("parent".into(), parent); + + let resolved = gs.resolve_cross_file_router_deps("child"); + assert!(resolved.is_empty()); +} + +/// Multiple parents declaring different deps for the same child +/// accumulate without duplication. Same dep declared twice (one +/// from each parent) must dedup by (callee.name, scoped). +#[test] +fn resolve_cross_file_router_deps_dedups_duplicate_parent_deps() { + use crate::auth_analysis::model::CallSite; + use crate::auth_analysis::router_facts::{PerFileRouterFacts, RouterIncludeEdge}; + + let cs = CallSite { + name: "require_auth".into(), + args: Vec::new(), + span: (0, 0), + args_value_refs: Vec::new(), + }; + let mut gs = GlobalSummaries::new(); + + // Parent A: include_router(child.router) with `require_auth` dep. + let mut p_a = PerFileRouterFacts::default(); + p_a.local_router_deps + .insert("router_a".into(), vec![(cs.clone(), true)]); + p_a.include_router_edges.push(RouterIncludeEdge { + parent_var: "router_a".into(), + child_module_id: "child".into(), + child_var: "router".into(), + }); + gs.insert_router_facts("parent_a".into(), p_a); + + // Parent B: SAME dep, different parent file. + let mut p_b = PerFileRouterFacts::default(); + p_b.local_router_deps + .insert("router_b".into(), vec![(cs, true)]); + p_b.include_router_edges.push(RouterIncludeEdge { + parent_var: "router_b".into(), + child_module_id: "child".into(), + child_var: "router".into(), + }); + gs.insert_router_facts("parent_b".into(), p_b); + + let resolved = gs.resolve_cross_file_router_deps("child"); + let deps = resolved.get("router").expect("router resolved"); + assert_eq!(deps.len(), 1, "duplicate (callee, scoped) deduplicated"); +} + // ── the analysis ──────────────────── // // `GlobalSummaries::resolve_callee_widened` is the runtime counterpart of diff --git a/src/taint/path_state.rs b/src/taint/path_state.rs index 2f6efe5a..fbf7087e 100644 --- a/src/taint/path_state.rs +++ b/src/taint/path_state.rs @@ -211,6 +211,41 @@ fn is_bounded_length_check(lower: &str) -> bool { false } +/// Normalise an identifier to its snake-case lowercase form so that +/// camelCase / PascalCase / SCREAMING variants line up against snake-cased +/// prefix lists (`is_safe`, `is_authorized`, `is_authenticated`). +/// +/// Underscore is inserted at every case boundary: +/// - lowercase/digit → uppercase (`isSafe` → `is_safe`) +/// - uppercase → uppercase-then-lowercase (`HTTPClient` → `http_client`) +/// +/// Inputs already in snake_case round-trip unchanged: `is_safe` → `is_safe`. +/// Used by `classify_condition` so a sanitiser predicate authored in any +/// of the dominant identifier conventions classifies the same. +pub(crate) fn to_snake_lower(s: &str) -> String { + let chars: Vec = s.chars().collect(); + let mut out = String::with_capacity(chars.len() + 4); + for i in 0..chars.len() { + let c = chars[i]; + if c.is_ascii_uppercase() { + if i > 0 { + let prev = chars[i - 1]; + let next = chars.get(i + 1).copied(); + let between_camel = prev.is_ascii_lowercase() || prev.is_ascii_digit(); + let acronym_end = + prev.is_ascii_uppercase() && next.is_some_and(|n| n.is_ascii_lowercase()); + if (between_camel || acronym_end) && !out.ends_with('_') { + out.push('_'); + } + } + out.push(c.to_ascii_lowercase()); + } else { + out.push(c.to_ascii_lowercase()); + } + } + out +} + /// Parse a leading non-negative integer literal (decimal only). fn parse_leading_uint(s: &str) -> Option { let mut n: u64 = 0; @@ -384,13 +419,35 @@ pub fn classify_condition(text: &str) -> PredicateKind { .unwrap_or(callee_part) .trim(); + // Derive a snake-cased form from the **original** text so that + // camelCase identifiers (`isSafeRemoteUrl`, `isAuthorized`, + // `isValidUUID`) classify against the snake-cased prefix list + // (`is_safe`, `is_authorized`, `is_authenticated`) the same as + // `is_safe_remote_url` would. Required to recognise CVE-2026-33486 + // (roadiz/documents `isSafeRemoteUrl` SSRF sanitiser) as a + // ValidationCall on the patched fixture. Mirrors the trim/strip + // pipeline above on case-preserved text so the snake form lines up + // with `bare`. + let orig_trimmed = text.trim_start_matches(['(', '!', ' ', '\t']); + let orig_trimmed = orig_trimmed + .strip_prefix("not ") + .unwrap_or(orig_trimmed) + .trim(); + let orig_callee_part = orig_trimmed.split('(').next().unwrap_or(""); + let orig_bare = orig_callee_part + .rsplit(['.', ':']) + .next() + .unwrap_or(orig_callee_part) + .trim(); + let bare_snake = to_snake_lower(orig_bare); + // Validation if bare.contains("valid") || bare.contains("check") || bare.contains("verify") - || bare.starts_with("is_safe") - || bare.starts_with("is_authorized") - || bare.starts_with("is_authenticated") + || bare_snake.starts_with("is_safe") + || bare_snake.starts_with("is_authorized") + || bare_snake.starts_with("is_authenticated") { return PredicateKind::ValidationCall; } @@ -734,8 +791,12 @@ fn extract_validation_target(text: &str) -> Option { // not corrupt the argument substring. let first_arg = first_call_arg(args_part)?; - // Strip reference operators (e.g. `&x` → `x`) + // Strip reference operators (e.g. `&x` → `x`) and PHP variable sigil + // (`$url` → `url`) so the extracted target lines up with the var-name + // form used in branch-narrowing. Mirrors the `$` strip already done by + // `extract_allowlist_target` for `in_array($cmd, $allowed)`. let first_arg = first_arg.strip_prefix('&').unwrap_or(first_arg).trim(); + let first_arg = first_arg.strip_prefix('$').unwrap_or(first_arg); if !first_arg.is_empty() && is_identifier(first_arg) { Some(first_arg.to_string()) @@ -991,6 +1052,63 @@ mod tests { ); } + #[test] + fn classify_camelcase_safety_validators_are_validation_call() { + // Real-CVE shape: roadiz/documents `isSafeRemoteUrl($url)` (CVE-2026-33486). + // Without snake-case normalisation, the bare `issaferemoteurl` would + // not match the `is_safe` prefix and the predicate would silently + // fall into `Comparison`/`Unknown`, leaving `$url` un-validated past + // the early-return. + assert_eq!( + classify_condition("self::isSafeRemoteUrl($url)"), + PredicateKind::ValidationCall + ); + assert_eq!( + classify_condition("isAuthorized(user)"), + PredicateKind::ValidationCall + ); + assert_eq!( + classify_condition("isAuthenticated(req)"), + PredicateKind::ValidationCall + ); + // Acronym handling: `isValidUUID` → `is_valid_uuid` → contains "valid". + assert_eq!( + classify_condition("isValidUUID(id)"), + PredicateKind::ValidationCall + ); + // Snake-case round-trips unchanged. + assert_eq!( + classify_condition("is_safe_remote_url(x)"), + PredicateKind::ValidationCall + ); + } + + #[test] + fn extract_validation_target_strips_php_dollar_sigil() { + // PHP `$url` strips the sigil so the extracted target lines up with + // the var-name form used in branch narrowing. Required for + // CVE-2026-33486 patched fixture to silence on `fopen($url, 'r')`. + assert_eq!( + extract_validation_target("self::isSafeRemoteUrl($url)"), + Some("url".to_string()) + ); + assert_eq!( + extract_validation_target("validate($input)"), + Some("input".to_string()) + ); + } + + #[test] + fn to_snake_lower_handles_common_variants() { + assert_eq!(to_snake_lower("isSafeRemoteUrl"), "is_safe_remote_url"); + assert_eq!(to_snake_lower("isValidUUID"), "is_valid_uuid"); + assert_eq!(to_snake_lower("HTTPClient"), "http_client"); + assert_eq!(to_snake_lower("IsSafe"), "is_safe"); + assert_eq!(to_snake_lower("is_safe"), "is_safe"); + assert_eq!(to_snake_lower("validate"), "validate"); + assert_eq!(to_snake_lower(""), ""); + } + #[test] fn classify_validation_requires_paren() { // `x_valid == true` should NOT be ValidationCall, no `(` call syntax. diff --git a/src/taint/ssa_transfer/mod.rs b/src/taint/ssa_transfer/mod.rs index 48ca0db5..c3b62771 100644 --- a/src/taint/ssa_transfer/mod.rs +++ b/src/taint/ssa_transfer/mod.rs @@ -1523,6 +1523,121 @@ fn apply_input_validator_branch_narrowing( } } +/// JS/TS Array-method validator-callback narrowing. +/// +/// `arr.filter(isSafeIdentifier)`, `arr.find(isValidId)`, and the +/// `findLast` variant are gating array methods whose return value is +/// composed of elements that passed the callback. When the callback +/// argument resolves to a name `classify_input_validator_callee` tags +/// as `BooleanTrueIsValid` (`isValid…`, `isSafe…`, `hasValid…` and +/// snake-case variants), every element of the result satisfies the +/// validator, so the call's downstream sinks see the same flow as +/// validated taint. +/// +/// The companion `if (isValidX(x)) use(x)` narrowing already exists in +/// [`apply_input_validator_branch_narrowing`]; this is the same idea +/// lifted to the call site for filter/find chains so taint stops at +/// the gate rather than leaking through subsequent +/// `Array[index]`/template/sink reads. +/// +/// Strict-additive: if the callback's name does not match the +/// validator pattern (anonymous arrow, opaque identifier, etc.), the +/// helper is a no-op and the existing default propagation runs +/// unchanged. +/// +/// Motivated by CVE-2026-42353 (i18next-http-middleware path +/// traversal): the patched fix is `languages.filter(utils.isSafeIdentifier)` +/// before forwarding `languages` into the backend connector, and the +/// dual deferred TS-side gap CVE-2026-25544 (Payload sqli). +fn try_array_method_validator_callback_narrowing( + inst: &SsaInst, + info: &NodeInfo, + callee: &str, + args: &[SmallVec<[SsaValue; 2]>], + return_bits: &mut Cap, + return_origins: &mut SmallVec<[TaintOrigin; 2]>, + state: &mut SsaTaintState, + transfer: &SsaTaintTransfer, + ssa: &SsaBody, +) -> bool { + if !matches!(transfer.lang, Lang::JavaScript | Lang::TypeScript) { + return false; + } + // Method-call shape: callee text contains a `.` and the trailing + // segment is one of the gating array methods. `findIndex` / + // `every` / `some` return scalar shapes (index, boolean) rather + // than a filtered collection so they are excluded — element-level + // validation does not apply to a numeric/boolean result. + let dot = match callee.rfind('.') { + Some(p) => p, + None => return false, + }; + let method = &callee[dot + 1..]; + if !matches!(method, "filter" | "find" | "findLast") { + return false; + } + // The first positional argument's callable name. Two channels: + // 1. `info.arg_callees` — populated by `extract_arg_callees` + // (`call_ident_of` walks call shapes inside the arg). Catches + // `arr.filter(cb())` and dotted-callback shapes where the + // tree-sitter node kind reaches `Kind::CallFn` or + // `Kind::CallMethod`. + // 2. SSA `value_defs[v].var_name` for the arg's first SSA value + // — covers the bare-identifier shape (`arr.filter(cb)`) + // where the AST node is a plain identifier and + // `extract_arg_callees` pushes `None` because there is no + // call to recurse into. This is the shape every patched + // CVE fix uses, so it is the dominant source of validator + // callbacks in real code. + let arg0 = match args.first() { + Some(a) => a, + None => return false, + }; + let cb_from_arg_callees = info.arg_callees.first().and_then(|s| s.as_deref()); + let cb_from_ssa = arg0.iter().find_map(|&v| { + ssa.value_defs + .get(v.0 as usize) + .and_then(|vd| vd.var_name.as_deref()) + }); + let cb_name = match cb_from_arg_callees.or(cb_from_ssa) { + Some(n) => n, + None => return false, + }; + if crate::ssa::type_facts::classify_input_validator_callee(cb_name) + != Some(InputValidatorPolarity::BooleanTrueIsValid) + { + return false; + } + + // Strip every cap from the return value: the returned array (or + // single found element) is composed exclusively of elements the + // recognised validator approved. `Cap::all()` is the conservative + // ceiling because the validator's body is opaque to this layer; a + // future extension could narrow caps by inspecting the body's + // rejection patterns. + *return_bits = Cap::empty(); + return_origins.clear(); + + // Mark the result's var_name as validated, mirroring the + // [`apply_input_validator_branch_narrowing`] insertion. Useful + // for direct same-name reads of the rebound array (`arr = + // arr.filter(p)` then `arr.length`) but does not propagate + // through Assigns to differently-named bindings (`const lng = + // arr[0]`); the `return_bits` strip above is what gates those + // downstream flows. + if let Some(name) = ssa + .value_defs + .get(inst.value.0 as usize) + .and_then(|vd| vd.var_name.as_deref()) + { + if let Some(sym) = transfer.interner.get(name) { + state.validated_must.insert(sym); + state.validated_may.insert(sym); + } + } + true +} + /// Find the latest reaching SSA definition for `var_name` at the end of /// `block`. Mirrors `crate::constraint::lower::resolve_single_var` but /// avoids the cross-module privacy leak: callers in this module need it @@ -4081,6 +4196,24 @@ pub(super) fn transfer_inst( } } + // Receiver-side validator strip. Some method-call validators + // raise on failure rather than transforming a return value, + // so the canonical `Sanitizer` mechanism (which clears the + // return) is the wrong shape. After the call returns, the + // *receiver* (and any args carrying the same equivalence + // class) is proven to satisfy the validated property. Strip + // the registered cap from receiver+args here so that + // `path.relative_to(base)` clears `Cap::FILE_IO` from + // `path` for downstream uses. Motivated by CVE-2024-23334 + // (aiohttp StaticResource symlink-bypass): the patched code + // calls `filepath.relative_to(self._directory)` inside a + // try/except and serves `filepath` afterwards. + if let Some(cap) = + crate::labels::lookup_receiver_validator(transfer.lang.as_str(), callee) + { + strip_cap_from_call_args(args, receiver, state, cap); + } + // Alias-aware sanitization: propagate through must-aliased field paths if !sanitizer_bits.is_empty() { if let Some(aliases) = transfer.base_aliases { @@ -4444,6 +4577,28 @@ pub(super) fn transfer_inst( } } + // JS/TS array-method validator-callback narrowing. When a + // call shape matches `.filter()` + // (or `find` / `findLast`), strip the caps that flowed into + // `return_bits` from the receiver — the result holds only + // elements the validator approved. Strict-additive: the + // helper is a no-op when the callback name does not match + // the BooleanTrueIsValid bucket, leaving the default + // propagation result unchanged. See + // [`try_array_method_validator_callback_narrowing`] for the + // motivating CVE pair. + try_array_method_validator_callback_narrowing( + inst, + info, + callee, + args, + &mut return_bits, + &mut return_origins, + state, + transfer, + ssa, + ); + // Constructor cap narrowing: a `new X(...)` call returns an object // instance, not a string. Caps that name a string-shaped sink // pattern (path argument, format string, URL component, JSON diff --git a/src/taint/tests.rs b/src/taint/tests.rs index e8f192e4..847a5af4 100644 --- a/src/taint/tests.rs +++ b/src/taint/tests.rs @@ -6779,3 +6779,83 @@ const handler = (req, res) => { "expected taint flow via double-call chain rebinding; got 0 findings", ); } + +/// CVE-2026-42353 i18next-http-middleware: the patched fix wraps a +/// tainted array in `arr.filter(isSafeIdentifier)` before forwarding. +/// `try_array_method_validator_callback_narrowing` recognises the +/// `.filter()` shape on JS/TS and strips +/// the receiver-derived caps from the call result, so a downstream +/// `arr[0]` → template-literal → `fs.readFileSync` chain no longer +/// flags. The bare-identifier callback case is the dominant patched +/// shape — `extract_arg_callees` returns `None` for plain +/// identifiers (no inner call to recurse into), so the helper falls +/// back to the SSA value's `var_name` channel. +#[test] +fn cve_2026_42353_filter_isvalid_callback_strips_taint() { + let src = br#" +const fs = require('fs'); +function isSafeIdentifier(v) { + return typeof v === 'string' && v.indexOf('..') === -1 && v.indexOf('/') === -1; +} +function handler(req, res) { + let languages = req.query.lng ? req.query.lng.split(' ') : []; + languages = languages.filter(isSafeIdentifier); + const lng = languages[0]; + const filename = `/locales/${lng}.json`; + fs.readFileSync(filename); +} +"#; + let lang = tree_sitter::Language::from(tree_sitter_javascript::LANGUAGE); + let file_cfg = parse_lang(src, "javascript", lang); + let summaries = &file_cfg.summaries; + let findings = analyse_file( + &file_cfg, + summaries, + None, + Lang::JavaScript, + "test.js", + &[], + None, + ); + assert!( + findings.is_empty(), + "expected no taint flow when filtered through isSafeIdentifier; got {} findings", + findings.len(), + ); +} + +/// Negative regression for the array-method validator-callback gate: +/// the same shape WITHOUT the `filter(isSafe…)` step keeps the path +/// traversal flow alive end-to-end. Pins the precision claim — the +/// strip is element-of-array-after-filter scoped, not a wholesale +/// kill on any `.filter` call regardless of callback identity. +#[test] +fn cve_2026_42353_filter_without_validator_callback_preserves_taint() { + let src = br#" +const fs = require('fs'); +function pickFirst(v) { return true; } +function handler(req, res) { + let languages = req.query.lng ? req.query.lng.split(' ') : []; + languages = languages.filter(pickFirst); + const lng = languages[0]; + const filename = `/locales/${lng}.json`; + fs.readFileSync(filename); +} +"#; + let lang = tree_sitter::Language::from(tree_sitter_javascript::LANGUAGE); + let file_cfg = parse_lang(src, "javascript", lang); + let summaries = &file_cfg.summaries; + let findings = analyse_file( + &file_cfg, + summaries, + None, + Lang::JavaScript, + "test.js", + &[], + None, + ); + assert!( + !findings.is_empty(), + "expected taint flow via filter(pickFirst) — pickFirst is not a recognised validator and must not strip taint; got 0 findings", + ); +} diff --git a/src/utils/config.rs b/src/utils/config.rs index e712144b..0f3ac35c 100644 --- a/src/utils/config.rs +++ b/src/utils/config.rs @@ -544,6 +544,16 @@ pub struct AuthAnalysisConfig { /// not need an ownership check. Defaults are set per-language in /// `auth_analysis::config::build_auth_rules`. pub acl_tables: Vec, + /// Callee names that, when they appear as the chain root of a + /// chained-call shape (`select(X).filter_by(...)`, + /// `query(X).filter(...)`), anchor the trailing method as a DB + /// query-builder operation. Used to override the chained-call + /// suppression in `classify_sink_class` for SQLAlchemy / similar + /// query-builder idioms whose first call returns an opaque builder + /// object the type tracker cannot resolve. Defaults set per + /// language in `auth_analysis::config::build_auth_rules`. + #[serde(default)] + pub db_query_builder_roots: Vec, } impl Default for AuthAnalysisConfig { @@ -568,6 +578,7 @@ impl Default for AuthAnalysisConfig { outbound_network_receiver_prefixes: Vec::new(), cache_receiver_prefixes: Vec::new(), acl_tables: Vec::new(), + db_query_builder_roots: Vec::new(), } } } @@ -1158,6 +1169,10 @@ pub(crate) fn merge_configs(mut default: Config, user: Config) -> Config { user_lang_cfg.auth.cache_receiver_prefixes, ); extend_dedup(&mut entry.auth.acl_tables, user_lang_cfg.auth.acl_tables); + extend_dedup( + &mut entry.auth.db_query_builder_roots, + user_lang_cfg.auth.db_query_builder_roots, + ); } default diff --git a/tests/benchmark/RESULTS.md b/tests/benchmark/RESULTS.md index 97cad0b6..ef2c8623 100644 --- a/tests/benchmark/RESULTS.md +++ b/tests/benchmark/RESULTS.md @@ -22,6 +22,9 @@ Real disclosed CVEs reduced to minimal reproducers, vulnerable + patched pair pe | CVE-2017-18342 | Python | PyYAML | MIT | Deserialization | detected | | CVE-2025-69662 | Python | geopandas | BSD-3-Clause | SQL Injection | detected | | CVE-2026-33626 | Python | LMDeploy | Apache-2.0 | SSRF | detected | +| CVE-2024-23334 | Python | aiohttp | Apache-2.0 | path_traversal | detected | +| CVE-2023-6568 | Python | MLflow | Apache-2.0 | XSS | detected | +| CVE-2024-21513 | Python | LangChain Experimental | MIT | code_exec | detected | | CVE-2019-14939 | JavaScript | mongo-express | MIT | code_exec | detected | | CVE-2025-64430 | JavaScript | Parse Server | Apache-2.0 | SSRF | detected | | CVE-2023-22621 | JavaScript | Strapi | MIT | code_exec (SSTI)| detected | @@ -42,6 +45,7 @@ Real disclosed CVEs reduced to minimal reproducers, vulnerable + patched pair pe | CVE-2023-38337 | Ruby | rswag | MIT | path_traversal | detected | | CVE-2017-9841 | PHP | PHPUnit | BSD-3-Clause | code_exec | detected | | CVE-2018-15133 | PHP | Laravel | MIT | Deserialization | detected | +| CVE-2026-33486 | PHP | Roadiz CMS | MIT | SSRF | detected | | CVE-2018-20997 | Rust | tar-rs | MIT OR Apache-2.0 | path_traversal | detected | | CVE-2022-36113 | Rust | cargo | MIT OR Apache-2.0 | path_traversal | detected | | CVE-2023-42456 | Rust | sudo-rs | Apache-2.0 | path_traversal | detected | @@ -49,10 +53,12 @@ Real disclosed CVEs reduced to minimal reproducers, vulnerable + patched pair pe | CVE-2024-32884 | Rust | gitoxide | Apache-2.0 OR MIT | CMDI | detected | | CVE-2025-53549 | Rust | matrix-rust-sdk | Apache-2.0 | SQL Injection | detected | | CVE-2016-3714 | C | ImageMagick (ImageTragick) | ImageMagick License | CMDI | detected | +| CVE-2017-1000117 | C | git (ssh:// argv injection)| GPL-2.0 | cmdi (argv-inj) | deferred | | CVE-2019-18634 | C | sudo (pwfeedback) | ISC | memory_safety | detected | | CVE-2019-13132 | C++ | ZeroMQ libzmq | MPL-2.0 | memory_safety | detected | | CVE-2022-1941 | C++ | Protocol Buffers | BSD-3-Clause | memory_safety | detected | | CVE-2026-25544 | TypeScript | Payload (Drizzle adapter) | MIT | sql_injection | deferred | +| CVE-2026-42353 | JavaScript | i18next-http-middleware | MIT | path_traversal | detected | Deferred entries are real bugs Nyx can't yet detect. The fixture stays committed with `disabled: true` in ground truth so the gap remains visible. @@ -77,6 +83,11 @@ Most recent first. Metrics are rule-level on the corpus size at that point. | Date | Change | Corpus | P | R | F1 | |------------|------------------------------------------------------------------------------|--------|-------|-------|-------| +| 2026-05-04 | C cvehunt session-0014: CVE-2017-1000117 (git ssh:// hostname-as-argv injection) added in corpus disabled — three-layer C engine gap: (a) array-element taint propagation through `args[i] = ssh_host;` writes, (b) missing `c.cmdi.exec*` AST patterns in `src/patterns/c.rs`, (c) sanitizer recognition of the upstream `if (ssh_host[0] == '-') die(...)` dash-prefix guard | 565 | 1.000 | 1.000 | 1.000 | +| 2026-05-04 | JS/TS array-method validator-callback narrowing (`try_array_method_validator_callback_narrowing` in `src/taint/ssa_transfer/mod.rs`) — `.filter()` / `.find` / `.findLast` strips `Cap::all()` from the call result when the callback resolves to a `BooleanTrueIsValid` validator; CVE-2026-42353 (i18next-http-middleware path traversal) re-enabled in ground truth, deferred queue cleared | 563 | 1.000 | 1.000 | 1.000 | +| 2026-05-04 | JS/TS ternary-RHS source-classification fix in `src/cfg/conditions.rs::lower_ternary_branch` (segment-strip first_member_label on the branch AST) — `let arr = cond ? req.query.lng : "";` now propagates taint through the diamond's join phi instead of lowering both branches to labelless Assign-with-empty-uses; CVE-2026-42353 (i18next-http-middleware path traversal / SSRF) added in corpus disabled — needs Array.prototype.filter(known_validator_callback) precision bridge | 561 | 1.000 | 1.000 | 1.000 | +| 2026-05-04 | PHP class-method body taint analysis (`declaration_list` / `interface_declaration` / `trait_declaration` / `enum_declaration` mapped to `Kind::Block` in `src/labels/php.rs`); PHP `unary_op_expression` recognised as negation in `detect_negation`; camelCase normalisation in `classify_condition` so `isSafeRemoteUrl(x)` classifies as ValidationCall the same as `is_safe_remote_url(x)`; PHP `$`-sigil stripping in `extract_validation_target`; `fopen` added as PHP SSRF sink; CVE-2026-33486 (roadiz/documents `DownloadedFile::fromUrl(file://)` SSRF/LFI) added | 555 | 1.000 | 1.000 | 1.000 | +| 2026-05-04 | Python Tier B `py.xss.make_response_format` AST pattern (Flask `make_response()` / `make_response()`); CVE-2023-6568 (mlflow reflected XSS) and CVE-2024-21513 (langchain VectorSQLDatabaseChain `_try_eval` over DB rows) added | 550 | 1.000 | 1.000 | 1.000 | | 2026-05-03 | Go for-range loop binding now defined from `range_clause` child of `for_statement` (was: tree-sitter wraps the binding/iterable on a child node; only direct `left`/`right` fields were consulted, so taint never reached the loop binding). gin sources extended to `c.QueryArray` / `c.GetQueryArray` / `c.PostFormArray` / `c.GetPostFormArray`. goqu raw SQL literal builders `goqu.L` / `goqu.Lit` recognised as SQL_QUERY sinks. CVE-2026-41422 (daptin aggregate API) detected | 521 | 1.000 | 1.000 | 1.000 | | 2026-05-02 | TS regex-allowlist `<*regex*>.test(value)` / `<*pattern*>.test(value)` recognised as ValidationCall whose target is the first arg (overrides default receiver-as-target); conservative on receiver names so non-regex `*.test()` callees stay Unknown. CVE-2026-25544 (Payload drizzle SQL injection) lands in corpus disabled — needs validated-flow propagation through SSA derivation / helper-summary returns | 499 | 1.000 | 1.000 | 1.000 | | 2026-05-02 | JS arrow `assignment_pattern` default-param extraction + JS object-literal kwarg fallback for gated sinks + double-call (`f()(x)`) chained-inner rebinding; lodash `_.template` modeled as gated CODE_EXEC sink suppressed by `{ evaluate: false }`; CVE-2023-22621 (Strapi SSTI) detected | 494 | — | — | — | diff --git a/tests/benchmark/corpus/go/auth/vuln_apicontext_findbyid.go b/tests/benchmark/corpus/go/auth/vuln_apicontext_findbyid.go index a9f5c92f..cd82f8de 100644 --- a/tests/benchmark/corpus/go/auth/vuln_apicontext_findbyid.go +++ b/tests/benchmark/corpus/go/auth/vuln_apicontext_findbyid.go @@ -1,15 +1,18 @@ package main -// Real-repo precision (2026-05-03): recall guard for the 2026-05-03 -// type-aware Go param filter. +// Real-repo precision (2026-05-03): recall guard for the type-aware Go +// param filter (2026-05-03 + 2026-05-03 expansion). // -// Even after `ctx context.Context` is dropped from `unit.params`, an -// id-shaped param (`id string`) keeps the unit on the hook ─ -// `is_external_input_param_name` recognises id-shapes ahead of the -// framework-name allow-list. This fixture asserts that the type-aware -// filter doesn't over-suppress: a helper that takes the canonical -// `(ctx, id)` shape and consumes `id` at a bare-receiver data-layer -// sink must still fire `go.auth.missing_ownership_check`. +// 2026-05-03 update: the engine now drops id-like scalar params from +// `unit.params` for non-route units (gitea `models/...` DAO cluster, +// ~957 FPs). This fixture asserts that the route-aware path keeps +// firing on the real vulnerable shape: a gin route handler whose body +// passes an id-shaped path param straight into a bare-receiver +// data-layer call with no preceding ownership check. +// +// `function_params_route_handler` runs with `include_id_like_typed = +// true`, so even after the DAO-shape filter the id-like scalar param +// survives in `unit.params` for `RouteHandler` units, the rule fires. import "context" @@ -18,10 +21,16 @@ type Repo struct{} func (r *Repo) Find(id string) interface{} { return nil } func (r *Repo) Save(id string, val string) {} +type ginEngine struct{} + +func (g *ginEngine) GET(path string, handler interface{}) {} +func (g *ginEngine) POST(path string, handler interface{}) {} + // `ctx context.Context` is dropped by the type-aware Go param filter -// (stdlib non-user-input). `id string` survives ─ id-shape opens the -// gate. `repo.Find(id)` is a bare-identifier read indicator with no -// preceding ownership check. Rule must fire. +// (stdlib non-user-input). `id string` survives because the gin +// extractor promotes this unit to `RouteHandler` and route-aware param +// extraction keeps id-like names. `repo.Find(id)` is a bare-identifier +// read indicator with no preceding ownership check — rule fires. func GetByID(ctx context.Context, repo *Repo, id string) interface{} { _ = ctx return repo.Find(id) @@ -32,3 +41,9 @@ func UpdateByID(ctx context.Context, repo *Repo, id string, val string) { _ = ctx repo.Save(id, val) } + +// Gin route binding promotes both handlers to `RouteHandler` kind. +func registerRoutes(r *ginEngine) { + r.GET("/items/:id", GetByID) + r.POST("/items/:id", UpdateByID) +} diff --git a/tests/benchmark/corpus/go/auth/vuln_repo_findbyid_no_auth.go b/tests/benchmark/corpus/go/auth/vuln_repo_findbyid_no_auth.go index 6dbaaada..e040454d 100644 --- a/tests/benchmark/corpus/go/auth/vuln_repo_findbyid_no_auth.go +++ b/tests/benchmark/corpus/go/auth/vuln_repo_findbyid_no_auth.go @@ -10,12 +10,30 @@ package main // remain canonical data-layer sinks and must continue to fire // `go.auth.missing_ownership_check` when invoked with a scoped // identifier (`id` parameter) without a preceding ownership check. +// +// 2026-05-03 update: previously the helper signature alone +// (`func GetByID(ctx, repo, id string)`) was the recall guard. After +// the Go DAO-helper precision pass (id-like scalar params dropped from +// `unit.params` for non-route units) the helper-only shape no longer +// passes `unit_has_user_input_evidence` — which is correct, the gitea +// `models/...` cluster proved that internal DAO helpers should not +// flag. This fixture is now a real route-handler shape: the gin +// extractor recognises `r.GET(..., GetByID)` as a route registration, +// promotes the unit to `RouteHandler`, and `function_params_route_handler` +// keeps the id-like scalar param so the rule still fires on the actual +// vulnerable form (HTTP route binding directly to a DAO call with no +// preceding auth check). type Repo struct{} func (r *Repo) Find(id string) interface{} { return nil } func (r *Repo) Save(id string, val string) {} +type ginEngine struct{} + +func (g *ginEngine) GET(path string, handler interface{}) {} +func (g *ginEngine) POST(path string, handler interface{}) {} + // `repo.Find(id)` — bare-identifier receiver, name matches the `Find` // read indicator. Still classifies as `DbCrossTenantRead` and still // fires the ownership check because no auth check precedes it. @@ -27,3 +45,13 @@ func GetByID(ctx interface{}, repo *Repo, id string) interface{} { func UpdateByID(ctx interface{}, repo *Repo, id string, val string) { repo.Save(id, val) } + +// Route registration: gin extractor recognises `r.GET(...)` / +// `r.POST(...)`, attaches `GetByID` / `UpdateByID` as the route +// handlers, and promotes their units to `AnalysisUnitKind::RouteHandler`. +// The id-like scalar param `id string` survives into `unit.params` via +// `function_params_route_handler` (route-aware, `include_id_like_typed = true`). +func registerRoutes(r *ginEngine) { + r.GET("/items/:id", GetByID) + r.POST("/items/:id", UpdateByID) +} diff --git a/tests/benchmark/corpus/go/safe/safe_dao_helper_id_scalar.go b/tests/benchmark/corpus/go/safe/safe_dao_helper_id_scalar.go new file mode 100644 index 00000000..69fd9991 --- /dev/null +++ b/tests/benchmark/corpus/go/safe/safe_dao_helper_id_scalar.go @@ -0,0 +1,87 @@ +package main + +// Real-repo precision (2026-05-03): distilled from +// /Users/elipeter/oss/gitea/models/actions/{run,run_job,runner,artifact, +// run_attempt,task,variable}.go and ~957 sibling helpers across gitea's +// `models/...` data-access layer. Same shape over-fires on minio's +// `cmd/iam-*-store` and is the canonical Go ORM/DAO helper signature. +// +// Pattern: a model-layer helper takes the canonical Go first-param +// `ctx context.Context` (stdlib cancellation / deadline / value-bag, +// NOT an HTTP request) plus one or more id-like scalar parameters +// (`repoID, runID int64`, `id int64`, …). The helper itself is +// **never** registered as a route handler — gitea's HTTP routes live +// in `routers/`, and the bound route handler runs the auth check +// before calling into `models/`. The DAO helper inherits trust from +// its single caller surface and must not flag +// `go.auth.missing_ownership_check`. +// +// Engine fix (2026-05-03, src/auth_analysis/extract/common.rs:: +// collect_param_names Go arm): for non-route units (default +// `include_id_like_typed = false`), drop id-like param names whose +// declared type is a bounded primitive scalar (`int*` / `uint*` / +// `string` / `bool` / `byte` / `rune` / `float*`). Real Go HTTP +// handlers always carry a framework-request-typed param +// (`*http.Request`, `*gin.Context`, `echo.Context`, `*fiber.Ctx`, +// `*context.APIContext`, …) and are recognised by the per-framework +// route extractors which call `function_params_route_handler` +// (`include_id_like_typed = true`) — those bypass the filter so id-shaped +// path params survive on real routes (see +// `auth/vuln_apicontext_findbyid.go` and +// `auth/vuln_repo_findbyid_no_auth.go` for the recall guards). +// +// Conservative scope: only **bounded primitive scalar** types trigger +// the drop. Pointer types (`*Runner`), struct-by-value, slice (`[]T`), +// generic and qualified types are payload shapes whose injection +// surface is unknown — id-like names on those keep their place in +// `unit.params`. + +import "context" + +type ActionRun struct{ ID int64 } +type ActionRunJob struct{ ID int64 } +type ActionRunner struct{ ID int64 } + +type modelDB struct{} + +func (m *modelDB) Find(ctx context.Context, id int64) interface{} { return nil } +func (m *modelDB) DeleteByID(ctx context.Context, id int64) error { return nil } +func (m *modelDB) UpdateRunJob(ctx context.Context, j *ActionRunJob) {} + +var db = &modelDB{} + +// `(ctx context.Context, repoID, runID int64)` — multi-name single-type +// declaration with all bounded scalar params. After the fix: +// `unit.params` is empty; `unit_has_user_input_evidence` returns false; +// `check_ownership_gaps` skips the unit entirely. +func GetRunByRepoAndID(ctx context.Context, repoID, runID int64) (*ActionRun, error) { + _ = db.Find(ctx, runID) + _ = repoID + return &ActionRun{ID: runID}, nil +} + +// Single id-like scalar param. Same DAO-helper shape, must not flag +// even though `db.DeleteByID` and `GetRunnerByID` both look like +// canonical mutation/read indicators. +func DeleteRunner(ctx context.Context, id int64) error { + if _, err := GetRunnerByID(ctx, id); err != nil { + return err + } + return db.DeleteByID(ctx, id) +} + +func GetRunnerByID(ctx context.Context, id int64) (*ActionRunner, error) { + _ = db.Find(ctx, id) + return &ActionRunner{ID: id}, nil +} + +// Mixed-arity helper: `userID int64` (id-like + scalar, dropped) plus +// `cfg *ActionRun` (non-scalar payload, kept). `cfg` is not id-like +// and doesn't match the Go-narrowed framework-name allow-list, so the +// unit still has no evidence and the rule does not flag. +func SetOwnerActionsConfig(ctx context.Context, userID int64, cfg *ActionRun) error { + _ = userID + _ = cfg + _ = ctx + return nil +} diff --git a/tests/benchmark/corpus/java/safe/SafeJpaCriteriaQuery.java b/tests/benchmark/corpus/java/safe/SafeJpaCriteriaQuery.java new file mode 100644 index 00000000..47b17234 --- /dev/null +++ b/tests/benchmark/corpus/java/safe/SafeJpaCriteriaQuery.java @@ -0,0 +1,47 @@ +import jakarta.persistence.EntityManager; +import jakarta.persistence.criteria.CriteriaBuilder; +import jakarta.persistence.criteria.CriteriaQuery; +import jakarta.persistence.criteria.Root; +import org.hibernate.Session; +import java.util.List; + +// Distilled from openmrs's +// api/src/main/java/org/openmrs/api/db/hibernate/HibernateCohortDAO.java +// (`getCohorts` / `getCohort`) and HibernateAdministrationDAO. The JPA +// CriteriaBuilder pattern builds a structural `CriteriaQuery` via +// `cb.createQuery(Foo.class)` plus `Root` / `Predicate` / `cb.equal` / +// `cb.like` etc., then hands the structural query object to +// `session.createQuery(cq)` / `em.createQuery(cq)` for execution. No +// string concatenation occurs — JPA emits parameterized SQL by +// construction. Engine must propagate the +// `TypeKind::JpaCriteriaQuery` fact through the `cb.createQuery` +// receiver-text recogniser, then suppress the structural +// `cfg-unguarded-sink` finding at the `session.createQuery(cq)` / +// `em.createQuery(cq)` site via `sink_args_jpa_criteria_query_safe`. +public class SafeJpaCriteriaQuery { + private final Session session; + private final EntityManager em; + + public SafeJpaCriteriaQuery(Session session, EntityManager em) { + this.session = session; + this.em = em; + } + + public List getCohorts(String nameFragment) { + CriteriaBuilder cb = session.getCriteriaBuilder(); + CriteriaQuery cq = cb.createQuery(Cohort.class); + Root root = cq.from(Cohort.class); + cq.where(cb.like(cb.lower(root.get("name")), nameFragment)); + return session.createQuery(cq).getResultList(); + } + + public Cohort getCohortByName(String name) { + CriteriaBuilder cb = em.getCriteriaBuilder(); + CriteriaQuery cq = cb.createQuery(Cohort.class); + Root root = cq.from(Cohort.class); + cq.where(cb.equal(root.get("name"), name)); + return em.createQuery(cq).getSingleResult(); + } + + public static class Cohort {} +} diff --git a/tests/benchmark/corpus/javascript/path_traversal/path_traversal_ternary_source.js b/tests/benchmark/corpus/javascript/path_traversal/path_traversal_ternary_source.js new file mode 100644 index 00000000..fed859f2 --- /dev/null +++ b/tests/benchmark/corpus/javascript/path_traversal/path_traversal_ternary_source.js @@ -0,0 +1,17 @@ +// Regression guard for the ternary-RHS source-classification fix in +// `src/cfg/conditions.rs::lower_ternary_branch`. Pre-fix, push_node only +// did suffix/prefix matching on the branch text, so `req.query.lng` did +// not classify as a Source (rule matcher is `req.query`, neither matches +// `req.query.lng`). Both ternary branches lowered to labelless +// Assign-with-empty-uses, the join phi saw no taint, and downstream sinks +// missed the flow. Motivated by GHSA-jfgf-83c5-2c4m / CVE-2026-42353 +// (i18next-http-middleware path traversal / SSRF via user-controlled +// language and namespace parameters). +const fs = require('fs'); +const express = require('express'); +const app = express(); + +app.get('/locales/resources.json', (req, res) => { + let lng = req.query.lng ? req.query.lng : 'en'; + fs.readFileSync(`/locales/${lng}/common.json`); +}); diff --git a/tests/benchmark/corpus/javascript/safe/safe_ternary_const_branches.js b/tests/benchmark/corpus/javascript/safe/safe_ternary_const_branches.js new file mode 100644 index 00000000..ff31c88a --- /dev/null +++ b/tests/benchmark/corpus/javascript/safe/safe_ternary_const_branches.js @@ -0,0 +1,13 @@ +// Companion precision guard to path_traversal_ternary_source.js. When +// both ternary branches are constant strings, the segment-strip +// classifier in `lower_ternary_branch` should not synthesise a Source +// label, so the assigned variable carries no taint and the downstream +// sink does not fire. +const fs = require('fs'); +const express = require('express'); +const app = express(); + +app.get('/page', (req, res) => { + const tier = req.query.premium ? 'premium' : 'standard'; + fs.readFileSync(`/static/${tier}/index.html`); +}); diff --git a/tests/benchmark/corpus/php/deser/deser_unserialize_method_named_unserialize_with_user_input.php b/tests/benchmark/corpus/php/deser/deser_unserialize_method_named_unserialize_with_user_input.php new file mode 100644 index 00000000..49ce03e8 --- /dev/null +++ b/tests/benchmark/corpus/php/deser/deser_unserialize_method_named_unserialize_with_user_input.php @@ -0,0 +1,25 @@ +payload = unserialize($_GET['blob']); + } +} + +class WrappedThenUnserialize { + // Wrapped argument inside magic method — conservative: still fires. + // Real-world cache / session pass-throughs surface here so the rule + // keeps its signal on `unserialize(trim($input))` / + // `unserialize(base64_decode($input))` shapes. + public function unserialize($input): void { + $this->payload = unserialize(trim($input)); + } +} diff --git a/tests/benchmark/corpus/php/safe/safe_camelcase_validator_negated.php b/tests/benchmark/corpus/php/safe/safe_camelcase_validator_negated.php new file mode 100644 index 00000000..7263b11c --- /dev/null +++ b/tests/benchmark/corpus/php/safe/safe_camelcase_validator_negated.php @@ -0,0 +1,30 @@ +data = unserialize($serialized); + } +} + +class CliInput implements \Serializable { + public string $executable = ''; + public array $args = []; + public array $options = []; + + public function unserialize($input): void { + [$this->executable, $this->args, $this->options] = unserialize($input); + } +} + +class CaseFolded implements \Serializable { + private mixed $payload = null; + + public function UnSerialize($payload) { + $this->payload = unserialize($payload); + } +} diff --git a/tests/benchmark/corpus/php/ssrf/ssrf_class_method_fopen.php b/tests/benchmark/corpus/php/ssrf/ssrf_class_method_fopen.php new file mode 100644 index 00000000..20000715 --- /dev/null +++ b/tests/benchmark/corpus/php/ssrf/ssrf_class_method_fopen.php @@ -0,0 +1,17 @@ + None: + if payload.get("kind") == "reschedule": + session.add({"id": task_instance_id, "data": payload}) + + +@ti_id_router.patch("/{task_instance_id}/state") +def ti_update_state( + task_instance_id: UUID, + payload: Annotated[dict, Body()], + session, +) -> None: + _create_state_update( + task_instance_id=task_instance_id, + payload=payload, + session=session, + ) diff --git a/tests/benchmark/corpus/python/auth/vuln_fastapi_route_no_dependencies.py b/tests/benchmark/corpus/python/auth/vuln_fastapi_route_no_dependencies.py index 61c41837..9a358cee 100644 --- a/tests/benchmark/corpus/python/auth/vuln_fastapi_route_no_dependencies.py +++ b/tests/benchmark/corpus/python/auth/vuln_fastapi_route_no_dependencies.py @@ -1,19 +1,26 @@ """ Vulnerable counterpart to safe_fastapi_route_dependencies_auth.py: same -shape but with NO `dependencies=[Depends(...)]` keyword arg on the route -decorator. The FastAPI ownership-check rule must still fire — the -recognizer must not blanket-suppress every FastAPI route, only those -with an actual dependency-injected auth check. +FastAPI route shape but with NO `dependencies=[Depends(...)]` keyword +arg on the route decorator. The ownership-check rule must still fire +— the dependency-injection recogniser must not blanket-suppress every +FastAPI route, only those with an actual dependency-injected auth +check. + +Sink uses a qualified Django-style ORM call so the post-fix +classifier still recognises it (`receiver_is_simple_chain` requires a +non-chained receiver dot). """ from fastapi import FastAPI router = FastAPI() +class Connection: + objects = None + + @router.delete("/{connection_id}") -def delete_connection(connection_id: str, session): +def delete_connection(connection_id: str): """No auth — must still fire missing_ownership_check.""" - connection = session.scalar(select(Connection).filter_by(conn_id=connection_id)) - if connection is None: - raise HTTPException(404, "not found") - session.delete(connection) + Connection.objects.filter(id=connection_id).delete() + return {"ok": True} diff --git a/tests/benchmark/corpus/python/auth/vuln_fastapi_route_no_dependencies_sqla.py b/tests/benchmark/corpus/python/auth/vuln_fastapi_route_no_dependencies_sqla.py new file mode 100644 index 00000000..bd01fcc9 --- /dev/null +++ b/tests/benchmark/corpus/python/auth/vuln_fastapi_route_no_dependencies_sqla.py @@ -0,0 +1,27 @@ +"""SQLAlchemy variant of vuln_fastapi_route_no_dependencies.py: same FastAPI +route shape with NO `dependencies=[Depends(...)]` keyword arg, but the sink +is a real-world airflow-style SQLAlchemy queryset chain +`session.scalar(select(C).filter_by(conn_id=user_input))`. + +Pre-fix the chain reduced to bare `["filter_by"]` and was suppressed by +`receiver_is_simple_chain`, blocking recall on this real-repo airflow shape. +The member_chain Python `function`-field traversal + `db_query_builder_roots` +extension restores recall. + +Recall guard: ownership-check rule must fire on the chained query — the +caller has no auth check. +""" +from fastapi import FastAPI +from sqlalchemy import select + +router = FastAPI() + + +class Connection: + pass + + +@router.delete("/{connection_id}") +def delete_connection(connection_id: str, session): + """No auth — must fire missing_ownership_check on the chained query.""" + return session.scalar(select(Connection).filter_by(conn_id=connection_id)) diff --git a/tests/benchmark/corpus/python/auth/vuln_fastapi_route_security_no_scopes.py b/tests/benchmark/corpus/python/auth/vuln_fastapi_route_security_no_scopes.py new file mode 100644 index 00000000..c3fa1906 --- /dev/null +++ b/tests/benchmark/corpus/python/auth/vuln_fastapi_route_security_no_scopes.py @@ -0,0 +1,38 @@ +"""Recall counterpart to safe_fastapi_route_security_scopes.py. + +Precision guard for the Security-without-scopes path: a bare +`Security(callable)` with no `scopes=[...]` kwarg, or with an empty +`scopes=[]`, is NOT promoted from LoginGuard to AuthorizationCheck — +the OAuth2 scope semantic only fires when scopes is non-empty. Without +scope enforcement the wrapper is functionally equivalent to +`Depends(callable)` plus a bare login check, so `missing_ownership_check` +must still fire on a downstream id-targeted ORM filter. + +Recall guard: ownership-check rule must fire — Security with no scopes +is conservative (treated as login-only), so the route is not promoted +to authorized. +""" +from fastapi import FastAPI, Security + + +def require_auth(): + pass + + +router = FastAPI() + + +class TaskInstance: + pass + + +@router.patch( + "/{task_instance_id}/run", + dependencies=[Security(require_auth, scopes=[])], +) +def ti_run(task_instance_id: str, session): + return session.scalar(select(TaskInstance).filter_by(id=task_instance_id)) + + +def select(_): + pass diff --git a/tests/benchmark/corpus/python/auth/vuln_fastapi_router_no_dependencies.py b/tests/benchmark/corpus/python/auth/vuln_fastapi_router_no_dependencies.py new file mode 100644 index 00000000..ad77eefe --- /dev/null +++ b/tests/benchmark/corpus/python/auth/vuln_fastapi_router_no_dependencies.py @@ -0,0 +1,46 @@ +"""Recall guard for the router-level Security-prop fix. When a router +is declared with NO `dependencies=` kwarg (`router = APIRouter(...)`), +attached routes that don't supply inline deps are genuinely +unauthorized — the engine must still flag id-targeted writes as +`missing_ownership_check`. Without the gate the router-level extractor +would over-fire by treating every router as auth-providing. + +Distilled from airflow +`task_instances.py:1036-1082` where `router = VersionedAPIRouter()` +(bare, no deps) attaches `@router.get("/states", ...)` — the route is +auth-attached only via the cross-file `include_router` chain in +`routes/__init__.py`, which is a separate gap (see deep_engine_fixes.md). +For the per-file case where the router has no router-level deps +declared, the route is correctly an un-guarded ownership-check FN. +""" +from cadwyn import VersionedAPIRouter + + +# Bare router — no router-level dependencies declared. +router = VersionedAPIRouter() + + +class TaskInstance: + pass + + +@router.get("/states/{run_id}/{task_id}") +def get_task_instance_states(run_id: str, task_id: str, session): + rows = session.scalars( + select(TaskInstance) + .where(TaskInstance.run_id == run_id) + .where(TaskInstance.task_id == task_id) + ).all() + [ + run_id_task_state_map[task.run_id].update( + {task.task_id: task.state} + ) + for task in rows + ] + + +def select(_): + pass + + +run_id_task_state_map = {} diff --git a/tests/benchmark/corpus/python/auth/vuln_local_set_with_user_id_query.py b/tests/benchmark/corpus/python/auth/vuln_local_set_with_user_id_query.py new file mode 100644 index 00000000..21775994 --- /dev/null +++ b/tests/benchmark/corpus/python/auth/vuln_local_set_with_user_id_query.py @@ -0,0 +1,28 @@ +# py-auth-realrepo-XXX (vuln pair): same bare-`set()` / `dict()` / +# `defaultdict()` local collection shape as +# safe_local_set_update_no_orm.py, but the helper *also* runs an +# id-targeted ORM query whose filter argument is a user-supplied id +# (`team_id` in the function signature, no caller-scope-entity +# exemption applies). +# +# Recall guard: the bare-callee constructor recogniser must only +# suppress the InMemoryLocal `.update` / `.add` calls — the +# id-targeted ORM `.filter(id=team_id)` must still fire +# `py.auth.missing_ownership_check`. +class Team: + pass + + +def get_team_with_history(request, team_id): + seen_ids = set() + audit = dict() + seen_ids.add(team_id) + audit["team"] = team_id + + return Team.objects.filter(id=team_id).first() + + +def archive_team(request, team_id): + pending = set() + pending.add(team_id) + Team.objects.filter(id=team_id).delete() diff --git a/tests/benchmark/corpus/python/path_traversal/path_traversal_no_relative_to.py b/tests/benchmark/corpus/python/path_traversal/path_traversal_no_relative_to.py new file mode 100644 index 00000000..d659cf37 --- /dev/null +++ b/tests/benchmark/corpus/python/path_traversal/path_traversal_no_relative_to.py @@ -0,0 +1,13 @@ +# py-path-traversal-no-relative-to: regression guard companion to +# safe_relative_to_validator.py. Same source/sink shape but without +# the `filepath.relative_to(base)` validator — taint must propagate. +from pathlib import Path + +from flask import request, send_file + + +def download() -> None: + base = Path("/var/www/static") + rel_url = request.args.get("path") + filepath = base.joinpath(rel_url).resolve() + send_file(str(filepath)) diff --git a/tests/benchmark/corpus/python/safe/safe_bare_callee_no_receiver.py b/tests/benchmark/corpus/python/safe/safe_bare_callee_no_receiver.py new file mode 100644 index 00000000..2b79b107 --- /dev/null +++ b/tests/benchmark/corpus/python/safe/safe_bare_callee_no_receiver.py @@ -0,0 +1,76 @@ +# py-auth-realrepo-011: bare-identifier callees without a receiver dot +# are never DB / ORM operations. Distilled from sentry +# `src/sentry/tasks/statistical_detectors.py` (line 743: +# `org_ids = list({p.organization_id for p in projects})`), +# `src/sentry/utils/query.py:90` (`events = list(method(...))`), +# `src/sentry/api/helpers/group_index/delete.py` (bare `delete_group_list`, +# `create_audit_entry`, `create_audit_entries` helper calls), and +# `src/sentry/seer/autofix/coding_agent.py` (bare `update_coding_agent_state`). +# +# Before the fix, the verb-name fallback in `classify_sink_class` +# matched bare callees `list`, `filter`, `update`, `create`, `add`, +# `delete` against the Python read/mutation indicator vocabulary and +# classified them as `DbCrossTenantRead` / `DbMutation`. Combined with +# the user-input-evidence precondition (`request: Request` triggers it), +# every internal helper firing one of these builtins / locally-defined +# helpers produced a `py.auth.missing_ownership_check` finding. +# +# A real ORM / DB call always carries a receiver +# (`Model.objects.filter(...)`, `repo.find(id)`, `db.query(...)`); a +# bare-identifier callee is a Python builtin or a locally-defined +# helper, neither of which has the cross-tenant read / mutation +# semantics the rule is checking for. The fix gates the verb fallback +# on `receiver_is_simple_chain(callee)` (callee contains a dot AND the +# receiver isn't itself a call expression). +from typing import Any, Iterable + + +def fetch_continuous_examples(raw_examples): + # `list(...)` is a Python builtin — no DB op happens here. + project_ids = list({pid for pid, _ in raw_examples.keys()}) + return project_ids + + +def detect_function_change_points(projects, start, transactions_per_project=None): + # Bare `list({...})` set-comprehension materialisation; both args + # come from internally-supplied `projects` collection iteration. + org_ids = list({p.organization_id for p in projects}) + project_ids = list({p.id for p in projects}) + return org_ids, project_ids + + +def delete_group_list(request, project, group_list, delete_type): + # Bare-name local helper invocation — `create_audit_entries` is a + # function defined in the same module, not a DB write. Used to + # fire `py.auth.missing_ownership_check`. + transaction_id = "tx" + create_audit_entries(request, project, group_list, delete_type, transaction_id) + + +def create_audit_entries(request, project, group_list, delete_type, transaction_id): + for group in group_list: + # Bare `create_audit_entry` is a helper, not a DB INSERT. + create_audit_entry( + request=request, + target_object=group.id, + event="ISSUE_DELETE", + data={"issue_id": group.id, "project_slug": project.slug}, + ) + + +def create_audit_entry(**kwargs): + pass + + +def update_coding_agent_state(state, action): + # Bare `update_*` helper called inside an outer task — Python lets + # you name local helpers freely, the verb prefix does not imply a + # DB mutation. + pass + + +def materialise_filter_chain(events: Iterable[Any]): + # `filter(...)` is the Python builtin (`Iterable.filter`), and the + # bare-name local helper pattern below is endemic in real-repo + # Python code. + return list(filter(lambda e: e is not None, events)) diff --git a/tests/benchmark/corpus/python/safe/safe_caller_scope_helper_under_authorized_route.py b/tests/benchmark/corpus/python/safe/safe_caller_scope_helper_under_authorized_route.py new file mode 100644 index 00000000..4b2bd860 --- /dev/null +++ b/tests/benchmark/corpus/python/safe/safe_caller_scope_helper_under_authorized_route.py @@ -0,0 +1,63 @@ +"""Distilled from airflow +`airflow-core/src/airflow/api_fastapi/execution_api/routes/task_instances.py:516-628`: +The route handler `ti_update_state` is route-level authorized via the +`ti_id_router = VersionedAPIRouter(dependencies=[Security(require_auth, +scopes=["ti:self"])])` declaration (closed by the session-0010 fix). +The handler then delegates the actual `session.add(TaskReschedule(...))` +sink to a private helper `_create_ti_state_update_query_and_update_state` +that has no inline auth check of its own. + +Pre-fix the helper fired `missing_ownership_check` + +`token_override_without_validation` at the helper's body sink because +`check_ownership_gaps` is scoped per AnalysisUnit — the caller's +route-level auth check did not propagate to the callee. + +The Phase 1 caller-scope IPA fix (`apply_caller_scope_propagation` in +`src/auth_analysis/mod.rs`) walks the call graph DOWN: when every +in-file caller of a helper carries route-level non-Login auth +(Other / Membership / Ownership / AdminGuard), the helper inherits the +caller's checks via synthetic `is_route_level=true` AuthChecks. This +lifts the airflow shape exactly, both findings cleared post-fix. + +Precision guard: helper must NOT fire `missing_ownership_check` or +`token_override_without_validation` despite holding the auth-relevant +sinks (`session.add` with caller-passed scoped id). +""" +from typing import Annotated +from uuid import UUID +from fastapi import APIRouter, Body, Security + + +def require_auth(): + pass + + +# Router-level Security carries the JWT scope check on every attached +# route at runtime. Closes the prior session-0010 gap. +ti_id_router = APIRouter( + dependencies=[Security(require_auth, scopes=["ti:self"])], +) + + +def _create_state_update( + *, + task_instance_id: UUID, + payload: dict, + session, +) -> None: + """Helper: caller-scope IPA must propagate route-level auth into here.""" + if payload.get("kind") == "reschedule": + session.add({"id": task_instance_id, "data": payload}) + + +@ti_id_router.patch("/{task_instance_id}/state") +def ti_update_state( + task_instance_id: UUID, + payload: Annotated[dict, Body()], + session, +) -> None: + _create_state_update( + task_instance_id=task_instance_id, + payload=payload, + session=session, + ) diff --git a/tests/benchmark/corpus/python/safe/safe_fastapi_route_security_scopes.py b/tests/benchmark/corpus/python/safe/safe_fastapi_route_security_scopes.py new file mode 100644 index 00000000..2090231e --- /dev/null +++ b/tests/benchmark/corpus/python/safe/safe_fastapi_route_security_scopes.py @@ -0,0 +1,44 @@ +"""Distilled from airflow +`airflow-core/src/airflow/api_fastapi/execution_api/routes/task_instances.py:101-117`: +FastAPI route declares its auth dependency as +`dependencies=[Security(require_auth, scopes=["token:execution"])]`. +`Security(...)` is FastAPI's OAuth2-scope-checked variant of `Depends(...)` +— the JWT must carry one of the listed scopes, so the route is fully +authorized at the boundary. + +Pre-fix `is_depends_callee` only matched `Depends`; `Security(...)` was +ignored, leaving the route as if no auth dep were declared. Even after +recognising the marker, `require_auth` is a registered login-guard, and a +`LoginGuard` AuthCheckKind would have been filtered by +`has_prior_subject_auth` — the route would still fire +`missing_ownership_check`. The deeper fix promotes a scoped Security +wrapper to `AuthCheckKind::Other` so the route counts as authorized for +ownership / membership checks at any sink the handler reaches. + +Precision guard: route must NOT fire `missing_ownership_check` even +though the handler does an id-targeted ORM filter. +""" +from fastapi import FastAPI, Security + + +def require_auth(scopes): + pass + + +router = FastAPI() + + +class TaskInstance: + pass + + +@router.patch( + "/{task_instance_id}/run", + dependencies=[Security(require_auth, scopes=["token:execution", "token:workload"])], +) +def ti_run(task_instance_id: str, session): + return session.scalar(select(TaskInstance).filter_by(id=task_instance_id)) + + +def select(_): + pass diff --git a/tests/benchmark/corpus/python/safe/safe_fastapi_router_level_security_scopes.py b/tests/benchmark/corpus/python/safe/safe_fastapi_router_level_security_scopes.py new file mode 100644 index 00000000..b68491b6 --- /dev/null +++ b/tests/benchmark/corpus/python/safe/safe_fastapi_router_level_security_scopes.py @@ -0,0 +1,61 @@ +"""Distilled from airflow +`airflow-core/src/airflow/api_fastapi/execution_api/routes/task_instances.py:89-318`: +FastAPI declares its auth dependency once at the router constructor — +`ti_id_router = VersionedAPIRouter(dependencies=[Security(require_auth, +scopes=["ti:self"])])` — and every per-task route attaches via +`@ti_id_router.(...)` with no inline deps. FastAPI propagates +router-level dependencies to every attached route at runtime, so the +JWT-validated scope check guards every `session.add` / row-update sink +the handler body reaches. + +Pre-fix the FastAPI dep extractor only walked the per-route decorator's +`dependencies=[...]` kwarg; router-constructor `dependencies=` was +dropped, so every `@ti_id_router.` route without inline deps fired +`missing_ownership_check` + `token_override_without_validation` despite +being authorized. + +The fix walks module-level ` = APIRouter(...)` / +`VersionedAPIRouter(...)` / `FastAPI(...)` assignments, captures the +router's `dependencies=[...]` into a per-router map, and merges them +into the per-route middleware list when the decorator's prefix matches. +A scoped Security wrapper synthesises matching TokenExpiry + +TokenRecipient checks (the JWT-validation semantics) so the +token-override rule recognises the route too. + +Precision guard: route must NOT fire `missing_ownership_check` / +`token_override_without_validation` even though the handler writes +through an id-targeted state update. +""" +from fastapi import Security +from cadwyn import VersionedAPIRouter + + +def require_auth(scopes): + pass + + +# Router-level Security with non-empty scopes. Every route attached to +# this router inherits the dep; no inline declaration needed. +ti_id_router = VersionedAPIRouter( + dependencies=[ + Security(require_auth, scopes=["ti:self"]), + ], +) + + +class Log: + pass + + +class TaskInstance: + pass + + +@ti_id_router.patch("/{task_instance_id}/state") +def ti_update_state(task_instance_id: str, session): + session.add( + Log( + task_instance_id=task_instance_id, + event="state_update", + ) + ) diff --git a/tests/benchmark/corpus/python/safe/safe_local_set_update_no_orm.py b/tests/benchmark/corpus/python/safe/safe_local_set_update_no_orm.py new file mode 100644 index 00000000..79b5b065 --- /dev/null +++ b/tests/benchmark/corpus/python/safe/safe_local_set_update_no_orm.py @@ -0,0 +1,58 @@ +# py-auth-realrepo-XXX: bare-callee Python container constructors +# (`set()` / `dict()` / `defaultdict()`) bind a non-sink local +# collection. Subsequent method calls on the bound var +# (`verified_ids.update(..)`, `cache[k] = v`, `requested_teams.add(..)`) +# are in-memory mutations, not ORM/DB writes, so the route handler +# must NOT fire `py.auth.missing_ownership_check`. +# +# Distilled from sentry `src/sentry/api/helpers/teams.py::get_teams`: +# +# def get_teams(request, organization, teams=None): +# requested_teams = set(request.GET.getlist("team", []) ...) +# verified_ids: set[int] = set() +# ... +# verified_ids.update(myteams) # <-- LOCAL set update +# requested_teams.update(verified_ids) +# teams_query = Team.objects.filter( +# id__in=requested_teams, organization_id=organization.id +# ) +# +# Without the bare-callee constructor recogniser, `set()` / `dict()` +# go untracked, the bound vars miss `non_sink_vars`, and the +# `.update(..)` / `.add(..)` calls classify as `DbMutation` — +# triggering the false missing-ownership-check finding. See +# `AuthAnalysisRules::is_non_sink_constructor_callee` and the +# `assignment` arm in `collect_unit_state`. +from collections import Counter, defaultdict +from collections.abc import Iterable + + +class Organization: + pass + + +class Team: + pass + + +def get_teams(request, organization: Organization, teams: Iterable[int] | None = None): + requested_teams = set(request.GET.getlist("team", []) if teams is None else teams) + verified_ids: set[int] = set() + seen_counter = Counter() + cache = defaultdict(list) + metadata = dict() + pending = list() + + if "myteams" in requested_teams: + requested_teams.remove("myteams") + myteams = request.access.team_ids_with_membership + verified_ids.update(myteams) + requested_teams.update(verified_ids) + seen_counter.update(myteams) + cache["my"].append(myteams) + metadata["count"] = len(myteams) + pending.append(myteams) + + return Team.objects.filter( + id__in=requested_teams, organization_id=organization.id + ) diff --git a/tests/benchmark/corpus/python/safe/safe_relative_to_validator.py b/tests/benchmark/corpus/python/safe/safe_relative_to_validator.py new file mode 100644 index 00000000..af83b1b3 --- /dev/null +++ b/tests/benchmark/corpus/python/safe/safe_relative_to_validator.py @@ -0,0 +1,19 @@ +# py-safe-relative-to: pathlib `relative_to(base)` raise-on-escape +# pattern recognised as a receiver-side FILE_IO validator. Captures +# the canonical Python path-containment idiom — the receiver is proven +# contained in `base` if execution reaches the next statement. +# Motivated by CVE-2024-23334 patched fixture. +from pathlib import Path + +from flask import request, send_file + + +def download() -> None: + base = Path("/var/www/static") + rel_url = request.args.get("path") + filepath = base.joinpath(rel_url).resolve() + try: + filepath.relative_to(base) + except ValueError: + return + send_file(str(filepath)) diff --git a/tests/benchmark/corpus/typescript/auth/safe_session_user_id_copy.ts b/tests/benchmark/corpus/typescript/auth/safe_session_user_id_copy.ts index d7b37ab7..99e568e1 100644 --- a/tests/benchmark/corpus/typescript/auth/safe_session_user_id_copy.ts +++ b/tests/benchmark/corpus/typescript/auth/safe_session_user_id_copy.ts @@ -9,9 +9,8 @@ // `src/auth_analysis/extract/common.rs::value_is_self_scoped_session_id_chain` // which extends `collect_self_actor_id_binding` to recognise // session-scoped chains beyond the existing `actor_var.id` shape. -async function getCachedApiKeys(_userId: number) { - return []; -} +declare const prisma: any; +declare function getServerSession(): Promise; export const Page = async () => { const session = await getServerSession(); @@ -21,6 +20,6 @@ export const Page = async () => { } const userId = session.user.id; - const apiKeys = await getCachedApiKeys(userId); + const apiKeys = await prisma.apiKey.findMany({ where: { userId } }); return apiKeys; }; diff --git a/tests/benchmark/corpus/typescript/auth/vuln_target_user_id_no_check.ts b/tests/benchmark/corpus/typescript/auth/vuln_target_user_id_no_check.ts index 2fa7d5d7..e675f80e 100644 --- a/tests/benchmark/corpus/typescript/auth/vuln_target_user_id_no_check.ts +++ b/tests/benchmark/corpus/typescript/auth/vuln_target_user_id_no_check.ts @@ -1,13 +1,13 @@ // Vulnerable counterpart to `safe_session_user_id_copy.ts`: the -// `userId` is bound from a route param (`req.params.targetUserId`, -// not from the session), so the rule must still flag the missing -// ownership check on the downstream prisma call. -async function deleteApiKeysFromUserId(_userId: number) {} - -export const Handler = async (req: any, _res: any) => { - const session = await getServerSession(); - if (!session) return; - - const userId = req.params.targetUserId; - await deleteApiKeysFromUserId(userId); +// `targetUserId` is a foreign id parameter (route param, not the +// caller's session-id copy), so the rule must still flag the missing +// ownership check on the downstream qualified prisma call. +declare const prisma: { + apiKey: { + deleteMany(args: { where: { userId: string } }): Promise; + }; }; + +export async function deleteApiKeysFromUserId(targetUserId: string) { + await prisma.apiKey.deleteMany({ where: { userId: targetUserId } }); +} diff --git a/tests/benchmark/cve_corpus/c/CVE-2017-1000117/patched.c b/tests/benchmark/cve_corpus/c/CVE-2017-1000117/patched.c new file mode 100644 index 00000000..64d651e8 --- /dev/null +++ b/tests/benchmark/cve_corpus/c/CVE-2017-1000117/patched.c @@ -0,0 +1,84 @@ +// Nyx CVE benchmark fixture. +// +// CVE: CVE-2017-1000117 +// Project: git (git/git) +// License: GPL-2.0-only (https://github.com/git/git/blob/v2.7.6/COPYING) +// Advisory: https://nvd.nist.gov/vuln/detail/CVE-2017-1000117 +// Patched: commit 820d7650cc6705fbb73c8caf9aef47394be5ed72 (in v2.7.6) +// "connect: reject ssh hostname that begins with a dash", +// connect.c:757-758 of the post-fix tree. +// +// Same trims as vulnerable.c (see that header). The patch under test is +// the verbatim 2-line gate added immediately after `get_host_and_port` / +// `get_port` in upstream: +// +// if (ssh_host[0] == '-') +// die("strange hostname '%s' blocked", ssh_host); +// +// In this fixture the upstream `die()` is replaced with `exit(1)` (a +// `noreturn` libc primitive) — the patched-fix simplification. The +// taint-flow consequence is identical: the dash-prefixed `ssh_host` is +// rejected before any argv assembly or exec call, so no user-tainted +// value reaches `execvp`. +// +// Patched-fix simplification: +// - `die("strange hostname '%s' blocked", ssh_host)` rendered as +// `fprintf(stderr, "strange hostname '%s' blocked\n", ssh_host); +// exit(1);`. upstream `die()` is a vararg wrapper that ultimately +// calls `exit(128)`. The flow-killing property (the function never +// returns when the gate fires) is preserved. + +#include +#include +#include +#include + +static void get_host_and_port_min(char **host, const char **port) { + char *colon, *end; + if (*host == NULL) return; + end = strchr(*host, '/'); + if (end) *end = '\0'; + colon = strchr(*host, ':'); + if (colon) { *colon = '\0'; *port = colon + 1; } +} + +int do_ssh_connect(char *url) { + const char *ssh; + char *ssh_host = url; + const char *port = NULL; + get_host_and_port_min(&ssh_host, &port); + + if (!port) port = "22"; + + if (ssh_host[0] == '-') { + fprintf(stderr, "strange hostname '%s' blocked\n", ssh_host); + exit(1); + } + + ssh = getenv("GIT_SSH_COMMAND"); + if (!ssh) { + ssh = getenv("GIT_SSH"); + if (!ssh) ssh = "ssh"; + } + + const char *args[8]; + int nargs = 0; + args[nargs++] = ssh; + if (port) { + args[nargs++] = "-p"; + args[nargs++] = port; + } + args[nargs++] = ssh_host; + args[nargs++] = "git-upload-pack"; + args[nargs++] = NULL; + + return execvp(args[0], (char *const *)args); +} + +int main(void) { + char url_buf[1024]; + if (!fgets(url_buf, sizeof url_buf, stdin)) return 1; + size_t len = strlen(url_buf); + if (len && url_buf[len - 1] == '\n') url_buf[len - 1] = '\0'; + return do_ssh_connect(url_buf); +} diff --git a/tests/benchmark/cve_corpus/c/CVE-2017-1000117/vulnerable.c b/tests/benchmark/cve_corpus/c/CVE-2017-1000117/vulnerable.c new file mode 100644 index 00000000..ad3b6f83 --- /dev/null +++ b/tests/benchmark/cve_corpus/c/CVE-2017-1000117/vulnerable.c @@ -0,0 +1,96 @@ +// Nyx CVE benchmark fixture. +// +// CVE: CVE-2017-1000117 +// Project: git (git/git) +// License: GPL-2.0-only (https://github.com/git/git/blob/v2.7.6/COPYING) +// Advisory: https://nvd.nist.gov/vuln/detail/CVE-2017-1000117 +// Vulnerable: tag v2.7.5 (parent c8dd1e3bb115), connect.c:733-793 of +// git_connect() — pre-fix tip before commit +// 820d7650cc6705fbb73c8caf9aef47394be5ed72 ("connect: reject +// ssh hostname that begins with a dash") landed. +// +// Pre-2.7.6 git accepted `ssh://-oProxyCommand=...@host/repo` URLs and +// passed the unsanitised `ssh_host` (derived from the URL host part) as +// an argv element to ssh. When `ssh_host` started with a dash, ssh +// interpreted it as an option (`-oProxyCommand=…`), giving the attacker +// a code-execution primitive whenever a user cloned an attacker-supplied +// URL or fetched an attacker-controlled submodule. The fix added a +// hostname-starts-with-dash rejection in connect.c immediately before +// the args were assembled. +// +// Trims: +// - Removed PROTO_LOCAL / git-daemon arms (connect.c:721 else-branch +// and the upstream `else { transport_check_allowed("file"); }` after +// the SSH block) — not on the disclosed flow path. +// - Removed `flags & CONNECT_DIAG_URL` early-exit (lines 744-755) and +// `tortoiseplink` / `putty` shell detection (lines 776-783) — they +// do not influence whether the dash-prefixed `ssh_host` reaches argv. +// - Inlined upstream's `start_command(conn)` (which fork+execvp's +// `conn->args.argv`) directly as `execvp(args[0], (char *const *)args)` +// against `conn->args.argv`. start_command is the heavyweight +// run-command helper; the disclosed sink behavior is identical. +// - Inlined upstream's `argv_array_push(&conn->args, …)` as plain +// pointer assignment into a fixed-size `argv[8]` buffer. argv_array +// is a strvec wrapper; the dispatched argv shape is unchanged. +// - Replaced `parse_connect_url(url, ...)` + `get_host_and_port()` URL +// parser with a minimal `get_host_and_port_min()` that does the +// classic "skip user@, NUL-terminate at /" — the disclosed flow only +// requires that `ssh_host` be a substring of the source URL, which +// this preserves byte-for-byte. +// - Source statement: upstream takes the URL from `argv[1]` of the git +// binary; the fixture uses `fgets(url_buf, ..., stdin)` — a recognised +// C taint source — so the file scans standalone without depending on +// argv-source modeling. + +#include +#include +#include +#include + +static void get_host_and_port_min(char **host, const char **port) { + char *colon, *end; + if (*host == NULL) return; + end = strchr(*host, '/'); + if (end) *end = '\0'; + colon = strchr(*host, ':'); + if (colon) { *colon = '\0'; *port = colon + 1; } +} + +int do_ssh_connect(char *url) { + // Load-bearing block copied verbatim from connect.c:733-793 of the + // pre-fix git tree (tag v2.7.5 / parent c8dd1e3bb115). The + // dash-prefix check that landed in the fix is intentionally absent. + const char *ssh; + char *ssh_host = url; + const char *port = NULL; + get_host_and_port_min(&ssh_host, &port); + + if (!port) port = "22"; + + ssh = getenv("GIT_SSH_COMMAND"); + if (!ssh) { + ssh = getenv("GIT_SSH"); + if (!ssh) ssh = "ssh"; + } + + const char *args[8]; + int nargs = 0; + args[nargs++] = ssh; + if (port) { + args[nargs++] = "-p"; + args[nargs++] = port; + } + args[nargs++] = ssh_host; + args[nargs++] = "git-upload-pack"; + args[nargs++] = NULL; + + return execvp(args[0], (char *const *)args); +} + +int main(void) { + char url_buf[1024]; + if (!fgets(url_buf, sizeof url_buf, stdin)) return 1; + size_t len = strlen(url_buf); + if (len && url_buf[len - 1] == '\n') url_buf[len - 1] = '\0'; + return do_ssh_connect(url_buf); +} diff --git a/tests/benchmark/cve_corpus/javascript/CVE-2026-42353/patched.js b/tests/benchmark/cve_corpus/javascript/CVE-2026-42353/patched.js new file mode 100644 index 00000000..ae874deb --- /dev/null +++ b/tests/benchmark/cve_corpus/javascript/CVE-2026-42353/patched.js @@ -0,0 +1,60 @@ +// Nyx CVE benchmark fixture (patched). +// +// CVE: CVE-2026-42353 +// GHSA: GHSA-jfgf-83c5-2c4m +// Project: i18next-http-middleware (i18next/i18next-http-middleware) +// License: MIT (https://github.com/i18next/i18next-http-middleware/blob/master/licence) +// Patched: 65301c194593d46a84623b64e5fde2f51d3550f6 lib/utils.js:1-22, lib/index.js:243-250 +// Release: v3.9.3 +// +// Patch adds `utils.isSafeIdentifier` (denylist allowing any legitimate +// i18next language code shape, rejecting `..`, path separators, control +// chars, prototype keys, empty strings, and values longer than 128) and +// inserts `languages = languages.filter(utils.isSafeIdentifier)` and the +// equivalent for `namespaces` before they reach the backend connector. +// +// Trims: same scaffolding trims as the vulnerable counterpart. +// +// Patched-form simplification: same template-literal inline of the +// backend's interpolator + readFileSync as the vulnerable side. The +// `utils.isSafeIdentifier` body is copied verbatim from +// `lib/utils.js:13-22` of the patched commit; the prototype-pollution +// denylist (UNSAFE_KEYS check) and length / control-char / `..` / +// separator rejections are all load-bearing for the precision-side +// claim. +const fs = require('fs'); +const express = require('express'); +const app = express(); + +const UNSAFE_KEYS = ['__proto__', 'constructor', 'prototype']; + +function isSafeIdentifier (v) { + if (typeof v !== 'string') return false; + if (v.length === 0 || v.length > 128) return false; + if (UNSAFE_KEYS.indexOf(v) > -1) return false; + if (v.indexOf('..') > -1) return false; + if (v.indexOf('/') > -1 || v.indexOf('\\') > -1) return false; + // eslint-disable-next-line no-control-regex + if (/[\x00-\x1F\x7F]/.test(v)) return false; + return true; +} + +app.get('/locales/resources.json', (req, res) => { + let languages = req.query.lng + ? req.query.lng.split(' ') + : []; + let namespaces = req.query.ns + ? req.query.ns.split(' ') + : []; + + // Drop user-supplied values containing patterns that could trigger + // path traversal / SSRF / prototype pollution when forwarded to the + // backend connector. See: https://www.i18next.com/how-to/faq#how-should-the-language-codes-be-formatted + languages = languages.filter(isSafeIdentifier); + namespaces = namespaces.filter(isSafeIdentifier); + + const lng = languages[0]; + const ns = namespaces[0]; + const filename = `/locales/${lng}/${ns}.json`; + fs.readFileSync(filename); +}); diff --git a/tests/benchmark/cve_corpus/javascript/CVE-2026-42353/vulnerable.js b/tests/benchmark/cve_corpus/javascript/CVE-2026-42353/vulnerable.js new file mode 100644 index 00000000..d48d2509 --- /dev/null +++ b/tests/benchmark/cve_corpus/javascript/CVE-2026-42353/vulnerable.js @@ -0,0 +1,56 @@ +// Nyx CVE benchmark fixture. +// +// CVE: CVE-2026-42353 +// GHSA: GHSA-jfgf-83c5-2c4m +// Project: i18next-http-middleware (i18next/i18next-http-middleware) +// License: MIT (https://github.com/i18next/i18next-http-middleware/blob/master/licence) +// Advisory: https://github.com/i18next/i18next-http-middleware/security/advisories/GHSA-jfgf-83c5-2c4m +// Vulnerable: a1d92a8f03292644d1c6fa83f1b77121d39daf4d lib/index.js:229-234,246-261 +// +// Pre-3.9.3 `getResourcesHandler` pulled `lng` and `ns` directly from +// `options.getQuery(req)` (default: `req => req.query`) and forwarded the +// split values into `i18next.services.backendConnector.load(...)` with no +// sanitisation. Paired with `i18next-fs-backend`, the backend's +// `Backend.read` calls `interpolator.interpolate(loadPath, { lng, ns })` +// which substitutes the unsanitised values into a path template and then +// `readFileSync(filename)`, so a request like +// `GET /locales/resources.json?lng=../../etc/passwd&ns=root` reads +// attacker-chosen files off disk. The advisory also flags the SSRF +// variant when paired with `i18next-http-backend`; we model the +// fs-backend path here because it is the more direct sink-flow shape. +// +// Trims: getResourcesHandler's caching headers (lib/index.js:213-227), +// route-params fallback (L237-244), Response/JSON envelope branch +// (L264-268), the full Backend class wrapper (read/save/create/queue/ +// debounce — only the inline interpolation + readFileSync are +// load-bearing), `extendOptionsWithDefaults`, the Backend constructor +// path, `loadPath` typeof-function escape hatch, getResourceBundle +// roundtrip, and the express-router/middleware mount glue. +// +// Patched-form simplification: the upstream interpolator is +// `i18next.services.interpolator.interpolate(loadPath, { lng, ns })`; +// here it is inlined as a template literal because the interpolator +// just substitutes `{{lng}}` and `{{ns}}` placeholders into `loadPath` +// (the default loadPath is `/locales/{{lng}}/{{ns}}.json`). The +// substitution is character-for-character equivalent for the load- +// bearing flow path (lng/ns into the string). +const fs = require('fs'); +const express = require('express'); +const app = express(); + +app.get('/locales/resources.json', (req, res) => { + let languages = req.query.lng + ? req.query.lng.split(' ') + : []; + let namespaces = req.query.ns + ? req.query.ns.split(' ') + : []; + + // Inline the backend's read() and forEach loop's body verbatim, + // collapsing the call into the array-index access used by the + // recall test (see disabled_reason in ground_truth.json). + const lng = languages[0]; + const ns = namespaces[0]; + const filename = `/locales/${lng}/${ns}.json`; + fs.readFileSync(filename); +}); diff --git a/tests/benchmark/cve_corpus/php/CVE-2026-33486/patched.php b/tests/benchmark/cve_corpus/php/CVE-2026-33486/patched.php new file mode 100644 index 00000000..b7213db3 --- /dev/null +++ b/tests/benchmark/cve_corpus/php/CVE-2026-33486/patched.php @@ -0,0 +1,72 @@ + Any: + ... + + +def get_result_from_sqldb( + db: SQLDatabase, cmd: str +) -> Union[str, List[Dict[str, Any]], Dict[str, Any]]: + result = db._execute(cmd, fetch="all") # type: ignore + return result diff --git a/tests/benchmark/cve_corpus/python/CVE-2024-21513/vulnerable.py b/tests/benchmark/cve_corpus/python/CVE-2024-21513/vulnerable.py new file mode 100644 index 00000000..8c7a936c --- /dev/null +++ b/tests/benchmark/cve_corpus/python/CVE-2024-21513/vulnerable.py @@ -0,0 +1,56 @@ +# Nyx CVE benchmark fixture. +# +# CVE: CVE-2024-21513 +# Project: LangChain Experimental (langchain-ai/langchain) +# License: MIT (https://github.com/langchain-ai/langchain/blob/master/LICENSE) +# Advisory: https://nvd.nist.gov/vuln/detail/CVE-2024-21513 +# Vulnerable: 7b13292e3544b2f5f2bfb8a27a062ea2b0c34561~1 +# libs/experimental/langchain_experimental/sql/vector_sql.py:79-98 +# +# `langchain_experimental.sql.vector_sql.VectorSQLDatabaseChain` ran +# every value returned from a SQL query through Python's built-in +# `eval(...)` so that string-shaped numbers / lists were converted into +# Python objects. An attacker who could control the database content +# (for example by writing into a vector store backing the chain) could +# return a value such as `__import__("os").system("rm -rf /")` and the +# chain would `eval` it, achieving arbitrary code execution on the +# server hosting the chain. +# +# Trims: +# - imports / non-load-bearing module decls (L1-30 of upstream). +# - `parse(self, text: str)` output-parser method (L70-77) and the +# `VectorSQLDatabaseChain` class body (L101-200) — neither is on +# the disclosed source→sink path. +# - SQLAlchemy / SQLDatabase type hints simplified to `Any` to avoid +# pulling the upstream type chain into the fixture. +# +# Verbatim load-bearing lines: the `_try_eval` helper definition and +# the two dict / list comprehensions inside `get_result_from_sqldb` +# that call `_try_eval(v)` on each query-result value are +# byte-for-byte from vector_sql.py at the pre-fix SHA. + +from typing import Any, Dict, List, Union + + +class SQLDatabase: + def _execute(self, cmd: str, fetch: str = "all") -> Any: + ... + + +def _try_eval(x: Any) -> Any: + try: + return eval(x) + except Exception: + return x + + +def get_result_from_sqldb( + db: SQLDatabase, cmd: str +) -> Union[str, List[Dict[str, Any]], Dict[str, Any]]: + result = db._execute(cmd, fetch="all") # type: ignore + if isinstance(result, list): + return [{k: _try_eval(v) for k, v in dict(d._asdict()).items()} for d in result] + else: + return { + k: _try_eval(v) for k, v in dict(result._asdict()).items() # type: ignore + } diff --git a/tests/benchmark/cve_corpus/python/CVE-2024-23334/patched.py b/tests/benchmark/cve_corpus/python/CVE-2024-23334/patched.py new file mode 100644 index 00000000..8766497f --- /dev/null +++ b/tests/benchmark/cve_corpus/python/CVE-2024-23334/patched.py @@ -0,0 +1,57 @@ +# Nyx CVE benchmark fixture. +# +# CVE: CVE-2024-23334 +# Project: aiohttp (aio-libs/aiohttp) +# License: Apache-2.0 (https://github.com/aio-libs/aiohttp/blob/master/LICENSE.txt) +# Advisory: https://github.com/aio-libs/aiohttp/security/advisories/GHSA-5h86-8mv2-jq9f +# Patched: 1c335944d6a8b1298baf179b7c0b3069f10c514b aiohttp/web_urldispatcher.py:644-668 +# +# The fix splits the previously-unified resolve+containment check so +# that ``relative_to(self._directory)`` is run on *both* arms of the +# ``follow_symlinks`` branch. In the follow-symlinks arm the path is +# normalised pre-resolve so a symlink target that lives outside the +# static directory still raises ``ValueError`` from ``relative_to`` and +# is converted to ``HTTPNotFound``. +# +# Trims: same as vulnerable.py. +# +# Verbatim load-bearing lines: the rebuilt ``follow_symlinks`` branch +# in ``_handle`` (L644-660), the new ``unresolved_path = self._directory +# .joinpath(filename)`` step, and the ``normalized_path.relative_to( +# self._directory)`` guard are byte-for-byte from +# web_urldispatcher.py:644-660 of the fix commit. + +import os +from pathlib import Path + +from aiohttp import web +from aiohttp.web import FileResponse, HTTPForbidden, HTTPNotFound, Request, StreamResponse + + +class StaticResource: + def __init__(self, directory: str, follow_symlinks: bool = True) -> None: + self._directory = Path(directory) + self._follow_symlinks = follow_symlinks + self._chunk_size = 256 * 1024 + + async def _handle(self, request: Request) -> StreamResponse: + rel_url = request.match_info["filename"] + try: + filename = Path(rel_url) + if filename.anchor: + raise HTTPForbidden() + unresolved_path = self._directory.joinpath(filename) + if self._follow_symlinks: + normalized_path = Path(os.path.normpath(unresolved_path)) + normalized_path.relative_to(self._directory) + filepath = normalized_path.resolve() + else: + filepath = unresolved_path.resolve() + filepath.relative_to(self._directory) + except (ValueError, FileNotFoundError) as error: + raise HTTPNotFound() from error + except HTTPForbidden: + raise + if filepath.is_file(): + return FileResponse(filepath, chunk_size=self._chunk_size) + raise HTTPNotFound diff --git a/tests/benchmark/cve_corpus/python/CVE-2024-23334/vulnerable.py b/tests/benchmark/cve_corpus/python/CVE-2024-23334/vulnerable.py new file mode 100644 index 00000000..5f79714f --- /dev/null +++ b/tests/benchmark/cve_corpus/python/CVE-2024-23334/vulnerable.py @@ -0,0 +1,62 @@ +# Nyx CVE benchmark fixture. +# +# CVE: CVE-2024-23334 +# Project: aiohttp (aio-libs/aiohttp) +# License: Apache-2.0 (https://github.com/aio-libs/aiohttp/blob/master/LICENSE.txt) +# Advisory: https://github.com/aio-libs/aiohttp/security/advisories/GHSA-5h86-8mv2-jq9f +# Vulnerable: 33ccdfb0a12690af5bb49bda2319ec0907fa7827 aiohttp/web_urldispatcher.py:633-648 +# +# aiohttp's StaticResource._handle resolved the requested filename +# under the configured static directory and then verified containment +# only when ``follow_symlinks`` was False. When ``follow_symlinks=True`` +# the ``filepath.relative_to(self._directory)`` check was skipped, so a +# symlink (or absolute path slip past the anchor check) under the +# static directory could escape it and serve files from anywhere on +# the filesystem the worker process could read. +# +# Trims: +# - ``append_version`` branch (L575-588) — separate code path that +# does not feed FileResponse on the disclosed flow. +# - ``HTTPNotFound`` / ``Exception`` handling fall-through after the +# try block (L646-654 of upstream) — irrelevant to source→sink. +# - ``_directory_as_html`` directory-listing branch (L658-708) — +# only ``FileResponse`` is the disclosed sink path. +# +# Verbatim load-bearing lines: the ``rel_url = request.match_info[ +# "filename"]`` source, the ``filepath = self._directory.joinpath( +# filename).resolve()`` path composition, the missing ``relative_to`` +# guard inside the ``if not self._follow_symlinks`` branch, and the +# ``return FileResponse(filepath, chunk_size=self._chunk_size)`` sink +# are byte-for-byte from web_urldispatcher.py:633-648 and L666-668. + +from pathlib import Path + +from aiohttp import web +from aiohttp.web import FileResponse, HTTPForbidden, HTTPNotFound, Request, StreamResponse + + +class StaticResource: + def __init__(self, directory: str, follow_symlinks: bool = True) -> None: + self._directory = Path(directory) + self._follow_symlinks = follow_symlinks + self._chunk_size = 256 * 1024 + + async def _handle(self, request: Request) -> StreamResponse: + rel_url = request.match_info["filename"] + try: + filename = Path(rel_url) + if filename.anchor: + # rel_url is an absolute name like + # /static/\\machine_name\c$ or /static/D:\path + # where the static dir is totally different + raise HTTPForbidden() + filepath = self._directory.joinpath(filename).resolve() + if not self._follow_symlinks: + filepath.relative_to(self._directory) + except (ValueError, FileNotFoundError) as error: + raise HTTPNotFound() from error + except HTTPForbidden: + raise + if filepath.is_file(): + return FileResponse(filepath, chunk_size=self._chunk_size) + raise HTTPNotFound diff --git a/tests/benchmark/ground_truth.json b/tests/benchmark/ground_truth.json index 558ec1fb..f6ba57c7 100644 --- a/tests/benchmark/ground_truth.json +++ b/tests/benchmark/ground_truth.json @@ -3,7 +3,7 @@ "metadata": { "description": "Nyx benchmark ground truth", "created": "2026-03-20", - "corpus_size": 533 + "corpus_size": 567 }, "cases": [ { @@ -912,6 +912,42 @@ "disabled": false, "notes": "Path traversal via send_file() with user-controlled path" }, + { + "case_id": "py-path_traversal-no-relative-to", + "file": "python/path_traversal/path_traversal_no_relative_to.py", + "language": "python", + "is_vulnerable": true, + "vuln_class": "path_traversal", + "cwe": "CWE-22", + "provenance": "synthetic", + "equivalence_tier": "exact", + "match_mode": "rule_match", + "expected_rule_ids": [ + "taint-unsanitised-flow" + ], + "allowed_alternative_rule_ids": [], + "forbidden_rule_ids": [], + "expected_severity": "MEDIUM", + "expected_category": "Security", + "expected_sink_lines": [ + [ + 11, + 11 + ] + ], + "expected_source_lines": [ + [ + 9, + 9 + ] + ], + "tags": [ + "path-traversal", + "regression-guard" + ], + "disabled": false, + "notes": "Negative companion to safe_relative_to_validator: no relative_to() validator on filepath, taint must propagate to send_file" + }, { "case_id": "py-deser-001", "file": "python/deser/deser_pickle.py", @@ -2816,6 +2852,43 @@ "disabled": false, "notes": "SSRF via file_get_contents() with user-controlled URL" }, + { + "case_id": "php-ssrf-002", + "file": "php/ssrf/ssrf_class_method_fopen.php", + "language": "php", + "is_vulnerable": true, + "vuln_class": "ssrf", + "cwe": "CWE-918", + "provenance": "synthetic", + "equivalence_tier": "exact", + "match_mode": "rule_match", + "expected_rule_ids": [ + "taint-unsanitised-flow" + ], + "allowed_alternative_rule_ids": [], + "forbidden_rule_ids": [], + "expected_severity": "MEDIUM", + "expected_category": "Security", + "expected_sink_lines": [ + [ + 15, + 15 + ] + ], + "expected_source_lines": [ + [ + 14, + 14 + ] + ], + "tags": [ + "fopen", + "ssrf", + "class-method" + ], + "disabled": false, + "notes": "Regression: PHP class-method body taint analysis (declaration_list mapped to Kind::Block) + fopen as PHP SSRF sink" + }, { "case_id": "php-path_traversal-001", "file": "php/path_traversal/path_traversal.php", @@ -3384,6 +3457,62 @@ "disabled": false, "notes": "Precision guard: chained-call inner-gate fix must NOT fire on http.get('http://internal-health.localhost/...').on(...) with a hardcoded literal URL." }, + { + "case_id": "js-path_traversal-ternary-source-001", + "file": "javascript/path_traversal/path_traversal_ternary_source.js", + "language": "javascript", + "is_vulnerable": true, + "vuln_class": "path_traversal", + "cwe": "CWE-22", + "provenance": "synthetic", + "equivalence_tier": "exact", + "match_mode": "rule_match", + "expected_rule_ids": [ + "taint-unsanitised-flow" + ], + "allowed_alternative_rule_ids": [ + "cfg-unguarded-sink" + ], + "forbidden_rule_ids": [], + "expected_severity": "MEDIUM", + "expected_category": "Security", + "expected_sink_lines": [], + "expected_source_lines": [], + "tags": [ + "path_traversal", + "ternary-as-value", + "source-classification" + ], + "disabled": false, + "notes": "Regression guard for the ternary-RHS source-classification fix in src/cfg/conditions.rs::lower_ternary_branch (2026-05-04). Pre-fix, push_node only did suffix/prefix matching on the branch text, so req.query.lng did not classify as a Source (rule matcher is req.query, neither matches req.query.lng). Both ternary branches lowered to labelless Assign-with-empty-uses, the join phi saw no taint, and downstream sinks missed the flow. Motivated by GHSA-jfgf-83c5-2c4m / CVE-2026-42353." + }, + { + "case_id": "js-safe-ternary-const-branches", + "file": "javascript/safe/safe_ternary_const_branches.js", + "language": "javascript", + "is_vulnerable": false, + "vuln_class": "safe", + "cwe": "N/A", + "provenance": "synthetic", + "equivalence_tier": "exact", + "match_mode": "file_presence", + "expected_rule_ids": [], + "allowed_alternative_rule_ids": [], + "forbidden_rule_ids": [ + "taint-unsanitised-flow" + ], + "expected_severity": null, + "expected_category": "Security", + "expected_sink_lines": [], + "expected_source_lines": [], + "tags": [ + "ternary-as-value", + "source-classification", + "negative" + ], + "disabled": false, + "notes": "Precision guard: ternary-RHS source-classification fix must NOT synthesise a Source label when both branches are constant strings. Pins the conservative gate inside lower_ternary_branch." + }, { "case_id": "py-ssrf-002", "file": "python/ssrf/ssrf_httpx_post.py", @@ -4730,6 +4859,34 @@ "disabled": false, "notes": "filter_input sanitizes user input \u2014 should produce no taint finding" }, + { + "case_id": "php-safe-camelcase-validator-001", + "file": "php/safe/safe_camelcase_validator_negated.php", + "language": "php", + "is_vulnerable": false, + "vuln_class": "safe", + "cwe": "", + "provenance": "synthetic", + "equivalence_tier": "exact", + "match_mode": "rule_match", + "expected_rule_ids": [], + "allowed_alternative_rule_ids": [], + "forbidden_rule_ids": [ + "taint-unsanitised-flow" + ], + "expected_severity": null, + "expected_category": null, + "expected_sink_lines": [], + "expected_source_lines": [], + "tags": [ + "validator", + "camelcase", + "negated", + "class-method" + ], + "disabled": false, + "notes": "Regression: camelCase `isSafeRemoteUrl` validator + `if (!validator($x))` early-return narrowing in PHP class method" + }, { "case_id": "php-sqli-pdo-001", "file": "php/sqli/sqli_pdo_raw.php", @@ -9922,6 +10079,63 @@ ], "notes": "CVE-2023-22621 patched counterpart: _.template called with { interpolate: , evaluate: false, escape: false } so the lodash evaluate block compiler is disabled." }, + { + "case_id": "cve-js-2026-42353-vulnerable", + "file": "cve_corpus/javascript/CVE-2026-42353/vulnerable.js", + "language": "javascript", + "is_vulnerable": true, + "vuln_class": "path_traversal", + "cwe": "CWE-22", + "provenance": "real_cve", + "equivalence_tier": "exact", + "match_mode": "rule_match", + "expected_rule_ids": [ + "taint-unsanitised-flow" + ], + "allowed_alternative_rule_ids": [], + "forbidden_rule_ids": [], + "expected_severity": "MEDIUM", + "expected_category": "Security", + "expected_sink_lines": [], + "expected_source_lines": [], + "tags": [ + "cve", + "i18next-http-middleware", + "path_traversal", + "ssrf", + "ternary-as-value" + ], + "disabled": false, + "notes": "CVE-2026-42353 / GHSA-jfgf-83c5-2c4m: i18next-http-middleware <3.9.3 getResourcesHandler forwards user-controlled lng/ns into i18next.services.backendConnector.load(...) without sanitisation. Paired with i18next-fs-backend the unsanitised values reach readFileSync(filename) (path traversal); paired with i18next-http-backend they reach an outgoing HTTP request URL (SSRF). MIT. Enabled 2026-05-04 after the array-method validator-callback narrowing (`try_array_method_validator_callback_narrowing` in src/taint/ssa_transfer/mod.rs) closed the dual gap that previously made the patched counterpart fire." + }, + { + "case_id": "cve-js-2026-42353-patched", + "file": "cve_corpus/javascript/CVE-2026-42353/patched.js", + "language": "javascript", + "is_vulnerable": false, + "vuln_class": "safe", + "cwe": "N/A", + "provenance": "real_cve", + "equivalence_tier": "exact", + "match_mode": "file_presence", + "expected_rule_ids": [], + "allowed_alternative_rule_ids": [], + "forbidden_rule_ids": [ + "taint-unsanitised-flow" + ], + "expected_severity": null, + "expected_category": "Security", + "expected_sink_lines": [], + "expected_source_lines": [], + "tags": [ + "cve", + "i18next-http-middleware", + "patched", + "negative" + ], + "disabled": false, + "notes": "CVE-2026-42353 patched counterpart: utils.isSafeIdentifier denylist applied via languages.filter(isSafeIdentifier) before forwarding to backend. Enabled 2026-05-04: array-method validator-callback narrowing recognises `.filter()` shapes and strips the receiver-derived caps from the call result." + }, { "case_id": "cve-ts-2023-26159-vulnerable", "file": "cve_corpus/typescript/CVE-2023-26159/vulnerable.ts", @@ -10262,6 +10476,202 @@ "disabled": false, "notes": "CVE-2026-33626 patched counterpart: _is_safe_url private-IP allowlist gate replaces scheme-only check; regression guard that Nyx does not refire on the fix" }, + { + "case_id": "cve-py-2024-23334-vulnerable", + "file": "cve_corpus/python/CVE-2024-23334/vulnerable.py", + "language": "python", + "is_vulnerable": true, + "vuln_class": "path_traversal", + "cwe": "CWE-22", + "provenance": "real_cve", + "equivalence_tier": "exact", + "match_mode": "rule_match", + "expected_rule_ids": [ + "taint-unsanitised-flow" + ], + "allowed_alternative_rule_ids": [], + "forbidden_rule_ids": [], + "expected_severity": "MEDIUM", + "expected_category": "Security", + "expected_sink_lines": [ + [ + 61, + 61 + ] + ], + "expected_source_lines": [ + [ + 45, + 45 + ] + ], + "tags": [ + "cve", + "aiohttp", + "path-traversal", + "positive" + ], + "disabled": false, + "notes": "CVE-2024-23334: aiohttp StaticResource symlink-bypass — relative_to(self._directory) check gated to the non-follow_symlinks arm; FileResponse(filepath) reachable from request.match_info on follow_symlinks=True. Apache-2.0" + }, + { + "case_id": "cve-py-2024-23334-patched", + "file": "cve_corpus/python/CVE-2024-23334/patched.py", + "language": "python", + "is_vulnerable": false, + "vuln_class": "safe", + "cwe": "N/A", + "provenance": "real_cve", + "equivalence_tier": "exact", + "match_mode": "file_presence", + "expected_rule_ids": [], + "allowed_alternative_rule_ids": [], + "forbidden_rule_ids": [ + "taint-unsanitised-flow" + ], + "expected_severity": null, + "expected_category": "Security", + "expected_sink_lines": [], + "expected_source_lines": [], + "tags": [ + "cve", + "aiohttp", + "patched", + "negative" + ], + "disabled": false, + "notes": "CVE-2024-23334 patched counterpart: relative_to(self._directory) recognised as a receiver-side FILE_IO validator on both follow_symlinks arms; regression guard that Nyx does not refire on the fix" + }, + { + "case_id": "cve-py-2023-6568-vulnerable", + "file": "cve_corpus/python/CVE-2023-6568/vulnerable.py", + "language": "python", + "is_vulnerable": true, + "vuln_class": "xss", + "cwe": "CWE-79", + "provenance": "real_cve", + "equivalence_tier": "exact", + "match_mode": "rule_match", + "expected_rule_ids": [ + "py.xss.make_response_format" + ], + "allowed_alternative_rule_ids": [ + "taint-unsanitised-flow" + ], + "forbidden_rule_ids": [], + "expected_severity": "LOW", + "expected_category": "Security", + "expected_sink_lines": [ + [ + 45, + 45 + ] + ], + "expected_source_lines": [ + [ + 41, + 41 + ] + ], + "tags": [ + "cve", + "mlflow", + "xss", + "positive" + ], + "disabled": false, + "notes": "CVE-2023-6568: mlflow auth create_user reflected attacker-controlled Content-Type request header into make_response f-string — reflected XSS. Apache-2.0" + }, + { + "case_id": "cve-py-2023-6568-patched", + "file": "cve_corpus/python/CVE-2023-6568/patched.py", + "language": "python", + "is_vulnerable": false, + "vuln_class": "safe", + "cwe": "N/A", + "provenance": "real_cve", + "equivalence_tier": "exact", + "match_mode": "file_presence", + "expected_rule_ids": [], + "allowed_alternative_rule_ids": [], + "forbidden_rule_ids": [ + "py.xss.make_response_format", + "taint-unsanitised-flow" + ], + "expected_severity": null, + "expected_category": "Security", + "expected_sink_lines": [], + "expected_source_lines": [], + "tags": [ + "cve", + "mlflow", + "patched", + "negative" + ], + "disabled": false, + "notes": "CVE-2023-6568 patched counterpart: f-string interpolation replaced with static string; no tainted source reaches make_response and the AST f-string trigger is gone, so neither rule fires" + }, + { + "case_id": "cve-py-2024-21513-vulnerable", + "file": "cve_corpus/python/CVE-2024-21513/vulnerable.py", + "language": "python", + "is_vulnerable": true, + "vuln_class": "code_exec", + "cwe": "CWE-94", + "provenance": "real_cve", + "equivalence_tier": "exact", + "match_mode": "rule_match", + "expected_rule_ids": [ + "py.code_exec.eval" + ], + "allowed_alternative_rule_ids": [], + "forbidden_rule_ids": [], + "expected_severity": "MEDIUM", + "expected_category": "Security", + "expected_sink_lines": [ + [ + 42, + 42 + ] + ], + "expected_source_lines": [], + "tags": [ + "cve", + "langchain", + "code-exec", + "positive" + ], + "disabled": false, + "notes": "CVE-2024-21513: langchain_experimental VectorSQLDatabaseChain ran every SQL query result through eval() via _try_eval; attacker-controlled DB rows -> RCE. MIT" + }, + { + "case_id": "cve-py-2024-21513-patched", + "file": "cve_corpus/python/CVE-2024-21513/patched.py", + "language": "python", + "is_vulnerable": false, + "vuln_class": "safe", + "cwe": "N/A", + "provenance": "real_cve", + "equivalence_tier": "exact", + "match_mode": "file_presence", + "expected_rule_ids": [], + "allowed_alternative_rule_ids": [], + "forbidden_rule_ids": [ + "py.code_exec.eval" + ], + "expected_severity": null, + "expected_category": "Security", + "expected_sink_lines": [], + "expected_source_lines": [], + "tags": [ + "cve", + "langchain", + "patched", + "negative" + ], + "disabled": false, + "notes": "CVE-2024-21513 patched counterpart: _try_eval helper deleted and get_result_from_sqldb returns raw result with no eval; py.code_exec.eval silent" + }, { "case_id": "cve-php-2017-9841-vulnerable", "file": "cve_corpus/php/CVE-2017-9841/vulnerable.php", @@ -10401,6 +10811,72 @@ "disabled": false, "notes": "CVE-2018-15133 patched counterpart: HMAC-verified JSON cookie replaces PHP-serialized payload; regression guard that Nyx does not refire on the fix" }, + { + "case_id": "cve-php-2026-33486-vulnerable", + "file": "cve_corpus/php/CVE-2026-33486/vulnerable.php", + "language": "php", + "is_vulnerable": true, + "vuln_class": "ssrf", + "cwe": "CWE-918", + "provenance": "real_cve", + "equivalence_tier": "exact", + "match_mode": "rule_match", + "expected_rule_ids": [ + "taint-unsanitised-flow" + ], + "allowed_alternative_rule_ids": [], + "forbidden_rule_ids": [], + "expected_severity": "MEDIUM", + "expected_category": "Security", + "expected_sink_lines": [ + [ + 44, + 44 + ] + ], + "expected_source_lines": [ + [ + 40, + 40 + ] + ], + "tags": [ + "cve", + "roadiz", + "ssrf", + "lfi" + ], + "disabled": false, + "notes": "CVE-2026-33486: roadiz/documents DownloadedFile::fromUrl passes the URL parameter to fopen() without scheme allowlist or host validation; file:// payloads read host filesystem. MIT" + }, + { + "case_id": "cve-php-2026-33486-patched", + "file": "cve_corpus/php/CVE-2026-33486/patched.php", + "language": "php", + "is_vulnerable": false, + "vuln_class": "safe", + "cwe": "N/A", + "provenance": "real_cve", + "equivalence_tier": "exact", + "match_mode": "file_presence", + "expected_rule_ids": [], + "allowed_alternative_rule_ids": [], + "forbidden_rule_ids": [ + "taint-unsanitised-flow" + ], + "expected_severity": null, + "expected_category": "Security", + "expected_sink_lines": [], + "expected_source_lines": [], + "tags": [ + "cve", + "roadiz", + "patched", + "negative" + ], + "disabled": false, + "notes": "CVE-2026-33486 patched counterpart: isSafeRemoteUrl early-return validates scheme/host before fopen; regression guard that Nyx does not refire on the fix" + }, { "case_id": "cve-rb-2013-0156-vulnerable", "file": "cve_corpus/ruby/CVE-2013-0156/vulnerable.rb", @@ -11325,6 +11801,79 @@ "disabled": false, "notes": "CVE-2019-18634 patched counterpart: bounded copy routine replaces the strcpy sink; regression guard that Nyx does not refire on the fix" }, + { + "case_id": "cve-c-2017-1000117-vulnerable", + "file": "cve_corpus/c/CVE-2017-1000117/vulnerable.c", + "language": "c", + "is_vulnerable": true, + "vuln_class": "cmdi", + "cwe": "CWE-88", + "provenance": "real_cve", + "equivalence_tier": "exact", + "match_mode": "rule_match", + "expected_rule_ids": [ + "taint-unsanitised-flow" + ], + "allowed_alternative_rule_ids": [ + "c.cmdi.execvp" + ], + "forbidden_rule_ids": [], + "expected_severity": "HIGH", + "expected_category": "Security", + "expected_sink_lines": [ + [ + 87, + 87 + ] + ], + "expected_source_lines": [ + [ + 92, + 92 + ] + ], + "tags": [ + "cve", + "git", + "argv-injection", + "cmdi" + ], + "disabled": true, + "disabled_reason": "C taint engine does not propagate taint through C array-element writes (`args[i] = ssh_host;`) and has no `c.cmdi.exec*` AST pattern; even if such a pattern were added it would also fire on the patched fixture (precision miss) because the CVE is sanitised by a pre-call dash-prefix guard the engine does not classify as a validator. Three-layer deep fix tracked in CVE_DEFERRED.md.", + "notes": "CVE-2017-1000117 (git ssh:// argv injection): pre-2.7.6 git accepted `ssh://-oProxyCommand=...@host/repo` URLs and pushed the URL host as an argv element to ssh, where a leading dash was treated as an option flag. GPL-2.0" + }, + { + "case_id": "cve-c-2017-1000117-patched", + "file": "cve_corpus/c/CVE-2017-1000117/patched.c", + "language": "c", + "is_vulnerable": false, + "vuln_class": "safe", + "cwe": "N/A", + "provenance": "real_cve", + "equivalence_tier": "exact", + "match_mode": "file_presence", + "expected_rule_ids": [], + "allowed_alternative_rule_ids": [], + "forbidden_rule_ids": [ + "c.cmdi.system", + "c.cmdi.popen", + "c.cmdi.execvp", + "taint-unsanitised-flow" + ], + "expected_severity": null, + "expected_category": "Security", + "expected_sink_lines": [], + "expected_source_lines": [], + "tags": [ + "cve", + "git", + "patched", + "negative" + ], + "disabled": true, + "disabled_reason": "Paired with cve-c-2017-1000117-vulnerable; precision side requires sanitizer recognition of the upstream `if (ssh_host[0] == '-') die(...)` guard so that adding any `c.cmdi.execvp` AST pattern would not also fire on the patched fixture.", + "notes": "CVE-2017-1000117 patched counterpart: dash-prefix gate added before argv assembly; regression guard that Nyx does not refire on the fix once the deferral lands" + }, { "case_id": "cve-cpp-2019-13132-vulnerable", "file": "cve_corpus/cpp/CVE-2019-13132/vulnerable.cpp", @@ -12626,6 +13175,33 @@ "disabled": false, "notes": "Python equivalent of rs-safe-014: direct-return sanitiser with `\"..\" in s` / `s.startswith(...)` rejection chain returning empty string." }, + { + "case_id": "py-safe-relative-to-validator", + "file": "python/safe/safe_relative_to_validator.py", + "language": "python", + "is_vulnerable": false, + "vuln_class": "safe", + "cwe": "CWE-22", + "provenance": "synthetic", + "equivalence_tier": "exact", + "match_mode": "rule_match", + "expected_rule_ids": [], + "allowed_alternative_rule_ids": [], + "forbidden_rule_ids": [ + "taint-unsanitised-flow" + ], + "expected_severity": null, + "expected_category": null, + "expected_sink_lines": null, + "expected_source_lines": null, + "tags": [ + "path-validator", + "relative-to", + "receiver-side" + ], + "disabled": false, + "notes": "Pathlib relative_to(base) raise-on-escape pattern recognised as a receiver-side FILE_IO validator; canonical Python path-containment idiom (CVE-2024-23334 patched fixture)" + }, { "case_id": "py-safe-022", "file": "python/safe/safe_canonicalise_rooted_startswith.py", @@ -13232,6 +13808,65 @@ "disabled": false, "notes": "Vulnerable counterpart to php-safe-019: md5 / sha1 used to store / sign / digest credentials, tokens, signatures. Consumer names contain crypto-keyword substrings (`password`, `token`, `signature`, `pw_hash`, `digest`) so Layer F suppression refuses to fire." }, + { + "case_id": "php-safe-020", + "file": "php/safe/safe_serializable_magic_method_unserialize.php", + "language": "php", + "is_vulnerable": false, + "vuln_class": "safe", + "cwe": "CWE-502", + "provenance": "real-repo-precision-2026-05-03", + "equivalence_tier": "exact", + "match_mode": "rule_match", + "expected_rule_ids": [], + "allowed_alternative_rule_ids": [], + "forbidden_rule_ids": [ + "php.deser.unserialize", + "taint-unsanitised-flow" + ], + "expected_severity": null, + "expected_category": null, + "expected_sink_lines": null, + "expected_source_lines": null, + "tags": [ + "real-repo-precision-2026-05-03", + "unserialize", + "serializable", + "magic-method" + ], + "disabled": false, + "notes": "Serializable::unserialize($input) magic-method body — the legacy PHP Serializable interface contract (deprecated since PHP 8.1). PHP itself drives invocation; the body's \\unserialize($x) call is part of the deserialization machinery and cannot be removed without breaking the interface. Distilled from joomla/administrator/components/com_finder/src/Indexer/Result.php:488 + libraries/src/Input/{Cli,Input}.php." + }, + { + "case_id": "php-deser-003", + "file": "php/deser/deser_unserialize_method_named_unserialize_with_user_input.php", + "language": "php", + "is_vulnerable": true, + "vuln_class": "deser", + "cwe": "CWE-502", + "provenance": "real-repo-precision-2026-05-03", + "equivalence_tier": "exact", + "match_mode": "rule_match", + "expected_rule_ids": [ + "php.deser.unserialize" + ], + "allowed_alternative_rule_ids": [ + "taint-unsanitised-flow" + ], + "forbidden_rule_ids": [], + "expected_severity": null, + "expected_category": "Security", + "expected_sink_lines": null, + "expected_source_lines": null, + "tags": [ + "real-repo-precision-2026-05-03", + "unserialize", + "magic-method-name-only", + "recall-guard" + ], + "disabled": false, + "notes": "Recall guard for the Serializable magic-method recogniser. Method is named `unserialize` but (a) calls \\unserialize on user input from $_GET, not the formal parameter, OR (b) wraps the parameter in trim() before passing it through. The recogniser deliberately requires bare-parameter pass-through, so both shapes must keep firing." + }, { "case_id": "c-safe-014", "file": "c/safe/safe_direct_path_sanitizer.c", @@ -14942,7 +15577,7 @@ ], "allowed_alternative_rule_ids": [], "forbidden_rule_ids": [], - "expected_severity": "HIGH", + "expected_severity": "MEDIUM", "expected_category": "Security", "expected_sink_lines": [], "expected_source_lines": [], @@ -15006,7 +15641,34 @@ "real-repo-precision-2026-05-03" ], "disabled": false, - "notes": "Recall guard for the 2026-05-03 type-aware Go param filter. Even after ctx context.Context is dropped from unit.params, an id-shaped param keeps the unit on the hook (id-shape recognised before the framework-name allow-list)." + "notes": "Recall guard for the 2026-05-03 Go DAO-helper precision pass. After id-like scalar params are dropped from unit.params for non-route units, this fixture pins recall via gin route registration: r.GET(/items/:id, GetByID) promotes the unit to RouteHandler; function_params_route_handler keeps id-like scalar params and the rule fires on the bare-receiver DAO call." + }, + { + "case_id": "go-safe-realrepo-019", + "file": "go/safe/safe_dao_helper_id_scalar.go", + "language": "go", + "is_vulnerable": false, + "vuln_class": "safe", + "cwe": "N/A", + "provenance": "real-repo", + "equivalence_tier": "exact", + "match_mode": "rule_match", + "expected_rule_ids": [], + "allowed_alternative_rule_ids": [], + "forbidden_rule_ids": [ + "go.auth.missing_ownership_check" + ], + "expected_severity": "NONE", + "expected_category": "N/A", + "expected_sink_lines": [], + "expected_source_lines": [], + "tags": [ + "auth", + "negative", + "real-repo-precision-2026-05-03" + ], + "disabled": false, + "notes": "Distilled from gitea models/actions/{run,run_job,runner,...}.go DAO helpers (~957 FPs). Pattern: (ctx context.Context, repoID, runID int64) signatures with bounded scalar id-like params calling internal DB helpers. Engine fix in src/auth_analysis/extract/common.rs::collect_param_names Go arm drops id-like scalar names from unit.params for non-route units (mirrors Python typed_parameter filter)." }, { "case_id": "py-auth-realrepo-001", @@ -15175,7 +15837,7 @@ "real-repo-precision-2026-04-27" ], "disabled": false, - "notes": "Vulnerable counterpart to ts-auth-realrepo-001: `userId = req.params.targetUserId` is a foreign id (URL param), not a session copy \u2014 the rule must still fire." + "notes": "Vulnerable counterpart to ts-auth-realrepo-001: `targetUserId` is a foreign id parameter (route-handed, not a session copy) \u2014 the rule must still fire on the qualified prisma.apiKey.deleteMany call. Updated 2026-05-03: pre-fix used bare `deleteApiKeysFromUserId(userId)` whose `delete` verb-prefix match fired despite no receiver evidence; post `receiver_is_simple_chain` gating, the fixture uses a qualified ORM call to test the canonical detection path." }, { "case_id": "js-auth-realrepo-001", @@ -15874,6 +16536,34 @@ "disabled": false, "notes": "Concatenated SQL passed to em.createQuery(...) \u2014 receiver-chain walk sees binary_expression at arg 0, refuses to synthesise sanitizer, structural sink fires. Regression guard for the JPA parameterised-execute fix." }, + { + "case_id": "java-safe-realrepo-openmrs-001", + "file": "java/safe/SafeJpaCriteriaQuery.java", + "language": "java", + "is_vulnerable": false, + "vuln_class": "safe", + "cwe": "N/A", + "provenance": "real-repo-precision-2026-05-03", + "equivalence_tier": "exact", + "match_mode": "file_presence", + "expected_rule_ids": [], + "allowed_alternative_rule_ids": [], + "forbidden_rule_ids": [ + "cfg-unguarded-sink", + "taint-unsanitised-flow" + ], + "expected_severity": null, + "expected_category": "Security", + "expected_sink_lines": [], + "expected_source_lines": [], + "tags": [ + "jpa", + "criteria-api", + "real-repo-precision-2026-05-03" + ], + "disabled": false, + "notes": "JPA Criteria API: cb.createQuery(Foo.class) returns CriteriaQuery; session.createQuery(cq)/em.createQuery(cq) is safe by construction (parameterized SQL emitted). Engine maps CriteriaBuilder.createQuery via receiver-text recogniser to TypeKind::JpaCriteriaQuery, then sink_args_jpa_criteria_query_safe suppresses cfg-unguarded-sink at the executor site. Distilled from openmrs HibernateCohortDAO." + }, { "case_id": "py-auth-realrepo-005", "file": "python/safe/safe_fastapi_route_dependencies_auth.py", @@ -15975,8 +16665,8 @@ "expected_category": "Security", "expected_sink_lines": [ [ - 15, - 15 + 25, + 25 ] ], "expected_source_lines": [], @@ -15986,7 +16676,7 @@ "real-repo-precision-2026-04-29" ], "disabled": false, - "notes": "Vulnerable counterpart to py-auth-realrepo-005: same FastAPI route shape but no `dependencies=[Depends(...)]` keyword arg. Regression guard: the dependency-injection recogniser must not blanket-suppress every FastAPI route." + "notes": "Vulnerable counterpart to py-auth-realrepo-005: same FastAPI route shape but no `dependencies=[Depends(...)]` keyword arg. Regression guard: the dependency-injection recogniser must not blanket-suppress every FastAPI route. Updated 2026-05-03: pre-fix recall came from a member_chain quirk where `select(Connection).filter_by(...)` reduced to bare callee `filter_by` and prefix-matched the `filter` read indicator. Post `receiver_is_simple_chain` gating, the fixture uses a qualified `Connection.objects.filter(id=connection_id).delete()` shape — SQLAlchemy `select().filter_by(...)` chained-call detection is a deferred deep fix." }, { "case_id": "js-data_exfil-001", @@ -16727,6 +17417,315 @@ "disabled": false, "notes": "Distilled from airflow providers/google/tests/unit/google/cloud/hooks/test_dlp.py: pytest test method decorated with `@mock.patch(\"...\")` was being attached as a Flask `PATCH` route handler because bare_method_name(\"mock.patch\") == \"patch\". Fix: parse_flask_route_decorator short-circuits on known test-framework decorator vocabulary (mock.patch, unittest.mock.patch, monkeypatch.setattr, pytest.mark.parametrize)." }, + { + "case_id": "py-auth-realrepo-011", + "file": "python/safe/safe_bare_callee_no_receiver.py", + "language": "python", + "is_vulnerable": false, + "vuln_class": "safe", + "cwe": "N/A", + "provenance": "real-repo", + "equivalence_tier": "exact", + "match_mode": "rule_match", + "expected_rule_ids": [], + "allowed_alternative_rule_ids": [], + "forbidden_rule_ids": [ + "py.auth.missing_ownership_check" + ], + "expected_severity": null, + "expected_category": "Security", + "expected_sink_lines": [], + "expected_source_lines": [], + "tags": [ + "auth", + "django", + "real-repo-precision-2026-05-03" + ], + "disabled": false, + "notes": "Distilled from sentry tasks/statistical_detectors.py:743 (`org_ids = list({p.organization_id for p in projects})`), utils/query.py:90 (`events = list(method(...))`), api/helpers/group_index/delete.py (bare `delete_group_list`, `create_audit_entry`), seer/autofix/coding_agent.py (bare `update_coding_agent_state`). Bare-identifier callees `list`, `filter`, `update`, `create`, `add` are Python builtins or locally-defined helpers, not DB / ORM operations. Fix: classify_sink_class verb-name fallback now requires `receiver_is_simple_chain` (callee contains a non-chained `.`). Regression guard for: bare callees must not classify as DbCrossTenantRead / DbMutation." + }, + { + "case_id": "py-auth-realrepo-012", + "file": "python/safe/safe_local_set_update_no_orm.py", + "language": "python", + "is_vulnerable": false, + "vuln_class": "safe", + "cwe": "N/A", + "provenance": "real-repo", + "equivalence_tier": "exact", + "match_mode": "rule_match", + "expected_rule_ids": [], + "allowed_alternative_rule_ids": [], + "forbidden_rule_ids": [ + "py.auth.missing_ownership_check" + ], + "expected_severity": null, + "expected_category": "Security", + "expected_sink_lines": [], + "expected_source_lines": [], + "tags": [ + "auth", + "django", + "real-repo-precision-2026-05-04" + ], + "disabled": false, + "notes": "Distilled from sentry api/helpers/teams.py::get_teams: bare-callee Python container constructors (`set()`, `dict()`, `defaultdict()`, `Counter()`, `list()`) bind a non-sink local collection. Subsequent `.update(..)` / `.add(..)` / item assignment must classify as InMemoryLocal, suppressing the false `py.auth.missing_ownership_check` finding. Fix: AuthAnalysisRules::is_non_sink_constructor_callee accepts bare callees matching non_sink_receiver_types; Python defaults populated with `set`/`dict`/`list`/`tuple`/`frozenset`/`defaultdict`/`OrderedDict`/`Counter`/`deque`/`ChainMap`/`namedtuple`; collect_non_sink_binding falls through to `left`/`right` field names; assignment / assignment_expression arm in collect_unit_state now wires the recogniser." + }, + { + "case_id": "py-auth-realrepo-013", + "file": "python/auth/vuln_local_set_with_user_id_query.py", + "language": "python", + "is_vulnerable": true, + "vuln_class": "auth", + "cwe": "CWE-862", + "provenance": "real-repo", + "equivalence_tier": "exact", + "match_mode": "rule_match", + "expected_rule_ids": [ + "py.auth.missing_ownership_check" + ], + "allowed_alternative_rule_ids": [], + "forbidden_rule_ids": [], + "expected_severity": "MEDIUM", + "expected_category": "Security", + "expected_sink_lines": [ + [ + 22, + 22 + ], + [ + 28, + 28 + ] + ], + "expected_source_lines": [], + "tags": [ + "auth", + "django", + "real-repo-precision-2026-05-04" + ], + "disabled": false, + "notes": "Vulnerable counterpart to py-auth-realrepo-012: same bare-`set()` / `dict()` local container shape, but the helper *also* runs an id-targeted ORM `Project.objects.filter(id=team_id)` query whose filter param is a user-supplied id (no caller-scope-entity exemption applies). Recall guard: bare-callee constructor recogniser must only suppress the InMemoryLocal `.add` / `.update` calls — the id-targeted ORM filter must still fire `py.auth.missing_ownership_check`." + }, + { + "case_id": "py-auth-realrepo-014", + "file": "python/auth/vuln_fastapi_route_no_dependencies_sqla.py", + "language": "python", + "is_vulnerable": true, + "vuln_class": "auth", + "cwe": "CWE-862", + "provenance": "real-repo", + "equivalence_tier": "exact", + "match_mode": "rule_match", + "expected_rule_ids": [ + "py.auth.missing_ownership_check" + ], + "allowed_alternative_rule_ids": [], + "forbidden_rule_ids": [], + "expected_severity": "MEDIUM", + "expected_category": "Security", + "expected_sink_lines": [ + [ + 27, + 27 + ] + ], + "expected_source_lines": [], + "tags": [ + "auth", + "fastapi", + "sqlalchemy", + "real-repo-precision-2026-05-04" + ], + "disabled": false, + "notes": "Distilled from airflow `airflow-core/src/airflow/api_fastapi/core_api/routes/public/connections.py`: `session.scalar(select(Connection).filter_by(conn_id=connection_id))` queryset chain. Pre-fix the chain reduced via `member_chain` to bare `[\"filter_by\"]` (Python tree-sitter `call` nodes use a `function` field not traversed by the Ruby/JS-style logic) and was suppressed by `receiver_is_simple_chain`'s bare-callee guard, blocking recall. Fix: (1) `member_chain` now traverses Python `call`'s `function` field; (2) the parent attribute branch appends `()` to last segment when its `object` is a call so `select(X).filter_by(...)` produces `[\"select()\", \"filter_by\"]`; (3) `AuthAnalysisRules::chain_root_is_db_query_builder` + per-language `db_query_builder_roots` (Python: `select`, `query`) anchors the chained-call shape to `DbCrossTenantRead`. Recall guard: missing_ownership_check must still fire on this airflow-style queryset chain when no `Depends(...)` auth dependency is declared." + }, + { + "case_id": "py-auth-realrepo-015", + "file": "python/safe/safe_fastapi_route_security_scopes.py", + "language": "python", + "is_vulnerable": false, + "vuln_class": "auth", + "cwe": "CWE-862", + "provenance": "real-repo", + "equivalence_tier": "exact", + "match_mode": "rule_match", + "expected_rule_ids": [], + "allowed_alternative_rule_ids": [], + "forbidden_rule_ids": [ + "py.auth.missing_ownership_check", + "py.auth.token_override_without_validation" + ], + "expected_severity": "MEDIUM", + "expected_category": "Security", + "expected_sink_lines": [], + "expected_source_lines": [], + "tags": [ + "auth", + "fastapi", + "security", + "real-repo-precision-2026-05-04" + ], + "disabled": false, + "notes": "Distilled from airflow `airflow-core/src/airflow/api_fastapi/execution_api/routes/task_instances.py:101-117`: `dependencies=[Security(require_auth, scopes=[\"token:execution\", \"token:workload\"])]`. Pre-fix `is_depends_callee` only matched `Depends`, so `Security(...)` was ignored and the route fired `missing_ownership_check` even with the auth dep declared. Even after recognising the marker, `require_auth` is registered as a `LoginGuard`, which `has_prior_subject_auth` filters out. Fix: (1) `is_dep_marker_callee` recognises `Security` / `fastapi.Security` / `fastapi.params.Security`; (2) `unwrap_depends_call` returns `(CallSite, is_scoped_security)` and skips `keyword_argument` children when finding the first positional; (3) `inject_middleware_auth` promotes a scoped Security wrapper from `LoginGuard` to `AuthCheckKind::Other` so the route counts as authorized. Precision guard: route must NOT fire ownership / token-override findings when carrying a scoped `Security(...)` route-level dep." + }, + { + "case_id": "py-auth-realrepo-016", + "file": "python/auth/vuln_fastapi_route_security_no_scopes.py", + "language": "python", + "is_vulnerable": true, + "vuln_class": "auth", + "cwe": "CWE-862", + "provenance": "real-repo", + "equivalence_tier": "exact", + "match_mode": "rule_match", + "expected_rule_ids": [ + "py.auth.missing_ownership_check" + ], + "allowed_alternative_rule_ids": [], + "forbidden_rule_ids": [], + "expected_severity": "MEDIUM", + "expected_category": "Security", + "expected_sink_lines": [ + [ + 34, + 34 + ] + ], + "expected_source_lines": [], + "tags": [ + "auth", + "fastapi", + "security", + "real-repo-precision-2026-05-04" + ], + "disabled": false, + "notes": "Recall counterpart to py-auth-realrepo-015. `Security(require_auth, scopes=[])` with an empty scope list is NOT promoted to `Other` — the OAuth2 scope semantic only fires when scopes is non-empty, so the wrapper falls back to bare login classification. Recall guard: `missing_ownership_check` must still fire on this id-targeted ORM filter; without conservative scope-emptiness gating, every empty-scopes route would over-suppress." + }, + { + "case_id": "py-auth-realrepo-017", + "file": "python/safe/safe_fastapi_router_level_security_scopes.py", + "language": "python", + "is_vulnerable": false, + "vuln_class": "auth", + "cwe": "CWE-862", + "provenance": "real-repo", + "equivalence_tier": "exact", + "match_mode": "rule_match", + "expected_rule_ids": [], + "allowed_alternative_rule_ids": [], + "forbidden_rule_ids": [ + "py.auth.missing_ownership_check", + "py.auth.token_override_without_validation" + ], + "expected_severity": "MEDIUM", + "expected_category": "Security", + "expected_sink_lines": [], + "expected_source_lines": [], + "tags": [ + "auth", + "fastapi", + "router-level-security", + "real-repo-precision-2026-05-04" + ], + "disabled": false, + "notes": "Distilled from airflow `airflow-core/src/airflow/api_fastapi/execution_api/routes/task_instances.py:89-318`: `ti_id_router = VersionedAPIRouter(dependencies=[Security(require_auth, scopes=[\"ti:self\"])])` declares the auth dep once at the router constructor; every `@ti_id_router.(...)` route inherits it at runtime. Pre-fix the FastAPI dep extractor only walked the per-route decorator's `dependencies=[...]` kwarg; router-constructor `dependencies=` was dropped, so every attached route without inline deps fired `missing_ownership_check` + `token_override_without_validation` despite being authorized. Fix: `collect_router_level_dependencies` walks module-level ` = APIRouter(...)` / `VersionedAPIRouter(...)` / `FastAPI(...)` assignments and captures `dependencies=[...]` keyed by the router var name; `router_prefix_from_decorator` extracts the receiver from `@.(...)` and merges router-level deps into the per-route middleware list. A scoped Security wrapper additionally synthesises matching `TokenExpiry` + `TokenRecipient` checks (the JWT-validation semantics — JWT signature includes expiry, scopes encode recipient binding) so the token-override rule recognises the route too. Precision guard: route must NOT fire ownership / token-override findings." + }, + { + "case_id": "py-auth-realrepo-018", + "file": "python/auth/vuln_fastapi_router_no_dependencies.py", + "language": "python", + "is_vulnerable": true, + "vuln_class": "auth", + "cwe": "CWE-862", + "provenance": "real-repo", + "equivalence_tier": "exact", + "match_mode": "rule_match", + "expected_rule_ids": [ + "py.auth.missing_ownership_check" + ], + "allowed_alternative_rule_ids": [], + "forbidden_rule_ids": [], + "expected_severity": "MEDIUM", + "expected_category": "Security", + "expected_sink_lines": [ + [ + 35, + 35 + ] + ], + "expected_source_lines": [], + "tags": [ + "auth", + "fastapi", + "router-level-security", + "real-repo-precision-2026-05-04" + ], + "disabled": false, + "notes": "Recall counterpart to py-auth-realrepo-017. Bare `router = VersionedAPIRouter()` with no `dependencies=` kwarg — attached routes that do not supply inline deps are genuinely unauthorized. The router-level extractor must NOT enter a fake key for routers without router-level deps; the gate (`if deps.is_empty() { continue; }` in `collect_router_level_dependencies`) ensures absence is preserved. Recall guard: `missing_ownership_check` must still fire on the id-targeted write." + }, + { + "case_id": "py-auth-realrepo-019", + "file": "python/safe/safe_caller_scope_helper_under_authorized_route.py", + "language": "python", + "is_vulnerable": false, + "vuln_class": "auth", + "cwe": "CWE-862", + "provenance": "real-repo", + "equivalence_tier": "exact", + "match_mode": "rule_match", + "expected_rule_ids": [], + "allowed_alternative_rule_ids": [], + "forbidden_rule_ids": [ + "py.auth.missing_ownership_check", + "py.auth.token_override_without_validation" + ], + "expected_severity": "MEDIUM", + "expected_category": "Security", + "expected_sink_lines": [], + "expected_source_lines": [], + "tags": [ + "auth", + "fastapi", + "caller-scope-ipa", + "real-repo-precision-2026-05-04" + ], + "disabled": false, + "notes": "Distilled from airflow `airflow-core/src/airflow/api_fastapi/execution_api/routes/task_instances.py:516-628`: the route handler `ti_update_state` is route-level authorized via the `ti_id_router = APIRouter(dependencies=[Security(require_auth, scopes=[\"ti:self\"])])` declaration (closed by session-0010), then delegates the `session.add(...)` sink to a private helper `_create_state_update`. Pre-fix the helper fired both `missing_ownership_check` and `token_override_without_validation` because `check_ownership_gaps` is scoped per AnalysisUnit — the caller's route-level auth check did not propagate to the callee. Phase 1 caller-scope IPA fix (`apply_caller_scope_propagation` in `src/auth_analysis/mod.rs`) walks the call graph DOWN: when every in-file caller of a helper carries route-level non-Login auth, the helper inherits those checks via synthetic `is_route_level=true` AuthChecks anchored at the callee's start line. Soundness: requires every caller authorized; refuses on dead helpers, mixed-caller helpers, login-only routes." + }, + { + "case_id": "py-auth-realrepo-020", + "file": "python/auth/vuln_caller_scope_helper_under_bare_route.py", + "language": "python", + "is_vulnerable": true, + "vuln_class": "auth", + "cwe": "CWE-862", + "provenance": "real-repo", + "equivalence_tier": "exact", + "match_mode": "rule_match", + "expected_rule_ids": [ + "py.auth.missing_ownership_check" + ], + "allowed_alternative_rule_ids": [], + "forbidden_rule_ids": [], + "expected_severity": "MEDIUM", + "expected_category": "Security", + "expected_sink_lines": [ + [ + 31, + 31 + ] + ], + "expected_source_lines": [], + "tags": [ + "auth", + "fastapi", + "caller-scope-ipa", + "real-repo-precision-2026-05-04" + ], + "disabled": false, + "notes": "Recall counterpart to py-auth-realrepo-019. Same shape but bare `router = APIRouter()` (no Security dep at the boundary). The helper `_create_state_update` is reached from a route handler with no authorization; `apply_caller_scope_propagation`'s soundness rule refuses to authorize the helper because no caller carries route-level non-Login auth. Recall guard: `missing_ownership_check` must still fire on the helper's `session.add` sink." + }, { "case_id": "cve-ts-2026-25544-vulnerable", "file": "cve_corpus/typescript/CVE-2026-25544/vulnerable.ts", diff --git a/tests/benchmark/results/latest.json b/tests/benchmark/results/latest.json index 7ef09275..acb77cee 100644 --- a/tests/benchmark/results/latest.json +++ b/tests/benchmark/results/latest.json @@ -1,6 +1,6 @@ { "benchmark_version": "1.0", - "timestamp": "2026-05-03T17:00:35Z", + "timestamp": "2026-05-04T17:11:50Z", "scanner_version": "0.6.1", "scanner_config": { "analysis_mode": "Full", @@ -9,10 +9,10 @@ "state_analysis_enabled": true, "worker_threads": 1 }, - "ground_truth_hash": "sha256:1d6ed97196d3ff0844320a79ac607983245dd73af5455bcf77f6ac6a212c5e45", - "corpus_size": 533, - "cases_run": 532, - "cases_skipped": 1, + "ground_truth_hash": "sha256:414494ab1b6881a9b78eca38e26561231f78767480399fda73a477e23a9fcbaa", + "corpus_size": 565, + "cases_run": 562, + "cases_skipped": 3, "outcomes": [ { "case_id": "c-buf-001", @@ -1656,6 +1656,40 @@ "security_finding_count": 1, "non_security_finding_count": 0 }, + { + "case_id": "cve-js-2026-42353-patched", + "file": "cve_corpus/javascript/CVE-2026-42353/patched.js", + "language": "javascript", + "vuln_class": "safe", + "is_vulnerable": false, + "outcome_file_level": "TN", + "outcome_rule_level": "TN", + "outcome_location_level": null, + "matched_rule_ids": [], + "unexpected_rule_ids": [], + "all_finding_ids": [], + "security_finding_count": 0, + "non_security_finding_count": 0 + }, + { + "case_id": "cve-js-2026-42353-vulnerable", + "file": "cve_corpus/javascript/CVE-2026-42353/vulnerable.js", + "language": "javascript", + "vuln_class": "path_traversal", + "is_vulnerable": true, + "outcome_file_level": "TP", + "outcome_rule_level": "TP", + "outcome_location_level": null, + "matched_rule_ids": [ + "taint-unsanitised-flow (source 44:9)" + ], + "unexpected_rule_ids": [], + "all_finding_ids": [ + "taint-unsanitised-flow (source 44:9)" + ], + "security_finding_count": 1, + "non_security_finding_count": 0 + }, { "case_id": "cve-php-2017-9841-patched", "file": "cve_corpus/php/CVE-2017-9841/patched.php", @@ -1728,6 +1762,43 @@ "security_finding_count": 2, "non_security_finding_count": 0 }, + { + "case_id": "cve-php-2026-33486-patched", + "file": "cve_corpus/php/CVE-2026-33486/patched.php", + "language": "php", + "vuln_class": "safe", + "is_vulnerable": false, + "outcome_file_level": "TN", + "outcome_rule_level": "TN", + "outcome_location_level": null, + "matched_rule_ids": [], + "unexpected_rule_ids": [], + "all_finding_ids": [], + "security_finding_count": 0, + "non_security_finding_count": 0 + }, + { + "case_id": "cve-php-2026-33486-vulnerable", + "file": "cve_corpus/php/CVE-2026-33486/vulnerable.php", + "language": "php", + "vuln_class": "ssrf", + "is_vulnerable": true, + "outcome_file_level": "TP", + "outcome_rule_level": "TP", + "outcome_location_level": "TP", + "matched_rule_ids": [ + "taint-unsanitised-flow (source 40:9)" + ], + "unexpected_rule_ids": [ + "state-resource-leak" + ], + "all_finding_ids": [ + "state-resource-leak", + "taint-unsanitised-flow (source 40:9)" + ], + "security_finding_count": 2, + "non_security_finding_count": 0 + }, { "case_id": "cve-py-2017-18342-patched", "file": "cve_corpus/python/CVE-2017-18342/patched.py", @@ -1800,6 +1871,113 @@ "security_finding_count": 2, "non_security_finding_count": 0 }, + { + "case_id": "cve-py-2023-6568-patched", + "file": "cve_corpus/python/CVE-2023-6568/patched.py", + "language": "python", + "vuln_class": "safe", + "is_vulnerable": false, + "outcome_file_level": "TN", + "outcome_rule_level": "TN", + "outcome_location_level": null, + "matched_rule_ids": [], + "unexpected_rule_ids": [], + "all_finding_ids": [], + "security_finding_count": 0, + "non_security_finding_count": 0 + }, + { + "case_id": "cve-py-2023-6568-vulnerable", + "file": "cve_corpus/python/CVE-2023-6568/vulnerable.py", + "language": "python", + "vuln_class": "xss", + "is_vulnerable": true, + "outcome_file_level": "TP", + "outcome_rule_level": "TP", + "outcome_location_level": "TP", + "matched_rule_ids": [ + "py.xss.make_response_format", + "taint-unsanitised-flow (source 41:20)" + ], + "unexpected_rule_ids": [], + "all_finding_ids": [ + "py.xss.make_response_format", + "taint-unsanitised-flow (source 41:20)" + ], + "security_finding_count": 2, + "non_security_finding_count": 0 + }, + { + "case_id": "cve-py-2024-21513-patched", + "file": "cve_corpus/python/CVE-2024-21513/patched.py", + "language": "python", + "vuln_class": "safe", + "is_vulnerable": false, + "outcome_file_level": "TN", + "outcome_rule_level": "TN", + "outcome_location_level": null, + "matched_rule_ids": [], + "unexpected_rule_ids": [], + "all_finding_ids": [], + "security_finding_count": 0, + "non_security_finding_count": 0 + }, + { + "case_id": "cve-py-2024-21513-vulnerable", + "file": "cve_corpus/python/CVE-2024-21513/vulnerable.py", + "language": "python", + "vuln_class": "code_exec", + "is_vulnerable": true, + "outcome_file_level": "TP", + "outcome_rule_level": "TP", + "outcome_location_level": "TP", + "matched_rule_ids": [ + "py.code_exec.eval" + ], + "unexpected_rule_ids": [ + "cfg-unguarded-sink" + ], + "all_finding_ids": [ + "cfg-unguarded-sink", + "py.code_exec.eval" + ], + "security_finding_count": 2, + "non_security_finding_count": 0 + }, + { + "case_id": "cve-py-2024-23334-patched", + "file": "cve_corpus/python/CVE-2024-23334/patched.py", + "language": "python", + "vuln_class": "safe", + "is_vulnerable": false, + "outcome_file_level": "TN", + "outcome_rule_level": "TN", + "outcome_location_level": null, + "matched_rule_ids": [], + "unexpected_rule_ids": [], + "all_finding_ids": [], + "security_finding_count": 0, + "non_security_finding_count": 0 + }, + { + "case_id": "cve-py-2024-23334-vulnerable", + "file": "cve_corpus/python/CVE-2024-23334/vulnerable.py", + "language": "python", + "vuln_class": "path_traversal", + "is_vulnerable": true, + "outcome_file_level": "TP", + "outcome_rule_level": "TP", + "outcome_location_level": "TP", + "matched_rule_ids": [ + "taint-unsanitised-flow (source 45:9)" + ], + "unexpected_rule_ids": [], + "all_finding_ids": [ + "taint-unsanitised-flow (source 45:9)" + ], + "security_finding_count": 1, + "non_security_finding_count": 0 + }, { "case_id": "cve-py-2025-69662-patched", "file": "cve_corpus/python/CVE-2025-69662/patched.py", @@ -3084,6 +3262,21 @@ "security_finding_count": 0, "non_security_finding_count": 0 }, + { + "case_id": "go-safe-realrepo-019", + "file": "go/safe/safe_dao_helper_id_scalar.go", + "language": "go", + "vuln_class": "safe", + "is_vulnerable": false, + "outcome_file_level": "TN", + "outcome_rule_level": "TN", + "outcome_location_level": null, + "matched_rule_ids": [], + "unexpected_rule_ids": [], + "all_finding_ids": [], + "security_finding_count": 0, + "non_security_finding_count": 0 + }, { "case_id": "go-sqli-001", "file": "go/sqli/sqli_concat.go", @@ -3811,6 +4004,21 @@ "security_finding_count": 0, "non_security_finding_count": 0 }, + { + "case_id": "java-safe-realrepo-openmrs-001", + "file": "java/safe/SafeJpaCriteriaQuery.java", + "language": "java", + "vuln_class": "safe", + "is_vulnerable": false, + "outcome_file_level": "TN", + "outcome_rule_level": "TN", + "outcome_location_level": null, + "matched_rule_ids": [], + "unexpected_rule_ids": [], + "all_finding_ids": [], + "security_finding_count": 0, + "non_security_finding_count": 0 + }, { "case_id": "java-safe-stmt-execute-validated", "file": "java/safe/safe_statement_execute_pattern_validated.java", @@ -4260,6 +4468,25 @@ "security_finding_count": 1, "non_security_finding_count": 0 }, + { + "case_id": "js-path_traversal-ternary-source-001", + "file": "javascript/path_traversal/path_traversal_ternary_source.js", + "language": "javascript", + "vuln_class": "path_traversal", + "is_vulnerable": true, + "outcome_file_level": "TP", + "outcome_rule_level": "TP", + "outcome_location_level": null, + "matched_rule_ids": [ + "taint-unsanitised-flow (source 15:29)" + ], + "unexpected_rule_ids": [], + "all_finding_ids": [ + "taint-unsanitised-flow (source 15:29)" + ], + "security_finding_count": 1, + "non_security_finding_count": 0 + }, { "case_id": "js-pathprune-safe-001", "file": "javascript/path_pruning/safe_early_return.js", @@ -4575,6 +4802,21 @@ "security_finding_count": 0, "non_security_finding_count": 0 }, + { + "case_id": "js-safe-ternary-const-branches", + "file": "javascript/safe/safe_ternary_const_branches.js", + "language": "javascript", + "vuln_class": "safe", + "is_vulnerable": false, + "outcome_file_level": "TN", + "outcome_rule_level": "TN", + "outcome_location_level": null, + "matched_rule_ids": [], + "unexpected_rule_ids": [], + "all_finding_ids": [], + "security_finding_count": 0, + "non_security_finding_count": 0 + }, { "case_id": "js-sqli-001", "file": "javascript/sqli/sqli_concat.js", @@ -4960,6 +5202,29 @@ "security_finding_count": 3, "non_security_finding_count": 0 }, + { + "case_id": "php-deser-003", + "file": "php/deser/deser_unserialize_method_named_unserialize_with_user_input.php", + "language": "php", + "vuln_class": "deser", + "is_vulnerable": true, + "outcome_file_level": "TP", + "outcome_rule_level": "TP", + "outcome_location_level": null, + "matched_rule_ids": [ + "php.deser.unserialize", + "taint-unsanitised-flow (source 13:38)", + "php.deser.unserialize" + ], + "unexpected_rule_ids": [], + "all_finding_ids": [ + "php.deser.unserialize", + "taint-unsanitised-flow (source 13:38)", + "php.deser.unserialize" + ], + "security_finding_count": 3, + "non_security_finding_count": 0 + }, { "case_id": "php-interproc-001", "file": "php/interprocedural/interproc_taint_propagation.php", @@ -5295,6 +5560,36 @@ "security_finding_count": 0, "non_security_finding_count": 0 }, + { + "case_id": "php-safe-020", + "file": "php/safe/safe_serializable_magic_method_unserialize.php", + "language": "php", + "vuln_class": "safe", + "is_vulnerable": false, + "outcome_file_level": "TN", + "outcome_rule_level": "TN", + "outcome_location_level": null, + "matched_rule_ids": [], + "unexpected_rule_ids": [], + "all_finding_ids": [], + "security_finding_count": 0, + "non_security_finding_count": 0 + }, + { + "case_id": "php-safe-camelcase-validator-001", + "file": "php/safe/safe_camelcase_validator_negated.php", + "language": "php", + "vuln_class": "safe", + "is_vulnerable": false, + "outcome_file_level": "TN", + "outcome_rule_level": "TN", + "outcome_location_level": null, + "matched_rule_ids": [], + "unexpected_rule_ids": [], + "all_finding_ids": [], + "security_finding_count": 0, + "non_security_finding_count": 0 + }, { "case_id": "php-safe-filter-001", "file": "php/safe/safe_filter_input.php", @@ -5392,6 +5687,28 @@ "security_finding_count": 2, "non_security_finding_count": 0 }, + { + "case_id": "php-ssrf-002", + "file": "php/ssrf/ssrf_class_method_fopen.php", + "language": "php", + "vuln_class": "ssrf", + "is_vulnerable": true, + "outcome_file_level": "TP", + "outcome_rule_level": "TP", + "outcome_location_level": "TP", + "matched_rule_ids": [ + "taint-unsanitised-flow (source 14:9)" + ], + "unexpected_rule_ids": [ + "cfg-resource-leak" + ], + "all_finding_ids": [ + "cfg-resource-leak", + "taint-unsanitised-flow (source 14:9)" + ], + "security_finding_count": 2, + "non_security_finding_count": 0 + }, { "case_id": "php-ssrf-safe-001", "file": "php/ssrf/safe_ssrf_hardcoded.php", @@ -5639,6 +5956,184 @@ "security_finding_count": 0, "non_security_finding_count": 0 }, + { + "case_id": "py-auth-realrepo-011", + "file": "python/safe/safe_bare_callee_no_receiver.py", + "language": "python", + "vuln_class": "safe", + "is_vulnerable": false, + "outcome_file_level": "TN", + "outcome_rule_level": "TN", + "outcome_location_level": null, + "matched_rule_ids": [], + "unexpected_rule_ids": [], + "all_finding_ids": [], + "security_finding_count": 0, + "non_security_finding_count": 0 + }, + { + "case_id": "py-auth-realrepo-012", + "file": "python/safe/safe_local_set_update_no_orm.py", + "language": "python", + "vuln_class": "safe", + "is_vulnerable": false, + "outcome_file_level": "TN", + "outcome_rule_level": "TN", + "outcome_location_level": null, + "matched_rule_ids": [], + "unexpected_rule_ids": [], + "all_finding_ids": [], + "security_finding_count": 0, + "non_security_finding_count": 0 + }, + { + "case_id": "py-auth-realrepo-013", + "file": "python/auth/vuln_local_set_with_user_id_query.py", + "language": "python", + "vuln_class": "auth", + "is_vulnerable": true, + "outcome_file_level": "TP", + "outcome_rule_level": "TP", + "outcome_location_level": "TP", + "matched_rule_ids": [ + "py.auth.missing_ownership_check", + "py.auth.missing_ownership_check" + ], + "unexpected_rule_ids": [], + "all_finding_ids": [ + "py.auth.missing_ownership_check", + "py.auth.missing_ownership_check" + ], + "security_finding_count": 2, + "non_security_finding_count": 0 + }, + { + "case_id": "py-auth-realrepo-014", + "file": "python/auth/vuln_fastapi_route_no_dependencies_sqla.py", + "language": "python", + "vuln_class": "auth", + "is_vulnerable": true, + "outcome_file_level": "TP", + "outcome_rule_level": "TP", + "outcome_location_level": "TP", + "matched_rule_ids": [ + "py.auth.missing_ownership_check" + ], + "unexpected_rule_ids": [], + "all_finding_ids": [ + "py.auth.missing_ownership_check" + ], + "security_finding_count": 1, + "non_security_finding_count": 0 + }, + { + "case_id": "py-auth-realrepo-015", + "file": "python/safe/safe_fastapi_route_security_scopes.py", + "language": "python", + "vuln_class": "auth", + "is_vulnerable": false, + "outcome_file_level": "TN", + "outcome_rule_level": "TN", + "outcome_location_level": null, + "matched_rule_ids": [], + "unexpected_rule_ids": [], + "all_finding_ids": [], + "security_finding_count": 0, + "non_security_finding_count": 0 + }, + { + "case_id": "py-auth-realrepo-016", + "file": "python/auth/vuln_fastapi_route_security_no_scopes.py", + "language": "python", + "vuln_class": "auth", + "is_vulnerable": true, + "outcome_file_level": "TP", + "outcome_rule_level": "TP", + "outcome_location_level": "TP", + "matched_rule_ids": [ + "py.auth.missing_ownership_check" + ], + "unexpected_rule_ids": [], + "all_finding_ids": [ + "py.auth.missing_ownership_check" + ], + "security_finding_count": 1, + "non_security_finding_count": 0 + }, + { + "case_id": "py-auth-realrepo-017", + "file": "python/safe/safe_fastapi_router_level_security_scopes.py", + "language": "python", + "vuln_class": "auth", + "is_vulnerable": false, + "outcome_file_level": "TN", + "outcome_rule_level": "TN", + "outcome_location_level": null, + "matched_rule_ids": [], + "unexpected_rule_ids": [], + "all_finding_ids": [], + "security_finding_count": 0, + "non_security_finding_count": 0 + }, + { + "case_id": "py-auth-realrepo-018", + "file": "python/auth/vuln_fastapi_router_no_dependencies.py", + "language": "python", + "vuln_class": "auth", + "is_vulnerable": true, + "outcome_file_level": "TP", + "outcome_rule_level": "TP", + "outcome_location_level": "TP", + "matched_rule_ids": [ + "py.auth.missing_ownership_check" + ], + "unexpected_rule_ids": [ + "py.auth.token_override_without_validation" + ], + "all_finding_ids": [ + "py.auth.missing_ownership_check", + "py.auth.token_override_without_validation" + ], + "security_finding_count": 2, + "non_security_finding_count": 0 + }, + { + "case_id": "py-auth-realrepo-019", + "file": "python/safe/safe_caller_scope_helper_under_authorized_route.py", + "language": "python", + "vuln_class": "auth", + "is_vulnerable": false, + "outcome_file_level": "TN", + "outcome_rule_level": "TN", + "outcome_location_level": null, + "matched_rule_ids": [], + "unexpected_rule_ids": [], + "all_finding_ids": [], + "security_finding_count": 0, + "non_security_finding_count": 0 + }, + { + "case_id": "py-auth-realrepo-020", + "file": "python/auth/vuln_caller_scope_helper_under_bare_route.py", + "language": "python", + "vuln_class": "auth", + "is_vulnerable": true, + "outcome_file_level": "TP", + "outcome_rule_level": "TP", + "outcome_location_level": "TP", + "matched_rule_ids": [ + "py.auth.missing_ownership_check" + ], + "unexpected_rule_ids": [ + "py.auth.token_override_without_validation" + ], + "all_finding_ids": [ + "py.auth.missing_ownership_check", + "py.auth.token_override_without_validation" + ], + "security_finding_count": 2, + "non_security_finding_count": 0 + }, { "case_id": "py-cmdi-001", "file": "python/cmdi/cmdi_direct.py", @@ -5962,6 +6457,25 @@ "security_finding_count": 1, "non_security_finding_count": 0 }, + { + "case_id": "py-path_traversal-no-relative-to", + "file": "python/path_traversal/path_traversal_no_relative_to.py", + "language": "python", + "vuln_class": "path_traversal", + "is_vulnerable": true, + "outcome_file_level": "TP", + "outcome_rule_level": "TP", + "outcome_location_level": "TP", + "matched_rule_ids": [ + "taint-unsanitised-flow (source 11:15)" + ], + "unexpected_rule_ids": [], + "all_finding_ids": [ + "taint-unsanitised-flow (source 11:15)" + ], + "security_finding_count": 1, + "non_security_finding_count": 0 + }, { "case_id": "py-pathprune-safe-001", "file": "python/path_pruning/safe_early_return.py", @@ -6187,6 +6701,21 @@ "security_finding_count": 0, "non_security_finding_count": 0 }, + { + "case_id": "py-safe-relative-to-validator", + "file": "python/safe/safe_relative_to_validator.py", + "language": "python", + "vuln_class": "safe", + "is_vulnerable": false, + "outcome_file_level": "TN", + "outcome_rule_level": "TN", + "outcome_location_level": null, + "matched_rule_ids": [], + "unexpected_rule_ids": [], + "all_finding_ids": [], + "security_finding_count": 0, + "non_security_finding_count": 0 + }, { "case_id": "py-sqli-001", "file": "python/sqli/sqli_concat.py", @@ -6343,11 +6872,14 @@ "matched_rule_ids": [ "taint-unsanitised-flow (source 4:12)" ], - "unexpected_rule_ids": [], + "unexpected_rule_ids": [ + "py.xss.make_response_format" + ], "all_finding_ids": [ + "py.xss.make_response_format", "taint-unsanitised-flow (source 4:12)" ], - "security_finding_count": 1, + "security_finding_count": 2, "non_security_finding_count": 0 }, { @@ -8494,9 +9026,11 @@ "outcome_location_level": null, "matched_rule_ids": [], "unexpected_rule_ids": [], - "all_finding_ids": [], + "all_finding_ids": [ + "ts.quality.any_annotation" + ], "security_finding_count": 0, - "non_security_finding_count": 0 + "non_security_finding_count": 1 }, { "case_id": "ts-auth-realrepo-002", @@ -8512,12 +9046,10 @@ ], "unexpected_rule_ids": [], "all_finding_ids": [ - "ts.quality.any_annotation", - "ts.quality.any_annotation", "js.auth.missing_ownership_check" ], "security_finding_count": 1, - "non_security_finding_count": 2 + "non_security_finding_count": 0 }, { "case_id": "ts-auth-realrepo-003", @@ -9512,19 +10044,19 @@ } ], "aggregate_file_level": { - "tp": 261, + "tp": 275, "fp": 0, "fn_": 0, - "tn": 271, + "tn": 287, "precision": 1.0, "recall": 1.0, "f1": 1.0 }, "aggregate_rule_level": { - "tp": 261, + "tp": 275, "fp": 0, "fn_": 0, - "tn": 271, + "tn": 287, "precision": 1.0, "recall": 1.0, "f1": 1.0 @@ -9552,7 +10084,7 @@ "tp": 30, "fp": 0, "fn_": 0, - "tn": 35, + "tn": 36, "precision": 1.0, "recall": 1.0, "f1": 1.0 @@ -9561,34 +10093,34 @@ "tp": 23, "fp": 0, "fn_": 0, - "tn": 22, + "tn": 23, "precision": 1.0, "recall": 1.0, "f1": 1.0 }, "javascript": { - "tp": 23, + "tp": 25, "fp": 0, "fn_": 0, - "tn": 30, + "tn": 32, "precision": 1.0, "recall": 1.0, "f1": 1.0 }, "php": { - "tp": 19, + "tp": 22, "fp": 0, "fn_": 0, - "tn": 20, + "tn": 23, "precision": 1.0, "recall": 1.0, "f1": 1.0 }, "python": { - "tp": 29, + "tp": 38, "fp": 0, "fn_": 0, - "tn": 32, + "tn": 41, "precision": 1.0, "recall": 1.0, "f1": 1.0 @@ -9623,10 +10155,10 @@ }, "by_vuln_class": { "auth": { - "tp": 20, + "tp": 25, "fp": 0, "fn_": 0, - "tn": 0, + "tn": 3, "precision": 1.0, "recall": 1.0, "f1": 1.0 @@ -9650,7 +10182,7 @@ "f1": 1.0 }, "code_exec": { - "tp": 4, + "tp": 5, "fp": 0, "fn_": 0, "tn": 0, @@ -9686,7 +10218,7 @@ "f1": 1.0 }, "deser": { - "tp": 8, + "tp": 9, "fp": 0, "fn_": 0, "tn": 0, @@ -9731,7 +10263,7 @@ "f1": 1.0 }, "path_traversal": { - "tp": 28, + "tp": 32, "fp": 0, "fn_": 0, "tn": 0, @@ -9761,7 +10293,7 @@ "tp": 0, "fp": 0, "fn_": 0, - "tn": 271, + "tn": 284, "precision": 1.0, "recall": 1.0, "f1": 1.0 @@ -9794,7 +10326,7 @@ "f1": 1.0 }, "ssrf": { - "tp": 30, + "tp": 32, "fp": 0, "fn_": 0, "tn": 0, @@ -9803,7 +10335,7 @@ "f1": 1.0 }, "xss": { - "tp": 23, + "tp": 24, "fp": 0, "fn_": 0, "tn": 0, @@ -9814,31 +10346,31 @@ }, "by_confidence": { ">=High": { - "tp": 88, - "fp": 100, - "fn_": 173, - "tn": 171, - "precision": 0.46808510638297873, - "recall": 0.3371647509578544, - "f1": 0.3919821826280624 + "tp": 85, + "fp": 114, + "fn_": 190, + "tn": 173, + "precision": 0.4271356783919598, + "recall": 0.3090909090909091, + "f1": 0.3586497890295359 }, ">=Low": { - "tp": 90, - "fp": 120, - "fn_": 171, - "tn": 151, - "precision": 0.42857142857142855, - "recall": 0.3448275862068966, - "f1": 0.3821656050955414 + "tp": 86, + "fp": 142, + "fn_": 189, + "tn": 145, + "precision": 0.37719298245614036, + "recall": 0.31272727272727274, + "f1": 0.341948310139165 }, ">=Medium": { - "tp": 90, - "fp": 116, - "fn_": 171, - "tn": 155, - "precision": 0.4368932038834951, - "recall": 0.3448275862068966, - "f1": 0.38543897216274087 + "tp": 86, + "fp": 133, + "fn_": 189, + "tn": 154, + "precision": 0.3926940639269406, + "recall": 0.31272727272727274, + "f1": 0.3481781376518218 } } } \ No newline at end of file diff --git a/tests/fastapi_cross_file_include_router_tests.rs b/tests/fastapi_cross_file_include_router_tests.rs new file mode 100644 index 00000000..2267ee6c --- /dev/null +++ b/tests/fastapi_cross_file_include_router_tests.rs @@ -0,0 +1,53 @@ +//! Cross-file FastAPI `include_router(child)` parent-dep propagation. +//! +//! Distilled from airflow +//! `airflow-core/src/airflow/api_fastapi/execution_api/routes/`: +//! `__init__.py` declares +//! `authenticated_router = VersionedAPIRouter(dependencies=[Security(require_auth)])` +//! and lifts every per-file child router via +//! `authenticated_router.include_router(.router, ...)`. FastAPI's +//! runtime propagates the parent's `dependencies=[...]` onto every route +//! attached to the child router, including bare child routers declared +//! without inline deps. +//! +//! Pre-fix: per-file router-dep extractor only saw inline declarations, +//! so bare child routers (`router = VersionedAPIRouter()`) fired +//! `missing_ownership_check` / `token_override_without_validation` +//! despite being authorized via the cross-file `include_router` chain. +//! +//! Post-fix: pass 1 persists per-file `PerFileRouterFacts` (router-level +//! deps + include_router edges) into +//! `GlobalSummaries.router_facts_by_module`; pass 2 resolves the +//! cross-file lift via `resolve_cross_file_router_deps_for_file` and +//! pre-populates `AuthorizationModel.cross_file_router_deps` before the +//! FlaskExtractor runs. Cross-file `Security(...)` markers are flagged +//! scoped-equivalent (architectural intent of include_router auth +//! scoping), so `inject_middleware_auth` promotes the kind to `Other` +//! and ownership checks see the route as authorized. +//! +//! Recall guard: `public_health.py` is attached to `execution_api_router` +//! which has NO `dependencies=[...]` kwarg. Routes there are genuinely +//! unauthorized — `missing_ownership_check` must still fire. Without +//! this guard, an over-broad cross-file lift (e.g. blanket "every +//! include_router target inherits any parent's auth") would silently +//! suppress real findings. + +mod common; + +use common::{scan_fixture_dir, validate_expectations}; +use nyx_scanner::utils::config::AnalysisMode; +use std::path::{Path, PathBuf}; + +fn fixture_path(name: &str) -> PathBuf { + Path::new(env!("CARGO_MANIFEST_DIR")) + .join("tests") + .join("fixtures") + .join(name) +} + +#[test] +fn fastapi_cross_file_include_router_lifts_parent_security_onto_child_router() { + let dir = fixture_path("auth_analysis_fastapi_cross_file_include_router"); + let diags = scan_fixture_dir(&dir, AnalysisMode::Full); + validate_expectations(&diags, &dir); +} diff --git a/tests/fixtures/auth_analysis_fastapi_cross_file_include_router/expectations.json b/tests/fixtures/auth_analysis_fastapi_cross_file_include_router/expectations.json new file mode 100644 index 00000000..45d65f7b --- /dev/null +++ b/tests/fixtures/auth_analysis_fastapi_cross_file_include_router/expectations.json @@ -0,0 +1,33 @@ +{ + "required_findings": [ + { "id_prefix": "py.auth.missing_ownership_check", "min_count": 1 } + ], + "forbidden_findings": [ + { + "id_prefix": "py.auth.missing_ownership_check", + "file_glob": "**/task_instances.py" + }, + { + "id_prefix": "py.auth.missing_ownership_check", + "file_glob": "**/dag_runs.py" + }, + { + "id_prefix": "py.auth.token_override_without_validation", + "file_glob": "**/task_instances.py" + }, + { + "id_prefix": "py.auth.token_override_without_validation", + "file_glob": "**/dag_runs.py" + } + ], + "noise_budget": { + "max_total_findings": 8, + "max_high_findings": 4 + }, + "performance_expectations": { + "max_ms_no_index": 1500, + "max_ms_index_cold": 2000, + "max_ms_index_warm": 800, + "ci_mode": "lenient" + } +} diff --git a/tests/fixtures/auth_analysis_fastapi_cross_file_include_router/routes/__init__.py b/tests/fixtures/auth_analysis_fastapi_cross_file_include_router/routes/__init__.py new file mode 100644 index 00000000..1404833d --- /dev/null +++ b/tests/fixtures/auth_analysis_fastapi_cross_file_include_router/routes/__init__.py @@ -0,0 +1,30 @@ +# Distilled from airflow `airflow-core/src/airflow/api_fastapi/execution_api/routes/__init__.py`. +# Parent file declares an authorized router carrying scoped Security deps, +# then attaches every per-file child router via `include_router(...)`. +# FastAPI runtime lifts the parent's `dependencies=[...]` onto every route +# attached to the child router — including bare child routers declared +# without inline deps — so routes inside child files inherit the auth +# automatically. +# +# Pre-fix the per-file router-dep extractor only saw inline declarations; +# bare child routers fired `missing_ownership_check` / +# `token_override_without_validation` despite being authorized via the +# `include_router` parent. The cross-file router-fact index resolves the +# parent-child lift at pass 2 entry. +from cadwyn import VersionedAPIRouter +from fastapi import APIRouter, Security + +from . import task_instances, dag_runs, public_health +from .security import require_auth + +execution_api_router = APIRouter() +execution_api_router.include_router(public_health.router, prefix="/health", tags=["Health"]) + +# All routes attached to this router are authenticated via Security(require_auth). +authenticated_router = VersionedAPIRouter(dependencies=[Security(require_auth)]) +authenticated_router.include_router( + task_instances.router, prefix="/task-instances", tags=["Task Instances"] +) +authenticated_router.include_router(dag_runs.router, prefix="/dag-runs", tags=["Dag Runs"]) + +execution_api_router.include_router(authenticated_router) diff --git a/tests/fixtures/auth_analysis_fastapi_cross_file_include_router/routes/dag_runs.py b/tests/fixtures/auth_analysis_fastapi_cross_file_include_router/routes/dag_runs.py new file mode 100644 index 00000000..d241e834 --- /dev/null +++ b/tests/fixtures/auth_analysis_fastapi_cross_file_include_router/routes/dag_runs.py @@ -0,0 +1,30 @@ +"""Second bare child router — same shape as task_instances.py.""" +from typing import Annotated + +from fastapi import Body +from cadwyn import VersionedAPIRouter + +router = VersionedAPIRouter() + + +@router.put("/{dag_run_id}/clear") +def clear_dag_run( + dag_run_id: str, + body: Annotated[dict, Body()], +): + """Bare-child route — auth via parent's include_router lift.""" + session = _get_session() + session.add( + DagRunRow(dag_run_id=dag_run_id, cleared=body.get("clear", False)) + ) + session.commit() + + +def _get_session(): + raise NotImplementedError + + +class DagRunRow: + def __init__(self, dag_run_id: str, cleared: bool) -> None: + self.dag_run_id = dag_run_id + self.cleared = cleared diff --git a/tests/fixtures/auth_analysis_fastapi_cross_file_include_router/routes/public_health.py b/tests/fixtures/auth_analysis_fastapi_cross_file_include_router/routes/public_health.py new file mode 100644 index 00000000..4e989ed3 --- /dev/null +++ b/tests/fixtures/auth_analysis_fastapi_cross_file_include_router/routes/public_health.py @@ -0,0 +1,52 @@ +"""Public router — NOT attached via authenticated_router, no auth lift. + +The parent file declares +`execution_api_router.include_router(public_health.router, prefix="/health")` +where `execution_api_router = APIRouter()` has NO dependencies. Every +route here is genuinely public — no inline auth, no cross-file lift. + +The vulnerability counterpart in this fixture: the route below writes +a row keyed by an id-like path param, with no auth covering it. +The auth analysis must still fire `missing_ownership_check` here — +recall guard for the cross-file resolution. If the cross-file lift +over-applies (e.g. blanket "any router covered by include_router gets +the parent's deps" without checking that the parent itself has deps), +this finding would silently disappear and we would lose the vuln +detection.""" +from typing import Annotated + +from fastapi import Body +from cadwyn import VersionedAPIRouter + +router = VersionedAPIRouter() + + +@router.put("/{log_id}/payload") +def public_update_log( + log_id: str, + body: Annotated[dict, Body()], +): + """Public route — no auth covers this id-targeted write. + + `log_id` is a path param the route accepted from the URL. The + write is keyed by that id with no ownership check — exactly the + shape `py.auth.missing_ownership_check` is designed to flag. + """ + session = _get_session() + session.add( + HealthLogRow( + log_id=log_id, + payload=body.get("payload", ""), + ) + ) + session.commit() + + +def _get_session(): + raise NotImplementedError + + +class HealthLogRow: + def __init__(self, log_id: str, payload: str) -> None: + self.log_id = log_id + self.payload = payload diff --git a/tests/fixtures/auth_analysis_fastapi_cross_file_include_router/routes/security.py b/tests/fixtures/auth_analysis_fastapi_cross_file_include_router/routes/security.py new file mode 100644 index 00000000..ddcb850a --- /dev/null +++ b/tests/fixtures/auth_analysis_fastapi_cross_file_include_router/routes/security.py @@ -0,0 +1,13 @@ +"""Stub for the auth dependency callable referenced by the parent router.""" +from typing import Annotated + + +def require_auth(): + """Validates a bearer JWT, raises HTTPException(401) on failure. + + Real airflow uses a more elaborate version that talks to a JWT + validator and the token-recipient table; for this fixture the + declaration-only stub is enough — the auth analysis cares about + the route-level wrapper, not the body. + """ + return None diff --git a/tests/fixtures/auth_analysis_fastapi_cross_file_include_router/routes/task_instances.py b/tests/fixtures/auth_analysis_fastapi_cross_file_include_router/routes/task_instances.py new file mode 100644 index 00000000..fe6d2cbb --- /dev/null +++ b/tests/fixtures/auth_analysis_fastapi_cross_file_include_router/routes/task_instances.py @@ -0,0 +1,60 @@ +"""Bare child router — auth comes from `__init__.py` via include_router. + +Pre-fix: every `@router.(...)` route in this file fired +`missing_ownership_check` because `router = VersionedAPIRouter()` +declares no inline `dependencies=[...]`. The auth declaration lives +on `__init__.py`'s `authenticated_router = VersionedAPIRouter( +dependencies=[Security(require_auth)])` and is lifted onto this file +via `authenticated_router.include_router(task_instances.router)`. + +Post-fix: cross-file router-fact resolution at pass 2 entry detects +the include_router edge targeting this file's `router` var, looks up +`authenticated_router`'s deps in the parent's `local_router_deps` +map, and folds them into this file's per-route auth attribution. +The route below must NOT fire `missing_ownership_check` / +`token_override_without_validation`.""" +from typing import Annotated + +from fastapi import Body +from cadwyn import VersionedAPIRouter + +from .security import require_auth as _require_auth_unused # noqa: F401 (parity with airflow) + +router = VersionedAPIRouter() + + +@router.patch("/{task_instance_id}/state") +def patch_task_instance_state( + task_instance_id: str, + body: Annotated[dict, Body()], +): + """Bare-child route — relies on parent router's Security(require_auth). + + Operations: writes a row keyed by user-supplied `task_instance_id`. + Without cross-file router-dep resolution this is the canonical FP + shape — the auth check lives in `__init__.py`, the sink lives here. + """ + new_state = body.get("state", "") + # Simulated session.add — write keyed by an id-like param the route + # accepted from the URL path. A bare in-file scan would mark this + # as missing_ownership_check on the assumption that `task_instance_id` + # is unauthorized user input. + session = _get_session() + session.add( + TaskInstanceRow( + task_instance_id=task_instance_id, + state=new_state, + ) + ) + session.commit() + + +def _get_session(): + """Stub — supplies the session object for the write below.""" + raise NotImplementedError + + +class TaskInstanceRow: + def __init__(self, task_instance_id: str, state: str) -> None: + self.task_instance_id = task_instance_id + self.state = state diff --git a/tests/fixtures/patterns/python/positive.py b/tests/fixtures/patterns/python/positive.py index ba9d8380..15ba5890 100644 --- a/tests/fixtures/patterns/python/positive.py +++ b/tests/fixtures/patterns/python/positive.py @@ -50,6 +50,16 @@ def trigger_sql_fstring(cursor, user): def trigger_sqlalchemy_text_fstring(connection, user): connection.execute(text(f"SELECT * FROM users WHERE name = '{user}'")) +# py.xss.make_response_format +def trigger_make_response_fstring(request, make_response): + content_type = request.headers.get("Content-Type") + return make_response(f"Invalid content type: '{content_type}'", 400) + +# py.xss.make_response_format (concat variant) +def trigger_make_response_concat(request, make_response): + name = request.args.get("name") + return make_response("

    Hello " + name + "

    ") + # py.crypto.md5 def trigger_md5(data): hashlib.md5(data) diff --git a/tests/pattern_tests.rs b/tests/pattern_tests.rs index f320d965..70737143 100644 --- a/tests/pattern_tests.rs +++ b/tests/pattern_tests.rs @@ -301,6 +301,9 @@ fn positive_python() { // py.sqli.text_format must fire on the SQLAlchemy text() shape. "py.sqli.execute_format", "py.sqli.text_format", + // CVE-2023-6568 (mlflow) reflected XSS via make_response f-string; + // also catches the `+`-concat shape in xss_reflected.py. + "py.xss.make_response_format", ], ); }