mirror of
https://github.com/dograh-hq/dograh.git
synced 2026-06-07 07:55:16 +02:00
test(mcp): guard instructions.py against tool drift
The MCP `instructions` hint is static and baked into the client prompt, while tool names, signatures, and error codes are discovered dynamically via tools/list. The two had drifted: instructions restated stale signatures and an error-code enum that omitted schema_validation and trigger_path_conflict. - Trim instructions.py to tool names + call order; stop restating signatures and error codes the dynamic surface already carries. - Document each tool's full error_code contract in the save_workflow and create_workflow docstrings (the descriptions shipped via tools/list). - Add test_mcp_instructions_drift.py: every tool named in the guide must be registered, and every error_code a tool returns must appear in its description. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
5762095edf
commit
8484e4bfaf
4 changed files with 170 additions and 30 deletions
|
|
@ -7,6 +7,14 @@ handling, hard constraints). Design-level per-field guidance belongs in
|
|||
each `PropertySpec.llm_hint`; it flows out through `get_node_type` and
|
||||
doesn't need to be repeated here.
|
||||
|
||||
Tool names, parameters, and per-tool `error_code` values are NOT
|
||||
authoritative here — they reach the model dynamically via `tools/list`
|
||||
from each tool's own signature and docstring. Reference tools by bare
|
||||
name and describe orchestration; do not restate signatures (they drift)
|
||||
or re-enumerate error codes (document those on the tool itself).
|
||||
`test_mcp_instructions_drift.py` fails if this guide names a tool that
|
||||
is not registered, or if a tool's error codes aren't in its docstring.
|
||||
|
||||
Extend based on real LLM failures — every bullet below ideally maps to a
|
||||
mistake the system has seen at least once.
|
||||
"""
|
||||
|
|
@ -17,22 +25,22 @@ You build and edit Dograh voice-AI workflows by emitting TypeScript that uses th
|
|||
## Call order
|
||||
|
||||
### Reading documentation
|
||||
1. `search_docs(query)` — use first for keyword or acronym lookup when the user is asking how Dograh works or how to configure something.
|
||||
2. `read_doc(path)` — fetch the full page once one result looks likely. Prefer this over reasoning from search summaries alone.
|
||||
3. `list_docs(path=None, depth=1)` — use when the user wants to browse a topic area or when search terms are too vague. Returned section paths feed back into `list_docs`; returned page paths feed into `read_doc`.
|
||||
1. `search_docs` — use first for keyword or acronym lookup when the user is asking how Dograh works or how to configure something.
|
||||
2. `read_doc` — fetch the full page once one result looks likely. Prefer this over reasoning from search summaries alone.
|
||||
3. `list_docs` — use when the user wants to browse a topic area or when search terms are too vague. Call it with no arguments for the top-level sections; returned section paths feed back into `list_docs`, returned page paths feed into `read_doc`.
|
||||
|
||||
### Editing an existing workflow
|
||||
1. `list_workflows` — locate the target workflow.
|
||||
2. `get_workflow_code(workflow_id)` — fetch the current source.
|
||||
3. (optional) `list_node_types` / `get_node_type(name)` — consult before adding or editing a node type whose fields aren't already visible in the current code.
|
||||
2. `get_workflow_code` — fetch the current source for that workflow.
|
||||
3. (optional) `list_node_types` / `get_node_type` — consult before adding or editing a node type whose fields aren't already visible in the current code.
|
||||
4. Mutate the code in place. Preserve existing nodes, edges, and variable names unless the task requires removing or renaming them.
|
||||
5. `save_workflow(workflow_id, code)` — persist as a new draft. The published version is untouched.
|
||||
5. `save_workflow` — persist as a new draft. The published version is untouched.
|
||||
|
||||
### Creating a new workflow
|
||||
1. Create a simple 1-node workflow with only `startCall`. The user can iteratively add complexity by editing it.
|
||||
2. `list_node_types` / `get_node_type(name)` — consult to learn the fields available on the node types you intend to use.
|
||||
2. `list_node_types` / `get_node_type` — consult to learn the fields available on the node types you intend to use.
|
||||
3. Author SDK TypeScript from scratch. The `new Workflow({ name: "..." })` call is required — `name` becomes the workflow's display name.
|
||||
4. `create_workflow(code)` — persists a new workflow as version 1 (published). Returns the new `workflow_id`. For subsequent edits use `save_workflow(workflow_id, code)` (which writes a draft).
|
||||
4. `create_workflow` — persists a new workflow as version 1 (published). Returns the new `workflow_id`. For subsequent edits use `save_workflow` (which writes a draft).
|
||||
|
||||
## Allowed source shape
|
||||
|
||||
|
|
@ -73,14 +81,7 @@ Example:
|
|||
|
||||
## Iterating on errors
|
||||
|
||||
`save_workflow` and `create_workflow` return one of:
|
||||
- `parse_error` — Disallowed construct (see grammar above) or malformed TypeScript.
|
||||
- `validation_error` — Node data failed spec validation (unknown field, missing required, wrong type, bad `options` value).
|
||||
- `graph_validation` — Structural rule broken (missing startCall, unreachable node, edge to/from wrong node type).
|
||||
- `missing_name` — (`create_workflow` only) `new Workflow({ name })` is absent or empty.
|
||||
- `bridge_error` — Internal; retry once, then surface to the user.
|
||||
|
||||
Every error carries `line` and `column`. Fix at that location and resubmit the **complete source** — this tool does not accept patches.
|
||||
A failed `save_workflow` / `create_workflow` returns a result with `saved`/`created` set to false, a machine-readable `error_code`, and a human-readable `error` message — carrying `line` and `column` when the problem is locatable in your source. The full set of `error_code` values and their meanings is documented on each tool (visible in its description). Read the `error` message, fix at the reported location, and resubmit the **complete source** — these tools do not accept patches. If a failure looks internal or transient rather than a problem with your code, retry once before surfacing it to the user.
|
||||
|
||||
## Field conventions
|
||||
|
||||
|
|
|
|||
|
|
@ -12,10 +12,10 @@ Execution flow mirrors `save_workflow`:
|
|||
4. Persist via `db_client.create_workflow` — workflow row + v1
|
||||
published definition in a single transaction.
|
||||
|
||||
Error codes surfaced to the LLM match `save_workflow`. An additional
|
||||
`missing_name` error is returned when the source omits
|
||||
`new Workflow({ name: "..." })` — the name is required and there is no
|
||||
prior workflow to fall back to.
|
||||
Each failure path returns an `error_code` via `_error_result`. Those
|
||||
codes and their meanings are documented in the `create_workflow`
|
||||
docstring (the description shipped to the LLM via `tools/list`); keep the
|
||||
two in sync — `test_mcp_instructions_drift.py` enforces it.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
|
@ -86,6 +86,22 @@ async def create_workflow(code: str) -> dict[str, Any]:
|
|||
On success the new workflow is published as version 1. Use
|
||||
`save_workflow(workflow_id, code)` for subsequent edits — those go to
|
||||
a draft.
|
||||
|
||||
On failure the result has `created: false`, a machine-readable
|
||||
`error_code`, and a human-readable `error` (with file:line:column
|
||||
where the problem is locatable). Resubmit the full corrected source —
|
||||
patches are not accepted. Possible `error_code` values:
|
||||
- `parse_error` — disallowed construct or malformed TypeScript.
|
||||
- `validation_error` — node data failed spec validation (unknown
|
||||
field, missing required, wrong type, option out of range).
|
||||
- `schema_validation` — wire-format (DTO) rejection; rare.
|
||||
- `graph_validation` — structural rule broken (e.g. no start node,
|
||||
unreachable node, edge to/from the wrong node type).
|
||||
- `missing_name` — `new Workflow({ name })` is absent or empty; the
|
||||
name is required and there is no prior workflow to fall back to.
|
||||
- `trigger_path_conflict` — a trigger node's path is already used by
|
||||
another workflow in this organization; rename it and resubmit.
|
||||
- `bridge_error` — internal/transient; retry once, then surface it.
|
||||
"""
|
||||
user = await authenticate_mcp_request()
|
||||
|
||||
|
|
|
|||
|
|
@ -10,16 +10,12 @@ Execution flow:
|
|||
4. Save as a new draft via `db_client.save_workflow_draft` — the
|
||||
published version stays intact, so edits are rollback-safe.
|
||||
|
||||
Error codes surfaced to the LLM:
|
||||
parse_error — TS parse failed or a disallowed construct was used
|
||||
validation_error — node data failed spec validation (unknown field,
|
||||
missing required, wrong type, option out of range)
|
||||
schema_validation — ReactFlowDTO Pydantic rejection (rare; parser bug)
|
||||
graph_validation — semantic graph rule broken (e.g. no start node)
|
||||
bridge_error — Node subprocess failed before returning JSON
|
||||
|
||||
All LLM-facing errors include file:line:column where available so the
|
||||
LLM can correct its code directly.
|
||||
Each failure path returns an `error_code` via `_error_result`. Those
|
||||
codes and their meanings are documented in the `save_workflow` docstring
|
||||
(the description shipped to the LLM via `tools/list`); keep the two in
|
||||
sync — `test_mcp_instructions_drift.py` enforces it. All LLM-facing
|
||||
errors include file:line:column where available so the LLM can correct
|
||||
its code directly.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
|
@ -91,6 +87,18 @@ async def save_workflow(workflow_id: int, code: str) -> dict[str, Any]:
|
|||
|
||||
On success the draft version is saved; the published version is
|
||||
untouched.
|
||||
|
||||
On failure the result has `saved: false`, a machine-readable
|
||||
`error_code`, and a human-readable `error` (with file:line:column
|
||||
where the problem is locatable). Resubmit the full corrected source —
|
||||
patches are not accepted. Possible `error_code` values:
|
||||
- `parse_error` — disallowed construct or malformed TypeScript.
|
||||
- `validation_error` — node data failed spec validation (unknown
|
||||
field, missing required, wrong type, option out of range).
|
||||
- `schema_validation` — wire-format (DTO) rejection; rare.
|
||||
- `graph_validation` — structural rule broken (e.g. no start node,
|
||||
unreachable node, edge to/from the wrong node type).
|
||||
- `bridge_error` — internal/transient; retry once, then surface it.
|
||||
"""
|
||||
user = await authenticate_mcp_request()
|
||||
|
||||
|
|
|
|||
115
api/tests/test_mcp_instructions_drift.py
Normal file
115
api/tests/test_mcp_instructions_drift.py
Normal file
|
|
@ -0,0 +1,115 @@
|
|||
"""Drift guards between the static MCP guide and the live tool surface.
|
||||
|
||||
`api/mcp_server/instructions.py` is free text baked into the client
|
||||
system prompt. It is *not* the authoritative description of the tools —
|
||||
names, signatures, and per-tool error codes reach the model dynamically
|
||||
via `tools/list`, derived from each tool's own function signature and
|
||||
docstring. These tests fail on the two classic drift modes:
|
||||
|
||||
1. The guide references a tool that is no longer registered (renamed or
|
||||
removed) — the model would be told to call something that 404s.
|
||||
2. A tool returns an `error_code` that is absent from the description it
|
||||
ships via `tools/list` — the model can't learn to recover from it.
|
||||
|
||||
Keep the guide about orchestration (call order, hard constraints) and let
|
||||
the tools describe themselves.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import re
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
from api.mcp_server import instructions as instructions_module
|
||||
from api.mcp_server.server import mcp
|
||||
from api.mcp_server.tools import create_workflow as create_workflow_module
|
||||
from api.mcp_server.tools import save_workflow as save_workflow_module
|
||||
|
||||
# Every registered MCP tool name starts with one of these verbs. A
|
||||
# backticked snake_case token in the guide whose leading word is a verb is
|
||||
# treated as a tool reference; field/reference names like `tool_refs`,
|
||||
# `credential_ref`, or `pre_call_fetch` don't start with a verb and are
|
||||
# correctly ignored. Extend this only when a new tool introduces a new
|
||||
# leading verb (a missing verb under-checks, it never false-fails).
|
||||
_TOOL_VERB_PREFIXES = frozenset(
|
||||
{
|
||||
"search",
|
||||
"read",
|
||||
"list",
|
||||
"get",
|
||||
"save",
|
||||
"create",
|
||||
"update",
|
||||
"delete",
|
||||
"add",
|
||||
"remove",
|
||||
"set",
|
||||
}
|
||||
)
|
||||
|
||||
# A backtick immediately followed by a snake_case identifier (>= 1
|
||||
# underscore). Anchoring on the opening backtick captures the leading
|
||||
# identifier of a code span whether it is bare (`read_doc`) or a call
|
||||
# (`read_doc(path)`), while skipping DSL constructs like `wf.edge` or
|
||||
# `new Workflow` whose first char after the backtick isn't `[a-z_]`.
|
||||
_BACKTICKED_SNAKE_RE = re.compile(r"`([a-z][a-z0-9]*(?:_[a-z0-9]+)+)")
|
||||
|
||||
# Error codes are emitted as the first string arg to `_error_result(...)`.
|
||||
_ERROR_RESULT_LITERAL_RE = re.compile(r'_error_result\(\s*"([a-z_]+)"')
|
||||
# `parse_error` / `validation_error` are picked by a `code_key` ternary
|
||||
# rather than passed as a literal to `_error_result`, so match them too.
|
||||
_CODE_KEY_LITERAL_RE = re.compile(r'"(parse_error|validation_error)"')
|
||||
|
||||
|
||||
def _referenced_tool_names(text: str) -> set[str]:
|
||||
return {
|
||||
token
|
||||
for token in _BACKTICKED_SNAKE_RE.findall(text)
|
||||
if token.split("_", 1)[0] in _TOOL_VERB_PREFIXES
|
||||
}
|
||||
|
||||
|
||||
def _returned_error_codes(module) -> set[str]:
|
||||
source = Path(module.__file__).read_text(encoding="utf-8")
|
||||
return set(_ERROR_RESULT_LITERAL_RE.findall(source)) | set(
|
||||
_CODE_KEY_LITERAL_RE.findall(source)
|
||||
)
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_guide_only_references_registered_tools():
|
||||
registered = {tool.name for tool in await mcp.list_tools()}
|
||||
referenced = _referenced_tool_names(instructions_module.DOGRAH_MCP_INSTRUCTIONS)
|
||||
|
||||
assert referenced, "no tool references extracted — the regex likely broke"
|
||||
unknown = sorted(referenced - registered)
|
||||
assert not unknown, (
|
||||
f"instructions.py references tools that are not registered: {unknown}. "
|
||||
f"Rename/remove the reference or register the tool. "
|
||||
f"Registered tools: {sorted(registered)}."
|
||||
)
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
@pytest.mark.parametrize(
|
||||
"tool_name, module",
|
||||
[
|
||||
("save_workflow", save_workflow_module),
|
||||
("create_workflow", create_workflow_module),
|
||||
],
|
||||
)
|
||||
async def test_tool_documents_every_error_code_it_returns(tool_name, module):
|
||||
descriptions = {
|
||||
tool.name: tool.description or "" for tool in await mcp.list_tools()
|
||||
}
|
||||
description = descriptions[tool_name]
|
||||
returned = _returned_error_codes(module)
|
||||
|
||||
assert returned, f"no error codes detected in {tool_name} source — regex broke"
|
||||
undocumented = sorted(code for code in returned if code not in description)
|
||||
assert not undocumented, (
|
||||
f"{tool_name} returns error_code(s) {undocumented} absent from the description "
|
||||
f"shipped via tools/list. Document them in the {tool_name} docstring."
|
||||
)
|
||||
Loading…
Add table
Add a link
Reference in a new issue