test(mcp): guard instructions.py against tool drift

The MCP `instructions` hint is static and baked into the client prompt,
while tool names, signatures, and error codes are discovered dynamically
via tools/list. The two had drifted: instructions restated stale
signatures and an error-code enum that omitted schema_validation and
trigger_path_conflict.

- Trim instructions.py to tool names + call order; stop restating
  signatures and error codes the dynamic surface already carries.
- Document each tool's full error_code contract in the save_workflow and
  create_workflow docstrings (the descriptions shipped via tools/list).
- Add test_mcp_instructions_drift.py: every tool named in the guide must
  be registered, and every error_code a tool returns must appear in its
  description.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Abhishek Kumar 2026-05-20 18:43:18 +05:30
parent 5762095edf
commit 8484e4bfaf
4 changed files with 170 additions and 30 deletions

View file

@ -7,6 +7,14 @@ handling, hard constraints). Design-level per-field guidance belongs in
each `PropertySpec.llm_hint`; it flows out through `get_node_type` and
doesn't need to be repeated here.
Tool names, parameters, and per-tool `error_code` values are NOT
authoritative here they reach the model dynamically via `tools/list`
from each tool's own signature and docstring. Reference tools by bare
name and describe orchestration; do not restate signatures (they drift)
or re-enumerate error codes (document those on the tool itself).
`test_mcp_instructions_drift.py` fails if this guide names a tool that
is not registered, or if a tool's error codes aren't in its docstring.
Extend based on real LLM failures every bullet below ideally maps to a
mistake the system has seen at least once.
"""
@ -17,22 +25,22 @@ You build and edit Dograh voice-AI workflows by emitting TypeScript that uses th
## Call order
### Reading documentation
1. `search_docs(query)` use first for keyword or acronym lookup when the user is asking how Dograh works or how to configure something.
2. `read_doc(path)` fetch the full page once one result looks likely. Prefer this over reasoning from search summaries alone.
3. `list_docs(path=None, depth=1)` use when the user wants to browse a topic area or when search terms are too vague. Returned section paths feed back into `list_docs`; returned page paths feed into `read_doc`.
1. `search_docs` use first for keyword or acronym lookup when the user is asking how Dograh works or how to configure something.
2. `read_doc` fetch the full page once one result looks likely. Prefer this over reasoning from search summaries alone.
3. `list_docs` use when the user wants to browse a topic area or when search terms are too vague. Call it with no arguments for the top-level sections; returned section paths feed back into `list_docs`, returned page paths feed into `read_doc`.
### Editing an existing workflow
1. `list_workflows` locate the target workflow.
2. `get_workflow_code(workflow_id)` fetch the current source.
3. (optional) `list_node_types` / `get_node_type(name)` consult before adding or editing a node type whose fields aren't already visible in the current code.
2. `get_workflow_code` fetch the current source for that workflow.
3. (optional) `list_node_types` / `get_node_type` consult before adding or editing a node type whose fields aren't already visible in the current code.
4. Mutate the code in place. Preserve existing nodes, edges, and variable names unless the task requires removing or renaming them.
5. `save_workflow(workflow_id, code)` persist as a new draft. The published version is untouched.
5. `save_workflow` persist as a new draft. The published version is untouched.
### Creating a new workflow
1. Create a simple 1-node workflow with only `startCall`. The user can iteratively add complexity by editing it.
2. `list_node_types` / `get_node_type(name)` consult to learn the fields available on the node types you intend to use.
2. `list_node_types` / `get_node_type` consult to learn the fields available on the node types you intend to use.
3. Author SDK TypeScript from scratch. The `new Workflow({ name: "..." })` call is required `name` becomes the workflow's display name.
4. `create_workflow(code)` persists a new workflow as version 1 (published). Returns the new `workflow_id`. For subsequent edits use `save_workflow(workflow_id, code)` (which writes a draft).
4. `create_workflow` persists a new workflow as version 1 (published). Returns the new `workflow_id`. For subsequent edits use `save_workflow` (which writes a draft).
## Allowed source shape
@ -73,14 +81,7 @@ Example:
## Iterating on errors
`save_workflow` and `create_workflow` return one of:
- `parse_error` Disallowed construct (see grammar above) or malformed TypeScript.
- `validation_error` Node data failed spec validation (unknown field, missing required, wrong type, bad `options` value).
- `graph_validation` Structural rule broken (missing startCall, unreachable node, edge to/from wrong node type).
- `missing_name` (`create_workflow` only) `new Workflow({ name })` is absent or empty.
- `bridge_error` Internal; retry once, then surface to the user.
Every error carries `line` and `column`. Fix at that location and resubmit the **complete source** this tool does not accept patches.
A failed `save_workflow` / `create_workflow` returns a result with `saved`/`created` set to false, a machine-readable `error_code`, and a human-readable `error` message carrying `line` and `column` when the problem is locatable in your source. The full set of `error_code` values and their meanings is documented on each tool (visible in its description). Read the `error` message, fix at the reported location, and resubmit the **complete source** these tools do not accept patches. If a failure looks internal or transient rather than a problem with your code, retry once before surfacing it to the user.
## Field conventions

View file

@ -12,10 +12,10 @@ Execution flow mirrors `save_workflow`:
4. Persist via `db_client.create_workflow` workflow row + v1
published definition in a single transaction.
Error codes surfaced to the LLM match `save_workflow`. An additional
`missing_name` error is returned when the source omits
`new Workflow({ name: "..." })` the name is required and there is no
prior workflow to fall back to.
Each failure path returns an `error_code` via `_error_result`. Those
codes and their meanings are documented in the `create_workflow`
docstring (the description shipped to the LLM via `tools/list`); keep the
two in sync `test_mcp_instructions_drift.py` enforces it.
"""
from __future__ import annotations
@ -86,6 +86,22 @@ async def create_workflow(code: str) -> dict[str, Any]:
On success the new workflow is published as version 1. Use
`save_workflow(workflow_id, code)` for subsequent edits those go to
a draft.
On failure the result has `created: false`, a machine-readable
`error_code`, and a human-readable `error` (with file:line:column
where the problem is locatable). Resubmit the full corrected source
patches are not accepted. Possible `error_code` values:
- `parse_error` disallowed construct or malformed TypeScript.
- `validation_error` node data failed spec validation (unknown
field, missing required, wrong type, option out of range).
- `schema_validation` wire-format (DTO) rejection; rare.
- `graph_validation` structural rule broken (e.g. no start node,
unreachable node, edge to/from the wrong node type).
- `missing_name` `new Workflow({ name })` is absent or empty; the
name is required and there is no prior workflow to fall back to.
- `trigger_path_conflict` a trigger node's path is already used by
another workflow in this organization; rename it and resubmit.
- `bridge_error` internal/transient; retry once, then surface it.
"""
user = await authenticate_mcp_request()

View file

@ -10,16 +10,12 @@ Execution flow:
4. Save as a new draft via `db_client.save_workflow_draft` the
published version stays intact, so edits are rollback-safe.
Error codes surfaced to the LLM:
parse_error TS parse failed or a disallowed construct was used
validation_error node data failed spec validation (unknown field,
missing required, wrong type, option out of range)
schema_validation ReactFlowDTO Pydantic rejection (rare; parser bug)
graph_validation semantic graph rule broken (e.g. no start node)
bridge_error Node subprocess failed before returning JSON
All LLM-facing errors include file:line:column where available so the
LLM can correct its code directly.
Each failure path returns an `error_code` via `_error_result`. Those
codes and their meanings are documented in the `save_workflow` docstring
(the description shipped to the LLM via `tools/list`); keep the two in
sync `test_mcp_instructions_drift.py` enforces it. All LLM-facing
errors include file:line:column where available so the LLM can correct
its code directly.
"""
from __future__ import annotations
@ -91,6 +87,18 @@ async def save_workflow(workflow_id: int, code: str) -> dict[str, Any]:
On success the draft version is saved; the published version is
untouched.
On failure the result has `saved: false`, a machine-readable
`error_code`, and a human-readable `error` (with file:line:column
where the problem is locatable). Resubmit the full corrected source
patches are not accepted. Possible `error_code` values:
- `parse_error` disallowed construct or malformed TypeScript.
- `validation_error` node data failed spec validation (unknown
field, missing required, wrong type, option out of range).
- `schema_validation` wire-format (DTO) rejection; rare.
- `graph_validation` structural rule broken (e.g. no start node,
unreachable node, edge to/from the wrong node type).
- `bridge_error` internal/transient; retry once, then surface it.
"""
user = await authenticate_mcp_request()

View file

@ -0,0 +1,115 @@
"""Drift guards between the static MCP guide and the live tool surface.
`api/mcp_server/instructions.py` is free text baked into the client
system prompt. It is *not* the authoritative description of the tools
names, signatures, and per-tool error codes reach the model dynamically
via `tools/list`, derived from each tool's own function signature and
docstring. These tests fail on the two classic drift modes:
1. The guide references a tool that is no longer registered (renamed or
removed) the model would be told to call something that 404s.
2. A tool returns an `error_code` that is absent from the description it
ships via `tools/list` the model can't learn to recover from it.
Keep the guide about orchestration (call order, hard constraints) and let
the tools describe themselves.
"""
from __future__ import annotations
import re
from pathlib import Path
import pytest
from api.mcp_server import instructions as instructions_module
from api.mcp_server.server import mcp
from api.mcp_server.tools import create_workflow as create_workflow_module
from api.mcp_server.tools import save_workflow as save_workflow_module
# Every registered MCP tool name starts with one of these verbs. A
# backticked snake_case token in the guide whose leading word is a verb is
# treated as a tool reference; field/reference names like `tool_refs`,
# `credential_ref`, or `pre_call_fetch` don't start with a verb and are
# correctly ignored. Extend this only when a new tool introduces a new
# leading verb (a missing verb under-checks, it never false-fails).
_TOOL_VERB_PREFIXES = frozenset(
{
"search",
"read",
"list",
"get",
"save",
"create",
"update",
"delete",
"add",
"remove",
"set",
}
)
# A backtick immediately followed by a snake_case identifier (>= 1
# underscore). Anchoring on the opening backtick captures the leading
# identifier of a code span whether it is bare (`read_doc`) or a call
# (`read_doc(path)`), while skipping DSL constructs like `wf.edge` or
# `new Workflow` whose first char after the backtick isn't `[a-z_]`.
_BACKTICKED_SNAKE_RE = re.compile(r"`([a-z][a-z0-9]*(?:_[a-z0-9]+)+)")
# Error codes are emitted as the first string arg to `_error_result(...)`.
_ERROR_RESULT_LITERAL_RE = re.compile(r'_error_result\(\s*"([a-z_]+)"')
# `parse_error` / `validation_error` are picked by a `code_key` ternary
# rather than passed as a literal to `_error_result`, so match them too.
_CODE_KEY_LITERAL_RE = re.compile(r'"(parse_error|validation_error)"')
def _referenced_tool_names(text: str) -> set[str]:
return {
token
for token in _BACKTICKED_SNAKE_RE.findall(text)
if token.split("_", 1)[0] in _TOOL_VERB_PREFIXES
}
def _returned_error_codes(module) -> set[str]:
source = Path(module.__file__).read_text(encoding="utf-8")
return set(_ERROR_RESULT_LITERAL_RE.findall(source)) | set(
_CODE_KEY_LITERAL_RE.findall(source)
)
@pytest.mark.asyncio
async def test_guide_only_references_registered_tools():
registered = {tool.name for tool in await mcp.list_tools()}
referenced = _referenced_tool_names(instructions_module.DOGRAH_MCP_INSTRUCTIONS)
assert referenced, "no tool references extracted — the regex likely broke"
unknown = sorted(referenced - registered)
assert not unknown, (
f"instructions.py references tools that are not registered: {unknown}. "
f"Rename/remove the reference or register the tool. "
f"Registered tools: {sorted(registered)}."
)
@pytest.mark.asyncio
@pytest.mark.parametrize(
"tool_name, module",
[
("save_workflow", save_workflow_module),
("create_workflow", create_workflow_module),
],
)
async def test_tool_documents_every_error_code_it_returns(tool_name, module):
descriptions = {
tool.name: tool.description or "" for tool in await mcp.list_tools()
}
description = descriptions[tool_name]
returned = _returned_error_codes(module)
assert returned, f"no error codes detected in {tool_name} source — regex broke"
undocumented = sorted(code for code in returned if code not in description)
assert not undocumented, (
f"{tool_name} returns error_code(s) {undocumented} absent from the description "
f"shipped via tools/list. Document them in the {tool_name} docstring."
)