fix(filesystem): remove cat node reads

Return nested PageIndex structure JSON from cat --structure and keep content reads page-based only. Remove the cat --node command surface, related limits, prompts, and structure-text fallback.
This commit is contained in:
BukeLy 2026-06-01 00:05:19 +08:00
parent e368562e03
commit d0c0c67a39
6 changed files with 56 additions and 277 deletions

View file

@ -5,7 +5,7 @@ This mirrors examples/agentic_vectorless_rag_demo.py, but exposes a corpus
through the PageIndex FileSystem shell instead of direct PageIndex document
tools. The agent receives one read-only bash-like PIFS tool and must retrieve
evidence through commands such as ls, tree, browse, find, grep, cat <path>
--structure, cat <path> --page, and cat <path> --node.
--structure, and cat <path> --page.
The demo registers supported files under examples/documents. When a matching
examples/documents/results/*_structure.json file exists, it is loaded into the
@ -81,7 +81,7 @@ Retrieval strategy:
browse -R /documents "Federal Reserve supervision regulation"
- browse returns file candidates only; it is not folder semantic recall.
- After browse returns candidates, verify evidence with grep, cat <path>
--structure, cat <path> --node, or cat <path> --page before answering.
--structure, or cat <path> --page before answering.
- Use find --where only with JSON metadata DSL, for example:
find /documents --where '{"file_format":"pdf"}'
- Use grep -R only for lexical evidence; do not treat semantic candidates as

View file

@ -70,22 +70,19 @@ likely browse candidate, verify the relevant claim with cat or grep before
answering. Errors are returned as text prefixed with ERROR. Do not call commands
that are not listed as available. When evidence is required, inspect it with cat
or grep before answering. Prefer shell-like target-first cat syntax with stable
targets: cat <path> --structure, cat <path> --page 31-59, and cat <path> --node
0009. You may also use file_ref or document_id when a path is ambiguous. Do not reconstruct paths from document titles; use exact targets returned by PIFS
commands and quote paths containing spaces. After structure identifies a
relevant section node, prefer
cat <path> --node <node_id>; use cat <path> --page <range> when the user asks
for page-level evidence, no suitable node exists, or exact page text is needed.
cat <path> --structure is paginated; request more with --offset if needed. Page
reads are limited to five pages at once, node reads to at most ten node ids,
and text cat --all returns only the first page of text lines. If a cat limit
error requires a smaller call, stop when the evidence is sufficient; otherwise
continue with another chunk before answering.
targets: cat <path> --structure and cat <path> --page 31-59. You may also use
file_ref or document_id when a path is ambiguous. Do not reconstruct paths from
document titles; use exact targets returned by PIFS commands and quote paths
containing spaces. Use cat <path> --structure to inspect the document structure
JSON, then cat <path> --page <range> for exact page text evidence. Page reads
are limited to five pages at once, and text cat --all returns only the first
page of text lines. If a cat limit error requires a smaller call, stop when the
evidence is sufficient; otherwise continue with another chunk before answering.
For questions about metadata fields, available summaries, or whether metadata
was provided, inspect stat --schema and stat <target> before making claims.
Do not use stat as a general content/topic discovery step. For document Q&A,
prefer ls/tree for folder selection, browse for file candidates, then cat
--structure and cat --node or cat --page for evidence.
--structure and cat --page for evidence.
"""
AGENT_TOOL_POLICY = """
@ -110,15 +107,14 @@ Tool policy:
- Do not reconstruct a file path from a title. Use exact paths returned by PIFS commands, or use file_ref/document_id when available; quote paths that contain spaces.
- For broad topic, method, or "what solution" questions that are likely about the workspace, search for candidate documents before asking the user to choose a document.
- Use stat only for metadata/schema/status questions or to resolve ambiguous target identity. Do not run stat merely to understand what a document says.
- Prefer target-first cat syntax with stable targets: cat <path> --structure, cat <path> --page 31-59, cat <path> --node <node_id>.
- cat <target> --structure returns at most 25 nodes; use --offset and --limit for more structure pages.
- cat <target> --page accepts at most 5 pages at once. If a larger range is needed, first inspect cat <target> --structure and then read a smaller page range or node.
- cat <target> --node accepts at most 10 node ids at once. Prefer relevant nodes from structure when possible.
- When recovering from cat page/node/text limit errors, stop if the evidence is sufficient; if it is not sufficient, make another smaller call before answering.
- Prefer target-first cat syntax with stable targets: cat <path> --structure, cat <path> --page 31-59.
- cat <target> --structure returns the cached PageIndex structure JSON without text fields.
- cat <target> --page accepts at most 5 pages at once. If a larger range is needed, first inspect cat <target> --structure and then read a smaller page range.
- When recovering from cat page/text limit errors, stop if the evidence is sufficient; if it is not sufficient, make another smaller call before answering.
- cat <target> --all returns at most 100 text lines; use cat <target> --range <start>-<end> for the next page.
- After cat <target> --structure finds a relevant section/subsection with a node_id, prefer cat <target> --node <node_id> for content from that semantic unit.
- Use cat <target> --page <start>-<end> when the user explicitly asks for pages/page ranges, when no suitable node_id exists, or when you need exact page text to verify page-level evidence.
- Avoid fetching a broad page span after a matching node is available unless page-level citation or verification is required.
- After cat <target> --structure identifies a relevant section/subsection, use cat <target> --page <start>-<end> for exact evidence.
- Use cat <target> --page <start>-<end> when the user explicitly asks for pages/page ranges or when you need exact page text to verify evidence.
- Avoid fetching a broad page span unless page-level citation or verification is required.
- Do not call cat --page <target> <start> <end>; if you need a page span, use cat <target> --page <start>-<end>.
- For metadata or summary-field questions, run stat --schema and stat <target> for relevant files before answering; do not infer metadata presence or absence from ls/find output alone.
- Distinguish default/register metadata from caller-provided custom metadata when the evidence supports it.

View file

@ -40,10 +40,6 @@ class PIFSCommandExecutor:
BROWSE_PAGE_SIZE = 10
MAX_TEXT_LINES = 100
MAX_PAGE_SPAN = 5
MAX_STRUCTURE_NODES = 25
MAX_NODE_IDS = 10
MAX_NODE_TEXT_LINES = 100
MAX_NODE_TEXT_CHARS = 12_000
MAX_STAT_FIELD_TARGETS = 20
MAX_TREE_DEPTH = 4
MAX_LS_RENDER_FILES = 25
@ -85,9 +81,8 @@ class PIFSCommandExecutor:
"- find --where: exact/canonical metadata DSL filtering using stat --schema fields only",
"- find <folder> -maxdepth N -type f|d: bounded folder traversal for find",
"- grep -R: recursive lexical/FTS search only; semantic vector prefilter is disabled",
"- cat <path|file_ref|document_id> --structure: cached PageIndex node list, paginated at 25 nodes",
"- cat <path|file_ref|document_id> --structure: cached PageIndex structure JSON without text fields",
"- cat <path|file_ref|document_id> --page: cached PageIndex page reads, limited to 5 pages",
"- cat <path|file_ref|document_id> --node: cached PageIndex node reads, limited to 10 node ids",
"- cat <path|file_ref|document_id> --all: text artifact reads for txt/text files, paginated at 100 lines",
"- stat --field <metadata_field> <target...>: one metadata field across up to 20 documents",
]
@ -495,15 +490,11 @@ class PIFSCommandExecutor:
if target.startswith("-"):
raise PIFSCommandError(
"cat syntax is target-first: cat <path|file_ref|document_id> --structure, "
"cat <path|file_ref|document_id> --page 31-59, or "
"cat <path|file_ref|document_id> --node 0009"
"or cat <path|file_ref|document_id> --page 31-59"
)
location = "all"
structural_mode: str | None = None
node_ids: list[str] = []
page_range: str | None = None
structure_offset = 0
structure_limit = self.MAX_STRUCTURE_NODES
i = 1
while i < len(args):
arg = args[i]
@ -516,29 +507,6 @@ class PIFSCommandExecutor:
location = "all"
elif arg == "--structure":
structural_mode = "structure"
elif arg == "--offset":
i += 1
if i >= len(args):
raise PIFSCommandError("cat --structure --offset requires a value")
structure_offset = self._parse_non_negative_int(args[i], "cat --structure --offset")
elif arg == "--limit":
i += 1
if i >= len(args):
raise PIFSCommandError("cat --structure --limit requires a value")
structure_limit = self._parse_bounded_int(
args[i],
"cat --structure --limit",
max_value=self.MAX_STRUCTURE_NODES,
)
elif arg == "--node":
i += 1
if i >= len(args):
raise PIFSCommandError("cat --node requires a node id")
structural_mode = "node"
while i < len(args) and not args[i].startswith("-"):
node_ids.extend(self._parse_node_ids(args[i]))
i += 1
i -= 1
elif arg == "--page":
i += 1
if i >= len(args):
@ -551,8 +519,7 @@ class PIFSCommandExecutor:
raise PIFSCommandError(
"cat accepts one file target. Use target-first syntax: "
"cat <path|file_ref|document_id> --structure, "
"cat <path|file_ref|document_id> --node 0002 0004, or "
"cat <path|file_ref|document_id> --page 31-33. "
"or cat <path|file_ref|document_id> --page 31-33. "
f"Unexpected extra argument: {arg!r}. If the target path or title contains "
"spaces, quote the whole target, for example: cat \"/documents/report name.pdf\" "
"--structure. If a title-derived path is ambiguous, use the file_ref or "
@ -560,47 +527,7 @@ class PIFSCommandExecutor:
)
i += 1
if structural_mode == "structure":
if structure_limit < 1:
raise PIFSCommandError(
"cat --structure --limit must be at least 1 and at most "
f"{self.MAX_STRUCTURE_NODES}."
)
data = self.filesystem.pageindex_structure(
target,
offset=structure_offset,
limit=structure_limit,
)
self._attach_structure_next_command(data, target)
return data
if structural_mode == "node":
self._require_at_most(
len(node_ids),
"cat --node node count",
self.MAX_NODE_IDS,
)
if not node_ids:
raise PIFSCommandError("cat --node requires a node id")
node_results = [
self._bounded_node_result(
self.filesystem.pageindex_node(target, node_id),
target=target,
node_id=node_id,
)
for node_id in node_ids
]
if len(node_results) == 1:
return node_results[0]
return {
"mode": "nodes",
"target": target,
"available": all(result.get("available") is not False for result in node_results),
"node_ids": node_ids,
"nodes": node_results,
"text": "\n\n".join(
f"[node {result.get('node_id') or node_id}]\n{result.get('text', '')}"
for node_id, result in zip(node_ids, node_results)
),
}
return self.filesystem.pageindex_structure(target)
if structural_mode == "page":
if not page_range or not re.fullmatch(r"\d+(?:-\d+)?", page_range):
raise PIFSCommandError(
@ -752,66 +679,6 @@ class PIFSCommandExecutor:
data["pagination"] = pagination
return data
def _bounded_node_result(
self,
data: dict[str, Any],
*,
target: str,
node_id: str,
) -> dict[str, Any]:
if not isinstance(data, dict) or data.get("available") is False:
return data
text = str(data.get("text") or "")
lines = text.splitlines()
truncated_by_lines = len(lines) > self.MAX_NODE_TEXT_LINES
truncated_by_chars = len(text) > self.MAX_NODE_TEXT_CHARS
if not truncated_by_lines and not truncated_by_chars:
data["node_pagination"] = {
"limit_nodes": self.MAX_NODE_IDS,
"text_truncated": False,
}
return data
selected = "\n".join(lines[: self.MAX_NODE_TEXT_LINES])
if len(selected) > self.MAX_NODE_TEXT_CHARS:
selected = selected[: self.MAX_NODE_TEXT_CHARS].rstrip()
data["text"] = (
selected.rstrip()
+ "\n"
+ self._pagination_footer(
"cat --node",
(
f"node text limited to {self.MAX_NODE_TEXT_LINES} lines/"
f"{self.MAX_NODE_TEXT_CHARS} chars"
),
f"cat {shlex.quote(target)} --structure",
)
).strip()
data["node_pagination"] = {
"limit_nodes": self.MAX_NODE_IDS,
"line_limit": self.MAX_NODE_TEXT_LINES,
"char_limit": self.MAX_NODE_TEXT_CHARS,
"original_lines": len(lines),
"original_chars": len(text),
"text_truncated": True,
"suggested_command": f"cat {shlex.quote(target)} --structure",
"node_id": node_id,
}
return data
def _attach_structure_next_command(self, data: dict[str, Any], target: str) -> None:
pagination = data.get("structure_pagination")
if not isinstance(pagination, dict):
return
if pagination.get("has_more") and pagination.get("next_offset") is not None:
next_command = (
f"cat {shlex.quote(target)} --structure "
f"--offset {pagination['next_offset']} --limit {pagination['limit']}"
)
pagination["next_command"] = next_command
else:
pagination["next_command"] = None
def _attach_page_next_command(
self,
data: dict[str, Any],
@ -841,10 +708,6 @@ class PIFSCommandExecutor:
f"Next: {next_command}. If unsure, use cat <target> --structure."
)
@staticmethod
def _parse_node_ids(value: str) -> list[str]:
return [part.strip() for part in value.split(",") if part.strip()]
@staticmethod
def _reject_regex_alternation_query(query: str, command_name: str) -> None:
if "|" not in str(query):
@ -939,7 +802,6 @@ class PIFSCommandExecutor:
return json.dumps(
{
"structure": data.get("structure", []),
"pagination": data.get("structure_pagination", {}),
},
ensure_ascii=False,
indent=2,

View file

@ -1278,10 +1278,7 @@ class PageIndexFileSystem:
return ""
client._ensure_doc_loaded(doc_id)
doc = client.documents.get(doc_id) or {}
page_text = self._pageindex_pages_text(doc.get("pages"))
if page_text:
return page_text
return self._pageindex_structure_text(doc.get("structure", []))
return self._pageindex_pages_text(doc.get("pages"))
@staticmethod
def _pageindex_pages_text(pages: Any) -> str:
@ -1296,25 +1293,6 @@ class PageIndexFileSystem:
parts.append(content)
return "\n\n".join(parts)
@classmethod
def _pageindex_structure_text(cls, structure: Any) -> str:
parts: list[str] = []
cls._collect_pageindex_node_text(structure, parts)
return "\n\n".join(parts)
@classmethod
def _collect_pageindex_node_text(cls, node: Any, parts: list[str]) -> None:
if isinstance(node, list):
for item in node:
cls._collect_pageindex_node_text(item, parts)
return
if not isinstance(node, dict):
return
text = str(node.get("text") or "").strip()
if text:
parts.append(text)
cls._collect_pageindex_node_text(node.get("nodes", []), parts)
@staticmethod
def _raw_artifact_payload(
*,

View file

@ -68,7 +68,6 @@ def test_pageindex_structure_options_report_failed_register_build(monkeypatch):
executor = PIFSCommandExecutor(filesystem, json_output=True)
structure = json.loads(executor.execute("cat dsid_structural_missing --structure"))
node = json.loads(executor.execute("cat dsid_structural_missing --node 0001"))
pages = json.loads(executor.execute("cat dsid_structural_missing --page 1-2"))
stat = json.loads(executor.execute("stat dsid_structural_missing"))
@ -85,10 +84,6 @@ def test_pageindex_structure_options_report_failed_register_build(monkeypatch):
"message": "index failed: extractor unavailable",
}
assert node["data"]["mode"] == "node"
assert node["data"]["available"] is False
assert node["data"]["node_id"] == "0001"
assert pages["data"]["mode"] == "page"
assert pages["data"]["available"] is False
assert pages["data"]["pages"] == "1-2"
@ -135,6 +130,9 @@ def test_register_pdf_markdown_uses_pageindex_extracted_text_for_metadata_and_ft
"nodes": [],
}
],
"pages": [
{"page": 1, "content": "PageIndex Markdown extracted gamma text."}
],
}
write_pageindex_client_doc(self.workspace, doc_id, doc)
self.documents[doc_id] = doc
@ -348,10 +346,10 @@ def test_cat_structure_page_reuses_pageindex_client_cache_without_indexing(monke
assert structure["data"]["available"] is True
assert structure["data"]["pageindex_doc_id"] == "doc_cached_pdf"
assert structure["data"]["structure"][0]["title"] == "Introduction"
assert structure["data"]["structure"][1]["title"] == "Findings"
assert structure["data"]["structure_pagination"]["limit"] == 25
assert structure["data"]["structure"][0]["nodes"][0]["title"] == "Findings"
assert "structure_pagination" not in structure["data"]
assert "text" not in structure["data"]["structure"][0]
assert "text" not in structure["data"]["structure"][1]
assert "text" not in structure["data"]["structure"][0]["nodes"][0]
assert pages["data"]["available"] is True
assert pages["data"]["text"] == "Page one text\n\nPage two text"
@ -364,53 +362,26 @@ def test_cat_structure_page_reuses_pageindex_client_cache_without_indexing(monke
assert stat["data"]["pageindex_tree_status"] == "built"
def test_cat_node_reads_pageindex_client_structure_without_custom_pifs_artifact():
def test_cat_node_is_not_supported():
from pageindex.filesystem import PIFSCommandExecutor, PageIndexFileSystem
from pageindex.filesystem.commands import PIFSCommandError
with tempfile.TemporaryDirectory() as tmp:
source = Path(tmp) / "notes.md"
source.write_text("# Notes\n\nBody", encoding="utf-8")
filesystem = PageIndexFileSystem(workspace=Path(tmp) / "workspace")
write_pageindex_client_doc(
filesystem.pageindex_client_workspace,
"doc_cached_md",
{
"id": "doc_cached_md",
"type": "md",
"path": str(source.resolve()),
"doc_name": "notes",
"doc_description": "",
"line_count": 3,
"structure": [
{
"title": "Notes",
"node_id": "0001",
"line_num": 1,
"text": "# Notes\n\nBody",
"nodes": [],
}
],
},
)
filesystem.register_file(
storage_uri=source.as_uri(),
storage_uri="file:///tmp/notes.md",
source_path="docs/notes.md",
external_id="dsid_md_cached",
title="Cached markdown notes",
content=source.read_text(encoding="utf-8"),
content="# Notes\n\nBody",
)
executor = PIFSCommandExecutor(filesystem, json_output=True)
node = json.loads(executor.execute("cat dsid_md_cached --node 0001"))
assert node["data"]["available"] is True
assert node["data"]["pageindex_doc_id"] == "doc_cached_md"
assert node["data"]["node"]["title"] == "Notes"
assert node["data"]["text"] == "# Notes\n\nBody"
assert "text" not in node["data"]["node"]
with pytest.raises(PIFSCommandError, match="Unsupported cat option: --node"):
executor.execute("cat dsid_md_cached --node 0001")
def test_cat_structure_page_node_and_text_outputs_are_hard_limited():
def test_cat_structure_page_and_text_outputs_are_hard_limited():
from pageindex.filesystem import PIFSCommandExecutor, PageIndexFileSystem
from pageindex.filesystem.commands import PIFSCommandError
@ -463,16 +434,13 @@ def test_cat_structure_page_node_and_text_outputs_are_hard_limited():
)
executor = PIFSCommandExecutor(filesystem, json_output=True)
first_structure = json.loads(executor.execute("cat dsid_limited_pdf --structure"))
assert len(first_structure["data"]["structure"]) == 25
assert first_structure["data"]["structure_pagination"]["has_more"] is True
assert first_structure["data"]["structure_pagination"]["next_offset"] == 25
second_structure = json.loads(
structure = json.loads(executor.execute("cat dsid_limited_pdf --structure"))
assert len(structure["data"]["structure"]) == 30
assert structure["data"]["structure"][25]["node_id"] == "0026"
assert "text" not in structure["data"]["structure"][0]
assert "structure_pagination" not in structure["data"]
with pytest.raises(PIFSCommandError, match="Unsupported cat option: --offset"):
executor.execute("cat dsid_limited_pdf --structure --offset 25")
)
assert len(second_structure["data"]["structure"]) == 5
assert second_structure["data"]["structure"][0]["node_id"] == "0026"
pages = json.loads(executor.execute("cat dsid_limited_pdf --page 1-5"))
assert pages["data"]["text"] == (
@ -484,38 +452,8 @@ def test_cat_structure_page_node_and_text_outputs_are_hard_limited():
with pytest.raises(PIFSCommandError, match="evidence is sufficient"):
executor.execute("cat dsid_limited_pdf --page 1-6")
nodes = json.loads(
executor.execute(
"cat dsid_limited_pdf --node 0001 0002 0003 0004 0005 "
"0006 0007 0008 0009 0010"
)
)
assert nodes["data"]["node_ids"] == [
"0001",
"0002",
"0003",
"0004",
"0005",
"0006",
"0007",
"0008",
"0009",
"0010",
]
comma_nodes = json.loads(
executor.execute("cat dsid_limited_pdf --node 0001,0002")
)
assert comma_nodes["data"]["node_ids"] == ["0001", "0002"]
with pytest.raises(PIFSCommandError, match="at most 10"):
executor.execute(
"cat dsid_limited_pdf --node 0001 0002 0003 0004 0005 "
"0006 0007 0008 0009 0010 0011"
)
with pytest.raises(PIFSCommandError, match="continue with additional chunks"):
executor.execute(
"cat dsid_limited_pdf --node 0001 0002 0003 0004 0005 "
"0006 0007 0008 0009 0010 0011"
)
with pytest.raises(PIFSCommandError, match="Unsupported cat option: --node"):
executor.execute("cat dsid_limited_pdf --node 0001")
with pytest.raises(PIFSCommandError, match="quote the whole target"):
executor.execute("cat dsid_limited_pdf 0001")
@ -672,11 +610,13 @@ def test_pageindex_structure_commands_are_limited_to_pdf_and_markdown():
for command in (
"cat dsid_text_only --structure",
"cat dsid_text_only --page 1",
"cat dsid_text_only --node 0001",
):
with pytest.raises(PIFSCommandError, match="only supported for PDF/Markdown"):
executor.execute(command)
with pytest.raises(PIFSCommandError, match="Unsupported cat option: --node"):
executor.execute("cat dsid_text_only --node 0001")
def test_existing_pageindex_status_allows_legacy_record_without_format_suffix():
from pageindex.filesystem import PIFSCommandExecutor, PageIndexFileSystem

View file

@ -218,13 +218,17 @@ class PIFSAgentStreamTest(unittest.TestCase):
self.assertEqual(output, '{"answer":"done","document_ids":["dsid_1"]}')
def test_prompt_tells_agent_when_to_choose_node_or_page(self):
self.assertIn("prefer cat <target> --node <node_id>", AGENT_TOOL_POLICY)
self.assertIn("page-level evidence", AGENT_TOOL_POLICY)
self.assertIn("prefer\ncat <path> --node <node_id>", BASH_TOOL_DESCRIPTION)
def test_prompt_tells_agent_to_use_structure_then_page(self):
self.assertIn(
"cat <target> --structure returns the cached PageIndex structure JSON",
AGENT_TOOL_POLICY,
)
self.assertIn("exact page text", BASH_TOOL_DESCRIPTION)
self.assertIn("cat <path> --structure and cat <path> --page", BASH_TOOL_DESCRIPTION)
self.assertIn("stop if the evidence is sufficient", AGENT_TOOL_POLICY)
self.assertIn("continue with another chunk before answering", BASH_TOOL_DESCRIPTION)
self.assertIn("Do not reconstruct paths from document titles", BASH_TOOL_DESCRIPTION)
self.assertIn("Do not reconstruct paths from", BASH_TOOL_DESCRIPTION)
self.assertIn("document titles", BASH_TOOL_DESCRIPTION)
self.assertIn("file_ref/document_id", AGENT_TOOL_POLICY)
def test_prompt_requires_stat_for_metadata_questions(self):
@ -244,7 +248,6 @@ class PIFSAgentStreamTest(unittest.TestCase):
self.assertIn("browse returns file candidates only", AGENT_TOOL_POLICY)
self.assertIn("verify the relevant facts with cat or grep", AGENT_TOOL_POLICY)
self.assertIn("cat <target> --structure", AGENT_TOOL_POLICY)
self.assertIn("cat <target> --node <node_id>", AGENT_TOOL_POLICY)
self.assertIn("cat <target> --page", AGENT_TOOL_POLICY)
self.assertIn("Do not use browse as folder semantic recall", AGENT_TOOL_POLICY)