mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-25 00:16:23 +02:00
Feature/prompts jsonl (#619)
* Tech spec * JSONL implementation complete * Updated prompt client users * Fix tests
This commit is contained in:
parent
e4f0013841
commit
e214eb4e02
8 changed files with 1292 additions and 463 deletions
455
docs/tech-specs/jsonl-prompt-output.md
Normal file
455
docs/tech-specs/jsonl-prompt-output.md
Normal file
|
|
@ -0,0 +1,455 @@
|
|||
# JSONL Prompt Output Technical Specification
|
||||
|
||||
## Overview
|
||||
|
||||
This specification describes the implementation of JSONL (JSON Lines) output
|
||||
format for prompt responses in TrustGraph. JSONL enables truncation-resilient
|
||||
extraction of structured data from LLM responses, addressing critical issues
|
||||
with JSON array outputs being corrupted when LLM responses hit output token
|
||||
limits.
|
||||
|
||||
This implementation supports the following use cases:
|
||||
|
||||
1. **Truncation-Resilient Extraction**: Extract valid partial results even when
|
||||
LLM output is truncated mid-response
|
||||
2. **Large-Scale Extraction**: Handle extraction of many items without risk of
|
||||
complete failure due to token limits
|
||||
3. **Mixed-Type Extraction**: Support extraction of multiple entity types
|
||||
(definitions, relationships, entities, attributes) in a single prompt
|
||||
4. **Streaming-Compatible Output**: Enable future streaming/incremental
|
||||
processing of extraction results
|
||||
|
||||
## Goals
|
||||
|
||||
- **Backward Compatibility**: Existing prompts using `response-type: "text"` and
|
||||
`response-type: "json"` continue to work without modification
|
||||
- **Truncation Resilience**: Partial LLM outputs yield partial valid results
|
||||
rather than complete failure
|
||||
- **Schema Validation**: Support JSON Schema validation for individual objects
|
||||
- **Discriminated Unions**: Support mixed-type outputs using a `type` field
|
||||
discriminator
|
||||
- **Minimal API Changes**: Extend existing prompt configuration with new
|
||||
response type and schema key
|
||||
|
||||
## Background
|
||||
|
||||
### Current Architecture
|
||||
|
||||
The prompt service supports two response types:
|
||||
|
||||
1. `response-type: "text"` - Raw text response returned as-is
|
||||
2. `response-type: "json"` - JSON parsed from response, validated against
|
||||
optional `schema`
|
||||
|
||||
Current implementation in `trustgraph-flow/trustgraph/template/prompt_manager.py`:
|
||||
|
||||
```python
|
||||
class Prompt:
|
||||
def __init__(self, template, response_type = "text", terms=None, schema=None):
|
||||
self.template = template
|
||||
self.response_type = response_type
|
||||
self.terms = terms
|
||||
self.schema = schema
|
||||
```
|
||||
|
||||
### Current Limitations
|
||||
|
||||
When extraction prompts request output as JSON arrays (`[{...}, {...}, ...]`):
|
||||
|
||||
- **Truncation corruption**: If the LLM hits output token limits mid-array, the
|
||||
entire response becomes invalid JSON and cannot be parsed
|
||||
- **All-or-nothing parsing**: Must receive complete output before parsing
|
||||
- **No partial results**: A truncated response yields zero usable data
|
||||
- **Unreliable for large extractions**: More extracted items = higher failure risk
|
||||
|
||||
This specification addresses these limitations by introducing JSONL format for
|
||||
extraction prompts, where each extracted item is a complete JSON object on its
|
||||
own line.
|
||||
|
||||
## Technical Design
|
||||
|
||||
### Response Type Extension
|
||||
|
||||
Add a new response type `"jsonl"` alongside existing `"text"` and `"json"` types.
|
||||
|
||||
#### Configuration Changes
|
||||
|
||||
**New response type value:**
|
||||
|
||||
```
|
||||
"response-type": "jsonl"
|
||||
```
|
||||
|
||||
**Schema interpretation:**
|
||||
|
||||
The existing `"schema"` key is used for both `"json"` and `"jsonl"` response
|
||||
types. The interpretation depends on the response type:
|
||||
|
||||
- `"json"`: Schema describes the entire response (typically an array or object)
|
||||
- `"jsonl"`: Schema describes each individual line/object
|
||||
|
||||
```json
|
||||
{
|
||||
"response-type": "jsonl",
|
||||
"schema": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"entity": { "type": "string" },
|
||||
"definition": { "type": "string" }
|
||||
},
|
||||
"required": ["entity", "definition"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This avoids changes to prompt configuration tooling and editors.
|
||||
|
||||
### JSONL Format Specification
|
||||
|
||||
#### Simple Extraction
|
||||
|
||||
For prompts extracting a single type of object (definitions, relationships,
|
||||
topics, rows), the output is one JSON object per line with no wrapper:
|
||||
|
||||
**Prompt output format:**
|
||||
```
|
||||
{"entity": "photosynthesis", "definition": "Process by which plants convert sunlight"}
|
||||
{"entity": "chlorophyll", "definition": "Green pigment in plants"}
|
||||
{"entity": "mitochondria", "definition": "Powerhouse of the cell"}
|
||||
```
|
||||
|
||||
**Contrast with previous JSON array format:**
|
||||
```json
|
||||
[
|
||||
{"entity": "photosynthesis", "definition": "Process by which plants convert sunlight"},
|
||||
{"entity": "chlorophyll", "definition": "Green pigment in plants"},
|
||||
{"entity": "mitochondria", "definition": "Powerhouse of the cell"}
|
||||
]
|
||||
```
|
||||
|
||||
If the LLM truncates after line 2, the JSON array format yields invalid JSON,
|
||||
while JSONL yields two valid objects.
|
||||
|
||||
#### Mixed-Type Extraction (Discriminated Unions)
|
||||
|
||||
For prompts extracting multiple types of objects (e.g., both definitions and
|
||||
relationships, or entities, relationships, and attributes), use a `"type"`
|
||||
field as discriminator:
|
||||
|
||||
**Prompt output format:**
|
||||
```
|
||||
{"type": "definition", "entity": "DNA", "definition": "Molecule carrying genetic instructions"}
|
||||
{"type": "relationship", "subject": "DNA", "predicate": "located_in", "object": "cell nucleus", "object-entity": true}
|
||||
{"type": "definition", "entity": "RNA", "definition": "Molecule that carries genetic information"}
|
||||
{"type": "relationship", "subject": "RNA", "predicate": "transcribed_from", "object": "DNA", "object-entity": true}
|
||||
```
|
||||
|
||||
**Schema for discriminated unions uses `oneOf`:**
|
||||
```json
|
||||
{
|
||||
"response-type": "jsonl",
|
||||
"schema": {
|
||||
"oneOf": [
|
||||
{
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"type": { "const": "definition" },
|
||||
"entity": { "type": "string" },
|
||||
"definition": { "type": "string" }
|
||||
},
|
||||
"required": ["type", "entity", "definition"]
|
||||
},
|
||||
{
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"type": { "const": "relationship" },
|
||||
"subject": { "type": "string" },
|
||||
"predicate": { "type": "string" },
|
||||
"object": { "type": "string" },
|
||||
"object-entity": { "type": "boolean" }
|
||||
},
|
||||
"required": ["type", "subject", "predicate", "object", "object-entity"]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### Ontology Extraction
|
||||
|
||||
For ontology-based extraction with entities, relationships, and attributes:
|
||||
|
||||
**Prompt output format:**
|
||||
```
|
||||
{"type": "entity", "entity": "Cornish pasty", "entity_type": "fo/Recipe"}
|
||||
{"type": "entity", "entity": "beef", "entity_type": "fo/Food"}
|
||||
{"type": "relationship", "subject": "Cornish pasty", "subject_type": "fo/Recipe", "relation": "fo/has_ingredient", "object": "beef", "object_type": "fo/Food"}
|
||||
{"type": "attribute", "entity": "Cornish pasty", "entity_type": "fo/Recipe", "attribute": "fo/serves", "value": "4 people"}
|
||||
```
|
||||
|
||||
### Implementation Details
|
||||
|
||||
#### Prompt Class
|
||||
|
||||
The existing `Prompt` class requires no changes. The `schema` field is reused
|
||||
for JSONL, with its interpretation determined by `response_type`:
|
||||
|
||||
```python
|
||||
class Prompt:
|
||||
def __init__(self, template, response_type="text", terms=None, schema=None):
|
||||
self.template = template
|
||||
self.response_type = response_type
|
||||
self.terms = terms
|
||||
self.schema = schema # Interpretation depends on response_type
|
||||
```
|
||||
|
||||
#### PromptManager.load_config
|
||||
|
||||
No changes required - existing configuration loading already handles the
|
||||
`schema` key.
|
||||
|
||||
#### JSONL Parsing
|
||||
|
||||
Add a new parsing method for JSONL responses:
|
||||
|
||||
```python
|
||||
def parse_jsonl(self, text):
|
||||
"""
|
||||
Parse JSONL response, returning list of valid objects.
|
||||
|
||||
Invalid lines (malformed JSON, empty lines) are skipped with warnings.
|
||||
This provides truncation resilience - partial output yields partial results.
|
||||
"""
|
||||
results = []
|
||||
|
||||
for line_num, line in enumerate(text.strip().split('\n'), 1):
|
||||
line = line.strip()
|
||||
|
||||
# Skip empty lines
|
||||
if not line:
|
||||
continue
|
||||
|
||||
# Skip markdown code fence markers if present
|
||||
if line.startswith('```'):
|
||||
continue
|
||||
|
||||
try:
|
||||
obj = json.loads(line)
|
||||
results.append(obj)
|
||||
except json.JSONDecodeError as e:
|
||||
# Log warning but continue - this provides truncation resilience
|
||||
logger.warning(f"JSONL parse error on line {line_num}: {e}")
|
||||
|
||||
return results
|
||||
```
|
||||
|
||||
#### PromptManager.invoke Changes
|
||||
|
||||
Extend the invoke method to handle the new response type:
|
||||
|
||||
```python
|
||||
async def invoke(self, id, input, llm):
|
||||
logger.debug("Invoking prompt template...")
|
||||
|
||||
terms = self.terms | self.prompts[id].terms | input
|
||||
resp_type = self.prompts[id].response_type
|
||||
|
||||
prompt = {
|
||||
"system": self.system_template.render(terms),
|
||||
"prompt": self.render(id, input)
|
||||
}
|
||||
|
||||
resp = await llm(**prompt)
|
||||
|
||||
if resp_type == "text":
|
||||
return resp
|
||||
|
||||
if resp_type == "json":
|
||||
try:
|
||||
obj = self.parse_json(resp)
|
||||
except:
|
||||
logger.error(f"JSON parse failed: {resp}")
|
||||
raise RuntimeError("JSON parse fail")
|
||||
|
||||
if self.prompts[id].schema:
|
||||
try:
|
||||
validate(instance=obj, schema=self.prompts[id].schema)
|
||||
logger.debug("Schema validation successful")
|
||||
except Exception as e:
|
||||
raise RuntimeError(f"Schema validation fail: {e}")
|
||||
|
||||
return obj
|
||||
|
||||
if resp_type == "jsonl":
|
||||
objects = self.parse_jsonl(resp)
|
||||
|
||||
if not objects:
|
||||
logger.warning("JSONL parse returned no valid objects")
|
||||
return []
|
||||
|
||||
# Validate each object against schema if provided
|
||||
if self.prompts[id].schema:
|
||||
validated = []
|
||||
for i, obj in enumerate(objects):
|
||||
try:
|
||||
validate(instance=obj, schema=self.prompts[id].schema)
|
||||
validated.append(obj)
|
||||
except Exception as e:
|
||||
logger.warning(f"Object {i} failed schema validation: {e}")
|
||||
return validated
|
||||
|
||||
return objects
|
||||
|
||||
raise RuntimeError(f"Response type {resp_type} not known")
|
||||
```
|
||||
|
||||
### Affected Prompts
|
||||
|
||||
The following prompts should be migrated to JSONL format:
|
||||
|
||||
| Prompt ID | Description | Type Field |
|
||||
|-----------|-------------|------------|
|
||||
| `extract-definitions` | Entity/definition extraction | No (single type) |
|
||||
| `extract-relationships` | Relationship extraction | No (single type) |
|
||||
| `extract-topics` | Topic/definition extraction | No (single type) |
|
||||
| `extract-rows` | Structured row extraction | No (single type) |
|
||||
| `agent-kg-extract` | Combined definition + relationship extraction | Yes: `"definition"`, `"relationship"` |
|
||||
| `extract-with-ontologies` / `ontology-extract` | Ontology-based extraction | Yes: `"entity"`, `"relationship"`, `"attribute"` |
|
||||
|
||||
### API Changes
|
||||
|
||||
#### Client Perspective
|
||||
|
||||
JSONL parsing is transparent to prompt service API callers. The parsing occurs
|
||||
server-side in the prompt service, and the response is returned via the standard
|
||||
`PromptResponse.object` field as a serialized JSON array.
|
||||
|
||||
When clients call the prompt service (via `PromptClient.prompt()` or similar):
|
||||
|
||||
- **`response-type: "json"`** with array schema → client receives Python `list`
|
||||
- **`response-type: "jsonl"`** → client receives Python `list`
|
||||
|
||||
From the client's perspective, both return identical data structures. The
|
||||
difference is entirely in how the LLM output is parsed server-side:
|
||||
|
||||
- JSON array format: Single `json.loads()` call; fails completely if truncated
|
||||
- JSONL format: Line-by-line parsing; yields partial results if truncated
|
||||
|
||||
This means existing client code expecting a list from extraction prompts
|
||||
requires no changes when migrating prompts from JSON to JSONL format.
|
||||
|
||||
#### Server Return Value
|
||||
|
||||
For `response-type: "jsonl"`, the `PromptManager.invoke()` method returns a
|
||||
`list[dict]` containing all successfully parsed and validated objects. This
|
||||
list is then serialized to JSON for the `PromptResponse.object` field.
|
||||
|
||||
#### Error Handling
|
||||
|
||||
- Empty results: Returns empty list `[]` with warning log
|
||||
- Partial parse failure: Returns list of successfully parsed objects with
|
||||
warning logs for failures
|
||||
- Complete parse failure: Returns empty list `[]` with warning logs
|
||||
|
||||
This differs from `response-type: "json"` which raises `RuntimeError` on
|
||||
parse failure. The lenient behavior for JSONL is intentional to provide
|
||||
truncation resilience.
|
||||
|
||||
### Configuration Example
|
||||
|
||||
Complete prompt configuration example:
|
||||
|
||||
```json
|
||||
{
|
||||
"prompt": "Extract all entities and their definitions from the following text. Output one JSON object per line.\n\nText:\n{{text}}\n\nOutput format per line:\n{\"entity\": \"<name>\", \"definition\": \"<definition>\"}",
|
||||
"response-type": "jsonl",
|
||||
"schema": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"entity": {
|
||||
"type": "string",
|
||||
"description": "The entity name"
|
||||
},
|
||||
"definition": {
|
||||
"type": "string",
|
||||
"description": "A clear definition of the entity"
|
||||
}
|
||||
},
|
||||
"required": ["entity", "definition"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Security Considerations
|
||||
|
||||
- **Input Validation**: JSON parsing uses standard `json.loads()` which is safe
|
||||
against injection attacks
|
||||
- **Schema Validation**: Uses `jsonschema.validate()` for schema enforcement
|
||||
- **No New Attack Surface**: JSONL parsing is strictly safer than JSON array
|
||||
parsing due to line-by-line processing
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
- **Memory**: Line-by-line parsing uses less peak memory than loading full JSON
|
||||
arrays
|
||||
- **Latency**: Parsing performance is comparable to JSON array parsing
|
||||
- **Validation**: Schema validation runs per-object, which adds overhead but
|
||||
enables partial results on validation failure
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Unit Tests
|
||||
|
||||
- JSONL parsing with valid input
|
||||
- JSONL parsing with empty lines
|
||||
- JSONL parsing with markdown code fences
|
||||
- JSONL parsing with truncated final line
|
||||
- JSONL parsing with invalid JSON lines interspersed
|
||||
- Schema validation with `oneOf` discriminated unions
|
||||
- Backward compatibility: existing `"text"` and `"json"` prompts unchanged
|
||||
|
||||
### Integration Tests
|
||||
|
||||
- End-to-end extraction with JSONL prompts
|
||||
- Extraction with simulated truncation (artificially limited response)
|
||||
- Mixed-type extraction with type discriminator
|
||||
- Ontology extraction with all three types
|
||||
|
||||
### Extraction Quality Tests
|
||||
|
||||
- Compare extraction results: JSONL vs JSON array format
|
||||
- Verify truncation resilience: JSONL yields partial results where JSON fails
|
||||
|
||||
## Migration Plan
|
||||
|
||||
### Phase 1: Implementation
|
||||
|
||||
1. Implement `parse_jsonl()` method in `PromptManager`
|
||||
2. Extend `invoke()` to handle `response-type: "jsonl"`
|
||||
3. Add unit tests
|
||||
|
||||
### Phase 2: Prompt Migration
|
||||
|
||||
1. Update `extract-definitions` prompt and configuration
|
||||
2. Update `extract-relationships` prompt and configuration
|
||||
3. Update `extract-topics` prompt and configuration
|
||||
4. Update `extract-rows` prompt and configuration
|
||||
5. Update `agent-kg-extract` prompt and configuration
|
||||
6. Update `extract-with-ontologies` prompt and configuration
|
||||
|
||||
### Phase 3: Downstream Updates
|
||||
|
||||
1. Update any code consuming extraction results to handle list return type
|
||||
2. Update code that categorizes mixed-type extractions by `type` field
|
||||
3. Update tests that assert on extraction output format
|
||||
|
||||
## Open Questions
|
||||
|
||||
None at this time.
|
||||
|
||||
## References
|
||||
|
||||
- Current implementation: `trustgraph-flow/trustgraph/template/prompt_manager.py`
|
||||
- JSON Lines specification: https://jsonlines.org/
|
||||
- JSON Schema `oneOf`: https://json-schema.org/understanding-json-schema/reference/combining.html#oneof
|
||||
- Related specification: Streaming LLM Responses (`docs/tech-specs/streaming-llm-responses.md`)
|
||||
Loading…
Add table
Add a link
Reference in a new issue