Feat: TrustGraph i18n & Documentation Translation Updates (#781)
Native CLI i18n: The TrustGraph CLI has built-in translation support
that dynamically loads language strings. You can test and use
different languages by simply passing the --lang flag (e.g., --lang
es for Spanish, --lang ru for Russian) or by configuring your
environment's LANG variable.
Automated Docs Translations: This PR introduces autonomously
translated Markdown documentation into several target languages,
including Spanish, Swahili, Portuguese, Turkish, Hindi, Hebrew,
Arabic, Simplified Chinese, and Russian.
2026-04-14 07:07:58 -04:00
|
|
|
---
|
|
|
|
|
layout: default
|
|
|
|
|
title: "Universal Document Decoder"
|
|
|
|
|
parent: "Tech Specs"
|
|
|
|
|
---
|
|
|
|
|
|
Add universal document decoder with multi-format support (#705)
Add universal document decoder with multi-format support
using 'unstructured'.
New universal decoder service powered by the unstructured
library, handling DOCX, XLSX, PPTX, HTML, Markdown, CSV, RTF,
ODT, EPUB and more through a single service. Tables are preserved
as HTML markup for better downstream extraction. Images are
stored in the librarian but excluded from the text
pipeline. Configurable section grouping strategies
(whole-document, heading, element-type, count, size) for non-page
formats. Page-based formats (PDF, PPTX, XLSX) are automatically
grouped by page.
All four decoders (PDF, Mistral OCR, Tesseract OCR, universal)
now share the "document-decoder" ident so they are
interchangeable. PDF-only decoders fetch document metadata to
check MIME type and gracefully skip unsupported formats.
Librarian changes: removed MIME type whitelist validation so any
document format can be ingested. Simplified routing so text/plain
goes to text-load and everything else goes to document-load.
Removed dual inline/streaming data paths — documents always use
document_id for content retrieval.
New provenance entity types (tg:Section, tg:Image) and metadata
predicates (tg:elementTypes, tg:tableCount, tg:imageCount) for
richer explainability.
Universal decoder is in its own package (trustgraph-unstructured)
and container image (trustgraph-unstructured).
2026-03-23 12:56:35 +00:00
|
|
|
# Universal Document Decoder
|
|
|
|
|
|
|
|
|
|
## Headline
|
|
|
|
|
|
|
|
|
|
Universal document decoder powered by `unstructured` — ingest any common
|
|
|
|
|
document format through a single service with full provenance and librarian
|
|
|
|
|
integration, recording source positions as knowledge graph metadata for
|
|
|
|
|
end-to-end traceability.
|
|
|
|
|
|
|
|
|
|
## Problem
|
|
|
|
|
|
|
|
|
|
TrustGraph currently has a PDF-specific decoder. Supporting additional
|
|
|
|
|
formats (DOCX, XLSX, HTML, Markdown, plain text, PPTX, etc.) requires
|
|
|
|
|
either writing a new decoder per format or adopting a universal extraction
|
|
|
|
|
library. Each format has different structure — some are page-based, some
|
|
|
|
|
aren't — and the provenance chain must record where in the original
|
|
|
|
|
document each piece of extracted text originated.
|
|
|
|
|
|
|
|
|
|
## Approach
|
|
|
|
|
|
|
|
|
|
### Library: `unstructured`
|
|
|
|
|
|
|
|
|
|
Use `unstructured.partition.auto.partition()` which auto-detects format
|
|
|
|
|
from mime type or file extension and extracts structured elements
|
|
|
|
|
(Title, NarrativeText, Table, ListItem, etc.). Each element carries
|
|
|
|
|
metadata including:
|
|
|
|
|
|
|
|
|
|
- `page_number` (for page-based formats like PDF, PPTX)
|
|
|
|
|
- `element_id` (unique per element)
|
|
|
|
|
- `coordinates` (bounding box for PDFs)
|
|
|
|
|
- `text` (the extracted text content)
|
|
|
|
|
- `category` (element type: Title, NarrativeText, Table, etc.)
|
|
|
|
|
|
|
|
|
|
### Element Types
|
|
|
|
|
|
|
|
|
|
`unstructured` extracts typed elements from documents. Each element has
|
|
|
|
|
a category and associated metadata:
|
|
|
|
|
|
|
|
|
|
**Text elements:**
|
|
|
|
|
- `Title` — section headings
|
|
|
|
|
- `NarrativeText` — body paragraphs
|
|
|
|
|
- `ListItem` — bullet/numbered list items
|
|
|
|
|
- `Header`, `Footer` — page headers/footers
|
|
|
|
|
- `FigureCaption` — captions for figures/images
|
|
|
|
|
- `Formula` — mathematical expressions
|
|
|
|
|
- `Address`, `EmailAddress` — contact information
|
|
|
|
|
- `CodeSnippet` — code blocks (from markdown)
|
|
|
|
|
|
|
|
|
|
**Tables:**
|
|
|
|
|
- `Table` — structured tabular data. `unstructured` provides both
|
|
|
|
|
`element.text` (plain text) and `element.metadata.text_as_html`
|
|
|
|
|
(full HTML `<table>` with rows, columns, and headers preserved).
|
|
|
|
|
For formats with explicit table structure (DOCX, XLSX, HTML), the
|
|
|
|
|
extraction is highly reliable. For PDFs, table detection depends on
|
|
|
|
|
the `hi_res` strategy with layout analysis.
|
|
|
|
|
|
|
|
|
|
**Images:**
|
|
|
|
|
- `Image` — embedded images detected via layout analysis (requires
|
|
|
|
|
`hi_res` strategy). With `extract_image_block_to_payload=True`,
|
|
|
|
|
returns the image data as base64 in `element.metadata.image_base64`.
|
|
|
|
|
OCR text from the image is available in `element.text`.
|
|
|
|
|
|
|
|
|
|
### Table Handling
|
|
|
|
|
|
|
|
|
|
Tables are a first-class output. When the decoder encounters a `Table`
|
|
|
|
|
element, it preserves the HTML structure rather than flattening to
|
|
|
|
|
plain text. This gives the downstream LLM extractor much better input
|
|
|
|
|
for pulling structured knowledge out of tabular data.
|
|
|
|
|
|
|
|
|
|
The page/section text is assembled as follows:
|
|
|
|
|
- Text elements: plain text, joined with newlines
|
|
|
|
|
- Table elements: HTML table markup from `text_as_html`, wrapped in a
|
|
|
|
|
`<table>` marker so the LLM can distinguish tables from narrative
|
|
|
|
|
|
|
|
|
|
For example, a page with a heading, paragraph, and table produces:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
Financial Overview
|
|
|
|
|
|
|
|
|
|
Revenue grew 15% year-over-year driven by enterprise adoption.
|
|
|
|
|
|
|
|
|
|
<table>
|
|
|
|
|
<tr><th>Quarter</th><th>Revenue</th><th>Growth</th></tr>
|
|
|
|
|
<tr><td>Q1</td><td>$12M</td><td>12%</td></tr>
|
|
|
|
|
<tr><td>Q2</td><td>$14M</td><td>17%</td></tr>
|
|
|
|
|
</table>
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
This preserves table structure through chunking and into the extraction
|
|
|
|
|
pipeline, where the LLM can extract relationships directly from
|
|
|
|
|
structured cells rather than guessing at column alignment from
|
|
|
|
|
whitespace.
|
|
|
|
|
|
|
|
|
|
### Image Handling
|
|
|
|
|
|
|
|
|
|
Images are extracted and stored in the librarian as child documents
|
|
|
|
|
with `document_type="image"` and a `urn:image:{uuid}` ID. They get
|
|
|
|
|
provenance triples with type `tg:Image`, linked to their parent
|
|
|
|
|
page/section via `prov:wasDerivedFrom`. Image metadata (coordinates,
|
|
|
|
|
dimensions, element_id) is recorded in provenance.
|
|
|
|
|
|
|
|
|
|
**Crucially, images are NOT emitted as TextDocument outputs.** They are
|
|
|
|
|
stored only — not sent downstream to the chunker or any text processing
|
|
|
|
|
pipeline. This is intentional:
|
|
|
|
|
|
|
|
|
|
1. There is no image processing pipeline yet (vision model integration
|
|
|
|
|
is future work)
|
|
|
|
|
2. Feeding base64 image data or OCR fragments into the text extraction
|
|
|
|
|
pipeline would produce garbage KG triples
|
|
|
|
|
|
|
|
|
|
Images are also excluded from the assembled page text — any `Image`
|
|
|
|
|
elements are silently skipped when concatenating element texts for a
|
|
|
|
|
page/section. The provenance chain records that images exist and where
|
|
|
|
|
they appeared in the document, so they can be picked up by a future
|
|
|
|
|
image processing pipeline without re-ingesting the document.
|
|
|
|
|
|
|
|
|
|
#### Future work
|
|
|
|
|
|
|
|
|
|
- Route `tg:Image` entities to a vision model for description,
|
|
|
|
|
diagram interpretation, or chart data extraction
|
|
|
|
|
- Store image descriptions as text child documents that feed into
|
|
|
|
|
the standard chunking/extraction pipeline
|
|
|
|
|
- Link extracted knowledge back to source images via provenance
|
|
|
|
|
|
|
|
|
|
### Section Strategies
|
|
|
|
|
|
|
|
|
|
For page-based formats (PDF, PPTX, XLSX), elements are always grouped
|
|
|
|
|
by page/slide/sheet first. For non-page formats (DOCX, HTML, Markdown,
|
|
|
|
|
etc.), the decoder needs a strategy for splitting the document into
|
|
|
|
|
sections. This is runtime-configurable via `--section-strategy`.
|
|
|
|
|
|
|
|
|
|
Each strategy is a grouping function over the list of `unstructured`
|
|
|
|
|
elements. The output is a list of element groups; the rest of the
|
|
|
|
|
pipeline (text assembly, librarian storage, provenance, TextDocument
|
|
|
|
|
emission) is identical regardless of strategy.
|
|
|
|
|
|
|
|
|
|
#### `whole-document` (default)
|
|
|
|
|
|
|
|
|
|
Emit the entire document as a single section. Let the downstream
|
|
|
|
|
chunker handle all splitting.
|
|
|
|
|
|
|
|
|
|
- Simplest approach, good baseline
|
|
|
|
|
- May produce very large TextDocument for big files, but the chunker
|
|
|
|
|
handles that
|
|
|
|
|
- Best when you want maximum context per section
|
|
|
|
|
|
|
|
|
|
#### `heading`
|
|
|
|
|
|
|
|
|
|
Split at heading elements (`Title`). Each section is a heading plus
|
|
|
|
|
all content until the next heading of equal or higher level. Nested
|
|
|
|
|
headings create nested sections.
|
|
|
|
|
|
|
|
|
|
- Produces topically coherent units
|
|
|
|
|
- Works well for structured documents (reports, manuals, specs)
|
|
|
|
|
- Gives the extraction LLM heading context alongside content
|
|
|
|
|
- Falls back to `whole-document` if no headings are found
|
|
|
|
|
|
|
|
|
|
#### `element-type`
|
|
|
|
|
|
|
|
|
|
Split when the element type changes significantly — specifically,
|
|
|
|
|
start a new section at transitions between narrative text and tables.
|
|
|
|
|
Consecutive elements of the same broad category (text, text, text or
|
|
|
|
|
table, table) stay grouped.
|
|
|
|
|
|
|
|
|
|
- Keeps tables as standalone sections
|
|
|
|
|
- Good for documents with mixed content (reports with data tables)
|
|
|
|
|
- Tables get dedicated extraction attention
|
|
|
|
|
|
|
|
|
|
#### `count`
|
|
|
|
|
|
|
|
|
|
Group a fixed number of elements per section. Configurable via
|
|
|
|
|
`--section-element-count` (default: 20).
|
|
|
|
|
|
|
|
|
|
- Simple and predictable
|
|
|
|
|
- Doesn't respect document structure
|
|
|
|
|
- Useful as a fallback or for experimentation
|
|
|
|
|
|
|
|
|
|
#### `size`
|
|
|
|
|
|
|
|
|
|
Accumulate elements until a character limit is reached, then start a
|
|
|
|
|
new section. Respects element boundaries — never splits mid-element.
|
|
|
|
|
Configurable via `--section-max-size` (default: 4000 characters).
|
|
|
|
|
|
|
|
|
|
- Produces roughly uniform section sizes
|
|
|
|
|
- Respects element boundaries (unlike the downstream chunker)
|
|
|
|
|
- Good compromise between structure and size control
|
|
|
|
|
- If a single element exceeds the limit, it becomes its own section
|
|
|
|
|
|
|
|
|
|
#### Page-based format interaction
|
|
|
|
|
|
|
|
|
|
For page-based formats, the page grouping always takes priority.
|
|
|
|
|
Section strategies can optionally apply *within* a page if it's very
|
|
|
|
|
large (e.g. a PDF page with an enormous table), controlled by
|
|
|
|
|
`--section-within-pages` (default: false). When false, each page is
|
|
|
|
|
always one section regardless of size.
|
|
|
|
|
|
|
|
|
|
### Format Detection
|
|
|
|
|
|
|
|
|
|
The decoder needs to know the document's mime type to pass to
|
|
|
|
|
`unstructured`'s `partition()`. Two paths:
|
|
|
|
|
|
|
|
|
|
- **Librarian path** (`document_id` set): fetch document metadata
|
|
|
|
|
from the librarian first — this gives us the `kind` (mime type)
|
|
|
|
|
that was recorded at upload time. Then fetch document content.
|
|
|
|
|
Two librarian calls, but the metadata fetch is lightweight.
|
|
|
|
|
- **Inline path** (backward compat, `data` set): no metadata
|
|
|
|
|
available on the message. Use `python-magic` to detect format
|
|
|
|
|
from content bytes as a fallback.
|
|
|
|
|
|
|
|
|
|
No changes to the `Document` schema are needed — the librarian
|
|
|
|
|
already stores the mime type.
|
|
|
|
|
|
|
|
|
|
### Architecture
|
|
|
|
|
|
|
|
|
|
A single `universal-decoder` service that:
|
|
|
|
|
|
|
|
|
|
1. Receives a `Document` message (inline or via librarian reference)
|
|
|
|
|
2. If librarian path: fetch document metadata (get mime type), then
|
|
|
|
|
fetch content. If inline path: detect format from content bytes.
|
|
|
|
|
3. Calls `partition()` to extract elements
|
|
|
|
|
4. Groups elements: by page for page-based formats, by configured
|
|
|
|
|
section strategy for non-page formats
|
|
|
|
|
5. For each page/section:
|
|
|
|
|
- Generates a `urn:page:{uuid}` or `urn:section:{uuid}` ID
|
|
|
|
|
- Assembles page text: narrative as plain text, tables as HTML,
|
|
|
|
|
images skipped
|
|
|
|
|
- Computes character offsets for each element within the page text
|
|
|
|
|
- Saves to librarian as child document
|
|
|
|
|
- Emits provenance triples with positional metadata
|
|
|
|
|
- Sends `TextDocument` downstream for chunking
|
|
|
|
|
6. For each image element:
|
|
|
|
|
- Generates a `urn:image:{uuid}` ID
|
|
|
|
|
- Saves image data to librarian as child document
|
|
|
|
|
- Emits provenance triples (stored only, not sent downstream)
|
|
|
|
|
|
|
|
|
|
### Format Handling
|
|
|
|
|
|
|
|
|
|
| Format | Mime Type | Page-based | Notes |
|
|
|
|
|
|----------|------------------------------------|------------|--------------------------------|
|
|
|
|
|
| PDF | application/pdf | Yes | Per-page grouping |
|
|
|
|
|
| DOCX | application/vnd.openxmlformats... | No | Uses section strategy |
|
|
|
|
|
| PPTX | application/vnd.openxmlformats... | Yes | Per-slide grouping |
|
|
|
|
|
| XLSX/XLS | application/vnd.openxmlformats... | Yes | Per-sheet grouping |
|
|
|
|
|
| HTML | text/html | No | Uses section strategy |
|
|
|
|
|
| Markdown | text/markdown | No | Uses section strategy |
|
|
|
|
|
| Plain | text/plain | No | Uses section strategy |
|
|
|
|
|
| CSV | text/csv | No | Uses section strategy |
|
|
|
|
|
| RST | text/x-rst | No | Uses section strategy |
|
|
|
|
|
| RTF | application/rtf | No | Uses section strategy |
|
|
|
|
|
| ODT | application/vnd.oasis... | No | Uses section strategy |
|
|
|
|
|
| TSV | text/tab-separated-values | No | Uses section strategy |
|
|
|
|
|
|
|
|
|
|
### Provenance Metadata
|
|
|
|
|
|
|
|
|
|
Each page/section entity records positional metadata as provenance
|
|
|
|
|
triples in `GRAPH_SOURCE`, enabling full traceability from KG triples
|
|
|
|
|
back to source document positions.
|
|
|
|
|
|
|
|
|
|
#### Existing fields (already in `derived_entity_triples`)
|
|
|
|
|
|
|
|
|
|
- `page_number` — page/sheet/slide number (1-indexed, page-based only)
|
|
|
|
|
- `char_offset` — character offset of this page/section within the
|
|
|
|
|
full document text
|
|
|
|
|
- `char_length` — character length of this page/section's text
|
|
|
|
|
|
|
|
|
|
#### New fields (extend `derived_entity_triples`)
|
|
|
|
|
|
|
|
|
|
- `mime_type` — original document format (e.g. `application/pdf`)
|
|
|
|
|
- `element_types` — comma-separated list of `unstructured` element
|
|
|
|
|
categories found in this page/section (e.g. "Title,NarrativeText,Table")
|
|
|
|
|
- `table_count` — number of tables in this page/section
|
|
|
|
|
- `image_count` — number of images in this page/section
|
|
|
|
|
|
|
|
|
|
These require new TG namespace predicates:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
TG_SECTION_TYPE = "https://trustgraph.ai/ns/Section"
|
|
|
|
|
TG_IMAGE_TYPE = "https://trustgraph.ai/ns/Image"
|
|
|
|
|
TG_ELEMENT_TYPES = "https://trustgraph.ai/ns/elementTypes"
|
|
|
|
|
TG_TABLE_COUNT = "https://trustgraph.ai/ns/tableCount"
|
|
|
|
|
TG_IMAGE_COUNT = "https://trustgraph.ai/ns/imageCount"
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Image URN scheme: `urn:image:{uuid}`
|
|
|
|
|
|
|
|
|
|
(`TG_MIME_TYPE` already exists.)
|
|
|
|
|
|
|
|
|
|
#### New entity type
|
|
|
|
|
|
|
|
|
|
For non-page formats (DOCX, HTML, Markdown, etc.) where the decoder
|
|
|
|
|
emits the whole document as a single unit rather than splitting by
|
|
|
|
|
page, the entity gets a new type:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
TG_SECTION_TYPE = "https://trustgraph.ai/ns/Section"
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
This distinguishes sections from pages when querying provenance:
|
|
|
|
|
|
|
|
|
|
| Entity | Type | When used |
|
|
|
|
|
|----------|-----------------------------|----------------------------------------|
|
|
|
|
|
| Document | `tg:Document` | Original uploaded file |
|
|
|
|
|
| Page | `tg:Page` | Page-based formats (PDF, PPTX, XLSX) |
|
|
|
|
|
| Section | `tg:Section` | Non-page formats (DOCX, HTML, MD, etc) |
|
|
|
|
|
| Image | `tg:Image` | Embedded images (stored, not processed)|
|
|
|
|
|
| Chunk | `tg:Chunk` | Output of chunker |
|
|
|
|
|
| Subgraph | `tg:Subgraph` | KG extraction output |
|
|
|
|
|
|
|
|
|
|
The type is set by the decoder based on whether it's grouping by page
|
|
|
|
|
or emitting a whole-document section. `derived_entity_triples` gains
|
|
|
|
|
an optional `section` boolean parameter — when true, the entity is
|
|
|
|
|
typed as `tg:Section` instead of `tg:Page`.
|
|
|
|
|
|
|
|
|
|
#### Full provenance chain
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
KG triple
|
|
|
|
|
→ subgraph (extraction provenance)
|
|
|
|
|
→ chunk (char_offset, char_length within page)
|
|
|
|
|
→ page/section (page_number, char_offset, char_length within doc, mime_type, element_types)
|
|
|
|
|
→ document (original file in librarian)
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Every link is a set of triples in the `GRAPH_SOURCE` named graph.
|
|
|
|
|
|
|
|
|
|
### Service Configuration
|
|
|
|
|
|
|
|
|
|
Command-line arguments:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
--strategy Partitioning strategy: auto, hi_res, fast (default: auto)
|
|
|
|
|
--languages Comma-separated OCR language codes (default: eng)
|
|
|
|
|
--section-strategy Section grouping: whole-document, heading, element-type,
|
|
|
|
|
count, size (default: whole-document)
|
|
|
|
|
--section-element-count Elements per section for 'count' strategy (default: 20)
|
|
|
|
|
--section-max-size Max chars per section for 'size' strategy (default: 4000)
|
|
|
|
|
--section-within-pages Apply section strategy within pages too (default: false)
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Plus the standard `FlowProcessor` and librarian queue arguments.
|
|
|
|
|
|
|
|
|
|
### Flow Integration
|
|
|
|
|
|
|
|
|
|
The universal decoder occupies the same position in the processing flow
|
|
|
|
|
as the current PDF decoder:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
Document → [universal-decoder] → TextDocument → [chunker] → Chunk → ...
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
It registers:
|
|
|
|
|
- `input` consumer (Document schema)
|
|
|
|
|
- `output` producer (TextDocument schema)
|
|
|
|
|
- `triples` producer (Triples schema)
|
|
|
|
|
- Librarian request/response (for fetch and child document storage)
|
|
|
|
|
|
|
|
|
|
### Deployment
|
|
|
|
|
|
|
|
|
|
- New container: `trustgraph-flow-universal-decoder`
|
|
|
|
|
- Dependency: `unstructured[all-docs]` (includes PDF, DOCX, PPTX, etc.)
|
|
|
|
|
- Can run alongside or replace the existing PDF decoder depending on
|
|
|
|
|
flow configuration
|
|
|
|
|
- The existing PDF decoder remains available for environments where
|
|
|
|
|
`unstructured` dependencies are too heavy
|
|
|
|
|
|
|
|
|
|
### What Changes
|
|
|
|
|
|
|
|
|
|
| Component | Change |
|
|
|
|
|
|------------------------------|-------------------------------------------------|
|
|
|
|
|
| `provenance/namespaces.py` | Add `TG_SECTION_TYPE`, `TG_IMAGE_TYPE`, `TG_ELEMENT_TYPES`, `TG_TABLE_COUNT`, `TG_IMAGE_COUNT` |
|
|
|
|
|
| `provenance/triples.py` | Add `mime_type`, `element_types`, `table_count`, `image_count` kwargs |
|
|
|
|
|
| `provenance/__init__.py` | Export new constants |
|
|
|
|
|
| New: `decoding/universal/` | New decoder service module |
|
|
|
|
|
| `setup.cfg` / `pyproject` | Add `unstructured[all-docs]` dependency |
|
|
|
|
|
| Docker | New container image |
|
|
|
|
|
| Flow definitions | Wire universal-decoder as document input |
|
|
|
|
|
|
|
|
|
|
### What Doesn't Change
|
|
|
|
|
|
|
|
|
|
- Chunker (receives TextDocument, works as before)
|
|
|
|
|
- Downstream extractors (receive Chunk, unchanged)
|
|
|
|
|
- Librarian (stores child documents, unchanged)
|
|
|
|
|
- Schema (Document, TextDocument, Chunk unchanged)
|
|
|
|
|
- Query-time provenance (unchanged)
|
|
|
|
|
|
|
|
|
|
## Risks
|
|
|
|
|
|
|
|
|
|
- `unstructured[all-docs]` has heavy dependencies (poppler, tesseract,
|
|
|
|
|
libreoffice for some formats). Container image will be larger.
|
|
|
|
|
Mitigation: offer a `[light]` variant without OCR/office deps.
|
|
|
|
|
- Some formats may produce poor text extraction (scanned PDFs without
|
|
|
|
|
OCR, complex XLSX layouts). Mitigation: configurable `strategy`
|
|
|
|
|
parameter, and the existing Mistral OCR decoder remains available
|
|
|
|
|
for high-quality PDF OCR.
|
|
|
|
|
- `unstructured` version updates may change element metadata.
|
|
|
|
|
Mitigation: pin version, test extraction quality per format.
|