trustgraph/trustgraph-flow/trustgraph/decoding/pdf/pdf_decoder.py


"""
Simple decoder, accepts PDF documents on input, outputs pages from the
PDF document as text as separate output objects.
"""

import tempfile
import base64
from langchain_community.document_loaders import PyPDFLoader

from ... schema import Document, TextDocument, Metadata
from ... base import FlowProcessor, ConsumerSpec, ProducerSpec

default_ident = "pdf-decoder"

class Processor(FlowProcessor):

    def __init__(self, **params):

        id = params.get("id", default_ident)

        super(Processor, self).__init__(
            **params | {
                "id": id,
            }
        )

        self.register_specification(
            ConsumerSpec(
                name = "input",
                schema = Document,
                handler = self.on_message,
            )
        )

        self.register_specification(
            ProducerSpec(
                name = "output",
                schema = TextDocument,
            )
        )

        print("PDF inited", flush=True)

    async def on_message(self, msg, consumer, flow):

        print("PDF message received", flush=True)

        v = msg.value()

        print(f"Decoding {v.metadata.id}...", flush=True)

        with tempfile.NamedTemporaryFile(delete_on_close=False) as fp:

            fp.write(base64.b64decode(v.data))
            fp.close()

            with open(fp.name, mode='rb') as f:

                loader = PyPDFLoader(fp.name)
                pages = loader.load()

                for ix, page in enumerate(pages):

                    print("page", ix, flush=True)

                    r = TextDocument(
                        metadata=v.metadata,
                        text=page.page_content.encode("utf-8"),
                    )

                    await flow("output").send(r)

        print("Done.", flush=True)

    @staticmethod
    def add_args(parser):
        FlowProcessor.add_args(parser)

def run():

    Processor.launch(default_ident, __doc__)
Trustgraph initial code drop 2024-07-10 23:20:06 +01:00
			`"""`
			`Simple decoder, accepts PDF documents on input, outputs pages from the`
			`PDF document as text as separate output objects.`
			`"""`

			`import tempfile`
			`import base64`
Base classes (#2) Simplify code using base classes 2024-07-17 16:56:47 +01:00			`from langchain_community.document_loaders import PyPDFLoader`
Trustgraph initial code drop 2024-07-10 23:20:06 +01:00
Feature / collections (#96) * Update schema defs for source -> metadata * Migrate to use metadata part of schema, also add metadata to triples & vecs * Add user/collection metadata to query * Use user/collection in RAG * Write and query working on triples 2024-10-02 18:14:29 +01:00			`from ... schema import Document, TextDocument, Metadata`
Feature/configure flows (#345) - Keeps processing in different flows separate so that data can go to different stores / collections etc. - Potentially supports different processing flows - Tidies the processing API with common base-classes for e.g. LLMs, and automatic configuration of 'clients' to use the right queue names in a flow 2025-04-22 20:21:38 +01:00			`from ... base import FlowProcessor, ConsumerSpec, ProducerSpec`
Trustgraph initial code drop 2024-07-10 23:20:06 +01:00
Feature/configure flows (#345) - Keeps processing in different flows separate so that data can go to different stores / collections etc. - Potentially supports different processing flows - Tidies the processing API with common base-classes for e.g. LLMs, and automatic configuration of 'clients' to use the right queue names in a flow 2025-04-22 20:21:38 +01:00			`default_ident = "pdf-decoder"`
Refactor names (#4) - Downsize embeddings model to mini-lm in docker-compose files - Rename for structure - Default queues defined in schema file - Standardize naming: graph embeddings, chunk embeddings, triples 2024-07-23 21:34:03 +01:00
Feature/configure flows (#345) - Keeps processing in different flows separate so that data can go to different stores / collections etc. - Potentially supports different processing flows - Tidies the processing API with common base-classes for e.g. LLMs, and automatic configuration of 'clients' to use the right queue names in a flow 2025-04-22 20:21:38 +01:00			`class Processor(FlowProcessor):`
Trustgraph initial code drop 2024-07-10 23:20:06 +01:00
Metrics (#3) * Basic metrics working * Add consumer & producer metrics * Grafana & Prometheus in docker compose 2024-07-18 17:20:42 +01:00			`def __init__(self, **params):`

Feature/configure flows (#345) - Keeps processing in different flows separate so that data can go to different stores / collections etc. - Potentially supports different processing flows - Tidies the processing API with common base-classes for e.g. LLMs, and automatic configuration of 'clients' to use the right queue names in a flow 2025-04-22 20:21:38 +01:00			`id = params.get("id", default_ident)`
Trustgraph initial code drop 2024-07-10 23:20:06 +01:00
Base classes (#2) Simplify code using base classes 2024-07-17 16:56:47 +01:00			`super(Processor, self).__init__(`
Metrics (#3) * Basic metrics working * Add consumer & producer metrics * Grafana & Prometheus in docker compose 2024-07-18 17:20:42 +01:00			`**params \| {`
Feature/configure flows (#345) - Keeps processing in different flows separate so that data can go to different stores / collections etc. - Potentially supports different processing flows - Tidies the processing API with common base-classes for e.g. LLMs, and automatic configuration of 'clients' to use the right queue names in a flow 2025-04-22 20:21:38 +01:00			`"id": id,`
Metrics (#3) * Basic metrics working * Add consumer & producer metrics * Grafana & Prometheus in docker compose 2024-07-18 17:20:42 +01:00			`}`
Trustgraph initial code drop 2024-07-10 23:20:06 +01:00			`)`

Feature/configure flows (#345) - Keeps processing in different flows separate so that data can go to different stores / collections etc. - Potentially supports different processing flows - Tidies the processing API with common base-classes for e.g. LLMs, and automatic configuration of 'clients' to use the right queue names in a flow 2025-04-22 20:21:38 +01:00			`self.register_specification(`
			`ConsumerSpec(`
			`name = "input",`
			`schema = Document,`
			`handler = self.on_message,`
			`)`
			`)`

			`self.register_specification(`
			`ProducerSpec(`
			`name = "output",`
			`schema = TextDocument,`
			`)`
			`)`

			`print("PDF inited", flush=True)`
Processor model prototype 2024-07-15 17:17:04 +01:00
Feature/configure flows (#345) - Keeps processing in different flows separate so that data can go to different stores / collections etc. - Potentially supports different processing flows - Tidies the processing API with common base-classes for e.g. LLMs, and automatic configuration of 'clients' to use the right queue names in a flow 2025-04-22 20:21:38 +01:00			`async def on_message(self, msg, consumer, flow):`
Trustgraph initial code drop 2024-07-10 23:20:06 +01:00
Feature/configure flows (#345) - Keeps processing in different flows separate so that data can go to different stores / collections etc. - Potentially supports different processing flows - Tidies the processing API with common base-classes for e.g. LLMs, and automatic configuration of 'clients' to use the right queue names in a flow 2025-04-22 20:21:38 +01:00			`print("PDF message received", flush=True)`
Processor model prototype 2024-07-15 17:17:04 +01:00
Base classes (#2) Simplify code using base classes 2024-07-17 16:56:47 +01:00			`v = msg.value()`
Trustgraph initial code drop 2024-07-10 23:20:06 +01:00
Feature / collections (#96) * Update schema defs for source -> metadata * Migrate to use metadata part of schema, also add metadata to triples & vecs * Add user/collection metadata to query * Use user/collection in RAG * Write and query working on triples 2024-10-02 18:14:29 +01:00			`print(f"Decoding {v.metadata.id}...", flush=True)`
Trustgraph initial code drop 2024-07-10 23:20:06 +01:00
Base classes (#2) Simplify code using base classes 2024-07-17 16:56:47 +01:00			`with tempfile.NamedTemporaryFile(delete_on_close=False) as fp:`
Processor model prototype 2024-07-15 17:17:04 +01:00
Base classes (#2) Simplify code using base classes 2024-07-17 16:56:47 +01:00			`fp.write(base64.b64decode(v.data))`
			`fp.close()`
Trustgraph initial code drop 2024-07-10 23:20:06 +01:00
Base classes (#2) Simplify code using base classes 2024-07-17 16:56:47 +01:00			`with open(fp.name, mode='rb') as f:`
Trustgraph initial code drop 2024-07-10 23:20:06 +01:00
Base classes (#2) Simplify code using base classes 2024-07-17 16:56:47 +01:00			`loader = PyPDFLoader(fp.name)`
			`pages = loader.load()`
Trustgraph initial code drop 2024-07-10 23:20:06 +01:00
Base classes (#2) Simplify code using base classes 2024-07-17 16:56:47 +01:00			`for ix, page in enumerate(pages):`
Trustgraph initial code drop 2024-07-10 23:20:06 +01:00
Feature/configure flows (#345) - Keeps processing in different flows separate so that data can go to different stores / collections etc. - Potentially supports different processing flows - Tidies the processing API with common base-classes for e.g. LLMs, and automatic configuration of 'clients' to use the right queue names in a flow 2025-04-22 20:21:38 +01:00			`print("page", ix, flush=True)`

Base classes (#2) Simplify code using base classes 2024-07-17 16:56:47 +01:00			`r = TextDocument(`
Feature: document metadata (#123) * Rework metadata structure in processing messages to be a subgraph * Add subgraph creation for tg-load-pdf and tg-load-text based on command-line passing of doc attributes * Document metadata is added to knowledge graph with subjectOf linkage to extracted entities 2024-10-23 18:04:04 +01:00			`metadata=v.metadata,`
Base classes (#2) Simplify code using base classes 2024-07-17 16:56:47 +01:00			`text=page.page_content.encode("utf-8"),`
			`)`
Trustgraph initial code drop 2024-07-10 23:20:06 +01:00
Feature/configure flows (#345) - Keeps processing in different flows separate so that data can go to different stores / collections etc. - Potentially supports different processing flows - Tidies the processing API with common base-classes for e.g. LLMs, and automatic configuration of 'clients' to use the right queue names in a flow 2025-04-22 20:21:38 +01:00			`await flow("output").send(r)`
Trustgraph initial code drop 2024-07-10 23:20:06 +01:00
Base classes (#2) Simplify code using base classes 2024-07-17 16:56:47 +01:00			`print("Done.", flush=True)`
Trustgraph initial code drop 2024-07-10 23:20:06 +01:00
Base classes (#2) Simplify code using base classes 2024-07-17 16:56:47 +01:00			`@staticmethod`
			`def add_args(parser):`
Feature/configure flows (#345) - Keeps processing in different flows separate so that data can go to different stores / collections etc. - Potentially supports different processing flows - Tidies the processing API with common base-classes for e.g. LLMs, and automatic configuration of 'clients' to use the right queue names in a flow 2025-04-22 20:21:38 +01:00			`FlowProcessor.add_args(parser)`
Trustgraph initial code drop 2024-07-10 23:20:06 +01:00
			`def run():`

Feature/configure flows (#345) - Keeps processing in different flows separate so that data can go to different stores / collections etc. - Potentially supports different processing flows - Tidies the processing API with common base-classes for e.g. LLMs, and automatic configuration of 'clients' to use the right queue names in a flow 2025-04-22 20:21:38 +01:00			`Processor.launch(default_ident, __doc__)`
Trustgraph initial code drop 2024-07-10 23:20:06 +01:00