Fixed bad rename

2026-04-26 08:56:21 +02:00 · 2024-07-16 17:00:56 +01:00 · 2024-07-16 17:00:56 +01:00 · bdf9ba8b5d
commit bdf9ba8b5d
parent 476d45eff7
3 changed files with 459 additions and 459 deletions
--- a/README.development.md
+++ b/README.development.md
@ -1,404 +1,82 @@

-# TrustGraph
+# Contributing

-## Introduction
+## Generally

-TrustGraph is a true end-to-end (e2e) knowledge pipeline that performs a `naive extraction` on a text corpus
-to build a RDF style knowledge graph coupled with a `RAG` service compatible with cloud LLMs and open-source
-SLMs (Small Language Models).
+Branching is good discipline to get into with multiple people working
+on the same repo for different reasons.

-The pipeline processing components are interconnected with a pub/sub engine to
-maximize modularity and enable new knowledge processing functions. The core processing components decode documents, 
-chunk text, perform embeddings, apply a local SLM/LLM, call a LLM API, and generate LM predictions.
+To create a branch...

-The processing showcases the reliability and efficiences of Graph RAG algorithms which can capture
-contextual language flags that are missed in conventional RAG approaches. Graph querying algorithms enable retrieving
-not just relevant knowledge but language cues essential to understanding semantic uses unique to a text corpus.
+- `git checkout -b etl` # to create the branch and check it out
+- `git push` # to push the branch head to the upstream repo.  You get an error and a command to run.  You don't have to do this straight away, but I like to get the BS admin out the way.  At this stage your branch HEAD points to the head of main.

-Processing modules are executed in containers.  Processing can be scaled-up
-by deploying multiple containers.
+## Adding a new module

-### Features
+So, to add a new module...

- PDF decoding
- Text chunking
- Inference of LMs deployed with [Ollama](https://ollama.com)
- Inference of LLMs: Claude, VertexAI and AzureAI serverless endpoints
- Application of a [HuggingFace](https://hf.co) embeddings models
- [RDF](https://www.w3.org/TR/rdf12-schema/)-aligned Knowledge Graph extraction
- Graph edge loading into [Apache Cassandra](https://github.com/apache/cassandra)
- Storing embeddings in [Milvus](https://github.com/milvus-io/milvus)
- Embedding query service
- Graph RAG query service
- All procesing integrates with [Apache Pulsar](https://github.com/apache/pulsar/)
- Containers, so can be deployed using Docker Compose or Kubernetes
- Plug'n'play architecture: switch different LLM modules to suit your needs
+- It needs a name. Say `kg-mymodule` but you can call it what you like.
+- It also needs a place in the Python package hierarchy, because it's
+  basically going to be its own loadable module.  We have a `trustgraph.kg`
+  module it can be a child of.  So, you need a directory
+  `trustgraph/kg/mymodule`
+- You need three files:
+  - `__init__.py` which defines the module entry point.
+  - Then, `__main__.py` means the module is executable.
+  - Finally a module to contain the code, let's call it `extract.py`.
+    The name doesn't matter but it has to match what's in `__init__.py` and
+    `__main__.py`.
+- The easiest way to get start is maybe make a copy of an existing module.
+- `cp -r trustgraph/kg/extract_relationships trustgraph/kg/mymodule/`
+- Finally you need a script entry point, in `scripts`.  Copy
+  `scripts/kg-extract-relationships` to `scripts/kg-mymodule`
+- In that `kg-mymodule` file, change the import line to import your module,
+  `trustgraph.kg.mymodule`.

-## Architecture
+## Development testing

-![architecture](architecture.png)
+To run your module, you don't need to have it running in a container.
+It can connect to Pulsar.

-TrustGraph is designed to be modular to support as many Language Models and environments as possible. A natural
-fit for a modular architecture is to decompose functions into a set modules connected through a pub/sub backbone.
-[Apache Pulsar](https://github.com/apache/pulsar/) serves as this pub/sub backbone. Pulsar acts as the data broker
-managing inputs and outputs between modules.
+The plumbing for your new module pretty needs to be right.  Look at the
+input_queue, output_queue and subscriber settings near the top of your
+new module code.

-**Pulsar Workflows**:
- For processing flows, Pulsar accepts the output of a processing module
-  and queues it for input to the next subscribed module.
- For services such as LLMs and embeddings, Pulsar provides a client/server
-  model.  A Pulsar queue is used as the input to the service.  When
-  processed, the output is then delivered to a separate queue where a client
-  subscriber can request that output.
+So, before changing the code any more, if you copied an existing module,
+check the plumbing works with your renamed module.

-The entire architecture, the pub/sub backbone and set of modules, is bundled into a single Python package. A container image with the
-package installed can also run the entire architecture.
+To run standalone, it is recommended to take an existing docker-compose
+file, run everything you need except the module you're developing.

-## Core Modules
+Then when you launch with docker compose, you'll get everything running
+except your module.

- `chunker-recursive` - Accepts text documents and uses LangChain recursive
-  chunking algorithm to produce smaller text chunks.
- `embeddings-hf` - A service which analyses text and returns a vector
-  embedding using one of the HuggingFace embeddings models.
- `embeddings-vectorize` - Uses an embeddings service to get a vector
-  embedding which is added to the processor payload.
- `graph-rag` - A query service which applies a Graph RAG algorithm to
-  provide a response to a text prompt.
- `graph-write-cassandra` - Takes knowledge graph edges and writes them to
-  a Cassandra store.
- `kg-extract-definitions` - knowledge extractor - examines text and
-  produces graph edges.
-  describing discovered terms and also their defintions.  Definitions are
-  derived using the input  documents.
- `kg-extract-relationships` - knowledge extractor - examines text and
-  produces graph edges describing the relationships between discovered
-  terms.
- `loader` - Takes a document and loads into the processing pipeline.  Used
-  e.g. to add PDF documents.
- `pdf-decoder` - Takes a PDF doc and emits text extracted from the document.
-  Text extraction from PDF is not a perfect science as PDF is a printable
-  format.  For instance, the wrapping of text between lines in a PDF document
-  is not semantically encoded, so the decoder will see wrapped lines as
-  space-separated.
- `vector-write-milvus` - Takes vector-entity mappings and records them
-  in the vector embeddings store.
+To run your module, you need to set up the Python environment as you did
+in the quickstart e.g. run `. env/bin/activate` and `export PYTHONPATH=.`

-## LM Specific Modules
+You're not running kg-mymodule in a container, so it can't use docker
+internal DNS to get to the containers, but the docker compose file
+exposes everything to the host anyway.  You should be able to access Pulsar
+on localhost port 6650, for instance.

- `llm-azure-text` - Sends request to AzureAI serverless endpoint
- `llm-claude-text` - Sends request to Anthropic's API
- `llm-ollama-text` -  Sends request to LM running using Ollama
- `llm-vertexai-text` -  Sends request to model available through VertexAI API
+You should be able to run your module on the host and point at Pulsar thus:

-## Getting Started
-
-The `Docker Compose` files have been tested on `Linux` and `MacOS`. There are currently
-no plans for `Windows` support in the immediate future.
-
-There are 4 `Docker Compose` files depending on the desired LM deployment:
- `VertexAI` through Google Cloud
- `Claude` through Anthropic's API
- `AzureAI` serverless endpoint
- Local LM deployment through `Ollama`
-
-Docker Compose enables the following functions:
- Run the required components for full e2e `Graph RAG` knowledge pipeline
- Check processing logs
- Load test text corpus and begin knowledge extraction
- Verify extracted graph edges and number of edges
- Run a query against the vector and graph stores to generate a response
-  using the chosen LM
-
-### Clone the Repo
-
-```
-git clone https://github.com/trustgraph-ai/trustgraph trustgraph
-cd trustgraph
-```
-
-### Install requirements
-
-```
-python3 -m venv env
-. env/bin/activate
-pip3 install pulsar-client
-pip3 install cassandra-driver
-export PYTHON_PATH=.
-```
-
-### Docker Compose files
-
-Depending on your desired LM deployment, you will choose from one of the 
-following `Docker Compose` files:
-
- `docker-compose-azure.yaml`: AzureAI endpoint. Set `AZURE_TOKEN` to the secret token and
-  `AZURE_ENDPOINT` to the URL endpoint address for the deployed model.
- `docker-compose-claude.yaml`: Anthropic's API. Set `CLAUDE_KEY` to your API key.
- `docker-compose-ollama.yaml`: Local LM (currently using [Gemma2](https://ollama.com/library/gemma2) deployed through Ollama.  Set `OLLAMA_HOST` to the machine running Ollama (e.g. `localhost` for Ollama running locally on your machine)
- `docker-compose-vertexai.yaml`: VertexAI API. Requires a `private.json` authentication file to authenticate with your GCP project. Filed should stored be at path `vertexai/private.json`.
-
-**NOTE**: All tokens, paths, and authentication files must be set **PRIOR** to launching a `Docker Compose` file.
-
-
-#### AzureAI Serverless Model Deployment
-
-```
-export AZURE_ENDPOINT=https://ENDPOINT.HOST.GOES.HERE/
-export AZURE_TOKEN=TOKEN-GOES-HERE
-docker-compose -f docker-compose-azure.yaml up -d
-```
-
-#### Claude through Anthropic API
-
-```
-export CLAUDE_KEY=TOKEN-GOES-HERE
-docker-compose -f docker-compose-claude.yaml up -d
-```
-
-#### Ollama Hosted Model Deployment
-
-```
-export OLLAMA_HOST=localhost # Set to hostname of Ollama host
-docker-compose -f docker-compose-ollama.yaml up -d
-```
-
-#### VertexAI through GCP
-
-```
-mkdir -p vertexai
-cp {whatever} vertexai/private.json
-docker-compose -f docker-compose-vertexai.yaml up -d
-```
-
-If you're running `SELinux` on Linux you may need to set the permissions on the
-VertexAI directory so that the key file can be mounted on a Docker container using
-the following command:
-
-```
-chcon -Rt svirt_sandbox_file_t vertexai/
-```
-
-### Verify Docker Containers
-
-On first running a `Docker Compose` file, it may take a while (depending on your network connection) to pull all the necessary components. Once all of the components have been pulled, check that the TrustGraph containers are running:
-
-```
-docker ps
-```
-
-Any containers that have exited unexpectedly can be found by checking the `STATUS` field
-using the following:
-
-```
-docker ps -a
-```
-
-### Warm-Up
-
-Before proceeding, allow the system to enter a stable a working state. In general
-`30 seconds` should be enough time for Pulsar to stablize.
-
-The system uses Cassandra for a Graph store. Cassandra can take `60-70 seconds`
-to achieve a working state.
-
-### Load a Text Corpus
-
-Create a sources directory and get a test PDF file. To demonstrate the power of TrustGraph, we're using a PDF of the public [Roger's Commision Report](https://sma.nasa.gov/SignificantIncidents/assets/rogers_commission_report.pdf) from the NASA Challenger disaster. This PDF includes complex formatting, unique terms, complex concepts, unique concepts, and information not commonly found in public knowledge sources.
-
-```
-mkdir sources
-curl -o sources/Challenger-Report-Vol1.pdf https://sma.nasa.gov/SignificantIncidents/assets/rogers_commission_report.pdf
-```
-
-Load the file for knowledge extraction:
-
-```
-scripts/loader -f sources/Challenger-Report-Vol1.pdf
-```
-
-`File loaded.` indicates the PDF has been sucessfully loaded to the processing queues and extraction will begin.
-
-### Processing Logs
-
-At this point, many processing services are running concurrently. You can check the status of these processes with the following logs:
-
-`PDF Decoder`:
-```
-docker logs trustgraph-pdf-decoder-1
-```
-
-Output should look:
+```bash
+scripts/kg-mymodule -p pulsar://localhost:6650
 ```
-Decoding 1f7b7055...
-Done.
-```
-
-`Chunker`:
-```
-docker logs trustgraph-chunker-1
-```
-
-The output should be similiar to the output of the `Decode`, except it should be a sequence of many entries.
-
-`Vectorizer`:
-```
-docker logs trustgraph-vectorize-1
-```
-
-Similar output to above processes, except many entries instead.
-
-
-`Language Model Inference`:
-```
-docker logs trustgraph-llm-1
-```
-
-Output should be a sequence of entries:
-```
-Handling prompt fa1b98ae-70ef-452b-bcbe-21a867c5e8e2...
-Send response...
-Done.
-```
-
-`Knowledge Graph Definitions`:
-```
-docker logs trustgraph-kg-extract-definitions-1
-```
-
-Output should be an array of JSON objects with keys `entity` and `definition`:
-
-```
-Indexing 1f7b7055-p11-c1...
-[
-    {
-        "entity": "Orbiter",
-        "definition": "A spacecraft designed for spaceflight."
-    },
-    {
-        "entity": "flight deck",
-        "definition": "The top level of the crew compartment, typically where flight controls are located."
-    },
-    {
-        "entity": "middeck",
-        "definition": "The lower level of the crew compartment, used for sleeping, working, and storing equipment."
-    }
-]
-Done.
-```
-
-`Knowledge Graph Relationshps`:
-```
-docker logs trustgraph-kg-extract-relationships-1
-```
-
-Output should be an array of JSON objects with keys `subject`, `predicate`, `object`, and `object-entity`:
-```
-Indexing 1f7b7055-p11-c3...
-[
-    {
-        "subject": "Space Shuttle",
-        "predicate": "carry",
-        "object": "16 tons of cargo",
-        "object-entity": false
-    },
-    {
-        "subject": "friction",
-        "predicate": "generated by",
-        "object": "atmosphere",
-        "object-entity": true
-    }
-]
-Done.
-```
-
-### Graph Parsing
-
-To check that the knowledge graph is successfully parsing data:
-
-```
-scripts/graph-show
-```
-
-The output should be a set of semantic triples in [N-Triples](https://www.w3.org/TR/rdf12-n-triples/) format.
-
-```
-http://trustgraph.ai/e/enterprise http://trustgraph.ai/e/was-carried to altitude and released for a gliding approach and landing at the Mojave Desert test center.
-http://trustgraph.ai/e/enterprise http://www.w3.org/2000/01/rdf-schema#label Enterprise.
-http://trustgraph.ai/e/enterprise http://www.w3.org/2004/02/skos/core#definition A prototype space shuttle orbiter used for atmospheric flight testing.
-```
-
-### Number of Graph Edges
-
-N-Triples format is not particularly human readable. It's more useful to know how many graph edges have successfully been extracted from the text corpus:
-```
-scripts/graph-show  | wc -l
-```
-
-The Challenger report has a long introduction with quite a bit of adminstrative text commonly found in official reports. The first few hundred graph edges mostly capture this document formatting knowledge. To fully test the ability to extract complex knowledge, wait until at least `1000` graph edges have been extracted. The full extraction for this PDF will extract many thousand graph edges.
-
-### RAG Test Script
-```
-tests/test-graph-rag
-```
-This script forms a LM prompt asking for 20 facts regarding the Challenger disaster. Depending on how many graph edges have been extracted, the response will be similar to:
-
-```
-Here are 20 facts from the provided knowledge graph about the Space Shuttle disaster:

-1. **Space Shuttle Challenger was a Space Shuttle spacecraft.**
-2. **The third Spacelab mission was carried by Orbiter Challenger.**
-3. **Francis R. Scobee was the Commander of the Challenger crew.**
-4. **Earth-to-orbit systems are designed to transport payloads and humans from Earth's surface into orbit.**
-5. **The Space Shuttle program involved the Space Shuttle.**
-6. **Orbiter Challenger flew on mission 41-B.**
-7. **Orbiter Challenger was used on STS-7 and STS-8 missions.**
-8. **Columbia completed the orbital test.**
-9. **The Space Shuttle flew 24 successful missions.**
-10. **One possibility for the Space Shuttle was a winged but unmanned recoverable liquid-fuel vehicle based on the Saturn 5 rocket.**
-11. **A Commission was established to investigate the space shuttle Challenger accident.**
-12. **Judit h Arlene Resnik was Mission Specialist Two.**
-13. **Mission 51-L was originally scheduled for December 1985 but was delayed until January 1986.**
-14. **The Corporation's Space Transportation Systems Division was responsible for the design and development of the Space Shuttle Orbiter.**
-15. **Michael John Smith was the Pilot of the Challenger crew.**
-16. **The Space Shuttle is composed of two recoverable Solid Rocket Boosters.**
-17. **The Space Shuttle provides for the broadest possible spectrum of civil/military missions.**
-18. **Mission 51-L consisted of placing one satellite in orbit, deploying and retrieving Spartan, and conducting six experiments.**
-19. **The Space Shuttle became the focus of NASA's near-term future.**
-20. **The Commission focused its attention on safety aspects of future flights.** 
-```
-
-For any errors with the `RAG` proces, check the following log:
-```
-docker logs -f trustgraph-graph-rag-1
-```
-### More RAG Test Queries
+You could try loading data, and check some stuff ends up in the graph.  If you get that far you're ready to hack the contents of extract.py to
+do what you want.

-If you want to try different RAG queries, modify the `query` in the [test script](https://github.com/trustgraph-ai/trustgraph/blob/master/tests/test-graph-rag).
+## Structure of the code

-### Shutting Down
-
-When shutting down the pipeline, it's best to shut down all Docker containers and volumes. Run the `docker compose down` command that corresponds to your model deployment:
-
-```
-docker-compose -f docker-compose-azure.yaml down --volumes
-```
-```
-docker-compose -f docker-compose-claude.yaml down --volumes
-```
-```
-docker-compose -f docker-compose-ollama.yaml down --volumes
-```
-```
-docker-compose -f docker-compose-vertexai.yaml down --volumes
-```
+The Processor class, `run` method is where all the fun takes place.

-To confirm all Docker containers have been shut down, check that the following list is empty:
 ```
-docker ps
+        while True:
+            msg = self.consumer.receive()
 ```

-To confirm all Docker volumes have been removed, check that the following list is empty:
-```
-docker volume ls
-```
+That bit :point_up: is a loop which is executed every time a new message
+arrives.