Trustgraph, first drop of code

2026-04-25 16:36:21 +02:00 · 2024-07-10 17:04:24 +01:00 · 2024-07-10 17:04:24 +01:00 · 299332dd4e
commit 299332dd4e
120 changed files with 12493 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -0,0 +1,151 @@
+
+# For AMD...
+# pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.1/
+
+----------------------------------------------------------------------------
+
+docker network create trustgraph --driver bridge
+
+docker volume create cassandra
+docker volume create pulsar-conf
+docker volume create pulsar-data
+docker volume create etcd
+docker volume create minio-data
+docker volume create milvus
+
+----------------------------------------------------------------------------
+
+Cassandra:
+docker run -i -t \
+    --name cassandra \
+    --network trustgraph \
+    -p 9042:9042 \
+    -v cassandra:/var/lib/cassandra \
+    docker.io/cassandra:4.1.5
+
+----------------------------------------------------------------------------
+
+Pulsar:
+
+docker run -it \
+    --name pulsar \
+    --network trustgraph \
+    -p 6650:6650 \
+    -p 8080:8080 \
+    -v pulsar-conf:/pulsar/conf \
+    -v pulsar-data:/pulsar/data \
+    docker.io/apachepulsar/pulsar:3.3.0 \
+        bin/pulsar standalone
+
+----------------------------------------------------------------------------
+
+Milvus:
+
+docker run -i -t \
+    --name etcd \
+    --network trustgraph \
+    -p 2379:2379 \
+    -e ETCD_AUTO_COMPACTION_MODE=revision \
+    -e ETCD_AUTO_COMPACTION_RETENTION=1000 \
+    -e ETCD_QUOTA_BACKEND_BYTES=4294967296 \
+    -e ETCD_SNAPSHOT_COUNT=50000 \
+    -v etcd:/etcd \
+    quay.io/coreos/etcd:v3.5.5 \
+        etcd -advertise-client-urls=http://127.0.0.1:2379 \
+        -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
+
+docker run -i -t \
+    --name minio \
+    --network trustgraph \
+    -p 9000:9000 -p 9001:9001 \
+    -e MINIO_ACCESS_KEY=minioadmin \
+    -e MINIO_SECRET_KEY=minioadmin \
+    -v minio-data:/minio_data \
+    docker.io/minio/minio:RELEASE.2024-07-04T14-25-45Z \
+        minio server /minio_data --console-address :9001
+
+docker run -i -t \
+    --name milvus \
+    --network trustgraph \
+    -p 19530:19530 -p 9091:9091 \
+    -e ETCD_ENDPOINTS=etcd:2379 \
+    -e MINIO_ADDRESS=minio:9000 \
+    -v milvus:/var/lib/milvus \
+    docker.io/milvusdb/milvus:v2.4.5 \
+        milvus run standalone
+
+----------------------------------------------------------------------------
+
+. env/bin/activate
+
+scripts/pdf-decoder -p pulsar://localhost:6650/
+scripts/chunker -p pulsar://localhost:6650/
+scripts/vectorize-minilm -p pulsar://localhost:6650/
+scripts/kg-extract-definitions -p pulsar://localhost:6650/
+scripts/kg-extract-relationships -p pulsar://localhost:6650/
+scripts/vector-write-milvus -p pulsar://localhost:6650/
+scripts/graph-write-cassandra -p pulsar://localhost:6650/
+scripts/llm-ollama-text  -p pulsar://localhost:6650/ -r http://monster:11434/
+
+----------------------------------------------------------------------------
+
+The pub/sub infra means individual processors can be switched out for other
+processors which do the same thing e.g.
+
+- graph-write writes to Cassandra but could be switched out for something else
+  which uses a different store
+- vector-store writes to Milvus but could be switched out for something which
+  writes to a different store
+- llm-ollama invokes Ollama, could be switched out for a different LLM with a
+  different API
+
+The vectorizer is a bit hard to switch out ATM, different embedding algorithms
+have different vector dimensions, but can probably make that work.
+
+kg-extractor turns chunks into edges, so maybe that could be switched out for
+different algorithms & different schemas. Potentially a good place for 'value
+add' processing to be added.
+
+It would also be possible to have more than one processor running where
+kg-extractor sits to have different kinds of semantic extraction: other
+proceessors could take as input the output of vectorizer, and drop edges and
+edge associations into graph-write and vector-store
+
+Actually, as it stands, kg-extractor does 2 different kinds of processing
+itself:
+
+- Creates edges for relationships, and
+- Creates edges for definitions
+
+Each of those uses its own LLM invocation and mashes the output into RDF. It
+would be useful to split kg-extractor into 2 pieces to demonstrate this bit of
+the architecture.
+
+----------------------------------------------------------------------------
+
+- pdf-extractor turns a PDF doc into plain text.  For each PDF page, a
+  separate blob of text is emitted currently.
+- chunker applies chunking to turn 1 PDF page into several chunks
+- vectorizer invokes a sentence embedding model to create vectors.  Doesn't
+  store anything.  Output of vectorizer is chunk plus embedding.
+- kg-extract turns chunks into edges using LLM prompting. There are 2 outputs:
+  - RDF triples
+  - An association between the vector embedding and RDF entities. For every
+    triple, there's an association between the S, P and O entiites and the
+    embedding.
+- graph-write loads the triples into a store
+- vector-store store the association between the vector and the RDF URI in a
+  vector store
+
+----------------------------------------------------------------------------
+
+The demo query algorithm is...
+
+- Take the question, extract embeddings using a sentence model
+- Use the vector store to lookup RDF URIs
+- Use the RDF URIs to find relevant RDF triples in Cassandra
+- Also use Cassandra to turn RDF URIs into plain-text labels if they are known
+- Construct the prompt, turn RDF triples into a knowledge graph format (used
+  Cypher, seems to work but not sure that's the best choice)
+- LLM prompt and show the answer
+