mirror of https://github.com/trustgraph-ai/trustgraph.git synced 2026-04-25 00:16:23 +02:00

The context development platform. Store, enrich, and retrieve structured knowledge with graph-native infrastructure, semantic retrieval, and portable context cores. https://trustgraph.ai

Find a file

Cyber MacGeddon 299332dd4e Trustgraph, first drop of code		2024-07-10 17:04:24 +01:00
scripts	Trustgraph, first drop of code	2024-07-10 17:04:24 +01:00
sources	Trustgraph, first drop of code	2024-07-10 17:04:24 +01:00
tests	Trustgraph, first drop of code	2024-07-10 17:04:24 +01:00
trustgraph	Trustgraph, first drop of code	2024-07-10 17:04:24 +01:00
.gitignore	Trustgraph, first drop of code	2024-07-10 17:04:24 +01:00
architecture.png	Trustgraph, first drop of code	2024-07-10 17:04:24 +01:00
architecture.svg	Trustgraph, first drop of code	2024-07-10 17:04:24 +01:00
collate-graph	Trustgraph, first drop of code	2024-07-10 17:04:24 +01:00
Containerfile	Trustgraph, first drop of code	2024-07-10 17:04:24 +01:00
do-query	Trustgraph, first drop of code	2024-07-10 17:04:24 +01:00
docker-compose-azure.yaml	Trustgraph, first drop of code	2024-07-10 17:04:24 +01:00
docker-compose-claude.yaml	Trustgraph, first drop of code	2024-07-10 17:04:24 +01:00
docker-compose-ollama.yaml	Trustgraph, first drop of code	2024-07-10 17:04:24 +01:00
docker-compose-vertexai.yaml	Trustgraph, first drop of code	2024-07-10 17:04:24 +01:00
etl	Trustgraph, first drop of code	2024-07-10 17:04:24 +01:00
etl2	Trustgraph, first drop of code	2024-07-10 17:04:24 +01:00
graph-clear	Trustgraph, first drop of code	2024-07-10 17:04:24 +01:00
graph-dump	Trustgraph, first drop of code	2024-07-10 17:04:24 +01:00
graph-out	Trustgraph, first drop of code	2024-07-10 17:04:24 +01:00
LICENSE	Trustgraph, first drop of code	2024-07-10 17:04:24 +01:00
Makefile	Trustgraph, first drop of code	2024-07-10 17:04:24 +01:00
query	Trustgraph, first drop of code	2024-07-10 17:04:24 +01:00
README.md	Trustgraph, first drop of code	2024-07-10 17:04:24 +01:00
receive	Trustgraph, first drop of code	2024-07-10 17:04:24 +01:00
receive2	Trustgraph, first drop of code	2024-07-10 17:04:24 +01:00
requirements.txt	Trustgraph, first drop of code	2024-07-10 17:04:24 +01:00
send	Trustgraph, first drop of code	2024-07-10 17:04:24 +01:00
send2	Trustgraph, first drop of code	2024-07-10 17:04:24 +01:00
setup.py	Trustgraph, first drop of code	2024-07-10 17:04:24 +01:00
test-cassandra	Trustgraph, first drop of code	2024-07-10 17:04:24 +01:00
test-trustgraph	Trustgraph, first drop of code	2024-07-10 17:04:24 +01:00
trustgraphETL.py	Trustgraph, first drop of code	2024-07-10 17:04:24 +01:00
trustgraphETL2.py	Trustgraph, first drop of code	2024-07-10 17:04:24 +01:00

README.md

For AMD...

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.1/

docker network create trustgraph --driver bridge

docker volume create cassandra docker volume create pulsar-conf docker volume create pulsar-data docker volume create etcd docker volume create minio-data docker volume create milvus

Cassandra: docker run -i -t
--name cassandra
--network trustgraph
-p 9042:9042
-v cassandra:/var/lib/cassandra
docker.io/cassandra:4.1.5

Pulsar:

docker run -it
--name pulsar
--network trustgraph
-p 6650:6650
-p 8080:8080
-v pulsar-conf:/pulsar/conf
-v pulsar-data:/pulsar/data
docker.io/apachepulsar/pulsar:3.3.0
bin/pulsar standalone

Milvus:

docker run -i -t
--name etcd
--network trustgraph
-p 2379:2379
-e ETCD_AUTO_COMPACTION_MODE=revision
-e ETCD_AUTO_COMPACTION_RETENTION=1000
-e ETCD_QUOTA_BACKEND_BYTES=4294967296
-e ETCD_SNAPSHOT_COUNT=50000
-v etcd:/etcd
quay.io/coreos/etcd:v3.5.5
etcd -advertise-client-urls=http://127.0.0.1:2379
-listen-client-urls http://0.0.0.0:2379 --data-dir /etcd

docker run -i -t
--name minio
--network trustgraph
-p 9000:9000 -p 9001:9001
-e MINIO_ACCESS_KEY=minioadmin
-e MINIO_SECRET_KEY=minioadmin
-v minio-data:/minio_data
docker.io/minio/minio:RELEASE.2024-07-04T14-25-45Z
minio server /minio_data --console-address :9001

docker run -i -t
--name milvus
--network trustgraph
-p 19530:19530 -p 9091:9091
-e ETCD_ENDPOINTS=etcd:2379
-e MINIO_ADDRESS=minio:9000
-v milvus:/var/lib/milvus
docker.io/milvusdb/milvus:v2.4.5
milvus run standalone

. env/bin/activate

scripts/pdf-decoder -p pulsar://localhost:6650/ scripts/chunker -p pulsar://localhost:6650/ scripts/vectorize-minilm -p pulsar://localhost:6650/ scripts/kg-extract-definitions -p pulsar://localhost:6650/ scripts/kg-extract-relationships -p pulsar://localhost:6650/ scripts/vector-write-milvus -p pulsar://localhost:6650/ scripts/graph-write-cassandra -p pulsar://localhost:6650/ scripts/llm-ollama-text -p pulsar://localhost:6650/ -r http://monster:11434/

The pub/sub infra means individual processors can be switched out for other processors which do the same thing e.g.

graph-write writes to Cassandra but could be switched out for something else which uses a different store
vector-store writes to Milvus but could be switched out for something which writes to a different store
llm-ollama invokes Ollama, could be switched out for a different LLM with a different API

The vectorizer is a bit hard to switch out ATM, different embedding algorithms have different vector dimensions, but can probably make that work.

kg-extractor turns chunks into edges, so maybe that could be switched out for different algorithms & different schemas. Potentially a good place for 'value add' processing to be added.

It would also be possible to have more than one processor running where kg-extractor sits to have different kinds of semantic extraction: other proceessors could take as input the output of vectorizer, and drop edges and edge associations into graph-write and vector-store

Actually, as it stands, kg-extractor does 2 different kinds of processing itself:

Creates edges for relationships, and
Creates edges for definitions

Each of those uses its own LLM invocation and mashes the output into RDF. It would be useful to split kg-extractor into 2 pieces to demonstrate this bit of the architecture.

pdf-extractor turns a PDF doc into plain text. For each PDF page, a separate blob of text is emitted currently.
chunker applies chunking to turn 1 PDF page into several chunks
vectorizer invokes a sentence embedding model to create vectors. Doesn't store anything. Output of vectorizer is chunk plus embedding.
kg-extract turns chunks into edges using LLM prompting. There are 2 outputs:
- RDF triples
- An association between the vector embedding and RDF entities. For every triple, there's an association between the S, P and O entiites and the embedding.
graph-write loads the triples into a store
vector-store store the association between the vector and the RDF URI in a vector store

The demo query algorithm is...

Take the question, extract embeddings using a sentence model
Use the vector store to lookup RDF URIs
Use the RDF URIs to find relevant RDF triples in Cassandra
Also use Cassandra to turn RDF URIs into plain-text labels if they are known
Construct the prompt, turn RDF triples into a knowledge graph format (used Cypher, seems to work but not sure that's the best choice)
LLM prompt and show the answer