trustgraph/docs/README.development.md

83 lines
3 KiB
Markdown
Raw Normal View History

2024-07-10 23:15:28 +01:00
2024-07-16 17:00:56 +01:00
# Contributing
2024-07-10 23:15:28 +01:00
2024-07-16 17:00:56 +01:00
## Generally
2024-07-10 23:15:28 +01:00
2024-07-16 17:00:56 +01:00
Branching is good discipline to get into with multiple people working
on the same repo for different reasons.
2024-07-10 23:15:28 +01:00
2024-07-16 17:00:56 +01:00
To create a branch...
2024-07-10 23:15:28 +01:00
2024-07-16 17:00:56 +01:00
- `git checkout -b etl` # to create the branch and check it out
- `git push` # to push the branch head to the upstream repo. You get an error and a command to run. You don't have to do this straight away, but I like to get the BS admin out the way. At this stage your branch HEAD points to the head of main.
2024-07-10 23:15:28 +01:00
2024-07-16 17:00:56 +01:00
## Adding a new module
2024-07-10 23:15:28 +01:00
2024-07-16 17:00:56 +01:00
So, to add a new module...
2024-07-10 23:15:28 +01:00
2024-07-16 17:00:56 +01:00
- It needs a name. Say `kg-mymodule` but you can call it what you like.
- It also needs a place in the Python package hierarchy, because it's
basically going to be its own loadable module. We have a `trustgraph.kg`
module it can be a child of. So, you need a directory
`trustgraph/kg/mymodule`
- You need three files:
- `__init__.py` which defines the module entry point.
- Then, `__main__.py` means the module is executable.
- Finally a module to contain the code, let's call it `extract.py`.
The name doesn't matter but it has to match what's in `__init__.py` and
`__main__.py`.
- The easiest way to get start is maybe make a copy of an existing module.
- `cp -r trustgraph/kg/extract_relationships trustgraph/kg/mymodule/`
- Finally you need a script entry point, in `scripts`. Copy
`scripts/kg-extract-relationships` to `scripts/kg-mymodule`
- In that `kg-mymodule` file, change the import line to import your module,
`trustgraph.kg.mymodule`.
2024-07-10 23:15:28 +01:00
2024-07-16 17:00:56 +01:00
## Development testing
2024-07-11 22:25:52 +01:00
2024-07-16 17:00:56 +01:00
To run your module, you don't need to have it running in a container.
It can connect to Pulsar.
2024-07-11 22:25:52 +01:00
2024-07-16 17:00:56 +01:00
The plumbing for your new module pretty needs to be right. Look at the
input_queue, output_queue and subscriber settings near the top of your
new module code.
2024-07-11 22:28:34 +01:00
2024-07-16 17:00:56 +01:00
So, before changing the code any more, if you copied an existing module,
check the plumbing works with your renamed module.
2024-07-11 22:28:34 +01:00
2024-07-16 17:00:56 +01:00
To run standalone, it is recommended to take an existing docker-compose
file, run everything you need except the module you're developing.
2024-07-16 17:00:56 +01:00
Then when you launch with docker compose, you'll get everything running
except your module.
2024-07-10 23:15:28 +01:00
2024-07-16 17:00:56 +01:00
To run your module, you need to set up the Python environment as you did
in the quickstart e.g. run `. env/bin/activate` and `export PYTHONPATH=.`
2024-07-10 23:15:28 +01:00
2024-07-16 17:00:56 +01:00
You're not running kg-mymodule in a container, so it can't use docker
internal DNS to get to the containers, but the docker compose file
exposes everything to the host anyway. You should be able to access Pulsar
on localhost port 6650, for instance.
2024-07-15 15:23:43 -07:00
2024-07-16 17:00:56 +01:00
You should be able to run your module on the host and point at Pulsar thus:
2024-07-15 15:23:43 -07:00
2024-07-16 17:00:56 +01:00
```bash
scripts/kg-mymodule -p pulsar://localhost:6650
```
2024-07-16 17:00:56 +01:00
You could try loading data, and check some stuff ends up in the graph. If you get that far you're ready to hack the contents of extract.py to
do what you want.
2024-07-11 22:50:58 +01:00
2024-07-16 17:00:56 +01:00
## Structure of the code
2024-07-11 22:50:58 +01:00
2024-07-16 17:00:56 +01:00
The Processor class, `run` method is where all the fun takes place.
2024-07-12 15:06:51 +01:00
2024-07-15 14:40:17 -07:00
```
2024-07-16 17:00:56 +01:00
while True:
msg = self.consumer.receive()
2024-07-12 15:06:51 +01:00
```
2024-07-16 17:00:56 +01:00
That bit :point_up: is a loop which is executed every time a new message
arrives.
2024-07-12 15:06:51 +01:00