mirror of
https://github.com/MODSetter/SurfSense.git
synced 2026-05-27 19:25:15 +02:00
feat(docs): add observability documentation
This commit is contained in:
parent
4c8d47617d
commit
6302939a72
3 changed files with 170 additions and 0 deletions
167
surfsense_web/content/docs/observability.mdx
Normal file
167
surfsense_web/content/docs/observability.mdx
Normal file
|
|
@ -0,0 +1,167 @@
|
|||
---
|
||||
title: Observability
|
||||
description: Configure backend traces and metrics for SurfSense
|
||||
icon: Radar
|
||||
---
|
||||
|
||||
SurfSense uses OpenTelemetry for backend traces and metrics. Application logs
|
||||
include trace and span IDs so you can correlate logs with traces, but logs stay
|
||||
on the normal container stderr path.
|
||||
|
||||
## Enable Locally
|
||||
|
||||
The development compose file reads backend settings from
|
||||
`surfsense_backend/.env`. Add these values there:
|
||||
|
||||
```dotenv
|
||||
SURFSENSE_ENABLE_OTEL=true
|
||||
SURFSENSE_ENV=dev
|
||||
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-lgtm:4317
|
||||
OTEL_EXPORTER_OTLP_PROTOCOL=grpc
|
||||
OTEL_RESOURCE_ATTRIBUTES=service.namespace=surfsense
|
||||
OTEL_METRIC_EXPORT_INTERVAL=300000
|
||||
```
|
||||
|
||||
Then start the development stack with the local LGTM backend:
|
||||
|
||||
```bash
|
||||
docker compose -f docker/docker-compose.dev.yml up --build
|
||||
```
|
||||
|
||||
Grafana is exposed on `http://localhost:3001` by default.
|
||||
|
||||
## Enable in Production Docker Compose
|
||||
|
||||
Production Docker Compose reads backend and collector settings from
|
||||
`docker/.env`. The API and Celery worker export telemetry to the bundled
|
||||
collector at `otel-collector:4317`; the collector is the only service that uses
|
||||
the Grafana Cloud credentials.
|
||||
|
||||
Add these values to `docker/.env`:
|
||||
|
||||
```dotenv
|
||||
SURFSENSE_ENV=production
|
||||
SURFSENSE_ENABLE_OTEL=true
|
||||
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
|
||||
OTEL_EXPORTER_OTLP_PROTOCOL=grpc
|
||||
OTEL_RESOURCE_ATTRIBUTES=service.namespace=surfsense
|
||||
OTEL_METRIC_EXPORT_INTERVAL=300000
|
||||
|
||||
GRAFANA_CLOUD_OTLP_ENDPOINT=https://otlp-gateway-<region>.grafana.net/otlp
|
||||
GRAFANA_CLOUD_INSTANCE_ID=<stack instance id>
|
||||
GRAFANA_CLOUD_API_KEY=<cloud access policy token>
|
||||
```
|
||||
|
||||
Then start the stack:
|
||||
|
||||
```bash
|
||||
docker compose -f docker/docker-compose.yml --profile observability up -d
|
||||
```
|
||||
|
||||
The collector receives OTLP on `otel-collector:4317`, scrubs sensitive span
|
||||
attributes, applies the configured tail-sampling policy, batches exports,
|
||||
retries failures, and forwards traces and metrics to Grafana Cloud over OTLP
|
||||
HTTP.
|
||||
|
||||
When deploying `surfsense_backend/Dockerfile` directly instead of production
|
||||
compose, use the same split: SurfSense containers export to a collector, and
|
||||
the collector owns the Grafana Cloud credentials.
|
||||
|
||||
## Automatic Traces
|
||||
|
||||
When OpenTelemetry is enabled, the backend instruments:
|
||||
|
||||
- FastAPI inbound requests.
|
||||
- SQLAlchemy queries from the main async engine and Celery task engine.
|
||||
- Raw psycopg calls used by the LangGraph checkpointer.
|
||||
- Redis commands.
|
||||
- HTTPX outbound requests.
|
||||
- Celery producer and worker execution.
|
||||
|
||||
## Manual Spans
|
||||
|
||||
SurfSense keeps project-specific spans behind `app.observability.otel`:
|
||||
|
||||
- `model.call`
|
||||
- `tool.call`
|
||||
- `chat.request`
|
||||
- `kb.search`
|
||||
- `kb.persist`
|
||||
- `connector.sync`
|
||||
- `subagent.invoke`
|
||||
- `etl.extract`
|
||||
- `etl.parse`
|
||||
- `etl.ocr`
|
||||
- `etl.picture.describe`
|
||||
- `etl.picture.ocr`
|
||||
- `compaction.run`
|
||||
- `permission.asked`
|
||||
- `interrupt.raised`
|
||||
|
||||
Keep span names and attributes low-cardinality. Do not attach user content,
|
||||
prompts, document titles, file paths, user-specific URLs, secrets, or raw
|
||||
queries as span attributes.
|
||||
|
||||
## Metrics
|
||||
|
||||
The OpenTelemetry instrumentors provide HTTP, HTTPX, and Celery runtime
|
||||
metrics. SurfSense adds these project metrics from `app.observability.metrics`:
|
||||
|
||||
- `surfsense.model.call.duration`
|
||||
- `gen_ai.client.token.usage`
|
||||
- `surfsense.tool.call.duration`
|
||||
- `surfsense.tool.call.errors`
|
||||
- `surfsense.chat.request.duration`
|
||||
- `surfsense.chat.request.outcome`
|
||||
- `surfsense.kb.search.duration`
|
||||
- `surfsense.compaction.runs`
|
||||
- `surfsense.permission.asks`
|
||||
- `surfsense.interrupt.raised`
|
||||
- `surfsense.indexing.document.duration`
|
||||
- `surfsense.indexing.document.outcome`
|
||||
- `surfsense.connector.sync.duration`
|
||||
- `surfsense.connector.sync.outcome`
|
||||
- `surfsense.subagent.invoke.duration`
|
||||
- `surfsense.subagent.invoke.outcome`
|
||||
- `surfsense.etl.extract.duration`
|
||||
- `surfsense.etl.extract.outcome`
|
||||
- `surfsense.celery.heartbeat.refreshes`
|
||||
- `surfsense.celery.heartbeat.failures`
|
||||
- `surfsense.celery.queue.latency`
|
||||
- `surfsense.auth.failures`
|
||||
- `surfsense.rate_limit.rejections`
|
||||
- `surfsense.perf.elapsed_ms`
|
||||
|
||||
Runtime gauges include process RSS, CPU utilization, threads, open file
|
||||
descriptors, asyncio tasks, and CPython GC counters.
|
||||
|
||||
## Logs
|
||||
|
||||
`LoggingInstrumentor().instrument()` injects `otelTraceID` and `otelSpanID` into
|
||||
standard Python `LogRecord`s. The root log format writes them as
|
||||
`trace_id=... span_id=...`.
|
||||
|
||||
SurfSense intentionally does not create an OpenTelemetry `LoggerProvider`,
|
||||
`LoggingHandler`, or `OTLPLogExporter`. Container stderr remains the log
|
||||
transport.
|
||||
|
||||
## Verification
|
||||
|
||||
1. Hit a FastAPI endpoint and confirm an inbound server span appears in Grafana.
|
||||
2. Run a chat request and confirm `model.call` and `tool.call` child spans.
|
||||
3. Run a knowledge-base search and confirm `kb.search` spans and SQL child spans.
|
||||
4. Run connector indexing and confirm Celery producer/worker spans share a trace
|
||||
ID and connector sync metrics increment.
|
||||
5. Confirm `gen_ai.client.token.usage`, model/tool durations, request duration,
|
||||
Celery runtime, and runtime gauges appear within one export interval.
|
||||
6. Confirm logs emitted inside a traced request show non-zero trace and span IDs.
|
||||
|
||||
## Out Of Scope
|
||||
|
||||
- Frontend/browser OpenTelemetry.
|
||||
- OpenTelemetry log export.
|
||||
- Profiling.
|
||||
- Production backend selection.
|
||||
- Tail-sampling collector configuration.
|
||||
- Replacing LangSmith.
|
||||
- Vendor SDKs.
|
||||
Loading…
Add table
Add a link
Reference in a new issue