make tiktoken token counting optional via enable_token_counting override

By default, use cheap len/4 estimate for input token counting (metrics and ratelimit). When enable_token_counting is set to true in overrides, use tiktoken BPE for exact counts. This eliminates ~80ms of per-request latency from tiktoken in the WASM filter while keeping metrics and ratelimit functional. Made-with: Cursor
2026-04-27 09:46:28 +02:00 · 2026-03-22 21:45:02 -07:00 · 2026-03-22 21:45:02 -07:00 · e5f3039924
commit e5f3039924
parent 406fa92802
3 changed files with 19 additions and 8 deletions
--- a/config/plano_config_schema.yaml
+++ b/config/plano_config_schema.yaml
@ -285,6 +285,9 @@ properties:
      agent_orchestration_model:
        type: string
        description: "Model name for the agent orchestrator (e.g., 'Plano-Orchestrator'). Must match a model in model_providers."
+      enable_token_counting:
+        type: boolean
+        description: "Enable tiktoken-based input token counting for metrics and rate limiting. Default is false."
  system_prompt:
    type: string
  prompt_targets: