docs: address signals flywheel review feedback

Addresses review comments on #910:

- Shorten the paper citation to (Chen et al., 2026) per common cite
  practice (replacing the full author list form).
- Replace the Why Signals Matter section with the review-suggested
  rewrite verbatim: more formal intro framing, renumbered steps to
  Instrument / Sample & triage / Data Construction / Model Optimization
  / Deploy, removes 'routing decisions' from the data-construction
  step, and adds DPO/RLHF/SFT as model-optimization examples.
- Renders tau and O(messages) as proper math glyphs via the sphinx
  built-in :math: role (enabled by adding sphinx.ext.mathjax to
  conf.py). Using the RST role form rather than raw $...$ inline so
  sphinx only injects MathJax on pages that actually have math,
  instead of loading ~1MB of JS on every page.

Build verified locally: sphinx-build produces no warnings on the
changed files and the rendered HTML wraps tau and O(messages) in
MathJax-ready <span class="math">\(\tau\)</span> containers.

Made-with: Cursor
This commit is contained in:
Syed Hashmi 2026-04-24 12:05:48 -07:00
parent ae629d3635
commit cea43c5da5
No known key found for this signature in database
2 changed files with 45 additions and 36 deletions

View file

@ -12,49 +12,56 @@ prioritized data that can drive prompt, routing, and model updates without
running an LLM-as-judge on every session.
The framework implemented here follows the taxonomy and detector design in
*Signals: Trajectory Sampling and Triage for Agentic Interactions* (Chen,
Hafeez, Paracha, 2026; `arXiv:2604.00356
<https://arxiv.org/abs/2604.00356>`_). All detectors are computed without
model calls; the entire pipeline attaches structured attributes and span
events to existing spans so your dashboards and alerts work unmodified.
*Signals: Trajectory Sampling and Triage for Agentic Interactions*
(`Chen et al., 2026 <https://arxiv.org/abs/2604.00356>`_). All detectors
are computed without model calls; the entire pipeline attaches structured
attributes and span events to existing spans so your dashboards and alerts
work unmodified.
Why Signals Matter: The Improvement Flywheel
============================================
Agentic applications are now deployed at scale, but improving them
post-deployment is still hard. Trajectories are voluminous and
non-deterministic; reviewing each one with humans or auxiliary LLMs is slow
and cost-prohibitive. You can't score every response (too expensive) or
eyeball every trace (doesn't scale). Without a triage layer, the loop from
production back to model or policy updates stays broken.
Agentic applications are increasingly deployed at scale, yet improving them
after deployment remains difficult. Production trajectories are long,
numerous, and non-deterministic, making exhaustive human review infeasible
and auxiliary LLM evaluation expensive. As a result, teams face a
bottleneck: they cannot score every response, inspect every trace, or
reliably identify which failures and successes should inform the next model
update. Without a low-cost triage layer, the feedback loop from production
behavior to model improvement remains incomplete.
Signals close that loop by making it cheap to find out which of the
millions of interactions are actually worth looking at:
Signals close this loop by cheaply identifying which interactions among
millions are worth inspecting:
1. **Instrument.** Every live interaction is scored along a fixed
taxonomy (interaction / execution / environment) and tagged as
structured attributes on its existing OTel span. No model calls, no
extra infrastructure.
2. **Sample & triage.** Signal attributes act as filters — surface
severe sessions, sample exemplars, exclude the boring middle. Per the
paper, signal-based sampling reaches 82% informativeness on
:math:`\tau`-bench versus 54% for random sampling, a 1.52× efficiency
gain per informative trajectory.
3. **Construct data.** The triaged subset becomes the input to
preference-data construction, prompt-ablation studies, routing
decisions, or fine-tuning corpora — whichever optimization pathway
you're running.
4. **Optimize the model.** Whatever artifact drives your agent —
system prompts, router rules, LoRA adapters, full fine-tunes — is
updated against that targeted data, not against noise.
5. **Deploy and repeat.** New versions ship behind Plano and are
immediately re-instrumented with the same signals, so you can
measure whether your change actually moved the needle and feed the
next iteration.
1. **Instrument.** Live trajectories are scored with model-free signals
attached as structured attributes on existing OpenTelemetry spans,
organized under a fixed taxonomy of interaction, execution, and
environment signals. This requires no additional model calls,
infrastructure, or changes to online agent behavior.
2. **Sample & triage.** Signal attributes act as filters: they surface
severe failures, retrieve representative exemplars, and exclude the
uninformative middle. In our experiments, signal-based sampling
achieves 82% informativeness on :math:`\tau`-bench, compared with 54%
for random sampling, yielding a 1.52× efficiency gain per informative
trajectory.
3. **Data Construction.** The triaged subset becomes targeted input for
constructing preference datasets or supervised fine-tuning datasets
from production trajectories.
4. **Model Optimization.** The resulting preference or supervised
fine-tuning data is used to update the model through methods such as
DPO, RLHF, or supervised fine-tuning, so optimization is driven by
targeted production behavior rather than undifferentiated trace noise.
5. **Deploy.** The improved model is deployed and immediately
re-instrumented with the same signals, enabling teams to measure
whether the change improved production behavior and to feed the next
iteration.
The loop only works if step 1 is nearly free. That's the design
constraint this framework is built around: model-free detectors, fixed
taxonomy, O(messages) cost, no online behavior change.
This loop depends on the first step being nearly free. The framework is
therefore designed around fixed-taxonomy, model-free detectors with
:math:`O(\text{messages})` cost, no online behavior change, and no
dependence on expensive evaluator models. By making production traces
searchable and sampleable at scale, signals turn raw agent telemetry into a
practical model-optimization flywheel.
What Are Behavioral Signals?
============================

View file

@ -33,6 +33,7 @@ extensions = [
"sphinx.ext.autodoc",
"sphinx.ext.intersphinx",
"sphinx.ext.extlinks",
"sphinx.ext.mathjax",
"sphinx.ext.viewcode",
"sphinx_sitemap",
"sphinx_design",
@ -41,6 +42,7 @@ extensions = [
"provider_models",
]
# Paths that contain templates, relative to this directory.
templates_path = ["_templates"]