Introduce signals change (#655)

* adding support for signals * reducing false positives for signals like positive interaction * adding docs. Still need to fix the messages list, but waiting on PR #621 * Improve frustration detection: normalize contractions and refine punctuation * Further refine test cases with longer messages * minor doc changes * fixing echo statement for build * fixing the messages construction and using the trait for signals * update signals docs * fixed some minor doc changes * added more tests and fixed docuemtnation. PR 100% ready * made fixes based on PR comments * Optimize latency 1. replace sliding window approach with trigram containment check 2. add code to pre-compute ngrams for patterns * removed some debug statements to make tests easier to read * PR comments to make ObservableStreamProcessor accept optonal Vec<Messagges> * fixed PR comments --------- Co-authored-by: Salman Paracha <salmanparacha@MacBook-Pro-342.local> Co-authored-by: MeiyuZhong <mariazhong9612@gmail.com> Co-authored-by: nehcgs <54548843+nehcgs@users.noreply.github.com>
2026-06-14 15:15:15 +02:00 · 2026-01-07 11:20:44 -08:00 · 2026-01-07 11:20:44 -08:00 · b4543ba56c
commit b4543ba56c
parent 57327ba667
17 changed files with 3972 additions and 191 deletions
--- a/cli/planoai/main.py
+++ b/cli/planoai/main.py
@ -81,7 +81,7 @@ def main(ctx, version):

@click.command()
 def build():
-    """Build Arch from source. Works from any directory within the repo."""
+    """Build Plano from source. Works from any directory within the repo."""

    # Find the repo root
    repo_root = find_repo_root()
@ -112,7 +112,7 @@ def build():
            ],
            check=True,
        )
-        click.echo("archgw image built successfully.")
+        click.echo("plano image built successfully.")
    except subprocess.CalledProcessError as e:
        click.echo(f"Error building plano image: {e}")
        sys.exit(1)
--- a/crates/Cargo.lock
+++ b/crates/Cargo.lock
@ -335,6 +335,7 @@ dependencies = [
 "serde_json",
 "serde_with",
 "serde_yaml",
+ "strsim",
 "thiserror 2.0.12",
 "time",
 "tokio",
--- a/crates/brightstaff/Cargo.toml
+++ b/crates/brightstaff/Cargo.toml
@ -30,6 +30,7 @@ reqwest = { version = "0.12.15", features = ["stream"] }
 serde = { version = "1.0.219", features = ["derive"] }
 serde_json = "1.0.140"
 serde_with = "3.13.0"
+strsim = "0.11"
 serde_yaml = "0.9.34"
 thiserror = "2.0.12"
 tokio = { version = "1.44.2", features = ["full"] }
--- a/crates/brightstaff/src/handlers/llm.rs
+++ b/crates/brightstaff/src/handlers/llm.rs
@ -111,6 +111,9 @@ pub async fn llm_chat(
        .get_recent_user_message()
        .map(|msg| truncate_message(&msg, 50));

+    // Extract messages for signal analysis (clone before moving client_request)
+    let messages_for_signals = client_request.get_messages();
+
    client_request.set_model(resolved_model.clone());
    if client_request.remove_metadata_key("archgw_preference_config") {
        debug!(
@ -292,6 +295,7 @@ pub async fn llm_chat(
        operation_component::LLM,
        llm_span,
        request_start_time,
+        Some(messages_for_signals),
    );

    // === v1/responses state management: Wrap with ResponsesStateProcessor ===
--- a/crates/brightstaff/src/handlers/utils.rs
+++ b/crates/brightstaff/src/handlers/utils.rs
@ -10,8 +10,10 @@ use tokio_stream::wrappers::ReceiverStream;
 use tokio_stream::StreamExt;
 use tracing::warn;

-// Import tracing constants
-use crate::tracing::{error, llm};
+// Import tracing constants and signals
+use crate::signals::{InteractionQuality, SignalAnalyzer, TextBasedSignalAnalyzer, FLAG_MARKER};
+use crate::tracing::{error, llm, signals as signal_constants};
+use hermesllm::apis::openai::Message;

 /// Trait for processing streaming chunks
 /// Implementors can inject custom logic during streaming (e.g., hallucination detection, logging)
@ -38,6 +40,7 @@ pub struct ObservableStreamProcessor {
    chunk_count: usize,
    start_time: Instant,
    time_to_first_token: Option<u128>,
+    messages: Option<Vec<Message>>,
 }

 impl ObservableStreamProcessor {
@ -48,11 +51,13 @@ impl ObservableStreamProcessor {
    /// * `service_name` - The service name for this span (e.g., "archgw(llm)")
    /// * `span` - The span to finalize after streaming completes
    /// * `start_time` - When the request started (for duration calculation)
+    /// * `messages` - Optional conversation messages for signal analysis
    pub fn new(
        collector: Arc<TraceCollector>,
        service_name: impl Into<String>,
        span: Span,
        start_time: Instant,
+        messages: Option<Vec<Message>>,
    ) -> Self {
        Self {
            collector,
@ -62,6 +67,7 @@ impl ObservableStreamProcessor {
            chunk_count: 0,
            start_time,
            time_to_first_token: None,
+            messages,
        }
    }
 }
@ -133,6 +139,94 @@ impl StreamProcessor for ObservableStreamProcessor {
            }
        }

+        // Analyze signals if messages are available and add to span attributes
+        if let Some(ref messages) = self.messages {
+            let analyzer: Box<dyn SignalAnalyzer> = Box::new(TextBasedSignalAnalyzer::new());
+            let report = analyzer.analyze(messages);
+
+            // Add overall quality
+            self.span.attributes.push(Attribute {
+                key: signal_constants::QUALITY.to_string(),
+                value: AttributeValue {
+                    string_value: Some(format!("{:?}", report.overall_quality)),
+                },
+            });
+
+            // Add repair/follow-up metrics if concerning
+            if report.follow_up.is_concerning || report.follow_up.repair_count > 0 {
+                self.span.attributes.push(Attribute {
+                    key: signal_constants::REPAIR_COUNT.to_string(),
+                    value: AttributeValue {
+                        string_value: Some(report.follow_up.repair_count.to_string()),
+                    },
+                });
+
+                self.span.attributes.push(Attribute {
+                    key: signal_constants::REPAIR_RATIO.to_string(),
+                    value: AttributeValue {
+                        string_value: Some(format!("{:.3}", report.follow_up.repair_ratio)),
+                    },
+                });
+            }
+
+            // Add flag marker to operation name if any concerning signal is detected
+            let should_flag = report.frustration.has_frustration
+                || report.repetition.has_looping
+                || report.escalation.escalation_requested
+                || matches!(
+                    report.overall_quality,
+                    InteractionQuality::Poor | InteractionQuality::Severe
+                );
+
+            if should_flag {
+                // Prepend flag marker to the operation name
+                self.span.name = format!("{} {}", self.span.name, FLAG_MARKER);
+            }
+
+            // Add key signal metrics
+            if report.frustration.has_frustration {
+                self.span.attributes.push(Attribute {
+                    key: signal_constants::FRUSTRATION_COUNT.to_string(),
+                    value: AttributeValue {
+                        string_value: Some(report.frustration.frustration_count.to_string()),
+                    },
+                });
+                self.span.attributes.push(Attribute {
+                    key: signal_constants::FRUSTRATION_SEVERITY.to_string(),
+                    value: AttributeValue {
+                        string_value: Some(report.frustration.severity.to_string()),
+                    },
+                });
+            }
+
+            if report.repetition.has_looping {
+                self.span.attributes.push(Attribute {
+                    key: signal_constants::REPETITION_COUNT.to_string(),
+                    value: AttributeValue {
+                        string_value: Some(report.repetition.repetition_count.to_string()),
+                    },
+                });
+            }
+
+            if report.escalation.escalation_requested {
+                self.span.attributes.push(Attribute {
+                    key: signal_constants::ESCALATION_REQUESTED.to_string(),
+                    value: AttributeValue {
+                        string_value: Some("true".to_string()),
+                    },
+                });
+            }
+
+            if report.positive_feedback.has_positive_feedback {
+                self.span.attributes.push(Attribute {
+                    key: signal_constants::POSITIVE_FEEDBACK_COUNT.to_string(),
+                    value: AttributeValue {
+                        string_value: Some(report.positive_feedback.positive_count.to_string()),
+                    },
+                });
+            }
+        }
+
        // Record the finalized span
        self.collector
            .record_span(&self.service_name, self.span.clone());
--- a/crates/brightstaff/src/lib.rs
+++ b/crates/brightstaff/src/lib.rs
@ -1,5 +1,6 @@
 pub mod handlers;
 pub mod router;
+pub mod signals;
 pub mod state;
 pub mod tracing;
 pub mod utils;
--- a/crates/brightstaff/src/signals/analyzer.rs
+++ b/crates/brightstaff/src/signals/analyzer.rs
--- a/crates/brightstaff/src/signals/mod.rs
+++ b/crates/brightstaff/src/signals/mod.rs
@ -0,0 +1,3 @@
+mod analyzer;
+
+pub use analyzer::*;
--- a/crates/brightstaff/src/tracing/constants.rs
+++ b/crates/brightstaff/src/tracing/constants.rs
@ -139,6 +139,45 @@ pub mod error {
    pub const STACK_TRACE: &str = "error.stack_trace";
 }

+// =============================================================================
+// Span Attributes - Agentic Signals
+// =============================================================================
+
+/// Behavioral quality indicators for agent interactions
+/// These signals are computed automatically from conversation patterns
+pub mod signals {
+    /// Overall quality assessment
+    /// Values: "Excellent", "Good", "Neutral", "Poor", "Severe"
+    pub const QUALITY: &str = "signals.quality";
+
+    /// Total number of turns in the conversation
+    pub const TURN_COUNT: &str = "signals.turn_count";
+
+    /// Efficiency score (0.0-1.0)
+    pub const EFFICIENCY_SCORE: &str = "signals.efficiency_score";
+
+    /// Number of repair attempts detected
+    pub const REPAIR_COUNT: &str = "signals.follow_up.repair.count";
+
+    /// Ratio of repairs to user turns
+    pub const REPAIR_RATIO: &str = "signals.follow_up.repair.ratio";
+
+    /// Number of frustration indicators detected
+    pub const FRUSTRATION_COUNT: &str = "signals.frustration.count";
+
+    /// Frustration severity level (0-3)
+    pub const FRUSTRATION_SEVERITY: &str = "signals.frustration.severity";
+
+    /// Number of repetition instances detected
+    pub const REPETITION_COUNT: &str = "signals.repetition.count";
+
+    /// Whether escalation was requested (user asked for human help)
+    pub const ESCALATION_REQUESTED: &str = "signals.escalation.requested";
+
+    /// Number of positive feedback indicators detected
+    pub const POSITIVE_FEEDBACK_COUNT: &str = "signals.positive_feedback.count";
+}
+
 // =============================================================================
 // Operation Names
 // =============================================================================
--- a/crates/brightstaff/src/tracing/mod.rs
+++ b/crates/brightstaff/src/tracing/mod.rs
@ -1,3 +1,5 @@
 mod constants;

-pub use constants::{error, http, llm, operation_component, routing, OperationNameBuilder};
+pub use constants::{
+    error, http, llm, operation_component, routing, signals, OperationNameBuilder,
+};
--- a/crates/hermesllm/src/apis/openai_responses.rs
+++ b/crates/hermesllm/src/apis/openai_responses.rs
@ -1127,82 +1127,16 @@ impl ProviderRequest for ResponsesAPIRequest {
    }

    fn get_messages(&self) -> Vec<crate::apis::openai::Message> {
-        use crate::apis::openai::{Message, MessageContent, Role};
+        use crate::transforms::request::from_openai::ResponsesInputConverter;

-        let mut openai_messages = Vec::new();
+        // Use the shared converter to get the full conversion with image support
+        let converter = ResponsesInputConverter {
+            input: self.input.clone(),
+            instructions: self.instructions.clone(),
+        };

-        // Add instructions as system message if present
-        if let Some(instructions) = &self.instructions {
-            openai_messages.push(Message {
-                role: Role::System,
-                content: MessageContent::Text(instructions.clone()),
-                name: None,
-                tool_calls: None,
-                tool_call_id: None,
-            });
-        }
-
-        // Convert input to messages
-        match &self.input {
-            InputParam::Text(text) => {
-                openai_messages.push(Message {
-                    role: Role::User,
-                    content: MessageContent::Text(text.clone()),
-                    name: None,
-                    tool_calls: None,
-                    tool_call_id: None,
-                });
-            }
-            InputParam::Items(items) => {
-                for item in items {
-                    match item {
-                        InputItem::Message(msg) => {
-                            // Convert message role
-                            let role = match msg.role {
-                                MessageRole::User => Role::User,
-                                MessageRole::Assistant => Role::Assistant,
-                                MessageRole::System => Role::System,
-                                MessageRole::Developer => Role::System, // Map developer to system
-                            };
-
-                            // Extract text from message content
-                            let content = match &msg.content {
-                                crate::apis::openai_responses::MessageContent::Text(text) => {
-                                    text.clone()
-                                }
-                                crate::apis::openai_responses::MessageContent::Items(items) => {
-                                    items
-                                        .iter()
-                                        .filter_map(|c| {
-                                            if let InputContent::InputText { text } = c {
-                                                Some(text.clone())
-                                            } else {
-                                                None
-                                            }
-                                        })
-                                        .collect::<Vec<_>>()
-                                        .join("\n")
-                                }
-                            };
-
-                            openai_messages.push(Message {
-                                role,
-                                content: MessageContent::Text(content),
-                                name: None,
-                                tool_calls: None,
-                                tool_call_id: None,
-                            });
-                        }
-                        // Skip other input item types for now
-                        InputItem::ItemReference { .. } | InputItem::FunctionCallOutput { .. } => {
-                            // These are not yet supported in agent framework
-                        }
-                    }
-                }
-            }
-        }
-
-        openai_messages
+        // Convert and return, falling back to empty vec on error
+        converter.try_into().unwrap_or_else(|_| Vec::new())
    }

    fn set_messages(&mut self, messages: &[crate::apis::openai::Message]) {
--- a/crates/hermesllm/src/transforms/request/from_openai.rs
+++ b/crates/hermesllm/src/transforms/request/from_openai.rs
@ -24,6 +24,150 @@ use crate::transforms::*;

 type AnthropicMessagesRequest = MessagesRequest;

+// ============================================================================
+// RESPONSES API INPUT CONVERSION
+// ============================================================================
+
+/// Helper struct for converting ResponsesAPI input to OpenAI messages
+pub struct ResponsesInputConverter {
+    pub input: InputParam,
+    pub instructions: Option<String>,
+}
+
+impl TryFrom<ResponsesInputConverter> for Vec<Message> {
+    type Error = TransformError;
+
+    fn try_from(converter: ResponsesInputConverter) -> Result<Self, Self::Error> {
+        // Convert input to messages
+        match converter.input {
+            InputParam::Text(text) => {
+                // Simple text input becomes a user message
+                let mut messages = Vec::new();
+
+                // Add instructions as system message if present
+                if let Some(instructions) = converter.instructions {
+                    messages.push(Message {
+                        role: Role::System,
+                        content: MessageContent::Text(instructions),
+                        name: None,
+                        tool_call_id: None,
+                        tool_calls: None,
+                    });
+                }
+
+                // Add the user message
+                messages.push(Message {
+                    role: Role::User,
+                    content: MessageContent::Text(text),
+                    name: None,
+                    tool_call_id: None,
+                    tool_calls: None,
+                });
+
+                Ok(messages)
+            }
+            InputParam::Items(items) => {
+                // Convert input items to messages
+                let mut converted_messages = Vec::new();
+
+                // Add instructions as system message if present
+                if let Some(instructions) = converter.instructions {
+                    converted_messages.push(Message {
+                        role: Role::System,
+                        content: MessageContent::Text(instructions),
+                        name: None,
+                        tool_call_id: None,
+                        tool_calls: None,
+                    });
+                }
+
+                // Convert each input item
+                for item in items {
+                    if let InputItem::Message(input_msg) = item {
+                        let role = match input_msg.role {
+                            MessageRole::User => Role::User,
+                            MessageRole::Assistant => Role::Assistant,
+                            MessageRole::System => Role::System,
+                            MessageRole::Developer => Role::System, // Map developer to system
+                        };
+
+                        // Convert content based on MessageContent type
+                        let content = match &input_msg.content {
+                            crate::apis::openai_responses::MessageContent::Text(text) => {
+                                // Simple text content
+                                MessageContent::Text(text.clone())
+                            }
+                            crate::apis::openai_responses::MessageContent::Items(content_items) => {
+                                // Check if it's a single text item (can use simple text format)
+                                if content_items.len() == 1 {
+                                    if let InputContent::InputText { text } = &content_items[0] {
+                                        MessageContent::Text(text.clone())
+                                    } else {
+                                        // Single non-text item - use parts format
+                                        MessageContent::Parts(
+                                            content_items.iter()
+                                                .filter_map(|c| match c {
+                                                    InputContent::InputText { text } => {
+                                                        Some(crate::apis::openai::ContentPart::Text { text: text.clone() })
+                                                    }
+                                                    InputContent::InputImage { image_url, .. } => {
+                                                        Some(crate::apis::openai::ContentPart::ImageUrl {
+                                                            image_url: crate::apis::openai::ImageUrl {
+                                                                url: image_url.clone(),
+                                                                detail: None,
+                                                            }
+                                                        })
+                                                    }
+                                                    InputContent::InputFile { .. } => None, // Skip files for now
+                                                    InputContent::InputAudio { .. } => None, // Skip audio for now
+                                                })
+                                                .collect()
+                                        )
+                                    }
+                                } else {
+                                    // Multiple content items - convert to parts
+                                    MessageContent::Parts(
+                                        content_items
+                                            .iter()
+                                            .filter_map(|c| match c {
+                                                InputContent::InputText { text } => {
+                                                    Some(crate::apis::openai::ContentPart::Text {
+                                                        text: text.clone(),
+                                                    })
+                                                }
+                                                InputContent::InputImage { image_url, .. } => Some(
+                                                    crate::apis::openai::ContentPart::ImageUrl {
+                                                        image_url: crate::apis::openai::ImageUrl {
+                                                            url: image_url.clone(),
+                                                            detail: None,
+                                                        },
+                                                    },
+                                                ),
+                                                InputContent::InputFile { .. } => None, // Skip files for now
+                                                InputContent::InputAudio { .. } => None, // Skip audio for now
+                                            })
+                                            .collect(),
+                                    )
+                                }
+                            }
+                        };
+
+                        converted_messages.push(Message {
+                            role,
+                            content,
+                            name: None,
+                            tool_call_id: None,
+                            tool_calls: None,
+                        });
+                    }
+                }
+
+                Ok(converted_messages)
+            }
+        }
+    }
+}
+
 // ============================================================================
 // MAIN REQUEST TRANSFORMATIONS
 // ============================================================================
@ -253,117 +397,12 @@ impl TryFrom<ResponsesAPIRequest> for ChatCompletionsRequest {
    type Error = TransformError;

    fn try_from(req: ResponsesAPIRequest) -> Result<Self, Self::Error> {
-        // Convert input to messages
-        let messages = match req.input {
-            InputParam::Text(text) => {
-                // Simple text input becomes a user message
-                vec![Message {
-                    role: Role::User,
-                    content: MessageContent::Text(text),
-                    name: None,
-                    tool_call_id: None,
-                    tool_calls: None,
-                }]
-            }
-            InputParam::Items(items) => {
-                // Convert input items to messages
-                let mut converted_messages = Vec::new();
-
-                // Add instructions as system message if present
-                if let Some(instructions) = &req.instructions {
-                    converted_messages.push(Message {
-                        role: Role::System,
-                        content: MessageContent::Text(instructions.clone()),
-                        name: None,
-                        tool_call_id: None,
-                        tool_calls: None,
-                    });
-                }
-
-                // Convert each input item
-                for item in items {
-                    if let InputItem::Message(input_msg) = item {
-                        let role = match input_msg.role {
-                            MessageRole::User => Role::User,
-                            MessageRole::Assistant => Role::Assistant,
-                            MessageRole::System => Role::System,
-                            MessageRole::Developer => Role::System, // Map developer to system
-                        };
-
-                        // Convert content based on MessageContent type
-                        let content = match &input_msg.content {
-                            crate::apis::openai_responses::MessageContent::Text(text) => {
-                                // Simple text content
-                                MessageContent::Text(text.clone())
-                            }
-                            crate::apis::openai_responses::MessageContent::Items(content_items) => {
-                                // Check if it's a single text item (can use simple text format)
-                                if content_items.len() == 1 {
-                                    if let InputContent::InputText { text } = &content_items[0] {
-                                        MessageContent::Text(text.clone())
-                                    } else {
-                                        // Single non-text item - use parts format
-                                        MessageContent::Parts(
-                                            content_items.iter()
-                                                .filter_map(|c| match c {
-                                                    InputContent::InputText { text } => {
-                                                        Some(crate::apis::openai::ContentPart::Text { text: text.clone() })
-                                                    }
-                                                    InputContent::InputImage { image_url, .. } => {
-                                                        Some(crate::apis::openai::ContentPart::ImageUrl {
-                                                            image_url: crate::apis::openai::ImageUrl {
-                                                                url: image_url.clone(),
-                                                                detail: None,
-                                                            }
-                                                        })
-                                                    }
-                                                    InputContent::InputFile { .. } => None, // Skip files for now
-                                                    InputContent::InputAudio { .. } => None, // Skip audio for now
-                                                })
-                                                .collect()
-                                        )
-                                    }
-                                } else {
-                                    // Multiple content items - convert to parts
-                                    MessageContent::Parts(
-                                        content_items
-                                            .iter()
-                                            .filter_map(|c| match c {
-                                                InputContent::InputText { text } => {
-                                                    Some(crate::apis::openai::ContentPart::Text {
-                                                        text: text.clone(),
-                                                    })
-                                                }
-                                                InputContent::InputImage { image_url, .. } => Some(
-                                                    crate::apis::openai::ContentPart::ImageUrl {
-                                                        image_url: crate::apis::openai::ImageUrl {
-                                                            url: image_url.clone(),
-                                                            detail: None,
-                                                        },
-                                                    },
-                                                ),
-                                                InputContent::InputFile { .. } => None, // Skip files for now
-                                                InputContent::InputAudio { .. } => None, // Skip audio for now
-                                            })
-                                            .collect(),
-                                    )
-                                }
-                            }
-                        };
-
-                        converted_messages.push(Message {
-                            role,
-                            content,
-                            name: None,
-                            tool_call_id: None,
-                            tool_calls: None,
-                        });
-                    }
-                }
-
-                converted_messages
-            }
+        // Convert input to messages using the shared converter
+        let converter = ResponsesInputConverter {
+            input: req.input,
+            instructions: req.instructions.clone(),
        };
+        let messages: Vec<Message> = converter.try_into()?;

        // Build the ChatCompletionsRequest
        Ok(ChatCompletionsRequest {
--- a/docs/source/_static/img/signals_trace.png
+++ b/docs/source/_static/img/signals_trace.png
--- a/docs/source/concepts/signals.rst
+++ b/docs/source/concepts/signals.rst
@ -0,0 +1,359 @@
+.. -*- coding: utf-8 -*-
+
+========
+Signals™
+========
+
+Agentic Signals are behavioral and executions quality indicators that act as early warning signs of agent performance—highlighting both brilliant successes and **severe failures**. These signals are computed directly from conversation traces without requiring manual labeling or domain expertise, making them practical for production observability at scale.
+
+The Problem: Knowing What's "Good"
+==================================
+
+One of the hardest parts of building agents is measuring how well they perform in the real world.
+
+**Offline testing** relies on hand-picked examples and happy-path scenarios, missing the messy diversity of real usage. Developers manually prompt models, evaluate responses, and tune prompts by guesswork—a slow, incomplete feedback loop.
+
+**Production debugging** floods developers with traces and logs but provides little guidance on which interactions actually matter. Finding failures means painstakingly reconstructing sessions and manually labeling quality issues.
+
+You can't score every response with an LLM-as-judge (too expensive, too slow) or manually review every trace (doesn't scale). What you need are **behavioral signals**—fast, economical proxies that don’t label quality outright but dramatically shrink the search space, pointing to sessions most likely to be broken or brilliant.
+
+What Are Behavioral Signals?
+============================
+
+Behavioral signals are canaries in the coal mine—early, objective indicators that something may have gone wrong (or gone exceptionally well). They don’t explain *why* an agent failed, but they reliably signal *where* attention is needed.
+
+These signals emerge naturally from the rhythm of interaction:
+
+- A user rephrasing the same request
+- Sharp increases in conversation length
+- Frustrated follow-up messages (ALL CAPS, "this doesn’t work", excessive !!!/???)
+- Agent repetition / looping
+- Expressions of gratitude or satisfaction
+- Requests to speak to a human / contact support
+
+Individually, these clues are shallow; together, they form a fingerprint of agent performance. Embedded directly into traces, they make it easy to spot friction as it happens: where users struggle, where agents loop, and where escalations occur.
+
+Signals vs Response Quality
+===========================
+
+Behavioral signals and response quality are complementary.
+
+**Response Quality**
+    Domain-specific correctness: did the agent do the right thing given business rules, user intent, and operational context? This often requires subject-matter experts or outcome instrumentation and is time-intensive but irreplaceable.
+
+**Behavioral Signals**
+    Observable patterns that correlate with quality: high repair frequency, excessive turns, frustration markers, repetition, escalation, and positive feedback. Fast to compute and valuable for prioritizing which traces deserve inspection.
+
+Used together, signals tell you *where to look*, and quality evaluation tells you *what went wrong (or right)*.
+
+How It Works
+============
+
+Signals are computed automatically by the gateway and emitted as **OpenTelemetry trace attributes** to your existing observability stack (Jaeger, Honeycomb, Grafana Tempo, etc.). No additional libraries or instrumentation required—just configure your OTEL collector endpoint.
+
+Each conversation trace is enriched with signal attributes that you can query, filter, and visualize in your observability platform. The gateway analyzes message content (performing text normalization, Unicode handling, and pattern matching) to compute behavioral signals in real-time.
+
+**OTEL Trace Attributes**
+
+Signal data is exported as structured span attributes:
+
+- ``signals.quality`` - Overall assessment (Excellent/Good/Neutral/Poor/Severe)
+- ``signals.turn_count`` - Total number of turns in the conversation
+- ``signals.efficiency_score`` - Efficiency metric (0.0-1.0)
+- ``signals.repair.count`` - Number of repair attempts detected (when present)
+- ``signals.repair.ratio`` - Ratio of repairs to user turns (when present)
+- ``signals.frustration.count`` - Number of frustration indicators detected
+- ``signals.frustration.severity`` - Frustration level (0-3)
+- ``signals.repetition.count`` - Number of repetition instances detected
+- ``signals.escalation.requested`` - Boolean escalation flag ("true" when present)
+- ``signals.positive_feedback.count`` - Number of positive feedback indicators
+
+**Visual Flag Marker**
+
+When concerning signals are detected (frustration, looping, escalation, or poor/severe quality), the flag marker **🚩** is automatically appended to the span's operation name, making problematic traces easy to spot in your trace visualizations.
+
+**Querying in Your Observability Platform**
+
+Example queries:
+
+- Find all severe interactions: ``signals.quality = "Severe"``
+- Find flagged traces: search for **🚩** in span names
+- Find long conversations: ``signals.turn_count > 10``
+- Find inefficient interactions: ``signals.efficiency_score < 0.5``
+- Find high repair rates: ``signals.repair.ratio > 0.3``
+- Find frustrated users: ``signals.frustration.severity >= 2``
+- Find looping agents: ``signals.repetition.count >= 3``
+- Find positive interactions: ``signals.positive_feedback.count >= 2``
+- Find escalations: ``signals.escalation.requested = "true"``
+
+.. image:: /_static/img/signals_trace.png
+   :width: 100%
+   :align: center
+
+
+Core Signal Types
+=================
+
+The signals system tracks six categories of behavioral indicators.
+
+Turn Count & Efficiency
+-----------------------
+
+**What it measures**
+    Number of user–assistant exchanges.
+
+**Why it matters**
+    Long conversations often indicate unclear intent resolution, confusion, or inefficiency. Very short conversations can correlate with crisp resolution.
+
+**Key metrics**
+
+- Total turn count
+- Warning thresholds (concerning: >7 turns, excessive: >12 turns)
+- Efficiency score (0.0–1.0)
+
+**Efficiency scoring**
+    Baseline expectation is ~5 turns (tunable). Efficiency stays at 1.0 up to the baseline, then declines with an inverse penalty as turns exceed baseline::
+
+        efficiency = 1 / (1 + 0.3 * (turns - baseline))
+
+Follow-Up & Repair Frequency
+----------------------------
+
+**What it measures**
+    How often users clarify, correct, or rephrase requests. This is a **user signal** tracking query reformulation behavior—when users must repair or rephrase their requests because the agent didn't understand or respond appropriately.
+
+**Why it matters**
+    High repair frequency is a proxy for misunderstanding or intent drift. When users repeatedly rephrase the same request, it indicates the agent is failing to grasp or act on the user's intent.
+
+**Key metrics**
+
+- Repair count and ratio (repairs / user turns)
+- Concerning threshold: >30% repair ratio
+- Detected repair phrases (exact or fuzzy)
+
+**Common patterns detected**
+
+- Explicit corrections: "I meant", "correction"
+- Negations: "No, I...", "that's not"
+- Rephrasing: "let me rephrase", "to clarify"
+- Mistake acknowledgment: "my mistake", "I was wrong"
+- "Similar rephrase" heuristic based on token overlap (with stopwords downweighted)
+
+User Frustration
+----------------
+
+**What it measures**
+    Observable frustration indicators and emotional escalation.
+
+**Why it matters**
+    Catching frustration early enables intervention before users abandon or escalate.
+
+**Detection patterns**
+
+- **Complaints**: "this doesn't work", "not helpful", "waste of time"
+- **Confusion**: "I don't understand", "makes no sense", "I'm confused"
+- **Tone markers**:
+
+  - ALL CAPS (>=10 alphabetic chars and >=80% uppercase)
+  - Excessive punctuation (>=3 exclamation marks or >=3 question marks)
+
+- **Profanity**: token-based (avoids substring false positives like "absolute" -> "bs")
+
+**Severity levels**
+
+- **None (0)**: no indicators
+- **Mild (1)**: 1–2 indicators
+- **Moderate (2)**: 3–4 indicators
+- **Severe (3)**: 5+ indicators
+
+Repetition & Looping
+--------------------
+
+**What it measures**
+    Assistant repetition / degenerative loops. This is an **assistant signal** tracking when the agent repeats itself, fails to follow instructions, or gets stuck in loops—indicating the agent is not making progress or adapting its responses.
+
+**Why it matters**
+    Often indicates missing state tracking, broken tool integration, prompt issues, or the agent ignoring user corrections. High repetition means the agent is not learning from the conversation context.
+
+**Detection method**
+
+- Compare assistant messages using **bigram Jaccard similarity**
+- Classify:
+
+  - **Exact**: similarity >= 0.85
+  - **Near-duplicate**: similarity >= 0.50
+
+- Looping is flagged when repetition instances exceed 2 in a session.
+
+**Severity levels**
+
+- **None (0)**: 0 instances
+- **Mild (1)**: 1–2 instances
+- **Moderate (2)**: 3–4 instances
+- **Severe (3)**: 5+ instances
+
+Positive Feedback
+-----------------
+
+**What it measures**
+    User expressions of satisfaction, gratitude, and success.
+
+**Why it matters**
+    Strong positive signals identify exemplar traces for prompt engineering and evaluation.
+
+**Detection patterns**
+
+- Gratitude: "thank you", "appreciate it"
+- Satisfaction: "that's great", "awesome", "love it"
+- Success confirmation: "got it", "that worked", "perfect"
+
+**Confidence scoring**
+
+- 1 indicator: 0.6
+- 2 indicators: 0.8
+- 3+ indicators: 0.95
+
+Escalation Requests
+-------------------
+
+**What it measures**
+    Requests for human help/support or threats to quit.
+
+**Why it matters**
+    Escalation is a strong signal that the agent failed to resolve the interaction.
+
+**Detection patterns**
+
+- Human requests: "speak to a human", "real person", "live agent"
+- Support: "contact support", "customer service", "help desk"
+- Quit threats: "I'm done", "forget it", "I give up"
+
+Overall Quality Assessment
+==========================
+
+Signals are aggregated into an overall interaction quality on a 5-point scale.
+
+**Excellent**
+    Strong positive signals, efficient resolution, low friction.
+
+**Good**
+    Mostly positive with minor clarifications; some back-and-forth but successful.
+
+**Neutral**
+    Mixed signals; neither clearly good nor bad.
+
+**Poor**
+    Concerning negative patterns (high friction, multiple repairs, moderate frustration). High abandonment risk.
+
+**Severe**
+    Critical issues—escalation requested, severe frustration, severe looping, or excessive turns (>12). Requires immediate attention.
+
+This assessment uses a scoring model that weighs positive factors (efficiency, positive feedback) against negative ones (frustration, repairs, repetition, escalation).
+
+Sampling and Prioritization
+===========================
+
+In production, trace data is overwhelming. Signals provide a lightweight first layer of analysis to prioritize which sessions deserve review.
+
+Workflow:
+
+1. Gateway captures conversation messages and computes signals
+2. Signal attributes are emitted to OTEL spans automatically
+3. Your observability platform ingests and indexes the attributes
+4. Query/filter by signal attributes to surface outliers (poor/severe and exemplars)
+5. Review high-information traces to identify improvement opportunities
+6. Update prompts, routing, or policies based on findings
+7. Redeploy and monitor signal metrics to validate improvements
+
+This creates a reinforcement loop where traces become both diagnostic data and training signal.
+
+Trace Filtering and Telemetry
+=============================
+
+Signal attributes are automatically added to OpenTelemetry spans, making them immediately queryable in your observability platform.
+
+**Visual Filtering**
+
+When concerning signals are detected, the flag marker **🚩** (U+1F6A9) is automatically appended to the span's operation name. This makes flagged sessions immediately visible in trace visualizations without requiring attribute filtering.
+
+**Example Span Attributes**::
+
+    # Span name: "POST /v1/chat/completions gpt-4 🚩"
+    signals.quality = "Severe"
+    signals.turn_count = 15
+    signals.efficiency_score = 0.234
+    signals.repair.count = 4
+    signals.repair.ratio = 0.571
+    signals.frustration.severity = 3
+    signals.frustration.count = 5
+    signals.escalation.requested = "true"
+    signals.repetition.count = 4
+
+**Building Dashboards**
+
+Use signal attributes to build monitoring dashboards in Grafana, Honeycomb, Datadog, etc.:
+
+- **Quality distribution**: Count of traces by ``signals.quality``
+- **P95 turn count**: 95th percentile of ``signals.turn_count``
+- **Average efficiency**: Mean of ``signals.efficiency_score``
+- **High repair rate**: Percentage where ``signals.repair.ratio > 0.3``
+- **Frustration rate**: Percentage where ``signals.frustration.severity >= 2``
+- **Escalation rate**: Percentage where ``signals.escalation.requested = "true"``
+- **Looping rate**: Percentage where ``signals.repetition.count >= 3``
+- **Positive feedback rate**: Percentage where ``signals.positive_feedback.count >= 1``
+
+**Creating Alerts**
+
+Set up alerts based on signal thresholds:
+
+- Alert when severe interaction count exceeds threshold in 1-hour window
+- Alert on sudden spike in frustration rate (>2x baseline)
+- Alert when escalation rate exceeds 5% of total conversations
+- Alert on degraded efficiency (P95 turn count increases >50%)
+
+Best Practices
+==============
+
+Start simple:
+
+- Alert or page on **Severe** sessions (or on spikes in Severe rate)
+- Review **Poor** sessions within 24 hours
+- Sample **Excellent** sessions as exemplars
+
+Combine multiple signals to infer failure modes:
+
+- Looping: repetition severity >= 2 + excessive turns
+- User giving up: frustration severity >= 2 + escalation requested
+- Misunderstood intent: repair ratio > 30% + excessive turns
+- Working well: positive feedback + high efficiency + no frustration
+
+Limitations and Considerations
+==============================
+
+Signals don’t capture:
+
+- Task completion / real outcomes
+- Factual or domain correctness
+- Silent abandonment (user leaves without expressing frustration)
+- Non-English nuance (pattern libraries are English-oriented)
+
+Mitigation strategies:
+
+- Periodically sample flagged sessions and measure false positives/negatives
+- Tune baselines per use case and user population
+- Add domain-specific phrase libraries where needed
+- Combine signals with non-text metrics (tool failures, disconnects, latency)
+
+.. note::
+   Behavioral signals complement—but do not replace—domain-specific response quality evaluation. Use signals to prioritize which traces to inspect, then apply domain expertise and outcome checks to diagnose root causes.
+
+.. tip::
+   The flag marker in the span name provides instant visual feedback in trace UIs, while the structured attributes (``signals.quality``, ``signals.frustration.severity``, etc.) enable powerful querying and aggregation in your observability platform.
+
+See Also
+========
+
+- :doc:`../guides/observability/tracing` - Distributed tracing for agent systems
+- :doc:`../guides/observability/monitoring` - Metrics and dashboards
+- :doc:`../guides/observability/access_logging` - Request/response logging
+- :doc:`../guides/observability/observability` - Complete observability guide
--- a/docs/source/get_started/overview.rst
+++ b/docs/source/get_started/overview.rst
@ -2,7 +2,7 @@

 Overview
 ========
-`Plano <https://github.com/katanemo/plano>`_ is delivery infrastructure for agentic apps. A models-native proxy server and data plane designed to help you build agents faster, and deliver them reliably to production.
+`Plano <https://github.com/katanemo/plano>`_ is delivery infrastructure for agentic apps. A smart proxy server and data plane designed to help you build agents faster, and deliver them reliably to production.

 Plano pulls out the rote plumbing work (the “hidden AI middleware”) and decouples you from brittle, ever‑changing framework abstractions. It centralizes what shouldn’t be bespoke in every codebase like agent routing and orchestration, rich agentic signals and traces for continuous improvement, guardrail filters for safety and moderation, and smart LLM routing APIs for UX and DX agility. Use any language or AI framework, and ship agents to production faster with Plano.

--- a/docs/source/guides/observability/tracing.rst
+++ b/docs/source/guides/observability/tracing.rst
@ -28,6 +28,120 @@ tools.
   :align: center


+Understanding Plano Traces
+--------------------------
+
+Plano creates structured traces that capture the complete flow of requests through your AI system. Each trace consists of multiple spans representing different stages of processing.
+
+Inbound Request Handling
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When a request enters Plano, it creates an **inbound span** (``plano(inbound)``) that represents the initial request reception and processing. This span captures:
+
+- HTTP request details (method, path, headers)
+- Request payload size
+- Initial validation and authentication
+
+Orchestration & Routing
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+For agent systems, Plano performs intelligent routing through orchestration spans:
+
+- **Agent Orchestration** (``plano(orchestrator)``): When multiple agents are available, Plano uses an LLM to analyze the user's intent and select the most appropriate agent. This span captures the orchestration decision-making process.
+
+- **LLM Routing** (``plano(routing)``): For direct LLM requests, Plano determines the optimal endpoint based on your routing strategy (round-robin, least-latency, cost-optimized). This span includes:
+
+  - Routing strategy used
+  - Selected upstream endpoint
+  - Route determination time
+  - Fallback indicators (if applicable)
+
+Agent Processing
+~~~~~~~~~~~~~~~~
+
+When requests are routed to agents, Plano creates spans for agent execution:
+
+- **Agent Filter Chains** (``plano(filter)``): If filters are configured (guardrails, context enrichment, query rewriting), each filter execution is captured in its own span, showing the transformation pipeline.
+
+- **Agent Execution** (``plano(agent)``): The main agent processing span that captures the agent's work, including any tools invoked and intermediate reasoning steps.
+
+Outbound LLM Calls
+~~~~~~~~~~~~~~~~~~
+
+All LLM calls—whether from Plano's routing layer or from agents—are traced with **LLM spans** (``plano(llm)``) that capture:
+
+- Model name and provider (e.g., ``gpt-4``, ``claude-3-sonnet``)
+- Request parameters (temperature, max_tokens, top_p)
+- Token usage (prompt_tokens, completion_tokens)
+- Streaming indicators and time-to-first-token
+- Response metadata
+
+**Example Span Attributes**::
+
+    # LLM call span
+    llm.model = "gpt-4"
+    llm.provider = "openai"
+    llm.usage.prompt_tokens = 150
+    llm.usage.completion_tokens = 75
+    llm.duration_ms = 1250
+    llm.time_to_first_token = 320
+
+Handoff to Upstream Services
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When Plano forwards requests to upstream services (agents, APIs, or LLM providers), it creates **handoff spans** (``plano(handoff)``) that capture:
+
+- Upstream endpoint URL
+- Request/response sizes
+- HTTP status codes
+- Upstream response times
+
+This creates a complete end-to-end trace showing the full request lifecycle through all system components.
+
+Behavioral Signals in Traces
+----------------------------
+
+Plano automatically enriches OpenTelemetry traces with :doc:`../../concepts/signals` — behavioral quality indicators computed from conversation patterns. These signals are attached as span attributes, providing immediate visibility into interaction quality.
+
+**What Signals Provide**
+
+Signals act as early warning indicators embedded in your traces:
+
+- **Quality Assessment**: Overall interaction quality (Excellent/Good/Neutral/Poor/Severe)
+- **Efficiency Metrics**: Turn count, efficiency scores, repair frequency
+- **User Sentiment**: Frustration indicators, positive feedback, escalation requests
+- **Agent Behavior**: Repetition detection, looping patterns
+
+**Visual Flag Markers**
+
+When concerning signals are detected (frustration, looping, escalation, or poor/severe quality), Plano automatically appends a flag marker **🚩** to the span's operation name. This makes problematic traces immediately visible in your tracing UI without requiring additional queries.
+
+**Example Span with Signals**::
+
+    # Span name: "POST /v1/chat/completions gpt-4 🚩"
+    # Standard LLM attributes:
+    llm.model = "gpt-4"
+    llm.usage.total_tokens = 225
+
+    # Behavioral signal attributes:
+    signals.quality = "Severe"
+    signals.turn_count = 15
+    signals.efficiency_score = 0.234
+    signals.frustration.severity = 3
+    signals.escalation.requested = "true"
+
+**Querying Signal Data**
+
+In your observability platform (Jaeger, Grafana Tempo, Datadog, etc.), filter traces by signal attributes:
+
+- Find severe interactions: ``signals.quality = "Severe"``
+- Find frustrated users: ``signals.frustration.severity >= 2``
+- Find inefficient flows: ``signals.efficiency_score < 0.5``
+- Find escalations: ``signals.escalation.requested = "true"``
+
+For complete details on all available signals, detection methods, and best practices, see the :doc:`../../concepts/signals` guide.
+
+
 Benefits of Using ``Traceparent`` Headers
 -----------------------------------------

--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -5,7 +5,7 @@ Welcome to Plano!
   :width: 100%
   :align: center

-`Plano <https://github.com/katanemo/plano>`_ is delivery infrastructure for agentic apps. A models-native proxy server and data plane designed to help you build agents faster, and deliver them reliably to production.
+`Plano <https://github.com/katanemo/plano>`_ is delivery infrastructure for agentic apps. A smart proxy server and data plane designed to help you build agents faster, and deliver them reliably to production.

 Plano pulls out the rote plumbing work (aka “hidden AI middleware”) and decouples you from brittle, ever‑changing framework abstractions. It centralizes what shouldn’t be bespoke in every codebase like agent routing and orchestration, rich agentic signals and traces for continuous improvement, guardrail filters for safety and moderation, and smart LLM routing APIs for UX and DX agility. Use any language or AI framework, and ship agents to production faster with Plano.

@ -36,6 +36,7 @@ Built by contributors to the widely adopted `Envoy Proxy <https://www.envoyproxy
      concepts/filter_chain
      concepts/llm_providers/llm_providers
      concepts/prompt_target
+      concepts/signals

  .. tab-item:: Guides