diff --git a/demos/hr_agent/README.md b/demos/hr_agent/README.md
new file mode 100644
index 00000000..cdc74d77
--- /dev/null
+++ b/demos/hr_agent/README.md
@@ -0,0 +1,31 @@
+# HR Agent Demo
+
+This demo showcases how the **Arch** can be used to build an HR agent to manage workforce-related inquiries, workforce planning, and communication via Slack. It intelligently routes incoming prompts to the correct targets, providing concise and useful responses tailored for HR and workforce decision-making.
+
+## Available Functions:
+
+- **HR Q/A**: Handles general Q&A related to insurance policies.
+  - **Endpoint**: `/agent/hr_qa`
+
+- **Workforce Data Retrieval**: Retrieves data related to workforce metrics like headcount, satisfaction, and staffing.
+  - **Endpoint**: `/agent/workforce`
+  - Parameters:
+    - `staffing_type` (str, required): Type of staffing (e.g., `contract`, `fte`, `agency`).
+    - `region` (str, required): Region for which the data is requested (e.g., `asia`, `europe`, `americas`).
+    - `point_in_time` (int, optional): Time point for data retrieval (e.g., `0 days ago`, `30 days ago`).
+
+- **Initiate Policy**: Sends messages to a Slack channel
+  - **Endpoint**: `/agent/slack_message`
+  - Parameters:
+    - `slack_message` (str, required): The message content to be sent
+
+# Starting the demo
+1. Please make sure the [pre-requisites](https://github.com/katanemo/arch/?tab=readme-ov-file#prerequisites) are installed correctly
+2. Start Arch
+   ```sh
+   sh run_demo.sh
+   ```
+3. Navigate to http://localhost:18080/
+4. "Can you give me workforce data for asia?"
+
+![alt text](image.png)
diff --git a/demos/hr_agent/image.png b/demos/hr_agent/image.png
new file mode 100644
index 00000000..e36855b2
Binary files /dev/null and b/demos/hr_agent/image.png differ
diff --git a/docs/source/get_started/intro_to_arch.rst b/docs/source/get_started/intro_to_arch.rst
index 7f61b4f9..9bd78f1e 100644
--- a/docs/source/get_started/intro_to_arch.rst
+++ b/docs/source/get_started/intro_to_arch.rst
@@ -3,8 +3,10 @@
 Intro to Arch
 =============
 
-Arch is an intelligent `(Layer 7) <https://www.cloudflare.com/learning/ddos/what-is-layer-7/>`_ gateway designed for generative AI apps, AI agents, and AI copilots that work with prompts.
-Engineered with purpose-built large language models (LLMs), Arch handles all the critical but undifferentiated tasks related to the handling and processing of prompts, including detecting and rejecting jailbreak attempts, intelligently calling “backend” APIs to fulfill the user's request represented in a prompt, routing to and offering disaster recovery between upstream LLMs, and managing the observability of prompts and LLM interactions in a centralized way.
+Arch is an intelligent `(Layer 7) <https://www.cloudflare.com/learning/ddos/what-is-layer-7/>`_ gateway designed for generative AI apps, agents, copilots that work with prompts.
+Engineered with purpose-built large language models (LLMs), Arch handles all the critical but undifferentiated tasks related to the handling and processing of prompts, including
+detecting and rejecting jailbreak attempts, intelligently calling “backend” APIs to fulfill the user's request represented in a prompt, routing to and offering disaster recovery
+between upstream LLMs, and managing the observability of prompts and LLM interactions in a centralized way.
 
 .. image:: /_static/img/arch-logo.png
    :width: 100%
@@ -16,22 +18,21 @@ Engineered with purpose-built large language models (LLMs), Arch handles all the
   including secure handling, intelligent routing, robust observability, and integration with backend (API)
   systems for personalization - all outside business logic.*
 
-In practice, achieving the above goal is incredibly difficult.
-Arch attempts to do so by providing the following high level features:
+In practice, achieving the above goal is incredibly difficult. Arch attempts to do so by providing the following high level features:
 
 **Out-of-process architecture, built on** `Envoy <http://envoyproxy.io/>`_:
-Arch is takes a dependency on Envoy and is a self-contained process that is designed to run alongside your application servers.
+Arch takes a dependency on Envoy and is a self-contained process that is designed to run alongside your application servers.
 Arch uses Envoy's HTTP connection management subsystem, HTTP L7 filtering and telemetry capabilities to extend the functionality exclusively for prompts and LLMs.
 This gives Arch several advantages:
 
-* Arch builds on Envoy's proven success. Envoy is used at masssive sacle by the leading technology companies of our time including `AirBnB <https://www.airbnb.com>`_, `Dropbox <https://www.dropbox.com>`_, `Google <https://www.google.com>`_, `Reddit <https://www.reddit.com>`_, `Stripe <https://www.stripe.com>`_, etc. Its battle tested and scales linearly with usage and enables developers to focus on what really matters: application features and business logic.
+* Arch builds on Envoy's proven success. Envoy is used at masssive scale by the leading technology companies of our time including `AirBnB <https://www.airbnb.com>`_, `Dropbox <https://www.dropbox.com>`_, `Google <https://www.google.com>`_, `Reddit <https://www.reddit.com>`_, `Stripe <https://www.stripe.com>`_, etc. Its battle tested and scales linearly with usage and enables developers to focus on what really matters: application features and business logic.
 
 * Arch works with any application language. A single Arch deployment can act as gateway for AI applications written in Python, Java, C++, Go, Php, etc.
 
 * Arch can be deployed and upgraded quickly across your infrastructure transparently without the horrid pain of deploying library upgrades in your applications.
 
-**Engineered with Fast LLMs:** Arch is engineered with specialized tiny LLMs that are desgined for fast, cost-effective and acurrate handling of prompts.
-These LLMs are designed to be best-in-class for critcal prompt-related tasks like:
+**Engineered with Fast LLMs:** Arch is engineered with specialized small LLMs that are designed for fast, cost-effective and accurate handling of prompts.
+These LLMs are designed to be best-in-class for critical prompt-related tasks like:
 
 * **Function Calling:** Arch helps you easily personalize your applications by enabling calls to application-specific (API) operations via user prompts.
   This involves any predefined functions or APIs you want to expose to users to perform tasks, gather information, or manipulate data.
@@ -43,17 +44,12 @@ These LLMs are designed to be best-in-class for critcal prompt-related tasks lik
   With prompt guardrails you can prevent ``jailbreak attempts`` present in user's prompts without having to write a single line of code.
   To learn more about how to configure guardrails available in Arch, read :ref:`Prompt Guard <prompt_guard>`.
 
-* **[Coming Soon] Intent-Markers:** Developers struggle to handle ``follow-up`` or ``clarifying`` questions.
-  Specifically, when users ask for modifications or additions to previous responses their AI applications often generate entirely new responses instead of adjusting the previous ones.
-  Arch offers intent-markers as a feature so that developers know when the user has shifted away from the previous intent so that they can improve their retrieval, lower overall token cost and dramatically improve the speed and accuracy of their responses back to users.
-  For more details :ref:`intent markers <arch_rag_guide>`.
-
 **Traffic Management:** Arch offers several capabilities for LLM calls originating from your applications, including smart retries on errors from upstream LLMs, and automatic cutover to other LLMs configured in Arch for continuous availability and disaster recovery scenarios.
 Arch extends Envoy's `cluster subsystem <https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/cluster_manager>`_ to manage upstream connections to LLMs so that you can build resilient AI applications.
 
-**Front/edge Gateway:** There is substantial benefit in using the same software at the edge (observability, traffic shaping alogirithms, applying guardrails, etc.) as for outbound LLM inference use cases.
+**Front/edge Gateway:** There is substantial benefit in using the same software at the edge (observability, traffic shaping algorithms, applying guardrails, etc.) as for outbound LLM inference use cases.
 Arch has the feature set that makes it exceptionally well suited as an edge gateway for AI applications.
-This includes TLS termination, applying guardrail early in the pricess, intelligent parameter gathering from prompts, and prompt-based routing to backend APIs.
+This includes TLS termination, applying guardrail early in the process, intelligent parameter gathering from prompts, and prompt-based routing to backend APIs.
 
 **Best-In Class Monitoring:** Arch offers several monitoring metrics that help you understand three critical aspects of
 your application: latency, token usage, and error rates by an upstream LLM provider. Latency measures the speed at which
diff --git a/model_server/app/cli.py b/model_server/app/cli.py
index 6a3f81ea..dd6a5679 100644
--- a/model_server/app/cli.py
+++ b/model_server/app/cli.py
@@ -15,11 +15,8 @@ logging.basicConfig(
 log = logging.getLogger("model_server.cli")
 log.setLevel(logging.INFO)
 
-# Path to the file where the server process ID will be stored
-PID_FILE = os.path.join(tempfile.gettempdir(), "model_server.pid")
 
-
-def run_server():
+def run_server(port=51000):
     """Start, stop, or restart the Uvicorn server based on command-line arguments."""
     if len(sys.argv) > 1:
         action = sys.argv[1]
@@ -27,22 +24,18 @@ def run_server():
         action = "start"
 
     if action == "start":
-        start_server()
+        start_server(port)
     elif action == "stop":
-        stop_server()
+        stop_server(port)
     elif action == "restart":
-        restart_server()
+        restart_server(port)
     else:
         log.info(f"Unknown action: {action}")
         sys.exit(1)
 
 
-def start_server():
-    """Start the Uvicorn server and save the process ID."""
-    if os.path.exists(PID_FILE):
-        log.info("Server is already running. Use 'model_server restart' to restart it.")
-        sys.exit(1)
-
+def start_server(port=51000):
+    """Start the Uvicorn server"""
     log.info(
         "Starting model server - loading some awesomeness, this may take some time :)"
     )
@@ -55,7 +48,7 @@ def start_server():
             "--host",
             "0.0.0.0",
             "--port",
-            "51000",
+            f"{port}",
         ],
         start_new_session=True,
         bufsize=1,
@@ -64,10 +57,7 @@ def start_server():
         stderr=subprocess.PIPE,  # Suppress standard error. There is a logger that model_server prints to
     )
 
-    if wait_for_health_check("http://0.0.0.0:51000/healthz"):
-        # Write the process ID to the PID file
-        with open(PID_FILE, "w") as f:
-            f.write(str(process.pid))
+    if wait_for_health_check(f"http://0.0.0.0:{port}/healthz"):
         log.info(f"Model server started with PID {process.pid}")
     else:
         # Add model_server boot-up logs
@@ -89,40 +79,88 @@ def wait_for_health_check(url, timeout=180):
     return False
 
 
-def stop_server():
+def check_and_install_lsof():
+    """Check if lsof is installed, and if not, install it using apt-get."""
+    try:
+        # Check if lsof is installed by running "lsof -v"
+        subprocess.run(
+            ["lsof", "-v"], check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE
+        )
+        print("lsof is already installed.")
+    except subprocess.CalledProcessError:
+        print("lsof not found, installing...")
+        try:
+            # Update package list and install lsof
+            subprocess.run(["sudo", "apt-get", "update"], check=True)
+            subprocess.run(["sudo", "apt-get", "install", "-y", "lsof"], check=True)
+            print("lsof installed successfully.")
+        except subprocess.CalledProcessError as install_error:
+            print(f"Failed to install lsof: {install_error}")
+
+
+def kill_process(port=51000, wait=True, timeout=10):
     """Stop the running Uvicorn server."""
     log.info("Stopping model server")
-    if not os.path.exists(PID_FILE):
-        log.info("Process id file not found, seems like model server was not running")
-        return
-
-    # Read the process ID from the PID file
-    with open(PID_FILE, "r") as f:
-        pid = int(f.read())
-
     try:
-        # Get process by PID
-        process = psutil.Process(pid)
-
-        # Gracefully terminate the process
-        process.terminate()  # Sends SIGTERM by default
-        process.wait(timeout=10)  # Wait for up to 10 seconds for the process to exit
-
-        log.info(f"Model server with PID {pid} stopped.")
-        os.remove(PID_FILE)
-
-    except psutil.NoSuchProcess:
-        log.info(f"Model server with PID {pid} not found. Cleaning up PID file.")
-        os.remove(PID_FILE)
-    except psutil.TimeoutExpired:
-        log.info(
-            f"Model server with PID {pid} did not terminate in time. Forcing shutdown."
+        # Run the function to check and install lsof if necessary
+        # Step 1: Run lsof command to get the process using the port
+        lsof_command = f"lsof -n | grep {port} | grep -i LISTEN"
+        result = subprocess.run(
+            lsof_command, shell=True, capture_output=True, text=True
         )
-        process.kill()  # Forcefully kill the process
-        os.remove(PID_FILE)
+
+        if result.returncode != 0:
+            print(f"No process found listening on port {port}.")
+            return
+
+        # Step 2: Parse the process IDs from the output
+        process_ids = [line.split()[1] for line in result.stdout.splitlines()]
+
+        if not process_ids:
+            print(f"No process found listening on port {port}.")
+            return
+
+        # Step 3: Kill each process using its PID
+        for pid in process_ids:
+            print(f"Killing model server process with PID {pid}")
+            subprocess.run(f"kill {pid}", shell=True)
+
+            if wait:
+                # Step 4: Wait for the process to be killed by checking if it's still running
+                start_time = time.time()
+
+                while True:
+                    check_process = subprocess.run(
+                        f"ps -p {pid}", shell=True, capture_output=True, text=True
+                    )
+                    if check_process.returncode != 0:
+                        print(f"Process {pid} has been killed.")
+                        break
+
+                    elapsed_time = time.time() - start_time
+                    if elapsed_time > timeout:
+                        print(
+                            f"Process {pid} did not terminate within {timeout} seconds."
+                        )
+                        print(f"Attempting to force kill process {pid}...")
+                        subprocess.run(f"kill -9 {pid}", shell=True)  # SIGKILL
+                        break
+
+                    print(
+                        f"Waiting for process {pid} to be killed... ({elapsed_time:.2f} seconds)"
+                    )
+                    time.sleep(0.5)
+
+    except Exception as e:
+        print(f"Error occurred: {e}")
 
 
-def restart_server():
+def stop_server(port=51000, wait=True, timeout=10):
+    check_and_install_lsof()
+    kill_process(port, wait, timeout)
+
+
+def restart_server(port=51000):
     """Restart the Uvicorn server."""
-    stop_server()
-    start_server()
+    stop_server(port)
+    start_server(port)
diff --git a/model_server/app/tests/test_cli_stop_server.py b/model_server/app/tests/test_cli_stop_server.py
new file mode 100644
index 00000000..4f3955a7
--- /dev/null
+++ b/model_server/app/tests/test_cli_stop_server.py
@@ -0,0 +1,55 @@
+import unittest
+from unittest.mock import patch, MagicMock
+import subprocess
+import time
+from app.cli import kill_process
+
+
+class TestStopServer(unittest.TestCase):
+    @patch("subprocess.run")
+    def test_stop_server_no_process(self, mock_run):
+        # Mock subprocess.run to simulate no process listening on the port
+        mock_run.return_value.returncode = 1
+        with patch("builtins.print") as mock_print:
+            kill_process(port=51000)
+            mock_print.assert_called_with("No process found listening on port 51000.")
+
+    @patch("subprocess.run")
+    def test_stop_server_process_killed(self, mock_run):
+        # Simulate lsof returning a process id
+        mock_run.side_effect = [
+            MagicMock(returncode=0, stdout="uvicorn 1234 user LISTEN\n"),
+            MagicMock(returncode=0),  # for killing the process
+            MagicMock(returncode=1),  # for checking the process after it is killed
+        ]
+        with patch("builtins.print") as mock_print:
+            kill_process(port=51000, wait=True, timeout=5)
+            mock_print.assert_any_call("Killing model server process with PID 1234")
+            mock_print.assert_any_call("Process 1234 has been killed.")
+
+    @patch("subprocess.run")
+    def test_stop_server_multiple_pids(self, mock_run):
+        # Simulate lsof returning multiple process ids (e.g., 1234 and 5678)
+        mock_run.side_effect = [
+            MagicMock(
+                returncode=0,
+                stdout="uvicorn 1234 user LISTEN\nuvicorn 5678 user LISTEN\n",
+            ),  # lsof output
+            MagicMock(returncode=0),  # first kill command for PID 1234
+            MagicMock(returncode=1),  # PID 1234 is successfully terminated
+            MagicMock(returncode=0),  # second kill command for PID 5678
+            MagicMock(returncode=1),  # PID 5678 is successfully terminated
+        ]
+
+        with patch("builtins.print") as mock_print:
+            kill_process(port=51000, wait=True, timeout=5)
+
+            # Assert that the function tried to kill both PIDs
+            mock_print.assert_any_call("Killing model server process with PID 1234")
+            mock_print.assert_any_call("Process 1234 has been killed.")
+            mock_print.assert_any_call("Killing model server process with PID 5678")
+            mock_print.assert_any_call("Process 5678 has been killed.")
+
+
+if __name__ == "__main__":
+    unittest.main()