From b94f3734cb6b026e18032b6d04b171074c4bddc7 Mon Sep 17 00:00:00 2001
From: alainnothere <164234422+alainnothere@users.noreply.github.com>
Date: Fri, 20 Mar 2026 01:49:36 +0000
Subject: [PATCH] Update README.md

---
 README.md | 47 ++++++++++++++++++++++++++++++++++-------------
 1 file changed, 34 insertions(+), 13 deletions(-)

diff --git a/README.md b/README.md
index 9d5a5dd..bd582d2 100644
--- a/README.md
+++ b/README.md
@@ -3,7 +3,7 @@ I replicated Ng's RYS method and found that duplicating 3 specific layers in Qwe
 
 # llm-circuit-finder
 
-**Duplicate 3 layers. No training. Logical deduction goes from 0.22 → 0.76.**
+**Duplicate 3 layers. No training. Logical deduction goes from ~0.22 → 0.76.~**
 
 This toolkit finds and exploits "reasoning circuits" hidden inside transformer models. The idea: certain contiguous blocks of layers act as indivisible cognitive units. Duplicate them in the forward pass — same weights, no training, no merging — and the model gets measurably smarter on specific capabilities.
 
@@ -13,20 +13,41 @@ Built on [David Ng's RYS method](https://dnhkng.github.io/posts/rys/) and extend
 
 ### Devstral-Small-2-24B: Layers 12, 13, 14 duplicated once
 
-Validated on standard benchmarks via [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) at n=50:
+I ran the full tests on a H200 instance on Vast.ai and compare devstral base against the surgery model, and the results are in: So the surgery is doing something real and specific: it's boosting mathematical reasoning and causal reasoning but at the cost of instruction following and code generation. The model thinks harder but follows directions less precisely.
 
-| Benchmark | Base | +3 layers | Change |
-|-----------|------|-----------|--------|
-| BBH Logical Deduction | 0.22 | **0.76** | **+245%** |
-| GSM8K (strict) | 0.48 | **0.64** | +33% |
-| MBPP (code gen) | 0.72 | **0.78** | +8% |
-| GSM8K (flexible) | 0.82 | **0.86** | +5% |
-| BBH Navigate | 0.96 | **0.98** | +2% |
-| BBH Date Understanding | 0.82 | **0.84** | +2% |
-| BBH Causal Judgement | 0.66 | 0.66 | — |
-| IFEval (strict) | 0.68 | 0.68 | — |
+On the results folder you can see the results under eval_base and eval_surgery.
+I also added to the repo vastai_rys_eval.sh which is the script used to run the whole wacamole in Vast.ai.
+vastai instance created with 
+
+vastai create instance somenumberhere   --image vastai/base-image:cuda-12.8.1-cudnn-devel-ubuntu22.04   --disk 80 --direct --ssh
+
+
+=================================================================================
+lm_eval Results Comparison
+=================================================================================
+Metric                                              base  rys_12_15 Δ(last-first)
+---------------------------------------------------------------------------------
+bbh/causal_judgement [exact_match]                0.5775     0.6364     +0.0588
+
+bbh/date_understanding [exact_match]              0.9440     0.9000     -0.0440
+
+bbh/logical_deduction_five_objects [exact_match]     0.7440     0.7320     -0.0120
+
+bbh/navigate [exact_match]                        0.9600     0.9440     -0.0160
+
+gsm8k_cot [flexible-extract]                      0.8650     0.8787     +0.0136
+gsm8k_cot [strict-match]                          0.8408     0.8704     +0.0296
+
+ifeval [inst_level_loose_acc]                     0.7446     0.7206     -0.0240
+ifeval [inst_level_strict_acc]                    0.6990     0.6595     -0.0396
+ifeval [prompt_level_loose_acc]                   0.6728     0.6488     -0.0240
+ifeval [prompt_level_strict_acc]                  0.6229     0.5767     -0.0462
+
+mbpp [pass_at_1]                                  0.7000     0.6700     -0.0300
+=================================================================================
+
+Average (all metrics)                             0.7610     0.7488     -0.0122
 
-**Average improvement: +8% across all metrics. Nothing degraded.**
 
 ### Qwen2.5-Coder-32B: Layers 7, 8, 9 duplicated once