Update README.md

2026-06-29 15:49:41 +02:00 · 2026-03-20 01:49:36 +00:00 · 2026-03-20 01:49:36 +00:00 · b94f3734cb
commit b94f3734cb
parent fd18db1568
1 changed files with 34 additions and 13 deletions
--- a/README.md
+++ b/README.md
@ -3,7 +3,7 @@ I replicated Ng's RYS method and found that duplicating 3 specific layers in Qwe

 # llm-circuit-finder

-**Duplicate 3 layers. No training. Logical deduction goes from 0.22 → 0.76.**
+**Duplicate 3 layers. No training. Logical deduction goes from ~0.22 → 0.76.~**

 This toolkit finds and exploits "reasoning circuits" hidden inside transformer models. The idea: certain contiguous blocks of layers act as indivisible cognitive units. Duplicate them in the forward pass — same weights, no training, no merging — and the model gets measurably smarter on specific capabilities.

@ -13,20 +13,41 @@ Built on [David Ng's RYS method](https://dnhkng.github.io/posts/rys/) and extend

 ### Devstral-Small-2-24B: Layers 12, 13, 14 duplicated once

-Validated on standard benchmarks via [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) at n=50:
+I ran the full tests on a H200 instance on Vast.ai and compare devstral base against the surgery model, and the results are in: So the surgery is doing something real and specific: it's boosting mathematical reasoning and causal reasoning but at the cost of instruction following and code generation. The model thinks harder but follows directions less precisely.

-| Benchmark | Base | +3 layers | Change |
-|-----------|------|-----------|--------|
-| BBH Logical Deduction | 0.22 | **0.76** | **+245%** |
-| GSM8K (strict) | 0.48 | **0.64** | +33% |
-| MBPP (code gen) | 0.72 | **0.78** | +8% |
-| GSM8K (flexible) | 0.82 | **0.86** | +5% |
-| BBH Navigate | 0.96 | **0.98** | +2% |
-| BBH Date Understanding | 0.82 | **0.84** | +2% |
-| BBH Causal Judgement | 0.66 | 0.66 | — |
-| IFEval (strict) | 0.68 | 0.68 | — |
+On the results folder you can see the results under eval_base and eval_surgery.
+I also added to the repo vastai_rys_eval.sh which is the script used to run the whole wacamole in Vast.ai.
+vastai instance created with 
+
+vastai create instance somenumberhere   --image vastai/base-image:cuda-12.8.1-cudnn-devel-ubuntu22.04   --disk 80 --direct --ssh
+
+
+=================================================================================
+lm_eval Results Comparison
+=================================================================================
+Metric                                              base  rys_12_15 Δ(last-first)
+---------------------------------------------------------------------------------
+bbh/causal_judgement [exact_match]                0.5775     0.6364     +0.0588
+
+bbh/date_understanding [exact_match]              0.9440     0.9000     -0.0440
+
+bbh/logical_deduction_five_objects [exact_match]     0.7440     0.7320     -0.0120
+
+bbh/navigate [exact_match]                        0.9600     0.9440     -0.0160
+
+gsm8k_cot [flexible-extract]                      0.8650     0.8787     +0.0136
+gsm8k_cot [strict-match]                          0.8408     0.8704     +0.0296
+
+ifeval [inst_level_loose_acc]                     0.7446     0.7206     -0.0240
+ifeval [inst_level_strict_acc]                    0.6990     0.6595     -0.0396
+ifeval [prompt_level_loose_acc]                   0.6728     0.6488     -0.0240
+ifeval [prompt_level_strict_acc]                  0.6229     0.5767     -0.0462
+
+mbpp [pass_at_1]                                  0.7000     0.6700     -0.0300
+=================================================================================
+
+Average (all metrics)                             0.7610     0.7488     -0.0122

-**Average improvement: +8% across all metrics. Nothing degraded.**

 ### Qwen2.5-Coder-32B: Layers 7, 8, 9 duplicated once