Initial commit: cuGenOpt GPU optimization solver

2026-06-20 20:58:06 +02:00 · 2026-03-20 00:33:45 +08:00 · 2026-03-20 00:33:45 +08:00 · fc5a0ff4af
commit fc5a0ff4af
117 changed files with 25545 additions and 0 deletions
--- a/skills/cugenopt-problem-gen/SKILL.md
+++ b/skills/cugenopt-problem-gen/SKILL.md
@ -0,0 +1,290 @@
+---
+name: cugenopt-problem-gen
+description: Generate cuGenOpt GPU-accelerated optimization Problem definitions from natural language descriptions. Use when the user describes a combinatorial optimization problem (TSP, VRP, knapsack, scheduling, assignment, graph coloring, bin packing, etc.) and wants to solve it with cuGenOpt, or asks to generate, create, model, or run an optimization problem on GPU.
+---
+
+# cuGenOpt Problem Generator
+
+Generate cuGenOpt Problem definitions from natural language descriptions. Users describe their optimization problem in plain language; this Skill produces compilable `.cuh` + `main.cu` files and optionally compiles/runs them.
+
+## When to Use
+
+Activate this Skill when the user:
+- Describes a combinatorial optimization problem and wants to solve it with cuGenOpt
+- Asks to "generate", "create", "write", or "define" a cuGenOpt Problem
+- Asks to "solve", "run", or "try" an optimization problem on GPU
+- Mentions problem types like TSP, VRP, knapsack, scheduling, assignment, bin packing, graph coloring, etc.
+- Asks "how to express / model [problem] in cuGenOpt"
+
+## Prerequisites
+
+The cuGenOpt framework source must be accessible. Locate it by searching for `prototype/core/solver.cuh`:
+```
+FRAMEWORK_ROOT = directory containing prototype/core/solver.cuh
+```
+If not found, ask the user for the framework location.
+
+## Workflow
+
+### Step 0: Determine Execution Depth
+
+Infer from the user's wording how far to go:
+
+| Signal | Intent | Depth |
+|--------|--------|-------|
+| "generate" / "write" / "define" | Code only | Phase 1 → 2 |
+| "solve" / "run" / "try" / "see results" | Full run | Phase 1 → 2 → 3 |
+| Provides data (CSV, array, file path) | Full run | Phase 1 → 2 → 3 |
+| "model" / "express" / "how to" | Code + explanation | Phase 1 → 2 |
+| Ambiguous | Code, then ask | Phase 1 → 2, ask |
+
+**Principle**: when in doubt, ask rather than silently consuming GPU.
+
+### Step 1: Analyze the Problem (Phase 1)
+
+Extract from the user's description:
+
+1. **Decision variables** — what is being decided?
+2. **Encoding type** — use the decision tree below
+3. **Dimensions** — D1, D2, dim1, dim2_default, total_elements
+4. **Objective(s)** — minimize/maximize what?
+5. **Constraints** — what makes a solution infeasible?
+6. **Data** — what input data is needed?
+
+#### Encoding Decision Tree
+
+```
+What are the decision variables?
+├── Ordering / assignment (no repeats) → Permutation
+│   ├── Single sequence (TSP, QAP, assignment) → RowMode::Single, D1=1
+│   ├── Multiple equal-length sequences (JSP) → RowMode::Fixed, D1=num_machines
+│   └── Variable-length partitions (VRP, VRPTW) → RowMode::Partition, D1=max_vehicles
+├── Select / not-select (0-1) → Binary
+│   └── Knapsack, scheduling, subset selection → RowMode::Single, D1=1
+└── Bounded integer (multi-level) → Integer
+    ├── Single row (graph coloring, load balance) → RowMode::Single, D1=1
+    └── Multiple equal rows (multi-machine) → RowMode::Fixed, D1=num_machines
+```
+
+#### Dimension Calculation
+
+| Parameter | Rule |
+|-----------|------|
+| `D1` | Template max rows. Single=1; VRP=max_vehicles; JSP=num_machines. Round up to power of 2 if > 1. |
+| `D2` | Template max columns. Single=num_elements; Partition=max(num_elements/D1*2, 64). Round up to power of 2. |
+| `dim1` | Actual rows at runtime. ≤ D1. |
+| `dim2_default` | Single/Fixed: num_elements; Partition: 0 (framework allocates). |
+| `total_elements` | Partition only: total number of elements to distribute across rows. |
+
+#### Confirmation Strategy
+
+| Complexity | Criteria | Action |
+|------------|----------|--------|
+| **Low** | Standard problem with direct reference (TSP, knapsack, assignment) | Proceed without pause |
+| **Medium** | Custom constraints (prioritized VRP, skill-matched scheduling) | Generate code, then summarize logic in natural language for user confirmation |
+| **High** | Multi-objective, ambiguous description, non-standard encoding | Output problem spec first for confirmation, then generate code |
+
+For medium/high complexity, summarize like:
+> "Objective: minimize total distance. Constraints: each vehicle ≤ 100 capacity, penalty = 100 × excess. Encoding: Permutation with Partition (5 vehicles, 30 customers)."
+
+The user confirms the **logic summary**, not the code.
+
+### Step 2: Generate Code (Phase 2)
+
+Generate two files in the user's chosen directory (default: a new folder under the workspace):
+
+#### problem.cuh
+
+```cuda
+#pragma once
+#include "core/types.cuh"
+#include "core/cuda_utils.cuh"
+#include "core/operators.cuh"
+
+struct MyProblem : ProblemBase<MyProblem, D1, D2> {
+    // GPU data pointers
+    const float* d_data;
+    int n;
+
+    // === Objective definition ===
+    static constexpr ObjDef OBJ_DEFS[] = {
+        {ObjDir::Minimize, 1.0f, 0.0f},
+    };
+
+    __device__ float compute_obj(int idx, const Sol& sol) const {
+        switch (idx) {
+            case 0: return /* objective calculation */;
+            default: return 0.0f;
+        }
+    }
+
+    __device__ float compute_penalty(const Sol& sol) const {
+        return /* 0.0f if no constraints, else penalty value */;
+    }
+
+    ProblemConfig config() const {
+        ProblemConfig cfg;
+        cfg.encoding = EncodingType::/* Permutation | Binary | Integer */;
+        cfg.dim1 = /* actual rows */;
+        cfg.dim2_default = /* actual columns */;
+        fill_obj_config(cfg);
+        // For multi-row:
+        // cfg.row_mode = RowMode::/* Fixed | Partition */;
+        // cfg.cross_row_prob = 0.3f;
+        // cfg.total_elements = n;  // Partition only
+        return cfg;
+    }
+
+    // === Shared memory (when data fits) ===
+    size_t shared_mem_bytes() const {
+        return /* total bytes of data to cache, or 0 if too large */;
+    }
+
+    size_t working_set_bytes() const {
+        return /* actual data size in bytes (for L2 cache estimation) */;
+    }
+
+    __device__ void load_shared(char* smem, int tid, int bsz) {
+        // Copy d_data into shared memory, then redirect pointer
+    }
+
+    // === Factory ===
+    static MyProblem create(/* host data params */) {
+        MyProblem prob;
+        // cudaMalloc + cudaMemcpy for each data array
+        return prob;
+    }
+
+    void destroy() {
+        // cudaFree each GPU pointer
+    }
+};
+```
+
+#### main.cu
+
+```cuda
+#include "core/solver.cuh"
+#include "problem.cuh"
+#include <cstdio>
+
+int main() {
+    // 1. Prepare data (hardcoded or read from file)
+
+    // 2. Create problem
+    auto prob = MyProblem::create(/* ... */);
+
+    // 3. Configure solver
+    SolverConfig scfg;
+    scfg.time_limit_sec = 30.0f;
+    scfg.use_aos = true;
+    scfg.verbose = true;
+
+    // 4. Solve
+    auto result = solve(prob, scfg);
+
+    // 5. Print results
+    printf("Best objective: %.4f\n", result.best_solution.objectives[0]);
+    printf("Penalty: %.4f\n", result.best_solution.penalty);
+    printf("Generations: %d, Time: %.1f ms\n", result.generations, result.elapsed_ms);
+
+    // 6. Print solution details
+    // ...
+
+    prob.destroy();
+    return 0;
+}
+```
+
+**Code generation rules**:
+- Read `reference/problem-api.md` for the full ProblemBase interface specification
+- Read `reference/encoding-guide.md` for encoding selection and dimension calculation details
+- Read `reference/examples.md` for end-to-end examples of natural language → code
+- `shared_mem_bytes()` should return the actual data size; the framework handles overflow automatically via `cudaFuncSetAttribute`
+- `working_set_bytes()` should always return the actual data size (used for population sizing)
+- For Partition encoding, set `cfg.dim2_default = 0` and `cfg.total_elements = n`
+- Use `CUDA_CHECK()` macro for all CUDA API calls
+- Penalty should be proportional to constraint violation magnitude
+
+### Step 3: Validate & Run (Phase 3)
+
+Only enter this phase if execution depth requires it.
+
+#### 3a. Environment Detection
+
+```
+1. Run: nvidia-smi
+   ├── Success → local/direct mode, go to 3b
+   └── Failure → no local GPU
+       └── Ask: "No GPU detected locally. Do you have a remote GPU server?"
+           ├── User provides SSH info → remote mode
+           │   - scp files to remote
+           │   - Run all commands via ssh <host> "..."
+           ├── User says "I use Cursor Remote SSH" → suggest switching to remote window
+           └── No GPU → deliver code only, print compile command
+
+2. Detect compute capability:
+   nvidia-smi --query-gpu=compute_cap --format=csv,noheader → sm_XX
+```
+
+#### 3b. Compile
+
+```bash
+nvcc -O2 -std=c++17 --extended-lambda \
+     -arch=sm_XX \
+     -I FRAMEWORK_ROOT/prototype \
+     -I FRAMEWORK_ROOT/prototype/core \
+     main.cu -o solve
+```
+
+The two `-I` paths are both needed: `prototype/` for `#include "core/solver.cuh"` in user code, and `prototype/core/` for `#include "types.cuh"` inside framework headers.
+
+If `nvcc` not found:
+- Check `ls /usr/local/cuda*/bin/nvcc` — if found, fix PATH
+- Otherwise ask: "nvcc not found. Want me to help configure CUDA?"
+  - Check and guide: CUDA Toolkit install, g++ install, driver compatibility
+
+If compilation fails: read error, fix code, retry (up to 3 attempts).
+
+#### 3c. Run & Verify
+
+```bash
+./solve
+```
+
+Verification checklist:
+| Check | Method | Pass |
+|-------|--------|------|
+| No crash | exit code 0 | ✓ |
+| penalty = 0 | check output | Feasible solution |
+| Objective reasonable | compare to known/estimated optimum | Within expected range |
+
+If penalty > 0: report constraint violations, suggest adjusting penalty weights or constraint logic.
+
+#### 3d. Remote Mode
+
+When using a remote GPU server:
+1. Check if framework exists on remote: `ssh <host> "ls <path>/prototype/core/solver.cuh"`
+   - Not found → `scp -r FRAMEWORK_ROOT/prototype <host>:<remote_path>/`
+2. Upload generated files: `scp problem.cuh main.cu <host>:<remote_path>/`
+3. Compile and run via SSH: `ssh <host> "cd <remote_path> && nvcc ... && ./solve"`
+4. Capture stdout for results
+
+### Common Errors
+
+| Error | Cause | Fix |
+|-------|-------|-----|
+| `nvcc: command not found` | CUDA not installed or not in PATH | Trigger environment setup |
+| `no supported gcc/g++ host compiler` | g++ version incompatible | Install compatible version |
+| D2 too small | Elements exceed template parameter | Increase D2 (power of 2) |
+| `penalty always > 0` | Constraints too tight or penalty logic wrong | Review compute_penalty |
+| Objective is 0 or NaN | Data not loaded to GPU | Check cudaMalloc/cudaMemcpy in create() |
+| Very low gens/s | Data matrix too large for shared memory | Check shared_mem_bytes / working_set_bytes |
+
+## Reference Files
+
+For detailed specifications, read these files in the `reference/` directory alongside this SKILL.md:
+
+- **`problem-api.md`** — Complete ProblemBase interface: every method, its signature, when to implement it, and gotchas
+- **`encoding-guide.md`** — Encoding type selection, dimension calculation rules, RowMode details, shared memory sizing
+- **`examples.md`** — 4 end-to-end examples: natural language input → analysis → generated code
--- a/skills/cugenopt-problem-gen/reference/encoding-guide.md
+++ b/skills/cugenopt-problem-gen/reference/encoding-guide.md
@ -0,0 +1,155 @@
+# Encoding Selection & Dimension Guide
+
+## Encoding Types
+
+cuGenOpt supports three encoding types. Choose based on the nature of the decision variables.
+
+### Permutation
+
+**Use when**: each element appears exactly once (ordering/assignment).
+
+| Scenario | RowMode | D1 | D2 | dim2_default | total_elements |
+|----------|---------|----|----|-------------|----------------|
+| TSP (n cities) | Single | 1 | next_pow2(n) | n | — |
+| QAP (n facilities) | Single | 1 | next_pow2(n) | n | — |
+| Assignment (n tasks) | Single | 1 | next_pow2(n) | n | — |
+| JSP (m machines, j jobs) | Fixed | next_pow2(m) | next_pow2(j) | j | — |
+| VRP (k vehicles, n customers) | Partition | next_pow2(k) | max(next_pow2(n/k*2), 64) | 0 | n |
+| VRPTW (k vehicles, n customers) | Partition | next_pow2(k) | max(next_pow2(n/k*2), 64) | 0 | n |
+
+**Partition specifics**:
+- `dim2_default = 0` tells the framework to distribute elements across rows
+- `total_elements = n` is the count of elements to distribute
+- `cross_row_prob` controls how often cross-row operators fire (typically 0.2–0.4)
+- Elements are customer/job indices `0..n-1`; depot/source is implicit (not in the solution)
+
+### Binary
+
+**Use when**: each position is a yes/no decision.
+
+| Scenario | RowMode | D1 | D2 | dim2_default |
+|----------|---------|----|----|-------------|
+| 0-1 Knapsack (n items) | Single | 1 | next_pow2(n) | n |
+| Scheduling (n shifts) | Single | 1 | next_pow2(n) | n |
+| Subset selection (n candidates) | Single | 1 | next_pow2(n) | n |
+| Multi-row scheduling (m workers, n shifts) | Fixed | next_pow2(m) | next_pow2(n) | n |
+
+**Solution values**: `sol.data[row][col]` is 0 or 1.
+
+### Integer
+
+**Use when**: each position takes a bounded integer value.
+
+| Scenario | RowMode | D1 | D2 | dim2_default | lower_bound | upper_bound |
+|----------|---------|----|----|-------------|-------------|-------------|
+| Graph coloring (n nodes, c colors) | Single | 1 | next_pow2(n) | n | 0 | c-1 |
+| Load balancing (n tasks, m machines) | Single | 1 | next_pow2(n) | n | 0 | m-1 |
+| Multi-machine scheduling | Fixed | next_pow2(m) | next_pow2(j) | j | 0 | max_time |
+
+**Solution values**: `sol.data[row][col]` is in `[value_lower_bound, value_upper_bound]`.
+
+Set bounds in config:
+```cuda
+cfg.value_lower_bound = 0;
+cfg.value_upper_bound = num_colors - 1;
+```
+
+## Dimension Calculation Rules
+
+### D1 and D2 (Template Parameters)
+
+These are **compile-time constants** and define the maximum capacity:
+- Must be sufficient for the largest instance you plan to solve
+- Power of 2 is recommended for memory alignment
+- Larger values waste registers/memory; keep as small as possible
+
+```
+next_pow2(x):
+  1→1, 2→2, 3→4, 5→8, 9→16, 17→32, 33→64, 65→128, ...
+```
+
+### dim1 and dim2_default (Runtime Parameters)
+
+Set in `config()` to the actual problem size:
+- `dim1 ≤ D1`: actual number of rows used
+- `dim2_default ≤ D2`: actual number of columns per row
+- For Partition mode: `dim2_default = 0` (framework handles distribution)
+
+### Choosing D2 for Partition Mode
+
+Since rows have variable length, D2 must accommodate the longest possible row:
+```
+D2 = max(next_pow2(total_elements / D1 * 2), 64)
+```
+The `*2` factor provides headroom for unbalanced distributions.
+
+## Shared Memory Sizing
+
+### When to Use Shared Memory
+
+Shared memory provides ~10x faster access than global memory. Use it when:
+- Problem has a data matrix (distance, cost, weight)
+- The matrix is accessed repeatedly during objective/penalty evaluation
+
+### How to Size
+
+Report the **actual** data size. The framework handles the rest:
+
+```cuda
+size_t shared_mem_bytes() const {
+    // Distance matrix + demand array
+    return (size_t)stride * stride * sizeof(float) + (size_t)n * sizeof(float);
+}
+```
+
+The framework automatically:
+1. If ≤ 48KB: uses default shared memory
+2. If 48KB–max_smem: calls `cudaFuncSetAttribute` to extend (GPU-dependent max: T4=64KB, V100=96KB, A100/A800=164KB, H100=228KB)
+3. If > max_smem: falls back to global memory, uses `working_set_bytes()` for L2 cache population sizing
+
+### working_set_bytes
+
+Always return the actual data size, regardless of whether it fits in shared memory:
+
+```cuda
+size_t working_set_bytes() const {
+    return (size_t)n * n * sizeof(float);
+}
+```
+
+This is used by the framework to auto-calculate population size based on L2 cache capacity.
+
+## RowMode Details
+
+### Single (default)
+- `dim1 = 1`, single row of elements
+- No cross-row operators
+- Simplest and most common
+
+### Fixed
+- `dim1 > 1`, all rows have the same length (`dim2_default`)
+- Cross-row operators: ROW_SWAP, ROW_REVERSE
+- No SPLIT/MERGE (rows cannot change length)
+- Use for: JSP (machines × jobs), multi-worker scheduling
+
+### Partition
+- `dim1 > 1`, rows have variable length
+- Elements are distributed across rows (total count = `total_elements`)
+- Cross-row operators: CROSS_RELOCATE, CROSS_SWAP, SEG_RELOCATE, SEG_SWAP, CROSS_EXCHANGE, SPLIT, MERGE
+- `cross_row_prob` controls the probability of selecting cross-row operators
+- Use for: VRP (vehicles × customers), any partitioning problem
+
+## Quick Reference: Problem → Config
+
+| Problem | Encoding | RowMode | D1 | D2 | cross_row_prob |
+|---------|----------|---------|----|----|---------------|
+| TSP-50 | Perm | Single | 1 | 64 | 0 |
+| TSP-500 | Perm | Single | 1 | 512 | 0 |
+| QAP-15 | Perm | Single | 1 | 16 | 0 |
+| Assignment-12 | Perm | Single | 1 | 16 | 0 |
+| VRP-30-4v | Perm | Partition | 4 | 32 | 0.3 |
+| VRPTW-100-25v | Perm | Partition | 32 | 32 | 0.3 |
+| Knapsack-100 | Binary | Single | 1 | 128 | 0 |
+| Scheduling-20 | Binary | Single | 1 | 32 | 0 |
+| Graph Color-50 | Integer | Single | 1 | 64 | 0 |
+| JSP-6m-6j | Perm | Fixed | 8 | 8 | 0.2 |
--- a/skills/cugenopt-problem-gen/reference/examples.md
+++ b/skills/cugenopt-problem-gen/reference/examples.md
@ -0,0 +1,621 @@
+# End-to-End Examples
+
+Four complete examples from natural language description to generated code.
+
+---
+
+## Example 1: 0-1 Knapsack (Low Complexity)
+
+### User Input
+> "I have 8 items with weights [2,3,4,5,9,7,8,6] and values [3,4,5,8,10,7,9,6]. Knapsack capacity is 20. Maximize total value."
+
+### Analysis
+- **Decision**: select or not → **Binary**
+- **RowMode**: Single (D1=1)
+- **D2**: next_pow2(8) = 8
+- **Objective**: Maximize total value
+- **Constraint**: total weight ≤ 20
+- **Complexity**: Low (standard knapsack, direct reference)
+
+### Generated: problem.cuh
+
+```cuda
+#pragma once
+#include "core/types.cuh"
+#include "core/cuda_utils.cuh"
+#include "core/operators.cuh"
+
+struct Knapsack8 : ProblemBase<Knapsack8, 1, 8> {
+    const float* d_weights;
+    const float* d_values;
+    float capacity;
+    int n;
+
+    __device__ float calc_total_value(const Sol& sol) const {
+        float tv = 0.0f;
+        const int* sel = sol.data[0];
+        for (int i = 0; i < n; i++)
+            if (sel[i]) tv += d_values[i];
+        return tv;
+    }
+
+    static constexpr ObjDef OBJ_DEFS[] = {
+        {ObjDir::Maximize, 1.0f, 0.0f},
+    };
+
+    __device__ float compute_obj(int idx, const Sol& sol) const {
+        switch (idx) {
+            case 0: return calc_total_value(sol);
+            default: return 0.0f;
+        }
+    }
+
+    __device__ float compute_penalty(const Sol& sol) const {
+        float tw = 0.0f;
+        const int* sel = sol.data[0];
+        for (int i = 0; i < n; i++)
+            if (sel[i]) tw += d_weights[i];
+        float over = tw - capacity;
+        return (over > 0.0f) ? over * 50.0f : 0.0f;
+    }
+
+    ProblemConfig config() const {
+        ProblemConfig cfg;
+        cfg.encoding = EncodingType::Binary;
+        cfg.dim1 = 1;
+        cfg.dim2_default = n;
+        fill_obj_config(cfg);
+        return cfg;
+    }
+
+    size_t shared_mem_bytes() const {
+        return 2 * (size_t)n * sizeof(float);
+    }
+
+    __device__ void load_shared(char* smem, int tid, int bsz) {
+        float* sw = reinterpret_cast<float*>(smem);
+        float* sv = sw + n;
+        for (int i = tid; i < n; i += bsz) {
+            sw[i] = d_weights[i];
+            sv[i] = d_values[i];
+        }
+        d_weights = sw;
+        d_values = sv;
+    }
+
+    static Knapsack8 create(const float* hw, const float* hv, int n, float cap) {
+        Knapsack8 prob;
+        prob.n = n;
+        prob.capacity = cap;
+        float *dw, *dv;
+        CUDA_CHECK(cudaMalloc(&dw, sizeof(float) * n));
+        CUDA_CHECK(cudaMalloc(&dv, sizeof(float) * n));
+        CUDA_CHECK(cudaMemcpy(dw, hw, sizeof(float) * n, cudaMemcpyHostToDevice));
+        CUDA_CHECK(cudaMemcpy(dv, hv, sizeof(float) * n, cudaMemcpyHostToDevice));
+        prob.d_weights = dw;
+        prob.d_values = dv;
+        return prob;
+    }
+
+    void destroy() {
+        if (d_weights) cudaFree(const_cast<float*>(d_weights));
+        if (d_values) cudaFree(const_cast<float*>(d_values));
+        d_weights = nullptr;
+        d_values = nullptr;
+    }
+};
+```
+
+### Generated: main.cu
+
+```cuda
+#include "core/solver.cuh"
+#include "problem.cuh"
+#include <cstdio>
+
+int main() {
+    const int n = 8;
+    float weights[] = {2, 3, 4, 5, 9, 7, 8, 6};
+    float values[]  = {3, 4, 5, 8, 10, 7, 9, 6};
+    float capacity = 20.0f;
+
+    auto prob = Knapsack8::create(weights, values, n, capacity);
+
+    SolverConfig scfg;
+    scfg.time_limit_sec = 5.0f;
+    scfg.use_aos = true;
+    scfg.verbose = true;
+
+    auto result = solve(prob, scfg);
+
+    printf("Best value: %.2f\n", result.best_solution.objectives[0]);
+    printf("Penalty: %.2f\n", result.best_solution.penalty);
+    printf("Selected items: ");
+    for (int i = 0; i < n; i++)
+        if (result.best_solution.data[0][i]) printf("%d ", i);
+    printf("\n");
+
+    prob.destroy();
+    return 0;
+}
+```
+
+---
+
+## Example 2: Assignment Problem (Low Complexity)
+
+### User Input
+> "Assign 10 workers to 10 tasks. Cost matrix is in a file `cost_10x10.txt`. Minimize total cost."
+
+### Analysis
+- **Decision**: assign each worker to a unique task → **Permutation**
+- **RowMode**: Single (D1=1)
+- **D2**: next_pow2(10) = 16
+- **Objective**: Minimize total cost
+- **Constraint**: none (permutation encoding guarantees one-to-one)
+- **Data**: read from file
+- **Complexity**: Low (standard assignment)
+
+### Generated: problem.cuh
+
+```cuda
+#pragma once
+#include "core/types.cuh"
+#include "core/cuda_utils.cuh"
+#include "core/operators.cuh"
+
+struct Assignment10 : ProblemBase<Assignment10, 1, 16> {
+    const float* d_cost;
+    int n;
+
+    __device__ float calc_total_cost(const Sol& sol) const {
+        float total = 0.0f;
+        const int* assign = sol.data[0];
+        for (int i = 0; i < n; i++)
+            total += d_cost[i * n + assign[i]];
+        return total;
+    }
+
+    static constexpr ObjDef OBJ_DEFS[] = {
+        {ObjDir::Minimize, 1.0f, 0.0f},
+    };
+
+    __device__ float compute_obj(int idx, const Sol& sol) const {
+        switch (idx) {
+            case 0: return calc_total_cost(sol);
+            default: return 0.0f;
+        }
+    }
+
+    __device__ float compute_penalty(const Sol& sol) const {
+        return 0.0f;
+    }
+
+    ProblemConfig config() const {
+        ProblemConfig cfg;
+        cfg.encoding = EncodingType::Permutation;
+        cfg.dim1 = 1;
+        cfg.dim2_default = n;
+        fill_obj_config(cfg);
+        return cfg;
+    }
+
+    size_t shared_mem_bytes() const {
+        return (size_t)n * n * sizeof(float);
+    }
+
+    size_t working_set_bytes() const {
+        return (size_t)n * n * sizeof(float);
+    }
+
+    __device__ void load_shared(char* smem, int tid, int bsz) {
+        float* sc = reinterpret_cast<float*>(smem);
+        int total = n * n;
+        for (int i = tid; i < total; i += bsz) sc[i] = d_cost[i];
+        d_cost = sc;
+    }
+
+    static Assignment10 create(const float* hc, int n) {
+        Assignment10 prob;
+        prob.n = n;
+        float* dc;
+        CUDA_CHECK(cudaMalloc(&dc, sizeof(float) * n * n));
+        CUDA_CHECK(cudaMemcpy(dc, hc, sizeof(float) * n * n, cudaMemcpyHostToDevice));
+        prob.d_cost = dc;
+        return prob;
+    }
+
+    void destroy() {
+        if (d_cost) { cudaFree(const_cast<float*>(d_cost)); d_cost = nullptr; }
+    }
+};
+```
+
+### Generated: main.cu
+
+```cuda
+#include "core/solver.cuh"
+#include "problem.cuh"
+#include <cstdio>
+#include <cstdlib>
+
+int main() {
+    const int n = 10;
+    float cost[n * n];
+
+    FILE* f = fopen("cost_10x10.txt", "r");
+    if (!f) { fprintf(stderr, "Cannot open cost_10x10.txt\n"); return 1; }
+    for (int i = 0; i < n * n; i++) fscanf(f, "%f", &cost[i]);
+    fclose(f);
+
+    auto prob = Assignment10::create(cost, n);
+
+    SolverConfig scfg;
+    scfg.time_limit_sec = 10.0f;
+    scfg.use_aos = true;
+    scfg.verbose = true;
+
+    auto result = solve(prob, scfg);
+
+    printf("Best cost: %.2f\n", result.best_solution.objectives[0]);
+    printf("Assignment: ");
+    for (int i = 0; i < n; i++)
+        printf("worker %d → task %d  ", i, result.best_solution.data[0][i]);
+    printf("\n");
+
+    prob.destroy();
+    return 0;
+}
+```
+
+---
+
+## Example 3: Vehicle Routing with Capacity (Medium Complexity)
+
+### User Input
+> "I have 1 depot and 30 customers. 4 trucks, each with capacity 100. Customer coordinates and demands are in `customers.csv` (columns: id, x, y, demand). Minimize total travel distance."
+
+### Analysis
+- **Decision**: assign customers to trucks and determine visit order → **Permutation**
+- **RowMode**: Partition (variable-length routes)
+- **D1**: next_pow2(4) = 4
+- **D2**: max(next_pow2(30/4*2), 64) = 64
+- **Objective**: Minimize total distance (depot → customers → depot for each truck)
+- **Constraint**: each truck's total demand ≤ 100
+- **Data**: CSV with coordinates → compute distance matrix
+- **Complexity**: Medium (custom constraint, Partition encoding)
+
+### Logic Summary (for user confirmation)
+> "Objective: minimize total travel distance across all trucks. Each truck starts and ends at depot (id=0). Constraint: total demand per truck ≤ 100, penalty = 100 × excess. Encoding: Permutation with Partition, 4 trucks, 30 customers."
+
+### Generated: problem.cuh
+
+```cuda
+#pragma once
+#include "core/types.cuh"
+#include "core/cuda_utils.cuh"
+#include "core/operators.cuh"
+#include <cmath>
+
+struct VRP30 : ProblemBase<VRP30, 4, 64> {
+    const float* d_dist;    // (n+1)×(n+1) distance matrix including depot
+    const float* d_demand;  // n customer demands
+    int n;                  // number of customers (excluding depot)
+    int stride;             // n+1
+    float capacity;
+    int num_vehicles;
+
+    __device__ float compute_route_dist(const int* route, int size) const {
+        if (size == 0) return 0.0f;
+        float dist = 0.0f;
+        int prev = 0;  // depot
+        for (int j = 0; j < size; j++) {
+            int node = route[j] + 1;  // customer indices are 0-based, node indices 1-based
+            dist += d_dist[prev * stride + node];
+            prev = node;
+        }
+        dist += d_dist[prev * stride + 0];  // return to depot
+        return dist;
+    }
+
+    __device__ float calc_total_distance(const Sol& sol) const {
+        float total = 0.0f;
+        for (int r = 0; r < num_vehicles; r++)
+            total += compute_route_dist(sol.data[r], sol.dim2_sizes[r]);
+        return total;
+    }
+
+    static constexpr ObjDef OBJ_DEFS[] = {
+        {ObjDir::Minimize, 1.0f, 0.0f},
+    };
+
+    __device__ float compute_obj(int idx, const Sol& sol) const {
+        switch (idx) {
+            case 0: return calc_total_distance(sol);
+            default: return 0.0f;
+        }
+    }
+
+    __device__ float compute_penalty(const Sol& sol) const {
+        float penalty = 0.0f;
+        for (int r = 0; r < num_vehicles; r++) {
+            float load = 0.0f;
+            for (int j = 0; j < sol.dim2_sizes[r]; j++)
+                load += d_demand[sol.data[r][j]];
+            if (load > capacity)
+                penalty += (load - capacity) * 100.0f;
+        }
+        return penalty;
+    }
+
+    ProblemConfig config() const {
+        ProblemConfig cfg;
+        cfg.encoding = EncodingType::Permutation;
+        cfg.dim1 = num_vehicles;
+        cfg.dim2_default = 0;
+        fill_obj_config(cfg);
+        cfg.row_mode = RowMode::Partition;
+        cfg.cross_row_prob = 0.3f;
+        cfg.total_elements = n;
+        return cfg;
+    }
+
+    size_t shared_mem_bytes() const {
+        return (size_t)stride * stride * sizeof(float) + (size_t)n * sizeof(float);
+    }
+
+    size_t working_set_bytes() const {
+        return (size_t)stride * stride * sizeof(float) + (size_t)n * sizeof(float);
+    }
+
+    __device__ void load_shared(char* smem, int tid, int bsz) {
+        float* sd = reinterpret_cast<float*>(smem);
+        int dist_size = stride * stride;
+        for (int i = tid; i < dist_size; i += bsz) sd[i] = d_dist[i];
+        d_dist = sd;
+        float* sdem = sd + dist_size;
+        for (int i = tid; i < n; i += bsz) sdem[i] = d_demand[i];
+        d_demand = sdem;
+    }
+
+    static VRP30 create(const float* h_dist, const float* h_demand,
+                         int n, float capacity, int num_vehicles) {
+        VRP30 prob;
+        prob.n = n;
+        prob.stride = n + 1;
+        prob.capacity = capacity;
+        prob.num_vehicles = num_vehicles;
+
+        int nodes = n + 1;
+        float* dd;
+        CUDA_CHECK(cudaMalloc(&dd, sizeof(float) * nodes * nodes));
+        CUDA_CHECK(cudaMemcpy(dd, h_dist, sizeof(float) * nodes * nodes, cudaMemcpyHostToDevice));
+        prob.d_dist = dd;
+
+        float* ddem;
+        CUDA_CHECK(cudaMalloc(&ddem, sizeof(float) * n));
+        CUDA_CHECK(cudaMemcpy(ddem, h_demand, sizeof(float) * n, cudaMemcpyHostToDevice));
+        prob.d_demand = ddem;
+
+        return prob;
+    }
+
+    void destroy() {
+        if (d_dist) { cudaFree(const_cast<float*>(d_dist)); d_dist = nullptr; }
+        if (d_demand) { cudaFree(const_cast<float*>(d_demand)); d_demand = nullptr; }
+    }
+};
+```
+
+### Generated: main.cu
+
+```cuda
+#include "core/solver.cuh"
+#include "problem.cuh"
+#include <cstdio>
+#include <cmath>
+
+int main() {
+    const int n = 30;
+    const int num_vehicles = 4;
+    const float capacity = 100.0f;
+
+    float x[n + 1], y[n + 1], demand[n];
+
+    FILE* f = fopen("customers.csv", "r");
+    if (!f) { fprintf(stderr, "Cannot open customers.csv\n"); return 1; }
+
+    char header[256];
+    fgets(header, sizeof(header), f);  // skip header
+
+    // Read depot (id=0)
+    int id;
+    fscanf(f, "%d,%f,%f,%*f", &id, &x[0], &y[0]);  // depot has no demand
+
+    // Read customers
+    for (int i = 0; i < n; i++) {
+        fscanf(f, "%d,%f,%f,%f", &id, &x[i + 1], &y[i + 1], &demand[i]);
+    }
+    fclose(f);
+
+    // Compute distance matrix
+    int nodes = n + 1;
+    float dist[nodes * nodes];
+    for (int i = 0; i < nodes; i++)
+        for (int j = 0; j < nodes; j++) {
+            float dx = x[i] - x[j], dy = y[i] - y[j];
+            dist[i * nodes + j] = sqrtf(dx * dx + dy * dy);
+        }
+
+    auto prob = VRP30::create(dist, demand, n, capacity, num_vehicles);
+
+    SolverConfig scfg;
+    scfg.time_limit_sec = 30.0f;
+    scfg.use_aos = true;
+    scfg.verbose = true;
+
+    auto result = solve(prob, scfg);
+
+    printf("Best distance: %.2f\n", result.best_solution.objectives[0]);
+    printf("Penalty: %.2f\n", result.best_solution.penalty);
+    for (int r = 0; r < num_vehicles; r++) {
+        printf("Truck %d: depot", r);
+        for (int j = 0; j < result.best_solution.dim2_sizes[r]; j++)
+            printf(" → %d", result.best_solution.data[r][j] + 1);
+        printf(" → depot\n");
+    }
+
+    prob.destroy();
+    return 0;
+}
+```
+
+---
+
+## Example 4: Graph Coloring (Low Complexity)
+
+### User Input
+> "Color a graph with 20 nodes using at most 4 colors. Edges: (0,1),(0,2),(1,3),(2,3),(3,4),...  Minimize the number of colors used, with no two adjacent nodes sharing a color."
+
+### Analysis
+- **Decision**: assign a color (0–3) to each node → **Integer**
+- **RowMode**: Single (D1=1)
+- **D2**: next_pow2(20) = 32
+- **Objective**: Minimize number of distinct colors used
+- **Constraint**: adjacent nodes must have different colors
+- **Complexity**: Low (standard graph coloring)
+
+### Generated: problem.cuh
+
+```cuda
+#pragma once
+#include "core/types.cuh"
+#include "core/cuda_utils.cuh"
+#include "core/operators.cuh"
+
+struct GraphColor20 : ProblemBase<GraphColor20, 1, 32> {
+    const int* d_adj;     // adjacency matrix n×n (1=edge, 0=no edge)
+    int n;
+    int max_colors;
+
+    __device__ float calc_num_colors(const Sol& sol) const {
+        int used[4] = {0, 0, 0, 0};
+        const int* colors = sol.data[0];
+        for (int i = 0; i < n; i++) {
+            int c = colors[i];
+            if (c >= 0 && c < max_colors) used[c] = 1;
+        }
+        float count = 0.0f;
+        for (int c = 0; c < max_colors; c++) count += used[c];
+        return count;
+    }
+
+    static constexpr ObjDef OBJ_DEFS[] = {
+        {ObjDir::Minimize, 1.0f, 0.0f},
+    };
+
+    __device__ float compute_obj(int idx, const Sol& sol) const {
+        switch (idx) {
+            case 0: return calc_num_colors(sol);
+            default: return 0.0f;
+        }
+    }
+
+    __device__ float compute_penalty(const Sol& sol) const {
+        float conflicts = 0.0f;
+        const int* colors = sol.data[0];
+        for (int i = 0; i < n; i++)
+            for (int j = i + 1; j < n; j++)
+                if (d_adj[i * n + j] && colors[i] == colors[j])
+                    conflicts += 1.0f;
+        return conflicts * 10.0f;
+    }
+
+    ProblemConfig config() const {
+        ProblemConfig cfg;
+        cfg.encoding = EncodingType::Integer;
+        cfg.dim1 = 1;
+        cfg.dim2_default = n;
+        cfg.value_lower_bound = 0;
+        cfg.value_upper_bound = max_colors - 1;
+        fill_obj_config(cfg);
+        return cfg;
+    }
+
+    size_t shared_mem_bytes() const {
+        return (size_t)n * n * sizeof(int);
+    }
+
+    size_t working_set_bytes() const {
+        return (size_t)n * n * sizeof(int);
+    }
+
+    __device__ void load_shared(char* smem, int tid, int bsz) {
+        int* sa = reinterpret_cast<int*>(smem);
+        int total = n * n;
+        for (int i = tid; i < total; i += bsz) sa[i] = d_adj[i];
+        d_adj = sa;
+    }
+
+    static GraphColor20 create(const int* h_adj, int n, int max_colors) {
+        GraphColor20 prob;
+        prob.n = n;
+        prob.max_colors = max_colors;
+        int* da;
+        CUDA_CHECK(cudaMalloc(&da, sizeof(int) * n * n));
+        CUDA_CHECK(cudaMemcpy(da, h_adj, sizeof(int) * n * n, cudaMemcpyHostToDevice));
+        prob.d_adj = da;
+        return prob;
+    }
+
+    void destroy() {
+        if (d_adj) { cudaFree(const_cast<int*>(d_adj)); d_adj = nullptr; }
+    }
+};
+```
+
+### Generated: main.cu
+
+```cuda
+#include "core/solver.cuh"
+#include "problem.cuh"
+#include <cstdio>
+
+int main() {
+    const int n = 20;
+    const int max_colors = 4;
+
+    int adj[n * n] = {0};
+    // Define edges
+    int edges[][2] = {{0,1},{0,2},{1,3},{2,3},{3,4},
+                       {4,5},{5,6},{6,7},{7,8},{8,9},
+                       {9,10},{10,11},{11,12},{12,13},{13,14},
+                       {14,15},{15,16},{16,17},{17,18},{18,19},
+                       {0,19},{1,4},{2,5},{6,9},{7,10}};
+    int num_edges = sizeof(edges) / sizeof(edges[0]);
+    for (int e = 0; e < num_edges; e++) {
+        int u = edges[e][0], v = edges[e][1];
+        adj[u * n + v] = 1;
+        adj[v * n + u] = 1;
+    }
+
+    auto prob = GraphColor20::create(adj, n, max_colors);
+
+    SolverConfig scfg;
+    scfg.time_limit_sec = 10.0f;
+    scfg.use_aos = true;
+    scfg.verbose = true;
+
+    auto result = solve(prob, scfg);
+
+    printf("Colors used: %.0f\n", result.best_solution.objectives[0]);
+    printf("Conflicts (penalty): %.2f\n", result.best_solution.penalty);
+    printf("Coloring: ");
+    for (int i = 0; i < n; i++)
+        printf("node%d=%d ", i, result.best_solution.data[0][i]);
+    printf("\n");
+
+    prob.destroy();
+    return 0;
+}
+```
--- a/skills/cugenopt-problem-gen/reference/problem-api.md
+++ b/skills/cugenopt-problem-gen/reference/problem-api.md
@ -0,0 +1,280 @@
+# ProblemBase API Reference
+
+Complete interface specification for `ProblemBase<Derived, D1, D2>` (defined in `core/types.cuh`).
+
+## Template Parameters
+
+| Parameter | Type | Description |
+|-----------|------|-------------|
+| `Derived` | struct | The concrete problem type (CRTP pattern) |
+| `D1` | int | Maximum number of rows (compile-time constant, power of 2 recommended) |
+| `D2` | int | Maximum columns per row (compile-time constant, power of 2 recommended) |
+
+The base class provides:
+- `using Sol = Solution<D1, D2>;` — the solution type
+- `static constexpr int NUM_OBJ` — auto-derived from `Derived::OBJ_DEFS`
+- `evaluate(Sol&)` — calls `compute_obj` for each objective + `compute_penalty`
+- `fill_obj_config(ProblemConfig&)` — populates objective fields from `OBJ_DEFS`
+- `obj_config()` — returns `ObjConfig` for the solver
+
+## Required Interface
+
+### 1. `OBJ_DEFS` — Objective Definitions (static constexpr)
+
+```cuda
+static constexpr ObjDef OBJ_DEFS[] = {
+    {ObjDir::Minimize, 1.0f, 0.0f},   // index 0
+    // {ObjDir::Maximize, 0.5f, 0.0f}, // index 1 (multi-objective)
+};
+```
+
+Each `ObjDef`:
+- `dir`: `ObjDir::Minimize` or `ObjDir::Maximize`
+- `weight`: importance weight for `CompareMode::Weighted` (default mode)
+- `tolerance`: tolerance for `CompareMode::Lexicographic`
+
+Most problems have a single objective. Multi-objective (up to 4) is supported.
+
+### 2. `compute_obj` — Objective Calculation
+
+```cuda
+__device__ float compute_obj(int idx, const Sol& sol) const;
+```
+
+- Runs on GPU (`__device__`)
+- `idx` corresponds to `OBJ_DEFS[idx]`
+- Use a `switch` statement dispatching to helper functions
+- Access solution data via `sol.data[row][col]` and `sol.dim2_sizes[row]`
+
+**Pattern**:
+```cuda
+__device__ float compute_obj(int idx, const Sol& sol) const {
+    switch (idx) {
+        case 0: return calc_total_cost(sol);
+        default: return 0.0f;
+    }
+}
+```
+
+### 3. `compute_penalty` — Constraint Violation
+
+```cuda
+__device__ float compute_penalty(const Sol& sol) const;
+```
+
+- Returns `0.0f` for feasible solutions
+- Returns a positive value proportional to violation magnitude for infeasible solutions
+- The solver always prefers feasible solutions (penalty=0) over infeasible ones
+- For multiple constraints, sum all violations
+
+**Guidelines**:
+- Scale penalty to be comparable to objective magnitude
+- Example: capacity overflow → `(excess_load) * 100.0f`
+- Example: vehicle count exceeded → `(excess_vehicles) * 1000.0f`
+
+### 4. `config` — Problem Configuration
+
+```cuda
+ProblemConfig config() const;
+```
+
+Returns runtime metadata. Must set:
+
+```cuda
+ProblemConfig config() const {
+    ProblemConfig cfg;
+    cfg.encoding = EncodingType::Permutation;  // or Binary, Integer
+    cfg.dim1 = /* actual rows used */;
+    cfg.dim2_default = /* actual columns */;
+    fill_obj_config(cfg);  // auto-fills objectives from OBJ_DEFS
+
+    // Multi-row problems:
+    // cfg.row_mode = RowMode::Fixed;       // equal-length rows
+    // cfg.row_mode = RowMode::Partition;   // variable-length rows
+    // cfg.cross_row_prob = 0.3f;           // cross-row operator probability
+    // cfg.total_elements = n;              // Partition: total elements across all rows
+
+    // Integer encoding:
+    // cfg.value_lower_bound = 0;
+    // cfg.value_upper_bound = num_colors - 1;
+
+    return cfg;
+}
+```
+
+### 5. `create` / `destroy` — Factory Methods
+
+```cuda
+static MyProblem create(/* host-side data */) {
+    MyProblem prob;
+    prob.n = n;
+    // Allocate GPU memory and copy data
+    float* d_ptr;
+    CUDA_CHECK(cudaMalloc(&d_ptr, sizeof(float) * n * n));
+    CUDA_CHECK(cudaMemcpy(d_ptr, h_ptr, sizeof(float) * n * n, cudaMemcpyHostToDevice));
+    prob.d_data = d_ptr;
+    return prob;
+}
+
+void destroy() {
+    if (d_data) { cudaFree(const_cast<float*>(d_data)); d_data = nullptr; }
+}
+```
+
+**Rules**:
+- All GPU memory allocated in `create()`, freed in `destroy()`
+- Use `CUDA_CHECK()` for every CUDA API call
+- Store both `d_` (device) and optionally `h_` (host) pointers
+- `const_cast` needed in `destroy()` because pointers are `const float*`
+
+## Optional Interface
+
+### 6. `shared_mem_bytes` — Shared Memory Requirement
+
+```cuda
+size_t shared_mem_bytes() const;
+```
+
+- Returns total bytes of problem data to cache in shared memory
+- Return the **actual** data size; the framework handles overflow:
+  - ≤ 48KB: fits default shared memory
+  - 48KB–164KB: framework calls `cudaFuncSetAttribute` to extend (GPU-dependent)
+  - Too large: framework falls back to global memory automatically
+- Default (from base class): returns 0
+
+**Example** (distance matrix):
+```cuda
+size_t shared_mem_bytes() const {
+    return (size_t)n * n * sizeof(float);  // report actual need
+}
+```
+
+### 7. `working_set_bytes` — Global Memory Working Set
+
+```cuda
+size_t working_set_bytes() const;
+```
+
+- Returns the per-block hot data size in global memory
+- Used by the framework to estimate L2 cache pressure and auto-size population
+- Default: returns `shared_mem_bytes()`
+- **Override when** `shared_mem_bytes()` returns 0 (data doesn't fit in shared memory) — return the actual data size so population sizing works correctly
+
+**Example**:
+```cuda
+size_t working_set_bytes() const {
+    return (size_t)n * n * sizeof(float) + (size_t)n * sizeof(float);
+}
+```
+
+### 8. `load_shared` — Load Data into Shared Memory
+
+```cuda
+__device__ void load_shared(char* smem, int tid, int bsz);
+```
+
+- Called by framework when `shared_mem_bytes() > 0`
+- Copy data from global memory to shared memory using cooperative loading
+- **Redirect the device pointer** to shared memory after loading
+
+**Pattern**:
+```cuda
+__device__ void load_shared(char* smem, int tid, int bsz) {
+    float* s_data = reinterpret_cast<float*>(smem);
+    int total = n * n;
+    for (int i = tid; i < total; i += bsz)
+        s_data[i] = d_data[i];
+    d_data = s_data;  // redirect pointer to shared memory
+}
+```
+
+For multiple arrays, lay them out sequentially in `smem`:
+```cuda
+__device__ void load_shared(char* smem, int tid, int bsz) {
+    float* s_dist = reinterpret_cast<float*>(smem);
+    int dist_size = stride * stride;
+    for (int i = tid; i < dist_size; i += bsz) s_dist[i] = d_dist[i];
+    d_dist = s_dist;
+
+    float* s_demand = s_dist + dist_size;
+    for (int i = tid; i < n; i += bsz) s_demand[i] = d_demand[i];
+    d_demand = s_demand;
+}
+```
+
+### 9. `heuristic_matrices` — Data for Heuristic Initialization
+
+```cuda
+int heuristic_matrices(HeuristicMatrix* out, int max_count) const;
+```
+
+- Returns host-side matrices for constructing heuristic initial solutions
+- The framework sorts elements by row/column sums to generate better-than-random starting points
+- Return value: number of matrices provided (0 = no heuristic init)
+
+**Example** (distance matrix for TSP):
+```cuda
+int heuristic_matrices(HeuristicMatrix* out, int max_count) const {
+    if (max_count < 1 || !h_dist) return 0;
+    out[0] = {h_dist, n};
+    return 1;
+}
+```
+
+### 10. `init_relation_matrix` — G/O Matrix for Guided Rebuild
+
+```cuda
+void init_relation_matrix(float* h_G, float* h_O, int N) const;
+```
+
+- Provides prior knowledge for the LNS guided rebuild operator
+- `G[i*N+j]`: grouping tendency (symmetric, higher = more likely in same group)
+- `O[i*N+j]`: ordering tendency (asymmetric, higher = i before j)
+- Values in [0, 1], typically scaled from problem data (e.g., distance proximity)
+- Default: does nothing (matrices stay zero, learned from search history)
+
+## Solution Data Access
+
+```cuda
+sol.data[row][col]      // element value at (row, col)
+sol.dim2_sizes[row]     // actual length of row (may be < D2)
+sol.objectives[idx]     // objective value (set by evaluate())
+sol.penalty             // penalty value (set by evaluate())
+```
+
+- **Permutation (Single)**: `sol.data[0][0..n-1]` contains a permutation of `0..n-1`
+- **Permutation (Partition)**: `sol.data[r][0..sol.dim2_sizes[r]-1]` for each route/partition
+- **Binary**: `sol.data[0][i]` is 0 or 1
+- **Integer**: `sol.data[0][i]` is in `[value_lower_bound, value_upper_bound]`
+
+## Key Types Reference
+
+```cuda
+enum class EncodingType { Permutation, Binary, Integer };
+enum class RowMode { Single, Fixed, Partition };
+enum class ObjDir { Minimize, Maximize };
+enum class CompareMode { Weighted, Lexicographic };
+
+struct ObjDef { ObjDir dir; float weight; float tolerance; };
+struct HeuristicMatrix { const float* data; int N; };
+
+struct ProblemConfig {
+    EncodingType encoding;
+    int dim1, dim2_default, num_objectives;
+    ObjDir obj_dirs[4]; float obj_weights[4];
+    CompareMode compare_mode;
+    RowMode row_mode;
+    float cross_row_prob;
+    int total_elements;
+    int value_lower_bound, value_upper_bound;
+};
+
+struct SolverConfig {
+    int pop_size;           // 0 = auto
+    int max_gen;            // max generations
+    float time_limit_sec;   // 0 = no limit
+    bool use_aos;           // adaptive operator selection
+    bool verbose;
+    unsigned seed;
+};
+```