Initial commit: cuGenOpt GPU optimization solver

2026-07-03 22:50:59 +02:00 · 2026-03-20 00:33:45 +08:00 · 2026-03-20 00:33:45 +08:00 · fc5a0ff4af
commit fc5a0ff4af
117 changed files with 25545 additions and 0 deletions
--- a/skills/cugenopt-problem-gen/reference/encoding-guide.md
+++ b/skills/cugenopt-problem-gen/reference/encoding-guide.md
@ -0,0 +1,155 @@
+# Encoding Selection & Dimension Guide
+
+## Encoding Types
+
+cuGenOpt supports three encoding types. Choose based on the nature of the decision variables.
+
+### Permutation
+
+**Use when**: each element appears exactly once (ordering/assignment).
+
+| Scenario | RowMode | D1 | D2 | dim2_default | total_elements |
+|----------|---------|----|----|-------------|----------------|
+| TSP (n cities) | Single | 1 | next_pow2(n) | n | — |
+| QAP (n facilities) | Single | 1 | next_pow2(n) | n | — |
+| Assignment (n tasks) | Single | 1 | next_pow2(n) | n | — |
+| JSP (m machines, j jobs) | Fixed | next_pow2(m) | next_pow2(j) | j | — |
+| VRP (k vehicles, n customers) | Partition | next_pow2(k) | max(next_pow2(n/k*2), 64) | 0 | n |
+| VRPTW (k vehicles, n customers) | Partition | next_pow2(k) | max(next_pow2(n/k*2), 64) | 0 | n |
+
+**Partition specifics**:
+- `dim2_default = 0` tells the framework to distribute elements across rows
+- `total_elements = n` is the count of elements to distribute
+- `cross_row_prob` controls how often cross-row operators fire (typically 0.2–0.4)
+- Elements are customer/job indices `0..n-1`; depot/source is implicit (not in the solution)
+
+### Binary
+
+**Use when**: each position is a yes/no decision.
+
+| Scenario | RowMode | D1 | D2 | dim2_default |
+|----------|---------|----|----|-------------|
+| 0-1 Knapsack (n items) | Single | 1 | next_pow2(n) | n |
+| Scheduling (n shifts) | Single | 1 | next_pow2(n) | n |
+| Subset selection (n candidates) | Single | 1 | next_pow2(n) | n |
+| Multi-row scheduling (m workers, n shifts) | Fixed | next_pow2(m) | next_pow2(n) | n |
+
+**Solution values**: `sol.data[row][col]` is 0 or 1.
+
+### Integer
+
+**Use when**: each position takes a bounded integer value.
+
+| Scenario | RowMode | D1 | D2 | dim2_default | lower_bound | upper_bound |
+|----------|---------|----|----|-------------|-------------|-------------|
+| Graph coloring (n nodes, c colors) | Single | 1 | next_pow2(n) | n | 0 | c-1 |
+| Load balancing (n tasks, m machines) | Single | 1 | next_pow2(n) | n | 0 | m-1 |
+| Multi-machine scheduling | Fixed | next_pow2(m) | next_pow2(j) | j | 0 | max_time |
+
+**Solution values**: `sol.data[row][col]` is in `[value_lower_bound, value_upper_bound]`.
+
+Set bounds in config:
+```cuda
+cfg.value_lower_bound = 0;
+cfg.value_upper_bound = num_colors - 1;
+```
+
+## Dimension Calculation Rules
+
+### D1 and D2 (Template Parameters)
+
+These are **compile-time constants** and define the maximum capacity:
+- Must be sufficient for the largest instance you plan to solve
+- Power of 2 is recommended for memory alignment
+- Larger values waste registers/memory; keep as small as possible
+
+```
+next_pow2(x):
+  1→1, 2→2, 3→4, 5→8, 9→16, 17→32, 33→64, 65→128, ...
+```
+
+### dim1 and dim2_default (Runtime Parameters)
+
+Set in `config()` to the actual problem size:
+- `dim1 ≤ D1`: actual number of rows used
+- `dim2_default ≤ D2`: actual number of columns per row
+- For Partition mode: `dim2_default = 0` (framework handles distribution)
+
+### Choosing D2 for Partition Mode
+
+Since rows have variable length, D2 must accommodate the longest possible row:
+```
+D2 = max(next_pow2(total_elements / D1 * 2), 64)
+```
+The `*2` factor provides headroom for unbalanced distributions.
+
+## Shared Memory Sizing
+
+### When to Use Shared Memory
+
+Shared memory provides ~10x faster access than global memory. Use it when:
+- Problem has a data matrix (distance, cost, weight)
+- The matrix is accessed repeatedly during objective/penalty evaluation
+
+### How to Size
+
+Report the **actual** data size. The framework handles the rest:
+
+```cuda
+size_t shared_mem_bytes() const {
+    // Distance matrix + demand array
+    return (size_t)stride * stride * sizeof(float) + (size_t)n * sizeof(float);
+}
+```
+
+The framework automatically:
+1. If ≤ 48KB: uses default shared memory
+2. If 48KB–max_smem: calls `cudaFuncSetAttribute` to extend (GPU-dependent max: T4=64KB, V100=96KB, A100/A800=164KB, H100=228KB)
+3. If > max_smem: falls back to global memory, uses `working_set_bytes()` for L2 cache population sizing
+
+### working_set_bytes
+
+Always return the actual data size, regardless of whether it fits in shared memory:
+
+```cuda
+size_t working_set_bytes() const {
+    return (size_t)n * n * sizeof(float);
+}
+```
+
+This is used by the framework to auto-calculate population size based on L2 cache capacity.
+
+## RowMode Details
+
+### Single (default)
+- `dim1 = 1`, single row of elements
+- No cross-row operators
+- Simplest and most common
+
+### Fixed
+- `dim1 > 1`, all rows have the same length (`dim2_default`)
+- Cross-row operators: ROW_SWAP, ROW_REVERSE
+- No SPLIT/MERGE (rows cannot change length)
+- Use for: JSP (machines × jobs), multi-worker scheduling
+
+### Partition
+- `dim1 > 1`, rows have variable length
+- Elements are distributed across rows (total count = `total_elements`)
+- Cross-row operators: CROSS_RELOCATE, CROSS_SWAP, SEG_RELOCATE, SEG_SWAP, CROSS_EXCHANGE, SPLIT, MERGE
+- `cross_row_prob` controls the probability of selecting cross-row operators
+- Use for: VRP (vehicles × customers), any partitioning problem
+
+## Quick Reference: Problem → Config
+
+| Problem | Encoding | RowMode | D1 | D2 | cross_row_prob |
+|---------|----------|---------|----|----|---------------|
+| TSP-50 | Perm | Single | 1 | 64 | 0 |
+| TSP-500 | Perm | Single | 1 | 512 | 0 |
+| QAP-15 | Perm | Single | 1 | 16 | 0 |
+| Assignment-12 | Perm | Single | 1 | 16 | 0 |
+| VRP-30-4v | Perm | Partition | 4 | 32 | 0.3 |
+| VRPTW-100-25v | Perm | Partition | 32 | 32 | 0.3 |
+| Knapsack-100 | Binary | Single | 1 | 128 | 0 |
+| Scheduling-20 | Binary | Single | 1 | 32 | 0 |
+| Graph Color-50 | Integer | Single | 1 | 64 | 0 |
+| JSP-6m-6j | Perm | Fixed | 8 | 8 | 0.2 |
--- a/skills/cugenopt-problem-gen/reference/examples.md
+++ b/skills/cugenopt-problem-gen/reference/examples.md
@ -0,0 +1,621 @@
+# End-to-End Examples
+
+Four complete examples from natural language description to generated code.
+
+---
+
+## Example 1: 0-1 Knapsack (Low Complexity)
+
+### User Input
+> "I have 8 items with weights [2,3,4,5,9,7,8,6] and values [3,4,5,8,10,7,9,6]. Knapsack capacity is 20. Maximize total value."
+
+### Analysis
+- **Decision**: select or not → **Binary**
+- **RowMode**: Single (D1=1)
+- **D2**: next_pow2(8) = 8
+- **Objective**: Maximize total value
+- **Constraint**: total weight ≤ 20
+- **Complexity**: Low (standard knapsack, direct reference)
+
+### Generated: problem.cuh
+
+```cuda
+#pragma once
+#include "core/types.cuh"
+#include "core/cuda_utils.cuh"
+#include "core/operators.cuh"
+
+struct Knapsack8 : ProblemBase<Knapsack8, 1, 8> {
+    const float* d_weights;
+    const float* d_values;
+    float capacity;
+    int n;
+
+    __device__ float calc_total_value(const Sol& sol) const {
+        float tv = 0.0f;
+        const int* sel = sol.data[0];
+        for (int i = 0; i < n; i++)
+            if (sel[i]) tv += d_values[i];
+        return tv;
+    }
+
+    static constexpr ObjDef OBJ_DEFS[] = {
+        {ObjDir::Maximize, 1.0f, 0.0f},
+    };
+
+    __device__ float compute_obj(int idx, const Sol& sol) const {
+        switch (idx) {
+            case 0: return calc_total_value(sol);
+            default: return 0.0f;
+        }
+    }
+
+    __device__ float compute_penalty(const Sol& sol) const {
+        float tw = 0.0f;
+        const int* sel = sol.data[0];
+        for (int i = 0; i < n; i++)
+            if (sel[i]) tw += d_weights[i];
+        float over = tw - capacity;
+        return (over > 0.0f) ? over * 50.0f : 0.0f;
+    }
+
+    ProblemConfig config() const {
+        ProblemConfig cfg;
+        cfg.encoding = EncodingType::Binary;
+        cfg.dim1 = 1;
+        cfg.dim2_default = n;
+        fill_obj_config(cfg);
+        return cfg;
+    }
+
+    size_t shared_mem_bytes() const {
+        return 2 * (size_t)n * sizeof(float);
+    }
+
+    __device__ void load_shared(char* smem, int tid, int bsz) {
+        float* sw = reinterpret_cast<float*>(smem);
+        float* sv = sw + n;
+        for (int i = tid; i < n; i += bsz) {
+            sw[i] = d_weights[i];
+            sv[i] = d_values[i];
+        }
+        d_weights = sw;
+        d_values = sv;
+    }
+
+    static Knapsack8 create(const float* hw, const float* hv, int n, float cap) {
+        Knapsack8 prob;
+        prob.n = n;
+        prob.capacity = cap;
+        float *dw, *dv;
+        CUDA_CHECK(cudaMalloc(&dw, sizeof(float) * n));
+        CUDA_CHECK(cudaMalloc(&dv, sizeof(float) * n));
+        CUDA_CHECK(cudaMemcpy(dw, hw, sizeof(float) * n, cudaMemcpyHostToDevice));
+        CUDA_CHECK(cudaMemcpy(dv, hv, sizeof(float) * n, cudaMemcpyHostToDevice));
+        prob.d_weights = dw;
+        prob.d_values = dv;
+        return prob;
+    }
+
+    void destroy() {
+        if (d_weights) cudaFree(const_cast<float*>(d_weights));
+        if (d_values) cudaFree(const_cast<float*>(d_values));
+        d_weights = nullptr;
+        d_values = nullptr;
+    }
+};
+```
+
+### Generated: main.cu
+
+```cuda
+#include "core/solver.cuh"
+#include "problem.cuh"
+#include <cstdio>
+
+int main() {
+    const int n = 8;
+    float weights[] = {2, 3, 4, 5, 9, 7, 8, 6};
+    float values[]  = {3, 4, 5, 8, 10, 7, 9, 6};
+    float capacity = 20.0f;
+
+    auto prob = Knapsack8::create(weights, values, n, capacity);
+
+    SolverConfig scfg;
+    scfg.time_limit_sec = 5.0f;
+    scfg.use_aos = true;
+    scfg.verbose = true;
+
+    auto result = solve(prob, scfg);
+
+    printf("Best value: %.2f\n", result.best_solution.objectives[0]);
+    printf("Penalty: %.2f\n", result.best_solution.penalty);
+    printf("Selected items: ");
+    for (int i = 0; i < n; i++)
+        if (result.best_solution.data[0][i]) printf("%d ", i);
+    printf("\n");
+
+    prob.destroy();
+    return 0;
+}
+```
+
+---
+
+## Example 2: Assignment Problem (Low Complexity)
+
+### User Input
+> "Assign 10 workers to 10 tasks. Cost matrix is in a file `cost_10x10.txt`. Minimize total cost."
+
+### Analysis
+- **Decision**: assign each worker to a unique task → **Permutation**
+- **RowMode**: Single (D1=1)
+- **D2**: next_pow2(10) = 16
+- **Objective**: Minimize total cost
+- **Constraint**: none (permutation encoding guarantees one-to-one)
+- **Data**: read from file
+- **Complexity**: Low (standard assignment)
+
+### Generated: problem.cuh
+
+```cuda
+#pragma once
+#include "core/types.cuh"
+#include "core/cuda_utils.cuh"
+#include "core/operators.cuh"
+
+struct Assignment10 : ProblemBase<Assignment10, 1, 16> {
+    const float* d_cost;
+    int n;
+
+    __device__ float calc_total_cost(const Sol& sol) const {
+        float total = 0.0f;
+        const int* assign = sol.data[0];
+        for (int i = 0; i < n; i++)
+            total += d_cost[i * n + assign[i]];
+        return total;
+    }
+
+    static constexpr ObjDef OBJ_DEFS[] = {
+        {ObjDir::Minimize, 1.0f, 0.0f},
+    };
+
+    __device__ float compute_obj(int idx, const Sol& sol) const {
+        switch (idx) {
+            case 0: return calc_total_cost(sol);
+            default: return 0.0f;
+        }
+    }
+
+    __device__ float compute_penalty(const Sol& sol) const {
+        return 0.0f;
+    }
+
+    ProblemConfig config() const {
+        ProblemConfig cfg;
+        cfg.encoding = EncodingType::Permutation;
+        cfg.dim1 = 1;
+        cfg.dim2_default = n;
+        fill_obj_config(cfg);
+        return cfg;
+    }
+
+    size_t shared_mem_bytes() const {
+        return (size_t)n * n * sizeof(float);
+    }
+
+    size_t working_set_bytes() const {
+        return (size_t)n * n * sizeof(float);
+    }
+
+    __device__ void load_shared(char* smem, int tid, int bsz) {
+        float* sc = reinterpret_cast<float*>(smem);
+        int total = n * n;
+        for (int i = tid; i < total; i += bsz) sc[i] = d_cost[i];
+        d_cost = sc;
+    }
+
+    static Assignment10 create(const float* hc, int n) {
+        Assignment10 prob;
+        prob.n = n;
+        float* dc;
+        CUDA_CHECK(cudaMalloc(&dc, sizeof(float) * n * n));
+        CUDA_CHECK(cudaMemcpy(dc, hc, sizeof(float) * n * n, cudaMemcpyHostToDevice));
+        prob.d_cost = dc;
+        return prob;
+    }
+
+    void destroy() {
+        if (d_cost) { cudaFree(const_cast<float*>(d_cost)); d_cost = nullptr; }
+    }
+};
+```
+
+### Generated: main.cu
+
+```cuda
+#include "core/solver.cuh"
+#include "problem.cuh"
+#include <cstdio>
+#include <cstdlib>
+
+int main() {
+    const int n = 10;
+    float cost[n * n];
+
+    FILE* f = fopen("cost_10x10.txt", "r");
+    if (!f) { fprintf(stderr, "Cannot open cost_10x10.txt\n"); return 1; }
+    for (int i = 0; i < n * n; i++) fscanf(f, "%f", &cost[i]);
+    fclose(f);
+
+    auto prob = Assignment10::create(cost, n);
+
+    SolverConfig scfg;
+    scfg.time_limit_sec = 10.0f;
+    scfg.use_aos = true;
+    scfg.verbose = true;
+
+    auto result = solve(prob, scfg);
+
+    printf("Best cost: %.2f\n", result.best_solution.objectives[0]);
+    printf("Assignment: ");
+    for (int i = 0; i < n; i++)
+        printf("worker %d → task %d  ", i, result.best_solution.data[0][i]);
+    printf("\n");
+
+    prob.destroy();
+    return 0;
+}
+```
+
+---
+
+## Example 3: Vehicle Routing with Capacity (Medium Complexity)
+
+### User Input
+> "I have 1 depot and 30 customers. 4 trucks, each with capacity 100. Customer coordinates and demands are in `customers.csv` (columns: id, x, y, demand). Minimize total travel distance."
+
+### Analysis
+- **Decision**: assign customers to trucks and determine visit order → **Permutation**
+- **RowMode**: Partition (variable-length routes)
+- **D1**: next_pow2(4) = 4
+- **D2**: max(next_pow2(30/4*2), 64) = 64
+- **Objective**: Minimize total distance (depot → customers → depot for each truck)
+- **Constraint**: each truck's total demand ≤ 100
+- **Data**: CSV with coordinates → compute distance matrix
+- **Complexity**: Medium (custom constraint, Partition encoding)
+
+### Logic Summary (for user confirmation)
+> "Objective: minimize total travel distance across all trucks. Each truck starts and ends at depot (id=0). Constraint: total demand per truck ≤ 100, penalty = 100 × excess. Encoding: Permutation with Partition, 4 trucks, 30 customers."
+
+### Generated: problem.cuh
+
+```cuda
+#pragma once
+#include "core/types.cuh"
+#include "core/cuda_utils.cuh"
+#include "core/operators.cuh"
+#include <cmath>
+
+struct VRP30 : ProblemBase<VRP30, 4, 64> {
+    const float* d_dist;    // (n+1)×(n+1) distance matrix including depot
+    const float* d_demand;  // n customer demands
+    int n;                  // number of customers (excluding depot)
+    int stride;             // n+1
+    float capacity;
+    int num_vehicles;
+
+    __device__ float compute_route_dist(const int* route, int size) const {
+        if (size == 0) return 0.0f;
+        float dist = 0.0f;
+        int prev = 0;  // depot
+        for (int j = 0; j < size; j++) {
+            int node = route[j] + 1;  // customer indices are 0-based, node indices 1-based
+            dist += d_dist[prev * stride + node];
+            prev = node;
+        }
+        dist += d_dist[prev * stride + 0];  // return to depot
+        return dist;
+    }
+
+    __device__ float calc_total_distance(const Sol& sol) const {
+        float total = 0.0f;
+        for (int r = 0; r < num_vehicles; r++)
+            total += compute_route_dist(sol.data[r], sol.dim2_sizes[r]);
+        return total;
+    }
+
+    static constexpr ObjDef OBJ_DEFS[] = {
+        {ObjDir::Minimize, 1.0f, 0.0f},
+    };
+
+    __device__ float compute_obj(int idx, const Sol& sol) const {
+        switch (idx) {
+            case 0: return calc_total_distance(sol);
+            default: return 0.0f;
+        }
+    }
+
+    __device__ float compute_penalty(const Sol& sol) const {
+        float penalty = 0.0f;
+        for (int r = 0; r < num_vehicles; r++) {
+            float load = 0.0f;
+            for (int j = 0; j < sol.dim2_sizes[r]; j++)
+                load += d_demand[sol.data[r][j]];
+            if (load > capacity)
+                penalty += (load - capacity) * 100.0f;
+        }
+        return penalty;
+    }
+
+    ProblemConfig config() const {
+        ProblemConfig cfg;
+        cfg.encoding = EncodingType::Permutation;
+        cfg.dim1 = num_vehicles;
+        cfg.dim2_default = 0;
+        fill_obj_config(cfg);
+        cfg.row_mode = RowMode::Partition;
+        cfg.cross_row_prob = 0.3f;
+        cfg.total_elements = n;
+        return cfg;
+    }
+
+    size_t shared_mem_bytes() const {
+        return (size_t)stride * stride * sizeof(float) + (size_t)n * sizeof(float);
+    }
+
+    size_t working_set_bytes() const {
+        return (size_t)stride * stride * sizeof(float) + (size_t)n * sizeof(float);
+    }
+
+    __device__ void load_shared(char* smem, int tid, int bsz) {
+        float* sd = reinterpret_cast<float*>(smem);
+        int dist_size = stride * stride;
+        for (int i = tid; i < dist_size; i += bsz) sd[i] = d_dist[i];
+        d_dist = sd;
+        float* sdem = sd + dist_size;
+        for (int i = tid; i < n; i += bsz) sdem[i] = d_demand[i];
+        d_demand = sdem;
+    }
+
+    static VRP30 create(const float* h_dist, const float* h_demand,
+                         int n, float capacity, int num_vehicles) {
+        VRP30 prob;
+        prob.n = n;
+        prob.stride = n + 1;
+        prob.capacity = capacity;
+        prob.num_vehicles = num_vehicles;
+
+        int nodes = n + 1;
+        float* dd;
+        CUDA_CHECK(cudaMalloc(&dd, sizeof(float) * nodes * nodes));
+        CUDA_CHECK(cudaMemcpy(dd, h_dist, sizeof(float) * nodes * nodes, cudaMemcpyHostToDevice));
+        prob.d_dist = dd;
+
+        float* ddem;
+        CUDA_CHECK(cudaMalloc(&ddem, sizeof(float) * n));
+        CUDA_CHECK(cudaMemcpy(ddem, h_demand, sizeof(float) * n, cudaMemcpyHostToDevice));
+        prob.d_demand = ddem;
+
+        return prob;
+    }
+
+    void destroy() {
+        if (d_dist) { cudaFree(const_cast<float*>(d_dist)); d_dist = nullptr; }
+        if (d_demand) { cudaFree(const_cast<float*>(d_demand)); d_demand = nullptr; }
+    }
+};
+```
+
+### Generated: main.cu
+
+```cuda
+#include "core/solver.cuh"
+#include "problem.cuh"
+#include <cstdio>
+#include <cmath>
+
+int main() {
+    const int n = 30;
+    const int num_vehicles = 4;
+    const float capacity = 100.0f;
+
+    float x[n + 1], y[n + 1], demand[n];
+
+    FILE* f = fopen("customers.csv", "r");
+    if (!f) { fprintf(stderr, "Cannot open customers.csv\n"); return 1; }
+
+    char header[256];
+    fgets(header, sizeof(header), f);  // skip header
+
+    // Read depot (id=0)
+    int id;
+    fscanf(f, "%d,%f,%f,%*f", &id, &x[0], &y[0]);  // depot has no demand
+
+    // Read customers
+    for (int i = 0; i < n; i++) {
+        fscanf(f, "%d,%f,%f,%f", &id, &x[i + 1], &y[i + 1], &demand[i]);
+    }
+    fclose(f);
+
+    // Compute distance matrix
+    int nodes = n + 1;
+    float dist[nodes * nodes];
+    for (int i = 0; i < nodes; i++)
+        for (int j = 0; j < nodes; j++) {
+            float dx = x[i] - x[j], dy = y[i] - y[j];
+            dist[i * nodes + j] = sqrtf(dx * dx + dy * dy);
+        }
+
+    auto prob = VRP30::create(dist, demand, n, capacity, num_vehicles);
+
+    SolverConfig scfg;
+    scfg.time_limit_sec = 30.0f;
+    scfg.use_aos = true;
+    scfg.verbose = true;
+
+    auto result = solve(prob, scfg);
+
+    printf("Best distance: %.2f\n", result.best_solution.objectives[0]);
+    printf("Penalty: %.2f\n", result.best_solution.penalty);
+    for (int r = 0; r < num_vehicles; r++) {
+        printf("Truck %d: depot", r);
+        for (int j = 0; j < result.best_solution.dim2_sizes[r]; j++)
+            printf(" → %d", result.best_solution.data[r][j] + 1);
+        printf(" → depot\n");
+    }
+
+    prob.destroy();
+    return 0;
+}
+```
+
+---
+
+## Example 4: Graph Coloring (Low Complexity)
+
+### User Input
+> "Color a graph with 20 nodes using at most 4 colors. Edges: (0,1),(0,2),(1,3),(2,3),(3,4),...  Minimize the number of colors used, with no two adjacent nodes sharing a color."
+
+### Analysis
+- **Decision**: assign a color (0–3) to each node → **Integer**
+- **RowMode**: Single (D1=1)
+- **D2**: next_pow2(20) = 32
+- **Objective**: Minimize number of distinct colors used
+- **Constraint**: adjacent nodes must have different colors
+- **Complexity**: Low (standard graph coloring)
+
+### Generated: problem.cuh
+
+```cuda
+#pragma once
+#include "core/types.cuh"
+#include "core/cuda_utils.cuh"
+#include "core/operators.cuh"
+
+struct GraphColor20 : ProblemBase<GraphColor20, 1, 32> {
+    const int* d_adj;     // adjacency matrix n×n (1=edge, 0=no edge)
+    int n;
+    int max_colors;
+
+    __device__ float calc_num_colors(const Sol& sol) const {
+        int used[4] = {0, 0, 0, 0};
+        const int* colors = sol.data[0];
+        for (int i = 0; i < n; i++) {
+            int c = colors[i];
+            if (c >= 0 && c < max_colors) used[c] = 1;
+        }
+        float count = 0.0f;
+        for (int c = 0; c < max_colors; c++) count += used[c];
+        return count;
+    }
+
+    static constexpr ObjDef OBJ_DEFS[] = {
+        {ObjDir::Minimize, 1.0f, 0.0f},
+    };
+
+    __device__ float compute_obj(int idx, const Sol& sol) const {
+        switch (idx) {
+            case 0: return calc_num_colors(sol);
+            default: return 0.0f;
+        }
+    }
+
+    __device__ float compute_penalty(const Sol& sol) const {
+        float conflicts = 0.0f;
+        const int* colors = sol.data[0];
+        for (int i = 0; i < n; i++)
+            for (int j = i + 1; j < n; j++)
+                if (d_adj[i * n + j] && colors[i] == colors[j])
+                    conflicts += 1.0f;
+        return conflicts * 10.0f;
+    }
+
+    ProblemConfig config() const {
+        ProblemConfig cfg;
+        cfg.encoding = EncodingType::Integer;
+        cfg.dim1 = 1;
+        cfg.dim2_default = n;
+        cfg.value_lower_bound = 0;
+        cfg.value_upper_bound = max_colors - 1;
+        fill_obj_config(cfg);
+        return cfg;
+    }
+
+    size_t shared_mem_bytes() const {
+        return (size_t)n * n * sizeof(int);
+    }
+
+    size_t working_set_bytes() const {
+        return (size_t)n * n * sizeof(int);
+    }
+
+    __device__ void load_shared(char* smem, int tid, int bsz) {
+        int* sa = reinterpret_cast<int*>(smem);
+        int total = n * n;
+        for (int i = tid; i < total; i += bsz) sa[i] = d_adj[i];
+        d_adj = sa;
+    }
+
+    static GraphColor20 create(const int* h_adj, int n, int max_colors) {
+        GraphColor20 prob;
+        prob.n = n;
+        prob.max_colors = max_colors;
+        int* da;
+        CUDA_CHECK(cudaMalloc(&da, sizeof(int) * n * n));
+        CUDA_CHECK(cudaMemcpy(da, h_adj, sizeof(int) * n * n, cudaMemcpyHostToDevice));
+        prob.d_adj = da;
+        return prob;
+    }
+
+    void destroy() {
+        if (d_adj) { cudaFree(const_cast<int*>(d_adj)); d_adj = nullptr; }
+    }
+};
+```
+
+### Generated: main.cu
+
+```cuda
+#include "core/solver.cuh"
+#include "problem.cuh"
+#include <cstdio>
+
+int main() {
+    const int n = 20;
+    const int max_colors = 4;
+
+    int adj[n * n] = {0};
+    // Define edges
+    int edges[][2] = {{0,1},{0,2},{1,3},{2,3},{3,4},
+                       {4,5},{5,6},{6,7},{7,8},{8,9},
+                       {9,10},{10,11},{11,12},{12,13},{13,14},
+                       {14,15},{15,16},{16,17},{17,18},{18,19},
+                       {0,19},{1,4},{2,5},{6,9},{7,10}};
+    int num_edges = sizeof(edges) / sizeof(edges[0]);
+    for (int e = 0; e < num_edges; e++) {
+        int u = edges[e][0], v = edges[e][1];
+        adj[u * n + v] = 1;
+        adj[v * n + u] = 1;
+    }
+
+    auto prob = GraphColor20::create(adj, n, max_colors);
+
+    SolverConfig scfg;
+    scfg.time_limit_sec = 10.0f;
+    scfg.use_aos = true;
+    scfg.verbose = true;
+
+    auto result = solve(prob, scfg);
+
+    printf("Colors used: %.0f\n", result.best_solution.objectives[0]);
+    printf("Conflicts (penalty): %.2f\n", result.best_solution.penalty);
+    printf("Coloring: ");
+    for (int i = 0; i < n; i++)
+        printf("node%d=%d ", i, result.best_solution.data[0][i]);
+    printf("\n");
+
+    prob.destroy();
+    return 0;
+}
+```
--- a/skills/cugenopt-problem-gen/reference/problem-api.md
+++ b/skills/cugenopt-problem-gen/reference/problem-api.md
@ -0,0 +1,280 @@
+# ProblemBase API Reference
+
+Complete interface specification for `ProblemBase<Derived, D1, D2>` (defined in `core/types.cuh`).
+
+## Template Parameters
+
+| Parameter | Type | Description |
+|-----------|------|-------------|
+| `Derived` | struct | The concrete problem type (CRTP pattern) |
+| `D1` | int | Maximum number of rows (compile-time constant, power of 2 recommended) |
+| `D2` | int | Maximum columns per row (compile-time constant, power of 2 recommended) |
+
+The base class provides:
+- `using Sol = Solution<D1, D2>;` — the solution type
+- `static constexpr int NUM_OBJ` — auto-derived from `Derived::OBJ_DEFS`
+- `evaluate(Sol&)` — calls `compute_obj` for each objective + `compute_penalty`
+- `fill_obj_config(ProblemConfig&)` — populates objective fields from `OBJ_DEFS`
+- `obj_config()` — returns `ObjConfig` for the solver
+
+## Required Interface
+
+### 1. `OBJ_DEFS` — Objective Definitions (static constexpr)
+
+```cuda
+static constexpr ObjDef OBJ_DEFS[] = {
+    {ObjDir::Minimize, 1.0f, 0.0f},   // index 0
+    // {ObjDir::Maximize, 0.5f, 0.0f}, // index 1 (multi-objective)
+};
+```
+
+Each `ObjDef`:
+- `dir`: `ObjDir::Minimize` or `ObjDir::Maximize`
+- `weight`: importance weight for `CompareMode::Weighted` (default mode)
+- `tolerance`: tolerance for `CompareMode::Lexicographic`
+
+Most problems have a single objective. Multi-objective (up to 4) is supported.
+
+### 2. `compute_obj` — Objective Calculation
+
+```cuda
+__device__ float compute_obj(int idx, const Sol& sol) const;
+```
+
+- Runs on GPU (`__device__`)
+- `idx` corresponds to `OBJ_DEFS[idx]`
+- Use a `switch` statement dispatching to helper functions
+- Access solution data via `sol.data[row][col]` and `sol.dim2_sizes[row]`
+
+**Pattern**:
+```cuda
+__device__ float compute_obj(int idx, const Sol& sol) const {
+    switch (idx) {
+        case 0: return calc_total_cost(sol);
+        default: return 0.0f;
+    }
+}
+```
+
+### 3. `compute_penalty` — Constraint Violation
+
+```cuda
+__device__ float compute_penalty(const Sol& sol) const;
+```
+
+- Returns `0.0f` for feasible solutions
+- Returns a positive value proportional to violation magnitude for infeasible solutions
+- The solver always prefers feasible solutions (penalty=0) over infeasible ones
+- For multiple constraints, sum all violations
+
+**Guidelines**:
+- Scale penalty to be comparable to objective magnitude
+- Example: capacity overflow → `(excess_load) * 100.0f`
+- Example: vehicle count exceeded → `(excess_vehicles) * 1000.0f`
+
+### 4. `config` — Problem Configuration
+
+```cuda
+ProblemConfig config() const;
+```
+
+Returns runtime metadata. Must set:
+
+```cuda
+ProblemConfig config() const {
+    ProblemConfig cfg;
+    cfg.encoding = EncodingType::Permutation;  // or Binary, Integer
+    cfg.dim1 = /* actual rows used */;
+    cfg.dim2_default = /* actual columns */;
+    fill_obj_config(cfg);  // auto-fills objectives from OBJ_DEFS
+
+    // Multi-row problems:
+    // cfg.row_mode = RowMode::Fixed;       // equal-length rows
+    // cfg.row_mode = RowMode::Partition;   // variable-length rows
+    // cfg.cross_row_prob = 0.3f;           // cross-row operator probability
+    // cfg.total_elements = n;              // Partition: total elements across all rows
+
+    // Integer encoding:
+    // cfg.value_lower_bound = 0;
+    // cfg.value_upper_bound = num_colors - 1;
+
+    return cfg;
+}
+```
+
+### 5. `create` / `destroy` — Factory Methods
+
+```cuda
+static MyProblem create(/* host-side data */) {
+    MyProblem prob;
+    prob.n = n;
+    // Allocate GPU memory and copy data
+    float* d_ptr;
+    CUDA_CHECK(cudaMalloc(&d_ptr, sizeof(float) * n * n));
+    CUDA_CHECK(cudaMemcpy(d_ptr, h_ptr, sizeof(float) * n * n, cudaMemcpyHostToDevice));
+    prob.d_data = d_ptr;
+    return prob;
+}
+
+void destroy() {
+    if (d_data) { cudaFree(const_cast<float*>(d_data)); d_data = nullptr; }
+}
+```
+
+**Rules**:
+- All GPU memory allocated in `create()`, freed in `destroy()`
+- Use `CUDA_CHECK()` for every CUDA API call
+- Store both `d_` (device) and optionally `h_` (host) pointers
+- `const_cast` needed in `destroy()` because pointers are `const float*`
+
+## Optional Interface
+
+### 6. `shared_mem_bytes` — Shared Memory Requirement
+
+```cuda
+size_t shared_mem_bytes() const;
+```
+
+- Returns total bytes of problem data to cache in shared memory
+- Return the **actual** data size; the framework handles overflow:
+  - ≤ 48KB: fits default shared memory
+  - 48KB–164KB: framework calls `cudaFuncSetAttribute` to extend (GPU-dependent)
+  - Too large: framework falls back to global memory automatically
+- Default (from base class): returns 0
+
+**Example** (distance matrix):
+```cuda
+size_t shared_mem_bytes() const {
+    return (size_t)n * n * sizeof(float);  // report actual need
+}
+```
+
+### 7. `working_set_bytes` — Global Memory Working Set
+
+```cuda
+size_t working_set_bytes() const;
+```
+
+- Returns the per-block hot data size in global memory
+- Used by the framework to estimate L2 cache pressure and auto-size population
+- Default: returns `shared_mem_bytes()`
+- **Override when** `shared_mem_bytes()` returns 0 (data doesn't fit in shared memory) — return the actual data size so population sizing works correctly
+
+**Example**:
+```cuda
+size_t working_set_bytes() const {
+    return (size_t)n * n * sizeof(float) + (size_t)n * sizeof(float);
+}
+```
+
+### 8. `load_shared` — Load Data into Shared Memory
+
+```cuda
+__device__ void load_shared(char* smem, int tid, int bsz);
+```
+
+- Called by framework when `shared_mem_bytes() > 0`
+- Copy data from global memory to shared memory using cooperative loading
+- **Redirect the device pointer** to shared memory after loading
+
+**Pattern**:
+```cuda
+__device__ void load_shared(char* smem, int tid, int bsz) {
+    float* s_data = reinterpret_cast<float*>(smem);
+    int total = n * n;
+    for (int i = tid; i < total; i += bsz)
+        s_data[i] = d_data[i];
+    d_data = s_data;  // redirect pointer to shared memory
+}
+```
+
+For multiple arrays, lay them out sequentially in `smem`:
+```cuda
+__device__ void load_shared(char* smem, int tid, int bsz) {
+    float* s_dist = reinterpret_cast<float*>(smem);
+    int dist_size = stride * stride;
+    for (int i = tid; i < dist_size; i += bsz) s_dist[i] = d_dist[i];
+    d_dist = s_dist;
+
+    float* s_demand = s_dist + dist_size;
+    for (int i = tid; i < n; i += bsz) s_demand[i] = d_demand[i];
+    d_demand = s_demand;
+}
+```
+
+### 9. `heuristic_matrices` — Data for Heuristic Initialization
+
+```cuda
+int heuristic_matrices(HeuristicMatrix* out, int max_count) const;
+```
+
+- Returns host-side matrices for constructing heuristic initial solutions
+- The framework sorts elements by row/column sums to generate better-than-random starting points
+- Return value: number of matrices provided (0 = no heuristic init)
+
+**Example** (distance matrix for TSP):
+```cuda
+int heuristic_matrices(HeuristicMatrix* out, int max_count) const {
+    if (max_count < 1 || !h_dist) return 0;
+    out[0] = {h_dist, n};
+    return 1;
+}
+```
+
+### 10. `init_relation_matrix` — G/O Matrix for Guided Rebuild
+
+```cuda
+void init_relation_matrix(float* h_G, float* h_O, int N) const;
+```
+
+- Provides prior knowledge for the LNS guided rebuild operator
+- `G[i*N+j]`: grouping tendency (symmetric, higher = more likely in same group)
+- `O[i*N+j]`: ordering tendency (asymmetric, higher = i before j)
+- Values in [0, 1], typically scaled from problem data (e.g., distance proximity)
+- Default: does nothing (matrices stay zero, learned from search history)
+
+## Solution Data Access
+
+```cuda
+sol.data[row][col]      // element value at (row, col)
+sol.dim2_sizes[row]     // actual length of row (may be < D2)
+sol.objectives[idx]     // objective value (set by evaluate())
+sol.penalty             // penalty value (set by evaluate())
+```
+
+- **Permutation (Single)**: `sol.data[0][0..n-1]` contains a permutation of `0..n-1`
+- **Permutation (Partition)**: `sol.data[r][0..sol.dim2_sizes[r]-1]` for each route/partition
+- **Binary**: `sol.data[0][i]` is 0 or 1
+- **Integer**: `sol.data[0][i]` is in `[value_lower_bound, value_upper_bound]`
+
+## Key Types Reference
+
+```cuda
+enum class EncodingType { Permutation, Binary, Integer };
+enum class RowMode { Single, Fixed, Partition };
+enum class ObjDir { Minimize, Maximize };
+enum class CompareMode { Weighted, Lexicographic };
+
+struct ObjDef { ObjDir dir; float weight; float tolerance; };
+struct HeuristicMatrix { const float* data; int N; };
+
+struct ProblemConfig {
+    EncodingType encoding;
+    int dim1, dim2_default, num_objectives;
+    ObjDir obj_dirs[4]; float obj_weights[4];
+    CompareMode compare_mode;
+    RowMode row_mode;
+    float cross_row_prob;
+    int total_elements;
+    int value_lower_bound, value_upper_bound;
+};
+
+struct SolverConfig {
+    int pop_size;           // 0 = auto
+    int max_gen;            // max generations
+    float time_limit_sec;   // 0 = no limit
+    bool use_aos;           // adaptive operator selection
+    bool verbose;
+    unsigned seed;
+};
+```