Initial commit: cuGenOpt GPU optimization solver

This commit is contained in:
L-yang-yang 2026-03-20 00:33:45 +08:00
commit fc5a0ff4af
117 changed files with 25545 additions and 0 deletions

View file

@ -0,0 +1,155 @@
# Encoding Selection & Dimension Guide
## Encoding Types
cuGenOpt supports three encoding types. Choose based on the nature of the decision variables.
### Permutation
**Use when**: each element appears exactly once (ordering/assignment).
| Scenario | RowMode | D1 | D2 | dim2_default | total_elements |
|----------|---------|----|----|-------------|----------------|
| TSP (n cities) | Single | 1 | next_pow2(n) | n | — |
| QAP (n facilities) | Single | 1 | next_pow2(n) | n | — |
| Assignment (n tasks) | Single | 1 | next_pow2(n) | n | — |
| JSP (m machines, j jobs) | Fixed | next_pow2(m) | next_pow2(j) | j | — |
| VRP (k vehicles, n customers) | Partition | next_pow2(k) | max(next_pow2(n/k*2), 64) | 0 | n |
| VRPTW (k vehicles, n customers) | Partition | next_pow2(k) | max(next_pow2(n/k*2), 64) | 0 | n |
**Partition specifics**:
- `dim2_default = 0` tells the framework to distribute elements across rows
- `total_elements = n` is the count of elements to distribute
- `cross_row_prob` controls how often cross-row operators fire (typically 0.20.4)
- Elements are customer/job indices `0..n-1`; depot/source is implicit (not in the solution)
### Binary
**Use when**: each position is a yes/no decision.
| Scenario | RowMode | D1 | D2 | dim2_default |
|----------|---------|----|----|-------------|
| 0-1 Knapsack (n items) | Single | 1 | next_pow2(n) | n |
| Scheduling (n shifts) | Single | 1 | next_pow2(n) | n |
| Subset selection (n candidates) | Single | 1 | next_pow2(n) | n |
| Multi-row scheduling (m workers, n shifts) | Fixed | next_pow2(m) | next_pow2(n) | n |
**Solution values**: `sol.data[row][col]` is 0 or 1.
### Integer
**Use when**: each position takes a bounded integer value.
| Scenario | RowMode | D1 | D2 | dim2_default | lower_bound | upper_bound |
|----------|---------|----|----|-------------|-------------|-------------|
| Graph coloring (n nodes, c colors) | Single | 1 | next_pow2(n) | n | 0 | c-1 |
| Load balancing (n tasks, m machines) | Single | 1 | next_pow2(n) | n | 0 | m-1 |
| Multi-machine scheduling | Fixed | next_pow2(m) | next_pow2(j) | j | 0 | max_time |
**Solution values**: `sol.data[row][col]` is in `[value_lower_bound, value_upper_bound]`.
Set bounds in config:
```cuda
cfg.value_lower_bound = 0;
cfg.value_upper_bound = num_colors - 1;
```
## Dimension Calculation Rules
### D1 and D2 (Template Parameters)
These are **compile-time constants** and define the maximum capacity:
- Must be sufficient for the largest instance you plan to solve
- Power of 2 is recommended for memory alignment
- Larger values waste registers/memory; keep as small as possible
```
next_pow2(x):
1→1, 2→2, 3→4, 5→8, 9→16, 17→32, 33→64, 65→128, ...
```
### dim1 and dim2_default (Runtime Parameters)
Set in `config()` to the actual problem size:
- `dim1 ≤ D1`: actual number of rows used
- `dim2_default ≤ D2`: actual number of columns per row
- For Partition mode: `dim2_default = 0` (framework handles distribution)
### Choosing D2 for Partition Mode
Since rows have variable length, D2 must accommodate the longest possible row:
```
D2 = max(next_pow2(total_elements / D1 * 2), 64)
```
The `*2` factor provides headroom for unbalanced distributions.
## Shared Memory Sizing
### When to Use Shared Memory
Shared memory provides ~10x faster access than global memory. Use it when:
- Problem has a data matrix (distance, cost, weight)
- The matrix is accessed repeatedly during objective/penalty evaluation
### How to Size
Report the **actual** data size. The framework handles the rest:
```cuda
size_t shared_mem_bytes() const {
// Distance matrix + demand array
return (size_t)stride * stride * sizeof(float) + (size_t)n * sizeof(float);
}
```
The framework automatically:
1. If ≤ 48KB: uses default shared memory
2. If 48KBmax_smem: calls `cudaFuncSetAttribute` to extend (GPU-dependent max: T4=64KB, V100=96KB, A100/A800=164KB, H100=228KB)
3. If > max_smem: falls back to global memory, uses `working_set_bytes()` for L2 cache population sizing
### working_set_bytes
Always return the actual data size, regardless of whether it fits in shared memory:
```cuda
size_t working_set_bytes() const {
return (size_t)n * n * sizeof(float);
}
```
This is used by the framework to auto-calculate population size based on L2 cache capacity.
## RowMode Details
### Single (default)
- `dim1 = 1`, single row of elements
- No cross-row operators
- Simplest and most common
### Fixed
- `dim1 > 1`, all rows have the same length (`dim2_default`)
- Cross-row operators: ROW_SWAP, ROW_REVERSE
- No SPLIT/MERGE (rows cannot change length)
- Use for: JSP (machines × jobs), multi-worker scheduling
### Partition
- `dim1 > 1`, rows have variable length
- Elements are distributed across rows (total count = `total_elements`)
- Cross-row operators: CROSS_RELOCATE, CROSS_SWAP, SEG_RELOCATE, SEG_SWAP, CROSS_EXCHANGE, SPLIT, MERGE
- `cross_row_prob` controls the probability of selecting cross-row operators
- Use for: VRP (vehicles × customers), any partitioning problem
## Quick Reference: Problem → Config
| Problem | Encoding | RowMode | D1 | D2 | cross_row_prob |
|---------|----------|---------|----|----|---------------|
| TSP-50 | Perm | Single | 1 | 64 | 0 |
| TSP-500 | Perm | Single | 1 | 512 | 0 |
| QAP-15 | Perm | Single | 1 | 16 | 0 |
| Assignment-12 | Perm | Single | 1 | 16 | 0 |
| VRP-30-4v | Perm | Partition | 4 | 32 | 0.3 |
| VRPTW-100-25v | Perm | Partition | 32 | 32 | 0.3 |
| Knapsack-100 | Binary | Single | 1 | 128 | 0 |
| Scheduling-20 | Binary | Single | 1 | 32 | 0 |
| Graph Color-50 | Integer | Single | 1 | 64 | 0 |
| JSP-6m-6j | Perm | Fixed | 8 | 8 | 0.2 |

View file

@ -0,0 +1,621 @@
# End-to-End Examples
Four complete examples from natural language description to generated code.
---
## Example 1: 0-1 Knapsack (Low Complexity)
### User Input
> "I have 8 items with weights [2,3,4,5,9,7,8,6] and values [3,4,5,8,10,7,9,6]. Knapsack capacity is 20. Maximize total value."
### Analysis
- **Decision**: select or not → **Binary**
- **RowMode**: Single (D1=1)
- **D2**: next_pow2(8) = 8
- **Objective**: Maximize total value
- **Constraint**: total weight ≤ 20
- **Complexity**: Low (standard knapsack, direct reference)
### Generated: problem.cuh
```cuda
#pragma once
#include "core/types.cuh"
#include "core/cuda_utils.cuh"
#include "core/operators.cuh"
struct Knapsack8 : ProblemBase<Knapsack8, 1, 8> {
const float* d_weights;
const float* d_values;
float capacity;
int n;
__device__ float calc_total_value(const Sol& sol) const {
float tv = 0.0f;
const int* sel = sol.data[0];
for (int i = 0; i < n; i++)
if (sel[i]) tv += d_values[i];
return tv;
}
static constexpr ObjDef OBJ_DEFS[] = {
{ObjDir::Maximize, 1.0f, 0.0f},
};
__device__ float compute_obj(int idx, const Sol& sol) const {
switch (idx) {
case 0: return calc_total_value(sol);
default: return 0.0f;
}
}
__device__ float compute_penalty(const Sol& sol) const {
float tw = 0.0f;
const int* sel = sol.data[0];
for (int i = 0; i < n; i++)
if (sel[i]) tw += d_weights[i];
float over = tw - capacity;
return (over > 0.0f) ? over * 50.0f : 0.0f;
}
ProblemConfig config() const {
ProblemConfig cfg;
cfg.encoding = EncodingType::Binary;
cfg.dim1 = 1;
cfg.dim2_default = n;
fill_obj_config(cfg);
return cfg;
}
size_t shared_mem_bytes() const {
return 2 * (size_t)n * sizeof(float);
}
__device__ void load_shared(char* smem, int tid, int bsz) {
float* sw = reinterpret_cast<float*>(smem);
float* sv = sw + n;
for (int i = tid; i < n; i += bsz) {
sw[i] = d_weights[i];
sv[i] = d_values[i];
}
d_weights = sw;
d_values = sv;
}
static Knapsack8 create(const float* hw, const float* hv, int n, float cap) {
Knapsack8 prob;
prob.n = n;
prob.capacity = cap;
float *dw, *dv;
CUDA_CHECK(cudaMalloc(&dw, sizeof(float) * n));
CUDA_CHECK(cudaMalloc(&dv, sizeof(float) * n));
CUDA_CHECK(cudaMemcpy(dw, hw, sizeof(float) * n, cudaMemcpyHostToDevice));
CUDA_CHECK(cudaMemcpy(dv, hv, sizeof(float) * n, cudaMemcpyHostToDevice));
prob.d_weights = dw;
prob.d_values = dv;
return prob;
}
void destroy() {
if (d_weights) cudaFree(const_cast<float*>(d_weights));
if (d_values) cudaFree(const_cast<float*>(d_values));
d_weights = nullptr;
d_values = nullptr;
}
};
```
### Generated: main.cu
```cuda
#include "core/solver.cuh"
#include "problem.cuh"
#include <cstdio>
int main() {
const int n = 8;
float weights[] = {2, 3, 4, 5, 9, 7, 8, 6};
float values[] = {3, 4, 5, 8, 10, 7, 9, 6};
float capacity = 20.0f;
auto prob = Knapsack8::create(weights, values, n, capacity);
SolverConfig scfg;
scfg.time_limit_sec = 5.0f;
scfg.use_aos = true;
scfg.verbose = true;
auto result = solve(prob, scfg);
printf("Best value: %.2f\n", result.best_solution.objectives[0]);
printf("Penalty: %.2f\n", result.best_solution.penalty);
printf("Selected items: ");
for (int i = 0; i < n; i++)
if (result.best_solution.data[0][i]) printf("%d ", i);
printf("\n");
prob.destroy();
return 0;
}
```
---
## Example 2: Assignment Problem (Low Complexity)
### User Input
> "Assign 10 workers to 10 tasks. Cost matrix is in a file `cost_10x10.txt`. Minimize total cost."
### Analysis
- **Decision**: assign each worker to a unique task → **Permutation**
- **RowMode**: Single (D1=1)
- **D2**: next_pow2(10) = 16
- **Objective**: Minimize total cost
- **Constraint**: none (permutation encoding guarantees one-to-one)
- **Data**: read from file
- **Complexity**: Low (standard assignment)
### Generated: problem.cuh
```cuda
#pragma once
#include "core/types.cuh"
#include "core/cuda_utils.cuh"
#include "core/operators.cuh"
struct Assignment10 : ProblemBase<Assignment10, 1, 16> {
const float* d_cost;
int n;
__device__ float calc_total_cost(const Sol& sol) const {
float total = 0.0f;
const int* assign = sol.data[0];
for (int i = 0; i < n; i++)
total += d_cost[i * n + assign[i]];
return total;
}
static constexpr ObjDef OBJ_DEFS[] = {
{ObjDir::Minimize, 1.0f, 0.0f},
};
__device__ float compute_obj(int idx, const Sol& sol) const {
switch (idx) {
case 0: return calc_total_cost(sol);
default: return 0.0f;
}
}
__device__ float compute_penalty(const Sol& sol) const {
return 0.0f;
}
ProblemConfig config() const {
ProblemConfig cfg;
cfg.encoding = EncodingType::Permutation;
cfg.dim1 = 1;
cfg.dim2_default = n;
fill_obj_config(cfg);
return cfg;
}
size_t shared_mem_bytes() const {
return (size_t)n * n * sizeof(float);
}
size_t working_set_bytes() const {
return (size_t)n * n * sizeof(float);
}
__device__ void load_shared(char* smem, int tid, int bsz) {
float* sc = reinterpret_cast<float*>(smem);
int total = n * n;
for (int i = tid; i < total; i += bsz) sc[i] = d_cost[i];
d_cost = sc;
}
static Assignment10 create(const float* hc, int n) {
Assignment10 prob;
prob.n = n;
float* dc;
CUDA_CHECK(cudaMalloc(&dc, sizeof(float) * n * n));
CUDA_CHECK(cudaMemcpy(dc, hc, sizeof(float) * n * n, cudaMemcpyHostToDevice));
prob.d_cost = dc;
return prob;
}
void destroy() {
if (d_cost) { cudaFree(const_cast<float*>(d_cost)); d_cost = nullptr; }
}
};
```
### Generated: main.cu
```cuda
#include "core/solver.cuh"
#include "problem.cuh"
#include <cstdio>
#include <cstdlib>
int main() {
const int n = 10;
float cost[n * n];
FILE* f = fopen("cost_10x10.txt", "r");
if (!f) { fprintf(stderr, "Cannot open cost_10x10.txt\n"); return 1; }
for (int i = 0; i < n * n; i++) fscanf(f, "%f", &cost[i]);
fclose(f);
auto prob = Assignment10::create(cost, n);
SolverConfig scfg;
scfg.time_limit_sec = 10.0f;
scfg.use_aos = true;
scfg.verbose = true;
auto result = solve(prob, scfg);
printf("Best cost: %.2f\n", result.best_solution.objectives[0]);
printf("Assignment: ");
for (int i = 0; i < n; i++)
printf("worker %d → task %d ", i, result.best_solution.data[0][i]);
printf("\n");
prob.destroy();
return 0;
}
```
---
## Example 3: Vehicle Routing with Capacity (Medium Complexity)
### User Input
> "I have 1 depot and 30 customers. 4 trucks, each with capacity 100. Customer coordinates and demands are in `customers.csv` (columns: id, x, y, demand). Minimize total travel distance."
### Analysis
- **Decision**: assign customers to trucks and determine visit order → **Permutation**
- **RowMode**: Partition (variable-length routes)
- **D1**: next_pow2(4) = 4
- **D2**: max(next_pow2(30/4*2), 64) = 64
- **Objective**: Minimize total distance (depot → customers → depot for each truck)
- **Constraint**: each truck's total demand ≤ 100
- **Data**: CSV with coordinates → compute distance matrix
- **Complexity**: Medium (custom constraint, Partition encoding)
### Logic Summary (for user confirmation)
> "Objective: minimize total travel distance across all trucks. Each truck starts and ends at depot (id=0). Constraint: total demand per truck ≤ 100, penalty = 100 × excess. Encoding: Permutation with Partition, 4 trucks, 30 customers."
### Generated: problem.cuh
```cuda
#pragma once
#include "core/types.cuh"
#include "core/cuda_utils.cuh"
#include "core/operators.cuh"
#include <cmath>
struct VRP30 : ProblemBase<VRP30, 4, 64> {
const float* d_dist; // (n+1)×(n+1) distance matrix including depot
const float* d_demand; // n customer demands
int n; // number of customers (excluding depot)
int stride; // n+1
float capacity;
int num_vehicles;
__device__ float compute_route_dist(const int* route, int size) const {
if (size == 0) return 0.0f;
float dist = 0.0f;
int prev = 0; // depot
for (int j = 0; j < size; j++) {
int node = route[j] + 1; // customer indices are 0-based, node indices 1-based
dist += d_dist[prev * stride + node];
prev = node;
}
dist += d_dist[prev * stride + 0]; // return to depot
return dist;
}
__device__ float calc_total_distance(const Sol& sol) const {
float total = 0.0f;
for (int r = 0; r < num_vehicles; r++)
total += compute_route_dist(sol.data[r], sol.dim2_sizes[r]);
return total;
}
static constexpr ObjDef OBJ_DEFS[] = {
{ObjDir::Minimize, 1.0f, 0.0f},
};
__device__ float compute_obj(int idx, const Sol& sol) const {
switch (idx) {
case 0: return calc_total_distance(sol);
default: return 0.0f;
}
}
__device__ float compute_penalty(const Sol& sol) const {
float penalty = 0.0f;
for (int r = 0; r < num_vehicles; r++) {
float load = 0.0f;
for (int j = 0; j < sol.dim2_sizes[r]; j++)
load += d_demand[sol.data[r][j]];
if (load > capacity)
penalty += (load - capacity) * 100.0f;
}
return penalty;
}
ProblemConfig config() const {
ProblemConfig cfg;
cfg.encoding = EncodingType::Permutation;
cfg.dim1 = num_vehicles;
cfg.dim2_default = 0;
fill_obj_config(cfg);
cfg.row_mode = RowMode::Partition;
cfg.cross_row_prob = 0.3f;
cfg.total_elements = n;
return cfg;
}
size_t shared_mem_bytes() const {
return (size_t)stride * stride * sizeof(float) + (size_t)n * sizeof(float);
}
size_t working_set_bytes() const {
return (size_t)stride * stride * sizeof(float) + (size_t)n * sizeof(float);
}
__device__ void load_shared(char* smem, int tid, int bsz) {
float* sd = reinterpret_cast<float*>(smem);
int dist_size = stride * stride;
for (int i = tid; i < dist_size; i += bsz) sd[i] = d_dist[i];
d_dist = sd;
float* sdem = sd + dist_size;
for (int i = tid; i < n; i += bsz) sdem[i] = d_demand[i];
d_demand = sdem;
}
static VRP30 create(const float* h_dist, const float* h_demand,
int n, float capacity, int num_vehicles) {
VRP30 prob;
prob.n = n;
prob.stride = n + 1;
prob.capacity = capacity;
prob.num_vehicles = num_vehicles;
int nodes = n + 1;
float* dd;
CUDA_CHECK(cudaMalloc(&dd, sizeof(float) * nodes * nodes));
CUDA_CHECK(cudaMemcpy(dd, h_dist, sizeof(float) * nodes * nodes, cudaMemcpyHostToDevice));
prob.d_dist = dd;
float* ddem;
CUDA_CHECK(cudaMalloc(&ddem, sizeof(float) * n));
CUDA_CHECK(cudaMemcpy(ddem, h_demand, sizeof(float) * n, cudaMemcpyHostToDevice));
prob.d_demand = ddem;
return prob;
}
void destroy() {
if (d_dist) { cudaFree(const_cast<float*>(d_dist)); d_dist = nullptr; }
if (d_demand) { cudaFree(const_cast<float*>(d_demand)); d_demand = nullptr; }
}
};
```
### Generated: main.cu
```cuda
#include "core/solver.cuh"
#include "problem.cuh"
#include <cstdio>
#include <cmath>
int main() {
const int n = 30;
const int num_vehicles = 4;
const float capacity = 100.0f;
float x[n + 1], y[n + 1], demand[n];
FILE* f = fopen("customers.csv", "r");
if (!f) { fprintf(stderr, "Cannot open customers.csv\n"); return 1; }
char header[256];
fgets(header, sizeof(header), f); // skip header
// Read depot (id=0)
int id;
fscanf(f, "%d,%f,%f,%*f", &id, &x[0], &y[0]); // depot has no demand
// Read customers
for (int i = 0; i < n; i++) {
fscanf(f, "%d,%f,%f,%f", &id, &x[i + 1], &y[i + 1], &demand[i]);
}
fclose(f);
// Compute distance matrix
int nodes = n + 1;
float dist[nodes * nodes];
for (int i = 0; i < nodes; i++)
for (int j = 0; j < nodes; j++) {
float dx = x[i] - x[j], dy = y[i] - y[j];
dist[i * nodes + j] = sqrtf(dx * dx + dy * dy);
}
auto prob = VRP30::create(dist, demand, n, capacity, num_vehicles);
SolverConfig scfg;
scfg.time_limit_sec = 30.0f;
scfg.use_aos = true;
scfg.verbose = true;
auto result = solve(prob, scfg);
printf("Best distance: %.2f\n", result.best_solution.objectives[0]);
printf("Penalty: %.2f\n", result.best_solution.penalty);
for (int r = 0; r < num_vehicles; r++) {
printf("Truck %d: depot", r);
for (int j = 0; j < result.best_solution.dim2_sizes[r]; j++)
printf(" → %d", result.best_solution.data[r][j] + 1);
printf(" → depot\n");
}
prob.destroy();
return 0;
}
```
---
## Example 4: Graph Coloring (Low Complexity)
### User Input
> "Color a graph with 20 nodes using at most 4 colors. Edges: (0,1),(0,2),(1,3),(2,3),(3,4),... Minimize the number of colors used, with no two adjacent nodes sharing a color."
### Analysis
- **Decision**: assign a color (03) to each node → **Integer**
- **RowMode**: Single (D1=1)
- **D2**: next_pow2(20) = 32
- **Objective**: Minimize number of distinct colors used
- **Constraint**: adjacent nodes must have different colors
- **Complexity**: Low (standard graph coloring)
### Generated: problem.cuh
```cuda
#pragma once
#include "core/types.cuh"
#include "core/cuda_utils.cuh"
#include "core/operators.cuh"
struct GraphColor20 : ProblemBase<GraphColor20, 1, 32> {
const int* d_adj; // adjacency matrix n×n (1=edge, 0=no edge)
int n;
int max_colors;
__device__ float calc_num_colors(const Sol& sol) const {
int used[4] = {0, 0, 0, 0};
const int* colors = sol.data[0];
for (int i = 0; i < n; i++) {
int c = colors[i];
if (c >= 0 && c < max_colors) used[c] = 1;
}
float count = 0.0f;
for (int c = 0; c < max_colors; c++) count += used[c];
return count;
}
static constexpr ObjDef OBJ_DEFS[] = {
{ObjDir::Minimize, 1.0f, 0.0f},
};
__device__ float compute_obj(int idx, const Sol& sol) const {
switch (idx) {
case 0: return calc_num_colors(sol);
default: return 0.0f;
}
}
__device__ float compute_penalty(const Sol& sol) const {
float conflicts = 0.0f;
const int* colors = sol.data[0];
for (int i = 0; i < n; i++)
for (int j = i + 1; j < n; j++)
if (d_adj[i * n + j] && colors[i] == colors[j])
conflicts += 1.0f;
return conflicts * 10.0f;
}
ProblemConfig config() const {
ProblemConfig cfg;
cfg.encoding = EncodingType::Integer;
cfg.dim1 = 1;
cfg.dim2_default = n;
cfg.value_lower_bound = 0;
cfg.value_upper_bound = max_colors - 1;
fill_obj_config(cfg);
return cfg;
}
size_t shared_mem_bytes() const {
return (size_t)n * n * sizeof(int);
}
size_t working_set_bytes() const {
return (size_t)n * n * sizeof(int);
}
__device__ void load_shared(char* smem, int tid, int bsz) {
int* sa = reinterpret_cast<int*>(smem);
int total = n * n;
for (int i = tid; i < total; i += bsz) sa[i] = d_adj[i];
d_adj = sa;
}
static GraphColor20 create(const int* h_adj, int n, int max_colors) {
GraphColor20 prob;
prob.n = n;
prob.max_colors = max_colors;
int* da;
CUDA_CHECK(cudaMalloc(&da, sizeof(int) * n * n));
CUDA_CHECK(cudaMemcpy(da, h_adj, sizeof(int) * n * n, cudaMemcpyHostToDevice));
prob.d_adj = da;
return prob;
}
void destroy() {
if (d_adj) { cudaFree(const_cast<int*>(d_adj)); d_adj = nullptr; }
}
};
```
### Generated: main.cu
```cuda
#include "core/solver.cuh"
#include "problem.cuh"
#include <cstdio>
int main() {
const int n = 20;
const int max_colors = 4;
int adj[n * n] = {0};
// Define edges
int edges[][2] = {{0,1},{0,2},{1,3},{2,3},{3,4},
{4,5},{5,6},{6,7},{7,8},{8,9},
{9,10},{10,11},{11,12},{12,13},{13,14},
{14,15},{15,16},{16,17},{17,18},{18,19},
{0,19},{1,4},{2,5},{6,9},{7,10}};
int num_edges = sizeof(edges) / sizeof(edges[0]);
for (int e = 0; e < num_edges; e++) {
int u = edges[e][0], v = edges[e][1];
adj[u * n + v] = 1;
adj[v * n + u] = 1;
}
auto prob = GraphColor20::create(adj, n, max_colors);
SolverConfig scfg;
scfg.time_limit_sec = 10.0f;
scfg.use_aos = true;
scfg.verbose = true;
auto result = solve(prob, scfg);
printf("Colors used: %.0f\n", result.best_solution.objectives[0]);
printf("Conflicts (penalty): %.2f\n", result.best_solution.penalty);
printf("Coloring: ");
for (int i = 0; i < n; i++)
printf("node%d=%d ", i, result.best_solution.data[0][i]);
printf("\n");
prob.destroy();
return 0;
}
```

View file

@ -0,0 +1,280 @@
# ProblemBase API Reference
Complete interface specification for `ProblemBase<Derived, D1, D2>` (defined in `core/types.cuh`).
## Template Parameters
| Parameter | Type | Description |
|-----------|------|-------------|
| `Derived` | struct | The concrete problem type (CRTP pattern) |
| `D1` | int | Maximum number of rows (compile-time constant, power of 2 recommended) |
| `D2` | int | Maximum columns per row (compile-time constant, power of 2 recommended) |
The base class provides:
- `using Sol = Solution<D1, D2>;` — the solution type
- `static constexpr int NUM_OBJ` — auto-derived from `Derived::OBJ_DEFS`
- `evaluate(Sol&)` — calls `compute_obj` for each objective + `compute_penalty`
- `fill_obj_config(ProblemConfig&)` — populates objective fields from `OBJ_DEFS`
- `obj_config()` — returns `ObjConfig` for the solver
## Required Interface
### 1. `OBJ_DEFS` — Objective Definitions (static constexpr)
```cuda
static constexpr ObjDef OBJ_DEFS[] = {
{ObjDir::Minimize, 1.0f, 0.0f}, // index 0
// {ObjDir::Maximize, 0.5f, 0.0f}, // index 1 (multi-objective)
};
```
Each `ObjDef`:
- `dir`: `ObjDir::Minimize` or `ObjDir::Maximize`
- `weight`: importance weight for `CompareMode::Weighted` (default mode)
- `tolerance`: tolerance for `CompareMode::Lexicographic`
Most problems have a single objective. Multi-objective (up to 4) is supported.
### 2. `compute_obj` — Objective Calculation
```cuda
__device__ float compute_obj(int idx, const Sol& sol) const;
```
- Runs on GPU (`__device__`)
- `idx` corresponds to `OBJ_DEFS[idx]`
- Use a `switch` statement dispatching to helper functions
- Access solution data via `sol.data[row][col]` and `sol.dim2_sizes[row]`
**Pattern**:
```cuda
__device__ float compute_obj(int idx, const Sol& sol) const {
switch (idx) {
case 0: return calc_total_cost(sol);
default: return 0.0f;
}
}
```
### 3. `compute_penalty` — Constraint Violation
```cuda
__device__ float compute_penalty(const Sol& sol) const;
```
- Returns `0.0f` for feasible solutions
- Returns a positive value proportional to violation magnitude for infeasible solutions
- The solver always prefers feasible solutions (penalty=0) over infeasible ones
- For multiple constraints, sum all violations
**Guidelines**:
- Scale penalty to be comparable to objective magnitude
- Example: capacity overflow → `(excess_load) * 100.0f`
- Example: vehicle count exceeded → `(excess_vehicles) * 1000.0f`
### 4. `config` — Problem Configuration
```cuda
ProblemConfig config() const;
```
Returns runtime metadata. Must set:
```cuda
ProblemConfig config() const {
ProblemConfig cfg;
cfg.encoding = EncodingType::Permutation; // or Binary, Integer
cfg.dim1 = /* actual rows used */;
cfg.dim2_default = /* actual columns */;
fill_obj_config(cfg); // auto-fills objectives from OBJ_DEFS
// Multi-row problems:
// cfg.row_mode = RowMode::Fixed; // equal-length rows
// cfg.row_mode = RowMode::Partition; // variable-length rows
// cfg.cross_row_prob = 0.3f; // cross-row operator probability
// cfg.total_elements = n; // Partition: total elements across all rows
// Integer encoding:
// cfg.value_lower_bound = 0;
// cfg.value_upper_bound = num_colors - 1;
return cfg;
}
```
### 5. `create` / `destroy` — Factory Methods
```cuda
static MyProblem create(/* host-side data */) {
MyProblem prob;
prob.n = n;
// Allocate GPU memory and copy data
float* d_ptr;
CUDA_CHECK(cudaMalloc(&d_ptr, sizeof(float) * n * n));
CUDA_CHECK(cudaMemcpy(d_ptr, h_ptr, sizeof(float) * n * n, cudaMemcpyHostToDevice));
prob.d_data = d_ptr;
return prob;
}
void destroy() {
if (d_data) { cudaFree(const_cast<float*>(d_data)); d_data = nullptr; }
}
```
**Rules**:
- All GPU memory allocated in `create()`, freed in `destroy()`
- Use `CUDA_CHECK()` for every CUDA API call
- Store both `d_` (device) and optionally `h_` (host) pointers
- `const_cast` needed in `destroy()` because pointers are `const float*`
## Optional Interface
### 6. `shared_mem_bytes` — Shared Memory Requirement
```cuda
size_t shared_mem_bytes() const;
```
- Returns total bytes of problem data to cache in shared memory
- Return the **actual** data size; the framework handles overflow:
- ≤ 48KB: fits default shared memory
- 48KB164KB: framework calls `cudaFuncSetAttribute` to extend (GPU-dependent)
- Too large: framework falls back to global memory automatically
- Default (from base class): returns 0
**Example** (distance matrix):
```cuda
size_t shared_mem_bytes() const {
return (size_t)n * n * sizeof(float); // report actual need
}
```
### 7. `working_set_bytes` — Global Memory Working Set
```cuda
size_t working_set_bytes() const;
```
- Returns the per-block hot data size in global memory
- Used by the framework to estimate L2 cache pressure and auto-size population
- Default: returns `shared_mem_bytes()`
- **Override when** `shared_mem_bytes()` returns 0 (data doesn't fit in shared memory) — return the actual data size so population sizing works correctly
**Example**:
```cuda
size_t working_set_bytes() const {
return (size_t)n * n * sizeof(float) + (size_t)n * sizeof(float);
}
```
### 8. `load_shared` — Load Data into Shared Memory
```cuda
__device__ void load_shared(char* smem, int tid, int bsz);
```
- Called by framework when `shared_mem_bytes() > 0`
- Copy data from global memory to shared memory using cooperative loading
- **Redirect the device pointer** to shared memory after loading
**Pattern**:
```cuda
__device__ void load_shared(char* smem, int tid, int bsz) {
float* s_data = reinterpret_cast<float*>(smem);
int total = n * n;
for (int i = tid; i < total; i += bsz)
s_data[i] = d_data[i];
d_data = s_data; // redirect pointer to shared memory
}
```
For multiple arrays, lay them out sequentially in `smem`:
```cuda
__device__ void load_shared(char* smem, int tid, int bsz) {
float* s_dist = reinterpret_cast<float*>(smem);
int dist_size = stride * stride;
for (int i = tid; i < dist_size; i += bsz) s_dist[i] = d_dist[i];
d_dist = s_dist;
float* s_demand = s_dist + dist_size;
for (int i = tid; i < n; i += bsz) s_demand[i] = d_demand[i];
d_demand = s_demand;
}
```
### 9. `heuristic_matrices` — Data for Heuristic Initialization
```cuda
int heuristic_matrices(HeuristicMatrix* out, int max_count) const;
```
- Returns host-side matrices for constructing heuristic initial solutions
- The framework sorts elements by row/column sums to generate better-than-random starting points
- Return value: number of matrices provided (0 = no heuristic init)
**Example** (distance matrix for TSP):
```cuda
int heuristic_matrices(HeuristicMatrix* out, int max_count) const {
if (max_count < 1 || !h_dist) return 0;
out[0] = {h_dist, n};
return 1;
}
```
### 10. `init_relation_matrix` — G/O Matrix for Guided Rebuild
```cuda
void init_relation_matrix(float* h_G, float* h_O, int N) const;
```
- Provides prior knowledge for the LNS guided rebuild operator
- `G[i*N+j]`: grouping tendency (symmetric, higher = more likely in same group)
- `O[i*N+j]`: ordering tendency (asymmetric, higher = i before j)
- Values in [0, 1], typically scaled from problem data (e.g., distance proximity)
- Default: does nothing (matrices stay zero, learned from search history)
## Solution Data Access
```cuda
sol.data[row][col] // element value at (row, col)
sol.dim2_sizes[row] // actual length of row (may be < D2)
sol.objectives[idx] // objective value (set by evaluate())
sol.penalty // penalty value (set by evaluate())
```
- **Permutation (Single)**: `sol.data[0][0..n-1]` contains a permutation of `0..n-1`
- **Permutation (Partition)**: `sol.data[r][0..sol.dim2_sizes[r]-1]` for each route/partition
- **Binary**: `sol.data[0][i]` is 0 or 1
- **Integer**: `sol.data[0][i]` is in `[value_lower_bound, value_upper_bound]`
## Key Types Reference
```cuda
enum class EncodingType { Permutation, Binary, Integer };
enum class RowMode { Single, Fixed, Partition };
enum class ObjDir { Minimize, Maximize };
enum class CompareMode { Weighted, Lexicographic };
struct ObjDef { ObjDir dir; float weight; float tolerance; };
struct HeuristicMatrix { const float* data; int N; };
struct ProblemConfig {
EncodingType encoding;
int dim1, dim2_default, num_objectives;
ObjDir obj_dirs[4]; float obj_weights[4];
CompareMode compare_mode;
RowMode row_mode;
float cross_row_prob;
int total_elements;
int value_lower_bound, value_upper_bound;
};
struct SolverConfig {
int pop_size; // 0 = auto
int max_gen; // max generations
float time_limit_sec; // 0 = no limit
bool use_aos; // adaptive operator selection
bool verbose;
unsigned seed;
};
```