11 KiB
| name | description |
|---|---|
| cugenopt-problem-gen | Generate cuGenOpt GPU-accelerated optimization Problem definitions from natural language descriptions. Use when the user describes a combinatorial optimization problem (TSP, VRP, knapsack, scheduling, assignment, graph coloring, bin packing, etc.) and wants to solve it with cuGenOpt, or asks to generate, create, model, or run an optimization problem on GPU. |
cuGenOpt Problem Generator
Generate cuGenOpt Problem definitions from natural language descriptions. Users describe their optimization problem in plain language; this Skill produces compilable .cuh + main.cu files and optionally compiles/runs them.
When to Use
Activate this Skill when the user:
- Describes a combinatorial optimization problem and wants to solve it with cuGenOpt
- Asks to "generate", "create", "write", or "define" a cuGenOpt Problem
- Asks to "solve", "run", or "try" an optimization problem on GPU
- Mentions problem types like TSP, VRP, knapsack, scheduling, assignment, bin packing, graph coloring, etc.
- Asks "how to express / model [problem] in cuGenOpt"
Prerequisites
The cuGenOpt framework source must be accessible. Locate it by searching for prototype/core/solver.cuh:
FRAMEWORK_ROOT = directory containing prototype/core/solver.cuh
If not found, ask the user for the framework location.
Workflow
Step 0: Determine Execution Depth
Infer from the user's wording how far to go:
| Signal | Intent | Depth |
|---|---|---|
| "generate" / "write" / "define" | Code only | Phase 1 → 2 |
| "solve" / "run" / "try" / "see results" | Full run | Phase 1 → 2 → 3 |
| Provides data (CSV, array, file path) | Full run | Phase 1 → 2 → 3 |
| "model" / "express" / "how to" | Code + explanation | Phase 1 → 2 |
| Ambiguous | Code, then ask | Phase 1 → 2, ask |
Principle: when in doubt, ask rather than silently consuming GPU.
Step 1: Analyze the Problem (Phase 1)
Extract from the user's description:
- Decision variables — what is being decided?
- Encoding type — use the decision tree below
- Dimensions — D1, D2, dim1, dim2_default, total_elements
- Objective(s) — minimize/maximize what?
- Constraints — what makes a solution infeasible?
- Data — what input data is needed?
Encoding Decision Tree
What are the decision variables?
├── Ordering / assignment (no repeats) → Permutation
│ ├── Single sequence (TSP, QAP, assignment) → RowMode::Single, D1=1
│ ├── Multiple equal-length sequences (JSP) → RowMode::Fixed, D1=num_machines
│ └── Variable-length partitions (VRP, VRPTW) → RowMode::Partition, D1=max_vehicles
├── Select / not-select (0-1) → Binary
│ └── Knapsack, scheduling, subset selection → RowMode::Single, D1=1
└── Bounded integer (multi-level) → Integer
├── Single row (graph coloring, load balance) → RowMode::Single, D1=1
└── Multiple equal rows (multi-machine) → RowMode::Fixed, D1=num_machines
Dimension Calculation
| Parameter | Rule |
|---|---|
D1 |
Template max rows. Single=1; VRP=max_vehicles; JSP=num_machines. Round up to power of 2 if > 1. |
D2 |
Template max columns. Single=num_elements; Partition=max(num_elements/D1*2, 64). Round up to power of 2. |
dim1 |
Actual rows at runtime. ≤ D1. |
dim2_default |
Single/Fixed: num_elements; Partition: 0 (framework allocates). |
total_elements |
Partition only: total number of elements to distribute across rows. |
Confirmation Strategy
| Complexity | Criteria | Action |
|---|---|---|
| Low | Standard problem with direct reference (TSP, knapsack, assignment) | Proceed without pause |
| Medium | Custom constraints (prioritized VRP, skill-matched scheduling) | Generate code, then summarize logic in natural language for user confirmation |
| High | Multi-objective, ambiguous description, non-standard encoding | Output problem spec first for confirmation, then generate code |
For medium/high complexity, summarize like:
"Objective: minimize total distance. Constraints: each vehicle ≤ 100 capacity, penalty = 100 × excess. Encoding: Permutation with Partition (5 vehicles, 30 customers)."
The user confirms the logic summary, not the code.
Step 2: Generate Code (Phase 2)
Generate two files in the user's chosen directory (default: a new folder under the workspace):
problem.cuh
#pragma once
#include "core/types.cuh"
#include "core/cuda_utils.cuh"
#include "core/operators.cuh"
struct MyProblem : ProblemBase<MyProblem, D1, D2> {
// GPU data pointers
const float* d_data;
int n;
// === Objective definition ===
static constexpr ObjDef OBJ_DEFS[] = {
{ObjDir::Minimize, 1.0f, 0.0f},
};
__device__ float compute_obj(int idx, const Sol& sol) const {
switch (idx) {
case 0: return /* objective calculation */;
default: return 0.0f;
}
}
__device__ float compute_penalty(const Sol& sol) const {
return /* 0.0f if no constraints, else penalty value */;
}
ProblemConfig config() const {
ProblemConfig cfg;
cfg.encoding = EncodingType::/* Permutation | Binary | Integer */;
cfg.dim1 = /* actual rows */;
cfg.dim2_default = /* actual columns */;
fill_obj_config(cfg);
// For multi-row:
// cfg.row_mode = RowMode::/* Fixed | Partition */;
// cfg.cross_row_prob = 0.3f;
// cfg.total_elements = n; // Partition only
return cfg;
}
// === Shared memory (when data fits) ===
size_t shared_mem_bytes() const {
return /* total bytes of data to cache, or 0 if too large */;
}
size_t working_set_bytes() const {
return /* actual data size in bytes (for L2 cache estimation) */;
}
__device__ void load_shared(char* smem, int tid, int bsz) {
// Copy d_data into shared memory, then redirect pointer
}
// === Factory ===
static MyProblem create(/* host data params */) {
MyProblem prob;
// cudaMalloc + cudaMemcpy for each data array
return prob;
}
void destroy() {
// cudaFree each GPU pointer
}
};
main.cu
#include "core/solver.cuh"
#include "problem.cuh"
#include <cstdio>
int main() {
// 1. Prepare data (hardcoded or read from file)
// 2. Create problem
auto prob = MyProblem::create(/* ... */);
// 3. Configure solver
SolverConfig scfg;
scfg.time_limit_sec = 30.0f;
scfg.use_aos = true;
scfg.verbose = true;
// 4. Solve
auto result = solve(prob, scfg);
// 5. Print results
printf("Best objective: %.4f\n", result.best_solution.objectives[0]);
printf("Penalty: %.4f\n", result.best_solution.penalty);
printf("Generations: %d, Time: %.1f ms\n", result.generations, result.elapsed_ms);
// 6. Print solution details
// ...
prob.destroy();
return 0;
}
Code generation rules:
- Read
reference/problem-api.mdfor the full ProblemBase interface specification - Read
reference/encoding-guide.mdfor encoding selection and dimension calculation details - Read
reference/examples.mdfor end-to-end examples of natural language → code shared_mem_bytes()should return the actual data size; the framework handles overflow automatically viacudaFuncSetAttributeworking_set_bytes()should always return the actual data size (used for population sizing)- For Partition encoding, set
cfg.dim2_default = 0andcfg.total_elements = n - Use
CUDA_CHECK()macro for all CUDA API calls - Penalty should be proportional to constraint violation magnitude
Step 3: Validate & Run (Phase 3)
Only enter this phase if execution depth requires it.
3a. Environment Detection
1. Run: nvidia-smi
├── Success → local/direct mode, go to 3b
└── Failure → no local GPU
└── Ask: "No GPU detected locally. Do you have a remote GPU server?"
├── User provides SSH info → remote mode
│ - scp files to remote
│ - Run all commands via ssh <host> "..."
├── User says "I use Cursor Remote SSH" → suggest switching to remote window
└── No GPU → deliver code only, print compile command
2. Detect compute capability:
nvidia-smi --query-gpu=compute_cap --format=csv,noheader → sm_XX
3b. Compile
nvcc -O2 -std=c++17 --extended-lambda \
-arch=sm_XX \
-I FRAMEWORK_ROOT/prototype \
-I FRAMEWORK_ROOT/prototype/core \
main.cu -o solve
The two -I paths are both needed: prototype/ for #include "core/solver.cuh" in user code, and prototype/core/ for #include "types.cuh" inside framework headers.
If nvcc not found:
- Check
ls /usr/local/cuda*/bin/nvcc— if found, fix PATH - Otherwise ask: "nvcc not found. Want me to help configure CUDA?"
- Check and guide: CUDA Toolkit install, g++ install, driver compatibility
If compilation fails: read error, fix code, retry (up to 3 attempts).
3c. Run & Verify
./solve
Verification checklist:
| Check | Method | Pass |
|---|---|---|
| No crash | exit code 0 | ✓ |
| penalty = 0 | check output | Feasible solution |
| Objective reasonable | compare to known/estimated optimum | Within expected range |
If penalty > 0: report constraint violations, suggest adjusting penalty weights or constraint logic.
3d. Remote Mode
When using a remote GPU server:
- Check if framework exists on remote:
ssh <host> "ls <path>/prototype/core/solver.cuh"- Not found →
scp -r FRAMEWORK_ROOT/prototype <host>:<remote_path>/
- Not found →
- Upload generated files:
scp problem.cuh main.cu <host>:<remote_path>/ - Compile and run via SSH:
ssh <host> "cd <remote_path> && nvcc ... && ./solve" - Capture stdout for results
Common Errors
| Error | Cause | Fix |
|---|---|---|
nvcc: command not found |
CUDA not installed or not in PATH | Trigger environment setup |
no supported gcc/g++ host compiler |
g++ version incompatible | Install compatible version |
| D2 too small | Elements exceed template parameter | Increase D2 (power of 2) |
penalty always > 0 |
Constraints too tight or penalty logic wrong | Review compute_penalty |
| Objective is 0 or NaN | Data not loaded to GPU | Check cudaMalloc/cudaMemcpy in create() |
| Very low gens/s | Data matrix too large for shared memory | Check shared_mem_bytes / working_set_bytes |
Reference Files
For detailed specifications, read these files in the reference/ directory alongside this SKILL.md:
problem-api.md— Complete ProblemBase interface: every method, its signature, when to implement it, and gotchasencoding-guide.md— Encoding type selection, dimension calculation rules, RowMode details, shared memory sizingexamples.md— 4 end-to-end examples: natural language input → analysis → generated code