cuGenOpt/README.md

# cuGenOpt

> **A GPU-Accelerated General-Purpose Metaheuristic Framework for Combinatorial Optimization**

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![CUDA](https://img.shields.io/badge/CUDA-11.0%2B-green.svg)](https://developer.nvidia.com/cuda-toolkit)
[![Python](https://img.shields.io/badge/Python-3.8%2B-blue.svg)](https://www.python.org/)

**Paper**: [cuGenOpt: A GPU-Accelerated General-Purpose Metaheuristic Framework for Combinatorial Optimization](https://arxiv.org/abs/2603.19163) 

---

## Overview

cuGenOpt is a high-performance, problem-agnostic GPU metaheuristic framework designed for combinatorial optimization. It provides:

- **Generic Solution Encodings**: Permutation, Binary, Integer, and Partition representations
- **Adaptive Operator Selection (AOS)**: Runtime weight adjustment via exponential moving average
- **Three-Layer Adaptive Architecture**: Static priors (L1) + Runtime AOS (L3) for cold-start avoidance
- **GPU Memory Hierarchy Optimization**: L2 cache-aware population sizing and adaptive shared memory management
- **Multi-GPU Support**: Independent parallel solving with automatic device management
- **Python API + CUDA C++**: High-level interface with JIT compilation for custom problems

### Key Features

| Feature | Description |
|---------|-------------|
| **12+ Problem Types** | TSP, VRP, VRPTW, Knapsack, QAP, JSP, Assignment, Graph Coloring, Bin Packing, and more |
| **Adaptive Search** | EMA-driven operator weight adjustment during runtime |
| **Problem Profiling** | Automatic initial strategy selection based on problem characteristics |
| **Memory-Aware** | Automatic population sizing based on GPU L2 cache capacity |
| **Multi-Objective** | Weighted sum and lexicographic optimization modes |
| **Cross-Platform** | Unified workflow on Linux and Windows |

---

## Quick Start

### Option 1: Python API (Recommended)

```bash
pip install cugenopt
pip install nvidia-cuda-nvcc-cu12  # If system CUDA Toolkit not available
```

**Solve Built-in Problems:**

```python
import numpy as np
import cugenopt

# Solve TSP
dist = np.random.rand(50, 50).astype(np.float32)
dist = (dist + dist.T) / 2  # Make symmetric
result = cugenopt.solve_tsp(dist, time_limit=10.0)
print(f"Best tour length: {result['best_obj']}")
print(f"Tour: {result['best_solution']}")
```

**Define Custom Problems with JIT:**

```python
result = cugenopt.solve_custom(
    compute_obj="""
        if (idx != 0) return 0.0f;
        float total = 0.0f;
        const int* route = sol.data[0];
        int size = sol.dim2_sizes[0];
        for (int i = 0; i < size; i++)
            total += d_dist[route[i] * _n + route[(i+1) % size]];
        return total;
    """,
    data={"d_dist": dist},
    encoding="permutation",
    dim2=50,
    n=50,
    time_limit=10.0
)
```

### Option 2: CUDA C++ Direct Usage

```bash
cd prototype
make tsp
./tsp
```

Define your own problem by inheriting `ProblemBase` and implementing `compute_obj` / `compute_penalty`.

---

## Architecture

```
┌─────────────────────────────────────────────────────────┐
│                    Python API Layer                     │
│  (Built-in Problems + JIT Compiler for Custom Problems) │
└─────────────────────────────────────────────────────────┘
                           │
┌─────────────────────────────────────────────────────────┐
│                 Core Framework (CUDA C++)               │
│  • Adaptive Solver (L1 Priors + L3 Runtime AOS)        │
│  • Operator Registry (Swap, Reverse, Insert, LNS, ...)  │
│  • Population Management (Elite + Diversity)            │
│  • Multi-GPU Coordinator                                │
└─────────────────────────────────────────────────────────┘
                           │
┌─────────────────────────────────────────────────────────┐
│              GPU Execution Engine                       │
│  • L2 Cache-Aware Memory Management                     │
│  • Adaptive Shared Memory Allocation                    │
│  • CUDA Kernels (Population-level + Neighborhood-level) │
└─────────────────────────────────────────────────────────┘
```


## Performance Highlights

### Benchmark Results

| Problem | Instance | cuGenOpt | Best Known | Gap |
|---------|----------|----------|------------|-----|
| TSP | kroA100 | 21,282 | 21,282 | 0.00% |
| TSP | kroA200 | 29,368 | 29,368 | 0.00% |
| QAP | nug12 | 578 | 578 | **0.00%** (Optimal) |
| VRPTW | C101 | 828.94 | 828.94 | 0.00% |
| VRPTW | R101 | 1,650.80 | 1,645.79 | 0.30% |

### GPU Scalability

| GPU | Memory Bandwidth | TSP n=1000 Speedup |
|-----|------------------|-------------------|
| T4 | 300 GB/s | 1.0× (baseline) |
| V100 | 900 GB/s | 1.6× |
| A800 | 1,935 GB/s | 3.6× |

*Memory-bound workload: performance scales linearly with bandwidth.*

### Multi-GPU Effectiveness

| Problem | Single GPU | 2× GPU | 4× GPU | Improvement |
|---------|-----------|--------|--------|-------------|
| TSP n=1000 | 7,542,668 | 7,277,989 | 7,236,344 | **3.51%** |
| QAP n=100 | 1,520,516 | 1,502,084 | 1,498,404 | **1.45%** |

*With CUDA Graph enabled. Larger problems benefit more from parallel exploration.*

---

## Requirements

### Hardware
- NVIDIA GPU with Compute Capability 7.0+ (Volta or newer)
- Recommended: 8GB+ GPU memory for large-scale problems

### Software
- CUDA Toolkit 11.0+
- Python 3.8+ (for Python API)
- GCC 7.5+ or MSVC 2019+ (for C++ compilation)

---

## Installation

### Python Package
come soon
```bash
pip install cugenopt
```

### Build from Source

```bash
git clone https://github.com/L-yang-yang/cugenopt.git
cd cugenopt/python
pip install -e .
```

### CUDA C++ Only

```bash
cd prototype
make all
```


## Citation

If you use cuGenOpt in your research, please cite:

```bibtex
@article{liu2026cugenopt,
  title={cuGenOpt: A GPU-Accelerated General-Purpose Metaheuristic Framework for Combinatorial Optimization},
  author={Liu, Yuyang},
  journal={arXiv preprint arXiv:2603.19163},
  year={2026}
}
```

---

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

---

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

---

## Contact

**Yuyang Liu**  
Independent Researcher, Shenzhen, China  
Email: 15251858055@163.com

---

## Acknowledgments

This work was conducted as independent research. Special thanks to the open-source community for providing excellent tools and libraries that made this project possible.
-												Initial commit: cuGenOpt GPU optimization solver

											
										
										
											2026-03-20 00:33:45 +08:00
+								# cuGenOpt
 								> **A GPU-Accelerated General-Purpose Metaheuristic Framework for Combinatorial Optimization**
 								[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 								[![CUDA](https://img.shields.io/badge/CUDA-11.0%2B-green.svg)](https://developer.nvidia.com/cuda-toolkit)
 								[![Python](https://img.shields.io/badge/Python-3.8%2B-blue.svg)](https://www.python.org/)
-												fix: harden CUDA safety checks and translate comments to English

Safety fixes (4 critical, 4 warning) from code review:

- qap.cuh: fix clone_to_device cross-device D2H by retaining host matrices
- types.cuh: add CUDA_CHECK to InjectBuffer, track owner_gpu for safe destroy
- types.cuh: add bounds check on lexicographic priority index
- solver.cuh: cap migrate_kernel islands to MAX_ISLANDS=64 to prevent stack overflow
- multi_gpu_solver.cuh: guard against 0 GPUs, propagate stop_reason from best GPU
- types.cuh: warn on SeqRegistry overflow
- solver.cuh: warn when constraint_directed/phased_search disabled without AOS

Translate all Chinese comments to English across 25+ source files
(core/*.cuh, problems/*.cuh, Makefile, multi-GPU tests).

Verified on V100S×2 (sm_70, CUDA 12.8): e5 (12 problem types, all optimal),
e13 (multi-objective + multi-GPU, 9 configs, all passed).

											
										
										
											2026-03-25 11:52:50 +08:00
+								**Paper**: [cuGenOpt: A GPU-Accelerated General-Purpose Metaheuristic Framework for Combinatorial Optimization](https://arxiv.org/abs/2603.19163)
-												Initial commit: cuGenOpt GPU optimization solver

											
										
										
											2026-03-20 00:33:45 +08:00
 								---
 								## Overview
 								cuGenOpt is a high-performance, problem-agnostic GPU metaheuristic framework designed for combinatorial optimization. It provides:
 								- **Generic Solution Encodings**: Permutation, Binary, Integer, and Partition representations
 								- **Adaptive Operator Selection (AOS)**: Runtime weight adjustment via exponential moving average
 								- **Three-Layer Adaptive Architecture**: Static priors (L1) + Runtime AOS (L3) for cold-start avoidance
 								- **GPU Memory Hierarchy Optimization**: L2 cache-aware population sizing and adaptive shared memory management
 								- **Multi-GPU Support**: Independent parallel solving with automatic device management
 								- **Python API + CUDA C++**: High-level interface with JIT compilation for custom problems
 								### Key Features
 								| Feature | Description |
 								|---------|-------------|
 								| **12+ Problem Types** | TSP, VRP, VRPTW, Knapsack, QAP, JSP, Assignment, Graph Coloring, Bin Packing, and more |
 								| **Adaptive Search** | EMA-driven operator weight adjustment during runtime |
 								| **Problem Profiling** | Automatic initial strategy selection based on problem characteristics |
 								| **Memory-Aware** | Automatic population sizing based on GPU L2 cache capacity |
 								| **Multi-Objective** | Weighted sum and lexicographic optimization modes |
 								| **Cross-Platform** | Unified workflow on Linux and Windows |
 								---
 								## Quick Start
 								### Option 1: Python API (Recommended)
 								```bash
 								pip install cugenopt
 								pip install nvidia-cuda-nvcc-cu12  # If system CUDA Toolkit not available
 								```
 								**Solve Built-in Problems:**
 								```python
 								import numpy as np
 								import cugenopt
 								# Solve TSP
 								dist = np.random.rand(50, 50).astype(np.float32)
 								dist = (dist + dist.T) / 2  # Make symmetric
 								result = cugenopt.solve_tsp(dist, time_limit=10.0)
 								print(f"Best tour length: {result['best_obj']}")
 								print(f"Tour: {result['best_solution']}")
 								```
 								**Define Custom Problems with JIT:**
 								```python
 								result = cugenopt.solve_custom(
 								    compute_obj="""
 								        if (idx != 0) return 0.0f;
 								        float total = 0.0f;
 								        const int* route = sol.data[0];
 								        int size = sol.dim2_sizes[0];
 								        for (int i = 0; i < size; i++)
 								            total += d_dist[route[i] * _n + route[(i+1) % size]];
 								        return total;
 								    """,
 								    data={"d_dist": dist},
 								    encoding="permutation",
 								    dim2=50,
 								    n=50,
 								    time_limit=10.0
 								)
 								```
 								### Option 2: CUDA C++ Direct Usage
 								```bash
 								cd prototype
 								make tsp
 								./tsp
 								```
 								Define your own problem by inheriting `ProblemBase` and implementing `compute_obj` / `compute_penalty`.
 								---
 								## Architecture
 								```
 								┌─────────────────────────────────────────────────────────┐
 								│                    Python API Layer                     │
 								│  (Built-in Problems + JIT Compiler for Custom Problems) │
 								└─────────────────────────────────────────────────────────┘
 								                           │
 								┌─────────────────────────────────────────────────────────┐
 								│                 Core Framework (CUDA C++)               │
 								│  • Adaptive Solver (L1 Priors + L3 Runtime AOS)        │
 								│  • Operator Registry (Swap, Reverse, Insert, LNS, ...)  │
 								│  • Population Management (Elite + Diversity)            │
 								│  • Multi-GPU Coordinator                                │
 								└─────────────────────────────────────────────────────────┘
 								                           │
 								┌─────────────────────────────────────────────────────────┐
 								│              GPU Execution Engine                       │
 								│  • L2 Cache-Aware Memory Management                     │
 								│  • Adaptive Shared Memory Allocation                    │
 								│  • CUDA Kernels (Population-level + Neighborhood-level) │
 								└─────────────────────────────────────────────────────────┘
 								```
 								## Performance Highlights
 								### Benchmark Results
 								| Problem | Instance | cuGenOpt | Best Known | Gap |
 								|---------|----------|----------|------------|-----|
 								| TSP | kroA100 | 21,282 | 21,282 | 0.00% |
 								| TSP | kroA200 | 29,368 | 29,368 | 0.00% |
 								| QAP | nug12 | 578 | 578 | **0.00%** (Optimal) |
 								| VRPTW | C101 | 828.94 | 828.94 | 0.00% |
 								| VRPTW | R101 | 1,650.80 | 1,645.79 | 0.30% |
 								### GPU Scalability
 								| GPU | Memory Bandwidth | TSP n=1000 Speedup |
 								|-----|------------------|-------------------|
 								| T4 | 300 GB/s | 1.0× (baseline) |
 								| V100 | 900 GB/s | 1.6× |
 								| A800 | 1,935 GB/s | 3.6× |
 								*Memory-bound workload: performance scales linearly with bandwidth.*
 								### Multi-GPU Effectiveness
 								| Problem | Single GPU | 2× GPU | 4× GPU | Improvement |
 								|---------|-----------|--------|--------|-------------|
 								| TSP n=1000 | 7,542,668 | 7,277,989 | 7,236,344 | **3.51%** |
 								| QAP n=100 | 1,520,516 | 1,502,084 | 1,498,404 | **1.45%** |
 								*With CUDA Graph enabled. Larger problems benefit more from parallel exploration.*
 								---
 								## Requirements
 								### Hardware
 								- NVIDIA GPU with Compute Capability 7.0+ (Volta or newer)
 								- Recommended: 8GB+ GPU memory for large-scale problems
 								### Software
 								- CUDA Toolkit 11.0+
 								- Python 3.8+ (for Python API)
 								- GCC 7.5+ or MSVC 2019+ (for C++ compilation)
 								---
 								## Installation
 								### Python Package
-												fix: harden CUDA safety checks and translate comments to English

Safety fixes (4 critical, 4 warning) from code review:

- qap.cuh: fix clone_to_device cross-device D2H by retaining host matrices
- types.cuh: add CUDA_CHECK to InjectBuffer, track owner_gpu for safe destroy
- types.cuh: add bounds check on lexicographic priority index
- solver.cuh: cap migrate_kernel islands to MAX_ISLANDS=64 to prevent stack overflow
- multi_gpu_solver.cuh: guard against 0 GPUs, propagate stop_reason from best GPU
- types.cuh: warn on SeqRegistry overflow
- solver.cuh: warn when constraint_directed/phased_search disabled without AOS

Translate all Chinese comments to English across 25+ source files
(core/*.cuh, problems/*.cuh, Makefile, multi-GPU tests).

Verified on V100S×2 (sm_70, CUDA 12.8): e5 (12 problem types, all optimal),
e13 (multi-objective + multi-GPU, 9 configs, all passed).

											
										
										
											2026-03-25 11:52:50 +08:00
+								come soon
-												Initial commit: cuGenOpt GPU optimization solver

											
										
										
											2026-03-20 00:33:45 +08:00
+								```bash
 								pip install cugenopt
 								```
 								### Build from Source
 								```bash
 								git clone https://github.com/L-yang-yang/cugenopt.git
 								cd cugenopt/python
 								pip install -e .
 								```
 								### CUDA C++ Only
 								```bash
 								cd prototype
 								make all
 								```
 								## Citation
 								If you use cuGenOpt in your research, please cite:
 								```bibtex
 								@article{liu2026cugenopt,
 								  title={cuGenOpt: A GPU-Accelerated General-Purpose Metaheuristic Framework for Combinatorial Optimization},
 								  author={Liu, Yuyang},
-												update arvix url

											
										
										
											2026-03-20 12:55:02 +08:00
+								  journal={arXiv preprint arXiv:2603.19163},
-												Initial commit: cuGenOpt GPU optimization solver

											
										
										
											2026-03-20 00:33:45 +08:00
+								  year={2026}
 								}
 								```
 								---
 								## License
 								This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
 								---
 								## Contributing
 								Contributions are welcome! Please feel free to submit a Pull Request.
 								---
 								## Contact
 								**Yuyang Liu**
 								Independent Researcher, Shenzhen, China
 								Email: 15251858055@163.com
 								---
 								## Acknowledgments
 								This work was conducted as independent research. Special thanks to the open-source community for providing excellent tools and libraries that made this project possible.