Scaling PDLP to Distributed Multi-GPU Systems
Solve massive-scale linear programming problems with billions of variables across multiple GPUs. Achieve near-linear speedup with full FP64 numerical accuracy.
We present a distributed framework of the Primal-Dual Hybrid Gradient (PDHG) algorithm for solving massive-scale linear programming (LP) problems. Although PDHG-based solvers demonstrate strong performance on single-node GPU architectures, their applicability to industrial-scale instances is often limited by single-GPU computational throughput.
To overcome these challenges, we propose D-PDLP, the first distributed PDLP framework, which extends PDHG to a multi-GPU setting via a practical two-dimensional grid partitioning of the constraint matrix. We introduce a block-wise random permutation strategy combined with nonzero-aware matrix partitioning to improve load balance and computational efficiency.
Extensive experiments on standard LP benchmarks (MIPLIB and Mittelmann) as well as huge-scale real-world datasets show that our distributed implementation achieves strong scalability and high performance while preserving full FP64 numerical accuracy.
The first distributed PDHG-based solver that scales across multiple GPUs via efficient 2D grid partitioning.
FirstPractical decomposition of constraint matrix across GPU clusters with minimal communication overhead.
Novel random shuffling strategy that balances load while preserving local structure for efficient SpMV.
Both Julia (prototyping) and C/CUDA (production) versions available. C version is open-sourced.
Achieve up to 6× speedup on 8 GPUs with strong scalability on computationally intensive problems.
Maintains full double-precision numerical accuracy while delivering massive performance gains.
The constraint matrix is partitioned into blocks distributed across a 2D device mesh. Each device stores local blocks and executes partial SpMV operations before global synchronization.
Block-wise random permutation balances workload while preserving dense micro-structures, unlike naive approaches that cause imbalance or destroy locality.
Running time comparison on massive instances
| Instance | 4 GPUs | 8 GPUs | ||
|---|---|---|---|---|
| Julia | C | Julia | C | |
| zib03 Koch | 336s | 283s | 245s | 205s |
| rand-10m PageRank | 5s | 3s | 4s | 2s |
| mcf_2500 MCF | 935s | 836s | 504s | 435s |
| wil50 QAP | 31s | 29s | 25s | 22s |
| tai50b QAP | 52s | 43s | 43s | 35s |
| ds1 Unit Commitment | 143s | 42s | 122s | 31s |
| ds2 Unit Commitment | 691s | 397s | 591s | 288s |
| SGM10 Average | 101.2s | 86.1s 1.17× | 77.7s | 62.2s 1.25× |
Get started with D-PDLP today and experience massive speedups on your large-scale optimization problems.
@misc{li2026dpdlpscalingpdlpdistributed,
title={D-PDLP: Scaling PDLP to Distributed Multi-GPU Systems},
author={Hongpei Li and Yicheng Huang and Huikang Liu and Dongdong Ge and Yinyu Ye},
year={2026},
eprint={2601.07628},
archivePrefix={arXiv},
primaryClass={math.OC},
url={https://arxiv.org/abs/2601.07628},
}