First Distributed PDHG Solver

D-PDLP

Scaling PDLP to Distributed Multi-GPU Systems

Solve massive-scale linear programming problems with billions of variables across multiple GPUs. Achieve near-linear speedup with full FP64 numerical accuracy.

Speedup on 8 GPUs
10⁹+ Variables Supported
FP64 Full Precision

Hongpei Li1Yicheng Huang2Huikang Liu3Dongdong Ge3Yinyu Ye4

1Cardinal Operations  |  2Shanghai University of Finance and Economics  |  3Shanghai Jiao Tong University  |  4Stanford University

D-PDLP Framework

Open Source & Ready to Use

D-PDLP is freely available on GitHub. We welcome contributions, bug reports, and feature requests from the community!

Star on GitHub

Abstract

We present a distributed framework of the Primal-Dual Hybrid Gradient (PDHG) algorithm for solving massive-scale linear programming (LP) problems. Although PDHG-based solvers demonstrate strong performance on single-node GPU architectures, their applicability to industrial-scale instances is often limited by single-GPU computational throughput.

To overcome these challenges, we propose D-PDLP, the first distributed PDLP framework, which extends PDHG to a multi-GPU setting via a practical two-dimensional grid partitioning of the constraint matrix. We introduce a block-wise random permutation strategy combined with nonzero-aware matrix partitioning to improve load balance and computational efficiency.

Extensive experiments on standard LP benchmarks (MIPLIB and Mittelmann) as well as huge-scale real-world datasets show that our distributed implementation achieves strong scalability and high performance while preserving full FP64 numerical accuracy.

What Makes D-PDLP Special

2D Grid Partitioning

Practical decomposition of constraint matrix across GPU clusters with minimal communication overhead.

Block-Wise Permutation

Novel random shuffling strategy that balances load while preserving local structure for efficient SpMV.

Open-sourced Software

Both Julia (prototyping) and C/CUDA (production) versions available. C version is open-sourced.

Near-Linear Speedup

Achieve up to 6× speedup on 8 GPUs with strong scalability on computationally intensive problems.

Full FP64 Accuracy

Maintains full double-precision numerical accuracy while delivering massive performance gains.

How It Works

2D Grid Partitioning

2D Grid Partitioning

The constraint matrix is partitioned into blocks distributed across a 2D device mesh. Each device stores local blocks and executes partial SpMV operations before global synchronization.

Permutation Strategies

Smart Load Balancing

Block-wise random permutation balances workload while preserving dense micro-structures, unlike naive approaches that cause imbalance or destroy locality.

Scalability & Performance

Mega-Scale Speedups

Running time comparison on massive instances

zib03 19M × 29M
812s
245s
3.3× faster
mcf_2500 1.5M × 126M
2943s
504s
5.8× faster
sdm_50k 5.5M × 10M
377s
61s
6.2× faster
1 GPU 8 GPUs

Julia vs C Implementation

Instance 4 GPUs 8 GPUs
Julia C Julia C
zib03 Koch 336s 283s 245s 205s
rand-10m PageRank 5s 3s 4s 2s
mcf_2500 MCF 935s 836s 504s 435s
wil50 QAP 31s 29s 25s 22s
tai50b QAP 52s 43s 43s 35s
ds1 Unit Commitment 143s 42s 122s 31s
ds2 Unit Commitment 691s 397s 591s 288s
SGM10 Average 101.2s 86.1s 1.17× 77.7s 62.2s 1.25×
The C-based solver is open-sourced on GitHub.

Benchmark Coverage

MIPLIB 2017 Mittelmann PageRank Multicommodity Flow QAP Relaxations Unit Commitment Design Matching zib03

Ready to Accelerate Your LP Solving?

Get started with D-PDLP today and experience massive speedups on your large-scale optimization problems.

BibTeX

@misc{li2026dpdlpscalingpdlpdistributed,
      title={D-PDLP: Scaling PDLP to Distributed Multi-GPU Systems}, 
      author={Hongpei Li and Yicheng Huang and Huikang Liu and Dongdong Ge and Yinyu Ye},
      year={2026},
      eprint={2601.07628},
      archivePrefix={arXiv},
      primaryClass={math.OC},
      url={https://arxiv.org/abs/2601.07628}, 
}