Breaking the Accuracy–Parallelism Trade-off of Block-wise dLLMs via Reinforcement Learning
通过强化学习突破块级扩散大语言模型精度与并行度的权衡
We propose LightningRL, a reinforcement learning framework that breaks the accuracy–parallelism trade-off of block-wise diffusion Large Language Models (dLLMs). LightningRL optimizes both speed and generation quality simultaneously through three key modifications to GRPO: per-reward decoupled normalization, token-level NLL regularization, and TPF-aware filtering. Applied to SDAR-8B, LightningRL achieves an average TPF of 7.32 and AUP of 497.9, significantly outperforming EAGLE-3, Fast-dLLM-v2, and other leading baselines across math and code benchmarks.
我们提出了 LightningRL,一个通过强化学习打破块级扩散大语言模型(dLLMs)精度-并行度权衡的框架。 LightningRL 通过对 GRPO 的三项关键改进同时优化速度和生成质量:逐奖励解耦归一化、Token级 NLL 正则化和TPF感知过滤。应用于 SDAR-8B 后,LightningRL 实现了平均 7.32 TPF 和 497.9 AUP,显著超越 EAGLE-3、Fast-dLLM-v2 等领先基线。
LightningRL samples a group of decoding trajectories per prompt, applies per-reward decoupled normalization to preserve within-group ranking under heterogeneous scales. The policy is optimized with a GRPO-style objective plus a token-level NLL anchor. The resulting shift moves probability mass toward the fastest correct trajectory, improving TPF without degrading accuracy.
LightningRL 对每个提示采样一组解码轨迹,应用逐奖励解耦归一化在异构尺度下保持组内排序。策略通过 GRPO 风格的目标函数加上 Token 级 NLL 锚点进行优化。最终的概率质量偏移朝向最快的正确轨迹,在不降低 精度的情况下提升 TPF。
LightningRL vs RL Methods| Model | GSM8K | MATH500 | MBPP | HumanEval | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Acc | TPF | AUP | Acc | TPF | AUP | Acc | TPF | AUP | Acc | TPF | AUP | |
| SDAR-8B-b32 | 88.9 | 2.85 | 252.5 | 63.6 | 4.81 | 299.5 | 58.0 | 2.44 | 81.1 | 73.5 | 2.39 | 123.8 |
| TraceRL-8B-b32 | 76.9 | 5.04 | 378.6 | 60.6 | 4.82 | 284.6 | 57.8 | 2.50 | 144.2 | 75.0 | 2.29 | 171.6 |
| GRPO(traj)-8B-b32 | 86.6 | 4.87 | 414.1 | 61.6 | 5.04 | 301.2 | 56.8 | 2.70 | 152.7 | 76.0 | 2.33 | 176.7 |
| LightningRL-8B-b32 | 90.3 | 5.58 | 492.4 | 63.0 | 6.28 | 407.5 | 58.3 | 11.10 | 641.6 | 72.6 | 6.30 | 450.1 |
Full Comparison
全面对比
| Model | GSM8K | MATH500 | MBPP | HumanEval | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Acc | TPF | AUP | Acc | TPF | AUP | Acc | TPF | AUP | Acc | TPF | AUP | |
| dLLMs | ||||||||||||
| Dream | 83.9 | 1.00 | 83.9 | 39.6 | 1.00 | 39.6 | 57.2 | 1.00 | 57.2 | 55.2 | 1.00 | 55.2 |
| Fast-dLLM-Dream | 79.0 | 1.44 | 116.5 | 38.3 | 1.78 | 55.2 | 53.2 | 1.20 | 63.6 | 54.3 | 1.33 | 63.5 |
| dParallel-Dream | 82.1 | 3.02 | 245.7 | 38.7 | 2.94 | 77.9 | 55.4 | 2.24 | 108.0 | 54.3 | 2.57 | 98.8 |
| d3LLM-Dream | 81.4 | 4.94 | 391.3 | 38.2 | 3.92 | 97.5 | 55.6 | 2.96 | 141.4 | 57.1 | 3.20 | 129.5 |
| LLaDA | 72.6 | 1.00 | 72.6 | 32.2 | 1.00 | 32.2 | 47.1 | 1.00 | 47.1 | 38.3 | 1.00 | 38.3 |
| Fast-dLLM-LLaDA | 74.7 | 2.77 | 205.8 | 30.8 | 1.97 | 47.2 | 38.6 | 2.13 | 56.6 | 37.8 | 2.56 | 54.0 |
| D2F-LLaDA | 73.2 | 2.88 | 209.7 | 28.7 | 2.38 | 45.5 | 38.0 | 1.94 | 50.0 | 36.6 | 2.69 | 62.0 |
| dParallel-LLaDA | 72.6 | 5.14 | 358.1 | 30.2 | 3.17 | 64.5 | 40.0 | 2.35 | 60.5 | 39.0 | 4.93 | 83.7 |
| d3LLM-LLaDA | 73.1 | 9.11 | 637.7 | 30.4 | 5.74 | 107.6 | 40.6 | 4.21 | 88.4 | 39.6 | 5.95 | 96.6 |
| AR Models 自回归模型 | ||||||||||||
| Qwen-2.5-7B-Instruct | 74.1 | 1.00 | 74.1 | 41.4 | 1.00 | 41.1 | 63.6 | 1.00 | 63.6 | 67.7 | 1.00 | 67.7 |
| EAGLE-3 (LLaMA-3.1) | 76.6 | 5.12 | 319.0 | 39.8 | 5.72 | 142.1 | 60.2 | 5.69 | 298.6 | 67.6 | 5.98 | 344.8 |
| Block-wise dLLMs 块级扩散 LLM | ||||||||||||
| Fast-dLLM-v2 | 77.5 | 2.21 | 156.0 | 48.7 | 2.61 | 126.7 | 50.1 | 2.04 | 81.9 | 61.7 | 2.58 | 128.9 |
| SDAR-8B-b32 | 88.9 | 2.85 | 252.5 | 63.6 | 4.81 | 299.5 | 58.0 | 2.44 | 81.1 | 73.5 | 2.39 | 123.8 |
| LightningRL-8B-b32 | 90.3 | 5.58 | 492.4 | 63.0 | 6.28 | 407.5 | 58.3 | 11.10 | 641.6 | 72.6 | 6.30 | 450.1 |
Figure 6. Total reward curves. LightningRL converges stably while TraceRL collapses. 图 6. 总奖励曲线。LightningRL 稳定收敛,而 TraceRL 崩溃。
Figure 7. Decoding behavior. LightningRL maintains higher throughput and reduces long-tail decoding steps. 图 7. 解码行为。LightningRL 保持更高吞吐量并减少长尾解码步骤。
Scalability
可扩展性
| Scale | BS | Model | GSM8K | MATH500 | MBPP | HumanEval | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Acc | TPF | AUP | Acc | TPF | AUP | Acc | TPF | AUP | Acc | TPF | AUP | |||
| 1.7B | 32 | SDAR | 71.5 | 2.48 | 176.4 | 41.2 | 5.62 | 226.3 | 39.0 | 3.86 | 149.8 | 48.8 | 2.81 | 136.0 |
| LightningRL | 71.7 | 3.40 | 251.4 | 41.2 | 6.01 | 241.2 | 37.3 | 5.07 | 185.8 | 48.2 | 3.43 | 165.2 | ||
| 4B | 32 | SDAR | 86.6 | 3.10 | 243.8 | 53.6 | 5.09 | 263.9 | 54.0 | 1.96 | 105.9 | 59.8 | 3.67 | 216.8 |
| LightningRL | 85.4 | 4.37 | 374.7 | 56.4 | 6.55 | 358.7 | 52.2 | 3.05 | 158.3 | 57.3 | 4.35 | 242.3 | ||
| 8B | 32 | SDAR | 88.9 | 2.85 | 252.5 | 63.6 | 4.81 | 299.5 | 58.0 | 2.44 | 81.1 | 74.4 | 2.39 | 123.8 |
| LightningRL | 90.3 | 5.58 | 492.4 | 63.0 | 6.28 | 407.5 | 58.3 | 11.10 | 641.6 | 72.6 | 6.30 | 450.1 | ||
| 8B | 8 | SDAR | 91.0 | 2.96 | 269.1 | 64.1 | 3.53 | 224.8 | 59.6 | 1.72 | 103.3 | 76.9 | 2.77 | 211.6 |
| LightningRL | 89.4 | 3.75 | 331.8 | 67.0 | 4.21 | 279.5 | 58.2 | 3.11 | 179.3 | 76.8 | 3.30 | 258.8 | ||
| 8B | 4 | SDAR | 91.1 | 2.35 | 213.9 | 71.2 | 2.49 | 176.8 | 63.3 | 1.84 | 116.7 | 78.6 | 1.53 | 120.3 |
| LightningRL | 91.0 | 3.21 | 291.6 | 70.3 | 3.42 | 237.7 | 63.2 | 2.47 | 155.5 | 78.2 | 2.32 | 181.3 | ||
Wall-Clock Speed
实际推理速度
| Model | H100 TPS | TPF | Acc (%) |
|---|---|---|---|
| Qwen-2.5-7B-Instruct | 57.3 | 1.00 | 74.1 |
| Fast-dLLM-v2 | 150.0 | 2.21 | 77.5 |
| dParallel-LLaDA | 172.2 | 5.14 | 72.6 |
| d3LLM-LLaDA | 288.9 | 9.11 | 73.1 |
| SDAR | 105.55 | 2.85 | 88.9 |
| LightningRL | 336.03 | 5.58 | 90.3 |
@article{hu2026lightningrl,
title={LightningRL: Breaking the Accuracy--Parallelism Trade-off of Block-wise dLLMs via Reinforcement Learning},
author={Hu, Yanzhe and Jin, Yijie and Liu, Pengfei and Yu, Kai and Deng, Zhijie},
journal={arXiv preprint},
year={2026},
note={Coming soon}
}