LightningRL Breaking the Accuracy–Parallelism Trade-off of Block-wise dLLMs via Reinforcement Learning 通过强化学习突破块级扩散大语言模型精度与并行度的权衡

Yanzhe Hu1,2, Yijie Jin1, Pengfei Liu1, Kai Yu1, Zhijie Deng1,†
1 Shanghai Jiao Tong University, 上海交通大学, 2 Huazhong University of Science and Technology 华中科技大学
Correspondence 通讯作者

TL;DR

We propose LightningRL, a reinforcement learning framework that breaks the accuracy–parallelism trade-off of block-wise diffusion Large Language Models (dLLMs). LightningRL optimizes both speed and generation quality simultaneously through three key modifications to GRPO: per-reward decoupled normalization, token-level NLL regularization, and TPF-aware filtering. Applied to SDAR-8B, LightningRL achieves an average TPF of 7.32 and AUP of 497.9, significantly outperforming EAGLE-3, Fast-dLLM-v2, and other leading baselines across math and code benchmarks.

我们提出了 LightningRL,一个通过强化学习打破块级扩散大语言模型(dLLMs)精度-并行度权衡的框架。 LightningRL 通过对 GRPO 的三项关键改进同时优化速度和生成质量:逐奖励解耦归一化Token级 NLL 正则化TPF感知过滤。应用于 SDAR-8B 后,LightningRL 实现了平均 7.32 TPF 和 497.9 AUP,显著超越 EAGLE-3、Fast-dLLM-v2 等领先基线。

Method Overview 方法概述

LightningRL overview

LightningRL samples a group of decoding trajectories per prompt, applies per-reward decoupled normalization to preserve within-group ranking under heterogeneous scales. The policy is optimized with a GRPO-style objective plus a token-level NLL anchor. The resulting shift moves probability mass toward the fastest correct trajectory, improving TPF without degrading accuracy.

LightningRL 对每个提示采样一组解码轨迹,应用逐奖励解耦归一化在异构尺度下保持组内排序。策略通过 GRPO 风格的目标函数加上 Token 级 NLL 锚点进行优化。最终的概率质量偏移朝向最快的正确轨迹,在不降低 精度的情况下提升 TPF。

Highlights 亮点

  • Breaking the Trade-off — LightningRL achieves 7.32 average TPF and 497.9 AUP, simultaneously improving both speed and accuracy
  • Three Key Innovations — Decoupled normalization, token-level NLL loss, and TPF-aware filtering work synergistically to stabilize multi-objective RL training
  • Strong Generalization — Consistent improvements across math (GSM8K, MATH500) and code (MBPP, HumanEval) benchmarks
  • Practical Speed336.03 TPS on H100 GPUs, 3.2× faster than the SDAR baseline
  • 打破权衡 — LightningRL 实现平均 7.32 TPF 和 497.9 AUP,同时提升速度和精度
  • 三项关键创新 — 解耦归一化、Token级 NLL 损失和 TPF 感知过滤协同工作,稳定多目标 RL 训练
  • 强泛化性 — 在数学(GSM8K、MATH500)和代码(MBPP、HumanEval)基准上一致性提升
  • 实际加速 — H100 GPU 上达到 336.03 TPS,比 SDAR 基线快 3.2 倍

Results 实验结果

Table 1. Evaluation results of LightningRL and RL methods on math and code benchmarks. 表 1. LightningRL 与 RL 方法在数学和代码基准上的评估结果。

LightningRL vs RL Methods

Model GSM8K MATH500 MBPP HumanEval
AccTPFAUP AccTPFAUP AccTPFAUP AccTPFAUP
SDAR-8B-b32 88.92.85252.5 63.64.81299.5 58.02.4481.1 73.52.39123.8
TraceRL-8B-b32 76.95.04378.6 60.64.82284.6 57.82.50144.2 75.02.29171.6
GRPO(traj)-8B-b32 86.64.87414.1 61.65.04301.2 56.82.70152.7 76.02.33176.7
LightningRL-8B-b32 90.35.58492.4 63.06.28407.5 58.311.10641.6 72.66.30450.1

Table 2. Evaluation results of LightningRL and baselines on math and code benchmarks. 表 2. LightningRL 与所有基线在数学和代码基准上的对比结果。

Full Comparison 全面对比

Model GSM8K MATH500 MBPP HumanEval
AccTPFAUP AccTPFAUP AccTPFAUP AccTPFAUP
dLLMs
Dream 83.91.0083.9 39.61.0039.6 57.21.0057.2 55.21.0055.2
Fast-dLLM-Dream 79.01.44116.5 38.31.7855.2 53.21.2063.6 54.31.3363.5
dParallel-Dream 82.13.02245.7 38.72.9477.9 55.42.24108.0 54.32.5798.8
d3LLM-Dream 81.44.94391.3 38.23.9297.5 55.62.96141.4 57.13.20129.5
LLaDA 72.61.0072.6 32.21.0032.2 47.11.0047.1 38.31.0038.3
Fast-dLLM-LLaDA 74.72.77205.8 30.81.9747.2 38.62.1356.6 37.82.5654.0
D2F-LLaDA 73.22.88209.7 28.72.3845.5 38.01.9450.0 36.62.6962.0
dParallel-LLaDA 72.65.14358.1 30.23.1764.5 40.02.3560.5 39.04.9383.7
d3LLM-LLaDA 73.19.11637.7 30.45.74107.6 40.64.2188.4 39.65.9596.6
AR Models 自回归模型
Qwen-2.5-7B-Instruct 74.11.0074.1 41.41.0041.1 63.61.0063.6 67.71.0067.7
EAGLE-3 (LLaMA-3.1) 76.65.12319.0 39.85.72142.1 60.25.69298.6 67.65.98344.8
Block-wise dLLMs 块级扩散 LLM
Fast-dLLM-v2 77.52.21156.0 48.72.61126.7 50.12.0481.9 61.72.58128.9
SDAR-8B-b32 88.92.85252.5 63.64.81299.5 58.02.4481.1 73.52.39123.8
LightningRL-8B-b32 90.35.58492.4 63.06.28407.5 58.311.10641.6 72.66.30450.1

Training Dynamics & Decoding Behavior 训练动态与解码行为

Training reward curves

Figure 6. Total reward curves. LightningRL converges stably while TraceRL collapses. 图 6. 总奖励曲线。LightningRL 稳定收敛,而 TraceRL 崩溃。

Decoding behavior

Figure 7. Decoding behavior. LightningRL maintains higher throughput and reduces long-tail decoding steps. 图 7. 解码行为。LightningRL 保持更高吞吐量并减少长尾解码步骤。


Table 6. Comparison of SDAR and LightningRL under identical settings, grouped by Model Scale and Block Size. 表 6. SDAR 和 LightningRL 在相同设置下的对比,按模型规模和块大小分组。

Scalability 可扩展性

Scale BS Model GSM8K MATH500 MBPP HumanEval
AccTPFAUP AccTPFAUP AccTPFAUP AccTPFAUP
1.7B 32 SDAR 71.52.48176.4 41.25.62226.3 39.03.86149.8 48.82.81136.0
LightningRL 71.73.40251.4 41.26.01241.2 37.35.07185.8 48.23.43165.2
4B 32 SDAR 86.63.10243.8 53.65.09263.9 54.01.96105.9 59.83.67216.8
LightningRL 85.44.37374.7 56.46.55358.7 52.23.05158.3 57.34.35242.3
8B 32 SDAR 88.92.85252.5 63.64.81299.5 58.02.4481.1 74.42.39123.8
LightningRL 90.35.58492.4 63.06.28407.5 58.311.10641.6 72.66.30450.1
8B 8 SDAR 91.02.96269.1 64.13.53224.8 59.61.72103.3 76.92.77211.6
LightningRL 89.43.75331.8 67.04.21279.5 58.23.11179.3 76.83.30258.8
8B 4 SDAR 91.12.35213.9 71.22.49176.8 63.31.84116.7 78.61.53120.3
LightningRL 91.03.21291.6 70.33.42237.7 63.22.47155.5 78.22.32181.3

Table 7. Tokens Per Second (TPS) performance comparison on H100 GPUs (GSM8K). 表 7. H100 GPU 上的每秒 Token 数 (TPS) 性能对比(GSM8K)。

Wall-Clock Speed 实际推理速度

Model H100 TPS TPF Acc (%)
Qwen-2.5-7B-Instruct 57.3 1.00 74.1
Fast-dLLM-v2 150.0 2.21 77.5
dParallel-LLaDA 172.2 5.14 72.6
d3LLM-LLaDA 288.9 9.11 73.1
SDAR 105.55 2.85 88.9
LightningRL 336.03 5.58 90.3

BibTeX

@article{hu2026lightningrl,
  title={LightningRL: Breaking the Accuracy--Parallelism Trade-off of Block-wise dLLMs via Reinforcement Learning},
  author={Hu, Yanzhe and Jin, Yijie and Liu, Pengfei and Yu, Kai and Deng, Zhijie},
  journal={arXiv preprint},
  year={2026},
  note={Coming soon}
}