Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

Xingyu Su^*¹Jacob Helwig^*¹Shubham Parashar^*¹Atharv Chagi¹Lakshmi Jotsna¹Degui Zhi²James Caverlee¹
Dileep Kalathil^1,3Shuiwang Ji¹

* Equal contribution

¹Department of Computer Science and Engineering, Texas A&M University·²Department of Bioinformatics and Systems Medicine, University of Texas Health Science Center at Houston·³Department of Electrical and Computer Engineering, Texas A&M University

Preprint

Motivation

Pretraining a diffusion language model (DLM) from scratch is expensive. Recent AR-to-DLM conversion methods reduce this cost by starting from pretrained autoregressive language models (ARLMs), but they still face two key mismatches: changing the training objective can weaken knowledge inherited from the ARLM, and standard DLM training uses randomly masked states that differ from the partially denoised states encountered during inference with confidence-based samplers.

OPDLM asks whether we can convert a pretrained ARLM into a DLM as a lightweight post-training procedure while preserving the capabilities learned during autoregressive pretraining and while addressing the divide between train and inference state distributions.

A natural tool for this is on-policy distillation (OPD), which trains a student on states sampled from its own generation process while using a teacher to provide token-level supervision. But applying OPD to DLM conversion creates a chicken-and-egg problem: a DLM student visits partially masked diffusion states, so a direct OPD setup would require a capable DLM teacher to score those states.

OPDLM bypasses this requirement by using the original frozen ARLM as the teacher. The student is initialized from the same ARLM weights, converted into a block-diffusion model, and trained on its own reverse diffusion trajectories. For each partially denoised student state, OPDLM uses the terminal denoised sequence to query the ARLM teacher on causal prefixes, producing token-level target distributions for the masked positions.

This creates a self-distillation setup:

Teacher: the original frozen ARLM, queried for token-level distributions on causal prefixes of the student's generated sequence.
Student: a block-diffusion LM initialized from the same ARLM, trained to predict masked tokens under blockwise bidirectional attention.

By training directly on the student's inference-time states while distilling from the original ARLM, OPDLM reduces the train-inference gap in DLM training and improves knowledge retention during conversion.

How OPDLM Works

OPDLM training step framework diagram — At each training step, the student DLM samples a reverse trajectory, a partially denoised state is selected, and masked-token predictions are aligned with frozen ARLM teacher distributions.

Roll out the student: sample a reverse unmasking trajectory from the current DLM and fixed sampler.
Select an on-policy state: choose a non-terminal partially denoised state from the realized trajectory.
Query the ARLM teacher: build causal prefixes from the terminal sequence and retrieve token-level teacher distributions.
Optimize KL: align the DLM prediction at masked positions with the frozen ARLM distribution.

Rollout-Length Curriculum

Early in training, the terminal sequences the student generates by running its reverse diffusion process are low-quality, since the converted student is being queried with masked-token inputs and blockwise bidirectional attention for the first time. To address this, OPDLM begins by generating shorter sequences and gradually increases their length, helping training stability and convergence.

Key Highlights

Efficiency Frontier0.066B4.2e18 FLOPs

OPDLM-8B AIME-24 run from Figure 1

Token Reduction15x-7,000x

Compared with established AR-to-DLM baselines

Zero-Shot Thinking18.6OPDLM-8B AIME-24

Think@eval without explicit thinking training

Specialized Math50.0AIME-24

OPDLM-MATH-8B-Thinking

OPDLM-8B defines a new AIME-24 Pareto point: 0.066B training tokens and 4.2e18 FLOPs, or 15x to 7,000x less training than established DLMs converted from ARLMs.

Results

General-Purpose DLM Results

OPDLM converts Qwen3 into a diffusion language model for general-purpose reasoning across knowledge, mathematics, science, and code. OPDLM-4B and OPDLM-8B reach performance competitive with existing DLMs while training on only 0.076B and 0.066B tokens, orders of magnitude fewer than the baselines, and at substantially lower FLOPs.

OPDLM-4B and OPDLM-8B achieve competitive performance across general knowledge, math, and code while using as little as 0.066B-0.076B training tokens.

Benchmark	SDAR-4B	OPDLM-4B	LLaDA-8B	Dream-7B	SDAR-8B	Fast-dLLM-v2-7B	OPDLM-8B
Training tokens	55B	0.076B	1500B	580B	55B	1B	0.066B
FLOPs (1e18)	1320	2.4	72000	24360	2640	42	4.2
General Knowledge & Instruction Following
MMLU	74.9	65.5	65.5	67.0	78.6	66.6	70.9
MMLU-Pro	50.9	46.3	37.0	43.3	56.9	41.5	53.7
GPQA-Diamond	33.0	29.1	31.8	32.1	40.2	27.3	36.1
IFEval	56.6	53.8	59.9	62.5	61.4	65.4	50.1
CEval	62.9	66.9	-	-	70.2	70.3	73.3
LiveBench	25.3	27.8	-	-	28.6	9.5	25.8
Mathematics & Reasoning
GSM8K	89.9	87.6	78.6	81.0	91.3	83.7	87.1
MATH-500	72.8	72.8	26.6	39.2	78.6	65.6	71.2
AIME-24	10.0	14.4	2.1	0.0	10.0	10.0	14.7
AIME-25	7.5	12.6	0.4	0.0	10.0	0.0	12.4
LMB-Hard	6.9	11.1	-	-	8.9	8.9	20.0
ZebraLogic	6.3	10.5	-	-	7.8	3.5	12.9
Code Generation
HumanEval-base	76.8	56.1	35.4	57.9	82.3	63.4	59.8
MBPP-base	80.7	57.7	31.5	68.3	79.6	63.0	48.7
LCB-v6	12.6	10.4	-	-	14.5	9.7	9.7
Codeforces	4.0	5.0	-	-	5.8	5.0	3.5

Zero-Shot Results

Zero-Shot Extended Thinking

Modern ARLMs can reason through a problem inside a <think>...</think> trace before committing to an answer. We never train OPDLM to do this, yet the converted DLM does it zero-shot: when prompted to think, OPDLM-8B improves on the hardest reasoning benchmarks, raising AIME-24 from 14.7 to 18.6 and AIME-25 from 12.4 to 19.4. The base ARLM's reasoning ability survives on-policy conversion intact, emerging as a capability we never explicitly trained for.

OPDLM retains ARLM priors for zero-shot extended thinking, despite that behavior not being explicitly included in OPDLM training.

Benchmark	OPDLM-4B	OPDLM-4B think@eval	OPDLM-8B	OPDLM-8B think@eval
GSM8K	87.6	85.3	87.1	88.0
MATH-500	72.8	75.0	71.2	75.6
AIME-24	14.4	11.2	14.7	18.6
AIME-25	12.6	13.6	12.4	19.4
LMB-Hard	11.1	17.8	20.0	17.8
ZebraLogic	10.5	9.5	12.9	17.3

Multilingual Results

OPDLM keeps the multilingual ability of the base ARLM after conversion. without any multilingual-specific training. It holds performance across MMMLU-lite, INCLUDE-lite, and MLogiQA, and even improves on multilingual Math (MT-AIME 2024).

OPDLM preserves multilingual ability from the base ARLM after on-policy conversion.

Benchmark	SDAR-4B	OPDLM-4B	Fast-dLLM-v2-7B	SDAR-8B	OPDLM-8B
MMMLU-lite	50.7	51.6	51.5	60.8	56.0
INCLUDE-lite	53.3	49.6	45.1	57.8	51.9
MT-AIME 2024	3.0	5.3	4.3	4.0	7.9
MLogiQA	46.5	46.5	42.6	46.3	42.0

Specialized DLM Results

Since OPDLM is a form of post-training applied to ARLMs, we can also build specialized DLMs. Below, we train OPDLM specifically for math to obtain OPDLM-MATH, using the same on-policy distillation setup. Additionally, we train OPDLM-MATH-Thinking for extended reasoning.

Without RLVR or DLM pretraining, OPDLM-MATH performs competitively with baselines and is especially strong on harder math benchmarks; thinking variants are trained as separate models for extended reasoning.

Model	GSM8K	MATH-500	AIME-24
Reference
SDAR-4B-Chat	90.2	70.2	5.0
LLaDA-8B-Instruct	82.5	37.3	0.5
Dream-7B-Instruct	72.7	38.7	0.0
SDAR-8B-Chat	91.1	74.3	11.8
4B Scale
TraDo-4B-Instruct	91.2	75.6	8.3
OPDLM-MATH-4B	83.8	75.8	10.0
OPDLM-MATH-4B-Thinking	91.7	90.2	43.3
8B Scale
TraDo-8B-Instruct	92.3	78.5	13.3
OPDLM-MATH-8B	86.2	76.6	23.3
TraDo-8B-Thinking	94.2	87.4	35.5
OPDLM-MATH-8B-Thinking	93.8	92.4	50.0

Parallelization

We show two controls on inference throughput: lowering the decoding threshold lets OPDLM produce more tokens per denoising step, while the training block size sets the upper bound on the parallelism it can expose at inference time.

MATH-500: Accuracy vs. Decoding Threshold

MATH-500: Accuracy vs. Decoding Threshold
Threshold	Accuracy	Tokens per step
0.80	69.2	2.12
0.85	69.4	2.04
0.90	71.6	1.91
0.95	72.0	1.80
1.00	72.8	1.00

Lowering the decoding threshold increases tokens per step with an accuracy trade-off.

MATH-500: Accuracy vs. Block Size at Threshold 0.90

MATH-500: Accuracy vs. Block Size at Threshold 0.90
Block Size	Accuracy	Tokens per step
4	71.6	1.91
8	61.4	2.34
16	49.2	3.59

At fixed gamma=0.9, larger block sizes increase tokens per step while trading off accuracy.

Quick Start

OPDLM converts a pretrained Qwen3 autoregressive model into a BD3LM student with on-policy distillation. Data and model artifacts are hosted in the divelab/opdlm Hugging Face collection.

Environment

bash

git clone https://github.com/divelab/OPDLM.git
cd OPDLM

conda create -n opdlm python=3.10.19 -y
conda activate opdlm

# Install torch first.
pip install torch==2.6.0+cu124 --index-url https://download.pytorch.org/whl/cu124

# Install project dependencies.
pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu124

# Install flash-attn last so it builds against the active torch install.
pip install flash-attn==2.7.4.post1 --no-build-isolation

If DeepSpeed rejects a CUDA 12.x minor-version mismatch while compiling CPU Adam, set DS_SKIP_CUDA_CHECK=1.

Data

bash

# Evaluation data: 19 of the 20 paper benchmarks.
huggingface-cli download divelab/opdlm_eval_data --local-dir data/ --repo-type dataset

# Training data: opdlm_train.json, 61,816 rows.
huggingface-cli download divelab/opdlm_train_data --local-dir data/ --repo-type dataset

# Paper eval and DAPO math data that live outside the OPDLM collection.
python data/prepare_codeforces.py
huggingface-cli download BytedTsinghua-SIA/DAPO-Math-17k --local-dir data/ --repo-type dataset

Models

bash

# Teacher ARLMs.
huggingface-cli download Qwen/Qwen3-4B --local-dir $HF_HOME/Qwen3-4B
huggingface-cli download Qwen/Qwen3-8B --local-dir $HF_HOME/Qwen3-8B

# Student initializations with bidirectional attention.
huggingface-cli download divelab/Qwen3-4B-a2d-init --local-dir $HF_HOME/Qwen3-4B-a2d-init
huggingface-cli download divelab/Qwen3-8B-a2d-init --local-dir $HF_HOME/Qwen3-8B-a2d-init

Smaller Qwen3-0.6B and Qwen3-1.7B init models can be regenerated with convert_qwen_to_bd3lm.py.

Train

bash

python rl.py config=configs/rl_bd3lm.yaml \
    model.pretrained_model=$HF_HOME/Qwen3-4B-a2d-init \
    model.teacher_model=$HF_HOME/Qwen3-4B \
    dataset.train_dataset=opdlm_train

Reference launchers with the paper hyperparameters live in scripts/general_pre_train/ and scripts/post_train_dapo/. Edit DATA_PATH, STUDENT, TEACHER, and the SBATCH header for your cluster.

Evaluate

bash

python pure_inference/eval.py \
    --models <path-to-your-trained-opdlm-ckpt> \
    --model_bases bd3lm \
    --datasets HumanEval MBPP MATH500 GSM8K AIME2024 \
    --max_token 2048 \
    --remasking_strategy low_confidence_static \
    --dynamic_threshold 0.9 \
    --temperature 0.0 \
    --block_size 4 --denoising_steps_per_block 4 \
    --out_dir pure_inference/results

Citation

BibTeX

@misc{su2026opdlm,
      title={Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation},
      author={Xingyu Su and Jacob Helwig and Shubham Parashar and Atharv Chagi and Lakshmi Jotsna and Degui Zhi and James Caverlee and Dileep Kalathil and Shuiwang Ji},
      year={2026},
      eprint={2606.06712},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.06712},
}