Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation
Xingyu Su*1Jacob Helwig*1Shubham Parashar*1Atharv Chagi1Lakshmi Jotsna1Degui Zhi2James Caverlee1
Dileep Kalathil1,3Shuiwang Ji1
* Equal contribution
1Department of Computer Science and Engineering, Texas A&M University·2Department of Bioinformatics and Systems Medicine, University of Texas Health Science Center at Houston·3Department of Electrical and Computer Engineering, Texas A&M University
Preprint
Motivation
Pretraining a diffusion language model (DLM) from scratch is expensive. Recent AR-to-DLM conversion methods reduce this cost by starting from pretrained autoregressive language models (ARLMs), but they still face two key mismatches: changing the training objective can weaken knowledge inherited from the ARLM, and standard DLM training uses randomly masked states that differ from the partially denoised states encountered during inference with confidence-based samplers.
OPDLM asks whether we can convert a pretrained ARLM into a DLM as a lightweight post-training procedure while preserving the capabilities learned during autoregressive pretraining and while addressing the divide between train and inference state distributions.
A natural tool for this is on-policy distillation (OPD), which trains a student on states sampled from its own generation process while using a teacher to provide token-level supervision. But applying OPD to DLM conversion creates a chicken-and-egg problem: a DLM student visits partially masked diffusion states, so a direct OPD setup would require a capable DLM teacher to score those states.
OPDLM bypasses this requirement by using the original frozen ARLM as the teacher. The student is initialized from the same ARLM weights, converted into a block-diffusion model, and trained on its own reverse diffusion trajectories. For each partially denoised student state, OPDLM uses the terminal denoised sequence to query the ARLM teacher on causal prefixes, producing token-level target distributions for the masked positions.
This creates a self-distillation setup:
- Teacher: the original frozen ARLM, queried for token-level distributions on causal prefixes of the student's generated sequence.
- Student: a block-diffusion LM initialized from the same ARLM, trained to predict masked tokens under blockwise bidirectional attention.
By training directly on the student's inference-time states while distilling from the original ARLM, OPDLM reduces the train-inference gap in DLM training and improves knowledge retention during conversion.
How OPDLM Works

- Roll out the student: sample a reverse unmasking trajectory from the current DLM and fixed sampler.
- Select an on-policy state: choose a non-terminal partially denoised state from the realized trajectory.
- Query the ARLM teacher: build causal prefixes from the terminal sequence and retrieve token-level teacher distributions.
- Optimize KL: align the DLM prediction at masked positions with the frozen ARLM distribution.
Rollout-Length Curriculum
Early in training, the terminal sequences the student generates by running its reverse diffusion process are low-quality, since the converted student is being queried with masked-token inputs and blockwise bidirectional attention for the first time. To address this, OPDLM begins by generating shorter sequences and gradually increases their length, helping training stability and convergence.
Key Highlights
OPDLM-8B AIME-24 run from Figure 1
Compared with established AR-to-DLM baselines
Think@eval without explicit thinking training
OPDLM-MATH-8B-Thinking
OPDLM-8B defines a new AIME-24 Pareto point: 0.066B training tokens and 4.2e18 FLOPs, or 15x to 7,000x less training than established DLMs converted from ARLMs.
Results
General-Purpose DLM Results
OPDLM converts Qwen3 into a diffusion language model for general-purpose reasoning across knowledge, mathematics, science, and code. OPDLM-4B and OPDLM-8B reach performance competitive with existing DLMs while training on only 0.076B and 0.066B tokens, orders of magnitude fewer than the baselines, and at substantially lower FLOPs.
| Benchmark | SDAR-4B | OPDLM-4B | LLaDA-8B | Dream-7B | SDAR-8B | Fast-dLLM-v2-7B | OPDLM-8B |
|---|---|---|---|---|---|---|---|
| Training tokens | 55B | 0.076B | 1500B | 580B | 55B | 1B | 0.066B |
| FLOPs (1e18) | 1320 | 2.4 | 72000 | 24360 | 2640 | 42 | 4.2 |
| General Knowledge & Instruction Following | |||||||
| MMLU | 74.9 | 65.5 | 65.5 | 67.0 | 78.6 | 66.6 | 70.9 |
| MMLU-Pro | 50.9 | 46.3 | 37.0 | 43.3 | 56.9 | 41.5 | 53.7 |
| GPQA-Diamond | 33.0 | 29.1 | 31.8 | 32.1 | 40.2 | 27.3 | 36.1 |
| IFEval | 56.6 | 53.8 | 59.9 | 62.5 | 61.4 | 65.4 | 50.1 |
| CEval | 62.9 | 66.9 | - | - | 70.2 | 70.3 | 73.3 |
| LiveBench | 25.3 | 27.8 | - | - | 28.6 | 9.5 | 25.8 |
| Mathematics & Reasoning | |||||||
| GSM8K | 89.9 | 87.6 | 78.6 | 81.0 | 91.3 | 83.7 | 87.1 |
| MATH-500 | 72.8 | 72.8 | 26.6 | 39.2 | 78.6 | 65.6 | 71.2 |
| AIME-24 | 10.0 | 14.4 | 2.1 | 0.0 | 10.0 | 10.0 | 14.7 |
| AIME-25 | 7.5 | 12.6 | 0.4 | 0.0 | 10.0 | 0.0 | 12.4 |
| LMB-Hard | 6.9 | 11.1 | - | - | 8.9 | 8.9 | 20.0 |
| ZebraLogic | 6.3 | 10.5 | - | - | 7.8 | 3.5 | 12.9 |
| Code Generation | |||||||
| HumanEval-base | 76.8 | 56.1 | 35.4 | 57.9 | 82.3 | 63.4 | 59.8 |
| MBPP-base | 80.7 | 57.7 | 31.5 | 68.3 | 79.6 | 63.0 | 48.7 |
| LCB-v6 | 12.6 | 10.4 | - | - | 14.5 | 9.7 | 9.7 |
| Codeforces | 4.0 | 5.0 | - | - | 5.8 | 5.0 | 3.5 |
Zero-Shot Results
Zero-Shot Extended Thinking
Modern ARLMs can reason through a problem inside a <think>...</think> trace before committing to an answer. We never train OPDLM to do this, yet the converted DLM does it zero-shot: when prompted to think, OPDLM-8B improves on the hardest reasoning benchmarks, raising AIME-24 from 14.7 to 18.6 and AIME-25 from 12.4 to 19.4. The base ARLM's reasoning ability survives on-policy conversion intact, emerging as a capability we never explicitly trained for.
| Benchmark | OPDLM-4B | OPDLM-4B think@eval | OPDLM-8B | OPDLM-8B think@eval |
|---|---|---|---|---|
| GSM8K | 87.6 | 85.3 | 87.1 | 88.0 |
| MATH-500 | 72.8 | 75.0 | 71.2 | 75.6 |
| AIME-24 | 14.4 | 11.2 | 14.7 | 18.6 |
| AIME-25 | 12.6 | 13.6 | 12.4 | 19.4 |
| LMB-Hard | 11.1 | 17.8 | 20.0 | 17.8 |
| ZebraLogic | 10.5 | 9.5 | 12.9 | 17.3 |
Multilingual Results
OPDLM keeps the multilingual ability of the base ARLM after conversion. without any multilingual-specific training. It holds performance across MMMLU-lite, INCLUDE-lite, and MLogiQA, and even improves on multilingual Math (MT-AIME 2024).
| Benchmark | SDAR-4B | OPDLM-4B | Fast-dLLM-v2-7B | SDAR-8B | OPDLM-8B |
|---|---|---|---|---|---|
| MMMLU-lite | 50.7 | 51.6 | 51.5 | 60.8 | 56.0 |
| INCLUDE-lite | 53.3 | 49.6 | 45.1 | 57.8 | 51.9 |
| MT-AIME 2024 | 3.0 | 5.3 | 4.3 | 4.0 | 7.9 |
| MLogiQA | 46.5 | 46.5 | 42.6 | 46.3 | 42.0 |
Specialized DLM Results
Since OPDLM is a form of post-training applied to ARLMs, we can also build specialized DLMs. Below, we train OPDLM specifically for math to obtain OPDLM-MATH, using the same on-policy distillation setup. Additionally, we train OPDLM-MATH-Thinking for extended reasoning.
| Model | GSM8K | MATH-500 | AIME-24 |
|---|---|---|---|
| Reference | |||
| SDAR-4B-Chat | 90.2 | 70.2 | 5.0 |
| LLaDA-8B-Instruct | 82.5 | 37.3 | 0.5 |
| Dream-7B-Instruct | 72.7 | 38.7 | 0.0 |
| SDAR-8B-Chat | 91.1 | 74.3 | 11.8 |
| 4B Scale | |||
| TraDo-4B-Instruct | 91.2 | 75.6 | 8.3 |
| OPDLM-MATH-4B | 83.8 | 75.8 | 10.0 |
| OPDLM-MATH-4B-Thinking | 91.7 | 90.2 | 43.3 |
| 8B Scale | |||
| TraDo-8B-Instruct | 92.3 | 78.5 | 13.3 |
| OPDLM-MATH-8B | 86.2 | 76.6 | 23.3 |
| TraDo-8B-Thinking | 94.2 | 87.4 | 35.5 |
| OPDLM-MATH-8B-Thinking | 93.8 | 92.4 | 50.0 |
Parallelization
We show two controls on inference throughput: lowering the decoding threshold lets OPDLM produce more tokens per denoising step, while the training block size sets the upper bound on the parallelism it can expose at inference time.
MATH-500: Accuracy vs. Decoding Threshold
| Threshold | Accuracy | Tokens per step |
|---|---|---|
| 0.80 | 69.2 | 2.12 |
| 0.85 | 69.4 | 2.04 |
| 0.90 | 71.6 | 1.91 |
| 0.95 | 72.0 | 1.80 |
| 1.00 | 72.8 | 1.00 |
MATH-500: Accuracy vs. Block Size at Threshold 0.90
| Block Size | Accuracy | Tokens per step |
|---|---|---|
| 4 | 71.6 | 1.91 |
| 8 | 61.4 | 2.34 |
| 16 | 49.2 | 3.59 |
Quick Start
OPDLM converts a pretrained Qwen3 autoregressive model into a BD3LM student with on-policy distillation. Data and model artifacts are hosted in the divelab/opdlm Hugging Face collection.
Environment
git clone https://github.com/divelab/OPDLM.git
cd OPDLM
conda create -n opdlm python=3.10.19 -y
conda activate opdlm
# Install torch first.
pip install torch==2.6.0+cu124 --index-url https://download.pytorch.org/whl/cu124
# Install project dependencies.
pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu124
# Install flash-attn last so it builds against the active torch install.
pip install flash-attn==2.7.4.post1 --no-build-isolationIf DeepSpeed rejects a CUDA 12.x minor-version mismatch while compiling CPU Adam, set DS_SKIP_CUDA_CHECK=1.
Data
# Evaluation data: 19 of the 20 paper benchmarks.
huggingface-cli download divelab/opdlm_eval_data --local-dir data/ --repo-type dataset
# Training data: opdlm_train.json, 61,816 rows.
huggingface-cli download divelab/opdlm_train_data --local-dir data/ --repo-type dataset
# Paper eval and DAPO math data that live outside the OPDLM collection.
python data/prepare_codeforces.py
huggingface-cli download BytedTsinghua-SIA/DAPO-Math-17k --local-dir data/ --repo-type datasetModels
# Teacher ARLMs.
huggingface-cli download Qwen/Qwen3-4B --local-dir $HF_HOME/Qwen3-4B
huggingface-cli download Qwen/Qwen3-8B --local-dir $HF_HOME/Qwen3-8B
# Student initializations with bidirectional attention.
huggingface-cli download divelab/Qwen3-4B-a2d-init --local-dir $HF_HOME/Qwen3-4B-a2d-init
huggingface-cli download divelab/Qwen3-8B-a2d-init --local-dir $HF_HOME/Qwen3-8B-a2d-initSmaller Qwen3-0.6B and Qwen3-1.7B init models can be regenerated with convert_qwen_to_bd3lm.py.
Train
python rl.py config=configs/rl_bd3lm.yaml \
model.pretrained_model=$HF_HOME/Qwen3-4B-a2d-init \
model.teacher_model=$HF_HOME/Qwen3-4B \
dataset.train_dataset=opdlm_trainReference launchers with the paper hyperparameters live in scripts/general_pre_train/ and scripts/post_train_dapo/. Edit DATA_PATH, STUDENT, TEACHER, and the SBATCH header for your cluster.
Evaluate
python pure_inference/eval.py \
--models <path-to-your-trained-opdlm-ckpt> \
--model_bases bd3lm \
--datasets HumanEval MBPP MATH500 GSM8K AIME2024 \
--max_token 2048 \
--remasking_strategy low_confidence_static \
--dynamic_threshold 0.9 \
--temperature 0.0 \
--block_size 4 --denoising_steps_per_block 4 \
--out_dir pure_inference/resultsCitation
@misc{su2026opdlm,
title={Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation},
author={Xingyu Su and Jacob Helwig and Shubham Parashar and Atharv Chagi and Lakshmi Jotsna and Degui Zhi and James Caverlee and Dileep Kalathil and Shuiwang Ji},
year={2026},
eprint={2606.06712},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2606.06712},
}

