Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation
Xingyu Su*1Jacob Helwig*1Shubham Parashar*1Atharv Chagi1Lakshmi Jotsna1Degui Zhi2James Caverlee1
Dileep Kalathil1,3Shuiwang Ji1
* Equal contribution
1Department of Computer Science and Engineering, Texas A&M University·2Department of Bioinformatics and Systems Medicine, University of Texas Health Science Center at Houston·3Department of Electrical and Computer Engineering, Texas A&M University
Preprint
Motivation
Pretraining a diffusion language model (DLM) from scratch is expensive, and existing open DLMs still trail autoregressive language models (ARLMs) of comparable scale on standard benchmarks. Rather than start over, we ask whether the capabilities already learned by a pretrained ARLM can be transferred to a DLM. A natural candidate is on-policy distillation (OPD), which supervises a student on its own rollouts and has proven effective in post-training. This motivates the central research question of our work:
Can we convert a pretrained ARLM into a DLM while preserving its prior? And can OPD do it?
Applying OPD here runs into a chicken-and-egg problem, i.e, the teacher needs to be a capable DLM in order to score the masked, partially masked states the student visits, but a capable DLM is exactly what we are trying to build. OPDLM bypasses this by querying the ARLM directly as the teacher, reading out its prior through causal prefixes of the student's rollouts. This gives a self-distillation setup:
- Teacher: the frozen pretrained ARLM, queried only for token-level distributions over causal clean blocks.
- Student: a block-diffusion LM initialized from the same ARLM weights, trained to predict masked tokens under blockwise bidirectional attention.
How OPDLM Works

- Roll out the student: sample a reverse unmasking trajectory from the current DLM and fixed sampler.
- Select an on-policy state: choose a non-terminal partially denoised state from the realized trajectory.
- Query the ARLM teacher: build causal prefixes from the terminal sequence and retrieve token-level teacher distributions.
- Optimize KL: align the DLM prediction at masked positions with the frozen ARLM distribution.
Rollout-Length Curriculum
Early in training, the terminal sequences the student generates by running its reverse diffusion process are low-quality, since the converted student is being queried with masked-token inputs and blockwise bidirectional attention for the first time. To address this, OPDLM begins by generating shorter sequences and gradually increases their length, helping training stability and convergence.
Key Highlights
OPDLM-8B AIME-24 run from Figure 1
Compared with established AR-to-DLM baselines
Think@eval without explicit thinking training
OPDLM-MATH-8B-Thinking
OPDLM-8B defines a new AIME-24 Pareto point: 0.066B training tokens and 4.2e18 FLOPs, or 15x to 7,000x less training than established DLMs converted from ARLMs.
Results
General-Purpose DLM Results
OPDLM converts Qwen3 into a diffusion language model for general-purpose reasoning across knowledge, mathematics, science, and code. OPDLM-4B and OPDLM-8B reach performance competitive with existing DLMs while training on only 0.076B and 0.066B tokens, orders of magnitude fewer than the baselines, and at substantially lower FLOPs.
| Benchmark | SDAR-4B | OPDLM-4B | LLaDA-8B | Dream-7B | SDAR-8B | Fast-dLLM-v2-7B | OPDLM-8B |
|---|---|---|---|---|---|---|---|
| Training tokens | 55B | 0.076B | 1500B | 580B | 55B | 1B | 0.066B |
| FLOPs (1e18) | 1320 | 2.4 | 72000 | 24360 | 2640 | 42 | 4.2 |
| General Knowledge & Instruction Following | |||||||
| MMLU | 74.9 | 65.5 | 65.5 | 67.0 | 78.6 | 66.6 | 70.9 |
| MMLU-Pro | 50.9 | 46.3 | 37.0 | 43.3 | 56.9 | 41.5 | 53.7 |
| GPQA-Diamond | 33.0 | 29.1 | 31.8 | 32.1 | 40.2 | 27.3 | 36.1 |
| IFEval | 56.6 | 53.8 | 59.9 | 62.5 | 61.4 | 65.4 | 50.1 |
| CEval | 62.9 | 66.9 | - | - | 70.2 | 70.3 | 73.3 |
| LiveBench | 25.3 | 27.8 | - | - | 28.6 | 9.5 | 25.8 |
| Mathematics & Reasoning | |||||||
| GSM8K | 89.9 | 87.6 | 78.6 | 81.0 | 91.3 | 83.7 | 87.1 |
| MATH-500 | 72.8 | 72.8 | 26.6 | 39.2 | 78.6 | 65.6 | 71.2 |
| AIME-24 | 10.0 | 14.4 | 2.1 | 0.0 | 10.0 | 10.0 | 14.7 |
| AIME-25 | 7.5 | 12.6 | 0.4 | 0.0 | 10.0 | 0.0 | 12.4 |
| LMB-Hard | 6.9 | 11.1 | - | - | 8.9 | 8.9 | 20.0 |
| ZebraLogic | 6.3 | 10.5 | - | - | 7.8 | 3.5 | 12.9 |
| Code Generation | |||||||
| HumanEval-base | 76.8 | 56.1 | 35.4 | 57.9 | 82.3 | 63.4 | 59.8 |
| MBPP-base | 80.7 | 57.7 | 31.5 | 68.3 | 79.6 | 63.0 | 48.7 |
| LCB-v6 | 12.6 | 10.4 | - | - | 14.5 | 9.7 | 9.7 |
| Codeforces | 4.0 | 5.0 | - | - | 5.8 | 5.0 | 3.5 |
Zero-Shot Results
Zero-Shot Extended Thinking
Modern ARLMs can reason through a problem inside a <think>...</think> trace before committing to an answer. We never train OPDLM to do this, yet the converted DLM does it zero-shot: when prompted to think, OPDLM-8B improves on the hardest reasoning benchmarks, raising AIME-24 from 14.7 to 18.6 and AIME-25 from 12.4 to 19.4. The base ARLM's reasoning ability survives on-policy conversion intact, emerging as a capability we never explicitly trained for.
| Benchmark | OPDLM-4B | OPDLM-4B think@eval | OPDLM-8B | OPDLM-8B think@eval |
|---|---|---|---|---|
| GSM8K | 87.6 | 85.3 | 87.1 | 88.0 |
| MATH-500 | 72.8 | 75.0 | 71.2 | 75.6 |
| AIME-24 | 14.4 | 11.2 | 14.7 | 18.6 |
| AIME-25 | 12.6 | 13.6 | 12.4 | 19.4 |
| LMB-Hard | 11.1 | 17.8 | 20.0 | 17.8 |
| ZebraLogic | 10.5 | 9.5 | 12.9 | 17.3 |
Multilingual Results
OPDLM keeps the multilingual ability of the base ARLM after conversion. without any multilingual-specific training. It holds performance across MMMLU-lite, INCLUDE-lite, and MLogiQA, and even improves on multilingual Math (MT-AIME 2024).
| Benchmark | SDAR-4B | OPDLM-4B | Fast-dLLM-v2-7B | SDAR-8B | OPDLM-8B |
|---|---|---|---|---|---|
| MMMLU-lite | 50.7 | 51.6 | 51.5 | 60.8 | 56.0 |
| INCLUDE-lite | 53.3 | 49.6 | 45.1 | 57.8 | 51.9 |
| MT-AIME 2024 | 3.0 | 5.3 | 4.3 | 4.0 | 7.9 |
| MLogiQA | 46.5 | 46.5 | 42.6 | 46.3 | 42.0 |
Specialized DLM Results
Since OPDLM is a form of post-training applied to ARLMs, we can also build specialized DLMs. Below, we train OPDLM specifically for math to obtain OPDLM-MATH, using the same on-policy distillation setup. Additionally, we train OPDLM-MATH-Thinking for extended reasoning.
| Model | GSM8K | MATH-500 | AIME-24 |
|---|---|---|---|
| Reference | |||
| SDAR-4B-Chat | 90.2 | 70.2 | 5.0 |
| LLaDA-8B-Instruct | 82.5 | 37.3 | 0.5 |
| Dream-7B-Instruct | 72.7 | 38.7 | 0.0 |
| SDAR-8B-Chat | 91.1 | 74.3 | 11.8 |
| 4B Scale | |||
| TraDo-4B-Instruct | 91.2 | 75.6 | 8.3 |
| OPDLM-MATH-4B | 83.8 | 75.8 | 10.0 |
| OPDLM-MATH-4B-Thinking | 91.7 | 90.2 | 43.3 |
| 8B Scale | |||
| TraDo-8B-Instruct | 92.3 | 78.5 | 13.3 |
| OPDLM-MATH-8B | 86.2 | 76.6 | 23.3 |
| TraDo-8B-Thinking | 94.2 | 87.4 | 35.5 |
| OPDLM-MATH-8B-Thinking | 93.8 | 92.4 | 50.0 |
Parallelization
We show two controls on inference throughput: lowering the decoding threshold lets OPDLM produce more tokens per denoising step, while the training block size sets the upper bound on the parallelism it can expose at inference time.
MATH-500: Accuracy vs. Decoding Threshold
| Threshold | Accuracy | Tokens per step |
|---|---|---|
| 0.80 | 69.2 | 2.12 |
| 0.85 | 69.4 | 2.04 |
| 0.90 | 71.6 | 1.91 |
| 0.95 | 72.0 | 1.80 |
| 1.00 | 72.8 | 1.00 |
MATH-500: Accuracy vs. Block Size at Threshold 0.90
| Block Size | Accuracy | Tokens per step |
|---|---|---|
| 4 | 71.6 | 1.91 |
| 8 | 61.4 | 2.34 |
| 16 | 49.2 | 3.59 |
Quick Start
OPDLM converts a pretrained Qwen3 autoregressive model into a BD3LM student with on-policy distillation. Data and model artifacts are hosted in the divelab/opdlm Hugging Face collection.
Environment
git clone <opdlm-repo-url>
cd opdlm
conda create -n opdlm python=3.10.19 -y
conda activate opdlm
# Install torch first.
pip install torch==2.6.0+cu124 --index-url https://download.pytorch.org/whl/cu124
# Install project dependencies.
pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu124
# Install flash-attn last so it builds against the active torch install.
pip install flash-attn==2.7.4.post1 --no-build-isolationIf DeepSpeed rejects a CUDA 12.x minor-version mismatch while compiling CPU Adam, set DS_SKIP_CUDA_CHECK=1.
Data
# Evaluation data: 19 of the 20 paper benchmarks.
huggingface-cli download divelab/opdlm_eval_data --local-dir data/ --repo-type dataset
# Training data: opdlm_train.json, 61,816 rows.
huggingface-cli download divelab/opdlm_train_data --local-dir data/ --repo-type dataset
# Paper eval and DAPO math data that live outside the OPDLM collection.
python data/prepare_codeforces.py
huggingface-cli download BytedTsinghua-SIA/DAPO-Math-17k --local-dir data/ --repo-type datasetModels
# Teacher ARLMs.
huggingface-cli download Qwen/Qwen3-4B --local-dir $HF_HOME/Qwen3-4B
huggingface-cli download Qwen/Qwen3-8B --local-dir $HF_HOME/Qwen3-8B
# Student initializations with bidirectional attention.
huggingface-cli download divelab/Qwen3-4B-a2d-init --local-dir $HF_HOME/Qwen3-4B-a2d-init
huggingface-cli download divelab/Qwen3-8B-a2d-init --local-dir $HF_HOME/Qwen3-8B-a2d-initSmaller Qwen3-0.6B and Qwen3-1.7B init models can be regenerated with convert_qwen_to_bd3lm.py.
Train
python rl.py config=configs/rl_bd3lm.yaml \
model.pretrained_model=$HF_HOME/Qwen3-4B-a2d-init \
model.teacher_model=$HF_HOME/Qwen3-4B \
dataset.train_dataset=opdlm_trainReference launchers with the paper hyperparameters live in scripts/general_pre_train/ and scripts/post_train_dapo/. Edit DATA_PATH, STUDENT, TEACHER, and the SBATCH header for your cluster.
Evaluate
python pure_inference/eval.py \
--models <path-to-your-trained-opdlm-ckpt> \
--model_bases bd3lm \
--datasets HumanEval MBPP MATH500 GSM8K AIME2024 \
--max_token 2048 \
--remasking_strategy low_confidence_static \
--dynamic_threshold 0.9 \
--temperature 0.0 \
--block_size 4 --denoising_steps_per_block 4 \
--out_dir pure_inference/resultsCitation
@misc{su2026opdlm,
title = {Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation},
author = {Su, Xingyu and Helwig, Jacob and Parashar, Shubham and Chagi, Atharv and
Jotsna, Lakshmi and Zhi, Degui and Caverlee, James and Kalathil, Dileep and
Ji, Shuiwang},
year = {2026},
note = {Preprint}
}

