Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

Xingyu Su*1Jacob Helwig*1Shubham Parashar*1Atharv Chagi1Lakshmi Jotsna1Degui Zhi2James Caverlee1
Dileep Kalathil1,3Shuiwang Ji1

* Equal contribution

1Department of Computer Science and Engineering, Texas A&M University·2Department of Bioinformatics and Systems Medicine, University of Texas Health Science Center at Houston·3Department of Electrical and Computer Engineering, Texas A&M University

Preprint

Motivation

Pretraining a diffusion language model (DLM) from scratch is expensive, and existing open DLMs still trail autoregressive language models (ARLMs) of comparable scale on standard benchmarks. Rather than start over, we ask whether the capabilities already learned by a pretrained ARLM can be transferred to a DLM. A natural candidate is on-policy distillation (OPD), which supervises a student on its own rollouts and has proven effective in post-training. This motivates the central research question of our work:

Can we convert a pretrained ARLM into a DLM while preserving its prior? And can OPD do it?

Applying OPD here runs into a chicken-and-egg problem, i.e, the teacher needs to be a capable DLM in order to score the masked, partially masked states the student visits, but a capable DLM is exactly what we are trying to build. OPDLM bypasses this by querying the ARLM directly as the teacher, reading out its prior through causal prefixes of the student's rollouts. This gives a self-distillation setup:

  • Teacher: the frozen pretrained ARLM, queried only for token-level distributions over causal clean blocks.
  • Student: a block-diffusion LM initialized from the same ARLM weights, trained to predict masked tokens under blockwise bidirectional attention.

How OPDLM Works

OPDLM training step framework diagram
At each training step, the student DLM samples a reverse trajectory, a partially denoised state is selected, and masked-token predictions are aligned with frozen ARLM teacher distributions.
  1. Roll out the student: sample a reverse unmasking trajectory from the current DLM and fixed sampler.
  2. Select an on-policy state: choose a non-terminal partially denoised state from the realized trajectory.
  3. Query the ARLM teacher: build causal prefixes from the terminal sequence and retrieve token-level teacher distributions.
  4. Optimize KL: align the DLM prediction at masked positions with the frozen ARLM distribution.

Rollout-Length Curriculum

Early in training, the terminal sequences the student generates by running its reverse diffusion process are low-quality, since the converted student is being queried with masked-token inputs and blockwise bidirectional attention for the first time. To address this, OPDLM begins by generating shorter sequences and gradually increases their length, helping training stability and convergence.

Key Highlights

Efficiency Frontier0.066B4.2e18 FLOPs

OPDLM-8B AIME-24 run from Figure 1

Token Reduction15x-7,000x

Compared with established AR-to-DLM baselines

Zero-Shot Thinking18.6OPDLM-8B AIME-24

Think@eval without explicit thinking training

Specialized Math50.0AIME-24

OPDLM-MATH-8B-Thinking

OPDLM-8B defines a new AIME-24 Pareto point: 0.066B training tokens and 4.2e18 FLOPs, or 15x to 7,000x less training than established DLMs converted from ARLMs.

Results

General-Purpose DLM Results

OPDLM converts Qwen3 into a diffusion language model for general-purpose reasoning across knowledge, mathematics, science, and code. OPDLM-4B and OPDLM-8B reach performance competitive with existing DLMs while training on only 0.076B and 0.066B tokens, orders of magnitude fewer than the baselines, and at substantially lower FLOPs.

OPDLM-4B and OPDLM-8B achieve competitive performance across general knowledge, math, and code while using as little as 0.066B-0.076B training tokens.
BenchmarkSDAR-4BOPDLM-4BLLaDA-8BDream-7BSDAR-8BFast-dLLM-v2-7BOPDLM-8B
Training tokens55B0.076B1500B580B55B1B0.066B
FLOPs (1e18)13202.472000243602640424.2
General Knowledge & Instruction Following
MMLU74.965.565.567.078.666.670.9
MMLU-Pro50.946.337.043.356.941.553.7
GPQA-Diamond33.029.131.832.140.227.336.1
IFEval56.653.859.962.561.465.450.1
CEval62.966.9--70.270.373.3
LiveBench25.327.8--28.69.525.8
Mathematics & Reasoning
GSM8K89.987.678.681.091.383.787.1
MATH-50072.872.826.639.278.665.671.2
AIME-2410.014.42.10.010.010.014.7
AIME-257.512.60.40.010.00.012.4
LMB-Hard6.911.1--8.98.920.0
ZebraLogic6.310.5--7.83.512.9
Code Generation
HumanEval-base76.856.135.457.982.363.459.8
MBPP-base80.757.731.568.379.663.048.7
LCB-v612.610.4--14.59.79.7
Codeforces4.05.0--5.85.03.5

Zero-Shot Results

Zero-Shot Extended Thinking

Modern ARLMs can reason through a problem inside a <think>...</think> trace before committing to an answer. We never train OPDLM to do this, yet the converted DLM does it zero-shot: when prompted to think, OPDLM-8B improves on the hardest reasoning benchmarks, raising AIME-24 from 14.7 to 18.6 and AIME-25 from 12.4 to 19.4. The base ARLM's reasoning ability survives on-policy conversion intact, emerging as a capability we never explicitly trained for.

OPDLM retains ARLM priors for zero-shot extended thinking, despite that behavior not being explicitly included in OPDLM training.
BenchmarkOPDLM-4BOPDLM-4B think@evalOPDLM-8BOPDLM-8B think@eval
GSM8K87.685.387.188.0
MATH-50072.875.071.275.6
AIME-2414.411.214.718.6
AIME-2512.613.612.419.4
LMB-Hard11.117.820.017.8
ZebraLogic10.59.512.917.3

Multilingual Results

OPDLM keeps the multilingual ability of the base ARLM after conversion. without any multilingual-specific training. It holds performance across MMMLU-lite, INCLUDE-lite, and MLogiQA, and even improves on multilingual Math (MT-AIME 2024).

OPDLM preserves multilingual ability from the base ARLM after on-policy conversion.
BenchmarkSDAR-4BOPDLM-4BFast-dLLM-v2-7BSDAR-8BOPDLM-8B
MMMLU-lite50.751.651.560.856.0
INCLUDE-lite53.349.645.157.851.9
MT-AIME 20243.05.34.34.07.9
MLogiQA46.546.542.646.342.0

Specialized DLM Results

Since OPDLM is a form of post-training applied to ARLMs, we can also build specialized DLMs. Below, we train OPDLM specifically for math to obtain OPDLM-MATH, using the same on-policy distillation setup. Additionally, we train OPDLM-MATH-Thinking for extended reasoning.

Without RLVR or DLM pretraining, OPDLM-MATH performs competitively with baselines and is especially strong on harder math benchmarks; thinking variants are trained as separate models for extended reasoning.
ModelGSM8KMATH-500AIME-24
Reference
SDAR-4B-Chat90.270.25.0
LLaDA-8B-Instruct82.537.30.5
Dream-7B-Instruct72.738.70.0
SDAR-8B-Chat91.174.311.8
4B Scale
TraDo-4B-Instruct91.275.68.3
OPDLM-MATH-4B83.875.810.0
OPDLM-MATH-4B-Thinking91.790.243.3
8B Scale
TraDo-8B-Instruct92.378.513.3
OPDLM-MATH-8B86.276.623.3
TraDo-8B-Thinking94.287.435.5
OPDLM-MATH-8B-Thinking93.892.450.0

Parallelization

We show two controls on inference throughput: lowering the decoding threshold lets OPDLM produce more tokens per denoising step, while the training block size sets the upper bound on the parallelism it can expose at inference time.

MATH-500: Accuracy vs. Decoding Threshold

MATH-500: Accuracy vs. Decoding Threshold666768697071727374750.800.850.900.951.000.801.001.201.401.601.802.002.20ThresholdMATH-500 AccuracyTokens / Step
MATH-500: Accuracy vs. Decoding Threshold
ThresholdAccuracyTokens per step
0.8069.22.12
0.8569.42.04
0.9071.61.91
0.9572.01.80
1.0072.81.00
Lowering the decoding threshold increases tokens per step with an accuracy trade-off.

MATH-500: Accuracy vs. Block Size at Threshold 0.90

MATH-500: Accuracy vs. Block Size at Threshold 0.903540455055606570758048161.502.002.503.003.504.00Block SizeMATH-500 AccuracyTokens / Step
MATH-500: Accuracy vs. Block Size at Threshold 0.90
Block SizeAccuracyTokens per step
471.61.91
861.42.34
1649.23.59
At fixed gamma=0.9, larger block sizes increase tokens per step while trading off accuracy.

Quick Start

OPDLM converts a pretrained Qwen3 autoregressive model into a BD3LM student with on-policy distillation. Data and model artifacts are hosted in the divelab/opdlm Hugging Face collection.

Environment

bash
git clone <opdlm-repo-url>
cd opdlm

conda create -n opdlm python=3.10.19 -y
conda activate opdlm

# Install torch first.
pip install torch==2.6.0+cu124 --index-url https://download.pytorch.org/whl/cu124

# Install project dependencies.
pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu124

# Install flash-attn last so it builds against the active torch install.
pip install flash-attn==2.7.4.post1 --no-build-isolation

If DeepSpeed rejects a CUDA 12.x minor-version mismatch while compiling CPU Adam, set DS_SKIP_CUDA_CHECK=1.

Data

bash
# Evaluation data: 19 of the 20 paper benchmarks.
huggingface-cli download divelab/opdlm_eval_data --local-dir data/ --repo-type dataset

# Training data: opdlm_train.json, 61,816 rows.
huggingface-cli download divelab/opdlm_train_data --local-dir data/ --repo-type dataset

# Paper eval and DAPO math data that live outside the OPDLM collection.
python data/prepare_codeforces.py
huggingface-cli download BytedTsinghua-SIA/DAPO-Math-17k --local-dir data/ --repo-type dataset

Models

bash
# Teacher ARLMs.
huggingface-cli download Qwen/Qwen3-4B --local-dir $HF_HOME/Qwen3-4B
huggingface-cli download Qwen/Qwen3-8B --local-dir $HF_HOME/Qwen3-8B

# Student initializations with bidirectional attention.
huggingface-cli download divelab/Qwen3-4B-a2d-init --local-dir $HF_HOME/Qwen3-4B-a2d-init
huggingface-cli download divelab/Qwen3-8B-a2d-init --local-dir $HF_HOME/Qwen3-8B-a2d-init

Smaller Qwen3-0.6B and Qwen3-1.7B init models can be regenerated with convert_qwen_to_bd3lm.py.

Train

bash
python rl.py config=configs/rl_bd3lm.yaml \
    model.pretrained_model=$HF_HOME/Qwen3-4B-a2d-init \
    model.teacher_model=$HF_HOME/Qwen3-4B \
    dataset.train_dataset=opdlm_train

Reference launchers with the paper hyperparameters live in scripts/general_pre_train/ and scripts/post_train_dapo/. Edit DATA_PATH, STUDENT, TEACHER, and the SBATCH header for your cluster.

Evaluate

bash
python pure_inference/eval.py \
    --models <path-to-your-trained-opdlm-ckpt> \
    --model_bases bd3lm \
    --datasets HumanEval MBPP MATH500 GSM8K AIME2024 \
    --max_token 2048 \
    --remasking_strategy low_confidence_static \
    --dynamic_threshold 0.9 \
    --temperature 0.0 \
    --block_size 4 --denoising_steps_per_block 4 \
    --out_dir pure_inference/results

Citation

BibTeX
@misc{su2026opdlm,
  title  = {Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation},
  author = {Su, Xingyu and Helwig, Jacob and Parashar, Shubham and Chagi, Atharv and
            Jotsna, Lakshmi and Zhi, Degui and Caverlee, James and Kalathil, Dileep and
            Ji, Shuiwang},
  year   = {2026},
  note   = {Preprint}
}