Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

Xingyu Su*1Jacob Helwig*1Shubham Parashar*1Atharv Chagi1Lakshmi Jotsna1Degui Zhi2James Caverlee1
Dileep Kalathil1,3Shuiwang Ji1

* Equal contribution

1Department of Computer Science and Engineering, Texas A&M University·2Department of Bioinformatics and Systems Medicine, University of Texas Health Science Center at Houston·3Department of Electrical and Computer Engineering, Texas A&M University

Preprint

Motivation

Pretraining a diffusion language model (DLM) from scratch is expensive. Recent AR-to-DLM conversion methods reduce this cost by starting from pretrained autoregressive language models (ARLMs), but they still face two key mismatches: changing the training objective can weaken knowledge inherited from the ARLM, and standard DLM training uses randomly masked states that differ from the partially denoised states encountered during inference with confidence-based samplers.

OPDLM asks whether we can convert a pretrained ARLM into a DLM as a lightweight post-training procedure while preserving the capabilities learned during autoregressive pretraining and while addressing the divide between train and inference state distributions.

A natural tool for this is on-policy distillation (OPD), which trains a student on states sampled from its own generation process while using a teacher to provide token-level supervision. But applying OPD to DLM conversion creates a chicken-and-egg problem: a DLM student visits partially masked diffusion states, so a direct OPD setup would require a capable DLM teacher to score those states.

OPDLM bypasses this requirement by using the original frozen ARLM as the teacher. The student is initialized from the same ARLM weights, converted into a block-diffusion model, and trained on its own reverse diffusion trajectories. For each partially denoised student state, OPDLM uses the terminal denoised sequence to query the ARLM teacher on causal prefixes, producing token-level target distributions for the masked positions.

This creates a self-distillation setup:

  • Teacher: the original frozen ARLM, queried for token-level distributions on causal prefixes of the student's generated sequence.
  • Student: a block-diffusion LM initialized from the same ARLM, trained to predict masked tokens under blockwise bidirectional attention.

By training directly on the student's inference-time states while distilling from the original ARLM, OPDLM reduces the train-inference gap in DLM training and improves knowledge retention during conversion.

How OPDLM Works

OPDLM training step framework diagram
At each training step, the student DLM samples a reverse trajectory, a partially denoised state is selected, and masked-token predictions are aligned with frozen ARLM teacher distributions.
  1. Roll out the student: sample a reverse unmasking trajectory from the current DLM and fixed sampler.
  2. Select an on-policy state: choose a non-terminal partially denoised state from the realized trajectory.
  3. Query the ARLM teacher: build causal prefixes from the terminal sequence and retrieve token-level teacher distributions.
  4. Optimize KL: align the DLM prediction at masked positions with the frozen ARLM distribution.

Rollout-Length Curriculum

Early in training, the terminal sequences the student generates by running its reverse diffusion process are low-quality, since the converted student is being queried with masked-token inputs and blockwise bidirectional attention for the first time. To address this, OPDLM begins by generating shorter sequences and gradually increases their length, helping training stability and convergence.

Key Highlights

Efficiency Frontier0.066B4.2e18 FLOPs

OPDLM-8B AIME-24 run from Figure 1

Token Reduction15x-7,000x

Compared with established AR-to-DLM baselines

Zero-Shot Thinking18.6OPDLM-8B AIME-24

Think@eval without explicit thinking training

Specialized Math50.0AIME-24

OPDLM-MATH-8B-Thinking

OPDLM-8B defines a new AIME-24 Pareto point: 0.066B training tokens and 4.2e18 FLOPs, or 15x to 7,000x less training than established DLMs converted from ARLMs.

Results

General-Purpose DLM Results

OPDLM converts Qwen3 into a diffusion language model for general-purpose reasoning across knowledge, mathematics, science, and code. OPDLM-4B and OPDLM-8B reach performance competitive with existing DLMs while training on only 0.076B and 0.066B tokens, orders of magnitude fewer than the baselines, and at substantially lower FLOPs.

OPDLM-4B and OPDLM-8B achieve competitive performance across general knowledge, math, and code while using as little as 0.066B-0.076B training tokens.
BenchmarkSDAR-4BOPDLM-4BLLaDA-8BDream-7BSDAR-8BFast-dLLM-v2-7BOPDLM-8B
Training tokens55B0.076B1500B580B55B1B0.066B
FLOPs (1e18)13202.472000243602640424.2
General Knowledge & Instruction Following
MMLU74.965.565.567.078.666.670.9
MMLU-Pro50.946.337.043.356.941.553.7
GPQA-Diamond33.029.131.832.140.227.336.1
IFEval56.653.859.962.561.465.450.1
CEval62.966.9--70.270.373.3
LiveBench25.327.8--28.69.525.8
Mathematics & Reasoning
GSM8K89.987.678.681.091.383.787.1
MATH-50072.872.826.639.278.665.671.2
AIME-2410.014.42.10.010.010.014.7
AIME-257.512.60.40.010.00.012.4
LMB-Hard6.911.1--8.98.920.0
ZebraLogic6.310.5--7.83.512.9
Code Generation
HumanEval-base76.856.135.457.982.363.459.8
MBPP-base80.757.731.568.379.663.048.7
LCB-v612.610.4--14.59.79.7
Codeforces4.05.0--5.85.03.5

Zero-Shot Results

Zero-Shot Extended Thinking

Modern ARLMs can reason through a problem inside a <think>...</think> trace before committing to an answer. We never train OPDLM to do this, yet the converted DLM does it zero-shot: when prompted to think, OPDLM-8B improves on the hardest reasoning benchmarks, raising AIME-24 from 14.7 to 18.6 and AIME-25 from 12.4 to 19.4. The base ARLM's reasoning ability survives on-policy conversion intact, emerging as a capability we never explicitly trained for.

OPDLM retains ARLM priors for zero-shot extended thinking, despite that behavior not being explicitly included in OPDLM training.
BenchmarkOPDLM-4BOPDLM-4B think@evalOPDLM-8BOPDLM-8B think@eval
GSM8K87.685.387.188.0
MATH-50072.875.071.275.6
AIME-2414.411.214.718.6
AIME-2512.613.612.419.4
LMB-Hard11.117.820.017.8
ZebraLogic10.59.512.917.3

Multilingual Results

OPDLM keeps the multilingual ability of the base ARLM after conversion. without any multilingual-specific training. It holds performance across MMMLU-lite, INCLUDE-lite, and MLogiQA, and even improves on multilingual Math (MT-AIME 2024).

OPDLM preserves multilingual ability from the base ARLM after on-policy conversion.
BenchmarkSDAR-4BOPDLM-4BFast-dLLM-v2-7BSDAR-8BOPDLM-8B
MMMLU-lite50.751.651.560.856.0
INCLUDE-lite53.349.645.157.851.9
MT-AIME 20243.05.34.34.07.9
MLogiQA46.546.542.646.342.0

Specialized DLM Results

Since OPDLM is a form of post-training applied to ARLMs, we can also build specialized DLMs. Below, we train OPDLM specifically for math to obtain OPDLM-MATH, using the same on-policy distillation setup. Additionally, we train OPDLM-MATH-Thinking for extended reasoning.

Without RLVR or DLM pretraining, OPDLM-MATH performs competitively with baselines and is especially strong on harder math benchmarks; thinking variants are trained as separate models for extended reasoning.
ModelGSM8KMATH-500AIME-24
Reference
SDAR-4B-Chat90.270.25.0
LLaDA-8B-Instruct82.537.30.5
Dream-7B-Instruct72.738.70.0
SDAR-8B-Chat91.174.311.8
4B Scale
TraDo-4B-Instruct91.275.68.3
OPDLM-MATH-4B83.875.810.0
OPDLM-MATH-4B-Thinking91.790.243.3
8B Scale
TraDo-8B-Instruct92.378.513.3
OPDLM-MATH-8B86.276.623.3
TraDo-8B-Thinking94.287.435.5
OPDLM-MATH-8B-Thinking93.892.450.0

Parallelization

We show two controls on inference throughput: lowering the decoding threshold lets OPDLM produce more tokens per denoising step, while the training block size sets the upper bound on the parallelism it can expose at inference time.

MATH-500: Accuracy vs. Decoding Threshold

MATH-500: Accuracy vs. Decoding Threshold666768697071727374750.800.850.900.951.000.801.001.201.401.601.802.002.20ThresholdMATH-500 AccuracyTokens / Step
MATH-500: Accuracy vs. Decoding Threshold
ThresholdAccuracyTokens per step
0.8069.22.12
0.8569.42.04
0.9071.61.91
0.9572.01.80
1.0072.81.00
Lowering the decoding threshold increases tokens per step with an accuracy trade-off.

MATH-500: Accuracy vs. Block Size at Threshold 0.90

MATH-500: Accuracy vs. Block Size at Threshold 0.903540455055606570758048161.502.002.503.003.504.00Block SizeMATH-500 AccuracyTokens / Step
MATH-500: Accuracy vs. Block Size at Threshold 0.90
Block SizeAccuracyTokens per step
471.61.91
861.42.34
1649.23.59
At fixed gamma=0.9, larger block sizes increase tokens per step while trading off accuracy.

Quick Start

OPDLM converts a pretrained Qwen3 autoregressive model into a BD3LM student with on-policy distillation. Data and model artifacts are hosted in the divelab/opdlm Hugging Face collection.

Environment

bash
git clone https://github.com/divelab/OPDLM.git
cd OPDLM

conda create -n opdlm python=3.10.19 -y
conda activate opdlm

# Install torch first.
pip install torch==2.6.0+cu124 --index-url https://download.pytorch.org/whl/cu124

# Install project dependencies.
pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu124

# Install flash-attn last so it builds against the active torch install.
pip install flash-attn==2.7.4.post1 --no-build-isolation

If DeepSpeed rejects a CUDA 12.x minor-version mismatch while compiling CPU Adam, set DS_SKIP_CUDA_CHECK=1.

Data

bash
# Evaluation data: 19 of the 20 paper benchmarks.
huggingface-cli download divelab/opdlm_eval_data --local-dir data/ --repo-type dataset

# Training data: opdlm_train.json, 61,816 rows.
huggingface-cli download divelab/opdlm_train_data --local-dir data/ --repo-type dataset

# Paper eval and DAPO math data that live outside the OPDLM collection.
python data/prepare_codeforces.py
huggingface-cli download BytedTsinghua-SIA/DAPO-Math-17k --local-dir data/ --repo-type dataset

Models

bash
# Teacher ARLMs.
huggingface-cli download Qwen/Qwen3-4B --local-dir $HF_HOME/Qwen3-4B
huggingface-cli download Qwen/Qwen3-8B --local-dir $HF_HOME/Qwen3-8B

# Student initializations with bidirectional attention.
huggingface-cli download divelab/Qwen3-4B-a2d-init --local-dir $HF_HOME/Qwen3-4B-a2d-init
huggingface-cli download divelab/Qwen3-8B-a2d-init --local-dir $HF_HOME/Qwen3-8B-a2d-init

Smaller Qwen3-0.6B and Qwen3-1.7B init models can be regenerated with convert_qwen_to_bd3lm.py.

Train

bash
python rl.py config=configs/rl_bd3lm.yaml \
    model.pretrained_model=$HF_HOME/Qwen3-4B-a2d-init \
    model.teacher_model=$HF_HOME/Qwen3-4B \
    dataset.train_dataset=opdlm_train

Reference launchers with the paper hyperparameters live in scripts/general_pre_train/ and scripts/post_train_dapo/. Edit DATA_PATH, STUDENT, TEACHER, and the SBATCH header for your cluster.

Evaluate

bash
python pure_inference/eval.py \
    --models <path-to-your-trained-opdlm-ckpt> \
    --model_bases bd3lm \
    --datasets HumanEval MBPP MATH500 GSM8K AIME2024 \
    --max_token 2048 \
    --remasking_strategy low_confidence_static \
    --dynamic_threshold 0.9 \
    --temperature 0.0 \
    --block_size 4 --denoising_steps_per_block 4 \
    --out_dir pure_inference/results

Citation

BibTeX
@misc{su2026opdlm,
      title={Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation},
      author={Xingyu Su and Jacob Helwig and Shubham Parashar and Atharv Chagi and Lakshmi Jotsna and Degui Zhi and James Caverlee and Dileep Kalathil and Shuiwang Ji},
      year={2026},
      eprint={2606.06712},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.06712},
}