DPO for Alignment

Please check out the tutorial notebooks on the links below. Right click on the GitHub link to save that file locally.

DPO for alignment: View on GitHub. Use this version if your GPU has >= 80 GB HBM.

Lite version: View on GitHub. Use this version if your GPU has < 80 GB HBM; it just uses smaller LLMs and finishes faster.

Task, Dataset, and Prompt

This tutorial shows Direct Preference Optimization (DPO) for aligning LLMs with human preferences. DPO is a simpler alternative to PPO that directly optimizes the policy model using preference data without requiring a separate reward model.

It uses the “ultrafeedback binarized” dataset; see its details on Hugging Face. We use a sample of 500 training examples for tractable demo runtimes.

The dataset contains paired preference data with chosen (preferred) and rejected (dispreferred) responses for the same prompts.

The training starts from a pre-trained SFT model (rapidfire-ai-inc/mistral-7b-sft-bnb-4bit) to ensure the model distribution is suitable for DPO alignment training.

Model, Adapter, and Trainer Knobs

We use the Mistral-7B-Instruct-v0.3 base model fine-tuned with 4-bit quantization (QLoRA).

There are 4 different DPO training configurations exploring various loss functions and hyperparameters:

  • Basic Bradley-Terry: Standard sigmoid loss with medium capacity LoRA (rank 64) and large beta.

  • High divergence: High capacity LoRA (rank 128) with small beta to encourage divergence from reference model.

  • Robust loss: Uses robust loss type with label smoothing to handle noisy preference data.

  • Combined loss: Weighted combination of sigmoid, BCO pair, and SFT losses.

All configurations use QLoRA with the same target modules and are launched with a simple grid search, totaling 4 combinations.

The lite version simply uses a smaller LoRA rank of 16 and a subset of 3 configs from the above list.