Aligning LLMs: DPO

Join AI Makerspace on Wednesday, Feb. 28, at 10 AM PT!

Fine-tuning and alignment are often misunderstood terms regarding Large Language Models (LLMs). In this series on Aligning LLMs, we will cover the most popular fine-tuning alignment methods, as well as emerging techniques, namely:

Reinforcement Learning with Human Feedback (RLHF)
Reinforcement Learning with AI Feedback (RLAIF)
Direct Preference Optimization (DPO)
Reasoning with Reinforced Fine-Tuning (ReFT)

In our third event, we tackle Direct Preference Optimization, the “much simpler alternative to RLHF.”

The methods that we’ve covered so far include reinforcement learning, which requires the use of a Proximal Policy Optimization (PPO) scheme that leverages a separately trained reward model. In DPO we do not need to train another LLM to act as the reward model, and further, we do not need to use reinforcement learning.

The “standard RLHF problem” can be characterized as reward maximization with a KL-divergence constraint; in other words, we want the most harmless answers coming out of our policy LLM as possible that simultaneously do not stray too far from our initial reference model.

Using DPO, we can maximize rewards using our preference data (containing chosen and rejected pairs) along with a simple binary cross-entropy loss function. This is the same loss function used in classic ML classification problems!

In this event, we’ll break down the steps of DPO and in doing so we’ll point out where it differs from the techniques of RLHF and RLAIF already covered. We’ll also discuss why it seems to be everywhere on the Open LLM leaderboard, and why all indications are that it is rapidly becoming the new industry standard.

As always, we’ll perform a detailed demonstration of how to code the core aspects of DPO yourself in a Google Colab notebook environment, and all code will be provided directly to attendees!

Join us live to learn:

Why DPO is quickly becoming the de-facto industry standard for alignment
How DPO uses a bit of sophisticated math to replace the need for RL
How to leverage DPO in your own LLM application development for alignment

RSVP

Speakers:

Dr. Greg Loughnane is the Co-Founder & CEO of AI Makerspace, where he serves as an instructor for their AI Engineering Bootcamp. Since 2021 he has built and led industry-leading Machine Learning education programs. Previously, he worked as an AI product manager, a university professor teaching AI, an AI consultant and startup advisor, and an ML researcher. He loves trail running and is based in Dayton, Ohio.
Chris Alexiuk is the Co-Founder & CTO at AI Makerspace, where he serves as an instructor for their AI Engineering Bootcamp. Previously, he’s held roles as a Founding Machine Learning Engineer, Data Scientist, and ML curriculum developer and instructor. He’s a YouTube content creator YouTube who’s motto is “Build, build, build!” He loves Dungeons & Dragons and is based in Toronto, Canada.

Subscribe

Aligning LLMs: DPO

Comments

Read Next

Cursor acquires code review startup Graphite

Ai2 launches Molmo 2, open-source multimodal models with advanced video understanding

Ai2's OLMo 3.1: truly open-source models with enhanced reasoning and instruction-following capabilities