Subscribe to Our Newsletter

Success! Now Check Your Email

To complete Subscribe, click the confirmation link in your inbox. If it doesn't arrive within 3 minutes, check your spam folder.

Ok, Thanks

Aligning LLMs: DPO

Join AI Makerspace on Wednesday, Feb. 28, at 10 AM PT! For a workshop on DPO. Come learn how it differs from the techniques of RLHF and why all indications are showing it's rapidly becoming the new industry standard.

Sarah DeSouza profile image
by Sarah DeSouza
Aligning LLMs: DPO

Join AI Makerspace on Wednesday, Feb. 28, at 10 AM PT!

Fine-tuning and alignment are often misunderstood terms regarding Large Language Models (LLMs). In this series on Aligning LLMs, we will cover the most popular fine-tuning alignment methods, as well as emerging techniques, namely:

  1. ​Reinforcement Learning with Human Feedback (RLHF)
  2. ​Reinforcement Learning with AI Feedback (RLAIF)
  3. ​Direct Preference Optimization (DPO)
  4. ​Reasoning with Reinforced Fine-Tuning (ReFT)

​In our third event, we tackle Direct Preference Optimization, the “much simpler alternative to RLHF.”

​The methods that we’ve covered so far include reinforcement learning, which requires the use of a Proximal Policy Optimization (PPO) scheme that leverages a separately trained reward model. In DPO we do not need to train another LLM to act as the reward model, and further, we do not need to use reinforcement learning.

​The “standard RLHF problem” can be characterized as reward maximization with a KL-divergence constraint; in other words, we want the most harmless answers coming out of our policy LLM as possible that simultaneously do not stray too far from our initial reference model.

​Using DPO, we can maximize rewards using our preference data (containing chosen and rejected pairs) along with a simple binary cross-entropy loss function. This is the same loss function used in classic ML classification problems!

​In this event, we’ll break down the steps of DPO and in doing so we’ll point out where it differs from the techniques of RLHF and RLAIF already covered. We’ll also discuss why it seems to be everywhere on the Open LLM leaderboard, and why all indications are that it is rapidly becoming the new industry standard.

​As always, we’ll perform a detailed demonstration of how to code the core aspects of DPO yourself in a Google Colab notebook environment, and all code will be provided directly to attendees!

​Join us live to learn:

  • ​Why DPO is quickly becoming the de-facto industry standard for alignment
  • ​How DPO uses a bit of sophisticated math to replace the need for RL
  • ​How to leverage DPO in your own LLM application development for alignment

Speakers:

  • Dr. Greg Loughnane is the Co-Founder & CEO of AI Makerspace, where he serves as an instructor for their AI Engineering Bootcamp. Since 2021 he has built and led industry-leading Machine Learning education programs.  Previously, he worked as an AI product manager, a university professor teaching AI, an AI consultant and startup advisor, and an ML researcher.  He loves trail running and is based in Dayton, Ohio.
  • Chris Alexiuk is the Co-Founder & CTO at AI Makerspace, where he serves as an instructor for their AI Engineering Bootcamp. Previously, he’s held roles as a Founding Machine Learning Engineer, Data Scientist, and ML curriculum developer and instructor. He’s a YouTube content creator YouTube who’s motto is “Build, build, build!” He loves Dungeons & Dragons and is based in Toronto, Canada.
Sarah DeSouza profile image
by Sarah DeSouza

Data Phoenix Digest

Subscribe to the weekly digest with a summary of the top research papers, articles, news, and our community events, to keep track of trends and grow in the Data & AI world!

Success! Now Check Your Email

To complete Subscribe, click the confirmation link in your inbox. If it doesn’t arrive within 3 minutes, check your spam folder.

Ok, Thanks

Read More