
january 27, 2025
<think>
- Take a question
- Generate multiple answers
- Score each answer
- Tell model which answers were good
- Repeat
- Sample a batch of questions
- For each question, generate \(G\) different answers
- Score all answers
- Calculate the relative goodness of each answer (how much better/worse than average)
- Update the model to make good answers more likely and bad answers less likely, weighted by the relative goodness
- But don’t change the model too much!
- Sample a batch of questions \(q \sim P(Q)\)
- For each question, generate \(G\) different answers (let’s call them \(\{o_i\}_{i=1}^G\))
- Score all answers \(r_i \sim r_{\varphi}(o_i)\) (\(r_{\varphi}\) can be a trained reward model or just a hardcoded verifier)
- Calculate the advantage \(A_i\) (just \(r_i\) with some normalization across group)
- Weight \(A_i\) by the relative likelihood of sampling \(o_i\) compared to the last step
- Update the model to maximize the weighted advantage!
- To keep from changing the model too much, punish big changes (\(KL\) divergence)
- Sample \(q \sim P(Q), \{o_i\}^G_{i=1}\sim \pi_{\theta_{old}}(O\vert q)\)
- Score all answers \(r_i \sim r_{\varphi}(o_i)\)
- Calculate the advantage \(A_i = \frac{r_i - \mu_r}{\sigma_r}\)
- Denote our old policy (model) \(\pi_{\theta_{old}}\) and our new policy \(\pi_{\theta_{new}}\)
- Given a question \(q\) and an answer \(o_i\), the ratio \(\frac{\pi_{\theta_{new}}(o_i\vert q)}{\pi_{\theta_{old}}(o_i\vert q)}\) tells us how much more/less likely that answer is after the update.
- To avoid really big updates, we can also clip the ratio to be between \(1 - \varepsilon\) and \(1 + \varepsilon\) (e.g. \(0.8\) and \(1.2\)).
- Get an estimate of how good/bad the update is by the min of two terms: \(A_i\) weighted by the ratio and \(A_i\) weighted by the clipped ratio.
- Averaged over the group, we get an estimate of the total goodness of the update.
- Our final goodness is this value minus a weighted term for how much the model changed, \(\beta\mathbb{D}_{KL}(\pi_{\theta_{new}} \vert\vert \pi_{\theta_{old}})\)
- Update the model to maximize this goodness!
- Let our goodness function be:
- Expanding, we get:
- The advantage \(A_i\) is calculated within each group:
- The clipped surrogate objective ensures that updates do not deviate excessively from the reference policy by bounding the policy ratio between \(1 - \varepsilon\) and \(1 + \varepsilon\).
- And the unbiased estimator of the KL divergence term keeps the model close to the reference policy:
- Maximize \(\mathcal{J}\)! (kinda looks like a whale)
[1] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
[2] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning