Core Principle: Using Reinforcement Learning from Human Feedback (RLHF). RLHF involves human evaluation of the model's responses and using this assessment to correct answers.
Brief summary. We took the GMT3 model, additionally trained it on the answers of living people, it turned out to be GMT 3.5. We trained another Reward Model so that it would give assessment of the response of the GMT 3.5 model (training data was again generated by live People). We improved the GMT3.5 model using reinforcement learning and got ChatGPT.
Supervised Fine Tuning (SFT) Model (GMT 3.5 Model)
- Data collection: A certain request is selected from the database, the person gives desired answer.
- This data is used to further train GPT 3 using training with teacher. Thus we got the SFT model (or GMT 3.5)
Reward Model (needed to use reinforcement learning)
- Data collection: Some query is selected from the database, GMT 3.5 generates several answers, the person sorts these answers from best to worst.
- This data is used to train the reward model. Entry: couple request-response. Output: score (number).
We use reinforcement learning
Updating the GMT 3.5 model “policy” using the Proximal method
Policy Optimization (PPO):
- A request is selected from the database.
- The model generates a response.
- The Reward Model generates the score.
- We return to step 2 (a fixed number of times, then to step 1).
- During execution, the model policy must change so that maximize the score that is generated by the Reward Model.
- Updating the GMT 3.5 model “policy” using the Proximal method Policy Optimization (PPO):