This short tutorial explains the training objectives used to develop ChatGPT, the new chatbot language model from OpenAI.

Timestamps:
0:00 – Non-intro
0:24 – Training overview
1:33 – Generative pretraining (the raw language model)
4:18 – The alignment problem
6:26 – Supervised fine-tuning
7:19 – Limitations of supervision: distributional shift
8:50 – Reward learning based on preferences
10:39 – Reinforcement learning from human feedback
13:02 – Room for improvement

ChatGPT:

Relevant papers for learning more:
InstructGPT: Ouyang et al., 2022 –
GPT-3: Brown et al., 2020 –
PaLM: Chowdhery et al., 2022 –
Efficient reductions for imitation learning: Ross & Bagnell, 2010 –
Deep reinforcement learning from human preferences: Christiano et al., 2017 –
Learning to summarize from human feedback: Stiennon et al., 2020 –
Scaling laws for reward model overoptimization: Gao et al., 2022 –
Proximal policy optimization algorithms: Schulman et al., 2017 –

Special thanks to Elmira Amirloo for feedback on this video.

Links:
YouTube:
Twitter:
Homepage:

If you’d like to help support the channel (completely optional), you can donate a cup of coffee via the following:
Venmo:
PayPal:


Point chat gbt may not really need an Introduction If you’ve used it you’ve probably seen That it can often answer questions about Code Snippets help summarize documents Write coherent short stories and the List goes on Now it has plenty of limitations and is Very capable of making mistakes but it Represents just how far language Modeling has come these past few years In this video we’ll specifically cover The training objectives used to develop Chat gbt The approach follows a similar Methodology to instruct GPT which was Released about a year prior That model was specifically geared Towards instruction following the user Provides a single request and the model Outputs a response Chat gbt extends to more interactive Dialogues with back and forth messages Where the model can retain and use Context from earlier in an exchange There are three main stages to the Training recipe Generative pre-training where we train a Raw language model on Text data Supervise fine-tuning where the model is Further trained to mimic ideal chatbot Behavior demonstrated by humans And reinforcement learning from Human Feedback where human preferences over

Alternative model outputs will be used To define a reward function for Additional training with RL In each of these steps the model is Fine-tuned from the results of the step Prior that is its weights will be Initialized to the final weights Obtained in the previous stage We’ll step through these one by one So first to refresh what is a language Model Well it’s basically a special case of an Auto regressive sequence model given Some history of observed variables X1 Through XT a basic sequence model is Tasked with predicting x t plus one During training we extract sequences From a data set and adjust the model’s Parameters to maximize the probability Assigned to the true XT plus ones Conditioned on the histories This next step prediction Paradigm has Been applied across different domains Like audio waveforms and molecular Graphs in the case of language models The individual variables which we’ll Refer to as tokens Can represent words Or sub-components of words For example given the first four words Of the sentence a language model using Word tokens will output a probability Distribution over all the words in its Vocabulary indicating How likely each Word is to come next at least according

To the model’s estimation The vast majority of tokens would Receive close to zero probability as They wouldn’t make any sense here but a Few plausible ones will get some Non-zero probability Mass Now even though a language model is Simply trained to predict the next token In Text data at inference time we Usually want to not just passively try To predict but actually generate Sequences of tokens At each step we can sample a token from The probability distribution output by The model and repeat this process over And over until a special stop token is Selected This setup is agnostic to the specific Model architecture used but typically Modern language models are large Transformers consisting of billions of Parameters I have a previous video diving into the Architectural details and math behind Transformers Example models from the past few years Include gpt3 from openai or Palm from Google These are trained on massive amounts of Text scraped from the internet Chat forms blobs books scripts academic Papers code really anything The amount of History the models can Condition on during inference has a

Limit though For chat gbt the underlying language Model can attend to about 3 000 words of Prior context So long enough for short conversations But obviously not quite sufficient to Output an entire novel By pre-training on this large amount of Kind of unstructured heterogeneous Text Data we allow the model to learn Sophisticated probabilistic dependencies Between words sentences and paragraphs Across different use cases of human Language So why is this generative modeling Formulation this basic language model Objective not enough why can’t this Alone produce the end behavior we see From chat gbt Well the user wants the model to Directly follow instructions or to Engage in an informative dialogue So there’s actually a misalignment Between the task implicit in the Language modeling objective and the Downstream task that the model Developers or end users want the model To perform The task represented by the language Model pre-training is actually a huge Mixture of tasks the user input is not Necessarily enough to disambiguate among These For example say a user provides this

Input explain how the bubble sort Algorithm works To us it’s obvious what the user wants The model to do but the model is only Trained to Output plausible completions Or continuations to pieces of text so Responding with a sentence like explain How the merge storage algorithm works is Not entirely unreasonable After all in its training data somewhere There are documents that just contain Lists of questions on different topics Like exams The task we want the model to perform is Actually just a subset of those Represented in the data now even without Any extra training we can often get a Language model to perform a desired task Via prompting we do this by first Conditioning the model on a manually Constructed example illustrating the Desired Behavior But this is extra work on the part of The user and can be tedious In addition to the task being Underspecified for a raw language model There are also subjective preferences That the developers may have regarding Other characteristics of the model’s Output For example they may want the model to Refuse to answer queries seeking advice For committing acts of violence or other Illicit activities

When dealing with a probabilistic model It may be difficult to completely Eliminate violations of these Specifications but developers would like To minimize their frequency if possible So in the second stage the model will be Fine-tuned a bit with straightforward Supervised learning Human contractors first conduct Conversations where they play Both Sides Both the human user and the ideal chat Bot these conversations are aggregated Into a data set where each training Example consists of a particular Conversation history paired with the Next response of the human acting as the Chatbot So given a particular history the Objective is to maximize the probability The model assigns to the sequence of Tokens in the corresponding response This can be viewed as a typical Imitation learning setup specifically Behavior cloning where we attempt to Mimic an expert’s action distribution Conditioned on an input state So already with this step this does much Better than the raw language model at Responding to user requests with less Need for prompting but it still has Limitations There’s a problem of distributional Shift when it comes to this type of Imitative setting whether it’s in

Language or other domains like driving a Car or playing a game The distribution of States during Training of the model is determined by The expert policy the behavior of the Human demonstrator But at inference time it’s the model or The agent itself that influenced the Distribution of visited States and in General the model does not learn the Expert’s policy exactly it may be able To approximate it decently but a variety Of practical factors can limit this Approximation whether it’s insufficient Training data partial observability of The environment or optimization Difficulties So as the model takes actions it may Make mistakes that the human Demonstrator was unlikely to and once Such an action takes place it can lead To a new state that has lower support Under the training distribution this can Lead to a compounding error effect the Model may be increasingly prone to Errors as these novel states are Encountered where it has less training Experience It can be shown theoretically that the Expected error actually grows Quadratically in the length of an Episode in the case of a language model Early mistakes May derail it causing it To make overconfident assertions or

Output complete nonsense To mitigate this we need the model or The agent to also act during training Not merely passively observe and expert One way to do this is to further Fine-tune the model with reinforcement Learning Certain RL settings already come with a Predefined reward function If we think about say Atari games then There is an unambiguous reward collected As the game progresses Without this we would typically need to Manually construct some reward function But of course this is hard in the case Of language doing well in a conversation Is difficult to Define precisely We could have labelers try to assign Numerical scores directly but it may be Challenging to calibrate these Instead the developers of chat GPT Establish a reward function based on Human preferences AI trainers first have conversations With the current model then for any Given model response a set of Alternative responses are also sampled And a human labeler ranks them according To most to least preferred to distill This information into a scalar reward Suitable for reinforcement learning a Separate reward model initialized with Weights from the supervised model is Trained on these rankings how well given

A ranking over K outputs for a given Input we can Form K choose two training Pairs The reward model will assign a scalar Score to each member of a pair Representing logits or unnormalized log Probabilities the greater the score the Greater the probability the model is Placing on that response being preferred Standard cross entropy is used for the Loss treating the reward model as a Binary classifier Once trained the scalar scores can be Used as Rewards This will enable more interactive Training than the purely supervised Setting During the reinforcement learning stage Our policy model that is the chat bot Will be fine-tuned from the final Supervised model It emits actions its sequences of tokens When responding to a human in a Conversational environment Given a particular state that is a Conversation history and a corresponding Action the reward model Returns the Numerical reward The developers elect to use proximal Policy optimization or PPO as the Reinforcement learning algorithm here We won’t go into the details of PPO in This video but this has been a popular Choice across different domains

Now the Learned reward model we’re Optimizing against here is a decent Approximation to the true objective we Care about the human subjective Assessment but it’s still just an Approximation a proxy objective In previous work it’s been shown that There’s a danger of over optimizing this Kind of learned reward where the Policies performance eventually starts Degrading on the true Downstream task Even while the reward model scores Continue to improve the policy is Exploiting deficiencies in the Learned Reward model This is reminiscent of good heart’s law Which states when a measure becomes a Target it ceases to be a good measure So the authors of the instruct gpp paper Describe avoiding this over optimization By applying an additional term to the PPO objective penalizing the KL Divergence between the RL policy and the Policy learned previously from Supervised fine-tuning The combination of reward model learning And PPO is iterated several times At each iteration the updated policy can Be used to gather more responses for Preference ranking and a new reward Model is trained then the policy is Updated through another round of PPO The two fine-tuning steps together that Is the supervised learning and the

Reinforcement learning from Human Feedback have a dramatic effect on the Model’s performance For instruct GPT the predecessor model To chat GPT evaluation showed that on Average labelers preferred responses From a model with only quote unquote 1.3 Billion parameters over the original 175 Billion parameter gpt3 from which it was Fine-tuned Despite chat gpt’s legitimately Sophisticated capabilities there is Still much room for improvement It will sometimes spit out inaccurate or Completely made up facts and cannot link Out to explicit sources And Its Behavior is still highly Dependent on the specific wording of an Input Even though prompting is far less needed Compared to a base language model Some amount of input hacking may still Be required depending on the specific Behavior desired But of course these are interesting Questions to explore from here as new Models are developed building on top of This progress Check out the links in the description To learn more Thanks for watching

Leave a Reply