Understanding ChatGPT

A summary of ChatGPT and GPT techniques

OpenAI is back in the headlines with news that it is updating its viral ChatGPT with a new version called GPT-4. If ChatGPT is the car, then GPT-4 is the engine: a powerful general technology that can be shaped down to a number of different uses. GPT-3 and now 4 are actually the internet’s best-known language-processing AI models.

What is ChatGPT?

ChatGPT is a large language model developed by OpenAI, which is based on the GPT (Generative Pre-trained Transformer) architecture. It is designed to understand natural language, generate human-like responses, and complete a variety of language tasks including language translation, text summarization, question-answering and more. It has been trained on a massive amount of text data from the internet, books, and other sources, allowing it to learn and understand a wide range of topics and provide informative and engaging responses.

What is GPT?

GPT (Generative Pre-trained Transformer) is a type of neural network architecture used for natural language processing (NLP) tasks, such as language modelling, text classification, and machine translation. It was first introduced by OpenAI in 2018 and has been used in a wide variety of applications.

The GPT architecture is based on the Transformer model, which was introduced in 2017 by Vaswani et al. allowing to capture of long-term dependencies in text without the need for recurrent neural networks and generate high-quality language output. GPT models are pre-trained on large amounts of text data using an unsupervised learning approach, where the model is trained to predict the next word in a sequence of text. The pre-training allows the model to learn the statistical language patterns and contextual relationships within natural language text, which can then be fine-tuned on specific NLP tasks, such as language translation, text classification, and text generation with smaller amounts of labelled data.

GPT has been a major breakthrough in natural language processing and the version GPT-3 has 175 billion parameters. It has been used to achieve state-of-the-art results on a wide variety of language tasks and has been adopted by a variety of industries for language-related applications such as chatbots, text summarization, and language translation. Its ability to pre-train large amounts of data and adapt to new tasks with few examples makes it a highly versatile and valuable tool for language-related applications.


GPT Models

The first version of GPT was introduced based on the ideas of transformersA good website for detailed learning can be found here . and unsupervised pre-training. Its results provide a convincing example that pairing supervised learning methods with unsupervised pre-training works very well. In summary, GPT as the name suggested, adopts generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative supervised fine-tuning on each specific task.

Generative pre-training

The term generative pre-training represents the unsupervised pre-training of the generative model.They used a multi-layer Transformer decoder to produce an output distribution over target tokens. Given an unsupervised corpus of tokens $\mathcal{U} = (u_1,\dots,u_n)$, they use a standard language modelling objective to maximize the following likelihood:

\[L_1(\mathcal{U})=\sum_i\log P(u_i\mid u_{i-k},\dots,u_{i-1};\Theta)\]

where $k$ is the size of the context window, and the conditional probability $P$ is modelled using a neural network with parameters $\Theta$ trained using stochastic gradient descent. Intuitively, we train the Transformer-based model to predict the next token within the $k$-context window using unlabeled text from which we also extract the latent features $h$.

Supervised fine-tuning

After training the model with the objective function above, they adapt the parameters to the supervised target task which refers to supervised fine-tuning. Assume a labelled dataset $\mathcal{C}$, where each instance consists of a sequence of input tokens, $x^1,\dots, x^m$, along with a label $y$. The inputs are passed through the pre-trained model to obtain the final transformer block’s activation $h_l^m$, which is then fed into an added linear output layer with parameters $W_y$ to predict $y$:

\[P(y\mid x^1,\dots,x^m)=softmax(h_l^mW_y).\]

This gives us the following objective to maximize:

\[L_2(\mathcal{C})=\sum_{(x,y)}\log P(y\mid x^1,\dots,x^m)\]

They additionally found that including language modelling as an auxiliary objective to the fine-tuning helped learning by (a) improving the generalization of the supervised model, and (b) accelerating convergence. Specifically, we optimize the following objective (with weight $\lambda$): $L_3(\mathcal{C})=L_2(\mathcal{C})+\lambda*L_1(\mathcal{C})$.

Some tasks, like question answering or textual entailment, have structured inputs such as ordered sentence pairs, or triplets of document, questions, and answers that are different from the contiguous sequences of text inputs of the pre-trained model so they require some modifications to apply GPT. This results in the input transformations which allow us to avoid making extensive changes to the architecture across tasks. A brief description of these input transformations is shown in Figure 1 (credit to the paper).

GPT-3 was applied with tasks and few-shot demonstrations specified purely via text interaction with the model. Since fine-tuning involves updating the weights of a pre-trained model by training on a supervised dataset specific to the desired task, it typically requires thousands to hundreds of thousands of labelled examples. However, the main disadvantages are the need for a new large dataset for every task, the potential for poor generalization out-of-distribution, and the potential to exploit spurious features of the training data, potentially resulting in an unfair comparison with human performance.

Few-shot learning

Few-Shot is the term referring to the setting where the model is given a few demonstrations of the task at inference time as conditioning, but no weight updates are allowed. Few-shot learning involves learning based on a broad distribution of tasks (in this case implicit in the pre-training data) and then rapidly adapting to a new task. The primary goal in traditional Few-Shot frameworks is to learn a similarity function that can map the similarities between the classes in the support and query sets.

Figure 2.1 illustrates different settings, from which we see for a typical dataset an example has a context and a desired completion (for example an English sentence and the French translation), and few-shot works by giving $K$ examples of context and completion, and then one final example of context, with the model expected to provide the completion.

The main advantages of few-shot are a major reduction in the need for task-specific data and a reduced potential to learn an overly narrow distribution from a large but narrow fine-tuning dataset. The main disadvantage is that results from this method have so far been much worse than state-of-the-art fine-tuned models. Also, a small amount of task-specific data is still required.


Improvements of GPTs

GPT-1 employs the idea of unsupervised learning for training representations of words using large amounts of unlabeled data consisting of terabytes of information and then integrates supervised learning for fine-tuning to improve performance on a wide range of NLP tasks. However, it has drawbacks including (1) Compute requirements (expensive pre-training step) (2) The limits and bias of learning about the world through text, and (3) Still brittle generalization.

GPT-2 is a larger model with 1.5 billion parameters following the details of GPT-1 (117 million parameters) with a few modificationsThese include pre-normalization, modified initialization, expanded vocabulary to 50,257, larger context size to 1024 tokens and larger batch-size of 512.. This larger size allows it to capture more complex language patterns and relationships. In short, GPT-2 is a direct scale-up of GPT-1, with more than $10\times$ the parameters and trained on more than $10\times$ the amount of data.

GPT-3 uses a variety of techniques to improve performance, including:

  1. Larger size: has 175 billion parameters which allows to capture of even more complex language patterns and relationships.
  2. Adaptive computation: dynamically adjusts the number of parameters used for each task, allowing it to allocate more resources to complex tasks and fewer resources to simpler tasks.
  3. Few-shot learning: learns to perform a new task with just a few examples, making it highly flexible and adaptable to new tasks and contexts.
  4. Prompt engineering: can be given a natural language prompt to generate text that fits a specific context or follows a specific style.

Limitations of Generative AI

Reference: The University of Edinburgh