11 minutes to read - Apr 19, 2023

What Is a Large Language Model, the Tech Behind ChatGPT?

VISIT
What Is a Large Language Model, the Tech Behind ChatGPT?
This article will explain what these LLMs are, how they are developed. Understanding LLMs is key to understanding how ChatGPT works.
Table of Contents
1A Large Language Model Is a Type of Neural Network
An LLM Uses a Transformer Architecture
2An LLM Builds Itself
An LLM Predicts Which Word Should Follow the Previous
LLMs Produce Text that Sounds Right but Cannot Guarantee That it Is Right
3Is GPT-4 an LLM?

The release of ChatGPT by OpenAI in December 2022 has drawn an incredible amount of attention. This curiosity extends from artificial intelligence in general to the class of technologies that underpins the AI chatbot in particular. These models, called large language models (LLMs), are capable of generating text on a seemingly endless range of topics. Understanding LLMs is key to understanding how ChatGPT works.

What makes LLMs impressive is their ability to generate human-like text in almost any language (including coding languages). These models are a true innovation — nothing like them has existed in the past.

This article will explain what these models are, how they are developed, and how they work. That is, to the extent that we understand that how they work. As it turns out, our understanding of why they work is — spookily — only partial.

A Large Language Model Is a Type of Neural Network

A neural network is a type of machine learning model based on a number of small mathematical functions called neurons. Like the neurons in a human brain, they are the lowest level of computation.

Each neuron is a simple mathematical function that calculates an output based on some input. The power of the neural network, however, comes from the connections between the neurons.

Each neuron is connected to some of its peers, and the strength of each connection is quantified through a numerical weight. They determine the degree to which the output of one neuron will be taken into account as an input to a following neuron.

A neural network could be very small. For example, a basic one could have six neurons with a total of eight connections between them. However, a neural network could also be very large, as is the case of the LLMs. These may have millions of neurons with many hundreds of billions of connections between them, with each connection having its own weight. 

An LLM Uses a Transformer Architecture

We already know that an LLM is a type of neural network. More specifically, LLMs use a particular neural network architecture called a transformer, which is designed to process and generate data in sequence, like text.

An architecture in this context describes how the neurons are connected to one another. All neural networks group their neurons into a number of different layers. If there are many layers, the network is described as being “deep,” which is where the term “deep learning” comes from.

In a very simple neural network architecture, each neuron may be connected to every neuron in the layer above it. In others, a neuron may only be connected to some other neurons that are near it in a grid.

The latter is the case in what are called Convolutional Neural Networks (CNN). CNNs have formed the foundation of modern image recognition over the past decade. The fact that the CNN is structured in a grid (like the pixels in an image) is no coincidence — in fact, it is an important reason for why that architecture works well for image data. 

A transformer, however, is somewhat different. Developed in 2017 by researchers at Google, a transformer introduces the idea of “attention,” whereby certain neurons are more strongly connected (or “pay more attention to”) other neurons in a sequence.

Since text is read in and read out in a sequence, one word after the other, with different parts of a sentence referring to or modifying others (such as an adjective that modifies the noun but not the verb) it is also no coincidence that an architecture that that is built to work in sequence, with different strengths of connection between different parts of that sequence, should work well on text-based data. 

An LLM Builds Itself

In its simplest terms, a model is a computer program. It is a set of instructions that perform various calculations on its input data and provides an output.

What is particular about a machine learning or AI model, however, is that rather than writing those instructions explicitly, the human programmers instead write a set of instructions (an algorithm) that then reviews large volumes of existing data to define the model itself. As such, the human programmers do not build the model, they build the algorithm that builds the model.

In the case of an LLM, this means that the programmers define the architecture for the model and the rules by which it will be built. But they do not create the neurons or the weights between the neurons. That is done in a process called “training” during which the model, following the instructions of the algorithm, defines those variables itself. 

In the case of an LLM, the data that is reviewed is text. In some cases, it may be more specialized or more generic. In the largest models, the objective is to provide the model with as much grammatical text as possible to learn from.

Over the process of training, which may consume many millions or billions of dollars worth of cloud computing resources, the model reviews this text and attempts to produce text of its own.

Initially, the output is gibberish, but through a massive process of trial and error — and by continually comparing its output to its input — the quality of the output gradually improves. The text becomes more intelligible.

Given enough time, enough computing resources, and enough training data, the model “learns” to produce text that, to the human reader, is indistinguishable from text written by a human. In some cases, human readers may provide feedback in a sort of reward model, telling it when its text reads well, or when it doesn’t (this is called “reinforcement learning from human feedback,” or RLHF). The model takes this into account and continuously improves itself, based on that feedback. 

An LLM Predicts Which Word Should Follow the Previous

A reductive description of LLMs that has emerged is that they “simply predict the next word in a sequence.” This is true, but it ignores the fact that this simple process can mean tools like ChatGPT generate remarkably high-quality text. It is just as easy to say that “the model is simply doing math,” which is also true, but not very useful in helping us to understand how the model works or in appreciating its power. 

The result of the training process described above is a neural network with hundreds of billions of connections between the millions of neurons, each defined by the model itself. The largest models represent a large volume of data, perhaps several hundred gigabytes just to store all of the weights.

Each of the weights and each of the neurons is a mathematical formula that must be calculated for each word (or, in some cases, a part of a word) that is provided to the model for its input, and for each word (or part of a word) that it generates as its output.

It’s a technical detail, but these “small words or parts of words” are called “tokens,” which is often how the use of these models is priced when they are provided as a service — more on that later.

The user interacting with one of these models provides an input in the form of text. For example, we can provide the following prompt to ChatGPT:

Hello ChatGPT, please provide me with a 100-word description of Dataiku. Include a description of its software and its core value proposition.

The models behind ChatGPT would then break that prompt into tokens. On average, a token is ⅘ of a word, so the above prompt and its 23 words might result in about 30 tokens. The GPT-3 model that gpt-3.5-turbo model is based on has 175 billion weights, meaning that the 30 tokens of input text would result in 30 x 175 billion = 5.25 trillion calculations. The GPT-4 model, which is also available in ChatGPT, has an unknown number of weights.

Then, the model would set about generating a response that sounds right based on the immense volume of text that it consumed during its training. Importantly, it is not looking up anything about the query. It does not have any memory wherein it can search for “dataiku,” “value proposition,” “software,” or any other relevant terms. Instead, it sets about generating each token of output text, it performs the 175 billion calculations again, generating a token that has the highest probability of sounding right. 

LLMs Produce Text that Sounds Right but Cannot Guarantee That it Is Right

ChatGPT can provide no guarantee that its output is right, only that it sounds right. Its responses are not looked up in its memory — they are generated on the fly based on those 175 billion weights described earlier.

This is not a shortcoming specific to ChatGPT but of the current state of all LLMs. Their skill is not in recalling facts — the simplest databases do that perfectly well. Their strength is, instead, in generating text that reads like human-written text and that, well, sounds right. In many cases, the text that sounds right will also actually be right, but not always. 

In the future, it is likely that LLMs will be integrated in systems that combine the power of the LLM’s text generation with a computational engine or knowledge base to provide factual answers in compelling natural language text. Those systems do not exist today, but it would be easy to overestimate how long it will take until we see them.

Another possibility is, if you want to provide users with information that you already possess but in a natural language answer format, you can provide those answers to tools like ChatGPT and have them build responses based on those answers. Dataiku has developed a demo using GPT-3 to provide answers from Dataiku’s documentation that does just that. 

Is GPT-4 an LLM?

On March 14, 2023, OpenAI released GPT-4, the latest version of its models in the GPT family. In addition to generating higher-quality text compared to GPT-3.5, GPT-4 introduces the ability to recognize images. It may be able to generate images as well. However that functionality, if it exists, is not yet available. The ability to handle input and output data of different types (text, images, video, audio, etc.) means that GPT-4 is multimodal.

The terminology for these latest models is evolving rapidly, with some debate in the expert community arguing that “language model” is too limiting. The term “foundation model” has been popularized by researchers at Stanford, but is also the source of some debate. Like the technology itself, the language used to describe the technology will continue to evolve rapidly.

Article source
Author at Dataiku Blog
loading...