A Brief Tour of LLMs

Introduction

When we think of computers, we sort of know how they work. We know that there is the hardware, the operating system, and then the programs that are basically lines of code. If you’re an avid programmer, you might even know how memory works, what header files and pre-compiled libraries are, how to write a program, etcetera. The point is, you know how and why the output of these systems are the way they are.

AI models, on the other hand, are different. When you input something to an AI model, you don’t exactly know why or how the model outputted what it did. This can pose reliability and trust issues when an AI model is used. Today, the rise of the Large Language Models (LLMs) that can pass the Turing Test, the question of how the models are achieving near-human levels of performance is ever more relevant. In this blog post, I will attempt to dive deep into this question by exploring how AI learns. Then, I will discuss whether AI can surpass humans or even gain consciousness in the near future.

I will try to make this article approachable to people with no prior experience in AI, so this article won’t go too deep into the technical details. Therefore, as long as you have some high school level math under your belt, it should (hopefully) be fairly easy to understand everything.

Training an AI Model

In order to understand whether machine can become sentient beings, we first have to know how AI models are created and learn on their own. Let’s begin by defining and distinguishing between some terms: artificial intelligence, machine learning, and deep learning. Note that the defined boundaries of these fields are somewhat vague, however it is still useful to know the difference between them. In most contexts, the terms each describe the following fields:

  • Artificial Intelligence describes machines and computer programs that appear intelligent. This term encompasses any computer program that has the ability to solve tasks that would normally require humans. It is a very general term that encompasses a large field.
  • Machine Learning describes machines that can “self-learn” from the data with simple ML algorithms such as regression algorithms. They are generally only able to find patterns in simple and structured data. This is a subfield of artificial intelligence.
  • Deep Learning is machine learning with complex ML algorithms such as multi-layered neural networks. They are able to grasp patterns from complex unstructured data. These models are the ones that produce the best results and achieve human-level performance. This is a subfield of machine learning.

In this part of the article, I will start with basic machine learning models using statistics, then explore neural networks, before finally discussing large language models.

1. Linear Regression

One of the simplest algorithms to train a machine learning model are regression models. A regression model calculates the the function of best fit to represent the relationship between input and output data.

For example, the simple linear regression model calculates the line of best fit. A fictional application for this could be if we want to determine a tumor’s rate of growth from its hormonal activity. We can first collect some real-world data in terms of the correlation between tumor’s hormonal activity and its rate of growth. Then, we can calculate the the line of best fit for the graph using statistics. When we have hormonal data that we haven’t seen before, we could estimate the rate at which the tumor grows based on our line of best fit. The more data that we have, the more accurate the model gets at predicting the growth rate of the tumor. The resulting model would look something like the following:

Visual of linear regression

2. Logistic Regression

So far, a linear regression model is working fine for our purposes of determine one continuous value over another. In our example above, both the hormonal activity and the growth rate lies on a continuous spectrum of values. However, if we encounter something that requires us to determine a binary value (yes or no) from a continuous function, linear regression fails. This can be illustrated with another fictional example: say we want to determine whether or not a tumor is cancerous (binary function) by its rate of growth (continuous function). We will plot the rate of growth on the x-axis, and the probability of whether or not a tumor is cancerous on the y-axis. The following image shows the prediction of a linear regression model:

Visual of linear regression (binary)

There are several issues with this approach. Notice the “line of best fit” doesn’t really fit the data well: the predicted output values exceed 0 and 1. There is also a large section in the middle of the input data where the output values are close to 0.5. In other words, we aren’t really confident with the output.

To remediate these issues, we can turn to logistic regression, where we use more advanced statistical methods to squish the output data into a logistic curve between 0 and 1. We will skip over the math here, however the end result would be similar to the following image:

Visual of logistic regression (binary)

As we can see, the regression line fits the data much better. In addition, the predicted output values does not exceed 0 and 1, and there is a much narrower section in the middle where the model isn’t confident with the output. Clearly, the logistic regression model is much better at generating binary output with a continuous input.

Regression models are the most rudimentary form of machine learning. They generally work well with simple data inputs and outputs. We can use statistical and mathematical tools to generate a function which takes an input and spits out an output that matches as closely to the training data as possible (minimizing error). In our logistic regression example above, the regression function takes in the growth rate of the tumor and outputs the probability of the tumor being cancerous.

There are many regression functions other than linear and logistic regression that fits different data heuristics. However, they all fall short when we have a more complicated relationship between the input and output variables. As an example, if our input variable is an image and our output variable is whether the input image has a cat, dog, or fish, regression models falls flat on their face (how do we even start with regression models?).

3. Neural Networks

This is where neural networks come into play. Neural networks use algorithms to approximate the function (i.e. the relationship between the input variable and the output variable) that statistics and math cannot find. Neural networks with multiple hidden layers (see below) are called deep learning. To illustrate the basic workings of neural networks, I’ll can use the aforementioned cat-dog-fish image classification example.

Visual of a rat terrier

Each input image is represented with a matrix of thousands of colored pixels, and each pixel can represent 3 colors. For the resolution of 300x300 alone, there are over 90,000 pixels stored in the computer. This means that storing a raw image on the computer requires 90,000 x 3 color channels = 270,000 numerical values. These numerical values would be the inputs of the neural network.

Our output values would be cat, dog, or fish. Each output would be a numerical value between 0 and 1 associated with it. This is the probability that the input image contains the output animal. What we need our neural network to accomplish is to “magically” turn the 270,000 input values into one of the 3 outputs. This sounds like a daunting task, but I promise that it is not too difficult with neural networks.

Neurons are the heart of neural networks (as the name suggests), and they are partially inspired by the neurons in our brains. However, neurons as a machine learning concept can be difficult to understand. To quote math YouTuber 3B1B, when I say neuron, all I want you to think about is a thing that holds a number between 0 and 1.

Our neural network would start with 270,000 neurons, each corresponding to the value of one of the color channels of our 300 by 300 image. The neural network would end with 3 neurons, representing one of the three output probabilities (cat, dog, fish). These would represent our starting and ending (i.e. input and output) layers of the neural network. There is also an arbitrary number of layers in between the input layer and the output layer (called the hidden layers. Each of these would contain an arbitrary number of neurons.

Visual of a basic neural network

Each connection between the neurons are weighted, meaning each neuron in the first layer will affect the neurons in the second layer differently depending on its importance. Then, each neuron in the second layer will affect each neuron in the third, and the third layer will affect the fourth, and so on. This continues until it reaches the output layer. The basic premise of the neural network is that based on a large amount of data, we can use algorithms to adjust the weights between the neurons in the layers such that the model would activate the correct output neuron based on an input. This process of adjusting the weights is known as training.

In our cat-dog-fish example, we would prepare a large dataset of labeled image of cats, dogs, and fishes, and feed it into the neural network. The network would start out with random weights. Upon seeing each image, we would compute the error between the prediction and the actual data, then compute the derivative of the error function at the specific point, and then adjust the parameters of the model to minimize loss. We would adjust the weights again and again until we have minimized the error function. This is known as the gradient descent optimization algorithm:

Visual of the steps of training

We will, again, skip over the mathematical details here, because gradient descent and other optimization functions uses multivariate differential calculus which I am frankly unqualified to talk about. However, the end result would look like the following (the yellow surface is the error function, and the black dots represent the current error value):

Visual of gradient descent

The product we get at the end of the training process is a neural network with a set of weights. We know what each of the weights in the hidden layers are, and we know that the set of weights produce the least error in our dataset. However, we don’t exactly know what purpose each weight serve.

In other words, we do not know why the algorithm settled on a particular set of weights, but we do know that the set of weights is the optimal that the training algorithm can find in order to minimize the error. In our case, the algorithms should adjust the weights of our model so that the dog neuron is activated when an image containing a dog is inputted, the cat neuron is activated when a cat is in the input image, and the fish neuron is activated when a fish is in the input image.

Visual of neural network

There are many special types of neural networks out in the wild, each using different techniques to perform better with different types of data. For example, Convolutional Neural Networks (CNN) are neural networks that works best with image classification and special detection, Recurrent Neural Networks (RNN) are best at sequential data such as text and natural language processing and machine translation, and many others exist for different purposes.

Recently, neural networks with the Transformer architecture, also specializing in text and natural language processing, have seen wide successes in overtaking neural network architectures such as RNNs, becomming the backbone of LLMs today.

4. Large Language Models

If you have made it this far into the essay-esque blog, you will have all the necessary components to learn how a large language model functions. At its core, LLMs are next word predictors using neural networks; something along the lines of a text autocompletion engine on steroids.

Visual of ChatGPT

As alluded to earlier, the architecture of the neural network that powers the LLMs today are transformers. The transformer first breaks up input words into pieces known as “tokens”. These tokens are then converted into vectors that were acquired during pre-training. Then, the architecture applies an attention mechanism, where each token affects the other tokens in the sentence differently based on context. This makes the model much better at determining the meaning and connotations of the words depending on context. For example, the transformer model would know that the meaning of “minute” in the following sentences are different:

  • “The minute hand of the clock moved”
  • “They did not notice the minute change”

Then, using more complex neural networks and methods, the network will generate a probability distribution of what the next word could be. This process will be repeated over and over again to “generate” new text; and when prompted correctly, these autocomplete models would appear like intelligent chat bots.

I will not go deeper into the technical and mathematical details of large language models here. You can learn more about them at this excellent playlist by 3B1B on YouTube. For the rest of this article, knowing that these large language models are basically autocomplete on steroids would be enough.

Will AI Replace Us?

[Under construction]