Train LLMs on your Laptop Using Low Rank Approximation
Recent advancements in AI have made it possible to run Large Language Models (LLMs) on smaller machines and with limited resources. We are only a few months on from these models being the preserve of large companies, with 100s of GPUs requiring 100s of millions of dollars and months to train. This new era of AI accessibility is now enabling individuals to operate and train LLMs on their laptops, Pixel phones, and even Raspberry Pis. In this post, we'll discuss the developments that have led to this change and explore the techniques employed to make LLMs more accessible to everyone - in particular something called Low Rank Adaptation.
The Turning Point: Llama.cpp and the Leaked LLaMA Weights
On March 10th, Bulgarian developer Georgi Gerganov released Llama.cpp, a quantized LLM network that allows users to run it on an M1 MacBook. Llama.cpp is based on Meta's LLaMA model, a GPT-3 class model intended for academic use. Although Meta made the code open-source, the model weights were kept secret, accessible only to qualified researchers. However, the LLaMA weights were leaked on BitTorrent on March 2nd, leading to a surge of interest in running LLMs on smaller devices.
Georgi's work involved taking the 7B LLaMA model and applying 4-bit quantization to reduce its size from 13GB to 4GB. His code, available under an MIT license, has enabled researchers and enthusiasts alike to run LLMs on various devices, including MacBooks, Pixel phones, and Raspberry Pis.
One key metric in evaluating LLMs' performance is tokens per second. Tokens, which are the smallest units of text that a model can process, help measure the speed of the response to a question. A token is a string of text and approximates to 3/4th of a word. This metric is limited by the model size and the hardware resources available for inference. For example, the model running on Pixel 6 runs at 5 tokens per second.
Training LLMs on Your Laptop
Beyond running LLMs on smaller devices, recent advancements have made it possible to train them on a smaller number of GPUs - even on a single laptop. The Alpaca-LoRA repository demonstrated this by fine-tuning Stanford's Alpaca model using a 52k instruction-following dataset. The fine-tuning process leveraged Hugging Face's training framework, mixed precision training, and Fully Sharded Data Parallel (FSDP) to train the model in just 3 hours on 8 80GB A100 GPUs, costing less than $100.
A variety of techniques have contributed to the development of smaller models, including:
- Low Rank Approximation (LoRA)
- Mu-Parameterization
- Zero-shot Hyper-parameter Transfer
- Parameter Efficient Fine Tuning (PEFT), including Llama-Adapter, Prefix Tuning, and the insertion of adapter layers in pre-trained LLMs.
These techniques have not only made LLMs more accessible but also expanded their capabilities. For instance, the Llama-Adapter model can perform image-conditioned question answering due to its training with images.
In the rest of this post, we are going to focus on Low Rank Approximation or LoRA.
Matrix Rank and Compression
Matrix rank is a measure of the linear independence of a matrix's columns or rows. It can be defined as the largest linearly independent subset of columns (or rows) of a matrix A. If there are k linearly independent columns, then all n columns of the mxn matrix A can be written as linear combinations of only k of them. Similarly, all m rows of the matrix can be written as linear combinations of only k rows.
$$\mathbf{A}=\mathbf{Y Z}^{\top}$$
Consider a matrix A that can be written as the product of two matrices Y and Z, each with dimensions mxk and nxk, respectively. If k is small compared to m and n, we can compress the information represented by mxn numbers into k(m+n) numbers. For example, a 1000x1000 pixel image (1 MB when stored as an unsigned char) might be compressed using 5*(1000+1000) numbers for k=5 (10 KB), resulting in a 100:1 compression ratio. We do this factorization using using Singular Value Decomposition.
Singular Value Decomposition (SVD) for Low-Rank Matrix Factorization
SVD is a powerful technique for decomposing an mxn matrix A into the product of three matrices:
- U: mxm orthogonal matrix
- V: nxn orthogonal matrix
- S: mxn diagonal matrix with diagonal entries sorted from high to low
To obtain a k-rank approximation of A, follow these steps:
- Keep only the top k right singular vectors (the first k rows of $\mathbf{V}^{\top}$) to get $\mathbf{V}_{k}^{\top}$, resulting in a kxn matrix.
- Keep only the top k left singular vectors (the first k columns of U) to get $\mathbf{U}_{k}$, resulting in an mxk matrix.
- Retain the top k singular values in the diagonal matrix, forming a kxk matrix $\mathbf{S}_{k}$.
Finally, multiply these matrices to obtain the k-rank approximation of A:
$$\mathbf{A}_{k}=\mathbf{U}_{k} \mathbf{S}_{k} \mathbf{V}_{k}^{\top}$$
To test this out, I took an image of a llama off the web and computed SVD on it. I then took the k largest components to construct the $\mathbf{U}_{k}$, $\mathbf{V}_{k}^{\top}$ and $\mathbf{S}_{k}$ matrices - and multiplied these to get the reconstructed image $\mathbf{A}_{k}$. Above, you see the original and reconstructed images with k = 5 and 50.
To get an idea of the compression ratio, I saved the original raw image and the components U, S and V as numpy arrays (from which the original image can be reconstructed). With k=5, the compression ratio was 1:50 and with k=50, 1:5. So with a compression ratio of 5, we get an image which is almost indistinguishable from the original and with a compression ratio of 50, we can barely make out the animal.
This same technique can be used to simplify the training of LLMs, and this was published by Hu et. al. in their recent paper: LORA: Low-Rank Adaptation of Large Language Models.
Transformers and other neural networks contain numerous dense layers that perform matrix multiplication, contributing to the massive number of parameters in models like GPT-3. Low Rank parameterized update of these matrices offers a promising approach to train these models.
Dense Neural Network Layers
A standard dense neural network layer operation involves a linear transformation followed by an activation function. The linear transformation can be represented as y = Ax + b, where A is the weights matrix, x is the input vector, b is the bias vector, and y is the output vector before applying the activation function. This operation is repeated across many nested layers within a neural network like a Transformer (a building block of the LLM), making the fine-tuning of these weight matrices a crucial step in adapting the network for new tasks. The weights in these matrices form the number of parameters of a neural net. When OpenAI says that its GPT3 has 175 billion parameters, it’s referring to the number of weights in such dense neural net layers.
Challenges with Training Large Networks
The sheer number of parameters in large models like GPT-3 (175 billion) creates significant challenges for memory and compute resources. For example, training GPT-3 175B requires 1.2 TB of GPU memory (VRAM) and 350 GB of CPU memory to store checkpoints during training. The update of these large weight matrices is both memory and compute-intensive.
Low Rank Approximation with LoRA
LoRA addresses these challenges by factorizing the large weight matrices into lower rank matrices. Given a pre-trained weight matrix $W_{0} \in \mathbb{R}^{d \times k}$, its update is constrained by a low-rank decomposition: $W_{0}+\Delta W=W_{0}+B A$, where $B \in \mathbb{R}^{d \times r}, A \in \mathbb{R}^{r \times k}$, and the rank $r \ll \min (d, k)$. During training, the low-rank constituents are updated using back-propagation, while $W_0$ remains frozen.
Benefits of LoRA
- Reduced VRAM usage: With a low rank (e.g., r=4), VRAM consumption during training can be reduced by up to 2/3, from 1.2 TB to 350 GB (for GPT-3 175B), as only the low-rank factorized matrices need to be stored and updated.
- Smaller model checkpoint size: Checkpoint sizes can be reduced by 10,000x, from 350 GB to 35 MB, enabling training on significantly fewer GPUs and avoiding I/O bottlenecks.
- Task-specific low-rank decompositions: Different low-rank decompositions can be created for different tasks, allowing smaller, customized models to be swapped in and out as required. This enables on-the-fly model switching on machines that store pre-trained weights in VRAM.
Such Low Rank Parameterized Update Matrices offer a powerful approach to make the training and adaptation of large neural networks more efficient. The Llama/Alpaca models discussed earlier offer similar performance to GPT3, but are much smaller. They can be made smaller still, using the factorization method discussed here, for a tinkerer sitting at home, to train on her laptop with a single GPU. This is what is being done by many early adopters and researchers around the planet now. These are people who don't have the resources of OpenAI/MSFT and want to use such models for their own tasks.
Nicely summarised. Thanks!
ReplyDeleteThanks Alan!
Delete