Fine Tuning Llama 2 using PEFT techniques

4 min readFeb 22, 2024

A lot of open source LLM models like Llama2, Mistral, Falcon have been introduced recently.
As we know from the full form of LLM that it is a Large Language Model having a lot of heavy weights and parameters.
These open source LLMs which are available, are already pre-trained models.
But there exists an option to fine-tune these models as well and this fine-tuning is possible with the help of PEFT.

PEFT : Parameter Efficient Transfer Learning

PEFT stands for Parameter Efficient Transfer Learning which is an amazing technique to fine-tune all the LLMs with heavy parameters and huge size even upto models with 70 billion parameters.

This article will brief the readers about the practical implementation of PEFT.

Steps Involved :

Install the Packages
Import the Libraries
Load the Model
Load the dataset and start fine-tuning

1. Install the Packages

The first step comes with installing the required packages.
The “peft” package has both the PEFT model and the LoRA model. “bitsandbytes” package is used for Quantisation
Quantisation : LLM weights are in the form of Floating values (32 bits) and we don’t get sufficient memory in our system to use the model. In order to use the model we need to quantise the weights, i.e. convert the floating point (32 bits) into int 8 as a lower memory.

PEFT uses techniques which will try to freeze the weights when Transfer learning is applied on LLM models. This technique freezes most of the weights of that LLM and only some of the weights will be retrained;
and based on that they will be able to provide accurate results based on our custom dataset.

Prompt Template for Llama 2 model

<s> will contain the Instruction. <<SYS>> will have the system prompt which will be enclosed with the <<SYS>> brackets. User prompt is used to give the prompt to user.

System Prompt (optional) to guide the model
User prompt (required) to give the instruction
Model Answer (required)

Any custom dataset which will be used for fine-tuning with Llama, should be converted in the above format.

Note : We don’t need to convert our data into this prompt template if we are using the Base Llama2 model.

Why PEFT …???

i. PEFT is useful to be able to run and fine-tune model with low memory also. Google colab provides only 15GB graphics card which seems as a limited resource as it can barely store a Llama2–7b model with such heavy weights.
ii. There will be additional memory required for the optimiser states, gradients and forward activations which will need more memory consumption.
iii. Full fine-tuning is not possible as 70 billion weights cannot be stored in 15 GB.
PEFT freezes most of the weights present in the LLM and only some selected weights are used for quantisation and then followed by fine-tuning.

2. Import Libraries

3. Load the model

In this step, we will load the “llama-2–7b-chat-hf model”
and then train it on 1000 samples, which will produce a fine-tuned model.

QLoRA will use a rank of 64 with a scaling parameter of 16.
We’ll load the Llama 2 model directly in 4-bit precision using the NF4 type and train it for one epoch. To find the low rank index, we are going to use the rank of 64 which is a Hyper Tuning parameter, and alpha value of 16 which is a Scaling parameter.

4. Load the dataset and start fine-tuning

Then we need to do load our custom dataset and then do the pre-processing. Preprocessing involves reformatting the prompt, filter out bad test, combine multiple datasets which is already done in the dataset we are using in this case.

Then comes configuring “bits&bytes” for 4-bit quantisation. (from 32 bit to 4 bit). This conversion will help in fine-tuning process as it will consume less space w.r.t. GPU. Thereafter we load the Llama2 model in 4 bit precision GPU with the corresponding tokeniser. We have to load the configuration of QLoRA, regular training parameters, passing everything to the SFT Trainer.

After that, we need to load the base model using AutoModel for CasualLM and then load Llama Tokeniser so that it will be able to convert the input for any LLM model.

Summarised steps :

Load the dataset
Set dtype
Set Quantisation process
Check GPU availability and compatibility
Load Base LLM model (Llama2)
Load Llama Tokeniser to be used in Llama2 along with Padding parameter
Configure the parameters for LoRA (interms of PEFT configuration)
Set Training arguments
Set Supervised fine-tuning parameters

You can find the entire source code here : https://github.com/tulsipatro/LLM/blob/main/1.Fine_tuning_Llama2_using_PEFT.ipynb