I want to train a Model...

Aug 12, 2024

You landed at a great place if you want to learn how to train a model on your custom data. I just trained a model on an SQL query Data Set and tested it to see how accurately this model can generate SQL statements from User’s statement in plain English… This exercise intends to create a tool for end users who want to query different data sources with instructions in plain English; the model will help translate those instructions into SQL queries…

Ok, So without delay, let's dive in. First things first… You need to compute (… to run training) and a platform where you can write your code, bring up the execution environment, and run training…

I used Google Colab… and Jupyter Notebooks… Jupyter notebooks are like JSON files containing blobs of Python code and supporting texts. ( If you are a newbie and unaware of Jupyter Notebook, don’t worry.)

I will attach the Jupyter Notebook, which you can open up in Google Colab and go through the code.

But before I dive into Code and how to train the model, we need to set up the context.

Why.. Why are we training the model, and what will we achieve with that..?

Anyone reading this might have their use case; you might have your data and are looking to create a model that can respond based on that data and be helpful…

I will provide you with the context around my data, how I am using it to train the model, and what I am achieving. You can then follow my lead and apply this to your dataset.

To start with, the data set I am using is on hugging faces. ( If you are new to hugging faces, create an account there and start using it. It's an amazing site where people release their models, consume other models, and share datasets.)

So, The data set I am using is located at.. https://huggingface.co/datasets/b-mc2/sql-create-context

There are 3 coloumns in this data set.. Here is how it looks.

Question. -  How many heads of the departments are older than 56 ?
Answer.   -  SELECT COUNT(*) FROM head WHERE age > 56	
Context.  -  CREATE TABLE head (age INTEGER)

We will train our model on this kind of data, where plain English instructions are mapped to SQL queries and context. Once trained on this dataset, the Model will be able to respond to users' queries and generate SQL. Let's explore further.

So now we have a data set with roughly 70K entries. Let's start the process of training the model.

I am using codellama model from Meta, you can find the details about it here..

https://huggingface.co/codellama/CodeLlama-7b-hf.

From now on, we need to get to our Google Colab. Here are the details of the runtime I am using for training. Please note that I used the A100 class of GPUs. ( I will attach the full notebook at the end of this blog, but I suggest you practice it step by step to understand.)

Step 1. Development environment

Let’s import all the required libs.. such as PyTorch, trl, transformers, and datasets. We will need these components to train the model on the dataset.

# Install Pytorch & other libraries
!pip install "torch==2.1.2" tensorboard

# Install Hugging Face libraries
!pip install  --upgrade \
  "transformers==4.36.2" \
  "datasets==2.16.1" \
  "accelerate==0.26.1" \
  "evaluate==0.4.1" \
  "bitsandbytes==0.42.0" \
  # "trl==0.7.10" # \
  # "peft==0.7.1" \

# install peft & trl from github
!pip install git+https://github.com/huggingface/trl@a3c5b7178ac4f65569975efadc97db2f3749c65e --upgrade
!pip install git+https://github.com/huggingface/peft@4a1559582281fc3c9283892caea8ccef1d6f5a4f --upgrade

It will take a couple of minutes to download all the required libraries…

Once this is done, please install flash-attn. This method can accelerate training up to three times.

import torch; assert torch.cuda.get_device_capability()[0] >= 8, 'Hardware not supported for Flash Attention'
# install flash-attn
!pip install ninja packaging
!MAX_JOBS=4 pip install flash-attn --no-build-isolation

Step 2: Connect with Hugging Face…

..so that you can using hugging face to save the trained model. Before you set the token here, please go to hugging face site and get yourself a token .. You can create one going to settings→access tokens

Once you get the token, run this code in your google colab to set the colab with hugging face.

from huggingface_hub import login

login(
  token=".......", # ADD YOUR TOKEN HERE
  add_to_git_credential=True
)

Step 3: Load the DataSet…

So now, we have to pay attention. We need to train the model to understand which message is for it as context, which message is coming directly from the user, and what kind of output it needs to produce. Remember, our data set has three columns: Questions, Answers, and Context. So now, we will train the model to understand our data and our users and what kind of output is expected.

from datasets import load_dataset

# Convert dataset to OAI messages
system_message = """You are an text to SQL query translator. Users will ask you questions in English and you will generate a SQL query based on the provided SCHEMA.
SCHEMA:
{schema}"""

def create_conversation(sample):
  return {
    "messages": [
      {"role": "system", "content": system_message.format(schema=sample["context"])},
      {"role": "user", "content": sample["question"]},
      {"role": "assistant", "content": sample["answer"]}
    ]
  }  

# Load dataset from the hub
dataset = load_dataset("b-mc2/sql-create-context", split="train")
dataset = dataset.shuffle().select(range(12500))

# Convert dataset to OAI messages
dataset = dataset.map(create_conversation, remove_columns=dataset.features,batched=False)
# split dataset into 10,000 training samples and 2,500 test samples
dataset = dataset.train_test_split(test_size=2500/12500)

print(dataset["train"][345]["messages"])

# save datasets to disk 
dataset["train"].to_json("train_dataset.json", orient="records")
dataset["test"].to_json("test_dataset.json", orient="records")

Let’s look at this code. This code basically maps each entry in our dataset and creates a message for it. For each row of our data, the message will look like this: it has three entries: one is for the system as context, the other is a question from the user, and the third is an answer or output. If it is hard to grasp, wait for a moment. We will examine the output from the trained model.

      {"role": "system", "content": system_message.format(schema=sample["context"])},
      {"role": "user", "content": sample["question"]},
      {"role": "assistant", "content": sample["answer"]}

Apart from converting the dataset to messages, this code also prepares two datasets: one for training the model and the other for testing the model for its accuracy. The test dataset has many fewer entries than the training dataset.

So let’s run this in collab.


from datasets import load_dataset

# Load jsonl data from disk
dataset = load_dataset("json", data_files="train_dataset.json", split="train")

Datasets are ready, now load them using the above code.

Step 4: Prepare for Fine Tuning Using TRL and SFTTrainer.

We are now ready to fine-tune our model using the SFTTrainer from the trl library. The SFTTrainer, a subclass of the Trainer from the transformers library, simplifies the supervised fine-tuning of open LLMs. It supports all standard features like logging, evaluation, and checkpointing, while offering additional conveniences, including:

Formatting datasets for conversational and instruction formats
Training on completions only, ignoring prompts
Efficient dataset packing
PEFT (parameter-efficient fine-tuning) support, including Q-LoRA
Preparing the model and tokenizer for conversational fine-tuning (e.g., adding special tokens)

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from trl import setup_chat_format

# Hugging Face model id
model_id = "codellama/CodeLlama-7b-hf" # or `mistralai/Mistral-7B-v0.1`

# BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    quantization_config=bnb_config
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.padding_side = 'right' # to prevent warnings

# # set chat template to OAI chatML, remove if you start from a fine-tuned model
model, tokenizer = setup_chat_format(model, tokenizer)

We will use QLora to fine-tune the model.

from peft import LoraConfig

# LoRA config based on QLoRA paper & Sebastian Raschka experiment
peft_config = LoraConfig(
        lora_alpha=128,
        lora_dropout=0.05,
        r=256,
        bias="none",
        target_modules="all-linear",
        task_type="CAUSAL_LM", 
)

Set up training parameters

from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="code-llama-7b-text-to-sql", # directory to save and repository id
    num_train_epochs=3,                     # number of training epochs
    per_device_train_batch_size=3,          # batch size per device during training
    gradient_accumulation_steps=2,          # number of steps before performing a backward/update pass
    gradient_checkpointing=True,            # use gradient checkpointing to save memory
    optim="adamw_torch_fused",              # use fused adamw optimizer
    logging_steps=10,                       # log every 10 steps
    save_strategy="epoch",                  # save checkpoint every epoch
    learning_rate=2e-4,                     # learning rate, based on QLoRA paper
    bf16=True,                              # use bfloat16 precision
    tf32=True,                              # use tf32 precision
    max_grad_norm=0.3,                      # max gradient norm based on QLoRA paper
    warmup_ratio=0.03,                      # warmup ratio based on QLoRA paper
    lr_scheduler_type="constant",           # use constant learning rate scheduler
    push_to_hub=True,                       # push model to hub
    report_to="tensorboard",                # report metrics to tensorboard
)

Finally, Create an SFTTrainer to train.

from trl import SFTTrainer

max_seq_length = 3072 # max sequence length for model and packing of the dataset

trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    packing=True,
    dataset_kwargs={
        "add_special_tokens": False,  # We template with special tokens
        "append_concat_token": False, # No need to add additional separator token
    }
)

Step 5: Train the Model

We are all set, now let’s train our model

# start training, the model will be automatically saved to the hub and the output directory
trainer.train()

# save model 
trainer.save_model()

That’s it; Training will take 20 - 40 mins based on the run time you have chosen.

Now let's test it.

Step 6: Setup model for test.

import torch
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer, pipeline 

peft_model_id = "./code-llama-7b-text-to-sql"
# peft_model_id = args.output_dir

# Load Model with PEFT adapter
model = AutoPeftModelForCausalLM.from_pretrained(
  peft_model_id,
  device_map="auto",
  torch_dtype=torch.float16
)
tokenizer = AutoTokenizer.from_pretrained(peft_model_id)
# load into pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

Now lets load the dataset and test

from datasets import load_dataset 
from random import randint


# Load our test dataset
eval_dataset = load_dataset("json", data_files="test_dataset.json", split="train")
rand_idx = randint(0, len(eval_dataset))

# Test on sample 
prompt = pipe.tokenizer.apply_chat_template(eval_dataset[rand_idx]["messages"][:2], tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=False, temperature=0.1, top_k=50, top_p=0.1, eos_token_id=pipe.tokenizer.eos_token_id, pad_token_id=pipe.tokenizer.pad_token_id)
print(f"Prompt:\n{prompt}")
print(f"Prompt:\n{eval_dataset[rand_idx]}")
print(f"Query:\n{eval_dataset[rand_idx]['messages'][1]['content']}")
print(f"Original Answer:\n{eval_dataset[rand_idx]['messages'][2]['content']}")
print(f"Generated Answer:\n{outputs[0]['generated_text'][len(prompt):].strip()}")

Here is the output of this test - Check the prompt created, query asked and response generated by the model..

Prompt: {'messages': [{'content': 'You are an text to SQL query translator. Users will ask you questions in English and you will generate a SQL query based on the provided SCHEMA.\nSCHEMA:\nCREATE TABLE table_2182654_3 (no_in_season VARCHAR, no_in_series VARCHAR)', 'role': 'system'}, {'content': 'What episode number in the season is episode 24 in the series?', 'role': 'user'}, {'content': 'SELECT no_in_season FROM table_2182654_3 WHERE no_in_series = 24', 'role': 'assistant'}]} 

Query: What episode number in the season is episode 24 in the series? 

Original Answer: SELECT no_in_season FROM table_2182654_3 WHERE no_in_series = 24 

Generated Answer: SELECT no_in_season FROM table_2182654_3 WHERE no_in_series = "24"

That’s it… Now, here is the Notebook, which you can use in your Google collab.

https://github.com/aroravce/AIPracticalBootcamp/blob/main/TrainingTheCodeLlama%20(1).ipynb

https://colab.research.google.com/drive/1Ysv1o3gOqZeJcMuc1nIWbWAik8hasKRg?usp=sharing

Thanks for reading this along.. Would love your appreciation in form of likes and subscribes.. More to come

Cloud and Gen AI

Discussion about this post