# Objective

 * We retain everything from the previous notebook with the following changes
 * Load the BERT large (330 million parameters) model in FP32.
 * Therefore, the memory requirement to load the model alone is : $330 \times 4= 1.2GB$
 * While we push the model to CUDA for training, it requires about 1 to 2GB of additional memory for loading kernels.
 * Setting batch size to 8 will throw the cuda:OOM (Out of Memory) error.
 * However, increasing the batch size often helps in faster convergence and better test performance.
 * How do we accomplish this?

* In this notebook, we are going to use a technique to increase the batch size without rising OOM error
* Please read the previous notebook [here](https://github.com/Arunprakash-A/DL-Pytorch-Workshop) before proceeding further.

# Imports

In [None]:
%%capture
!pip install datasets
!pip install transformers[torch]==4.38.2
!pip install nvidia-ml-py3

In [None]:
import transformers
transformers.__version__

'4.38.2'

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
from pprint import pprint
import torch
from transformers import AutoTokenizer,pipeline

In [None]:
import datasets
from datasets import load_dataset, get_dataset_split_names, get_dataset_config_names, get_dataset_config_info
from transformers import AutoTokenizer

# Loading dataset

 * We are going to do Masked Language Modelling (MLM) with continual pre-training, however, with the same "MRPC" dataset used in the previous notebook
 * We will just drop the label column and randomly mask the words in the sentences for training the model with  MLM
 * Though the dataset is small, we refrain from doing full pre-training (you can do that if yoy wish to)

In [None]:
raw_dataset = load_dataset(path="glue", name="mrpc") # name=config_name
train_split = raw_dataset["train"]
checkpoint = "bert-large-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Downloading readme:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/649k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
def tokenize_function(example):
    return tokenizer(text=example["sentence1"], text_pair=example["sentence2"], return_special_tokens_mask=True,
                     padding='max_length',truncation=True)

In [None]:
tokenized_datasets = raw_dataset.map(tokenize_function, batched=True)


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

In [None]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask', 'special_tokens_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask', 'special_tokens_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask', 'special_tokens_mask'],
        num_rows: 1725
    })
})

* Remove all unused columns

In [None]:
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1","sentence2",'label'])

In [None]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['idx', 'input_ids', 'token_type_ids', 'attention_mask', 'special_tokens_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['idx', 'input_ids', 'token_type_ids', 'attention_mask', 'special_tokens_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['idx', 'input_ids', 'token_type_ids', 'attention_mask', 'special_tokens_mask'],
        num_rows: 1725
    })
})

* Apply masking with masking probability 0.3.

In [None]:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer,mlm_probability=0.3)

# The Model

In [None]:
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained(checkpoint)

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
print(f'Number of parametes {model.num_parameters()/(10**6)} million')

Number of parametes 335.174458 million


* Let's just quickly check we are actually fine-tuning (continual pretraining) all the parameters in the model

In [None]:
# for parameter in model.parameters():
#     if parameter.requires_grad:
#         print(parameter.shape)

# Training using Trainer API

* Till now we have neither loaded the data nor the model into the GPU.
* Still a small portion (260 MB) of the GPU memory is already occupied.

In [None]:
from pynvml import *


def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB.")


def print_summary(result):
    print(f"Time: {result.metrics['train_runtime']:.2f}")
    print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}")

print_gpu_utilization()

GPU memory occupied: 258 MB.


* A callback for tracing the memory usage before and after the parameter update.

In [None]:
from transformers import TrainerCallback

class TraceMemory(TrainerCallback):

    def on_step_begin(self, args, state, control, **kwargs):
        nvmlInit()
        handle = nvmlDeviceGetHandleByIndex(0)
        info = nvmlDeviceGetMemoryInfo(handle)
        print(f'GPU memory step begin: {info.used//1024**2}MB')

    def on_step_end(self, args, state, control, **kwargs):
        nvmlInit()
        handle = nvmlDeviceGetHandleByIndex(0)
        info = nvmlDeviceGetMemoryInfo(handle)
        print(f'GPU memory step end: {info.used//1024**2}MB')


 ## Batch size-1 : SGD

* Let's first run the model with a batch size of 1 (later we need to change it to 8) and see how much memory is occupied
* Internally, the model uses AdamW optimizer for updating weights
* We just train using 10 samples from the dataset.

In [None]:
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments("test-trainer",num_train_epochs=1, per_device_train_batch_size=1,
                                  disable_tqdm=None,gradient_accumulation_steps=1)

* Visit this [page](https://huggingface.co/docs/transformers/v4.38.2/en/main_classes/callback#transformers.TrainerCallback) to see all the available callbacks
* Note, now transformer version 4.42.0 has a few more callbacks like "on_optimizer_step"

In [None]:
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"].select(range(10)),
    eval_dataset=tokenized_datasets["validation"].select(range(10)),
    data_collator=data_collator,
    tokenizer=tokenizer,
    callbacks=[TraceMemory],

)

* How much memmory do we really need to run the model with the batch size of 8.

* The model parameter itself takes 1.2 GB + 1 GB (for kernel) (**it might differ based the GPU (T4, L4, V100, A100..)**)

In [None]:
print_gpu_utilization()

GPU memory occupied: 1650 MB.


* For each sample we need 1.2 GB for gradients and 2.4 GB for optimizer states (assuming Adam like optimizers)

* Therefore, we need additional 1.2 GB per sample **for training**

* So, with batch size 1, it requires in total **at least**  (1.2+1.2+2.4+1.3 = 6.1)(param+grad+states+kernel) GB of memory
* Note: We have ignored memory for activation


In [None]:
result = trainer.train()
print(result)

GPU memory step begin: 1650MB
GPU memory step end: 6846MB


Step,Training Loss


GPU memory step begin: 6846MB
GPU memory step end: 6970MB
GPU memory step begin: 6970MB
GPU memory step end: 6970MB
GPU memory step begin: 6970MB
GPU memory step end: 6970MB
GPU memory step begin: 6970MB
GPU memory step end: 6970MB
GPU memory step begin: 6970MB
GPU memory step end: 6970MB
GPU memory step begin: 6970MB
GPU memory step end: 6970MB
GPU memory step begin: 6970MB
GPU memory step end: 6970MB
GPU memory step begin: 6970MB
GPU memory step end: 6970MB
GPU memory step begin: 6970MB
GPU memory step end: 6970MB
TrainOutput(global_step=10, training_loss=2.392757225036621, metrics={'train_runtime': 5.6187, 'train_samples_per_second': 1.78, 'train_steps_per_second': 1.78, 'total_flos': 9320251207680.0, 'train_loss': 2.392757225036621, 'epoch': 1.0})


* At the end of the training we have used about 7 GB of memory for batch of size 1
* Note, there are 10 update steps.

* Therefore, if we use 8 samples in a batch, we need **at least** (1.2+8*1.2+2.4+1.3=14.4 GB) (Ignored memory for activation)

* The colab GPU has only 16 GB of memory and **hence it will raise OOM error if batch size goes beyond 7**

* Now restart the session and execute the section below (It will throw the OOM error in google colab)
* I am running this notebook with A100 (so it won't be a problem)

## Batchsize 8: Mini-batch GD

* Ensure that you restarted the run-time (session) and executed all the cells above the "Batchsize 1: SGD" section

In [None]:
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments("test-trainer",num_train_epochs=1,per_device_train_batch_size=8,
                                  disable_tqdm=None,gradient_accumulation_steps=1)

In [None]:
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"].select(range(10)),
    eval_dataset=tokenized_datasets["validation"].select(range(10)),
    data_collator=data_collator,
    tokenizer=tokenizer,
    callbacks=[TraceMemory],

)

* Note that there won't be any change in memory when we load the model parameters into to the GPU

In [None]:
print_gpu_utilization()

GPU memory occupied: 1650 MB.


In [None]:
result = trainer.train()
print(result)

GPU memory step begin: 1650MB


OutOfMemoryError: CUDA out of memory. Tried to allocate 478.00 MiB. GPU 

* It occupied about 17 GB of memory for training (in A100)

* Note also that there are only two updates (make sense right?)
* If you are using the colab, you might have encountered **OOM** error.
* Now, restart the session and execute all the cells above the section "Batch size 1: SGD" and then execute the cells below

# Gradient Accumulation

* We know that larger batch size gives a faster convergence and also a better test performance.

* How do we increase the batch size then?

* That's where  gradient accumulation (a simple idea) helps.

* The idea is, instead of storing the gradients separately for each sample, just accumulate them (sum or mean) (like we accumulate the adam optimizer states).

* Update the weights after **accumulation_steps** (instead of after passing each batch)

* In the *TrainingArguments*, change the gradient accumulation steps to 8 and see the amount of memory occupied!
* Carefully note that we have done only one update (bcz accumulation step is 8, but there are 10 samples in the datasets, the last 2 samples will be ignored)

In [None]:
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments("test-trainer",num_train_epochs=1,per_device_train_batch_size=1,
                                  disable_tqdm=None,gradient_accumulation_steps=8)

* For each time step, we are passing a batch of samples (1 in this case) that can fit into the memory
* The gradient is accumulated for each step till  accumulation step
* When the step reaches the accumulation_steps (8 in this case), it will update the weights
* In effect, we have done mini-batch GD with batch size 8 (at the cost of training time)

In [None]:
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"].select(range(10)),
    eval_dataset=tokenized_datasets["validation"].select(range(10)),
    data_collator=data_collator,
    tokenizer=tokenizer,
    callbacks=[TraceMemory],

)

In [None]:
print_gpu_utilization()

GPU memory occupied: 1650 MB.


In [None]:
result = trainer.train()
print(result)

GPU memory step begin: 1650MB
GPU memory step end: 6848MB


Step,Training Loss


TrainOutput(global_step=1, training_loss=2.60465931892395, metrics={'train_runtime': 4.1165, 'train_samples_per_second': 2.429, 'train_steps_per_second': 0.243, 'total_flos': 7456200966144.0, 'train_loss': 2.60465931892395, 'epoch': 0.8})


* Note that the GPU memory required to train the model is 7 GB (as if we used SGD).
* This approach gives us a better test performance.