Skip to content

Finetune Transformers

Learning how to finetune a BERT model using PyTorch/TensorFlow from HuggingFace for your usecase is a art in itself because there are so many ways and methods to do it that you will not able to figure out which is the best for my usecase. BTW, you can always refer to HuggingFace documentation.

For example!
  1. Choose between PyTorch and TensorFlow. (let choose )
  2. If you are importing your dataset with pandas or polars then need to create a custom class by inheriting torch.utils.data.Dataset class.
  3. Then need to tokenize the data and need to use DataLoader and Data Collator.
  4. Then use a for-loop to train and validate the model.

But there is a easy way to finetune, by using objects like transformers.TrainingArguments and transformers.Trainer which reduces the manual looping complexity.

Finetune Process

Load Data

Import dataset your method such as pandas, polars or other ways.

Preprocess Data

Precess the data and check the labels.

Related Docs

Train-Val-Test Dataset

Split the data into train, validation and test data. Before doing this you have consider many things like:

  • How to tokenize the data with certain padding, truncation, max_length, return_tensors, etc.?
  • Do you need to shuffle the data? (Only shuffle train dataset)
  • Which object you will use to store the data? (datsets.Dataset or torch.utils.data.DataLoader)
  • Representation or data type of labels column? This will be different for different type of problems you have to make sure that the data is in correct format.
  • Is DataCollator required?

Tokenize Data

You need to tokenize the data before sending it to model to trained on. It is done using respective model's tokenizer. You can tokenize the data separately or in batch (recommended).

Batch Creation

You have to cast the dataset into a object which supports the batching.

Data Collator

Data collators are objects that will form a batch by using a list of dataset elements as input.

Load Tokenizer

A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most of the tokenizers are available in two flavors: a full python implementation and a “Fast” implementation based on the Rust library Tokenizers.

Load Model

A pre-trained model which we are going to finetune using our custom dataset.

BERT Model

Model Training/Finetuning

PEFT Methods

PEFT offers parameter-efficient methods for finetuning large pretrained models by training a smaller number of parameters using a reparametrization method like LoRA and more.

Evaluate Model

Model Prediction/Inference

Training with Pytorch

You can either finetune a pretrained model in native PyTorch or with transformers.Trainer class (recommended).

Read this documentation by HuggingFace "Fine-tune a pre-trained model" where they explain how to finetune a pre-trained model using both the methods separately.

Also refer to this tutorial by same team for "Text Classification".