Type to search…

↑↓ navigate open Esc close
Fine-Tune Transformers

05 Jan 2025 Anshul Raj Verma

Blog Icon: simple-icons:huggingface

References to finetune Transformers from HuggingFace using Pytorch or TensorFlow.

Back

Learning how to fine-tune a BERT model using PyTorch/TensorFlow from HuggingFace for your use case is an art in itself because there are so many ways and methods to do. And you will not able to figure out easily which is the best for your use case. By the way, you can always refer to HuggingFace documentation.

For Example!

  1. Choose between PyTorch and TensorFlow. (lets choose PyTorch)
  2. If you are importing your dataset with pandas or polars then need to create a custom class by inheriting torch.utils.data.Dataset class.
  3. Then need to tokenize the data and need to use DataLoader and Data Collator.
  4. Then use a for-loop to train and validate the model.

But there is an easy way of fine-tune, by using objects like transformers.TrainingArguments and transformers.Trainer which reduces the manual looping complexity.

Fine-Tune Process

Load Data

Import dataset your method such as pandas, polars or other ways.

Preprocess Data

Process the data and check the labels. Docs Preprocess Data.

Train-Val-Test Dataset

Split the data into train, validation, and test data. Before doing this you have consider many things like:

  • How to tokenize the data with certain padding, truncation, max_length, return_tensors, etc.?
  • Do you need to shuffle the data? (Only shuffle train dataset)
  • Which object you will use to store the data? (datasets.Dataset or torch.utils.data.DataLoader)
  • Representation or data type of labels column? This will be different for different type of problems you have to make sure that the data is in correct format.
  • Is DataCollator required?

Tokenize Data

You need to tokenize the data before sending it to model to trained on. It is done using respective model’s tokenizer. You can tokenize the data separately or in batch (recommended).

Padding and Truncation

Batch Creation

You have to cast the dataset into an object which supports the batching.

Data Collator

Data collators are objects that will form a batch by using a list of dataset elements as input.

Data Collator

transformers.DataCollatorWithPadding

Load Tokenizer

A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most of the tokenizers are available in two flavors: a full python implementation and a “Fast” implementation based on the Rust library Tokenizers.

Load Model

A pre-trained model which we are going to finetune using our custom dataset.

BERT Model

Model Training/Finetuning

PEFT Methods

PEFT offers parameter-efficient methods for finetuning large pretrained models by training a smaller number of parameters using a reparametrization method like LoRA and more.

Evaluate Model

Model Prediction/Inference

Training with PyTorch

You can either fine-tune a pretrained model in native PyTorch or with transformers.Trainer class (recommended).

Read this documentation by HuggingFace “Fine-tune a pre-trained model” where they explain how to fine-tune a pretrained model using both the methods separately.

Also refer to this tutorial by same team for “Text Classification”.