Finetune Transformers¶

Learning how to finetune a BERT model using PyTorch/TensorFlow from HuggingFace for your usecase is a art in itself because there are so many ways and methods to do it that you will not able to figure out which is the best for my usecase. BTW, you can always refer to HuggingFace documentation.

For example!

Choose between PyTorch and TensorFlow. (let choose )
If you are importing your dataset with pandas or polars then need to create a custom class by inheriting torch.utils.data.Dataset class.
Then need to tokenize the data and need to use DataLoader and Data Collator.
Then use a for-loop to train and validate the model.

But there is a easy way to finetune, by using objects like transformers.TrainingArguments and transformers.Trainer which reduces the manual looping complexity.

Finetune Process¶

Load Data¶

Import dataset your method such as pandas, polars or other ways.

Preprocess Data¶

Precess the data and check the labels.

Related Docs

Preprocess Data

Train-Val-Test Dataset¶

Split the data into train, validation and test data. Before doing this you have consider many things like:

How to tokenize the data with certain padding, truncation, max_length, return_tensors, etc.?
Do you need to shuffle the data? (Only shuffle train dataset)
Which object you will use to store the data? (datsets.Dataset or torch.utils.data.DataLoader)
Representation or data type of labels column? This will be different for different type of problems you have to make sure that the data is in correct format.
Is DataCollator required?

Tokenize Data¶

You need to tokenize the data before sending it to model to trained on. It is done using respective model's tokenizer. You can tokenize the data separately or in batch (recommended).

Related Docs

Padding and Truncation

Batch Creation¶

You have to cast the dataset into a object which supports the batching.

Related Docs

Data Collator¶

Data collators are objects that will form a batch by using a list of dataset elements as input.

Related Docs

Load Tokenizer¶

A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most of the tokenizers are available in two flavors: a full python implementation and a “Fast” implementation based on the Rust library Tokenizers.

Related Docs

Load Model¶

A pre-trained model which we are going to finetune using our custom dataset.

Related Docs

BERT Model¶

Related Docs

Model Training/Finetuning¶

Related Docs

PEFT Methods¶

PEFT offers parameter-efficient methods for finetuning large pretrained models by training a smaller number of parameters using a reparametrization method like LoRA and more.

Related Docs

Evaluate Model¶

Related Docs

Model Prediction/Inference¶

Related Docs

Training with Pytorch¶

You can either finetune a pretrained model in native PyTorch or with transformers.Trainer class (recommended).

Read this documentation by HuggingFace "Fine-tune a pre-trained model" where they explain how to finetune a pre-trained model using both the methods separately.

Also refer to this tutorial by same team for "Text Classification".