Finetune Transformers¶
Learning how to finetune a BERT model using PyTorch/TensorFlow from HuggingFace for your usecase is a art in itself because there are so many ways and methods to do it that you will not able to figure out which is the best for my usecase. BTW, you can always refer to HuggingFace documentation.
For example!
- Choose between PyTorch and TensorFlow. (let choose )
- If you are importing your dataset with
pandas
orpolars
then need to create a custom class by inheritingtorch.utils.data.Dataset
class. - Then need to tokenize the data and need to use
DataLoader
and Data Collator. - Then use a for-loop to train and validate the model.
But there is a easy way to finetune, by using objects like
transformers.TrainingArguments
andtransformers.Trainer
which reduces the manual looping complexity.
Finetune Process¶
Load Data¶
Import dataset your method such as pandas
, polars
or other ways.
Preprocess Data¶
Precess the data and check the labels
.
Related Docs
Train-Val-Test Dataset¶
Split the data into train, validation and test data. Before doing this you have consider many things like:
- How to tokenize the data with certain
padding
,truncation
,max_length
,return_tensors
, etc.? - Do you need to shuffle the data? (Only shuffle train dataset)
- Which object you will use to store the data? (
datsets.Dataset
ortorch.utils.data.DataLoader
) - Representation or data type of labels column? This will be different for different type of problems you have to make sure that the data is in correct format.
- Is DataCollator required?
Tokenize Data¶
You need to tokenize the data before sending it to model to trained on. It is done using respective model's tokenizer. You can tokenize the data separately or in batch (recommended).
Related Docs
Batch Creation¶
You have to cast the dataset into a object which supports the batching.
Data Collator¶
Data collators are objects that will form a batch by using a list of dataset elements as input.
Related Docs
Load Tokenizer¶
A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most of the tokenizers are available in two flavors: a full python implementation and a “Fast” implementation based on the Rust library Tokenizers.
Related Docs
Load Model¶
A pre-trained model which we are going to finetune using our custom dataset.
BERT Model¶
Related Docs
Model Training/Finetuning¶
PEFT Methods¶
PEFT offers parameter-efficient methods for finetuning large pretrained models by training a smaller number of parameters using a reparametrization method like LoRA and more.
Related Docs
Evaluate Model¶
Related Docs
Model Prediction/Inference¶
Training with Pytorch¶
You can either finetune a pretrained model in native PyTorch or with
transformers.Trainer
class
(recommended).
Read this documentation by HuggingFace "Fine-tune a pre-trained model" where they explain how to finetune a pre-trained model using both the methods separately.
Also refer to this tutorial by same team for "Text Classification".