Learning how to fine-tune a BERT model using PyTorch/TensorFlow from HuggingFace for your use case is an art in itself because there are so many ways and methods to do. And you will not able to figure out easily which is the best for your use case. By the way, you can always refer to HuggingFace documentation.
For Example!
- Choose between PyTorch and TensorFlow. (lets choose PyTorch)
- If you are importing your dataset with
pandasorpolarsthen need to create a custom class by inheritingtorch.utils.data.Datasetclass. - Then need to tokenize the data and need to use
DataLoaderand Data Collator. - Then use a for-loop to train and validate the model.
But there is an easy way of fine-tune, by using objects like
transformers.TrainingArgumentsandtransformers.Trainerwhich reduces the manual looping complexity.
Fine-Tune Process
Load Data
Import dataset your method such as pandas, polars or other ways.
Preprocess Data
Process the data and check the labels. Docs
Preprocess Data.
Train-Val-Test Dataset
Split the data into train, validation, and test data. Before doing this you have consider many things like:
- How to tokenize the data with certain
padding,truncation,max_length,return_tensors, etc.? - Do you need to shuffle the data? (Only shuffle train dataset)
- Which object you will use to store the data? (
datasets.Datasetortorch.utils.data.DataLoader) - Representation or data type of labels column? This will be different for different type of problems you have to make sure that the data is in correct format.
- Is DataCollator required?
Tokenize Data
You need to tokenize the data before sending it to model to trained on. It is done using respective model’s tokenizer. You can tokenize the data separately or in batch (recommended).
Batch Creation
You have to cast the dataset into an object which supports the batching.
Data Collator
Data collators are objects that will form a batch by using a list of dataset elements as input.
transformers.DataCollatorWithPadding
Load Tokenizer
A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most of the tokenizers are available in two flavors: a full python implementation and a “Fast” implementation based on the Rust library Tokenizers.
Load Model
A pre-trained model which we are going to finetune using our custom dataset.
BERT Model
Model Training/Finetuning
PEFT Methods
PEFT offers parameter-efficient methods for finetuning large pretrained models by training a smaller number of parameters using a reparametrization method like LoRA and more.
Evaluate Model
Model Prediction/Inference
Training with PyTorch
You can either fine-tune a pretrained model in native PyTorch or with
transformers.Trainer class
(recommended).
Read this documentation by HuggingFace “Fine-tune a pre-trained model” where they explain how to fine-tune a pretrained model using both the methods separately.
Also refer to this tutorial by same team for “Text Classification”.