A Tutorial on using BERT for Text Classification w Fine Tuning

April 26, 2020

Table of Contents

What is BERT ?
An overview of BERT Architecture
Different Ways To Use BERT
Understanding Input to BERT
How to Fine Tune BERT for Text Classification ?
Recommended Fine Tuning Hyper Parameters
Preparing Input datasets
- Understanding the input to keras-bert
Binary Text Classification Using BERT
Multi-class Text Classification Using BERT
Multilabel Text Classification Using BERT
Conclusion
Resources & References

In this tutorial, we will learn how to use BERT for text classification. We will begin with a brief introduction of BERT, its architecture and fine-tuning mechanism. Then we will learn how to fine-tune BERT for text classification on following classification tasks:

Binary Text Classification: IMDB sentiment analysis with BERT [88% accuracy].
Multi-class Text Classification: 20-Newsgroup classification with BERT [90% accuracy].
Multi-label Text Classification: Toxic-comment classification with BERT [90% accuracy].

We will use BERT through the keras-bert Python library, and train and test our model on GPU’s provided by Google Colab with Tensorflow backend.

What is BERT ?

BERT stands for Bidirectional Encoder Representation of Transformers. It is a deep learning based unsupervised language representation model developed by researchers at Google AI Language. It is the first deeply-bidirectional unsupervised language model. The language models, until BERT, learnt from text sequences in either left-to-right or combined left-to-right and right-to-left contexts. Thus they were either not bidirectional or not bidirectional in all layers.The diagram below shows its bidirectional architecture as compared to other language models.

Deep-Bi-directionality in BERT Source

BERT incorporated deep bi-directionality in learning representations using a novel Masked Language Model(MLM) approach. This deep-bidirectional learning approach allows BERT to learn words with their context being both left and right words. Under the hood, BERT uses the popular Attention model for bidirectional training of transformers. With this approach BERT claims to have achieved the state-of-the-art results on a series of natural language processing and understanding tasks.

An overview of BERT Architecture

Before diving into using BERT for text classification, let us take a quick overview of BERT’s architecture. BERT is a multilayered bidirectional Transformer encoder. The diagram below shows a 12 layered BERT model(BERT-Base version). Note that each Transformer is based on the Attention Model.

There are multiple pre-trained model versions with varying numbers of encoder layers, attention heads and hidden size dimensions available. Below is a list of different model variants available.

H = The hidden size.

A = Number of self attention heads.

L = Number of Layers (Transformer Blocks)

The largest model available is BERT-Large which has 24 layers, 16 attention heads and 1024 dimensional output hidden vectors. For each model, there are also cased and uncased variants available. In this tutorial we will use BERT-Base which has 12 encoder layers with 12 attention heads and has 768 hidden sized representations.

Different Ways To Use BERT

BERT can be used for text classification in three ways.

Fine Tuning Approach: In the fine tuning approach, we add a dense layer on top of the last layer of the pretrained BERT model and then train the whole model with a task specific dataset.
Feature Based Approach: In this approach fixed features are extracted from the pretrained model.The activations from one or more layers are extracted without fine-tuning and these contextual embeddings are used as input to the downstream network for specific tasks. A few strategies for feature extraction discussed in the BERT paper are as follows:
1. Extracting Second-to-Last Hidden Layer
2. Extracting Last Hidden Layer
3. Concat Last Four Hidden
4. Weighted Sum All 12 Layers
As word-embedding: In this approach, the trained model is used to generate token embedding (vector representation of words) without any fine-tuning for an end-to-end NLP task. The vectors representations of tokens then can then be used for specific tasks like classification, topic modeling, summarisation etc. The following code demonstrates using BERT as word-embedding using the bert-embedding library.

Python

#Source: https://pypi.org/project/bert-embedding/
pip install bert-embedding
from bert_embedding import BertEmbedding

text = "A tutorial on how to generate token embeddings using BERT"
bert_embedding = BertEmbedding()
result = bert_embedding(text.split('\n'))
first_sentence = result[0]
embedding = first_sentence[1]
print (embedding)
# array([ 0.4805648 ,  0.18369392, -0.28554988, ..., -0.01961522,
#        1.0207764 , -0.67167974], dtype=float32)

So which approach to choose for text classification with BERT? The answer depends on the performance requirements and the amount of effort we wish to put in, in terms of resources and time. Fine-tuning and feature-based extraction approaches require training, testing and validating on GPU or TPU and therefore are more time taking and resource intensive as compared to embedding-based approach. However, they are expected to yield better results as they benefit from the use of bidirectional contextual representation of whole sentences, tuned specifically for the task at hand.

The BERT paper recommends fine-tuning for better results. A few advantages of fine tuning BERT are as follows:

Better Results: Deeply-bidirectional learning enables it to achieve comparable or even better results than custom architecture tailored to one specific task.
Lesser data: BERT is trained on the BooksCorpus (800M words) and Wikipedia (2,500M words). The pre-trained model therefore has weights that allow us to fine tune for a specific dataset using much smaller datasets as compared to the case where the model needs to learn weights on a train from scratch.
Lesser resources: With advantage of being able to work with lesser training data, it cuts down the excessive compute and memory resources required to train the models from scratch.

Understanding Input to BERT

So, what is the input to BERT? Input to BERT is an embedding representation derived by summing token embedding, segmentation embedding and the position embedding of the text.

What are token embedding, segmentation embedding and the position embedding?

Token Embeddings: Token embeddings are the representations for the word-tokens of the text derived by tokenizing using WordPiece token vocabulary. For BERT-Base, the hidden size is 768, thus the token embedding created has a (SEQ_LEN X 768) size representation. The token embedding also includes [CLS] and [SEP] markers which denote the class(classification -category or label) and sentence separation respectively.
Position Embeddings: The position embedding is a representation for the position of each token in the sentence. For BERT-Base it is a 2D array of size (SEQ_LEN, 768), where each Nth row is a vector representation for the Nth position.
Segment Embeddings: The segment embedding identifies the different unique sentences in the text.

Note that each of the embeddings(token, position and segment), being summed to derive the input, has (SEQ_LEN x Hidden-Size) dimension. The SEQ_LEN value can be changed and is decided based on the length of the sentences in the downstream task dataset. The sentences which have length less than the sequence length need to be padded. The Hidden-Size (H) is decided by the choice of the BERT model(like BERT Tiny, Small, Base , Large etc.).

How to Fine Tune BERT for Text Classification ?

To Fine Tuning BERT for text classification, take a pre-trained BERT model, apply an additional fully-connected dense layer on top of its output layer and train the entire model with the task dataset. The diagram below shows how BERT is used for text-classification:

Note that only the final hidden state corresponding to the class token ([CLS]) is used as the aggregate sequence representation to feed into a fully connected dense layer for classification tasks. To understand it better, let us look at the last layers of BERT(BERT-Base, 12 Layers).

Bash

Encoder-11-FeedForward-Norm (La (None, 128, 768)     1536        Encoder-11-FeedForward-Add[0][0] 
__________________________________________________________________________________________________
Encoder-12-MultiHeadSelfAttenti (None, 128, 768)     2362368     Encoder-11-FeedForward-Norm[0][0]
__________________________________________________________________________________________________
Encoder-12-MultiHeadSelfAttenti (None, 128, 768)     0           Encoder-12-MultiHeadSelfAttention
__________________________________________________________________________________________________
Encoder-12-MultiHeadSelfAttenti (None, 128, 768)     0           Encoder-11-FeedForward-Norm[0][0]
                                                                 Encoder-12-MultiHeadSelfAttention
__________________________________________________________________________________________________
Encoder-12-MultiHeadSelfAttenti (None, 128, 768)     1536        Encoder-12-MultiHeadSelfAttention
__________________________________________________________________________________________________
Encoder-12-FeedForward (FeedFor (None, 128, 768)     4722432     Encoder-12-MultiHeadSelfAttention
__________________________________________________________________________________________________
Encoder-12-FeedForward-Dropout  (None, 128, 768)     0           Encoder-12-FeedForward[0][0]     
__________________________________________________________________________________________________
Encoder-12-FeedForward-Add (Add (None, 128, 768)     0           Encoder-12-MultiHeadSelfAttention
                                                                 Encoder-12-FeedForward-Dropout[0]
__________________________________________________________________________________________________
Encoder-12-FeedForward-Norm (La (None, 128, 768)     1536        Encoder-12-FeedForward-Add[0][0] 
__________________________________________________________________________________________________
Extract (Extract)               (None, 768)          0           Encoder-12-FeedForward-Norm[0][0]
__________________________________________________________________________________________________
NSP-Dense (Dense)               (None, 768)          590592      Extract[0][0]                    
__________________________________________________________________________________________________

For fine-tuning this model for classification tasks, we take the last layer NSP-Dense (Next Sentence Prediction-Dense) and tie its output to a new fully connected dense layer, as shown below.

Python

# Add dense layer for classification
inputs = model.inputs[:2]
dense = model.get_layer('NSP-Dense').output
outputs = keras.layers.Dense(units=20, activation='softmax')(dense)
model = keras.models.Model(inputs, outputs)

The updated model looks like this for binary text classification:

Bash

Encoder-12-FeedForward-Norm (La (None, 128, 768)     1536        Encoder-12-FeedForward-Add[0][0] 
__________________________________________________________________________________________________
Extract (Extract)               (None, 768)          0           Encoder-12-FeedForward-Norm[0][0]
__________________________________________________________________________________________________
NSP-Dense (Dense)               (None, 768)          590592      Extract[0][0]                    
__________________________________________________________________________________________________
dense (Dense)                   (None, 20)           15380       NSP-Dense[0][0]                  
==================================================================================================
Total params: 109,202,708
Trainable params: 109,202,708
Non-trainable params: 0
__________________________________________________________________________________________________
None

The size of the last fully connected dense layer is equal to the number of classification classes or labels.

So, how do we choose activation and loss function for text classification? For Binary and Multiclass text classification we use the softmax activation function with sparse categorical cross entropy loss function while for multilabel text classification, sigmoid activation function with binary cross entropy loss function is more suitable.

Recommended Fine Tuning Hyper Parameters

According to the BERT paper, the following range of values are recommended:

Batch size: 16, 32
Learning rate (Adam): 5e-5, 3e-5, 2e-5
Number of epochs: 2, 3, 4

Preparing Input datasets

Let us take a look at working examples of binary, multiclass and multilabel text classification by fine-tuning BERT. We will use Python based keras-bert library with Tensorflow backend and run our examples on Google Colab with GPU accelerators. Some of the code for these examples are taken from keras-bert documentation.

One method that is common across, all the tasks is the method that prepares the training, test and validation datasets. We need a method that generates these sets in the format BERT expects for text classification.

Understanding the input to keras-bert

For fine-tuning using keras-bert the following inputs are required:

Token Embedding: Each sentence in the dataset needs to be tokenized using WordPiece vocabulary, add [CLS] and [SEP] tokens, add padding.
Segment Mask Embedding: Generate segment embedding. (Array of zeros for single sentence representation.)
Target Labels

The positional embedding is derived internally and does not need to be passed explicitly.

To do the above three tasks we will use a method called load_data, the input to which would vary depending on the dataset format, however the processing logic and the output is the same across all. The output of load_data method is a tuple where the first item in a list of size two, the first item being text’s token embedding and the second item being texts segment embedding(array of zeros as we are classifying or labelling only one sentence at a time). The second item of the tuple is the target class, index wise-paired with the token and segment embedding.

Binary Text Classification Using BERT

To demonstrate using BERT with fine-tuning for binary text classification, we will use the Large Movie Review Dataset. This is a dataset for binary sentiment classification and contains a set of 25,000 highly polar movie reviews for training, and 25,000 for testing.

Let us begin with first downloading the dataset and preparing the training and test datasets.

Python

#!wget -q https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
#!unzip -o uncased_L-12_H-768_A-12.zip

dataset = tf.keras.utils.get_file(
    fname="aclImdb.tar.gz", 
    origin="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz", 
    extract=True,
)

token_dict = {}
with codecs.open(vocab_path, 'r', 'utf8') as reader:
    for line in reader:
        token = line.strip()
        token_dict[token] = len(token_dict)

tokenizer = Tokenizer(token_dict)

def load_data(path, tagset):
    global tokenizer
    indices, sentiments = [], []
    for folder, sentiment in tagset:
        folder = os.path.join(path, folder)
        for name in tqdm(os.listdir(folder)):
            with open(os.path.join(folder, name), 'r') as reader:
                  text = reader.read()
            ids, segments = tokenizer.encode(text, max_len=SEQ_LEN)
            indices.append(ids)
            sentiments.append(sentiment)
  
    items = list(zip(indices, sentiments))
    np.random.shuffle(items)
    indices, sentiments = zip(*items)
    indices = np.array(indices)
    mod = indices.shape[0] % BATCH_SIZE
    if mod > 0:
        indices, sentiments = indices[:-mod], sentiments[:-mod]
    return [indices, np.zeros_like(indices)], np.array(sentiments)
  
train_path = os.path.join(os.path.dirname(dataset), 'aclImdb', 'train')
test_path = os.path.join(os.path.dirname(dataset), 'aclImdb', 'test')

tagset = [('neg', 0), ('pos', 1)]
id_to_labels = {0: 'negative', 1: 'positive'}
train_x, train_y = load_data(train_path, tagset)
test_x, test_y = load_data(test_path, tagset)

Once we have our training data ready, let us define our model training hyper-parameters. We set the batch-size as 16 and learning-rate at 2e-5 as recommended by the BERT paper. It's important to not set a high value for learning rate, as it could cause the training to not converge or catastrophic forgetting.

Python

# Bert Model Constants
SEQ_LEN = 128
BATCH_SIZE = 16
EPOCHS = 3
LR = 2e-5

pretrained_path = 'uncased_L-12_H-768_A-12'
config_path = os.path.join(pretrained_path, 'bert_config.json')
checkpoint_path = os.path.join(pretrained_path, 'bert_model.ckpt')
vocab_path = os.path.join(pretrained_path, 'vocab.txt')

The next step is to build and train the model. We first load the pre-trained BERT-Base model. Then we take its last layer (NSP-Dense) and connect it to binary classification layer. The binary classification layer is essentially a fully-connected dense layer with size 2. Since it is a case of binary classification, we want the probabilities of the output nodes to sum upto 1, we use the softmax as the activation function.

Python

model = load_trained_model_from_checkpoint(
      config_path,
      checkpoint_path,
      training=True,
      trainable=True,
      seq_len=SEQ_LEN,
  )
inputs = model.inputs[:2]
dense = model.get_layer('NSP-Dense').output
outputs = keras.layers.Dense(units=2, activation='softmax')(dense)
model = keras.models.Model(inputs, outputs)

model.compile(
      RAdam(lr=LR),
      loss='sparse_categorical_crossentropy',
      metrics=['sparse_categorical_accuracy'],
)
history = model.fit(
    train_x,
    train_y,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    validation_split=0.20,
    shuffle=True,
)

Bash

Train on 19993 samples, validate on 4999 samples
Epoch 1/3
19993/19993 [==============================] - 426s 21ms/sample - loss: 0.3789 - sparse_categorical_accuracy: 0.8250 - val_loss: 0.3106 - val_sparse_categorical_accuracy: 0.8666
Epoch 2/3
19993/19993 [==============================] - 410s 20ms/sample - loss: 0.2370 - sparse_categorical_accuracy: 0.9029 - val_loss: 0.2764 - val_sparse_categorical_accuracy: 0.8852
Epoch 3/3
19993/19993 [==============================] - 408s 20ms/sample - loss: 0.1392 - sparse_categorical_accuracy: 0.9472 - val_loss: 0.3310 - val_sparse_categorical_accuracy: 0.8898

One the training is done, let us evaluate the model.

Python

from sklearn.metrics import accuracy_score, f1_score

predicts = model.predict(test_x, verbose=True).argmax(axis=-1)
accuracy = accuracy_score(test_y, predicts)
macro_f1 = f1_score(test_y, predicts, average='macro')
print ("Accuracy: %s" % accuracy)
print ("macro_f1: %s" % macro_f1)

Bash

1 2	Accuracy: 0.8842429577464789 macro_f1: 0.8841799318689518

We could save the model with model.save(modelname.h5). The following code shows how to generate predictions.

Python

texts = [
  "It's a must watch",
  "Can't wait for it's next part!",
  'It fell short of expectations.',
  'Wish there was more to it!',
  'Just wow!',
  'Colossial waste of time',
  'Save youself from this 90 mins trauma!'
]
for text in texts:
  ids, segments = tokenizer.encode(text, max_len=SEQ_LEN)
  inpu = np.array(ids).reshape([1, SEQ_LEN])
  predicted_id = model.predict([inpu,np.zeros_like(inpu)]).argmax(axis=-1)[0]
  print ("%s: %s"% (id_to_labels[predicted_id], text))

Bash

positive: It's a must watch
positive: Can't wait for it's next part!
negative: It fell short of expectations.
positive: Wish there was more to it!
positive: Just wow!
negative: Colossial waste of time
negative: Save youself from this 90 mins trauma!

Google Colab for IMDB sentiment analysis with BERT fine tuning.

Multi-class Text Classification Using BERT

To demonstrate multi-class text classification we will use the 20-Newsgroup dataset. It is a collection of about 20,000 newsgroup documents, spread evenly across 20 different newsgroups.

Let us first prepare the training and test datasets.

Python

dataset = tf.keras.utils.get_file(
    fname="20news-18828.tar.gz", 
    origin="http://qwone.com/~jason/20Newsgroups/20news-18828.tar.gz", 
    extract=True,
)

tokenizer = Tokenizer(token_dict)
def load_data(path, tagset):
    global tokenizer
    indices, labels = [], []
    for folder, label in tagset:
        folder = os.path.join(path, folder)
        for name in tqdm(os.listdir(folder)):
            with open(os.path.join(folder, name), 'r', encoding="utf-8", errors='ignore') as reader:
                  text = reader.read()
            ids, segments = tokenizer.encode(text, max_len=SEQ_LEN)
            indices.append(ids)
            labels.append(label)
  
    items = list(zip(indices, labels))
    np.random.shuffle(items)
    indices, labels = zip(*items)
    indices = np.array(indices)
    mod = indices.shape[0] % BATCH_SIZE
    if mod > 0:
        indices, labels = indices[:-mod], labels[:-mod]
    return [indices, np.zeros_like(indices)], np.array(labels)


path = os.path.join(os.path.dirname(dataset), '20news-18828')
tagset = [(x, i) for i,x in enumerate(os.listdir(path))]
id_to_labels = {id_: label for label, id_ in tagset}

# Load data, split 80-20 for triaing/testing.
all_x, all_y = load_data(path, tagset)

train_perc = 0.8
total = len(all_y)

n_train = int(train_perc * total)
n_test = (total - n_train)

test_x = [all_x[0][n_train:], all_x[1][n_train:]]
train_x = [all_x[0][:n_train], all_x[1][:n_train]]

train_y, test_y = all_y[:n_train], all_y[n_train:]

print("# Total: %s, # Train: %s, # Test: %s" % (total, n_train, n_test))

Bash

1	# Total: 18816, # Train: 15052, # Test: 3764

Next, we build and train our model. We use the recommended BERT fine-tuning parameters and train our model for 4 epochs. The classification layer added on top of pre-trained BERT model is a fully-connected dense layer of size 20 (as 20 output classes) .

Python

#pip install -q keras-bert keras-rectified-adam
# Bert Model Constants
SEQ_LEN = 128
BATCH_SIZE = 16
EPOCHS = 4
LR = 2e-5

pretrained_path = 'uncased_L-12_H-768_A-12'
config_path = os.path.join(pretrained_path, 'bert_config.json')
checkpoint_path = os.path.join(pretrained_path, 'bert_model.ckpt')
vocab_path = os.path.join(pretrained_path, 'vocab.txt')

model = load_trained_model_from_checkpoint(
  config_path,
  checkpoint_path,
  training=True,
  trainable=True,
  seq_len=SEQ_LEN,
)

# Add dense layer for classification
inputs = model.inputs[:2]
dense = model.get_layer('NSP-Dense').output
outputs = keras.layers.Dense(units=20, activation='softmax')(dense)
model = keras.models.Model(inputs, outputs)

model.compile(
    RAdam(lr=LR),
    loss='sparse_categorical_crossentropy',
    metrics=['sparse_categorical_accuracy'],
)

history = model.fit(
    train_x,
    train_y,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    validation_split=0.20,
    shuffle=True,
)

Bash

Train on 12041 samples, validate on 3011 samples
Epoch 1/4
12041/12041 [==============================] - 765s 64ms/sample - loss: 1.6826 - sparse_categorical_accuracy: 0.5052 - val_loss: 0.6773 - val_sparse_categorical_accuracy: 0.7948
Epoch 2/4
12041/12041 [==============================] - 749s 62ms/sample - loss: 0.4951 - sparse_categorical_accuracy: 0.8481 - val_loss: 0.4421 - val_sparse_categorical_accuracy: 0.8698
Epoch 3/4
12041/12041 [==============================] - 748s 62ms/sample - loss: 0.2534 - sparse_categorical_accuracy: 0.9239 - val_loss: 0.3752 - val_sparse_categorical_accuracy: 0.8947
Epoch 4/4
12041/12041 [==============================] - 746s 62ms/sample - loss: 0.1386 - sparse_categorical_accuracy: 0.9588 - val_loss: 0.3471 - val_sparse_categorical_accuracy: 0.9083

Once we have our model train, let us evaluate and use for muti-class labelling.

Python

from sklearn.metrics import accuracy_score, f1_score
predicts = model.predict(test_x, verbose=True).argmax(axis=-1)
accuracy = accuracy_score(test_y, predicts)
macro_f1 = f1_score(test_y, predicts, average='macro')

Bash

1 2	Accuracy: 0.9024973432518597 macro_f1: 0.9001928370898599

Predict newsgroup labels with the trained model.

Python

texts = [
  'Who scored the maximum goals?',
  'Mars might have water and dragons!',
  'CPU is over-clocked, causing it to heating too much!',
  'I need to buy new prescriptions.',
  'This is just government propaganda.'
]
for text in texts:
  ids, segments = tokenizer.encode(text, max_len=SEQ_LEN)
  inpu = np.array(ids).reshape([1, SEQ_LEN])
  predicted_id = model.predict([inpu,np.zeros_like(inpu)]).argmax(axis=-1)[0]
  print ("%s: %s"% (id_to_labels[predicted_id], text))

Bash

rec.sport.hockey: Who scored the maximum goals?
sci.space: Mars might have water and dragons!
comp.sys.ibm.pc.hardware: CPU is over-clocked, causing it to heating too much!
sci.med: I need to buy new prescriptions.
talk.politics.misc: This is just government propaganda.
talk.politics.misc: This is just government propaganda.

Google Colab for 20 Newsgroup Multi-class Text Classification using BERT

Multilabel Text Classification Using BERT

To demonstrate multi-label text classification we will use Toxic Comment Classification dataset. It is a dataset on Kaggle, with Wikipedia comments which have been labeled by human raters for toxic behaviour. The different types o toxicity are: toxic, severe_toxic, obscene, threat, insult and identity_hate. Each comment can have either none or one or more type of toxicity. The dataset has over 100,000 labelled data, but for this tutorial we will use 25% of it to keep training memory and time requirements manageable.

Let us first build the training and test datasets.

Python

from google.colab import drive
drive.mount('/content/gdrive')

RESOUCE_DIR = "/content/gdrive/My\ Drive/resources"

# Train/test Files
datasets_dir = "%s/datasets/jigsaw-toxic-comment-classification-challenge" % (RESOUCE_DIR)
test_datapath = "%s/test.csv" % (datasets_dir)
test_labels = "%s/test_labels.csv" % (datasets_dir)
train_datapath = "%s/train.csv" % (datasets_dir)

tokenizer = Tokenizer(token_dict)

def load_data(comments, comment_labels):
    global tokenizer
    indices, labels = [], []
    for x in range(comments.shape[0]):
      ids, segments = tokenizer.encode(comments[x], max_len=SEQ_LEN)
      indices.append(ids)
      labels.append(comment_labels[x])

    items = list(zip(indices, labels))
    np.random.shuffle(items)
    indices, labels = zip(*items)
    indices = np.array(indices)
    mod = indices.shape[0] % BATCH_SIZE
    if mod > 0:
        indices, labels = indices[:-mod], labels[:-mod]
    return [indices, np.zeros_like(indices)], np.array(labels)

train_df = pd.read_csv(train_datapath.replace('\\', ''))
train_df = train_df.sample(frac=0.25,random_state = 42)

train_lines = train_df['comment_text'].values
labels_ordered = [
  'toxic',
  'severe_toxic',
  'obscene',
  'threat',
  'insult',
  'identity_hate'
]
train_labels = train_df[labels_ordered].values
train_x, train_y = load_data(train_lines, train_labels)

Next we build model and train it. The multi-label classification layer is a fully-connected dense layer of size 6 (6 possible labels), and we use sigmoid activation function to get independent probabilities of each class.

Python

model = load_trained_model_from_checkpoint(
  config_path.replace('\\', ''),
  checkpoint_path.replace('\\', ''),
  training=True,
  trainable=True,
  seq_len=SEQ_LEN,
)

# Add dense layer for classification
inputs = model.inputs[:2]
dense = model.get_layer('NSP-Dense').output
outputs = keras.layers.Dense(
  units=len(labels_ordered),
  activation='sigmoid',
  name = 'Toxic-Categories-Dense'
)(dense)
model = keras.models.Model(inputs, outputs)

model.compile(
    RAdam(lr=LR),
    loss='binary_crossentropy',
    metrics=['accuracy'],
)
history = model.fit(
    train_x,
    train_y,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    validation_split=0.33,
    shuffle=True,
)

Bash

Train on 26724 samples, validate on 13164 samples
Epoch 1/2
26724/26724 [==============================] - 1251s 47ms/sample - loss: 0.0858 - acc: 0.9660 - val_loss: 0.0450 - val_acc: 0.9822
Epoch 2/2
26724/26724 [==============================] - 1235s 46ms/sample - loss: 0.0404 - acc: 0.9845 - val_loss: 0.0431 - val_acc: 0.9827

We see that in just 2 epoch, our model achieved a 98% accuracy on the validation set. We can further save this model and use this model to generate labels as follows:

Python

texts = [
  'You are an idiot!',
  'You are a  drug addict!',
  'I will kill you!',
  'I want to goto London',
]

for text in texts:
  ids, segments = tokenizer.encode(text, max_len=SEQ_LEN)
  inpu = np.array(ids).reshape([1, SEQ_LEN])
  predicted = (model.predict([inpu,np.zeros_like(inpu)]) >= 0.5).astype(int)
  labels = [
    label
    for i, label in enumerate(labels_ordered)
    if predicted[0][i]
  ]
  print ("%s: %s" % (text, labels))

Bash

You are an idiot!: ['toxic', 'obscene', 'insult']
You are a  drug addict!: ['toxic']
I will kill you!: ['toxic', 'threat']
I want to goto London: []

Google Colab for Toxic Comment Classification with BERT fine tuning.

Conclusion

In this tutorial, we learnt how to use BERT with fine tuning for text classification. We saw that how using the pre-trained BERT model and just one additional classification layer, we can achieve high classification accuracy for different text classification tasks. BERT proves to be a very powerful language model and can be of immense value for text classification tasks.