Table of Contents
In this tutorial, we will learn how to use BERT for text classification. We will begin with a brief introduction of BERT, its architecture and fine-tuning mechanism. Then we will learn how to fine-tune BERT for text classification on following classification tasks:
We will use BERT through the keras-bert Python library, and train and test our model on GPU’s provided by Google Colab with Tensorflow backend.
BERT stands for Bidirectional Encoder Representation of Transformers. It is a deep learning based unsupervised language representation model developed by researchers at Google AI Language. It is the first deeply-bidirectional unsupervised language model. The language models, until BERT, learnt from text sequences in either left-to-right or combined left-to-right and right-to-left contexts. Thus they were either not bidirectional or not bidirectional in all layers.The diagram below shows its bidirectional architecture as compared to other language models.
Deep-Bi-directionality in BERT Source
BERT incorporated deep bi-directionality in learning representations using a novel Masked Language Model(MLM) approach. This deep-bidirectional learning approach allows BERT to learn words with their context being both left and right words. Under the hood, BERT uses the popular Attention model for bidirectional training of transformers. With this approach BERT claims to have achieved the state-of-the-art results on a series of natural language processing and understanding tasks.
Before diving into using BERT for text classification, let us take a quick overview of BERT’s architecture. BERT is a multilayered bidirectional Transformer encoder. The diagram below shows a 12 layered BERT model(BERT-Base version). Note that each Transformer is based on the Attention Model.
There are multiple pre-trained model versions with varying numbers of encoder layers, attention heads and hidden size dimensions available. Below is a list of different model variants available.
H = The hidden size.
A = Number of self attention heads.
L = Number of Layers (Transformer Blocks)
The largest model available is BERT-Large which has 24 layers, 16 attention heads and 1024 dimensional output hidden vectors. For each model, there are also cased and uncased variants available. In this tutorial we will use BERT-Base which has 12 encoder layers with 12 attention heads and has 768 hidden sized representations.
BERT can be used for text classification in three ways.
1 2 3 4 5 6 7 8 9 10 11 12 | #Source: https://pypi.org/project/bert-embedding/
pip install bert-embedding
from bert_embedding import BertEmbedding
text = "A tutorial on how to generate token embeddings using BERT"
bert_embedding = BertEmbedding()
result = bert_embedding(text.split('\n'))
first_sentence = result[0]
embedding = first_sentence[1]
print (embedding)
# array([ 0.4805648 , 0.18369392, -0.28554988, ..., -0.01961522,
# 1.0207764 , -0.67167974], dtype=float32)
|
So which approach to choose for text classification with BERT? The answer depends on the performance requirements and the amount of effort we wish to put in, in terms of resources and time. Fine-tuning and feature-based extraction approaches require training, testing and validating on GPU or TPU and therefore are more time taking and resource intensive as compared to embedding-based approach. However, they are expected to yield better results as they benefit from the use of bidirectional contextual representation of whole sentences, tuned specifically for the task at hand.
The BERT paper recommends fine-tuning for better results. A few advantages of fine tuning BERT are as follows:
So, what is the input to BERT? Input to BERT is an embedding representation derived by summing token embedding, segmentation embedding and the position embedding of the text.
What are token embedding, segmentation embedding and the position embedding?
Note that each of the embeddings(token, position and segment), being summed to derive the input, has (SEQ_LEN x Hidden-Size) dimension. The SEQ_LEN value can be changed and is decided based on the length of the sentences in the downstream task dataset. The sentences which have length less than the sequence length need to be padded. The Hidden-Size (H) is decided by the choice of the BERT model(like BERT Tiny, Small, Base , Large etc.).
To Fine Tuning BERT for text classification, take a pre-trained BERT model, apply an additional fully-connected dense layer on top of its output layer and train the entire model with the task dataset. The diagram below shows how BERT is used for text-classification:
Note that only the final hidden state corresponding to the class token ([CLS]) is used as the aggregate sequence representation to feed into a fully connected dense layer for classification tasks. To understand it better, let us look at the last layers of BERT(BERT-Base, 12 Layers).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | Encoder-11-FeedForward-Norm (La (None, 128, 768) 1536 Encoder-11-FeedForward-Add[0][0]
__________________________________________________________________________________________________
Encoder-12-MultiHeadSelfAttenti (None, 128, 768) 2362368 Encoder-11-FeedForward-Norm[0][0]
__________________________________________________________________________________________________
Encoder-12-MultiHeadSelfAttenti (None, 128, 768) 0 Encoder-12-MultiHeadSelfAttention
__________________________________________________________________________________________________
Encoder-12-MultiHeadSelfAttenti (None, 128, 768) 0 Encoder-11-FeedForward-Norm[0][0]
Encoder-12-MultiHeadSelfAttention
__________________________________________________________________________________________________
Encoder-12-MultiHeadSelfAttenti (None, 128, 768) 1536 Encoder-12-MultiHeadSelfAttention
__________________________________________________________________________________________________
Encoder-12-FeedForward (FeedFor (None, 128, 768) 4722432 Encoder-12-MultiHeadSelfAttention
__________________________________________________________________________________________________
Encoder-12-FeedForward-Dropout (None, 128, 768) 0 Encoder-12-FeedForward[0][0]
__________________________________________________________________________________________________
Encoder-12-FeedForward-Add (Add (None, 128, 768) 0 Encoder-12-MultiHeadSelfAttention
Encoder-12-FeedForward-Dropout[0]
__________________________________________________________________________________________________
Encoder-12-FeedForward-Norm (La (None, 128, 768) 1536 Encoder-12-FeedForward-Add[0][0]
__________________________________________________________________________________________________
Extract (Extract) (None, 768) 0 Encoder-12-FeedForward-Norm[0][0]
__________________________________________________________________________________________________
NSP-Dense (Dense) (None, 768) 590592 Extract[0][0]
__________________________________________________________________________________________________
|
For fine-tuning this model for classification tasks, we take the last layer NSP-Dense (Next Sentence Prediction-Dense) and tie its output to a new fully connected dense layer, as shown below.
1 2 3 4 5 | # Add dense layer for classification
inputs = model.inputs[:2]
dense = model.get_layer('NSP-Dense').output
outputs = keras.layers.Dense(units=20, activation='softmax')(dense)
model = keras.models.Model(inputs, outputs)
|
The updated model looks like this for binary text classification:
1 2 3 4 5 6 7 8 9 10 11 12 13 | Encoder-12-FeedForward-Norm (La (None, 128, 768) 1536 Encoder-12-FeedForward-Add[0][0]
__________________________________________________________________________________________________
Extract (Extract) (None, 768) 0 Encoder-12-FeedForward-Norm[0][0]
__________________________________________________________________________________________________
NSP-Dense (Dense) (None, 768) 590592 Extract[0][0]
__________________________________________________________________________________________________
dense (Dense) (None, 20) 15380 NSP-Dense[0][0]
==================================================================================================
Total params: 109,202,708
Trainable params: 109,202,708
Non-trainable params: 0
__________________________________________________________________________________________________
None
|
The size of the last fully connected dense layer is equal to the number of classification classes or labels.
So, how do we choose activation and loss function for text classification? For Binary and Multiclass text classification we use the softmax activation function with sparse categorical cross entropy loss function while for multilabel text classification, sigmoid activation function with binary cross entropy loss function is more suitable.
According to the BERT paper, the following range of values are recommended:
Let us take a look at working examples of binary, multiclass and multilabel text classification by fine-tuning BERT. We will use Python based keras-bert library with Tensorflow backend and run our examples on Google Colab with GPU accelerators. Some of the code for these examples are taken from keras-bert documentation.
One method that is common across, all the tasks is the method that prepares the training, test and validation datasets. We need a method that generates these sets in the format BERT expects for text classification.
For fine-tuning using keras-bert the following inputs are required:
The positional embedding is derived internally and does not need to be passed explicitly.
To do the above three tasks we will use a method called load_data, the input to which would vary depending on the dataset format, however the processing logic and the output is the same across all. The output of load_data method is a tuple where the first item in a list of size two, the first item being text’s token embedding and the second item being texts segment embedding(array of zeros as we are classifying or labelling only one sentence at a time). The second item of the tuple is the target class, index wise-paired with the token and segment embedding.
To demonstrate using BERT with fine-tuning for binary text classification, we will use the Large Movie Review Dataset. This is a dataset for binary sentiment classification and contains a set of 25,000 highly polar movie reviews for training, and 25,000 for testing.
Let us begin with first downloading the dataset and preparing the training and test datasets.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | #!wget -q https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
#!unzip -o uncased_L-12_H-768_A-12.zip
dataset = tf.keras.utils.get_file(
fname="aclImdb.tar.gz",
origin="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz",
extract=True,
)
token_dict = {}
with codecs.open(vocab_path, 'r', 'utf8') as reader:
for line in reader:
token = line.strip()
token_dict[token] = len(token_dict)
tokenizer = Tokenizer(token_dict)
def load_data(path, tagset):
global tokenizer
indices, sentiments = [], []
for folder, sentiment in tagset:
folder = os.path.join(path, folder)
for name in tqdm(os.listdir(folder)):
with open(os.path.join(folder, name), 'r') as reader:
text = reader.read()
ids, segments = tokenizer.encode(text, max_len=SEQ_LEN)
indices.append(ids)
sentiments.append(sentiment)
items = list(zip(indices, sentiments))
np.random.shuffle(items)
indices, sentiments = zip(*items)
indices = np.array(indices)
mod = indices.shape[0] % BATCH_SIZE
if mod > 0:
indices, sentiments = indices[:-mod], sentiments[:-mod]
return [indices, np.zeros_like(indices)], np.array(sentiments)
train_path = os.path.join(os.path.dirname(dataset), 'aclImdb', 'train')
test_path = os.path.join(os.path.dirname(dataset), 'aclImdb', 'test')
tagset = [('neg', 0), ('pos', 1)]
id_to_labels = {0: 'negative', 1: 'positive'}
train_x, train_y = load_data(train_path, tagset)
test_x, test_y = load_data(test_path, tagset)
|
Once we have our training data ready, let us define our model training hyper-parameters. We set the batch-size as 16 and learning-rate at 2e-5 as recommended by the BERT paper. It's important to not set a high value for learning rate, as it could cause the training to not converge or catastrophic forgetting.
1 2 3 4 5 6 7 8 9 10 | # Bert Model Constants
SEQ_LEN = 128
BATCH_SIZE = 16
EPOCHS = 3
LR = 2e-5
pretrained_path = 'uncased_L-12_H-768_A-12'
config_path = os.path.join(pretrained_path, 'bert_config.json')
checkpoint_path = os.path.join(pretrained_path, 'bert_model.ckpt')
vocab_path = os.path.join(pretrained_path, 'vocab.txt')
|
The next step is to build and train the model. We first load the pre-trained BERT-Base model. Then we take its last layer (NSP-Dense) and connect it to binary classification layer. The binary classification layer is essentially a fully-connected dense layer with size 2. Since it is a case of binary classification, we want the probabilities of the output nodes to sum upto 1, we use the softmax as the activation function.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | model = load_trained_model_from_checkpoint(
config_path,
checkpoint_path,
training=True,
trainable=True,
seq_len=SEQ_LEN,
)
inputs = model.inputs[:2]
dense = model.get_layer('NSP-Dense').output
outputs = keras.layers.Dense(units=2, activation='softmax')(dense)
model = keras.models.Model(inputs, outputs)
model.compile(
RAdam(lr=LR),
loss='sparse_categorical_crossentropy',
metrics=['sparse_categorical_accuracy'],
)
history = model.fit(
train_x,
train_y,
epochs=EPOCHS,
batch_size=BATCH_SIZE,
validation_split=0.20,
shuffle=True,
)
|
1 2 3 4 5 6 7 | Train on 19993 samples, validate on 4999 samples
Epoch 1/3
19993/19993 [==============================] - 426s 21ms/sample - loss: 0.3789 - sparse_categorical_accuracy: 0.8250 - val_loss: 0.3106 - val_sparse_categorical_accuracy: 0.8666
Epoch 2/3
19993/19993 [==============================] - 410s 20ms/sample - loss: 0.2370 - sparse_categorical_accuracy: 0.9029 - val_loss: 0.2764 - val_sparse_categorical_accuracy: 0.8852
Epoch 3/3
19993/19993 [==============================] - 408s 20ms/sample - loss: 0.1392 - sparse_categorical_accuracy: 0.9472 - val_loss: 0.3310 - val_sparse_categorical_accuracy: 0.8898
|
One the training is done, let us evaluate the model.
1 2 3 4 5 6 7 | from sklearn.metrics import accuracy_score, f1_score
predicts = model.predict(test_x, verbose=True).argmax(axis=-1)
accuracy = accuracy_score(test_y, predicts)
macro_f1 = f1_score(test_y, predicts, average='macro')
print ("Accuracy: %s" % accuracy)
print ("macro_f1: %s" % macro_f1)
|
1 2 | Accuracy: 0.8842429577464789
macro_f1: 0.8841799318689518
|
We could save the model with model.save(modelname.h5). The following code shows how to generate predictions.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | texts = [
"It's a must watch",
"Can't wait for it's next part!",
'It fell short of expectations.',
'Wish there was more to it!',
'Just wow!',
'Colossial waste of time',
'Save youself from this 90 mins trauma!'
]
for text in texts:
ids, segments = tokenizer.encode(text, max_len=SEQ_LEN)
inpu = np.array(ids).reshape([1, SEQ_LEN])
predicted_id = model.predict([inpu,np.zeros_like(inpu)]).argmax(axis=-1)[0]
print ("%s: %s"% (id_to_labels[predicted_id], text))
|
1 2 3 4 5 6 7 | positive: It's a must watch
positive: Can't wait for it's next part!
negative: It fell short of expectations.
positive: Wish there was more to it!
positive: Just wow!
negative: Colossial waste of time
negative: Save youself from this 90 mins trauma!
|
Google Colab for IMDB sentiment analysis with BERT fine tuning.
To demonstrate multi-class text classification we will use the 20-Newsgroup dataset. It is a collection of about 20,000 newsgroup documents, spread evenly across 20 different newsgroups.
Let us first prepare the training and test datasets.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | dataset = tf.keras.utils.get_file(
fname="20news-18828.tar.gz",
origin="http://qwone.com/~jason/20Newsgroups/20news-18828.tar.gz",
extract=True,
)
tokenizer = Tokenizer(token_dict)
def load_data(path, tagset):
global tokenizer
indices, labels = [], []
for folder, label in tagset:
folder = os.path.join(path, folder)
for name in tqdm(os.listdir(folder)):
with open(os.path.join(folder, name), 'r', encoding="utf-8", errors='ignore') as reader:
text = reader.read()
ids, segments = tokenizer.encode(text, max_len=SEQ_LEN)
indices.append(ids)
labels.append(label)
items = list(zip(indices, labels))
np.random.shuffle(items)
indices, labels = zip(*items)
indices = np.array(indices)
mod = indices.shape[0] % BATCH_SIZE
if mod > 0:
indices, labels = indices[:-mod], labels[:-mod]
return [indices, np.zeros_like(indices)], np.array(labels)
path = os.path.join(os.path.dirname(dataset), '20news-18828')
tagset = [(x, i) for i,x in enumerate(os.listdir(path))]
id_to_labels = {id_: label for label, id_ in tagset}
# Load data, split 80-20 for triaing/testing.
all_x, all_y = load_data(path, tagset)
train_perc = 0.8
total = len(all_y)
n_train = int(train_perc * total)
n_test = (total - n_train)
test_x = [all_x[0][n_train:], all_x[1][n_train:]]
train_x = [all_x[0][:n_train], all_x[1][:n_train]]
train_y, test_y = all_y[:n_train], all_y[n_train:]
print("# Total: %s, # Train: %s, # Test: %s" % (total, n_train, n_test))
|
1 | # Total: 18816, # Train: 15052, # Test: 3764
|
Next, we build and train our model. We use the recommended BERT fine-tuning parameters and train our model for 4 epochs. The classification layer added on top of pre-trained BERT model is a fully-connected dense layer of size 20 (as 20 output classes) .
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | #pip install -q keras-bert keras-rectified-adam
# Bert Model Constants
SEQ_LEN = 128
BATCH_SIZE = 16
EPOCHS = 4
LR = 2e-5
pretrained_path = 'uncased_L-12_H-768_A-12'
config_path = os.path.join(pretrained_path, 'bert_config.json')
checkpoint_path = os.path.join(pretrained_path, 'bert_model.ckpt')
vocab_path = os.path.join(pretrained_path, 'vocab.txt')
model = load_trained_model_from_checkpoint(
config_path,
checkpoint_path,
training=True,
trainable=True,
seq_len=SEQ_LEN,
)
# Add dense layer for classification
inputs = model.inputs[:2]
dense = model.get_layer('NSP-Dense').output
outputs = keras.layers.Dense(units=20, activation='softmax')(dense)
model = keras.models.Model(inputs, outputs)
model.compile(
RAdam(lr=LR),
loss='sparse_categorical_crossentropy',
metrics=['sparse_categorical_accuracy'],
)
history = model.fit(
train_x,
train_y,
epochs=EPOCHS,
batch_size=BATCH_SIZE,
validation_split=0.20,
shuffle=True,
)
|
1 2 3 4 5 6 7 8 9 | Train on 12041 samples, validate on 3011 samples
Epoch 1/4
12041/12041 [==============================] - 765s 64ms/sample - loss: 1.6826 - sparse_categorical_accuracy: 0.5052 - val_loss: 0.6773 - val_sparse_categorical_accuracy: 0.7948
Epoch 2/4
12041/12041 [==============================] - 749s 62ms/sample - loss: 0.4951 - sparse_categorical_accuracy: 0.8481 - val_loss: 0.4421 - val_sparse_categorical_accuracy: 0.8698
Epoch 3/4
12041/12041 [==============================] - 748s 62ms/sample - loss: 0.2534 - sparse_categorical_accuracy: 0.9239 - val_loss: 0.3752 - val_sparse_categorical_accuracy: 0.8947
Epoch 4/4
12041/12041 [==============================] - 746s 62ms/sample - loss: 0.1386 - sparse_categorical_accuracy: 0.9588 - val_loss: 0.3471 - val_sparse_categorical_accuracy: 0.9083
|
Once we have our model train, let us evaluate and use for muti-class labelling.
1 2 3 4 | from sklearn.metrics import accuracy_score, f1_score
predicts = model.predict(test_x, verbose=True).argmax(axis=-1)
accuracy = accuracy_score(test_y, predicts)
macro_f1 = f1_score(test_y, predicts, average='macro')
|
1 2 | Accuracy: 0.9024973432518597
macro_f1: 0.9001928370898599
|
Predict newsgroup labels with the trained model.
1 2 3 4 5 6 7 8 9 10 11 12 | texts = [
'Who scored the maximum goals?',
'Mars might have water and dragons!',
'CPU is over-clocked, causing it to heating too much!',
'I need to buy new prescriptions.',
'This is just government propaganda.'
]
for text in texts:
ids, segments = tokenizer.encode(text, max_len=SEQ_LEN)
inpu = np.array(ids).reshape([1, SEQ_LEN])
predicted_id = model.predict([inpu,np.zeros_like(inpu)]).argmax(axis=-1)[0]
print ("%s: %s"% (id_to_labels[predicted_id], text))
|
1 2 3 4 5 6 | rec.sport.hockey: Who scored the maximum goals?
sci.space: Mars might have water and dragons!
comp.sys.ibm.pc.hardware: CPU is over-clocked, causing it to heating too much!
sci.med: I need to buy new prescriptions.
talk.politics.misc: This is just government propaganda.
talk.politics.misc: This is just government propaganda.
|
Google Colab for 20 Newsgroup Multi-class Text Classification using BERT
To demonstrate multi-label text classification we will use Toxic Comment Classification dataset. It is a dataset on Kaggle, with Wikipedia comments which have been labeled by human raters for toxic behaviour. The different types o toxicity are: toxic, severe_toxic, obscene, threat, insult and identity_hate. Each comment can have either none or one or more type of toxicity. The dataset has over 100,000 labelled data, but for this tutorial we will use 25% of it to keep training memory and time requirements manageable.
Let us first build the training and test datasets.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | from google.colab import drive
drive.mount('/content/gdrive')
RESOUCE_DIR = "/content/gdrive/My\ Drive/resources"
# Train/test Files
datasets_dir = "%s/datasets/jigsaw-toxic-comment-classification-challenge" % (RESOUCE_DIR)
test_datapath = "%s/test.csv" % (datasets_dir)
test_labels = "%s/test_labels.csv" % (datasets_dir)
train_datapath = "%s/train.csv" % (datasets_dir)
tokenizer = Tokenizer(token_dict)
def load_data(comments, comment_labels):
global tokenizer
indices, labels = [], []
for x in range(comments.shape[0]):
ids, segments = tokenizer.encode(comments[x], max_len=SEQ_LEN)
indices.append(ids)
labels.append(comment_labels[x])
items = list(zip(indices, labels))
np.random.shuffle(items)
indices, labels = zip(*items)
indices = np.array(indices)
mod = indices.shape[0] % BATCH_SIZE
if mod > 0:
indices, labels = indices[:-mod], labels[:-mod]
return [indices, np.zeros_like(indices)], np.array(labels)
train_df = pd.read_csv(train_datapath.replace('\\', ''))
train_df = train_df.sample(frac=0.25,random_state = 42)
train_lines = train_df['comment_text'].values
labels_ordered = [
'toxic',
'severe_toxic',
'obscene',
'threat',
'insult',
'identity_hate'
]
train_labels = train_df[labels_ordered].values
train_x, train_y = load_data(train_lines, train_labels)
|
Next we build model and train it. The multi-label classification layer is a fully-connected dense layer of size 6 (6 possible labels), and we use sigmoid activation function to get independent probabilities of each class.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | model = load_trained_model_from_checkpoint(
config_path.replace('\\', ''),
checkpoint_path.replace('\\', ''),
training=True,
trainable=True,
seq_len=SEQ_LEN,
)
# Add dense layer for classification
inputs = model.inputs[:2]
dense = model.get_layer('NSP-Dense').output
outputs = keras.layers.Dense(
units=len(labels_ordered),
activation='sigmoid',
name = 'Toxic-Categories-Dense'
)(dense)
model = keras.models.Model(inputs, outputs)
model.compile(
RAdam(lr=LR),
loss='binary_crossentropy',
metrics=['accuracy'],
)
history = model.fit(
train_x,
train_y,
epochs=EPOCHS,
batch_size=BATCH_SIZE,
validation_split=0.33,
shuffle=True,
)
|
1 2 3 4 5 | Train on 26724 samples, validate on 13164 samples
Epoch 1/2
26724/26724 [==============================] - 1251s 47ms/sample - loss: 0.0858 - acc: 0.9660 - val_loss: 0.0450 - val_acc: 0.9822
Epoch 2/2
26724/26724 [==============================] - 1235s 46ms/sample - loss: 0.0404 - acc: 0.9845 - val_loss: 0.0431 - val_acc: 0.9827
|
We see that in just 2 epoch, our model achieved a 98% accuracy on the validation set. We can further save this model and use this model to generate labels as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | texts = [
'You are an idiot!',
'You are a drug addict!',
'I will kill you!',
'I want to goto London',
]
for text in texts:
ids, segments = tokenizer.encode(text, max_len=SEQ_LEN)
inpu = np.array(ids).reshape([1, SEQ_LEN])
predicted = (model.predict([inpu,np.zeros_like(inpu)]) >= 0.5).astype(int)
labels = [
label
for i, label in enumerate(labels_ordered)
if predicted[0][i]
]
print ("%s: %s" % (text, labels))
|
1 2 3 4 | You are an idiot!: ['toxic', 'obscene', 'insult']
You are a drug addict!: ['toxic']
I will kill you!: ['toxic', 'threat']
I want to goto London: []
|
Google Colab for Toxic Comment Classification with BERT fine tuning.
In this tutorial, we learnt how to use BERT with fine tuning for text classification. We saw that how using the pre-trained BERT model and just one additional classification layer, we can achieve high classification accuracy for different text classification tasks. BERT proves to be a very powerful language model and can be of immense value for text classification tasks.