An Overview of Pre-trained Models in NLP

An Overview of Pre-trained Models in NLP
 & Rishi Prakash


Unlike other machine learning (ML) tasks, natural language processing (NLP) has always been a challenge to data scientists because of the complexity of languages worldwide. Moreover, language processing is complex because of the differences in sentence syntax, semantic meanings, parts of speech, grammar, and homonyms. Nevertheless, several methods have been developed to tackle each of these. This blog will discuss a few of the latest developments in NLP.

Bidirectional Encoder Representations from Transformers (BERT)

BERT is an open-source, pre-trained NLP model based on transformers. The primary purpose of BERT is to help machines understand the context of a word in different languages with the help of surrounding words. This framework was pre-trained using text from Wikipedia and can be further fine-tuned by adding layers based on the user’s requirements. The approach to pre-training includes:

  • Masked Language Modelling: The objective is to predict the masked word in the input.
  •  Next Sentence Prediction: Here, different sentences are provided as inputs, with some sentences being one after the other and some in no particular order, and the model predicts if the pair of sentences provided follow each other or not.


BERT’s essential technical advancement is applying the bidirectional training (i.e., transformers can directly access all positions in the sequence) of the transformer in contrast to the earlier approaches, which looked at a text sequence either from left-to-right or right-to-left training.

The Need for Bidirectional Training

Words used in sentences might vary in context depending on the meaning. Therefore, each word added to the sentence augments the overall meaning of the focused word. Previously, the language processing techniques, known as word embeddings, such as GloVe and Word2Vec, would map every word to a vector, representing only one dimension.

 Because of this approach, they fail in context-heavy tasks as each word is fixed to a single meaning. In contrast, BERT considers the context of the word by reading bidirectionally, eliminating the left-to-right momentum problem, where the meaning of certain words is prone to be biased towards a specific meaning as the sentence progresses. The use cases for bidirectional training include:

●     Sentiment analysis

●     Question-answering tasks

●     Machine translation

 Several methods have been developed to improve BERT, either on the prediction metrics or on computational speed. A few of them are RoBERTa, DistilBERT, and Albert.

Generative Pre-trained Transformer 3 (GPT-3)

 GPT-3 is an autoregressive model used for text generation and is trained using a vast body of internet text to predict the next token based on the previous sequence of tokens. This pre-training objective results in well-suited models for text generation but not for language understanding. The model works by taking the input text and predicting the next word based on the words in the input text.

GPT-3 is called an autoregressive model because once a second word is predicted based on the first word, the model considers both the first and the second words as inputs and predicts the third word. The process continues until the required number of words is predicted. The generated text will be similar to how a human could have written it. This results from GPT-3’s deep learning neural network model having over 175 billion ML parameters. The use cases of GPT-3 include: 

●     It can be used whenever a large amount of text needs to be generated from a small amount of text input. For example, it can be used to create articles, poetry, stories, news reports, and dialogues using a small amount of input text.

●     It is also used in gaming to create realistic chat dialogs, quizzes, images, and other graphics based on text suggestions. It can also generate memes, recipes, and comic strips.

 GPT-3 Versus BERT

While the transformer includes two separate mechanisms — encoder and decoder — the BERT model only works on encoding mechanisms to generate a language model; however, the GPT-3 uses a transformer decoder for producing text.

Bidirectional and Auto-Regressive Transformer (BART)

BART combines a bidirectional encoder (like BERT) and an autoregressive decoder (like GPT) to form a Seq2Seq model. BART uses a standard transformer architecture (encoder-decoder) like the original transformer model but also incorporates the changes made by BERT, which only uses the encoder to predict the masked words, and GPT, which only uses the decoder to predict the next token. Thus, BART gets the best of both worlds.

Benefits of BART

Contrary to other transformers, BART accepts any text corruption, including masking, word removal, and word replacement for training the model, thus giving it unlimited flexibility in choosing the corruption scheme, including changing the length of the original input. Some corruption schemes mentioned in BART are token masking, token deletion, text infilling, sentence permutation, and document rotation.


XLNet is an advancement over the BERT model and uses a generalized autoregressive pre-training method. Moreover, it is a generalized model as it considers the bi-directional meaning. The use cases of XLNet include:

●     Sentiment analysis

●     Question-answering tasks

●     Machine translation

Advantages of XLNet over BERT

During the pre-training of BERT, it removes some tokens from the input data (masking) and tries to predict the masked tokens based on the unmasked tokens available. However, the drawback of this process kicks in due to the assumption that each masked token is dependent on all the unmasked tokens but independent of other masked tokens.

To better understand this, let’s have “Dog love _ _“ as an input where the BERT model might combine the sentence in various ways, including ”Dog loves eating and playing” or “Dog loves to fetch meat.” If the model had considered the dependency within the masked tokens, the output would have been “Dog loves eating meat” or “Dogs loves playing fetch.” XLNet overcomes this by using Permutation Language Modelling and Two-Stream Self-Attention, thus overcoming the disadvantages of BERT.


BERT, BART, GP-3, and XLNet are some of the few, pre-trained models in NLP that have pushed the boundaries of achievements concerning language understanding and language generation with machines. These are the golden years for NLP as researchers are developing more heavy, pre-trained models and novel pre-training approaches that further push the boundaries.



1. Attention Is All You Need Paper

2. The Illustrated Transformer – Jay Alammar


1.BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Original Paper Published)

2.The Illustrated BERT, ELMo, and co.


1.Language Models are Few-Shot Learners (Original Paper Published)

2.GPT-3: All you need to know about the AI language model


1.XLNet: Generalized Autoregressive Pretraining for Language Understanding (Original Paper Published)

2.What is XLNet and why it outperforms BERT


1.BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension (Original Paper Published)

2.Paper review: “BART

Related Blogs

Many enterprises using Databricks for ETL workflows face challenges with isolated data management across workspaces. This…

Businesses are embracing the scalability and flexibility offered by cloud solutions. However, cloud migration often poses…

Streamlit is an open-source Python library designed to effortlessly create interactive web applications for data science…

Scroll to Top