GPT-2 Text Prediction with OpenVINO

This notebook shows a text prediction with OpenVINO. We use the GPT-2 model, which is a part of the Generative Pre-trained Transformer (GPT) family. GPT-2 is pre-trained on a large corpus of English text using unsupervised training. The model is available from Open Model Zoo, which we will use to download and convert the model to OpenVINO IR.


import sys
import numpy as np
from openvino.runtime import Core
from IPython.display import Markdown, display
import json
from pathlib import Path

from transformers import GPT2Tokenizer

The model

# directory where the model will be downloaded.
base_model_dir = "model"

# name of the model
model_name = 'gpt-2'

# desired precision
precision = "FP16"

model_path = f"model/public/{model_name}/{precision}/{model_name}.xml"
model_weights_path = f"model/public/{model_name}/{precision}/{model_name}.bin"

Download GPT-2 from Open Model Zoo

We use omz_downloader, which is a command-line tool from the openvino-dev package. omz_downloader automatically creates a directory structure and downloads the selected model. Skip this step if the model is already downloaded. For this demo, we have to download and use gpt-2 model.

download_command = f"omz_downloader " \
                   f"--name {model_name} " \
                   f"--output_dir {base_model_dir} " \
                   f"--cache_dir {base_model_dir}"

display(Markdown(f"Download command: `{download_command}`"))
display(Markdown(f"Downloading {model_name}... (This may take a few minutes depending on your connection.)"))

! $download_command

Download command: omz_downloader --name gpt-2 --output_dir model --cache_dir model

Downloading gpt-2… (This may take a few minutes depending on your connection.)

Load the model

Converted models are located in a fixed directory structure, which indicates source, model name and precision. We start by building an Inference Engine object. Then we read the network architecture and model weights from the .xml and .bin files, respectively. Finally, we compile the model for the desired device. Because we use the dynamic shapes feature, which is only available on CPU, we must use CPU for the device. Dynamic shapes support on GPU is coming soon.

Since the text recognition model has a dynamic input shape, you cannot directly switch device to GPU for inference on integrated or discrete Intel GPUs. In order to run inference on iGPU or dGPU with this model, you will need to resize the inputs to this model to use a fixed size and then try running the inference on GPU device.

# initialize inference engine
ie_core = Core()

# read the model and corresponding weights from file
model = ie_core.read_model(model=model_path, weights=model_weights_path)

# assign dynamic shapes to every input layer
for input_layer in model.inputs:
    input_shape = input_layer.partial_shape
    input_shape[0] = -1
    input_shape[1] = -1
    model.reshape({input_layer: input_shape})

# compile the model for CPU devices
compiled_model = ie_core.compile_model(model=model, device_name="CPU")

# get input and output names of nodes
input_keys = next(iter(compiled_model.inputs))
output_keys = next(iter(compiled_model.outputs))

Input keys are the names of the input nodes and output keys contain names of the output nodes of the network. In the case of GPT-2, we have batch size and sequence length as inputs and batch size, sequence length and vocab size as outputs.


NLP models often take a list of tokens as a standard input. A token is a single word mapped to an integer. To provide the proper input, we use a vocabulary file to handle the mapping. So first let’s load the vocabulary file.

def load_vocab_file(vocab_file_path):
    with open(vocab_file_path, "r", encoding="utf-8") as content:
        return json.load(content)
vocal_file_path = f"model/public/{model_name}/gpt2/vocab.json"
vocab = load_vocab_file(vocal_file_path)

Define tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# this function converts text to tokens
def tokenize(text):
    input_ids = tokenizer(text)['input_ids']
    input_ids = np.array(input_ids).reshape(1, -1)
    return input_ids

The last token in the vocabulary list is an endoftext token. We store the index of this token in order to use this index as padding at later stage.

eos_token_id = len(vocab) - 1
tokenizer._convert_id_to_token(len(vocab) - 1)

Define Softmax layer

A softmax function is used to convert top-k logits into a probability distribution.

def softmax(x):
    e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    summation = e_x.sum(axis=-1, keepdims=True)
    return e_x / summation

Set the minimum sequence length

If the minimum sequence length is not reached, the following code will reduce the probability of the eos token occurring. This continues the process of generating the next words.

def process_logits(input_ids, scores, eos_token_id, min_length=0):
    cur_length = input_ids.shape[-1]
    if cur_length < min_length:
        scores[:, eos_token_id] = -float("inf")
    return scores

Top-K sampling

In Top-K sampling, we filter the K most likely next words and redistribute the probability mass among only those K next words.

def get_top_k_logits(scores, top_k):
    filter_value = -float("inf")
    top_k = min(max(top_k, 1), scores.shape[-1])
    top_k_scores = -np.sort(-scores)[:, :top_k]
    indices_to_remove = scores < np.min(top_k_scores)
    filtred_scores =, mask=indices_to_remove,
    return filtred_scores

Main Processing Function

Generating the predicted sequence.

def generate_sequence(input_ids, max_sequence_length=128,
    while True:
        cur_input_len = len(input_ids[0])
        pad_len = max_sequence_length - cur_input_len
        model_input = np.concatenate((input_ids,
                                      [[eos_token_id] * pad_len]), axis=-1)
        # passing the padded sequnce into the model
        outputs = compiled_model(inputs=[model_input])[output_keys]
        next_token_logits = outputs[:, cur_input_len - 1, :]
        # pre-process distribution
        next_token_scores = process_logits(input_ids,
                                           next_token_logits, eos_token_id)
        top_k = 20
        next_token_scores = get_top_k_logits(next_token_scores, top_k)
        # get next token id
        probs = softmax(next_token_scores)
        next_tokens = np.random.choice(probs.shape[-1], 1,
                                       p=probs[0], replace=True)
        # break the loop if max length or end of text token is reached
        if cur_input_len == max_sequence_length or next_tokens == eos_token_id:
            input_ids = np.concatenate((input_ids, [next_tokens]), axis=-1)
    return input_ids


The text variable below is the input used to generate a predicted sequence.

text = "Deep learning is a type of machine learning that uses neural networks"
input_ids = tokenize(text)
output_ids = generate_sequence(input_ids)
S = " "
# Convert IDs to words and make the sentence from it
for i in output_ids[0]:
    S += tokenizer.convert_tokens_to_string(tokenizer._convert_id_to_token(i))
print("Input Text: ", text)
print(f"Predicted Sequence:{S}")
Input Text:  Deep learning is a type of machine learning that uses neural networks

Predicted Sequence: Deep learning is a type of machine learning that uses neural networks to learn a large set of facts about a situation and then compares and contrasts that information with what's in the background.

The team of researchers from the University of Washington in Seattle, in collaboration with the University of Michigan and the University of Pennsylvania, analyzed the data on a large, well-known social network (Facebook Twitter) called Twitter Learning.

The researchers found that Twitter Learning was more efficient than the previous network of Facebook Twitter Learning in predicting outcomes on a scale of 1 to 10 as compared to Facebook Facebook learning. This was not surprising given that the social networks