Dynamic Shapes

As it was demonstrated in the Changing Input Shapes article, there are models that support changing input shapes before model compilation in Core::compile_model. Reshaping models provides an ability to customize the model input shape for the exact size required in the end application. This article explains how the ability of model to reshape can further be leveraged in more dynamic scenarios.

Applying Dynamic Shapes

Conventional “static” model reshaping works well when it can be done once per many model inference calls with the same shape. However, this approach does not perform efficiently if the input tensor shape is changed on every inference call. Calling the reshape() and compile_model() methods each time a new size comes is extremely time-consuming. A popular example would be inference of natural language processing models (like BERT) with arbitrarily-sized user input sequences. In this case, the sequence length cannot be predicted and may change every time inference is called. Dimensions that can be frequently changed are called dynamic dimensions. Dynamic shapes should be considered, when a real shape of input is not known at the time of the compile_model() method call.

Below are several examples of dimensions that can be naturally dynamic:

  • Sequence length dimension for various sequence processing models, like BERT

  • Spatial dimensions in segmentation and style transfer models

  • Batch dimension

  • Arbitrary number of detections in object detection models output

There are various methods to address input dynamic dimensions through combining multiple pre-reshaped models and input data padding. The methods are sensitive to model internals, do not always give optimal performance and are cumbersome. For a short overview of the methods, refer to the When Dynamic Shapes API is Not Applicable page. Apply those methods only if native dynamic shape API described in the following sections does not work or does not perform as expected.

The decision about using dynamic shapes should be based on proper benchmarking of a real application with real data. Unlike statically shaped models, dynamically shaped ones require different inference time, depending on input data shape or input tensor content. Furthermore, using the dynamic shapes can bring more overheads in memory and running time of each inference call depending on hardware plugin and model used.

Handling Dynamic Shapes Natively

This section describes how to handle dynamically shaped models natively with OpenVINO Runtime API version 2022.1 and higher. There are three main parts in the flow that differ from static shapes:

  • Configure the model.

  • Prepare data for inference.

  • Read resulting data after inference.

Configuring the Model

To avoid the methods mentioned in the previous section, there is a way to specify one or multiple dimensions to be dynamic, directly in the model inputs. This is achieved with the same reshape method that is used for alternating static shape of inputs. Dynamic dimensions are specified as -1 or the ov::Dimension() instead of a positive number used for static dimensions:

ov::Core core;
auto model = core.read_model("model.xml");

// Set one static dimension (= 1) and another dynamic dimension (= Dimension())
model->reshape({{1, ov::Dimension()}});  // {1,?}

// The same as above
model->reshape({{1, -1}}); // {1,?}

// Or set both dimensions as dynamic if both are going to be changed dynamically
model->reshape({{ov::Dimension(), ov::Dimension()}});  // {?,?}

// The same as above
model->reshape({{-1, -1}});  // {?,?}
core = ov.Core()
model = core.read_model("model.xml")

# Set one static dimension (= 1) and another dynamic dimension (= Dimension())
model.reshape([1, ov.Dimension()])

# The same as above
model.reshape([1, -1])

# The same as above
model.reshape("1, ?")

# Or set both dimensions as dynamic if both are going to be changed dynamically
model.reshape([ov.Dimension(), ov.Dimension()])

# The same as above
model.reshape([-1, -1])

# The same as above
model.reshape("?, ?")
ov_core_t\* core = NULL;
ov_core_create(&core);

ov_model_t\* model = NULL;
ov_core_read_model(core, "model.xml", NULL, &model);

// Set one static dimension (= 1) and another dynamic dimension (= Dimension())
{
ov_partial_shape_t partial_shape;
ov_dimension_t dims[2] = {{1, 1}, {-1, -1}};
ov_partial_shape_create(2, dims, &partial_shape);
ov_model_reshape_single_input(model, partial_shape); // {1,?}
ov_partial_shape_free(&partial_shape);
}

// Or set both dimensions as dynamic if both are going to be changed dynamically
{
ov_partial_shape_t partial_shape;
ov_dimension_t dims[2] = {{-1, -1}, {-1, -1}};
ov_partial_shape_create(2, dims, &partial_shape);
ov_model_reshape_single_input(model, partial_shape); // {?,?}
ov_partial_shape_free(&partial_shape);
}

To simplify the code, the examples assume that the model has a single input and single output. However, there are no limitations on the number of inputs and outputs to apply dynamic shapes.

Undefined Dimensions “Out Of the Box”

Dynamic dimensions may appear in the input model without calling the reshape method. Many DL frameworks support undefined dimensions. If such a model is converted with Model Optimizer or read directly by the Core::read_model, undefined dimensions are preserved. Such dimensions are automatically treated as dynamic ones. Therefore, there is no need to call the reshape method, if undefined dimensions are already configured in the original or the IR model.

If the input model has undefined dimensions that will not change during inference. It is recommended to set them to static values, using the same reshape method of the model. From the API perspective, any combination of dynamic and static dimensions can be configured.

Model Optimizer provides identical capability to reshape the model during the conversion, including specifying dynamic dimensions. Use this capability to save time on calling reshape method in the end application. To get information about setting input shapes using Model Optimizer, refer to Setting Input Shapes.

Dimension Bounds

Apart from a dynamic dimension, the lower and/or upper bounds can also be specified. They define a range of allowed values for the dimension. The bounds are coded as arguments for the ov::Dimension :

// Both dimensions are dynamic, first has a size within 1..10 and the second has a size within 8..512
model->reshape({{ov::Dimension(1, 10), ov::Dimension(8, 512)}});  // {1..10,8..512}

// Both dimensions are dynamic, first doesn't have bounds, the second is in the range of 8..512
model->reshape({{-1, ov::Dimension(8, 512)}});   // {?,8..512}
# Both dimensions are dynamic, first has a size within 1..10 and the second has a size within 8..512
model.reshape([ov.Dimension(1, 10), ov.Dimension(8, 512)])

# The same as above
model.reshape([(1, 10), (8, 512)])

# The same as above
model.reshape("1..10, 8..512")

# Both dimensions are dynamic, first doesn't have bounds, the second is in the range of 8..512
model.reshape([-1, (8, 512)])
// Both dimensions are dynamic, first has a size within 1..10 and the second has a size within 8..512
{
ov_partial_shape_t partial_shape;
ov_dimension_t dims[2] = {{1, 10}, {8, 512}};
ov_partial_shape_create(2, dims, &partial_shape);
ov_model_reshape_single_input(model, partial_shape); // {1..10,8..512}
ov_partial_shape_free(&partial_shape);
}

// Both dimensions are dynamic, first doesn't have bounds, the second is in the range of 8..512
{
ov_partial_shape_t partial_shape;
ov_dimension_t dims[2] = {{-1, -1}, {8, 512}};
ov_partial_shape_create(2, dims, &partial_shape);
ov_model_reshape_single_input(model, partial_shape); // {?,8..512}
ov_partial_shape_free(&partial_shape);
}

Information about bounds gives an opportunity for the inference plugin to apply additional optimizations. Using dynamic shapes assumes the plugins apply more flexible optimization approach during model compilation. It may require more time/memory for model compilation and inference. Therefore, providing any additional information, like bounds, can be beneficial. For the same reason, it is not recommended to leave dimensions as undefined, without the real need.

When specifying bounds, the lower bound is not as important as the upper one. The upper bound allows inference devices to allocate memory for intermediate tensors more precisely. It also allows using a fewer number of tuned kernels for different sizes. More precisely, benefits of specifying the lower or upper bound is device dependent. Depending on the plugin, specifying the upper bounds can be required. For information about dynamic shapes support on different devices, refer to the Features Support Matrix.

If the lower and upper bounds for a dimension are known, it is recommended to specify them, even if a plugin can execute a model without the bounds.

Setting Input Tensors

Preparing a model with the reshape method is the first step. The second step is passing a tensor with an appropriate shape to infer request. This is similar to the regular steps. However, tensors can now be passed with different shapes for the same executable model and even for the same inference request:

// The first inference call

// Create tensor compatible with the model input
// Shape {1, 128} is compatible with any reshape statements made in previous examples
auto input_tensor_1 = ov::Tensor(model->input().get_element_type(), {1, 128});
// ... write values to input_tensor_1

// Set the tensor as an input for the infer request
infer_request.set_input_tensor(input_tensor_1);

// Do the inference
infer_request.infer();

// Retrieve a tensor representing the output data
ov::Tensor output_tensor = infer_request.get_output_tensor();

// For dynamic models output shape usually depends on input shape,
// that means shape of output tensor is initialized after the first inference only
// and has to be queried after every infer request
auto output_shape_1 = output_tensor.get_shape();

// Take a pointer of an appropriate type to tensor data and read elements according to the shape
// Assuming model output is f32 data type
auto data_1 = output_tensor.data<float>();
// ... read values

// The second inference call, repeat steps:

// Create another tensor (if the previous one cannot be utilized)
// Notice, the shape is different from input_tensor_1
auto input_tensor_2 = ov::Tensor(model->input().get_element_type(), {1, 200});
// ... write values to input_tensor_2

infer_request.set_input_tensor(input_tensor_2);

infer_request.infer();

// No need to call infer_request.get_output_tensor() again
// output_tensor queried after the first inference call above is valid here.
// But it may not be true for the memory underneath as shape changed, so re-take a pointer:
auto data_2 = output_tensor.data<float>();

// and new shape as well
auto output_shape_2 = output_tensor.get_shape();

// ... read values in data_2 according to the shape output_shape_2
# The first inference call

# Create tensor compatible to the model input
# Shape {1, 128} is compatible with any reshape statements made in previous examples
input_tensor1 = ov.Tensor(model.input().element_type, [1, 128])
# ... write values to input_tensor_1

# Set the tensor as an input for the infer request
infer_request.set_input_tensor(input_tensor1)

# Do the inference
infer_request.infer()

# Or pass a tensor in infer to set the tensor as a model input and make the inference
infer_request.infer([input_tensor1])

# Or pass the numpy array to set inputs of the infer request
input_data = np.ones(shape=[1, 128])
infer_request.infer([input_data])

# Retrieve a tensor representing the output data
output_tensor = infer_request.get_output_tensor()

# Copy data from tensor to numpy array
data1 = output_tensor.data[:]

# The second inference call, repeat steps:

# Create another tensor (if the previous one cannot be utilized)
# Notice, the shape is different from input_tensor_1
input_tensor2 = ov.Tensor(model.input().element_type, [1, 200])
# ... write values to input_tensor_2

infer_request.infer([input_tensor2])

# No need to call infer_request.get_output_tensor() again
# output_tensor queried after the first inference call above is valid here.
# But it may not be true for the memory underneath as shape changed, so re-take an output data:
data2 = output_tensor.data[:]
ov_output_port_t\* input_port = NULL;
ov_element_type_e\* type = NULL;
ov_shape_t input_shape_1;
ov_tensor_t\* input_tensor_1 = NULL;
ov_tensor_t\* output_tensor = NULL;
ov_shape_t output_shape_1;
void\* data_1 = NULL;
ov_shape_t input_shape_2;
ov_tensor_t\* input_tensor_2 = NULL;
ov_shape_t output_shape_2;
void\* data_2 = NULL;
// The first inference call

// Create tensor compatible with the model input
// Shape {1, 128} is compatible with any reshape statements made in previous examples
{
ov_model_input(model, &input_port);
ov_port_get_element_type(input_port, type);
int64_t dims[2] = {1, 128};
ov_shape_create(2, dims, &input_shape_1);
ov_tensor_create(type, input_shape_1, &input_tensor_1);
// ... write values to input_tensor
}

// Set the tensor as an input for the infer request
ov_infer_request_set_input_tensor(infer_request, input_tensor_1);

// Do the inference
ov_infer_request_infer(infer_request);

// Retrieve a tensor representing the output data
ov_infer_request_get_output_tensor(infer_request, &output_tensor);

// For dynamic models output shape usually depends on input shape,
// that means shape of output tensor is initialized after the first inference only
// and has to be queried after every infer request
ov_tensor_get_shape(output_tensor, &output_shape_1);

// Take a pointer of an appropriate type to tensor data and read elements according to the shape
// Assuming model output is f32 data type
ov_tensor_data(output_tensor, &data_1);
// ... read values

// The second inference call, repeat steps:

// Create another tensor (if the previous one cannot be utilized)
// Notice, the shape is different from input_tensor_1
{
int64_t dims[2] = {1, 200};
ov_shape_create(2, dims, &input_shape_2);
ov_tensor_create(type, input_shape_2, &input_tensor_2);
// ... write values to input_tensor_2
}

ov_infer_request_set_input_tensor(infer_request, input_tensor_2);
ov_infer_request_infer(infer_request);

// No need to call infer_request.get_output_tensor() again
// output_tensor queried after the first inference call above is valid here.
// But it may not be true for the memory underneath as shape changed, so re-take a pointer:
ov_tensor_data(output_tensor, &data_2);

// and new shape as well
ov_tensor_get_shape(output_tensor, &output_shape_2);
// ... read values in data_2 according to the shape output_shape_2

// free resource
ov_output_port_free(input_port);
ov_shape_free(&input_shape_1);
ov_tensor_free(input_tensor_1);
ov_shape_free(&output_shape_1);
ov_shape_free(&input_shape_2);
ov_tensor_free(input_tensor_2);
ov_shape_free(&output_shape_2);
ov_tensor_free(output_tensor);

In the example above, the set_input_tensor is used to specify input tensors. The real dimension of the tensor is always static, because it is a particular tensor and it does not have any dimension variations in contrast to model inputs.

Similar to static shapes, get_input_tensor can be used instead of set_input_tensor. In contrast to static input shapes, when using get_input_tensor for dynamic inputs, the set_shape method for the returned tensor should be called to define the shape and allocate memory. Without doing so, the tensor returned by get_input_tensor is an empty tensor. The shape of the tensor is not initialized and memory is not allocated, because infer request does not have information about the real shape that will be provided. Setting shape for an input tensor is required when the corresponding input has at least one dynamic dimension, regardless of the bounds. Contrary to previous example, the following one shows the same sequence of two infer requests, using get_input_tensor instead of set_input_tensor :

// The first inference call

// Get the tensor; shape is not initialized
auto input_tensor = infer_request.get_input_tensor();

// Set shape is required
input_tensor.set_shape({1, 128});
// ... write values to input_tensor

infer_request.infer();
ov::Tensor output_tensor = infer_request.get_output_tensor();
auto output_shape_1 = output_tensor.get_shape();
auto data_1 = output_tensor.data<float>();
// ... read values

// The second inference call, repeat steps:

// Set a new shape, may reallocate tensor memory
input_tensor.set_shape({1, 200});
// ... write values to input_tensor memory

infer_request.infer();
auto data_2 = output_tensor.data<float>();
auto output_shape_2 = output_tensor.get_shape();
// ... read values in data_2 according to the shape output_shape_2
# Get the tensor, shape is not initialized
input_tensor = infer_request.get_input_tensor()

# Set shape is required
input_tensor.shape = [1, 128]
# ... write values to input_tensor

infer_request.infer()
output_tensor = infer_request.get_output_tensor()
data1 = output_tensor.data[:]

# The second inference call, repeat steps:

# Set a new shape, may reallocate tensor memory
input_tensor.shape = [1, 200]
# ... write values to input_tensor

infer_request.infer()
data2 = output_tensor.data[:]
ov_tensor_t\* input_tensor = NULL;
ov_shape_t input_shape_1;
ov_tensor_t\* output_tensor = NULL;
void\* data_1 = NULL;
ov_shape_t output_shape_1;
ov_shape_t input_shape_2;
ov_shape_t output_shape_2;
void\* data_2 = NULL;
// The first inference call
// Get the tensor; shape is not initialized
ov_infer_request_get_input_tensor(infer_request, &input_tensor);

// Set shape is required
{
int64_t dims[2] = {1, 128};
ov_shape_create(2, dims, &input_shape_1);
ov_tensor_set_shape(input_tensor, input_shape_1);
// ... write values to input_tensor
}
// do inference
ov_infer_request_infer(infer_request);
// get output tensor data & shape
{
ov_infer_request_get_output_tensor(infer_request, &output_tensor);
ov_tensor_get_shape(output_tensor, &output_shape_1);
ov_tensor_data(output_tensor, &data_1);
// ... read values
}

// The second inference call, repeat steps:
// Set a new shape, may reallocate tensor memory
{
int64_t dims[2] = {1, 200};
ov_shape_create(2, dims, &input_shape_2);
ov_tensor_set_shape(input_tensor, input_shape_2);
// ... write values to input_tensor memory
}
// do inference
ov_infer_request_infer(infer_request);
// get output tensor data & shape
{
ov_tensor_get_shape(output_tensor, &output_shape_2);
ov_tensor_data(output_tensor, &data_2);
// ... read values in data_2 according to the shape output_shape_2
}

ov_shape_free(&input_shape_1);
ov_shape_free(&output_shape_1);
ov_shape_free(&input_shape_2);
ov_shape_free(&output_shape_2);
ov_tensor_free(output_tensor);

Dynamic Shapes in Outputs

Examples above are valid approaches when dynamic dimensions in output may be implied by propagation of dynamic dimension from the inputs. For example, batch dimension in an input shape is usually propagated through the whole model and appears in the output shape. It also applies to other dimensions, like sequence length for NLP models or spatial dimensions for segmentation models, that are propagated through the entire network.

Whether the output has dynamic dimensions or not can be verified by querying the output partial shape after the model is read or reshaped. The same applies to inputs. For example:

// Print output partial shape
std::cout << model->output().get_partial_shape() << "\n";

// Print input partial shape
std::cout << model->input().get_partial_shape() << "\n";
# Print output partial shape
print(model.output().partial_shape)

# Print input partial shape
print(model.input().partial_shape)
ov_output_port_t\* output_port = NULL;
ov_output_port_t\* input_port = NULL;
ov_partial_shape_t partial_shape;
char \* str_partial_shape = NULL;

// Print output partial shape
{
ov_model_output(model, &output_port);
ov_port_get_partial_shape(output_port, &partial_shape);
str_partial_shape = ov_partial_shape_to_string(partial_shape);
printf("The output partial shape: %s", str_partial_shape);
}

// Print input partial shape
{
ov_model_input(model, &input_port);
ov_port_get_partial_shape(input_port, &partial_shape);
str_partial_shape = ov_partial_shape_to_string(partial_shape);
printf("The input partial shape: %s", str_partial_shape);
}

// free allocated resource
ov_free(str_partial_shape);
ov_partial_shape_free(&partial_shape);
ov_output_port_free(output_port);
ov_output_port_free(input_port);

When there are dynamic dimensions in corresponding inputs or outputs, the ? or ranges like 1..10 appear.

It can also be verified in a more programmatic way:

auto model = core.read_model("model.xml");

if (model->input(0).get_partial_shape().is_dynamic()) {
    // input is dynamic
}

if (model->output(0).get_partial_shape().is_dynamic()) {
    // output is dynamic
}

if (model->output(0).get_partial_shape()[1].is_dynamic()) {
    // 1-st dimension of output is dynamic
}
model = core.read_model("model.xml")

if model.input(0).partial_shape.is_dynamic():
    # input is dynamic
    pass

if model.output(0).partial_shape.is_dynamic():
    # output is dynamic
    pass

if model.output(0).partial_shape[1].is_dynamic():
    # 1-st dimension of output is dynamic
    pass
ov_model_t\* model = NULL;
ov_output_port_t\* input_port = NULL;
ov_output_port_t\* output_port = NULL;
ov_partial_shape_t partial_shape;

ov_core_read_model(core, "model.xml", NULL, &model);

// for input
{
ov_model_input_by_index(model, 0, &input_port);
ov_port_get_partial_shape(input_port, &partial_shape);
if (ov_partial_shape_is_dynamic(partial_shape)) {
    // input is dynamic
}
}

// for output
{
ov_model_output_by_index(model, 0, &output_port);
ov_port_get_partial_shape(output_port, &partial_shape);
if (ov_partial_shape_is_dynamic(partial_shape)) {
    // output is dynamic
}
}

// free allocated resource
ov_partial_shape_free(&partial_shape);
ov_output_port_free(input_port);
ov_output_port_free(output_port);

If at least one dynamic dimension exists in an output of a model, a shape of the corresponding output tensor will be set as the result of inference call. Before the first inference, memory for such a tensor is not allocated and has the [0] shape. If the set_output_tensor method is called with a pre-allocated tensor, the inference will call the set_shape internally, and the initial shape is replaced by the calculated shape. Therefore, setting a shape for output tensors in this case is useful only when pre-allocating enough memory for output tensor. Normally, the set_shape method of a Tensor re-allocates memory only if a new shape requires more storage.