Automatic Batching¶
The Automatic Batching Execution mode (or Auto-batching for short) performs automatic batching on-the-fly to improve device utilization by grouping inference requests together, without programming effort from the user. With Automatic Batching, gathering the input and scattering the output from the individual inference requests required for the batch happen transparently, without affecting the application code.
Auto Batching can be used directly as a virtual device or as an option for inference on CPU/GPU/VPU (by means of configuration/hint). These 2 ways are provided for the user to enable the BATCH devices explicitly or implicitly, with the underlying logic remaining the same. An example of the difference is that the CPU device doesn’t support implicitly to enable BATCH device, commands such as ./benchmark_app -m <model> -d CPU -hint tput
will not apply BATCH device implicitly, but ./benchmark_app -m <model> -d "BATCH:CPU(16)
can explicitly load BATCH device.
Auto-batching primarily targets the existing code written for inferencing many requests, each instance with the batch size 1. To get corresponding performance improvements, the application must be running multiple inference requests simultaneously. Auto-batching can also be used via a particular virtual device.
This article provides a preview of the Automatic Batching function, including how it works, its configurations, and testing performance.
How Automatic Batching Works¶
Batching is a straightforward way of leveraging the compute power of GPU and saving on communication overheads. Automatic Batching is “implicitly” triggered on the GPU when ov::hint::PerformanceMode::THROUGHPUT
is specified for the ov::hint::performance_mode
property for the compile_model
or set_property
calls.
auto compiled_model = core.compile_model(model, "GPU",
ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT));
config = {"PERFORMANCE_HINT": "THROUGHPUT"}
compiled_model = core.compile_model(model, "GPU", config)
To enable Auto-batching in the legacy apps not akin to the notion of performance hints, you need to use the explicit device notion, such as BATCH:GPU
.
Auto-Batching can be disabled (for example, for the GPU device) to prevent being triggered by ov::hint::PerformanceMode::THROUGHPUT
. To do that, set ov::hint::allow_auto_batching
to false in addition to the ov::hint::performance_mode
, as shown below:
// disabling the automatic batching
// leaving intact other configurations options that the device selects for the 'throughput' hint
auto compiled_model = core.compile_model(model, "GPU",
ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT),
ov::hint::allow_auto_batching(false));
# disabling the automatic batching
# leaving intact other configurations options that the device selects for the 'throughput' hint
config = {"PERFORMANCE_HINT": "THROUGHPUT",
"ALLOW_AUTO_BATCHING": False}
compiled_model = core.compile_model(model, "GPU", config)
Configuring Automatic Batching¶
Following the OpenVINO naming convention, the batching device is assigned the label of BATCH. The configuration options are as follows:
Parameter name |
Parameter description |
Examples |
---|---|---|
|
The name of the device to apply Automatic batching, with the optional batch size value in brackets. |
|
|
The timeout value, in ms. (1000 by default) |
You can reduce the timeout value to avoid performance penalty when the data arrives too unevenly. For example, set it to “100”, or the contrary, i.e., make it large enough to accommodate input preparation (e.g. when it is a serial process). |
Automatic Batch Size Selection¶
In both the THROUGHPUT hint and the explicit BATCH device cases, the optimal batch size is selected automatically, as the implementation queries the ov::optimal_batch_size
property from the device and passes the model graph as the parameter. The actual value depends on the model and device specifics, for example, the on-device memory for dGPUs.
The support for Auto-batching is not limited to GPU. However, if a device does not support ov::optimal_batch_size
yet, to work with Auto-batching, an explicit batch size must be specified, e.g., BATCH:<device>(16)
.
This “automatic batch size selection” works on the presumption that the application queries ov::optimal_number_of_infer_requests
to create the requests of the returned number and run them simultaneously:
// when the batch size is automatically selected by the implementation
// it is important to query/create and run the sufficient #requests
auto compiled_model = core.compile_model(model, "GPU",
ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT));
auto num_requests = compiled_model.get_property(ov::optimal_number_of_infer_requests);
# when the batch size is automatically selected by the implementation
# it is important to query/create and run the sufficient requests
config = {"PERFORMANCE_HINT": "THROUGHPUT"}
compiled_model = core.compile_model(model, "GPU", config)
num_requests = compiled_model.get_property("OPTIMAL_NUMBER_OF_INFER_REQUESTS")
Optimizing Performance by Limiting Batch Size¶
If not enough inputs were collected, the timeout
value makes the transparent execution fall back to the execution of individual requests. This value can be configured via the AUTO_BATCH_TIMEOUT
property.
The timeout, which adds itself to the execution time of the requests, heavily penalizes the performance. To avoid this, when your parallel slack is bounded, provide OpenVINO with an additional hint.
For example, when the application processes only 4 video streams, there is no need to use a batch larger than 4. The most future-proof way to communicate the limitations on the parallelism is to equip the performance hint with the optional ov::hint::num_requests
configuration key set to 4. This will limit the batch size for the GPU and the number of inference streams for the CPU, hence each device uses ov::hint::num_requests
while converting the hint to the actual device configuration options:
// limiting the available parallel slack for the 'throughput' hint via the ov::hint::num_requests
// so that certain parameters (like selected batch size) are automatically accommodated accordingly
auto compiled_model = core.compile_model(model, "GPU",
ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT),
ov::hint::num_requests(4));
config = {"PERFORMANCE_HINT": "THROUGHPUT",
"PERFORMANCE_HINT_NUM_REQUESTS": "4"}
# limiting the available parallel slack for the 'throughput'
# so that certain parameters (like selected batch size) are automatically accommodated accordingly
compiled_model = core.compile_model(model, "GPU", config)
For the explicit usage, you can limit the batch size by using BATCH:GPU(4)
, where 4 is the number of requests running in parallel.
Automatic Batching as an explicit device¶
The below examples show how AUTO Batching can be used in the form of device that the user can apply to perform inference directly:
./benchmark_app -m <model> -d "BATCH:GPU"
./benchmark_app -m <model> -d "BATCH:GPU(16)"
./benchmark_app -m <model> -d "BATCH:CPU(16)"
BATCH
– load BATCH device explicitly,:GPU(16)
– BATCH devices configuration, which tell BATCH device to apply GPU device with batch size = 16.
Automatic Batching as underlying device configured to other devices¶
In the following example, BATCH device will be configured to another device in case of tput/ctput mode
.
./benchmark_app -m <model> -d GPU -hint tput
./benchmark_app -m <model> -d AUTO -hint tput
./benchmark_app -m <model> -d AUTO -hint ctput
./benchmark_app -m <model> -d AUTO:GPU -hint ctput
Note
If you run ./benchmark_app
, do not set batch_size
by -b <batch_size>
, otherwise AUTO mode will not be applied.
Other Performance Considerations¶
To achieve the best performance with Automatic Batching, the application should:
Operate inference requests of the number that represents the multiple of the batch size. In the example from Optimizing Performance by Limiting Batch Size section – for batch size 4, the application should operate 4, 8, 12, 16, etc. requests.
Use the requests that are grouped by the batch size together. For example, the first 4 requests are inferred, while the second group of the requests is being populated. Essentially, Automatic Batching shifts the asynchronicity from the individual requests to the groups of requests that constitute the batches.
Balance the
timeout
value vs. the batch size. For example, in many cases, having a smallertimeout
value/batch size may yield better performance than having a larger batch size with atimeout
value that is not large enough to accommodate the full number of the required requests.When Automatic Batching is enabled, the
timeout
property ofov::CompiledModel
can be changed anytime, even after the loading/compilation of the model. For example, setting the value to 0 disables Auto-batching effectively, as the collection of requests would be omitted.Carefully apply Auto-batching to the pipelines. For example, in the conventional “video-sources -> detection -> classification” flow, it is most beneficial to do Auto-batching over the inputs to the detection stage. The resulting number of detections is usually fluent, which makes Auto-batching less applicable for the classification stage.
Limitations¶
The following are limitations of the current AUTO Batching implementations:
The dynamic model is not supported by
BATCH
device.BATCH
device can only supporttput/ctput mode
. Thelatency/none mode
is not supported.Supported are only models with
batch dimension = 1
.The input/output tensor should come from
inferRequest
, otherwise the user-created tensor will trigger a memory copying.The
OPTIMAL_BATCH_SIZE
should be greater than2
. In case it’s not, user needs to specify a batch size which depends on model and device (CPU does not support this property).BATCH
device supports GPU by default, while CPU will not triggerauto_batch
intput
mode.AUTO_BATCH
will bring much more compilation latency.Although it is less critical for the throughput-oriented scenarios, the load time with Auto-batching increases by almost double.
Certain networks are not safely reshapable by the “batching” dimension (specified as
N
in the layout terms). Besides, if the batching dimension is not zeroth, Auto-batching will not be triggered “implicitly” by the throughput hint.The “explicit” notion, for example,
BATCH:GPU
, using the relaxed dimensions tracking, often makes Auto-batching possible. For example, this method unlocks most detection networks.When forcing Auto-batching via the “explicit” device notion, make sure that you validate the results for correctness.
Performance improvements happen at the cost of the growth of memory footprint. However, Auto-batching queries the available memory (especially for dGPU) and limits the selected batch size accordingly.
Testing Performance with Benchmark_app¶
The benchmark_app
sample, that has both C++ and Python versions, is the best way to evaluate the performance of Automatic Batching:
The most straightforward way is using the performance hints:
benchmark_app -hint tput -d GPU -m ‘path to your favorite model’
You can also use the “explicit” device notion to override the strict rules of the implicit reshaping by the batch dimension:
benchmark_app -hint none -d BATCH:GPU -m ‘path to your favorite model’
or override the automatically deduced batch size as well:
$benchmark_app -hint none -d BATCH:GPU(16) -m ‘path to your favorite model’
This example also applies to CPU or any other device that generally supports batch execution.
Keep in mind that some shell versions (e.g.
bash
) may require adding quotes around complex device names, i.e.-d "BATCH:GPU(16)"
in this example.
Note that Benchmark_app performs a warm-up run of a single request. As Auto-Batching requires significantly more requests to execute in batch, this warm-up run hits the default timeout value (1000 ms), as reported in the following example:
[ INFO ] First inference took 1000.18ms
This value also exposed as the final execution statistics on the benchmark_app
exit:
[ INFO ] Latency:
[ INFO ] Max: 1000.18 ms
This is NOT the actual latency of the batched execution, so you are recommended to refer to other metrics in the same log, for example, “Median” or “Average” execution.