Deployment Optimization Guide Additional Configurations

To optimize your performance results during runtime step, you can experiment with:

  • multi socket CPUs

  • threading

  • Basic Interoperability with Other APIs

Best Latency on the Multi-Socket CPUs

Note that when latency is of concern, there are additional tips for multi-socket systems. When input is limited to the single image, the only way to achieve the best latency is to limit execution to the single socket. The reason is that single image is simply not enough to saturate more than one socket. Also NUMA overheads might dominate the execution time. Below is the example command line that limits the execution to the single socket using numactl for the best latency value (assuming the machine with 28 phys cores per socket):

limited to the single socket).
$ numactl -m 0 --physcpubind 0-27  benchmark_app -m <model.xml> -api sync -nthreads 28

Note that if you have more than one input, running as many inference requests as you have NUMA nodes (or sockets) usually gives the same best latency as a single request on the single socket, but much higher throughput. Assuming two NUMA nodes machine:

$ benchmark_app -m <model.xml> -nstreams 2

Number of NUMA nodes on the machine can be queried via ‘lscpu’. Please see more on the NUMA support in the Optimization Guide.

Threading

  • As explained in the CPU Checklist section, by default the Inference Engine uses Intel TBB as a parallel engine. Thus, any OpenVINO-internal threading (including CPU inference) uses the same threads pool, provided by the TBB. But there are also other threads in your application, so oversubscription is possible at the application level:

The rule of thumb is that you should try to have the overall number of active threads in your application equal to the number of cores in your machine. Keep in mind the spare core(s) that the OpenCL driver under the GPU plugin might also need.

  • One specific workaround to limit the number of threads for the Inference Engine is using the CPU configuration options.

  • To avoid further oversubscription, use the same threading model in all modules/libraries that your application uses. Notice that third party components might bring their own threading. For example, using Inference Engine which is now compiled with the TBB by default might lead to performance troubles when mixed in the same app with another computationally-intensive library, but compiled with OpenMP. You can try to compile the open source version of the Inference Engine to use the OpenMP as well. But notice that in general, the TBB offers much better composability, than other threading solutions.

  • If your code (or third party libraries) uses GNU OpenMP, the Intel OpenMP (if you have recompiled Inference Engine with that) must be initialized first. This can be achieved by linking your application with the Intel OpenMP instead of GNU OpenMP, or using LD_PRELOAD on Linux* OS.

Basic Interoperability with Other APIs

The general approach for sharing data between Inference Engine and media/graphics APIs like Intel Media Server Studio (Intel MSS) is based on sharing the system memory. That is, in your code, you should map or copy the data from the API to the CPU address space first.

For Intel MSS, it is recommended to perform a viable pre-processing, for example, crop/resize, and then convert to RGB again with the Video Processing Procedures (VPP). Then lock the result and create an Inference Engine blob on top of that. The resulting pointer can be used for the SetBlob :

//Lock Intel MSS surface
mfxFrameSurface1 *frame_in;   //Input MSS surface.
mfxFrameAllocator* pAlloc = &m_mfxCore.FrameAllocator();
pAlloc->Lock(pAlloc->pthis, frame_in->Data.MemId, &frame_in->Data);
//Inference Engine code

WARNING : The InferenceEngine::NHWC layout is not supported natively by most InferenceEngine plugins so internal conversion might happen.

InferenceEngine::SizeVector dims_src = {
    1         /* batch, N*/,
    (size_t) frame_in->Info.Height  /* Height */,
    (size_t) frame_in->Info.Width    /* Width */,
    3 /*Channels,*/,
    };
InferenceEngine::TensorDesc desc(InferenceEngine::Precision::U8, dims_src, InferenceEngine::NHWC);
/* wrapping the surface data, as RGB is interleaved, need to pass only ptr to the R, notice that this wouldn’t work with planar formats as these are 3 separate planes/pointers*/
InferenceEngine::TBlob<uint8_t>::Ptr p = InferenceEngine::make_shared_blob<uint8_t>( desc, (uint8_t*) frame_in->Data.R);
inferRequest.SetBlob("input", p);
inferRequest.Infer();
//Make sure to unlock the surface upon inference completion, to return the ownership back to the Intel MSS
pAlloc->Unlock(pAlloc->pthis, frame_in->Data.MemId, &frame_in->Data);

Alternatively, you can use RGBP (planar RGB) output from Intel MSS. This allows to wrap the (locked) result as regular NCHW which is generally friendly for most plugins (unlike NHWC). Then you can use it with SetBlob just like in previous example:

InferenceEngine::SizeVector dims_src = {
       1     /* batch, N*/,
       3     /*Channels,*/,
       (size_t) frame_in->Info.Height  /* Height */,
       (size_t) frame_in->Info.Width    /* Width */,
       };
TensorDesc desc(InferenceEngine::Precision::U8, dims_src, InferenceEngine::NCHW);
/* wrapping the RGBP surface data*/
InferenceEngine::TBlob<uint8_t>::Ptr p = InferenceEngine::make_shared_blob<uint8_t>( desc, (uint8_t*) frame_in->Data.R);
inferRequest.SetBlob("input", p);
// …

The only downside of this approach is that VPP conversion to RGBP is not hardware accelerated (and performed on the GPU EUs). Also, it is available only on LInux.

OpenCV* Interoperability Example

Unlike APIs that use dedicated address space and/or special data layouts (for instance, compressed OpenGL* textures), regular OpenCV data objects like cv::Mat reside in the conventional system memory. That is, the memory can be actually shared with the Inference Engine and only data ownership to be transferred.

Again, if the OpenCV and Inference Engine layouts match, the data can be wrapped as Inference Engine (input/output) blob. Notice that by default, Inference Engine accepts the planar and not interleaved inputs in NCHW, so the NHWC (which is exactly the interleaved layout) should be specified explicitly:

WARNING : The InferenceEngine::NHWC layout is not supported natively by most InferenceEngine plugins so internal conversion might happen.

cv::Mat frame(cv::Size(100, 100), CV_8UC3);  // regular CV_8UC3 image, interleaved
// creating blob that wraps the OpenCV’s Mat
// (the data it points should persists until the blob is released):
InferenceEngine::SizeVector dims_src = {
    1         /* batch, N*/,
    (size_t)frame.rows  /* Height */,
    (size_t)frame.cols    /* Width */,
    (size_t)frame.channels() /*Channels,*/,
    };
InferenceEngine::TensorDesc desc(InferenceEngine::Precision::U8, dims_src, InferenceEngine::NHWC);
InferenceEngine::TBlob<uint8_t>::Ptr p = InferenceEngine::make_shared_blob<uint8_t>( desc, (uint8_t*)frame.data, frame.step[0] * frame.rows);
inferRequest.SetBlob("input", p);
inferRequest.Infer();
// …
// similarly, you can wrap the output tensor (let’s assume it is FP32)
// notice that the output should be also explicitly stated as NHWC with setLayout
auto output_blob = inferRequest.GetBlob("output");
const float* output_data = output_blob->buffer().as<float*>();
auto dims = output_blob->getTensorDesc().getDims();
cv::Mat res (dims[2], dims[3], CV_32FC3, (void *)output_data);

Notice that original cv::Mat /blobs cannot be used simultaneously by the application and the Inference Engine. Alternatively, the data that the pointer references to can be copied to unlock the original data and return ownership to the original API.

To learn more about optimizations during developing step, visit Deployment Optimization Guide page.