Remote Tensor API of NPU Plugin#
The NPU plugin supports memory sharing between OpenVINO and native APIs such as OpenCL, Vulkan, or DirectX 12.
It implements the ov::RemoteContext and ov::RemoteTensor interfaces, providing mechanisms for efficient memory sharing.
On Windows, the plugin exports an NT handle; on Linux, it uses a DMA-BUF System Heap. You can share this memory by
passing the pointer as the shared_buffer member to the remote_tensor(..., shared_buffer) create function.
Another option is to import memory by mapping a file into memory or by using a CPU virtual address allocation. These methods
help avoid memory copy overhead when plugging OpenVINO inference into an existing NPU pipeline.
Supported scenario by the Remote Tensor API:
The NPU plugin context and memory objects can be constructed from low-level device, display, or memory handles and used to create the OpenVINO™
ov::CompiledModelorov::Tensorobjects.
Class and function declarations for the API are defined in the following file: src/inference/include/openvino/runtime/intel_npu/level_zero/level_zero.hpp
The most common way to enable the interaction of your application with the Remote Tensor API is to use user-side utility classes and functions that consume or produce native handles directly.
Context Sharing Between Application and NPU Plugin#
NPU plugin classes that implement the ov::RemoteContext interface are responsible for context sharing.
Obtaining a context object is the first step in sharing pipeline objects.
The context object of the NPU plugin directly wraps Level Zero context, setting a scope for sharing the
ov::RemoteTensor objects. The ov::RemoteContext object is retrieved from the NPU plugin.
Once you have obtained the context, you can use it to create the ov::RemoteTensor objects.
Getting RemoteContext from the Plugin#
To request the current default context of the plugin, use one of the following methods:
auto npu_context = core.get_default_context("NPU").as<ov::intel_npu::level_zero::ZeroContext>();
// Extract raw level zero context handle from RemoteContext
void* context_handle = npu_context.get();
auto npu_context = compiled_model.get_context().as<ov::intel_npu::level_zero::ZeroContext>();
// Extract raw level zero context handle from RemoteContext
void* context_handle = npu_context.get();
Memory Sharing Between Application and NPU Plugin#
The classes that implement the ov::RemoteTensor interface are the wrappers for native API
memory handles, which can be obtained from them at any time.
To create a shared tensor from a native memory handle or a file, use dedicated create_tensor, create_l0_host_tensor, or create_host_tensor
methods of the ov::RemoteContext sub-classes.
ov::intel_npu::level_zero::LevelZero has multiple overloads methods which enable wrapping pre-allocated native handles with the ov::RemoteTensor
object or requesting plugin to allocate specific device memory.
For more details, see the code snippets below:
ov::intel_npu::FileDescriptor file_descriptor{"file_name"};
auto remote_tensor = npu_context.create_tensor(in_element_type, in_shape, file_descriptor);
void* standard_allocation = nullptr;
ov::intel_npu::MemType memory_type = ov::intel_npu::MemType::CPU_VA;
auto remote_tensor = npu_context.create_tensor(in_element_type, in_shape, standard_allocation, memory_type);
void* shared_buffer = nullptr;
ov::intel_npu::MemType memory_type = ov::intel_npu::MemType::SHARED_BUF;
auto remote_tensor = npu_context.create_tensor(in_element_type, in_shape, shared_buffer, memory_type);
int32_t fd_heap = 0; // create the DMA-BUF System Heap file descriptor
ov::intel_npu::MemType memory_type = ov::intel_npu::MemType::SHARED_BUF;
auto remote_tensor = npu_context.create_tensor(in_element_type, in_shape, fd_heap, memory_type);
auto remote_tensor = npu_context.create_l0_host_tensor(in_element_type, in_shape);
// Extract raw level zero pointer from remote tensor
void* level_zero_ptr = remote_tensor.get();
auto tensor = npu_context.create_host_tensor(in_element_type, in_shape);
// Extract raw level zero pointer from remote tensor
void* level_zero_ptr = tensor.data();
Limitations#
The NPU plugin does not support methods for direct allocation of native handles.
Warning
CPU Virtual Address Allocation Requirements When using CPU virtual address allocations, you must comply with the following requirements to prevent memory corruption and crashes:
1. Memory Alignment (Mandatory) Both the allocation pointer and its size must be aligned to the standard page size (4KB). Non-aligned allocations will be rejected.
2. Allocation Lifetime (Critical) The allocation must remain valid until ALL of the following have occurred:
All inference requests using this remote tensor have completed execution, AND
All inference requests using this remote tensor have been destroyed, AND
The remote tensor has been destroyed
Failure to maintain the allocation for the entire lifecycle will result in undefined behavior and potential crashes.
Low-Level Methods for RemoteContext and RemoteTensor Creation#
The high-level wrappers mentioned above bring a direct dependency on native APIs to your program.
If you want to avoid the dependency, you still can directly use the ov::Core::create_context(),
ov::RemoteContext::create_tensor(), and ov::RemoteContext::get_params() methods.
On this level, native handles are re-interpreted as void pointers and all arguments are passed
using ov::AnyMap containers that are filled with the std::string, ov::Any pairs.
Two types of map entries are possible: a descriptor and a container.
The descriptor sets the expected structure and possible parameter values of the map.
For possible low-level properties and their description, refer to the header file: remote_properties.hpp.