Executable Network

ExecutableNetwork class functionality:

  • Compile an InferenceEngine::ICNNNetwork instance to a backend specific graph representation

  • Create an arbitrary number of InferRequest objects

  • Hold some common resources shared between different instances of InferRequest. For example:

    • InferenceEngine::IExecutableNetworkInternal::_taskExecutor task executor to implement asynchronous execution

    • InferenceEngine::IExecutableNetworkInternal::_callbackExecutor task executor to run an asynchronous inference request callback in a separate thread

Class

Inference Engine Plugin API provides the helper InferenceEngine::ExecutableNetworkThreadSafeDefault class recommended to use as a base class for an executable network. Based on that, a declaration of an executable network class can look as follows:

class ExecutableNetwork : public InferenceEngine::ExecutableNetworkThreadSafeDefault {
public:
    ExecutableNetwork(const std::shared_ptr<const ngraph::Function>& function,
                      const InferenceEngine::InputsDataMap& inputInfoMap,
                      const InferenceEngine::OutputsDataMap& outputsInfoMap,
                      const Configuration& cfg,
                      const std::shared_ptr<Plugin>& plugin);

    ExecutableNetwork(std::istream& model, const Configuration& cfg, const std::shared_ptr<Plugin>& plugin);

    // Methods from a base class ExecutableNetworkThreadSafeDefault

    void Export(std::ostream& model) override;
    InferenceEngine::IInferRequestInternal::Ptr CreateInferRequestImpl(
        InferenceEngine::InputsDataMap networkInputs,
        InferenceEngine::OutputsDataMap networkOutputs) override;
    InferenceEngine::IInferRequestInternal::Ptr CreateInferRequestImpl(
        const std::vector<std::shared_ptr<const ov::Node>>& inputs,
        const std::vector<std::shared_ptr<const ov::Node>>& outputs) override;
    InferenceEngine::IInferRequestInternal::Ptr CreateInferRequest() override;
    InferenceEngine::Parameter GetMetric(const std::string& name) const override;
    InferenceEngine::Parameter GetConfig(const std::string& name) const override;

private:
    friend class TemplateInferRequest;
    friend class Plugin;

    void CompileNetwork(const std::shared_ptr<const ngraph::Function>& function,
                        const InferenceEngine::InputsDataMap& inputInfoMap,
                        const InferenceEngine::OutputsDataMap& outputsInfoMap);
    void InitExecutor();

    std::atomic<std::size_t> _requestId = {0};
    Configuration _cfg;
    std::shared_ptr<Plugin> _plugin;
    std::shared_ptr<ngraph::Function> _function;
    std::map<std::string, std::size_t> _inputIndex;
    std::map<std::string, std::size_t> _outputIndex;
};

Class Fields

The example class has several fields:

  • _requestId - Tracks a number of created inference requests, which is used to distinguish different inference requests during profiling via the Intel® Instrumentation and Tracing Technology (ITT) library.

  • _cfg - Defines a configuration an executable network was compiled with.

  • _plugin - Refers to a plugin instance.

  • _function - Keeps a reference to transformed ngraph::Function which is used in ngraph reference backend computations. Note, in case of other backends with backend specific graph representation _function has different type and represents backend specific graph or just a set of computational kernels to perform an inference.

  • _inputIndex - maps a name of input with its index among all network inputs.

  • _outputIndex - maps a name of output with its index among all network outputs.

Constructor with

This constructor accepts a generic representation of a neural network as an InferenceEngine::ICNNNetwork reference and is compiled into a backend specific device graph:

TemplatePlugin::ExecutableNetwork::ExecutableNetwork(const std::shared_ptr<const ngraph::Function>& function,
                                                     const InferenceEngine::InputsDataMap& inputInfoMap,
                                                     const InferenceEngine::OutputsDataMap& outputsInfoMap,
                                                     const Configuration& cfg,
                                                     const Plugin::Ptr& plugin)
    : InferenceEngine::ExecutableNetworkThreadSafeDefault(nullptr, nullptr),  // Disable default threads creation
      _cfg(cfg),
      _plugin(plugin) {
    // TODO: if your plugin supports device ID (more that single instance of device can be on host machine)
    // you should select proper device based on KEY_DEVICE_ID or automatic behavior
    // In this case, _waitExecutor should also be created per device.
    try {
        CompileNetwork(function, inputInfoMap, outputsInfoMap);
        InitExecutor();  // creates thread-based executor using for async requests
    } catch (const InferenceEngine::Exception&) {
        throw;
    } catch (const std::exception& e) {
        IE_THROW(Unexpected) << "Standard exception from compilation library: " << e.what();
    } catch (...) {
        IE_THROW(Unexpected) << "Generic exception is thrown";
    }
}

The implementation CompileNetwork is fully device-specific.

The function accepts a const shared pointer to ngraph::Function object and performs the following steps:

  1. Applies nGraph passes using TransformNetwork function, which defines plugin-specific conversion pipeline. To support low precision inference, the pipeline can include Low Precision Transformations. These transformations are usually hardware specific. You can find how to use and configure Low Precisions Transformations in Low Precision Transformations guide.

  2. Maps the transformed graph to a backend specific graph representation (for example, to CPU plugin internal graph representation).

  3. Allocates and fills memory for graph weights, backend specific memory handles and so on.

// forward declaration
void TransformNetwork(std::shared_ptr<ngraph::Function>& function,
                      const InferenceEngine::InputsDataMap& inputInfoMap,
                      const InferenceEngine::OutputsDataMap& outputsInfoMap);

void TemplatePlugin::ExecutableNetwork::CompileNetwork(const std::shared_ptr<const ngraph::Function>& function,
                                                       const InferenceEngine::InputsDataMap& inputInfoMap,
                                                       const InferenceEngine::OutputsDataMap& outputsInfoMap) {
    // TODO: perform actual graph compilation / mapping to backend graph representation / kernels

    // clone network
    _function = ngraph::clone_function(\*function);

    // apply plugins transformations
    TransformNetwork(_function, inputInfoMap, outputsInfoMap);

    // Generate backend specific blob mappings. For example Inference Engine uses not ngraph::Result nodes friendly name
    // as inference request output names but the name of the layer before.
    size_t idx = 0;
    for (auto&& result : _function->get_results()) {
        const auto& input = result->input_value(0);
        auto name = ngraph::op::util::get_ie_output_name(input);
        if (_outputIndex.emplace(name, idx).second)
            idx++;
    }
    for (auto&& parameter : _function->get_parameters()) {
        _inputIndex.emplace(parameter->get_friendly_name(), _function->get_parameter_index(parameter));
    }

    // Perform any other steps like allocation and filling backend specific memory handles and so on
}

Note

After all these steps, the backend specific graph is ready to create inference requests and perform inference.

Constructor Importing from Stream

This constructor creates a backend specific graph by importing from a stream object:

Note

The export of backend specific graph is done in the Export method, and data formats must be the same for both import and export.

TemplatePlugin::ExecutableNetwork::ExecutableNetwork(std::istream& model,
                                                     const Configuration& cfg,
                                                     const Plugin::Ptr& plugin)
    : _cfg(cfg),
      _plugin(plugin) {
    // read XML content
    std::string xmlString;
    std::uint64_t dataSize = 0;
    model.read(reinterpret_cast<char\*>(&dataSize), sizeof(dataSize));
    xmlString.resize(dataSize);
    model.read(const_cast<char\*>(xmlString.c_str()), dataSize);

    // read blob content
    InferenceEngine::Blob::Ptr dataBlob;
    model.read(reinterpret_cast<char\*>(&dataSize), sizeof(dataSize));
    if (0 != dataSize) {
        dataBlob = InferenceEngine::make_shared_blob<std::uint8_t>(
            InferenceEngine::TensorDesc(InferenceEngine::Precision::U8,
                                        {static_cast<std::size_t>(dataSize)},
                                        InferenceEngine::Layout::C));
        dataBlob->allocate();
        model.read(dataBlob->buffer(), dataSize);
    }

    auto cnnnetwork = _plugin->GetCore()->ReadNetwork(xmlString, std::move(dataBlob));

    // TODO: implement Import / Export of configuration options and merge with `cfg`
    // TODO: implement Import / Export of network precisions, layouts, preprocessing info
    InferenceEngine::InputsDataMap inputInfoMap = cnnnetwork.getInputsInfo();
    InferenceEngine::OutputsDataMap outputInfoMap = cnnnetwork.getOutputsInfo();

    setNetworkInputs(inputInfoMap);
    setNetworkOutputs(outputInfoMap);
    SetPointerToPlugin(_plugin->shared_from_this());

    try {
        // TODO: remove compilation, network is already compiled and serialized in compiled form
        CompileNetwork(cnnnetwork.getFunction(), inputInfoMap, outputInfoMap);
        InitExecutor();  // creates thread-based executor using for async requests
    } catch (const InferenceEngine::Exception&) {
        throw;
    } catch (const std::exception& e) {
        IE_THROW(Unexpected) << "Standard exception from compilation library: " << e.what();
    } catch (...) {
        IE_THROW(Unexpected) << "Generic exception is thrown";
    }
}

The implementation of the method should write all data to the model stream, which is required to import a backend specific graph later in the Plugin::Import method:

void TemplatePlugin::ExecutableNetwork::Export(std::ostream& modelStream) {
    OV_ITT_SCOPED_TASK(itt::domains::TemplatePlugin, "ExecutableNetwork::Export");

    // Note: custom ngraph extensions are not supported
    std::map<std::string, ngraph::OpSet> custom_opsets;
    std::stringstream xmlFile, binFile;
    OPENVINO_SUPPRESS_DEPRECATED_START
    ov::pass::Serialize serializer(xmlFile, binFile, custom_opsets);
    OPENVINO_SUPPRESS_DEPRECATED_END
    serializer.run_on_model(_function);

    auto m_constants = binFile.str();
    auto m_model = xmlFile.str();

    auto dataSize = static_cast<std::uint64_t>(m_model.size());
    modelStream.write(reinterpret_cast<char\*>(&dataSize), sizeof(dataSize));
    modelStream.write(m_model.c_str(), dataSize);

    dataSize = static_cast<std::uint64_t>(m_constants.size());
    modelStream.write(reinterpret_cast<char\*>(&dataSize), sizeof(dataSize));
    modelStream.write(reinterpret_cast<char\*>(&m_constants[0]), dataSize);

    // TODO: implement network precision, layout, preprocessing info serialization
}

The method creates an asynchronous inference request and returns it. While the public Inference Engine API has a single interface for inference request, which can be executed in synchronous and asynchronous modes, a plugin library implementation has two separate classes:

  • Synchronous inference request, which defines pipeline stages and runs them synchronously in the Infer method.

  • Asynchronous inference request, which is a wrapper for a synchronous inference request and can run a pipeline asynchronously. Depending on a device pipeline structure, it can has one or several stages:

    • For single-stage pipelines, there is no need to define this method and create a class derived from InferenceEngine::AsyncInferRequestThreadSafeDefault. For single stage pipelines, a default implementation of this method creates InferenceEngine::AsyncInferRequestThreadSafeDefault wrapping a synchronous inference request and runs it asynchronously in the _taskExecutor executor.

    • For pipelines with multiple stages, such as performing some preprocessing on host, uploading input data to a device, running inference on a device, or downloading and postprocessing output data, schedule stages on several task executors to achieve better device use and performance. You can do it by creating a sufficient number of inference requests running in parallel. In this case, device stages of different inference requests are overlapped with preprocessing and postprocessing stage giving better performance.

      Warning

      It is up to you to decide how many task executors you need to optimally execute a device pipeline.

      InferenceEngine::IInferRequestInternal::Ptr TemplatePlugin::ExecutableNetwork::CreateInferRequest() {
          InferenceEngine::IInferRequestInternal::Ptr internalRequest;
          if (this->_plugin && _plugin->IsNewAPI()) {
              internalRequest = CreateInferRequestImpl(_parameters, _results);
          }
          if (!internalRequest)
              internalRequest = CreateInferRequestImpl(_networkInputs, _networkOutputs);
          return std::make_shared<TemplateAsyncInferRequest>(std::static_pointer_cast<TemplateInferRequest>(internalRequest),
                                                             _taskExecutor,
                                                             _plugin->_waitExecutor,
                                                             _callbackExecutor);
      }

This is a helper method used by CreateInferRequest to create a synchronous inference request, which is later wrapped with the asynchronous inference request class:

InferenceEngine::IInferRequestInternal::Ptr TemplatePlugin::ExecutableNetwork::CreateInferRequestImpl(
    InferenceEngine::InputsDataMap networkInputs,
    InferenceEngine::OutputsDataMap networkOutputs) {
    return std::make_shared<TemplateInferRequest>(networkInputs,
                                                  networkOutputs,
                                                  std::static_pointer_cast<ExecutableNetwork>(shared_from_this()));
}

InferenceEngine::IInferRequestInternal::Ptr TemplatePlugin::ExecutableNetwork::CreateInferRequestImpl(
    const std::vector<std::shared_ptr<const ov::Node>>& inputs,
    const std::vector<std::shared_ptr<const ov::Node>>& outputs) {
    return std::make_shared<TemplateInferRequest>(inputs,
                                                  outputs,
                                                  std::static_pointer_cast<ExecutableNetwork>(shared_from_this()));
}

Returns a metric value for a metric with the name name. A metric is a static type of information about an executable network. Examples of metrics:

  • EXEC_NETWORK_METRIC_KEY(NETWORK_NAME) - name of an executable network

  • EXEC_NETWORK_METRIC_KEY(OPTIMAL_NUMBER_OF_INFER_REQUESTS) - heuristic to denote an optimal (or at least sub-optimal) number of inference requests needed to run asynchronously to use the current device fully

  • Any other executable network metric specific for a particular device. Such metrics and possible values must be declared in a plugin configuration public header, for example, template/template_config.hpp

InferenceEngine::Parameter TemplatePlugin::ExecutableNetwork::GetMetric(const std::string& name) const {
    // TODO: return more supported values for metrics
    if (EXEC_NETWORK_METRIC_KEY(SUPPORTED_METRICS) == name) {
        IE_SET_METRIC_RETURN(SUPPORTED_METRICS,
                             std::vector<std::string>{METRIC_KEY(NETWORK_NAME),
                                                      METRIC_KEY(SUPPORTED_METRICS),
                                                      METRIC_KEY(SUPPORTED_CONFIG_KEYS),
                                                      METRIC_KEY(OPTIMAL_NUMBER_OF_INFER_REQUESTS)});
    } else if (EXEC_NETWORK_METRIC_KEY(SUPPORTED_CONFIG_KEYS) == name) {
        std::vector<std::string> configKeys = {CONFIG_KEY(DEVICE_ID),
                                               CONFIG_KEY(PERF_COUNT),
                                               TEMPLATE_CONFIG_KEY(THROUGHPUT_STREAMS)};
        auto streamExecutorConfigKeys = InferenceEngine::IStreamsExecutor::Config{}.SupportedKeys();
        for (auto&& configKey : streamExecutorConfigKeys) {
            configKeys.emplace_back(configKey);
        }
        IE_SET_METRIC_RETURN(SUPPORTED_CONFIG_KEYS, configKeys);
    } else if (EXEC_NETWORK_METRIC_KEY(NETWORK_NAME) == name) {
        auto networkName = _function->get_friendly_name();
        IE_SET_METRIC_RETURN(NETWORK_NAME, networkName);
    } else if (EXEC_NETWORK_METRIC_KEY(OPTIMAL_NUMBER_OF_INFER_REQUESTS) == name) {
        unsigned int value = _cfg._streamsExecutorConfig._streams;
        IE_SET_METRIC_RETURN(OPTIMAL_NUMBER_OF_INFER_REQUESTS, value);
    } else {
        IE_THROW() << "Unsupported ExecutableNetwork metric: " << name;
    }
}

The IE_SET_METRIC_RETURN helper macro sets metric value and checks that the actual metric type matches a type of the specified value.

Returns a current value for a configuration key with the name name. The method extracts configuration values an executable network is compiled with.

InferenceEngine::Parameter TemplatePlugin::ExecutableNetwork::GetConfig(const std::string& name) const {
    return _cfg.Get(name);
}

This function is the only way to get configuration values when a network is imported and compiled by other developers and tools (for example, the Compile tool.

The next step in plugin library implementation is the Synchronous Inference Request class.