Converting TensorFlow* Object Detection API Models

What's New in the 2018 R4 Release

How to Convert a Model

With 2018 R3 release, the Model Optimizer introduces a new approach to convert models created using the TensorFlow* Object Detection API. Compared with the previous approach, the new process produces inference results with higher accuracy and does not require modifying any configuration files and providing intricate command line parameters.

You can download TensorFlow* Object Detection API models from the Object Detection Model Zoo.

NOTE: Before converting, make sure you have configured the Model Optimizer. For configuration steps, refer to Configuring the Model Optimizer.

To convert a TensorFlow* Object Detection API model, go to the <INSTALL_DIR>/deployment_tools/model_optimizer directory and run the mo_tf.py script with the following required parameters:

NOTE: If you convert a TensorFlow* Object Detection API model to use with the Inference Engine sample applications, you must specify the --reverse_input_channels parameter also.

Additionally to the mandatory parameters listed above you can use optional conversion parameters if needed. A full list of parameters is available in the Converting a TensorFlow* Model topic.

For example, if you downloaded the pre-trained SSD InceptionV2 topology and extracted archive to the directory /tmp/ssd_inception_v2_coco_2018_01_28, the sample command line to convert the model looks as follows:

<INSTALL_DIR>/deployment_tools/model_optimizer/mo_tf.py --input_model=/tmp/ssd_inception_v2_coco_2018_01_28/frozen_inference_graph.pb --tensorflow_use_custom_operations_config <INSTALL_DIR>/deployment_tools/model_optimizer/extensions/front/tf/ssd_v2_support.json --tensorflow_object_detection_api_pipeline_config /tmp/ssd_inception_v2_coco_2018_01_28/pipeline.config --reverse_input_channels

Custom Input Shape

Model Optimizer handles command line parameter --input_shape for TensorFlow* Object Detection API models in a special way depending on the image resizer type defined in the pipeline.config file. TensorFlow* Object Detection API generates different Preprocessor sub-graph based on the image resizer type. Model Optimizer supports two types of image resizer:

Fixed Shape Resizer Replacement

Keep Aspect Ratio Resizer Replacement

def calculate_shape_keeping_aspect_ratio(H: int, W: int, min_dimension: int, max_dimension: int):
ratio_min = min_dimension / min(H, W)
ratio_max = max_dimension / max(H, W)
ratio = min(ratio_min, ratio_max)
return int(round(H * ratio)), int(round(W * ratio))

Models with keep_aspect_ratio_resizer were trained to recognize object in real aspect ratio, in contrast with most of the classification topologies trained to recognize objects stretched vertically and horizontally as well. By default, the Model Optimizer converts topologies with keep_aspect_ratio_resizer to consume a square input image. If the non-square image is provided as input, it is stretched without keeping aspect ratio that results to objects detection quality decrease.

NOTE: It is highly recommended to specify the --input_shape command line parameter for the models with keep_aspect_ratio_resizer if the input image dimensions are known in advance.

Important Notes About Feeding Input Images to the Samples

Inference Engine comes with a number of samples that use Object Detection API models including:

There are a number of important notes about feeding input images to the samples:

  1. Inference Engine samples stretch input image to the size of the input layer without preserving aspect ratio. This behavior is usually correct for most topologies (including SSDs), but incorrect for the following Faster R-CNN topologies: Inception ResNet, Inception V2, ResNet50 and ResNet101. Images pre-processing for these topologies keeps aspect ratio. Also all Mask R-CNN and R-FCN topologies require keeping aspect ratio. The type of pre-processing is defined in the pipeline configuration file in the section image_resizer. If keeping aspect ratio is required, then it is necessary to resize image before passing it to the sample.
  2. TensorFlow* implementation of image resize may be different from the one implemented in the sample. Even reading input image from compressed format (like .jpg) could give different results in the sample and TensorFlow*. So, if it is necessary to compare accuracy between the TensorFlow* and the Inference Engine it is recommended to pass pre-scaled input image in a non-compressed format (like .bmp).
  3. if you want to infer the model with the Inference Engine samples, convert the model specifying the --reverse_input_channels command line parameter. The samples load images in BGR channels order, while TensorFlow* models were trained with images in RGB order. When the --reverse_input_channels command line parameter is specified, the Model Optimizer performs first convolution or other channel dependent operation weights modification so the output will be like the image is passed with RGB channels order.

Detailed Explanations of Model Conversion Process

This section is intended for users who want to understand how the Model Optimizer performs Object Detection API models conversion in details. The knowledge given in this section is also useful for users having complex models that are not converted with the Model Optimizer out of the box. It is highly recommended to read Sub-Graph Replacement in Model Optimizer chapter first to understand sub-graph replacement concepts which are used here.

Implementation of the sub-graph replacers for Object Detection API models is located in the file <INSTALL_DIR>/deployment_tools/model_optimizer/extensions/front/tf/ObjectDetectionAPI.py.

It is also important to open the model in the TensorBoard to see the topology structure. Model Optimizer can create an event file that can be then fed to the TensorBoard* tool. Run the Model Optimizer with providing two command line parameters:

SSD (Single Shot Multibox Detector) Topologies

The SSD topologies are the simplest ones among Object Detection API topologies, so they will be analyzed first. The sub-graph replacement configuration file ssd_v2_support.json, which should be used to convert these models, contains three sub-graph replacements: ObjectDetectionAPIPreprocessorReplacement, ObjectDetectionAPISSDPostprocessorReplacement and ObjectDetectionAPIOutputReplacement. Their implementation is described below.

Preprocessor Block

All Object Detection API topologies contain Preprocessor block of nodes (aka "scope") that performs two tasks:

Model Optimizer cannot convert the part of the Preprocessor block performing scaling because the TensorFlow implementation uses while- loops which the Inference Engine does not support. Another reason is that the Inference Engine samples scale input images to the size of the input layer from the Intermediate Representation (IR) automatically. Given that it is necessary to cut-off the scaling part of the Preprocessor block and leave only operations applying mean and scale values. This task is solved using the Model Optimizer sub-graph replacer mechanism.

The Preprocessor block has two outputs: the tensor with pre-processed image(s) data and a tensor with pre-processed image(s) size(s). While converting the model, Model Optimizer keeps only the nodes producing the first tensor. The second tensor is a constant which can be obtained from the pipeline.config file to be used in other replacers.

The implementation of the Preprocessor block sub-graph replacer is the following (file <INSTALL_DIR>/deployment_tools/model_optimizer/extensions/front/tf/ObjectDetectionAPI.py):

class ObjectDetectionAPIPreprocessorReplacement(FrontReplacementFromConfigFileSubGraph):
"""
The class replaces the "Preprocessor" block resizing input image and applying mean/scale values. Only nodes related
to applying mean/scaling values are kept.
"""
replacement_id = 'ObjectDetectionAPIPreprocessorReplacement'
def run_before(self):
return [Pack, Sub]
def nodes_to_remove(self, graph: nx.MultiDiGraph, match: SubgraphMatch):
new_nodes_to_remove = match.matched_nodes_names()
# do not remove nodes that perform input image scaling and mean value subtraction
for node_to_keep in ('Preprocessor/sub', 'Preprocessor/sub/y', 'Preprocessor/mul', 'Preprocessor/mul/x'):
if node_to_keep in new_nodes_to_remove:
new_nodes_to_remove.remove(node_to_keep)
return new_nodes_to_remove
def generate_sub_graph(self, graph: nx.MultiDiGraph, match: SubgraphMatch):
argv = graph.graph['cmd_params']
layout = graph.graph['layout']
if argv.tensorflow_object_detection_api_pipeline_config is None:
raise Error(missing_param_error)
pipeline_config = PipelineConfig(argv.tensorflow_object_detection_api_pipeline_config)
sub_node = match.output_node(0)[0]
if not sub_node.has('op') or sub_node.op != 'Sub':
raise Error('The output op of the Preprocessor sub-graph is not of type "Sub". Looks like the topology is '
'not created with TensorFlow Object Detection API.')
mul_node = None
if sub_node.in_node(0).has('op') and sub_node.in_node(0).op == 'Mul':
log.info('There is image scaling node in the Preprocessor block.')
mul_node = sub_node.in_node(0)
initial_input_node_name = 'image_tensor'
if initial_input_node_name not in graph.nodes():
raise Error('Input node "{}" of the graph is not found. Do not run the Model Optimizer with '
'"--input" command line parameter.'.format(initial_input_node_name))
placeholder_node = Node(graph, initial_input_node_name)
# set default value of the batch size to 1 if user didn't specify batch size and input shape
batch_dim = get_batch_dim(layout, 4)
if argv.batch is None and placeholder_node.shape[batch_dim] == -1:
placeholder_node.shape[batch_dim] = 1
if placeholder_node.shape[batch_dim] > 1:
print("[ WARNING ] The batch size more than 1 is supported for SSD topologies only.")
height, width = calculate_placeholder_spatial_shape(graph, match, pipeline_config)
placeholder_node.shape[get_height_dim(layout, 4)] = height
placeholder_node.shape[get_width_dim(layout, 4)] = width
# save the pre-processed image spatial sizes to be used in the other replacers
graph.graph['preprocessed_image_height'] = placeholder_node.shape[get_height_dim(layout, 4)]
graph.graph['preprocessed_image_width'] = placeholder_node.shape[get_width_dim(layout, 4)]
to_float_node = placeholder_node.out_node(0)
if not to_float_node.has('op') or to_float_node.op != 'Cast':
raise Error('The output of the node "{}" is not Cast operation. Cannot apply replacer.'.format(
initial_input_node_name))
# connect to_float_node directly with node performing scale on mean value subtraction
if mul_node is None:
create_edge(to_float_node, sub_node, 0, 0)
else:
create_edge(to_float_node, mul_node, 0, 1)
print('The Preprocessor block has been removed. Only nodes performing mean value subtraction and scaling (if'
' applicable) are kept.')
return {}

The run_before function defines a list of replacers which current replacer should be run before. In this case it is Pack and Sub. The Sub operation is not supported by Inference Engine plugins so Model Optimizer replaces it with a combination of the Eltwise layer (element-wise sum) and the ScaleShift layer. But the Preprocessor replacer expects to see Sub node, so it should be called before the Sub is replaced.

The nodes_to_remove function returns list of nodes that should be removed after the replacement happens. In this case it removes all nodes matched in the Preprocessor scope except the Sub and Mul nodes performing mean value subtraction and scaling.

The generate_sub_graph function performs the following actions:

Postprocessor Block

A distinct feature of any SSD topology is a part performing non-maximum suppression of proposed images bounding boxes. This part of the topology is implemented with dozens of primitive operations in TensorFlow, while in Inference Engine, it is one layer called DetectionOutput. Thus, to convert a SSD model from the TensorFlow, the Model Optimizer should replace the entire sub-graph of operations that implement the DetectionOutput layer with a single DetectionOutput node.

The Inference Engine DetectionOutput layer implementation consumes three tensors in the following order:

  1. Tensor with locations of bounding boxes
  2. Tensor with confidences for each bounding box
  3. Tensor with prior boxes ("anchors" in a TensorFlow terminology)

The Inference Engine DetectionOutput layer implementation produces one tensor with seven numbers for each actual detection:

There are more output tensors in the TensorFlow Object Detection API: "detection_boxes", "detection_classes", "detection_scores" and "num_detections", but the values in them are consistent with the output values of the Inference Engine DetectionOutput layer.

The sub-graph replacement by points is used in the ssd_v2_support.json to match the Postprocessor block. The start points are defined the following way:

There are a number of differences in layout, format and content of in input tensors to DetectionOutput layer and what tensors generates TensorFlow, so additional tensors processing before creating DetectionOutput layer is required. It is described below. The sub-graph replacement class for the DetectionOutput layer is given below:

class ObjectDetectionAPISSDPostprocessorReplacement(FrontReplacementFromConfigFileSubGraph):
replacement_id = 'ObjectDetectionAPISSDPostprocessorReplacement'
def run_after(self):
return [ObjectDetectionAPIPreprocessorReplacement]
def run_before(self):
# the replacer uses node of type "RealDiv" as one of the start points, but Model Optimizer replaces nodes of
# type "RealDiv" with a new ones, so it is necessary to replace the sub-graph before replacing the "RealDiv"
# nodes
return [Div, StandaloneConstEraser]
def output_edges_match(self, graph: nx.DiGraph, match: SubgraphMatch, new_sub_graph: dict):
# the DetectionOutput in IE produces single tensor, but in TF it produces two tensors, so create only one output
# edge match
return {match.output_node(0)[0].id: new_sub_graph['detection_output_node'].id}
def generate_sub_graph(self, graph: nx.MultiDiGraph, match: SubgraphMatch):
argv = graph.graph['cmd_params']
if argv.tensorflow_object_detection_api_pipeline_config is None:
raise Error(missing_param_error)
pipeline_config = PipelineConfig(argv.tensorflow_object_detection_api_pipeline_config)
num_classes = _value_or_raise(match, pipeline_config, 'num_classes')
# reshapes confidences to 4D before applying activation function
expand_dims_op = Reshape(graph, {'dim': np.array([0, 1, -1, num_classes + 1])})
# do not convert from NHWC to NCHW this node shape
expand_dims_node = expand_dims_op.create_node([match.input_nodes(1)[0][0].in_node(0)],
dict(name='do_ExpandDims_conf'))
activation_function = _value_or_raise(match, pipeline_config, 'postprocessing_score_converter')
activation_conf_node = add_activation_function_after_node(graph, expand_dims_node, activation_function)
PermuteAttrs.set_permutation(expand_dims_node, expand_dims_node.out_node(), None)
# IE DetectionOutput layer consumes flattened tensors
# reshape operation to flatten locations tensor
reshape_loc_op = Reshape(graph, {'dim': np.array([0, -1])})
reshape_loc_node = reshape_loc_op.create_node([match.input_nodes(0)[0][0].in_node(0)],
dict(name='do_reshape_loc'))
# IE DetectionOutput layer consumes flattened tensors
# reshape operation to flatten confidence tensor
reshape_conf_op = Reshape(graph, {'dim': np.array([0, -1])})
reshape_conf_node = reshape_conf_op.create_node([activation_conf_node], dict(name='do_reshape_conf'))
if pipeline_config.get_param('ssd_anchor_generator_num_layers') is not None or \
pipeline_config.get_param('multiscale_anchor_generator_min_level') is not None:
# change the Reshape operations with hardcoded number of output elements of the convolution nodes to be
# reshapable
_relax_reshape_nodes(graph, pipeline_config)
# create PriorBoxClustered nodes instead of a constant value with prior boxes so the model could be reshaped
if pipeline_config.get_param('ssd_anchor_generator_num_layers') is not None:
priors_node = _create_prior_boxes_node(graph, pipeline_config)
elif pipeline_config.get_param('multiscale_anchor_generator_min_level') is not None:
priors_node = _create_multiscale_prior_boxes_node(graph, pipeline_config)
else:
log.info('The anchor generator is not known. Save constant with prior-boxes to IR.')
priors_node = match.input_nodes(2)[0][0].in_node(0)
# creates DetectionOutput Node object from Op class
detection_output_op = DetectionOutput(graph, match.custom_replacement_desc.custom_attributes)
detection_output_op.attrs['old_infer'] = detection_output_op.attrs['infer']
detection_output_op.attrs['infer'] = __class__.do_infer
detection_output_node = detection_output_op.create_node(
[reshape_loc_node, reshape_conf_node, priors_node],
dict(name=detection_output_op.attrs['type'],
clip=1,
confidence_threshold=_value_or_raise(match, pipeline_config, 'postprocessing_score_threshold'),
top_k=_value_or_raise(match, pipeline_config, 'postprocessing_max_detections_per_class'),
keep_top_k=_value_or_raise(match, pipeline_config, 'postprocessing_max_total_detections'),
nms_threshold=_value_or_raise(match, pipeline_config, 'postprocessing_iou_threshold')))
return {'detection_output_node': detection_output_node}

The run_before and run_after functions define lists of replacers that this replacer should be run before and after respectively.

The input_edges_match and output_edges_match functions generate dictionaries describing how the input/output nodes matched with the replacer should be connected with new nodes generated in the generate_sub_graph function. Refer to sub-graph replacements documentation for more information.

The generate_sub_graph function performs the following actions:

The paragraphs below explains why the inference function for the Detection Output layer is modified. Before doing that it is necessary to make acquaintance with selected high-level steps of the Model Optimize model conversion pipeline. Note, that only selected steps are required for understanding the change are mentioned:

  1. Model Optimizer creates calculation graph from the initial topology where each nodes corresponds to a operation from the initial model.
  2. Model Optimizer performs "Front replacers" (including the one being described now).
  3. Model Optimizer adds data nodes between operation nodes to the graph.
  4. Model Optimizer performs "Middle replacers".
  5. Model Optimizer performs "shape inference" phase. During this phase the shape of all data nodes is being calculated. Model Optimizer also calculates value for data tensors which are constant, i.e. do not depend on input. For example, tensor with prior boxes (generated with MultipleGridAnchorGenerator or similar scopes) doesn't depend on input and is evaluated by Model Optimizer during shape inference. Model Optimizer uses inference function stored in the 'infer' attribute of operation nodes.
  6. Model Optimizer performs "Back replacers".
  7. Model Optimizer generates IR.

The do_infer function is needed to perform some adjustments to the tensor with prior boxes (anchors) that is known only after the shape inference phase and to perform additional transformations described below. This change is performed only if the tensor with prior boxes is not constant (so it is produced by PriorBoxClustered layers during inference). It is possible to make the Postprocessor block replacement as a Middle replacer (so the prior boxes tensor would be evaluated by the time the replacer is called), but in this case it will be necessary to correctly handle data nodes which are created between each pair of initially adjacent operation nodes. In order to inject required modification to the inference function of the DetectionOutput node, a new function is created to perform modifications and to call the initial inference function. The code of a new inference function is the following:

@staticmethod
def do_infer(node: Node):
prior_boxes = node.in_node(2).value
if prior_boxes is not None:
# these are default variances values
variance = np.array([[0.1, 0.1, 0.2, 0.2]])
# replicating the variance values for all prior-boxes
variances = np.tile(variance, [prior_boxes.shape[-2], 1])
# DetectionOutput in the Inference Engine expects the prior-boxes in the following layout: (values, variances)
prior_boxes = prior_boxes.reshape([-1, 4])
prior_boxes = np.concatenate((prior_boxes, variances), 0)
# compared to the IE's DetectionOutput, the TF keeps the prior-boxes in YXYX, need to get back to the XYXY
prior_boxes = np.concatenate((prior_boxes[:, 1:2], prior_boxes[:, 0:1],
prior_boxes[:, 3:4], prior_boxes[:, 2:3]), 1)
# adding another dimensions, as the prior-boxes are expected as 3d tensors
prior_boxes = prior_boxes.reshape((1, 2, -1))
node.in_node(2).shape = np.array(prior_boxes.shape, dtype=np.int64)
node.in_node(2).value = prior_boxes
node.old_infer(node)
# compared to the IE's DetectionOutput, the TF keeps the locations in YXYX, need to get back to the XYXY
# for last convolutions that operate the locations need to swap the X and Y for output feature weights & biases
conv_nodes = backward_bfs_for_operation(node.in_node(0), ['Conv2D'])
swap_weights_xy(conv_nodes)
squeeze_reshape_and_concat(conv_nodes)
for node_name in node.graph.nodes():
node = Node(node.graph, node_name)
if node.has_and_set('swap_xy_count') and len(node.out_nodes()) != node['swap_xy_count']:
raise Error('The weights were swapped for node "{}", but this weight was used in other nodes.'.format(
node.name))

Faster R-CNN Topologies

The Faster R-CNN models contain several building blocks similar to building blocks from SSD models so it is highly recommended to read the section about converting them first. Detailed information about Faster R-CNN topologies is provided in the abstract.

Preprocessor Block

Faster R-CNN topologies contain similar Preprocessor block as SSD topologies. The same ObjectDetectionAPIPreprocessorReplacement sub-graph replacer is used to cut it off.

Proposal

The Proposal layer is implemented with dozens of primitive operations in TensorFlow, meanwhile, it is a single layer in the Inference Engine. The ObjectDetectionAPIProposalReplacement sub-graph replacer identifies nodes corresponding to the layer and replaces them with required new nodes.

class ObjectDetectionAPIProposalReplacement(FrontReplacementFromConfigFileSubGraph):
"""
This class replaces sub-graph of operations with Proposal layer and additional layers transforming
tensors from layout of TensorFlow to layout required by Inference Engine.
Refer to comments inside the function for more information about performed actions.
"""
replacement_id = 'ObjectDetectionAPIProposalReplacement'
def run_after(self):
return [ObjectDetectionAPIPreprocessorReplacement]
def run_before(self):
return [Sub, CropAndResizeReplacement]
def output_edges_match(self, graph: nx.DiGraph, match: SubgraphMatch, new_sub_graph: dict):
return {match.output_node(0)[0].id: new_sub_graph['proposal_node'].id}
def nodes_to_remove(self, graph: nx.MultiDiGraph, match: SubgraphMatch):
new_list = match.matched_nodes_names().copy()
# do not remove nodes that produce box predictions and class predictions
new_list.remove(match.single_input_node(0)[0].id)
new_list.remove(match.single_input_node(1)[0].id)
return new_list
def generate_sub_graph(self, graph: nx.MultiDiGraph, match: SubgraphMatch):
argv = graph.graph['cmd_params']
if argv.tensorflow_object_detection_api_pipeline_config is None:
raise Error(missing_param_error)
pipeline_config = PipelineConfig(argv.tensorflow_object_detection_api_pipeline_config)
input_height = graph.graph['preprocessed_image_height']
input_width = graph.graph['preprocessed_image_width']
max_proposals = _value_or_raise(match, pipeline_config, 'first_stage_max_proposals')
proposal_ratios = _value_or_raise(match, pipeline_config, 'anchor_generator_aspect_ratios')
proposal_scales = _value_or_raise(match, pipeline_config, 'anchor_generator_scales')
anchors_count = len(proposal_ratios) * len(proposal_scales)
# Convolution/matmul node that produces classes predictions
# Permute result of the tensor with classes permissions so it will be in a correct layout for Softmax
predictions_node = backward_bfs_for_operation(match.single_input_node(1)[0], ['Add'])[0]
permute_predictions_op = Permute(graph, dict(order=np.array([0, 2, 3, 1])))
permute_predictions_node = permute_predictions_op.create_node([], dict(name=predictions_node.name + '/Permute'))
insert_node_after(predictions_node, permute_predictions_node, 0)
# creates constant input with the image height, width and scale H and scale W (if present) required for Proposal
const_op = Const(graph, dict(value=np.array([[input_height, input_width, 1]], dtype=np.float32)))
const_node = const_op.create_node([], dict(name='proposal_const_image_size'))
reshape_classes_op = Reshape(graph, dict(dim=np.array([0, -1, 2])))
reshape_classes_node = reshape_classes_op.create_node([permute_predictions_node],
dict(name='reshape_FirstStageBoxPredictor_class'))
softmax_conf_op = Softmax(graph, dict(axis=2))
softmax_conf_node = softmax_conf_op.create_node([reshape_classes_node],
dict(name='FirstStageBoxPredictor_softMax_class'))
PermuteAttrs.set_permutation(reshape_classes_node, softmax_conf_node, None)
reshape_softmax_op = Reshape(graph, dict(dim=np.array([1, anchors_count, 2, -1])))
reshape_softmax_node = reshape_softmax_op.create_node([softmax_conf_node], dict(name='reshape_softmax_class'))
PermuteAttrs.set_permutation(softmax_conf_node, reshape_softmax_node, None)
permute_reshape_softmax_op = Permute(graph, dict(order=np.array([0, 1, 3, 2])))
permute_reshape_softmax_node = permute_reshape_softmax_op.create_node([reshape_softmax_node], dict(
name=reshape_softmax_node.name + '/Permute'))
# implement custom reshape infer function because we need to know the input convolution node output dimension
# sizes but we can know it only after partial infer
reshape_permute_op = Reshape(graph,
dict(dim=np.ones([4]), anchors_count=anchors_count, conv_node=predictions_node))
reshape_permute_op.attrs['old_infer'] = reshape_permute_op.attrs['infer']
reshape_permute_op.attrs['infer'] = __class__.classes_probabilities_reshape_shape_infer
reshape_permute_node = reshape_permute_op.create_node([permute_reshape_softmax_node],
dict(name='Reshape_Permute_Class'))
proposal_op = ProposalOp(graph, dict(min_size=1,
framework='tensorflow',
pre_nms_topn=2 ** 31 - 1,
box_size_scale=5,
box_coordinate_scale=10,
post_nms_topn=max_proposals,
feat_stride=_value_or_raise(match, pipeline_config,
'features_extractor_stride'),
ratio=proposal_ratios,
scale=proposal_scales,
base_size=_value_or_raise(match, pipeline_config,
'anchor_generator_base_size'),
nms_thresh=_value_or_raise(match, pipeline_config,
'first_stage_nms_iou_threshold')))
anchors_node = backward_bfs_for_operation(match.single_input_node(0)[0], ['Add'])[0]
proposal_node = proposal_op.create_node([reshape_permute_node, anchors_node, const_node],
dict(name='proposals'))
# the TF implementation of ROIPooling with bi-linear filtration need proposals scaled by image size
proposal_scale_const = np.array([1.0, 1 / input_height, 1 / input_width, 1 / input_height, 1 / input_width],
dtype=np.float32)
proposal_scale_const_op = Const(graph, dict(value=proposal_scale_const))
proposal_scale_const_node = proposal_scale_const_op.create_node([], dict(name='Proposal_scale_const'))
scale_proposals_op = Eltwise(graph, dict(operation='mul'))
scale_proposals_node = scale_proposals_op.create_node([proposal_node, proposal_scale_const_node],
dict(name='scaled_proposals'))
proposal_reshape_4d_op = Reshape(graph, dict(dim=np.array([1, 1, max_proposals, 5]), nchw_layout=True))
proposal_reshape_4d_node = proposal_reshape_4d_op.create_node([scale_proposals_node],
dict(name="reshape_proposals_4d"))
# creates the Crop operation that gets input from the Proposal layer and gets tensor with bounding boxes only
crop_op = Crop(graph, dict(axis=np.array([3]), offset=np.array([1]), dim=np.array([4]), nchw_layout=True))
crop_node = crop_op.create_node([proposal_reshape_4d_node], dict(name='crop_proposals'))
proposal_reshape_3d_op = Reshape(graph, dict(dim=np.array([0, -1, 4]), nchw_layout=True))
proposal_reshape_3d_node = proposal_reshape_3d_op.create_node([crop_node], dict(name="tf_proposals"))
return {'proposal_node': proposal_reshape_3d_node}
@staticmethod
def classes_probabilities_reshape_shape_infer(node: Node):
# now we can determine the reshape dimensions from Convolution node
conv_node = node.conv_node
conv_output_shape = conv_node.out_node().shape
# update desired shape of the Reshape node
node.dim = np.array([0, conv_output_shape[1], conv_output_shape[2], node.anchors_count * 2])
node.old_infer(node)

The main interest of the implementation of this replacer is the generate_sub_graph function.

Lines 26-36: Parses the pipeline.config file and gets required parameters for the Proposal layer.

Lines 38-73: Performs the following manipulations with the tensor with class predictions:

  1. TensorFlow uses the NHWC layout, while the Inference Engine uses NCHW. Model Optimizer by default performs transformations with all nodes data in the inference graph to convert it to the NCHW layout. The size of 'C' dimension of the tensor with class predictions is equal to $base\_anchors\_count \cdot 2$, where 2 corresponds to a number of classes (background and foreground) and $base\_anchors\_count$ is equal to number of anchors that are applied to each position of 'H' and 'W' dimensions. Therefore, there are $H \cdot W \cdot base\_anchors\_count$ bounding boxes. Lines 54-56 apply the Softmax layer to this tensor to get class probabilities for each bounding box.
  2. The dimension with classes must be in the fastest growing dimension to apply the Softmax activation. Lines 41-43 permute the tensor to NHWC layout first (because the Model Optimizer automatically permuted it to NCHW before) and then reshape to [N, total_bounding_boxes, 2].
  3. After applying the Softmax activation, lines 52-64 perform reversed actions to reshape the tensor to the initial dimensions.
  4. The inference function injection (like with DetectionOutput layer for SSD conversion) is used for the last reshape (lines 71-72), as the value of 'H' and 'W' dimensions are unknown during the replacement (because this is a Front replacer that is performed before the shape inference).

Lines 75-92: Adds the Proposal layer to the graph. This layer has one input containing input image size (lines 46-47). The image sizes are read from the pipeline.config file.

Lines 94-106: Scales bounding boxes to [0,1] interval as required by the ROIPooling layer with a bi-linear filtration.

Lines 108-113: Crops the output from the Proposal node to remove the batch indices (the Inference Engine implementation of the Proposal layer generates tensor with shape [num_proposals, 5]). The final tensor contains just box coordinates as in the TensorFlow implementation.

Lines 118-125: Updated inference function for the Reshape layer which restores the original shape of tensor with class probabilities. The inference function is patched because the original shape for this tensor is known only during the shape inference phase.

SecondStagePostprocessor Block

The SecondStagePostprocessor block is similar to the Postprocessor block from the SSDs topologies. But there are a number of differences in conversion of the SecondStagePostprocessor block.

class ObjectDetectionAPIDetectionOutputReplacement(FrontReplacementFromConfigFileSubGraph):
"""
Replaces the sub-graph that is equal to the DetectionOutput layer from Inference Engine. This replacer is used for
Faster R-CNN, R-FCN and Mask R-CNN topologies conversion.
The replacer uses a value of the custom attribute 'coordinates_swap_method' from the sub-graph replacement
configuration file to choose how to swap box coordinates of the 0-th input of the generated DetectionOutput layer.
Refer to the code for more details.
"""
replacement_id = 'ObjectDetectionAPIDetectionOutputReplacement'
def run_before(self):
return [ObjectDetectionAPIMaskRCNNROIPoolingSecondReplacement, Unpack]
def run_after(self):
return [ObjectDetectionAPIProposalReplacement, CropAndResizeReplacement]
def nodes_to_remove(self, graph: nx.MultiDiGraph, match: SubgraphMatch):
new_nodes_to_remove = match.matched_nodes_names().copy()
new_nodes_to_remove.extend(['detection_boxes', 'detection_scores', 'num_detections'])
return new_nodes_to_remove
def output_edges_match(self, graph: nx.DiGraph, match: SubgraphMatch, new_sub_graph: dict):
# the DetectionOutput in IE produces single tensor, but in TF it produces four tensors, so we need to create
# only one output edge match
return {match.output_node(0)[0].id: new_sub_graph['detection_output_node'].id}
def generate_sub_graph(self, graph: nx.MultiDiGraph, match: SubgraphMatch):
argv = graph.graph['cmd_params']
if argv.tensorflow_object_detection_api_pipeline_config is None:
raise Error(missing_param_error)
pipeline_config = PipelineConfig(argv.tensorflow_object_detection_api_pipeline_config)
num_classes = _value_or_raise(match, pipeline_config, 'num_classes')
first_stage_max_proposals = _value_or_raise(match, pipeline_config, 'first_stage_max_proposals')
activation_function = _value_or_raise(match, pipeline_config, 'postprocessing_score_converter')
activation_conf_node = add_activation_function_after_node(graph, match.single_input_node(1)[0].in_node(0),
activation_function)
# IE DetectionOutput layer consumes flattened tensors
# reshape operation to flatten confidence tensor
reshape_conf_op = Reshape(graph, dict(dim=np.array([1, -1])))
reshape_conf_node = reshape_conf_op.create_node([activation_conf_node], dict(name='do_reshape_conf'))
# TF produces locations tensor without boxes for background.
# Inference Engine DetectionOutput layer requires background boxes so we generate them with some values
# and concatenate with locations tensor
fake_background_locs_blob = np.tile([[[1, 1, 2, 2]]], [first_stage_max_proposals, 1, 1])
fake_background_locs_const_op = Const(graph, dict(value=fake_background_locs_blob))
fake_background_locs_const_node = fake_background_locs_const_op.create_node([])
reshape_loc_op = Reshape(graph, dict(dim=np.array([first_stage_max_proposals, num_classes, 4])))
reshape_loc_node = reshape_loc_op.create_node([match.single_input_node(0)[0].in_node(0)],
dict(name='reshape_loc'))
concat_loc_op = Concat(graph, dict(axis=1))
concat_loc_node = concat_loc_op.create_node([fake_background_locs_const_node, reshape_loc_node],
dict(name='concat_fake_loc'))
PermuteAttrs.set_permutation(reshape_loc_node, concat_loc_node, None)
PermuteAttrs.set_permutation(fake_background_locs_const_node, concat_loc_node, None)
# constant node with variances
variances_const_op = Const(graph, dict(value=np.array([0.1, 0.1, 0.2, 0.2])))
variances_const_node = variances_const_op.create_node([])
# reshape locations tensor to 2D so it could be passed to Eltwise which will be converted to ScaleShift
reshape_loc_2d_op = Reshape(graph, dict(dim=np.array([-1, 4])))
reshape_loc_2d_node = reshape_loc_2d_op.create_node([concat_loc_node], dict(name='reshape_locs_2'))
PermuteAttrs.set_permutation(concat_loc_node, reshape_loc_2d_node, None)
# element-wise multiply locations with variances
eltwise_locs_op = Eltwise(graph, dict(operation='mul'))
eltwise_locs_node = eltwise_locs_op.create_node([reshape_loc_2d_node, variances_const_node],
dict(name='scale_locs'))
# IE DetectionOutput layer consumes flattened tensors
reshape_loc_do_op = Reshape(graph, dict(dim=np.array([1, -1])))
custom_attributes = match.custom_replacement_desc.custom_attributes
coordinates_swap_method = 'add_convolution'
if 'coordinates_swap_method' not in custom_attributes:
log.error('The ObjectDetectionAPIDetectionOutputReplacement sub-graph replacement configuration file '
'must contain "coordinates_swap_method" in the "custom_attributes" dictionary. Two values are '
'supported: "swap_weights" and "add_convolution". The first one should be used when there is '
'a MatMul or Conv2D node before the "SecondStagePostprocessor" block in the topology. With this '
'solution the weights of the MatMul or Conv2D nodes are permuted, simulating the swap of XY '
'coordinates in the tensor. The second could be used in any other cases but it is worse in terms '
'of performance because it adds the Conv2D node which performs permuting of data. Since the '
'attribute is not defined the second approach is used by default.')
else:
coordinates_swap_method = custom_attributes['coordinates_swap_method']
supported_swap_methods = ['swap_weights', 'add_convolution']
if coordinates_swap_method not in supported_swap_methods:
raise Error('Unsupported "coordinates_swap_method" defined in the sub-graph replacement configuration '
'file. Supported methods are: {}'.format(', '.join(supported_swap_methods)))
if coordinates_swap_method == 'add_convolution':
swapped_locs_node = add_convolution_to_swap_xy_coordinates(graph, eltwise_locs_node, 4)
reshape_loc_do_node = reshape_loc_do_op.create_node([swapped_locs_node], dict(name='do_reshape_locs'))
else:
reshape_loc_do_node = reshape_loc_do_op.create_node([eltwise_locs_node], dict(name='do_reshape_locs'))
# find Proposal output which has the data layout as in TF: YXYX coordinates without batch indices.
proposal_nodes_ids = [node_id for node_id, attrs in graph.nodes(data=True)
if 'name' in attrs and attrs['name'] == 'proposals']
if len(proposal_nodes_ids) != 1:
raise Error("Found the following nodes '{}' with name 'proposals' but there should be exactly 1. "
"Looks like ObjectDetectionAPIProposalReplacement replacement didn't work.".
format(proposal_nodes_ids))
proposal_node = Node(graph, proposal_nodes_ids[0])
swapped_proposals_node = add_convolution_to_swap_xy_coordinates(graph, proposal_node, 5)
# reshape priors boxes as Detection Output expects
reshape_priors_op = Reshape(graph, dict(dim=np.array([1, 1, -1])))
reshape_priors_node = reshape_priors_op.create_node([swapped_proposals_node],
dict(name='DetectionOutput_reshape_priors_'))
detection_output_op = DetectionOutput(graph, {})
if coordinates_swap_method == 'swap_weights':
# update infer function to re-pack weights
detection_output_op.attrs['old_infer'] = detection_output_op.attrs['infer']
detection_output_op.attrs['infer'] = __class__.do_infer
detection_output_node = detection_output_op.create_node(
[reshape_loc_do_node, reshape_conf_node, reshape_priors_node],
dict(name=detection_output_op.attrs['type'], share_location=0, normalized=0, variance_encoded_in_target=1,
clip=1, code_type='caffe.PriorBoxParameter.CENTER_SIZE', pad_mode='caffe.ResizeParameter.CONSTANT',
resize_mode='caffe.ResizeParameter.WARP',
num_classes=num_classes,
input_height=graph.graph['preprocessed_image_height'],
input_width=graph.graph['preprocessed_image_width'],
confidence_threshold=_value_or_raise(match, pipeline_config, 'postprocessing_score_threshold'),
top_k=_value_or_raise(match, pipeline_config, 'postprocessing_max_detections_per_class'),
keep_top_k=_value_or_raise(match, pipeline_config, 'postprocessing_max_total_detections'),
nms_threshold=_value_or_raise(match, pipeline_config, 'postprocessing_iou_threshold')))
PermuteAttrs.set_permutation(reshape_priors_node, detection_output_node, None)
# sets specific name to the node so we can find it in other replacers
detection_output_node.name = 'detection_output'
output_op = Output(graph, dict(name='do_OutputOp'))
output_op.create_node([detection_output_node])
print('The graph output nodes "num_detections", "detection_boxes", "detection_classes", "detection_scores" '
'have been replaced with a single layer of type "Detection Output". Refer to IR catalogue in the '
'documentation for information about this layer.')
return {'detection_output_node': detection_output_node}
@staticmethod
def do_infer(node):
node.old_infer(node)
# compared to the IE's DetectionOutput, the TF keeps the locations in YXYX, need to get back to the XYXY
# for last matmul/Conv2D that operate the locations need to swap the X and Y for output feature weights & biases
swap_weights_xy(backward_bfs_for_operation(node.in_node(0), ['MatMul', 'Conv2D']))

The differences in conversion are the following:

Cutting Off Part of the Topology

There is an ability to cut-off part of the topology using the --output command line parameter. Detailed information on why it could be useful is provided here. The Faster R-CNN models are cut at the end using the sub-graph replacer ObjectDetectionAPIOutputReplacement.

class ObjectDetectionAPIOutputReplacement(FrontReplacementFromConfigFileGeneral):
"""
This replacer is used to cut-off the network by specified nodes for models generated with Object Detection API.
The custom attribute for the replacer contains one value for key "outputs". This string is a comma separated list
of outputs alternatives. Each output alternative is a '|' separated list of node name which could be outputs. The
first node from each alternative that exits in the graph is chosen. Others are ignored.
For example, if the "outputs" is equal to the following string:
"Reshape_16,SecondStageBoxPredictor_1/Conv_3/BiasAdd|SecondStageBoxPredictor_1/Conv_1/BiasAdd"
then the "Reshape_16" will be an output if it exists in the graph. The second output will be
SecondStageBoxPredictor_1/Conv_3/BiasAdd if it exist in the graph, if not then
SecondStageBoxPredictor_1/Conv_1/BiasAdd will be output if it exists in the graph.
"""
replacement_id = 'ObjectDetectionAPIOutputReplacement'
def run_before(self):
return [ObjectDetectionAPIPreprocessorReplacement]
def transform_graph(self, graph: nx.MultiDiGraph, replacement_descriptions: dict):
if graph.graph['cmd_params'].output is not None:
log.warning('User defined output nodes are specified. Skip the graph cut-off by the '
'ObjectDetectionAPIOutputReplacement.')
return
outputs = []
outputs_string = replacement_descriptions['outputs']
for alternatives in outputs_string.split(','):
for out_node_name in alternatives.split('|'):
if graph.has_node(out_node_name):
outputs.append(out_node_name)
break
else:
log.debug('A node "{}" does not exist in the graph. Do not add it as output'.format(out_node_name))
_outputs = output_user_data_repack(graph, outputs)
add_output_ops(graph, _outputs, graph.graph['inputs'])

This is a replacer of type "general" which is called just once in comparison with other Front-replacers ("scope" and "points") which are called for each matched instance. The replacer reads node names that should become new output nodes, like specifying --output <node_names>. The only difference is that the string containing node names could contain '|' character specifying output node names alternatives. Detailed explanation is provided in the class description in the code.

The detection_boxes, detection_scores, num_detections nodes are specified as outputs in the faster_rcnn_support.json file. These nodes are used to remove part of the graph that is not be needed to calculate value of specified output nodes.

R-FCN topologies

The R-FCN models are based on Faster R-CNN models so it is highly recommended to read the section about converting them first. Detailed information about R-FCN topologies is provided in the abstract.

Preprocessor Block

R-FCN topologies contain similar Preprocessor block as SSD and Faster R-CNN topologies. The same ObjectDetectionAPIPreprocessorReplacement sub-graph replacer is used to cut it off.

Proposal

Similar to Faster R-CNNs, R-FCN topologies contain implementation of Proposal layer before the SecondStageBoxPredictor block, so ObjectDetectionAPIProposalReplacement replacement is used in the sub-graph replacement configuration file.

SecondStageBoxPredictor block

The SecondStageBoxPredictor block differs from the self-titled block from Faster R-CNN topologies. It contains a number of CropAndResize operations consuming variously scaled boxes generated with a Proposal layer. This block of operations is converted to intermediate representation as is, without using sub-graph replacements.

SecondStagePostprocessor block

The SecondStagePostprocessor block implements functionality of the DetectionOutput layer from the Inference Engine. The ObjectDetectionAPIDetectionOutputReplacement sub-graph replacement is used to replace the block. For this type of topologies the replacer adds convolution node to swap coordinates of boxes in of the 0-th input tensor to the DetectionOutput layer. The custom attribute coordinates_swap_method is set to value add_convolution in the sub-graph replacement configuration file to enable that behaviour. A method (swap_weights) is not suitable for this type of topologies because there are no Mul or Conv2D operations before the 0-th input of the DetectionOutput layer.

Cutting Off Part of the Topology

The R-FCN models are cut at the end with the sub-graph replacer ObjectDetectionAPIOutputReplacement as Faster R-CNNs topologies using the following output node names: detection_boxes.

Mask R-CNN Topologies

The Mask R-CNN models are based on Faster R-CNN models so it is highly recommended to read the section about converting them first. Detailed information about Mask R-CNN topologies is provided in the abstract.

Preprocessor Block

Mask R-CNN topologies contain similar Preprocessor block as SSD and Faster R-CNN topologies. The same ObjectDetectionAPIPreprocessorReplacement sub-graph replacer is used to cut it off.

Proposal and ROI (Region of Interest) Pooling

Proposal and ROI Pooling layers are added to Mask R-CNN topologies like in Faster R-CNNs.

DetectionOutput

Unlike in SSDs and Faster R-CNNs, the implementation of the DetectionOutput layer in Mask R-CNNs topologies is not separated in a dedicated scope. But the matcher is defined with start/end points defined in the mask_rcnn_support.json so the replacer correctly adds the DetectionOutput layer.

One More ROIPooling

There is the second CropAndResize (equivalent of ROIPooling layer) that uses boxes produced with the DetectionOutput layer. The ObjectDetectionAPIMaskRCNNROIPoolingSecondReplacement replacer is used to replace this node.

class ObjectDetectionAPIMaskRCNNROIPoolingSecondReplacement(FrontReplacementFromConfigFileSubGraph):
replacement_id = 'ObjectDetectionAPIMaskRCNNROIPoolingSecondReplacement'
def output_edges_match(self, graph: nx.DiGraph, match: SubgraphMatch, new_sub_graph: dict):
return {match.output_node(0)[0].id: new_sub_graph['roi_pooling_node'].id}
def generate_sub_graph(self, graph: nx.MultiDiGraph, match: SubgraphMatch):
argv = graph.graph['cmd_params']
if argv.tensorflow_object_detection_api_pipeline_config is None:
raise Error(missing_param_error)
pipeline_config = PipelineConfig(argv.tensorflow_object_detection_api_pipeline_config)
roi_pool_size = _value_or_raise(match, pipeline_config, 'initial_crop_size')
detection_output_nodes_ids = [node_id for node_id, attrs in graph.nodes(data=True)
if 'name' in attrs and attrs['name'] == 'detection_output']
if len(detection_output_nodes_ids) != 1:
raise Error("Found the following nodes '{}' with 'detection_output' but there should be exactly 1.".
format(detection_output_nodes_ids))
detection_output_node = Node(graph, detection_output_nodes_ids[0])
# add reshape of Detection Output so it can be an output of the topology
reshape_detection_output_2d_op = Reshape(graph, dict(dim=np.array([-1, 7])))
reshape_detection_output_2d_node = reshape_detection_output_2d_op.create_node(
[detection_output_node], dict(name='reshape_do_2d'))
# adds special node of type "Output" that is a marker for the output nodes of the topology
output_op = Output(graph, dict(name='do_reshaped_OutputOp'))
output_node = output_op.create_node([reshape_detection_output_2d_node])
# add attribute 'output_sort_order' so it will be used as a key to sort output nodes before generation of IR
output_node.in_edge()['data_attrs'].append('output_sort_order')
output_node.in_edge()['output_sort_order'] = [('detection_boxes', 0)]
# creates the Crop operation that gets input from the DetectionOutput layer, cuts of slices of data with batch
# indices and class labels producing a tensor with classes probabilities and bounding boxes only as it is
# expected by the ROIPooling layer
crop_op = Crop(graph, dict(axis=np.array([3]), offset=np.array([2]), dim=np.array([5]), nchw_layout=True))
crop_node = crop_op.create_node([detection_output_node], dict(name='crop_do'))
# reshape bounding boxes as required by ROIPooling
reshape_do_op = Reshape(graph, dict(dim=np.array([-1, 5])))
reshape_do_node = reshape_do_op.create_node([crop_node], dict(name='reshape_do'))
roi_pooling_op = ROIPooling(graph, dict(method="bilinear", spatial_scale=1,
pooled_h=roi_pool_size, pooled_w=roi_pool_size))
roi_pooling_node = roi_pooling_op.create_node([match.single_input_node(0)[0].in_node(), reshape_do_node],
dict(name='ROI_pooling_2'))
return {'roi_pooling_node': roi_pooling_node}

The Inference Engine DetectionOutput layer implementation produces one tensor with seven numbers for each actual detection:

The boxes coordinates must be fed to the ROIPooling layer, so the Crop layer is added to remove unnecessary part (lines 37-38).

Then the result tensor is reshaped (lines 41-42) and ROIPooling layer is created (lines 44-47).

Mask Tensors Processing

The post-processing part of Mask R-CNN topologies filters out bounding boxes with low probabilities and applies activation function to the rest one. This post-processing is implemented using the Gather operation, which is not supported by the Inference Engine. Special Front-replacer removes this post-processing and just inserts the activation layer to the end. The filtering of bounding boxes is done in the dedicated demo mask_rcnn_demo. The code of the replacer is the following:

class ObjectDetectionAPIMaskRCNNSigmoidReplacement(FrontReplacementFromConfigFileGeneral):
"""
This replacer is used to convert Mask R-CNN topologies only.
Adds activation with sigmoid function to the end of the network producing masks tensors.
"""
replacement_id = 'ObjectDetectionAPIMaskRCNNSigmoidReplacement'
def run_after(self):
return [ObjectDetectionAPIMaskRCNNROIPoolingSecondReplacement]
def transform_graph(self, graph: nx.MultiDiGraph, replacement_descriptions):
output_node = None
op_outputs = [n for n, d in graph.nodes(data=True) if 'op' in d and d['op'] == 'OpOutput']
for op_output in op_outputs:
last_node = Node(graph, op_output).in_node(0)
if last_node.name.startswith('SecondStageBoxPredictor'):
sigmoid_op = Activation(graph, dict(operation='sigmoid'))
sigmoid_node = sigmoid_op.create_node([last_node], dict(name=last_node.id + '/sigmoid'))
sigmoid_node.name = 'masks'
if output_node is not None:
raise Error('Identified two possible outputs from the topology. Cannot proceed.')
# add special node of type "Output" that is a marker for the output nodes of the topology
output_op = Output(graph, dict(name=sigmoid_node.name + '/OutputOp'))
output_node = output_op.create_node([sigmoid_node])
print('The predicted masks are produced by the "masks" layer for each bounding box generated with a '
'"detection_output" layer.\n Refer to IR catalogue in the documentation for information '
'about the DetectionOutput layer and Inference Engine documentation about output data interpretation.\n'
'The topology can be inferred using dedicated demo "mask_rcnn_demo".')

The replacer looks for the output node which name starts with 'SecondStageBoxPredictor' (the another node of type 'OpOutput' is located after the DetectionOutput node). This node contains the generated masks. The replacer adds activation layer 'Sigmoid' after this node as it is done in the initial TensorFlow* model.

Cutting Off Part of the Topology

The Mask R-CNN models are cut at the end with the sub-graph replacer ObjectDetectionAPIOutputReplacement using the following output node names:

SecondStageBoxPredictor_1/Conv_3/BiasAdd|SecondStageBoxPredictor_1/Conv_1/BiasAdd

One of these two nodes produces output mask tensors. The child nodes of these nodes are related to post-processing which is implemented in the Mask R-CNN demo and should be cut off.