Overview of OpenVINO™ Toolkit Public Pre-Trained Models¶
OpenVINO™ toolkit provides a set of public pre-trained models that you can use for learning and demo purposes or for developing deep learning software. Most recent version is available in the repo on Github. The table Public Pre-Trained Models Device Support summarizes devices supported by each model.
You can download models and convert them into OpenVINO™ IR format (*.xml + *.bin) using the OpenVINO™ Model Downloader and other automation tools.
Classification Models¶
Model Name |
Implementation |
OMZ Model Name |
Accuracy |
GFlops |
mParams |
---|---|---|---|---|---|
AlexNet |
Caffe* |
56.598%/79.812% |
1.5 |
60.965 |
|
AntiSpoofNet |
PyTorch* |
3.81% |
0.15 |
3.02 |
|
CaffeNet |
Caffe* |
56.714%/79.916% |
1.5 |
60.965 |
|
ConvNeXt Tiny |
PyTorch* |
82.05%/95.86% |
8.9419 |
28.5892 |
|
DenseNet 121 |
Caffe* |
74.42%/92.136% |
5.723~5.7287 |
7.971 |
|
DLA 34 |
PyTorch* |
74.64%/92.06% |
6.1368 |
15.7344 |
|
EfficientNet B0 |
TensorFlow* |
75.70%/92.76% |
0.819 |
5.268 |
|
EfficientNet V2 B0 |
PyTorch* |
78.36%/94.02% |
1.4641 |
7.1094 |
|
EfficientNet V2 Small |
PyTorch* |
84.29%/97.26% |
16.9406 |
21.3816 |
|
HBONet 1.0 |
PyTorch* |
73.1%/91.0% |
0.6208 |
4.5443 |
|
HBONet 0.25 |
PyTorch* |
57.3%/79.8% |
0.0758 |
1.9299 |
|
Inception (GoogleNet) V1 |
Caffe* |
68.928%/89.144% |
3.016~3.266 |
6.619~6.999 |
|
Inception (GoogleNet) V2 |
Caffe* |
72.024%/90.844% |
4.058 |
11.185 |
|
Inception (GoogleNet) V3 |
TensorFlow* |
77.904%/93.808% |
11.469 |
23.817 |
|
Inception (GoogleNet) V4 |
TensorFlow* |
80.204%/95.21% |
24.584 |
42.648 |
|
Inception-ResNet V2 |
TensorFlow* |
77.82%/94.03% |
22.227 |
30.223 |
|
LeViT 128S |
PyTorch* |
76.54%/92.85% |
0.6177 |
8.2199 |
|
MixNet L |
TensorFlow* |
78.30%/93.91% |
0.565 |
7.3 |
|
MobileNet V1 0.25 128 |
Caffe* |
40.54%/65% |
0.028 |
0.468 |
|
MobileNet V1 1.0 224 |
Caffe* |
69.496%/89.224% |
1.148 |
4.221 |
|
MobileNet V2 1.0 224 |
Caffe* |
71.218%/90.178% |
0.615~0.876 |
3.489 |
|
MobileNet V2 1.4 224 |
TensorFlow* |
74.09%/91.97% |
1.183 |
6.087 |
|
MobileNet V3 Small 1.0 |
TensorFlow* |
67.36%/87.44% |
0.1168 |
2.537 |
|
MobileNet V3 Large 1.0 |
TensorFlow* |
mobilenet-v3-large-1.0-224-tf |
75.30%/92.62% |
0.4450 |
5.4721 |
NFNet F0 |
PyTorch* |
83.34%/96.56% |
24.8053 |
71.4444 |
|
RegNetX-3.2GF |
PyTorch* |
78.17%/94.08% |
6.3893 |
15.2653 |
|
ResNet 26, alpha=0.25 |
MXNet* |
76.076%/92.584% |
3.768 |
15.99 |
|
open-closed-eye-0001 |
PyTorch* |
95.84% |
0.0014 |
0.0113 |
|
RepVGG A0 |
PyTorch* |
72.40%/90.49% |
2.7286 |
8.3094 |
|
RepVGG B1 |
PyTorch* |
78.37%/94.09% |
23.6472 |
51.8295 |
|
RepVGG B3 |
PyTorch* |
80.50%/95.25% |
52.4407 |
110.9609 |
|
ResNeSt 50 |
PyTorch* |
81.11%/95.36% |
10.8148 |
27.4493 |
|
ResNet 18 |
PyTorch* |
69.754%/89.088% |
3.637 |
11.68 |
|
ResNet 34 |
PyTorch* |
73.30%/91.42% |
7.3409 |
21.7892 |
|
ResNet 50 |
PyTorch* |
75.168%/92.212% |
6.996~8.216 |
25.53 |
|
ReXNet V1 x1.0 |
PyTorch* |
77.86%/93.87% |
0.8325 |
4.7779 |
|
SE-Inception |
Caffe* |
75.996%/92.964% |
4.091 |
11.922 |
|
SE-ResNet 50 |
Caffe* |
77.596%/93.85% |
7.775 |
28.061 |
|
SE-ResNeXt 50 |
Caffe* |
78.968%/94.63% |
8.533 |
27.526 |
|
Shufflenet V2 x0.5 |
Caffe* |
58.80%/81.13% |
0.08465 |
1.363 |
|
Shufflenet V2 x1.0 |
PyTorch* |
69.36%/88.32% |
0.2957 |
2.2705 |
|
SqueezeNet v1.0 |
Caffe* |
57.684%/80.38% |
1.737 |
1.248 |
|
SqueezeNet v1.1 |
Caffe* |
58.382%/81% |
0.785 |
1.236 |
|
Swin Transformer Tiny, window size=7 |
PyTorch* |
81.38%/95.51% |
9.0280 |
28.8173 |
|
T2T-ViT, transformer layers number=14 |
PyTorch* |
81.44%/95.66% |
9.5451 |
21.5498 |
|
VGG 16 |
Caffe* |
70.968%/89.878% |
30.974 |
138.358 |
|
VGG 19 |
Caffe* |
71.062%/89.832% |
39.3 |
143.667 |
Segmentation Models¶
Semantic segmentation is an extension of object detection problem. Instead of returning bounding boxes, semantic segmentation models return a “painted” version of the input image, where the “color” of each pixel represents a certain class. These networks are much bigger than respective object detection networks, but they provide a better (pixel-level) localization of objects and they can detect areas with complex shape.
Semantic Segmentation Models¶
Model Name |
Implementation |
OMZ Model Name |
Accuracy |
GFlops |
mParams |
---|---|---|---|---|---|
DeepLab V3 |
TensorFlow* |
68.41% |
11.469 |
23.819 |
|
DRN-D-38 |
PyTorch* |
71.31% |
1768.3276 |
25.9939 |
|
Erfnet |
PyTorch* |
76.47% |
11.13 |
7.87 |
|
HRNet V2 C1 Segmentation |
PyTorch* |
77.69% |
81.993 |
66.4768 |
|
Fastseg MobileV3Large LR-ASPP, F=128 |
PyTorch* |
72.67% |
140.9611 |
3.2 |
|
Fastseg MobileV3Small LR-ASPP, F=128 |
PyTorch* |
67.15% |
69.2204 |
1.1 |
|
PSPNet R-50-D8 |
PyTorch* |
70.6% |
357.1719 |
46.5827 |
|
OCRNet HRNet_w48 |
Paddle* |
82.15% |
324.66 |
70.47 |
Instance Segmentation Models¶
Instance segmentation is an extension of object detection and semantic segmentation problems. Instead of predicting a bounding box around each object instance instance segmentation model outputs pixel-wise masks for all instances.
Model Name |
Implementation |
OMZ Model Name |
Accuracy |
GFlops |
mParams |
---|---|---|---|---|---|
Mask R-CNN Inception ResNet V2 |
TensorFlow* |
39.86%/35.36% |
675.314 |
92.368 |
|
Mask R-CNN ResNet 50 |
TensorFlow* |
29.75%/27.46% |
294.738 |
50.222 |
|
YOLACT ResNet 50 FPN |
PyTorch* |
28.0%/30.69% |
118.575 |
36.829 |
3D Semantic Segmentation Models¶
Model Name |
Implementation |
OMZ Model Name |
Accuracy |
GFlops |
mParams |
---|---|---|---|---|---|
Brain Tumor Segmentation |
MXNet* |
92.4003% |
409.996 |
38.192 |
|
Brain Tumor Segmentation 2 |
PyTorch* |
91.4826% |
300.801 |
4.51 |
Object Detection Models¶
Several detection models can be used to detect a set of the most popular objects - for example, faces, people, vehicles. Most of the networks are SSD-based and provide reasonable accuracy/performance trade-offs.
Model Name |
Implementation |
OMZ Model Name |
Accuracy |
GFlops |
mParams |
---|---|---|---|---|---|
CTPN |
TensorFlow* |
73.67% |
55.813 |
17.237 |
|
CenterNet (CTDET with DLAV0) 512x512 |
ONNX* |
44.2756% |
62.211 |
17.911 |
|
DETR-ResNet50 |
PyTorch* |
39.27% / 42.36% |
174.4708 |
41.3293 |
|
EfficientDet-D0 |
TensorFlow* |
31.95% |
2.54 |
3.9 |
|
EfficientDet-D1 |
TensorFlow* |
37.54% |
6.1 |
6.6 |
|
FaceBoxes |
PyTorch* |
83.565% |
1.8975 |
1.0059 |
|
Face Detection Retail |
Caffe* |
83.00% |
1.067 |
0.588 |
|
Faster R-CNN with Inception-ResNet v2 |
TensorFlow* |
40.69% |
30.687 |
13.307 |
|
Faster R-CNN with ResNet 50 |
TensorFlow* |
31.09% |
57.203 |
29.162 |
|
MobileFace Detection V1 |
MXNet* |
78.7488% |
3.5456 |
7.6828 |
|
Mobilenet-yolo-v4-syg |
Keras* |
86.35% |
65.981 |
61.922 |
|
MTCNN |
Caffe* |
mtcnn: |
48.1308%/62.2625% |
|
|
NanoDet with ShuffleNetV2 1.5x, size=416 |
PyTorch* |
27.38%/26.63% |
2.3895 |
2.0534 |
|
NanoDet Plus with ShuffleNetV2 1.5x, size=416 |
PyTorch* |
34.53%/33.77% |
3.0147 |
2.4614 |
|
Pelee |
Caffe* |
21.9761% |
1.290 |
5.98 |
|
RetinaFace with ResNet 50 |
PyTorch* |
91.78% |
88.8627 |
27.2646 |
|
RetinaNet with Resnet 50 |
TensorFlow* |
33.15% |
238.9469 |
64.9706 |
|
R-FCN with Resnet-101 |
TensorFlow* |
28.40%/45.02% |
53.462 |
171.85 |
|
SSD 300 |
Caffe* |
87.09% |
62.815 |
26.285 |
|
SSD 512 |
Caffe* |
91.07% |
180.611 |
27.189 |
|
SSD with MobileNet |
Caffe* |
67.00% |
2.316~2.494 |
5.783~6.807 |
|
SSD with MobileNet FPN |
TensorFlow* |
35.5453% |
123.309 |
36.188 |
|
SSD lite with MobileNet V2 |
TensorFlow* |
24.2946% |
1.525 |
4.475 |
|
SSD with ResNet 34 1200x1200 |
PyTorch* |
20.7198%/39.2752% |
433.411 |
20.058 |
|
Ultra Lightweight Face Detection RFB 320 |
PyTorch* |
84.78% |
0.2106 |
0.3004 |
|
Ultra Lightweight Face Detection slim 320 |
PyTorch* |
83.32% |
0.1724 |
0.2844 |
|
Vehicle License Plate Detection Barrier |
TensorFlow* |
99.52% |
0.271 |
0.547 |
|
YOLO v1 Tiny |
TensorFlow.js* |
54.79% |
6.9883 |
15.8587 |
|
YOLO v2 Tiny |
Keras* |
27.3443%/29.1184% |
5.4236 |
11.2295 |
|
YOLO v2 |
Keras* |
53.1453%/56.483% |
63.0301 |
50.9526 |
|
YOLO v3 |
Keras* |
62.2759%/67.7221% |
65.9843~65.998 |
61.9221~61.930 |
|
YOLO v3 Tiny |
Keras* |
35.9%/39.7% |
5.582 |
8.848~8.8509 |
|
YOLO v4 |
Keras* |
71.23%/77.40%/50.26% |
129.5567 |
64.33 |
|
YOLO v4 Tiny |
Keras* |
6.9289 |
6.0535 |
||
YOLOF |
PyTorch* |
60.69%/66.23%/43.63% |
175.37942 |
48.228 |
|
YOLOX Tiny |
PyTorch* |
47.85%/52.56%/31.82% |
6.4813 |
5.0472 |
Face Recognition Models¶
Model Name |
Implementation |
OMZ Model Name |
Accuracy |
GFlops |
mParams |
---|---|---|---|---|---|
FaceNet |
TensorFlow* |
99.14% |
2.846 |
23.469 |
|
LResNet100E-IR,ArcFace@ms1m-refine-v2 |
MXNet* |
99.68% |
24.2115 |
65.1320 |
|
SphereFace |
Caffe* |
98.8321% |
3.504 |
22.671 |
Human Pose Estimation Models¶
Human pose estimation task is to predict a pose: body skeleton, which consists of keypoints and connections between them, for every person in an input image or video. Keypoints are body joints, i.e. ears, eyes, nose, shoulders, knees, etc. There are two major groups of such methods: top-down and bottom-up. The first detects persons in a given frame, crops or rescales detections, then runs pose estimation network for every detection. These methods are very accurate. The second finds all keypoints in a given frame, then groups them by person instances, thus faster than previous, because network runs once.
Model Name |
Implementation |
OMZ Model Name |
Accuracy |
GFlops |
mParams |
---|---|---|---|---|---|
human-pose-estimation-3d-0001 |
PyTorch* |
100.44437mm |
18.998 |
5.074 |
|
single-human-pose-estimation-0001 |
PyTorch* |
69.0491% |
60.125 |
33.165 |
|
higher-hrnet-w32-human-pose-estimation |
PyTorch* |
64.64% |
92.8364 |
28.6180 |
Monocular Depth Estimation Models¶
The task of monocular depth estimation is to predict a depth (or inverse depth) map based on a single input image. Since this task contains - in the general setting - some ambiguity, the resulting depth maps are often only defined up to an unknown scaling factor.
Model Name |
Implementation |
OMZ Model Name |
Accuracy |
GFlops |
mParams |
---|---|---|---|---|---|
midasnet |
PyTorch* |
0.07071 |
207.25144 |
104.081 |
|
FCRN ResNet50-Upproj |
TensorFlow* |
0.573 |
63.5421 |
34.5255 |
Image Inpainting Models¶
Image inpainting task is to estimate suitable pixel information to fill holes in images.
Model Name |
Implementation |
OMZ Model Name |
Accuracy |
GFlops |
mParams |
---|---|---|---|---|---|
GMCNN Inpainting |
TensorFlow* |
33.47Db |
691.1589 |
12.7773 |
|
Hybrid-CS-Model-MRI |
TensorFlow* |
34.27Db |
146.6037 |
11.3313 |
Style Transfer Models¶
Style transfer task is to transfer the style of one image to another.
Model Name |
Implementation |
OMZ Model Name |
Accuracy |
GFlops |
mParams |
---|---|---|---|---|---|
fast-neural-style-mosaic-onnx |
ONNX* |
12.04dB |
15.518 |
1.679 |
Action Recognition Models¶
The task of action recognition is to predict action that is being performed on a short video clip (tensor formed by stacking sampled frames from input video).
Model Name |
Implementation |
OMZ Model Name |
Accuracy |
GFlops |
mParams |
---|---|---|---|---|---|
RGB-I3D, pretrained on ImageNet* |
TensorFlow* |
64.83%/84.58% |
278.9815 |
12.6900 |
|
common-sign-language-0001 |
PyTorch* |
93.58% |
4.2269 |
4.1128 |
Colorization Models¶
Colorization task is to predict colors of scene from grayscale image.
Model Name |
Implementation |
OMZ Model Name |
Accuracy |
GFlops |
mParams |
---|---|---|---|---|---|
colorization-v2 |
PyTorch* |
26.99dB |
83.6045 |
32.2360 |
|
colorization-siggraph |
PyTorch* |
27.73dB |
150.5441 |
34.0511 |
Sound Classification Models¶
The task of sound classification is to predict what sounds are in an audio fragment.
Model Name |
Implementation |
OMZ Model Name |
Accuracy |
GFlops |
mParams |
---|---|---|---|---|---|
ACLNet |
PyTorch* |
86%/92% |
1.4 |
2.7 |
|
ACLNet-int8 |
PyTorch* |
87%/93% |
1.41 |
2.71 |
Speech Recognition Models¶
The task of speech recognition is to recognize and translate spoken language into text.
Model Name |
Implementation |
OMZ Model Name |
Accuracy |
GFlops |
mParams |
---|---|---|---|---|---|
DeepSpeech V0.6.1 |
TensorFlow* |
7.55% |
0.0472 |
47.2 |
|
DeepSpeech V0.8.2 |
TensorFlow* |
6.13% |
0.0472 |
47.2 |
|
QuartzNet |
PyTorch* |
3.86% |
2.4195 |
18.8857 |
|
Wav2Vec 2.0 Base |
PyTorch* |
3.39% |
26.843 |
94.3965 |
Image Translation Models¶
The task of image translation is to generate the output based on exemplar.
Model Name |
Implementation |
OMZ Model Name |
Accuracy |
GFlops |
mParams |
---|---|---|---|---|---|
CoCosNet |
PyTorch* |
12.93dB |
1080.7032 |
167.9141 |
Optical Character Recognition Models¶
Model Name |
Implementation |
OMZ Model Name |
Accuracy |
GFlops |
mParams |
---|---|---|---|---|---|
license-plate-recognition-barrier-0007 |
TensorFlow* |
98% |
0.347 |
1.435 |
Place Recognition Models¶
The task of place recognition is to quickly and accurately recognize the location of a given query photograph.
Model Name |
Implementation |
OMZ Model Name |
Accuracy |
GFlops |
mParams |
---|---|---|---|---|---|
NetVLAD |
TensorFlow* |
82.0321% |
36.6374 |
149.0021 |
Deblurring Models¶
The task of image deblurring.
Model Name |
Implementation |
OMZ Model Name |
Accuracy |
GFlops |
mParams |
---|---|---|---|---|---|
DeblurGAN-v2 |
PyTorch* |
28.25Db |
80.8919 |
2.1083 |
JPEG Artifacts Removal Models¶
The task of restoration images from jpeg format.
Model Name |
Implementation |
OMZ Model Name |
Accuracy |
GFlops |
mParams |
---|---|---|---|---|---|
FBCNN |
PyTorch* |
34.34Db |
1420.78235 |
71.922 |
Salient Object Detection Models¶
Salient object detection is a task-based on a visual attention mechanism, in which algorithms aim to explore objects or regions more attentive than the surrounding areas on the scene or images.
Model Name |
Implementation |
OMZ Model Name |
Accuracy |
GFlops |
mParams |
---|---|---|---|---|---|
F3Net |
PyTorch* |
84.21% |
31.2883 |
25.2791 |
Text Prediction Models¶
Text prediction is a task to predict the next word, given all of the previous words within some text.
Model Name |
Implementation |
OMZ Model Name |
Accuracy |
GFlops |
mParams |
---|---|---|---|---|---|
GPT-2 |
PyTorch* |
29.00% |
293.0489 |
175.6203 |
Text Recognition Models¶
Scene text recognition is a task to recognize text on a given image. Researchers compete on creating algorithms which are able to recognize text of different shapes, fonts and background. See details about datasets in here The reported metric is collected over the alphanumeric subset of ICDAR13 (1015 images) in case-insensitive mode.
Model Name |
Implementation |
OMZ Model Name |
Accuracy |
GFlops |
mParams |
---|---|---|---|---|---|
Resnet-FC |
PyTorch* |
90.94% |
40.3704 |
177.9668 |
|
ViTSTR Small patch=16, size=224 |
PyTorch* |
90.34% |
9.1544 |
21.5061 |
Text to Speech Models¶
Model Name |
Implementation |
OMZ Model Name |
Accuracy |
GFlops |
mParams |
---|---|---|---|---|---|
ForwardTacotron |
PyTorch* |
forward-tacotron: |
|
|
|
WaveRNN |
PyTorch* |
wavernn: |
|
|
Named Entity Recognition Models¶
Named entity recognition (NER) is the task of tagging entities in text with their corresponding type.
Model Name |
Implementation |
OMZ Model Name |
Accuracy |
GFlops |
mParams |
---|---|---|---|---|---|
bert-base-NER |
PyTorch* |
94.45% |
22.3874 |
107.4319 |
Vehicle Reidentification Models¶
Model Name |
Implementation |
OMZ Model Name |
Accuracy |
GFlops |
mParams |
---|---|---|---|---|---|
vehicle-reid-0001 |
PyTorch* |
96.31%/85.15 % |
2.643 |
2.183 |
Background Matting Models¶
Background matting is a method of separating a foreground from a background in an image or video, wherein some pixels may belong to foreground as well as background, such pixels are called partial or mixed pixels. This distinguishes background matting from segmentation approaches where the result is a binary mask.
Model Name |
Implementation |
OMZ Model Name |
Accuracy |
GFlops |
mParams |
---|---|---|---|---|---|
background-matting-mobilenetv2 |
PyTorch* |
4.32/1.0/2.48/2.7 |
6.7419 |
5.052 |
|
modnet-photographic-portrait-matting |
PyTorch* |
5.21/727.95 |
31.1564 |
6.4597 |
|
modnet-webcam-portrait-matting |
PyTorch* |
5.66/762.52 |
31.1564 |
6.4597 |
|
robust-video-matting-mobilenetv3 |
PyTorch* |
20.8/15.1/4.42/4.05 |
9.3892 |
3.7363 |
See Also¶
Legal Information¶
Caffe, Caffe2, Keras, MXNet, PyTorch, and TensorFlow are trademarks or brand names of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names,trademarks and brands does not imply endorsement.