vitstr-small-patch16-224

Use-case and high-level description

The vitstr-small-patch16-224 model is small version of the ViTSTR models. ViTSTR is a simple single-stage model that uses a pre-trained Vision Transformer (ViT) to perform Scene Text Recognition (ViTSTR). Small version of model has an embedding size of 384 and number of heads of 6. Model is able to recognize alphanumeric case sensitive text and special characters.

More details provided in the paper and repository.

Specification

Metric

Value

Type

Scene Text Recognition

GFLOPs

9.1544

MParams

21.5061

Source framework

PyTorch*

Accuracy

Alphanumeric subset of common scene text recognition benchmarks are used. For your convenience you can see dataset size. Note, that we use here ICDAR15 alphanumeric subset without irregular (arbitrary oriented, perspective or curved) texts. See details here, section 4.1. All reported results are achieved without using any lexicon.

Dataset

Accuracy

Dataset size

ICDAR-03

93.43%

867

ICDAR-13

90.34%

1015

ICDAR-15

75.04%

1811

SVT

85.47%

647

IIIT5K

87.07%

3000

Use accuracy_check [...] --model_attributes <path_to_folder_with_downloaded_model> to specify the path to additional model attributes. path_to_folder_with_downloaded_model is a path to the folder, where the current model is downloaded by Model Downloader tool.

Input

Original model

Image, name: image, shape: 1, 1, 224, 224 in the format B, C, H, W, where:

  • B - batch size

  • C - number of channels

  • H - image height

  • W - image width

Note that the source image should be tight aligned crop with detected text converted to grayscale.

Scale values - [255].

Converted model

Image, name: image, shape: 1, 1, 224, 224 in the format B, C, H, W, where:

  • B - batch size

  • C - number of channels

  • H - image height

  • W - image width

Note that the source image should be tight aligned crop with detected text converted to grayscale.

Output

Original model

Output tensor, name: logits, shape: 1, 25, 96 in the format B, W, L, where:

  • B - batch size

  • W - output sequence length

  • L - confidence distribution across [GO] - special start token for decoder, [s] - special end of sequence character for decoder and characters, listed in enclosed file vocab.txt.

The network output decoding process is pretty easy: get the argmax on L dimension, transform indices to letters and slice the resulting phrase on the first entry of end-of-sequence symbol.

Converted model

Output tensor, name: logits, shape: 1, 25, 96 in the format B, W, L, where:

  • B - batch size

  • W - output sequence length

  • L - confidence distribution across [GO] - special start token for decoder, [s] - special end of sequence character for decoder and characters, listed in enclosed file vocab.txt.

The network output decoding process is pretty easy: get the argmax on L dimension, transform indices to letters and slice the resulting phrase on the first entry of end-of-sequence symbol.

Download a Model and Convert it into OpenVINO™ IR Format

You can download models and if necessary convert them into OpenVINO™ IR format using the Model Downloader and other automation tools as shown in the examples below.

An example of using the Model Downloader:

omz_downloader --name <model_name>

An example of using the Model Converter:

omz_converter --name <model_name>

Demo usage

The model can be used in the following demos provided by the Open Model Zoo to show its capabilities: