# vitstr-small-patch16-224¶

## Use-case and high-level description¶

The vitstr-small-patch16-224 model is small version of the ViTSTR models. ViTSTR is a simple single-stage model that uses a pre-trained Vision Transformer (ViT) to perform Scene Text Recognition (ViTSTR). Small version of model has an embedding size of 384 and number of heads of 6. Model is able to recognize alphanumeric case sensitive text and special characters.

More details provided in the paper and repository.

## Specification¶

Metric

Value

Type

Scene Text Recognition

GFLOPs

9.1544

MParams

21.5061

Source framework

PyTorch*

## Accuracy¶

Alphanumeric subset of common scene text recognition benchmarks are used. For your convenience you can see dataset size. Note, that we use here ICDAR15 alphanumeric subset without irregular (arbitrary oriented, perspective or curved) texts. See details here, section 4.1. All reported results are achieved without using any lexicon.

Dataset

Accuracy

Dataset size

ICDAR-03

93.43%

867

ICDAR-13

90.34%

1015

ICDAR-15

75.04%

1811

SVT

85.47%

647

IIIT5K

87.07%

3000

Use accuracy_check [...] --model_attributes <path_to_folder_with_downloaded_models> to specify the path to additional model attributes. path_to_folder_with_downloaded_models is a path to the folder, where models are downloaded by Model Downloader tool.

## Input¶

### Original model¶

Image, name: image, shape: 1, 1, 224, 224 in the format B, C, H, W, where:

• B - batch size

• C - number of channels

• H - image height

• W - image width

Note that the source image should be tight aligned crop with detected text converted to grayscale.

Scale values - [255].

### Converted model¶

Image, name: image, shape: 1, 1, 224, 224 in the format B, C, H, W, where:

• B - batch size

• C - number of channels

• H - image height

• W - image width

Note that the source image should be tight aligned crop with detected text converted to grayscale.

## Output¶

### Original model¶

Output tensor, name: logits, shape: 1, 25, 96 in the format B, W, L, where:

• B - batch size

• W - output sequence length

• L - confidence distribution across [GO] - special start token for decoder, [s] - special end of sequence character for decoder and characters, listed in enclosed file vocab.txt.

The network output decoding process is pretty easy: get the argmax on L dimension, transform indices to letters and slice the resulting phrase on the first entry of end-of-sequence symbol.

### Converted model¶

Output tensor, name: logits, shape: 1, 25, 96 in the format B, W, L, where:

• B - batch size

• W - output sequence length

• L - confidence distribution across [GO] - special start token for decoder, [s] - special end of sequence character for decoder and characters, listed in enclosed file vocab.txt.

The network output decoding process is pretty easy: get the argmax on L dimension, transform indices to letters and slice the resulting phrase on the first entry of end-of-sequence symbol.

You can download models and if necessary convert them into OpenVINO™ IR format using the Model Downloader and other automation tools as shown in the examples below.

omz_downloader --name <model_name>
omz_converter --name <model_name>