FakeConvert#

Note: FakeConvert is an experimental operation and subject to change.

Versioned name: FakeConvert-13

Category: Quantization

Short description: FakeConvert is element-wise quantization of floating-point input values into a set of values corresponding to a target low-precision floating-point type.

Detailed description: FakeConvert operation converts the input tensor to a specified target low-precision floating-point type and performs backward conversion to the source precision. It also applies affine transformation defined by scale and shift parameters before the conversion step and its reverse form after the backward conversion.

It emulates types defined by the destination_type attribute, on the original type of the data input.

Possible destination types are: “f8e4m3”, “f8e5m2”. The “f8e4m3” is an 8-bit floating-point format, where 1 bit for the sign, 4 bits for the exponents and 3 bits for the mantissa. The “f8e5m2” is an 8-bit floating-point format, where 1 bit is for the sign, 5 bits for the exponents and 2 for the mantissa. The FP8 types were introduced in the following paper: FP8 Formats for Deep Learning .

Fake in FakeConvert means that the output tensor preserve the same element type as an original type of the input tensor, not the destination_type.

Each element of the output is defined as the result of the following expression:

 data = data * scale - shift
 ConvertLike(Convert(data, destination_type), data)
 data = (data + shift) / scale

Attributes

destination_type
- Description: destination_type is the emulated type
- Range of values: “f8e4m3”, “f8e5m2”
- Type: string
- Required: yes

Inputs:

1: data - tensor of type T_F and arbitrary shape. Required.
2: scale - tensor of type T_F with a scale factor for the data input value. The shape must be numpy-broadcastable to the shape of data. Required.
3: shift - tensor of type T_F with value to subtract before and add after conversion of the data input value. The shape must be numpy-broadcastable to the shape of data, and match the shape of the scale input. Optional.

Outputs:

1: Output tensor of type T_F with shape and type matching the 1st input tensor data.

Types

T_F: supported floating-point type (FP16, BF16, FP32).

Example

<layer … type="FakeConvert"…>
    <data destination_type="f8e4m3"/>
    <input>
        <port id="0">
            <dim>1</dim>
            <dim>64</dim>
            <dim>56</dim>
            <dim>56</dim>
        </port>
        <port id="1">
            <dim>1</dim>
            <dim>64</dim>
            <dim>1</dim>
            <dim>1</dim>
        </port>
        <port id="2">
            <dim>1</dim>
            <dim>64</dim>
            <dim>1</dim>
            <dim>1</dim>
        </port>
    </input>
    <output>
        <port id="3">
            <dim>1</dim>
            <dim>64</dim>
            <dim>56</dim>
            <dim>56</dim>
        </port>
    </output>
</layer>