AUGRUSequence#

Versioned name: AUGRUSequence

Category: Sequence processing

Short description: AUGRUSequence operation represents a series of AUGRU cells (GRU with attentional update gate).

Detailed description: The main difference between AUGRUSequence and GRUSequence is the additional attention score input A, which is a multiplier for the update gate. The AUGRU formula is based on the paper arXiv:1809.03672.

AUGRU formula:
  *  - matrix multiplication
 (.) - Hadamard product (element-wise)

 f, g - activation functions
 z - update gate, r - reset gate, h - hidden gate
 a - attention score

  rt = f(Xt*(Wr^T) + Ht-1*(Rr^T) + Wbr + Rbr)
  zt = f(Xt*(Wz^T) + Ht-1*(Rz^T) + Wbz + Rbz)
  ht = g(Xt*(Wh^T) + (rt (.) Ht-1)*(Rh^T) + Rbh + Wbh)  # 'linear_before_reset' is False

  zt' = (1 - at) (.) zt  # multiplication by attention score

  Ht = (1 - zt') (.) ht + zt' (.) Ht-1

Activation functions for gates: sigmoid for f, tanh for g. Only forward direction is supported, so num_directions is always equal to 1.

Attributes

hidden_size
- Description: hidden_size specifies hidden state size.
- Range of values: a positive integer
- Type: int
- Required: yes
activations
- Description: activation functions for gates
- Range of values: sigmoid, tanh
- Type: a list of strings
- Default value: sigmoid for f, tanh for g
- Required: no
activations_alpha, activations_beta
- Description: activations_alpha, activations_beta attributes of functions; applicability and meaning of these attributes depends on chosen activation functions
- Range of values: []
- Type: float[]
- Default value: []
- Required: no
clip
- Description: clip specifies bound values [-C, C] for tensor clipping. Clipping is performed before activations.
- Range of values: 0.
- Type: float
- Default value: 0. that means the clipping is not applied
- Required: no
direction
- Description: Specify if the RNN is forward, reverse, or bidirectional. If it is one of forward or reverse then num_directions = 1, if it is bidirectional, then num_directions = 2. This num_directions value specifies input/output shape requirements.
- Range of values: forward
- Type: string
- Default value: forward
- Required: no
linear_before_reset
- Description: linear_before_reset flag denotes, if the output of hidden gate is multiplied by the reset gate before or after linear transformation.
- Range of values: False
- Type: boolean
- Default value: False
- Required: no

Inputs

1: X - 3D tensor of type T1 [batch_size, seq_length, input_size], input data. Required.
2: H_t - 3D tensor of type T1 and shape [batch_size, num_directions, hidden_size]. Input with initial hidden state data. Required.
3: sequence_lengths - 1D tensor of type T2 and shape [batch_size]. Specifies real sequence lengths for each batch element. Required.
4: W - 3D tensor of type T1 and shape [num_directions, 3 * hidden_size, input_size]. The weights for matrix multiplication, gate order: zrh. Required.
5: R - 3D tensor of type T1 and shape [num_directions, 3 * hidden_size, hidden_size]. The recurrence weights for matrix multiplication, gate order: zrh. Required.
6: B - 2D tensor of type T1. The biases. If linear_before_reset is set to False, then the shape is [num_directions, 3 * hidden_size], gate order: zrh. Otherwise the shape is [num_directions, 4 * hidden_size] - the sum of biases for z and r gates (weights and recurrence weights), the biases for h gate are placed separately. Required.
7: A - 3D tensor of type T1 [batch_size, seq_length, 1], the attention score. Required.

Outputs

1: Y - 4D tensor of type T1 [batch_size, num_directions, seq_length, hidden_size], concatenation of all the intermediate output values of the hidden.
2: Ho - 3D tensor of type T1 [batch_size, num_directions, hidden_size], the last output value of hidden state.

Types

T1: any supported floating-point type.
T2: any supported integer type.

Example

<layer ... type="AUGRUSequence" ...>
    <data hidden_size="128"/>
    <input>
        <port id="0"> <!-- `X` input data -->
            <dim>1</dim>
            <dim>4</dim>
            <dim>16</dim>
        </port>
        <port id="1"> <!-- `H_t` input -->
            <dim>1</dim>
            <dim>1</dim>
            <dim>128</dim>
        </port>
        <port id="2"> <!-- `sequence_lengths` input -->
            <dim>1</dim>
        </port>
         <port id="3"> <!-- `W` weights input -->
            <dim>1</dim>
            <dim>384</dim>
            <dim>16</dim>
        </port>
         <port id="4"> <!-- `R` recurrence weights input -->
            <dim>1</dim>
            <dim>384</dim>
            <dim>128</dim>
        </port>
         <port id="5"> <!-- `B` bias input -->
            <dim>1</dim>
            <dim>384</dim>
        </port>
        <port id="6"> <!-- `A` attention score input -->
            <dim>1</dim>
            <dim>4</dim>
            <dim>1</dim>
        </port>
    </input>
    <output>
        <port id="7"> <!-- `Y` output -->
            <dim>1</dim>
            <dim>1</dim>
            <dim>4</dim>
            <dim>128</dim>
        </port>
        <port id="8"> <!-- `Ho` output -->
            <dim>1</dim>
            <dim>1</dim>
            <dim>128</dim>
        </port>
    </output>
</layer>