Attention Mechanism

Divya G K
2 min readDec 3, 2020

This is part 4 in the blog series showing the evolution of RNNs.

Part 1 — LSTM

Part 2 — GRU

Part 3 — Sequence to Sequence Learning

Attention mechanism was first introduced for machine translation in this paper “NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE” which got evolved over recent years and now its being used for almost all Natural Language Processing tasks, for Computer Vision tasks-Image Captioning, medical imaging models ‘U-net’, Speech models such as Speech recognition. In fact all GPT-3 models uses attention based layers as its predecessor!

Let’s dive into the basics of attention first…

Intuition

In the initial architecture of encoder-decoder, the encoders encodes the entire input sequence into a fixed length vector which later was consumed by the decoder.

In most of the encoder-decoder architecture, the LSTM/GRU models used as encoders. In this basic encoder-decoder models, the intermediate encoder layers state was discarded and the decoder was heavily dependent on the context vector output by the encoder. This kind of network often fails when the input sequence length is large.

To reduce encoder’s burden attention mechanism was introduced where even the intermediate encoder’s output states were used by decoder to understand the context.

Attention Weights

Attention weights were introduced to help decoder decide which part of the sentence, it needs to pay more attention to with help of attention weights.

With content based attention weights passed to a SoftMax function, decoder was able to figure out which word to be focused on.

--

--