Attention Mechanism

Divya G K

2 min readDec 3, 2020

This is part 4 in the blog series showing the evolution of RNNs.

Part 1 — LSTM

LSTM-Solution for vanishing gradient

We humans never start thinking from scratch. We always have stored information or ideas from previous experience. We…

dshegde2798.medium.com

Part 2 — GRU

Gated Recurrent Unit — An Introduction

To understand GRU, please go through this article first before reading this one.

raviteja94.medium.com

Part 3 — Sequence to Sequence Learning

Sequence to Sequence Learning

Seq2Seq with Neural Networks can be used to perform “End to End” Translation came into reality around 2014. An simple…

adityajindal4.medium.com

Attention mechanism was first introduced for machine translation in this paper “NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE” which got evolved over recent years and now its being used for almost all Natural Language Processing tasks, for Computer Vision tasks-Image Captioning, medical imaging models ‘U-net’, Speech models such as Speech recognition. In fact all GPT-3 models uses attention based layers as its predecessor!

Let’s dive into the basics of attention first…

Intuition

In the initial architecture of encoder-decoder, the encoders encodes the entire input sequence into a fixed length vector which later was consumed by the decoder.

In most of the encoder-decoder architecture, the LSTM/GRU models used as encoders. In this basic encoder-decoder models, the intermediate encoder layers state was discarded and the decoder was heavily dependent on the context vector output by the encoder. This kind of network often fails when the input sequence length is large.

To reduce encoder’s burden attention mechanism was introduced where even the intermediate encoder’s output states were used by decoder to understand the context.

Attention Weights

Attention weights were introduced to help decoder decide which part of the sentence, it needs to pay more attention to with help of attention weights.

With content based attention weights passed to a SoftMax function, decoder was able to figure out which word to be focused on.