GPT or Transformer architecture

2 min readMar 19, 2024

GPT-3.5 is built using GPT (Generative Pre-trained Transformer) architecture, a transformer-based model for natural language processing tasks. Transformers revolutionized the field by introducing attention mechanisms, enabling the model to process information in parallel and capture intricate contextual relationships within vast amounts of data. Transformers play a central role in GPT-3.5’s ability to comprehend, generate, and interact with text in a remarkably human-like manner.

Let’s look at the transformer architecture

Encoder:

Input sentence is converted into vectors (embeddings) — numerical matrix
Positional encoding applies information about word in conjunction with position of other words in sentence
In multihead attention, query, keys and values are multiplied with weights and applied softmax function. The output is normalized to 0–1 range and multiplied with multipliers to get final output, which will amplify words which are relevant to the input. Query is input embeddings, key and values are model training data. Animation shows single head attention, in reality the q, k v are split into smaller chunks and processed in parallel, making it multihead.

Attention(Q, K, V) = softmax(QKT/√dk)V

Next step is feed forward and normalization. Feed forward applies linear transformations

FFN(x) = max(0, xW1 + b1)W2 + b2 (2)

This process will produce intermediate output (context) for decoder.

Decoder:

Decoder applies positional encoding to output from decoder at time (t-1) and applies mask-multi-head-attention (same as encoder but some values are masked)
Decoder applies transformations (multihead attention, feedforward and normalization same as encoder but with different weight and bias) to output of encoder (key and values) and output from previous decoder step.
In linearization, transforms embeddings to positions into our vocabulary (inverse process of converting text to embeddings)
Softmax, helps select word with max score (most relevant)

Transformer data flow: Hello world (English) to Bonjour monde (French)

References: https://research.google/pubs/attention-is-all-you-need/

GPT or Transformer architecture

Written by Rupal

No responses yet