{"id":4817,"date":"2025-09-24T10:48:45","date_gmt":"2025-09-24T10:48:45","guid":{"rendered":"https:\/\/palmer-consulting.com\/transformer-definition\/"},"modified":"2025-09-24T10:48:45","modified_gmt":"2025-09-24T10:48:45","slug":"transformer-definition","status":"publish","type":"post","link":"https:\/\/palmer-consulting.com\/en\/transformer-definition\/","title":{"rendered":"Transformer definition"},"content":{"rendered":"<h1 data-start=\"223\" data-end=\"298\">Introduction: what is a Transformer in artificial intelligence?<\/h1>\n<p data-start=\"300\" data-end=\"596\"><strong data-start=\"303\" data-end=\"318\">Transformer<\/strong> is the architecture that has shaken up artificial intelligence since 2017. Conceived by Google researchers in the seminal article <em data-start=\"456\" data-end=\"485\">&#8220;Attention is All You Need&#8221;,<\/em> it paved the way for <strong data-start=\"511\" data-end=\"556\">large language models (LLMs)<\/strong> like <strong data-start=\"563\" data-end=\"593\">GPT, BERT, LLaMA or Gemini<\/strong>. <\/p>\n<p data-start=\"598\" data-end=\"1008\">Why such an impact? Because Transformer has replaced conventional sequential approaches (RNN, LSTM) with an<strong data-start=\"729\" data-end=\"742\">attention<\/strong> mechanism capable of processing words <strong data-start=\"771\" data-end=\"787\">in parallel<\/strong> and understanding their relationships <strong data-start=\"821\" data-end=\"859\">in the overall context of a text<\/strong>. The result: faster, more accurate and infinitely more powerful models for <strong data-start=\"944\" data-end=\"1005\">translating, generating, summarizing or analyzing natural language<\/strong>.  <\/p>\n<p data-start=\"1010\" data-end=\"1269\">Today, the Transformer is no longer just an academic innovation: it&#8217;s <strong data-start=\"1093\" data-end=\"1121\">the bedrock of modern AI<\/strong>, used in machine translation, virtual assistants, document analysis, code generation and even computer vision.<\/p>\n<p data-start=\"1271\" data-end=\"1536\">\ud83d\udc49 In this article, we&#8217;ll <strong data-start=\"1304\" data-end=\"1351\">clearly define what a Transformer is<\/strong>, understand how it works, its benefits and limitations, and explore its concrete applications in sectors such as <strong data-start=\"1478\" data-end=\"1533\">banking, insurance, healthcare and customer relations<\/strong>.<\/p>\n<h2 data-start=\"234\" data-end=\"272\"><\/h2>\n<h2 data-start=\"234\" data-end=\"272\">1. Background, motivation and origins<\/h2>\n<p data-start=\"274\" data-end=\"588\">Before the Transformer architecture, the dominant architectures for processing sequential data (such as sentences) were <strong data-start=\"408\" data-end=\"415\">RNNs<\/strong> (recurrent neural networks), LSTMs (Long Short-Term Memory) or GRUs, sometimes combined with attention mechanisms. These architectures had several limitations: <\/p>\n<ul data-start=\"590\" data-end=\"985\">\n<li data-start=\"590\" data-end=\"709\">\n<p data-start=\"592\" data-end=\"709\">they processed sequences <strong data-start=\"623\" data-end=\"645\">iteratively<\/strong> (word by word), making parallelization difficult,<\/p>\n<\/li>\n<li data-start=\"710\" data-end=\"824\">\n<p data-start=\"712\" data-end=\"824\">long-distance dependencies could fade (vanishing gradient problem) or be poorly modeled,<\/p>\n<\/li>\n<li data-start=\"825\" data-end=\"985\">\n<p data-start=\"827\" data-end=\"985\">convolution-based architectures (as in some variants) had difficulty effectively capturing global relationships in the sequence.<\/p>\n<\/li>\n<\/ul>\n<p data-start=\"987\" data-end=\"1427\">The seminal article <strong data-start=\"1007\" data-end=\"1038\">&#8220;Attention Is All You Need&#8221;<\/strong> (Vaswani et al., 2017) proposes an architecture based entirely on attention, without recourse to recurrent or convolutional mechanisms. <span class=\"\" data-state=\"closed\"><span class=\"ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]\" data-testid=\"webpage-citation-pill\"><a class=\"flex h-4.5 overflow-hidden rounded-xl px-2 text-[9px] font-medium transition-colors duration-150 ease-in-out text-token-text-secondary! bg-[#F4F4F4]! dark:bg-[#303030]!\" href=\"https:\/\/arxiv.org\/abs\/1706.03762?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noopener\"><span class=\"relative start-0 bottom-0 flex h-full w-full items-center\"><span class=\"flex h-4 w-full items-center justify-between absolute\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">DeanHub | Code Is Life<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+3<\/span><\/span><span class=\"flex h-4 w-full items-center justify-between\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">arXiv<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+3<\/span><\/span><span class=\"flex h-4 w-full items-center justify-between absolute\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">Wikipedia<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+3<\/span><\/span><\/span><\/a><\/span><\/span><br data-start=\"1218\" data-end=\"1221\">The authors show that this model is not only easier to parallelize, but also achieves better performance on machine translation tasks.   <span class=\"\" data-state=\"closed\"><span class=\"ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]\" data-testid=\"webpage-citation-pill\"><a class=\"flex h-4.5 overflow-hidden rounded-xl px-2 text-[9px] font-medium transition-colors duration-150 ease-in-out text-token-text-secondary! bg-[#F4F4F4]! dark:bg-[#303030]!\" href=\"https:\/\/arxiv.org\/abs\/1706.03762?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noopener\"><span class=\"relative start-0 bottom-0 flex h-full w-full items-center\"><span class=\"flex h-4 w-full items-center justify-between overflow-hidden\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">arXiv<\/span><\/span><\/span><\/a><\/span><\/span><\/p>\n<p data-start=\"1429\" data-end=\"1624\">Since its publication, the Transformer architecture has become the cornerstone of many modern language processing models: BERT, GPT, T5, etc. <span class=\"\" data-state=\"closed\"><span class=\"ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]\" data-testid=\"webpage-citation-pill\"><a class=\"flex h-4.5 overflow-hidden rounded-xl px-2 text-[9px] font-medium transition-colors duration-150 ease-in-out text-token-text-secondary! bg-[#F4F4F4]! dark:bg-[#303030]!\" href=\"https:\/\/en.wikipedia.org\/wiki\/Attention_Is_All_You_Need?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noopener\"><span class=\"relative start-0 bottom-0 flex h-full w-full items-center\"><span class=\"flex h-4 w-full items-center justify-between absolute\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">Wikipedia<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+3<\/span><\/span><span class=\"flex h-4 w-full items-center justify-between\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">Wikipedia<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+3<\/span><\/span><span class=\"flex h-4 w-full items-center justify-between absolute\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">neuron.ai<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+3<\/span><\/span><\/span><\/a><\/span><\/span><\/p>\n<hr data-start=\"1626\" data-end=\"1629\">\n<h2 data-start=\"1631\" data-end=\"1681\">2. Global overview: encoder \/ decoder<\/h2>\n<p data-start=\"1683\" data-end=\"1871\">The basic architecture of the Transformer follows an <strong data-start=\"1736\" data-end=\"1757\">encoder-decoder<\/strong> scheme, as in many classic seq2seq (sequence-to-sequence) models. <span class=\"\" data-state=\"closed\"><span class=\"ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]\" data-testid=\"webpage-citation-pill\"><a class=\"flex h-4.5 overflow-hidden rounded-xl px-2 text-[9px] font-medium transition-colors duration-150 ease-in-out text-token-text-secondary! bg-[#F4F4F4]! dark:bg-[#303030]!\" href=\"https:\/\/arxiv.org\/abs\/1706.03762?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noopener\"><span class=\"relative start-0 bottom-0 flex h-full w-full items-center\"><span class=\"flex h-4 w-full items-center justify-between absolute\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">neuron.ai<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+3<\/span><\/span><span class=\"flex h-4 w-full items-center justify-between\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">arXiv<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+3<\/span><\/span><span class=\"flex h-4 w-full items-center justify-between absolute\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">Jay Alammar<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+3<\/span><\/span><\/span><\/a><\/span><\/span><\/p>\n<h3 data-start=\"1873\" data-end=\"1889\">2.1 Encoder<\/h3>\n<ul data-start=\"1891\" data-end=\"2508\">\n<li data-start=\"1891\" data-end=\"1978\">\n<p data-start=\"1893\" data-end=\"1978\">The encoder takes a sequence as input (for example, a sentence in the source language).<\/p>\n<\/li>\n<li data-start=\"1979\" data-end=\"2148\">\n<p data-start=\"1981\" data-end=\"2148\">It consists of a <strong data-start=\"2004\" data-end=\"2012\">stack of<\/strong> <span class=\"katex\"><span class=\"katex-mathml\">NN<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">N<\/span><\/span><\/span><\/span> identical layers (often <span class=\"katex\"><span class=\"katex-mathml\">N=6N = 6<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">N<\/span><span class=\"mrel\">=<\/span><\/span><span class=\"base\"><span class=\"mord\">6<\/span><\/span><\/span><\/span> in the &#8220;base&#8221; version of the original article).  <span class=\"\" data-state=\"closed\"><span class=\"ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]\" data-testid=\"webpage-citation-pill\"><a class=\"flex h-4.5 overflow-hidden rounded-xl px-2 text-[9px] font-medium transition-colors duration-150 ease-in-out text-token-text-secondary! bg-[#F4F4F4]! dark:bg-[#303030]!\" href=\"https:\/\/arxiv.org\/abs\/1706.03762?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noopener\"><span class=\"relative start-0 bottom-0 flex h-full w-full items-center\"><span class=\"flex h-4 w-full items-center justify-between\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">arXiv<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+2<\/span><\/span><span class=\"flex h-4 w-full items-center justify-between absolute\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">neuron.ai<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+2<\/span><\/span><\/span><\/a><\/span><\/span><\/p>\n<\/li>\n<li data-start=\"2149\" data-end=\"2338\">\n<p data-start=\"2151\" data-end=\"2218\">Each encoder layer consists of two main sub-blocks:<\/p>\n<ol data-start=\"2224\" data-end=\"2338\">\n<li data-start=\"2224\" data-end=\"2260\">\n<p data-start=\"2227\" data-end=\"2260\"><strong data-start=\"2227\" data-end=\"2258\">Self-Attention (multi-head)<\/strong><\/p>\n<\/li>\n<li data-start=\"2263\" data-end=\"2338\">\n<p data-start=\"2266\" data-end=\"2338\"><strong data-start=\"2266\" data-end=\"2294\">Positional feed-forward<\/strong> (feed-forward applied to each position)<\/p>\n<\/li>\n<\/ol>\n<\/li>\n<li data-start=\"2340\" data-end=\"2508\">\n<p data-start=\"2342\" data-end=\"2508\">Each sub-block is surrounded by a residual connection + layer normalization. <span class=\"\" data-state=\"closed\"><span class=\"ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]\" data-testid=\"webpage-citation-pill\"><a class=\"flex h-4.5 overflow-hidden rounded-xl px-2 text-[9px] font-medium transition-colors duration-150 ease-in-out text-token-text-secondary! bg-[#F4F4F4]! dark:bg-[#303030]!\" href=\"https:\/\/arxiv.org\/abs\/1706.03762?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noopener\"><span class=\"relative start-0 bottom-0 flex h-full w-full items-center\"><span class=\"flex h-4 w-full items-center justify-between\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">arXiv<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+2<\/span><\/span><span class=\"flex h-4 w-full items-center justify-between absolute\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">neuron.ai<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+2<\/span><\/span><\/span><\/a><\/span><\/span><\/p>\n<\/li>\n<\/ul>\n<p data-start=\"2510\" data-end=\"2629\">At the encoder&#8217;s output, contextualized vector representations are obtained for each of the input tokens.<\/p>\n<h3 data-start=\"2631\" data-end=\"2647\">2.2 Decoder<\/h3>\n<ul data-start=\"2649\" data-end=\"3362\">\n<li data-start=\"2649\" data-end=\"2755\">\n<p data-start=\"2651\" data-end=\"2755\">The decoder generates the output (e.g. a translated sentence) autoregressively (word by word).<\/p>\n<\/li>\n<li data-start=\"2756\" data-end=\"2860\">\n<p data-start=\"2758\" data-end=\"2860\">It also includes a stack of <span class=\"katex\"><span class=\"katex-mathml\">NN<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">N<\/span><\/span><\/span><\/span> identical layers. <span class=\"\" data-state=\"closed\"><span class=\"ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]\" data-testid=\"webpage-citation-pill\"><a class=\"flex h-4.5 overflow-hidden rounded-xl px-2 text-[9px] font-medium transition-colors duration-150 ease-in-out text-token-text-secondary! bg-[#F4F4F4]! dark:bg-[#303030]!\" href=\"https:\/\/arxiv.org\/abs\/1706.03762?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noopener\"><span class=\"relative start-0 bottom-0 flex h-full w-full items-center\"><span class=\"flex h-4 w-full items-center justify-between\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">arXiv<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+2<\/span><\/span><span class=\"flex h-4 w-full items-center justify-between absolute\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">neuron.ai<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+2<\/span><\/span><\/span><\/a><\/span><\/span><\/p>\n<\/li>\n<li data-start=\"2861\" data-end=\"3230\">\n<p data-start=\"2863\" data-end=\"2916\">Each decoder layer contains three sub-blocks:<\/p>\n<ol data-start=\"2920\" data-end=\"3230\">\n<li data-start=\"2920\" data-end=\"3062\">\n<p data-start=\"2923\" data-end=\"3062\"><strong data-start=\"2923\" data-end=\"2949\">Masked self-attention<\/strong>: so that the position <span class=\"katex\"><span class=\"katex-mathml\">ii<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">i<\/span><\/span><\/span><\/span> cannot &#8220;watch&#8221; future positions &gt; <span class=\"katex\"><span class=\"katex-mathml\">ii<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">i<\/span><\/span><\/span><\/span>.<\/p>\n<\/li>\n<li data-start=\"3065\" data-end=\"3194\">\n<p data-start=\"3068\" data-end=\"3194\"><strong data-start=\"3068\" data-end=\"3099\">Encoder-decoder cross-attention<\/strong>: each decoder position can be attached to encoder positions.<\/p>\n<\/li>\n<li data-start=\"3197\" data-end=\"3230\">\n<p data-start=\"3200\" data-end=\"3230\"><strong data-start=\"3200\" data-end=\"3228\">Positional feed-forward<\/strong><\/p>\n<\/li>\n<\/ol>\n<\/li>\n<li data-start=\"3232\" data-end=\"3362\">\n<p data-start=\"3234\" data-end=\"3362\">Here too, each sub-block is equipped with residual connections + layer normalization. <span class=\"\" data-state=\"closed\"><span class=\"ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]\" data-testid=\"webpage-citation-pill\"><a class=\"flex h-4.5 overflow-hidden rounded-xl px-2 text-[9px] font-medium transition-colors duration-150 ease-in-out text-token-text-secondary! bg-[#F4F4F4]! dark:bg-[#303030]!\" href=\"https:\/\/arxiv.org\/abs\/1706.03762?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noopener\"><span class=\"relative start-0 bottom-0 flex h-full w-full items-center\"><span class=\"flex h-4 w-full items-center justify-between\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">arXiv<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+2<\/span><\/span><span class=\"flex h-4 w-full items-center justify-between absolute\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">neuron.ai<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+2<\/span><\/span><\/span><\/a><\/span><\/span><\/p>\n<\/li>\n<\/ul>\n<p data-start=\"3364\" data-end=\"3577\">Finally, after the last decoder layer, we apply a linear + softmax layer to obtain the probability distribution on the vocabulary for the next token. <span class=\"\" data-state=\"closed\"><span class=\"ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]\" data-testid=\"webpage-citation-pill\"><a class=\"flex h-4.5 overflow-hidden rounded-xl px-2 text-[9px] font-medium transition-colors duration-150 ease-in-out text-token-text-secondary! bg-[#F4F4F4]! dark:bg-[#303030]!\" href=\"https:\/\/arxiv.org\/abs\/1706.03762?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noopener\"><span class=\"relative start-0 bottom-0 flex h-full w-full items-center\"><span class=\"flex h-4 w-full items-center justify-between\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">arXiv<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+2<\/span><\/span><span class=\"flex h-4 w-full items-center justify-between absolute\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">neuron.ai<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+2<\/span><\/span><\/span><\/a><\/span><\/span><\/p>\n<p data-start=\"3579\" data-end=\"3751\">This architecture is represented visually in numerous tutorials, for example on <em data-start=\"3679\" data-end=\"3708\">The Illustrated Transformer<\/em> website. <span class=\"\" data-state=\"closed\"><span class=\"ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]\" data-testid=\"webpage-citation-pill\"><a class=\"flex h-4.5 overflow-hidden rounded-xl px-2 text-[9px] font-medium transition-colors duration-150 ease-in-out text-token-text-secondary! bg-[#F4F4F4]! dark:bg-[#303030]!\" href=\"https:\/\/jalammar.github.io\/illustrated-transformer\/?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noopener\"><span class=\"relative start-0 bottom-0 flex h-full w-full items-center\"><span class=\"flex h-4 w-full items-center justify-between overflow-hidden\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">Jay Alammar<\/span><\/span><\/span><\/a><\/span><\/span><\/p>\n<hr data-start=\"3753\" data-end=\"3756\">\n<h2 data-start=\"3758\" data-end=\"3815\">3. Key components and their mathematical operation<\/h2>\n<p data-start=\"3817\" data-end=\"4093\">To fully understand the Transformer, it&#8217;s essential to look at its fundamental mechanisms:<strong data-start=\"3921\" data-end=\"3954\">attention (and self-attention)<\/strong>,<strong data-start=\"3958\" data-end=\"3983\">multi-headed attention<\/strong>,<strong data-start=\"3987\" data-end=\"4011\">positional encoding<\/strong>, <strong data-start=\"4017\" data-end=\"4041\">feed-forward networks<\/strong>, and <strong data-start=\"4050\" data-end=\"4092\">residual connections + normalization<\/strong>.<\/p>\n<h3 data-start=\"4095\" data-end=\"4151\">3.1 Attention (scaled dot-product) and self-attention<\/h3>\n<p data-start=\"4153\" data-end=\"4385\">The heart of the Transformer is the attention mechanism. The general idea is that, for each position (token) in a sequence, we want to calculate a weighting (attention) on the other positions to aggregate useful information. <\/p>\n<p data-start=\"4387\" data-end=\"4480\">The formula for scaled dot-product attention is :<\/p>\n<p><span class=\"katex-display\"><span class=\"katex\"><span class=\"katex-mathml\">Attention(Q,K,V)=softmax(QK\u22a4dk)V\\text{Attention}(Q, K, V) = \\text{softmax}\\left(\\frac{Q K^\\top}{\\sqrt{d_k}}\\right) V<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord text\"><span class=\"mord\">Attention<\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">Q<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">K<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">V<\/span><span class=\"mclose\">)<\/span><span class=\"mrel\">=<\/span><\/span><span class=\"base\"><span class=\"mord text\"><span class=\"mord\">softmax<\/span><\/span><span class=\"minner\"><span class=\"mopen delimcenter\"><span class=\"delimsizing size3\">(<\/span><\/span><span class=\"mord\"><span class=\"mfrac\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"mord sqrt\"><span class=\"svg-align\"><span class=\"mord mathnormal\">d<\/span><span class=\"msupsub\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">k<\/span><\/span><span class=\"vlist-s\"><\/span><\/span><span class=\"vlist-s\"><\/span><span class=\"mord mathnormal\">Q<\/span><span class=\"mord mathnormal\">K<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u22a4<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mclose delimcenter\"><span class=\"delimsizing size3\">)<\/span><\/span><\/span><span class=\"mord mathnormal\">V<\/span><\/span><\/span><\/span><\/span><\/p>\n<p data-start=\"4574\" data-end=\"4578\">where<\/p>\n<ul data-start=\"4579\" data-end=\"4809\">\n<li data-start=\"4579\" data-end=\"4624\">\n<p data-start=\"4581\" data-end=\"4624\"><span class=\"katex\"><span class=\"katex-mathml\">QQ<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">Q<\/span><\/span><\/span><\/span> = <em data-start=\"4601\" data-end=\"4610\">queries<\/em> matrix,<\/p>\n<\/li>\n<li data-start=\"4625\" data-end=\"4663\">\n<p data-start=\"4627\" data-end=\"4663\"><span class=\"katex\"><span class=\"katex-mathml\">KK<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">K<\/span><\/span><\/span><\/span> = matrix of <em data-start=\"4647\" data-end=\"4653\">keys<\/em>,<\/p>\n<\/li>\n<li data-start=\"4664\" data-end=\"4707\">\n<p data-start=\"4666\" data-end=\"4707\"><span class=\"katex\"><span class=\"katex-mathml\">VV<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">V<\/span><\/span><\/span><\/span> = matrix of <em data-start=\"4686\" data-end=\"4694\">values<\/em>,<\/p>\n<\/li>\n<li data-start=\"4708\" data-end=\"4809\">\n<p data-start=\"4710\" data-end=\"4809\"><span class=\"katex\"><span class=\"katex-mathml\">dkd_k<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">d<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">k<\/span><\/span><\/span><span class=\"vlist-s\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span> = dimension of key vectors (scaling factor). <\/span> <span class=\"\" data-state=\"closed\"><span class=\"ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]\" data-testid=\"webpage-citation-pill\"><a class=\"flex h-4.5 overflow-hidden rounded-xl px-2 text-[9px] font-medium transition-colors duration-150 ease-in-out text-token-text-secondary! bg-[#F4F4F4]! dark:bg-[#303030]!\" href=\"https:\/\/arxiv.org\/abs\/1706.03762?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noopener\"><span class=\"relative start-0 bottom-0 flex h-full w-full items-center\"><span class=\"flex h-4 w-full items-center justify-between absolute\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">DeanHub | Code Is Life<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+3<\/span><\/span><span class=\"flex h-4 w-full items-center justify-between\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">arXiv<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+3<\/span><\/span><span class=\"flex h-4 w-full items-center justify-between absolute\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">Wikipedia<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+3<\/span><\/span><\/span><\/a><\/span><\/span><\/p>\n<\/li>\n<\/ul>\n<p data-start=\"4811\" data-end=\"5038\">The term  <span class=\"katex\"><span class=\"katex-mathml\">1dk\\frac{1}{\\sqrt{d_k}}<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mfrac\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord sqrt mtight\"><span class=\"svg-align\"><span class=\"mord mathnormal mtight\">d<\/span><span class=\"msupsub\"><span class=\"sizing reset-size3 size1 mtight\"><span class=\"mord mathnormal mtight\">k<\/span><\/span><span class=\"vlist-s\"><\/span><\/span><span class=\"vlist-s\"><\/span><\/span><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">1<\/span><\/span><\/span><span class=\"vlist-s\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span>  is a scaling factor designed to prevent scalar products from becoming too large, which would make the softmax too &#8220;sharp&#8221; (not very stable). <span class=\"\" data-state=\"closed\"><span class=\"ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]\" data-testid=\"webpage-citation-pill\"><a class=\"flex h-4.5 overflow-hidden rounded-xl px-2 text-[9px] font-medium transition-colors duration-150 ease-in-out text-token-text-secondary! bg-[#F4F4F4]! dark:bg-[#303030]!\" href=\"https:\/\/arxiv.org\/abs\/1706.03762?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noopener\"><span class=\"relative start-0 bottom-0 flex h-full w-full items-center\"><span class=\"flex h-4 w-full items-center justify-between\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">arXiv<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+2<\/span><\/span><span class=\"flex h-4 w-full items-center justify-between absolute\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">neuron.ai<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+2<\/span><\/span><\/span><\/a><\/span><\/span><\/p>\n<p data-start=\"5040\" data-end=\"5286\">In<strong data-start=\"5047\" data-end=\"5065\">self-attention<\/strong>, the <span class=\"katex\"><span class=\"katex-mathml\">Q,K,VQ, K, V<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">Q<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">K<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">V<\/span><\/span><\/span><\/span> are all derived from the same input (or output) sequence, allowing each position to &#8220;pay attention&#8221; to other positions in the same sequence.  <span class=\"\" data-state=\"closed\"><span class=\"ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]\" data-testid=\"webpage-citation-pill\"><a class=\"flex h-4.5 overflow-hidden rounded-xl px-2 text-[9px] font-medium transition-colors duration-150 ease-in-out text-token-text-secondary! bg-[#F4F4F4]! dark:bg-[#303030]!\" href=\"https:\/\/arxiv.org\/abs\/1706.03762?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noopener\"><span class=\"relative start-0 bottom-0 flex h-full w-full items-center\"><span class=\"flex h-4 w-full items-center justify-between\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">arXiv<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+2<\/span><\/span><span class=\"flex h-4 w-full items-center justify-between absolute\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">neuron.ai<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+2<\/span><\/span><\/span><\/a><\/span><\/span><\/p>\n<p data-start=\"5288\" data-end=\"5369\">In this way, each position receives an enriched representation of its global context.<\/p>\n<h3 data-start=\"5371\" data-end=\"5423\">3.2 Multi-Head Attention<\/h3>\n<p data-start=\"5425\" data-end=\"5692\">Rather than a single attention head, the Transformer uses <strong data-start=\"5486\" data-end=\"5505\">several<\/strong> parallel <strong data-start=\"5486\" data-end=\"5505\">heads<\/strong>. The idea: each head can learn a different relationship between tokens (e.g. syntactic, semantic, proximity, etc.). <span class=\"\" data-state=\"closed\"><span class=\"ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]\" data-testid=\"webpage-citation-pill\"><a class=\"flex h-4.5 overflow-hidden rounded-xl px-2 text-[9px] font-medium transition-colors duration-150 ease-in-out text-token-text-secondary! bg-[#F4F4F4]! dark:bg-[#303030]!\" href=\"https:\/\/arxiv.org\/abs\/1706.03762?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noopener\"><span class=\"relative start-0 bottom-0 flex h-full w-full items-center\"><span class=\"flex h-4 w-full items-center justify-between\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">arXiv<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+2<\/span><\/span><span class=\"flex h-4 w-full items-center justify-between absolute\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">neuron.ai<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+2<\/span><\/span><\/span><\/a><\/span><\/span> <\/p>\n<p data-start=\"5694\" data-end=\"5723\">The process is as follows:<\/p>\n<ol data-start=\"5725\" data-end=\"6110\">\n<li data-start=\"5725\" data-end=\"5887\">\n<p data-start=\"5728\" data-end=\"5887\">For each head <span class=\"katex\"><span class=\"katex-mathml\">ii<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">i<\/span><\/span><\/span><\/span>we project <span class=\"katex\"><span class=\"katex-mathml\">Q,K,VQ, K, V<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">Q<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">K<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">V<\/span><\/span><\/span><\/span> via head-specific linear matrices :  <span class=\"katex\"><span class=\"katex-mathml\">Qi=QWiQQ_i = Q W^{Q}_i<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">Q<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><\/span><span class=\"vlist-s\"><\/span><\/span><\/span><\/span><span class=\"mrel\">=<\/span><\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">Q<\/span><span class=\"mord\"><span class=\"mord mathnormal\">W<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">Q<\/span><\/span><\/span><\/span><span class=\"vlist-s\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span>,  <span class=\"katex\"><span class=\"katex-mathml\">Ki=KWiKK_i = K W^{K}_i<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">K<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><\/span><span class=\"vlist-s\"><\/span><\/span><\/span><\/span><span class=\"mrel\">=<\/span><\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">K<\/span><span class=\"mord\"><span class=\"mord mathnormal\">W<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">K<\/span><\/span><\/span><\/span><span class=\"vlist-s\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span>,  <span class=\"katex\"><span class=\"katex-mathml\">Vi=VWiVV_i = V W^{V}_i<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">V<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><\/span><span class=\"vlist-s\"><\/span><\/span><\/span><\/span><span class=\"mrel\">=<\/span><\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">V<\/span><span class=\"mord\"><span class=\"mord mathnormal\">W<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">V<\/span><\/span><\/span><\/span><span class=\"vlist-s\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span>.<\/p>\n<\/li>\n<li data-start=\"5888\" data-end=\"5970\">\n<p data-start=\"5891\" data-end=\"5970\">We calculate the attention on each head:  <span class=\"katex\"><span class=\"katex-mathml\">Attention(Qi,Ki,Vi)\\text{Attention}(Q_i, K_i, V_i)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord text\"><span class=\"mord\">Attention<\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">Q<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><\/span><span class=\"vlist-s\"><\/span><\/span><\/span><\/span><\/span><span class=\"mpunct\">,<\/span><span class=\"mord\"><span class=\"mord mathnormal\">K<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><\/span><span class=\"vlist-s\"><\/span><\/span><\/span><\/span><\/span><span class=\"mpunct\">,<\/span><span class=\"mord\"><span class=\"mord mathnormal\">V<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><\/span><span class=\"vlist-s\"><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span>.<\/p>\n<\/li>\n<li data-start=\"5971\" data-end=\"6110\">\n<p data-start=\"5974\" data-end=\"6110\">We concatenate the outputs of all heads, then apply a final linear projection. <span class=\"\" data-state=\"closed\"><span class=\"ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]\" data-testid=\"webpage-citation-pill\"><a class=\"flex h-4.5 overflow-hidden rounded-xl px-2 text-[9px] font-medium transition-colors duration-150 ease-in-out text-token-text-secondary! bg-[#F4F4F4]! dark:bg-[#303030]!\" href=\"https:\/\/arxiv.org\/abs\/1706.03762?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noopener\"><span class=\"relative start-0 bottom-0 flex h-full w-full items-center\"><span class=\"flex h-4 w-full items-center justify-between\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">arXiv<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+2<\/span><\/span><span class=\"flex h-4 w-full items-center justify-between absolute\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">neuron.ai<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+2<\/span><\/span><\/span><\/a><\/span><\/span><\/p>\n<\/li>\n<\/ol>\n<p data-start=\"6112\" data-end=\"6287\">This approach increases the <strong data-start=\"6149\" data-end=\"6174\">expressiveness of<\/strong> the model, capturing various aspects of the relationship between positions. <span class=\"\" data-state=\"closed\"><span class=\"ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]\" data-testid=\"webpage-citation-pill\"><a class=\"flex h-4.5 overflow-hidden rounded-xl px-2 text-[9px] font-medium transition-colors duration-150 ease-in-out text-token-text-secondary! bg-[#F4F4F4]! dark:bg-[#303030]!\" href=\"https:\/\/dev.to\/param_ahuja\/how-positional-encoding-multi-head-attention-powers-transformers-588j?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noopener\"><span class=\"relative start-0 bottom-0 flex h-full w-full items-center\"><span class=\"flex h-4 w-full items-center justify-between\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">DEV Community<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+2<\/span><\/span><span class=\"flex h-4 w-full items-center justify-between absolute\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">Towards AI<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+2<\/span><\/span><\/span><\/a><\/span><\/span><\/p>\n<h3 data-start=\"6289\" data-end=\"6339\">3.3 Positional encoding<\/h3>\n<p data-start=\"6341\" data-end=\"6605\">Because the Transformer doesn&#8217;t process the sequence in an ordered way (no recursive loop), we need to add <strong data-start=\"6472\" data-end=\"6484\">positional<\/strong> information to the token embeddings so that the model knows &#8220;where&#8221; each token is located. <span class=\"\" data-state=\"closed\"><span class=\"ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]\" data-testid=\"webpage-citation-pill\"><a class=\"flex h-4.5 overflow-hidden rounded-xl px-2 text-[9px] font-medium transition-colors duration-150 ease-in-out text-token-text-secondary! bg-[#F4F4F4]! dark:bg-[#303030]!\" href=\"https:\/\/arxiv.org\/abs\/1706.03762?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noopener\"><span class=\"relative start-0 bottom-0 flex h-full w-full items-center\"><span class=\"flex h-4 w-full items-center justify-between absolute\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">xmarva.github.io<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+3<\/span><\/span><span class=\"flex h-4 w-full items-center justify-between\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">arXiv<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+3<\/span><\/span><span class=\"flex h-4 w-full items-center justify-between absolute\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">neuron.ai<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+3<\/span><\/span><\/span><\/a><\/span><\/span><\/p>\n<p data-start=\"6607\" data-end=\"6658\">The original article proposes sinusoidal encoding:<\/p>\n<p><span class=\"katex-display\"><span class=\"katex\"><span class=\"katex-mathml\">PE(pos,2i)=sin(pos100002i\/dmodel)PE(pos,2i+1)=cos(pos100002i\/dmodel)\\begin{aligned} PE_{(pos, 2i)} &amp;= \\sin\\left(\\frac{pos}{10000^{2i \/ d_{\\text{model}}}}\\right) \\\\ PE_{(pos, 2i+1)} &amp;= \\cos\\left(\\frac{pos}{10000^{2i \/ d_{\\text{model}}}}\\right) \\end{aligned}<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mtable\"><span class=\"col-align-r\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"mord mathnormal\">P<\/span><span class=\"mord mathnormal\">E<\/span><span class=\"msupsub\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mopen mtight\">(<\/span><span class=\"mord mathnormal mtight\">pos<\/span><span class=\"mpunct mtight\">,<\/span><span class=\"mord mathnormal mtight\">2i<\/span><span class=\"mclose mtight\">)<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"mord mathnormal\">P<\/span><span class=\"mord mathnormal\">E<\/span><span class=\"msupsub\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mopen mtight\">(<\/span><span class=\"mord mathnormal mtight\">pos<\/span><span class=\"mpunct mtight\">,<\/span><span class=\"mbin mtight\">2i+1<\/span><span class=\"mclose mtight\">)<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><span class=\"col-align-l\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"mrel\">=<\/span><span class=\"mop\">sin<\/span><span class=\"minner\"><span class=\"mopen delimcenter\"><span class=\"delimsizing size2\">(<\/span><\/span><span class=\"mfrac\">10000<span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">2i\/d<\/span><span class=\"sizing reset-size3 size1 mtight\"><span class=\"mord text mtight\">model<\/span><\/span><span class=\"vlist-s\"><\/span><\/span><\/span><\/span><\/span><span class=\"mord mathnormal\">p<\/span><span class=\"mord mathnormal\">bones<\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"mclose delimcenter\"><span class=\"delimsizing size2\">)<\/span><\/span><\/span><span class=\"mrel\">=<\/span><span class=\"mop\">cos<\/span><span class=\"minner\"><span class=\"mopen delimcenter\"><span class=\"delimsizing size2\">(<\/span><\/span><span class=\"mfrac\">10000<span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">2i\/d<\/span><span class=\"sizing reset-size3 size1 mtight\"><span class=\"mord text mtight\">model<\/span><\/span><span class=\"vlist-s\"><\/span><\/span><\/span><\/span><\/span><span class=\"mord mathnormal\">p<\/span><span class=\"mord mathnormal\">bones<\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"mclose delimcenter\"><span class=\"delimsizing size2\">)<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/p>\n<p data-start=\"6856\" data-end=\"7041\">where  <span class=\"katex\"><span class=\"katex-mathml\">pospos<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">p<\/span><span class=\"mord mathnormal\">os<\/span><\/span><\/span><\/span>  is the position in the sequence,  <span class=\"katex\"><span class=\"katex-mathml\">ii<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">i<\/span><\/span><\/span><\/span>  is the index in the encoding vector, and  <span class=\"katex\"><span class=\"katex-mathml\">dmodeld_{\\text{model}}<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">d<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\">model<\/span><\/span><\/span><\/span><span class=\"vlist-s\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span>  is the dimension of the model.  <span class=\"\" data-state=\"closed\"><span class=\"ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]\" data-testid=\"webpage-citation-pill\"><a class=\"flex h-4.5 overflow-hidden rounded-xl px-2 text-[9px] font-medium transition-colors duration-150 ease-in-out text-token-text-secondary! bg-[#F4F4F4]! dark:bg-[#303030]!\" href=\"https:\/\/arxiv.org\/abs\/1706.03762?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noopener\"><span class=\"relative start-0 bottom-0 flex h-full w-full items-center\"><span class=\"flex h-4 w-full items-center justify-between absolute\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">DeanHub | Code Is Life<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+3<\/span><\/span><span class=\"flex h-4 w-full items-center justify-between\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">arXiv<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+3<\/span><\/span><span class=\"flex h-4 w-full items-center justify-between absolute\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">xmarva.github.io<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+3<\/span><\/span><\/span><\/a><\/span><\/span><\/p>\n<p data-start=\"7043\" data-end=\"7210\">These position vectors are added (summed) to the word embeddings before entering the first encoder\/decoder block. <span class=\"\" data-state=\"closed\"><span class=\"ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]\" data-testid=\"webpage-citation-pill\"><a class=\"flex h-4.5 overflow-hidden rounded-xl px-2 text-[9px] font-medium transition-colors duration-150 ease-in-out text-token-text-secondary! bg-[#F4F4F4]! dark:bg-[#303030]!\" href=\"https:\/\/arxiv.org\/abs\/1706.03762?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noopener\"><span class=\"relative start-0 bottom-0 flex h-full w-full items-center\"><span class=\"flex h-4 w-full items-center justify-between absolute\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">xmarva.github.io<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+3<\/span><\/span><span class=\"flex h-4 w-full items-center justify-between\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">arXiv<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+3<\/span><\/span><span class=\"flex h-4 w-full items-center justify-between absolute\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">neuron.ai<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+3<\/span><\/span><\/span><\/a><\/span><\/span><\/p>\n<p data-start=\"7212\" data-end=\"7486\">An important reason for using sin\/cos functions is to allow the model to generalize to sequence lengths greater than those encountered during training, as sinusoidal patterns are extrapolable. <span class=\"\" data-state=\"closed\"><span class=\"ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]\" data-testid=\"webpage-citation-pill\"><a class=\"flex h-4.5 overflow-hidden rounded-xl px-2 text-[9px] font-medium transition-colors duration-150 ease-in-out text-token-text-secondary! bg-[#F4F4F4]! dark:bg-[#303030]!\" href=\"https:\/\/arxiv.org\/abs\/1706.03762?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noopener\"><span class=\"relative start-0 bottom-0 flex h-full w-full items-center\"><span class=\"flex h-4 w-full items-center justify-between\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">arXiv<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+1<\/span><\/span><\/span><\/a><\/span><\/span><\/p>\n<p data-start=\"7488\" data-end=\"7677\">Other positional encoding variants (learnable, rotatable, etc.) have been proposed in later work, but the principle remains the same. <span class=\"\" data-state=\"closed\"><span class=\"ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]\" data-testid=\"webpage-citation-pill\"><a class=\"flex h-4.5 overflow-hidden rounded-xl px-2 text-[9px] font-medium transition-colors duration-150 ease-in-out text-token-text-secondary! bg-[#F4F4F4]! dark:bg-[#303030]!\" href=\"https:\/\/deepwiki.com\/stanford-cs336\/assignment1-basics\/2-transformer-architecture?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noopener\"><span class=\"relative start-0 bottom-0 flex h-full w-full items-center\"><span class=\"flex h-4 w-full items-center justify-between\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">DeepWiki<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+1<\/span><\/span><\/span><\/a><\/span><\/span><\/p>\n<h3 data-start=\"7679\" data-end=\"7718\">3.4 Positional feed-forward network<\/h3>\n<p data-start=\"7720\" data-end=\"7893\">After each attention block (or cross-attention), each position passes through a small <strong data-start=\"7825\" data-end=\"7839\">individual<\/strong> feed-forward network, applied independently to each position:<\/p>\n<p><span class=\"katex-display\"><span class=\"katex\"><span class=\"katex-mathml\">FFN(x)=max(0,xW1+b1) W2+b2\\text{FFN}(x) = \\max(0, x W_1 + b_1)\\, W_2 + b_2<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord text\"><span class=\"mord\">FFN<\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">x<\/span><span class=\"mclose\">)<\/span><span class=\"mrel\">=<\/span><\/span><span class=\"base\"><span class=\"mop\">max<\/span><span class=\"mopen\">(<\/span><span class=\"mord\">0<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">x<\/span><span class=\"mord\"><span class=\"mord mathnormal\">W<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">1<\/span><\/span><\/span><span class=\"vlist-s\"><\/span><\/span><\/span><\/span><span class=\"mbin\">+<\/span><\/span><\/span><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">b<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">1<\/span><\/span><\/span><span class=\"vlist-s\"><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><span class=\"mord\"><span class=\"mord mathnormal\">W<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">2<\/span><\/span><\/span><span class=\"vlist-s\"><\/span><\/span><\/span><\/span><span class=\"mbin\">+<\/span><\/span><\/span><\/span><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">b<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">2<\/span><\/span><\/span><span class=\"vlist-s\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/p>\n<p data-start=\"7951\" data-end=\"8131\">This is a small, dense layer (often two linear layers with non-linear activation, often ReLU) applied at each position. <span class=\"\" data-state=\"closed\"><span class=\"ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]\" data-testid=\"webpage-citation-pill\"><a class=\"flex h-4.5 overflow-hidden rounded-xl px-2 text-[9px] font-medium transition-colors duration-150 ease-in-out text-token-text-secondary! bg-[#F4F4F4]! dark:bg-[#303030]!\" href=\"https:\/\/arxiv.org\/abs\/1706.03762?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noopener\"><span class=\"relative start-0 bottom-0 flex h-full w-full items-center\"><span class=\"flex h-4 w-full items-center justify-between\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">arXiv<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+2<\/span><\/span><span class=\"flex h-4 w-full items-center justify-between absolute\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">neuron.ai<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+2<\/span><\/span><\/span><\/a><\/span><\/span><\/p>\n<p data-start=\"8133\" data-end=\"8236\">This operation adds local (per position) non-linear capacity to the representations.<\/p>\n<h3 data-start=\"8238\" data-end=\"8309\">3.5 Residual + layer norm connections<\/h3>\n<p data-start=\"8311\" data-end=\"8601\">To facilitate deep network training, each sub-block (attention, feed-forward) uses a <strong data-start=\"8417\" data-end=\"8441\">residual connection<\/strong>: the input of the sub-block is added to its output. Then <strong data-start=\"8509\" data-end=\"8536\">layer normalization<\/strong> is applied. <span class=\"\" data-state=\"closed\"><span class=\"ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]\" data-testid=\"webpage-citation-pill\"><a class=\"flex h-4.5 overflow-hidden rounded-xl px-2 text-[9px] font-medium transition-colors duration-150 ease-in-out text-token-text-secondary! bg-[#F4F4F4]! dark:bg-[#303030]!\" href=\"https:\/\/arxiv.org\/abs\/1706.03762?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noopener\"><span class=\"relative start-0 bottom-0 flex h-full w-full items-center\"><span class=\"flex h-4 w-full items-center justify-between\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">arXiv<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+2<\/span><\/span><span class=\"flex h-4 w-full items-center justify-between absolute\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">neuron.ai<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+2<\/span><\/span><\/span><\/a><\/span><\/span> <\/p>\n<p data-start=\"8603\" data-end=\"8620\">Schematically :<\/p>\n<p><span class=\"katex-display\"><span class=\"katex\"><span class=\"katex-mathml\">LayerNorm(x+SousBloc(x))\\text{LayerNorm}(x + \\text{SousBloc}(x))<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord text\"><span class=\"mord\">LayerNorm<\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">x<\/span><span class=\"mbin\">+<\/span><\/span><span class=\"base\"><span class=\"mord text\"><span class=\"mord\">SubBlock<\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">x<\/span><span class=\"mclose\">))<\/span><\/span><\/span><\/span><\/span><\/p>\n<p data-start=\"8670\" data-end=\"8822\">This technique stabilizes training, improves gradient propagation, and avoids performance degradation when the model becomes deep.<\/p>\n<hr data-start=\"8824\" data-end=\"8827\">\n<h2 data-start=\"8829\" data-end=\"8874\">4. Training and inference processes<\/h2>\n<h3 data-start=\"8876\" data-end=\"8896\">4.1 Training<\/h3>\n<ul data-start=\"8898\" data-end=\"9667\">\n<li data-start=\"8898\" data-end=\"9043\">\n<p data-start=\"8900\" data-end=\"9043\">The model is trained on input sequence\/target sequence pairs (e.g. source and target sentences in a translation task).<\/p>\n<\/li>\n<li data-start=\"9044\" data-end=\"9263\">\n<p data-start=\"9046\" data-end=\"9263\">We apply the <strong data-start=\"9061\" data-end=\"9086\">masking technique<\/strong> in the decoder to prevent the model from &#8220;seeing&#8221; the future tokens (we mask the positions  &gt; <span class=\"katex\"><span class=\"katex-mathml\">ii<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">i<\/span><\/span><\/span><\/span> when generating the token <span class=\"katex\"><span class=\"katex-mathml\">ii<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">i<\/span><\/span><\/span><\/span>).  <span class=\"\" data-state=\"closed\"><span class=\"ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]\" data-testid=\"webpage-citation-pill\"><a class=\"flex h-4.5 overflow-hidden rounded-xl px-2 text-[9px] font-medium transition-colors duration-150 ease-in-out text-token-text-secondary! bg-[#F4F4F4]! dark:bg-[#303030]!\" href=\"https:\/\/arxiv.org\/abs\/1706.03762?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noopener\"><span class=\"relative start-0 bottom-0 flex h-full w-full items-center\"><span class=\"flex h-4 w-full items-center justify-between\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">arXiv<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+2<\/span><\/span><span class=\"flex h-4 w-full items-center justify-between absolute\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">neuron.ai<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+2<\/span><\/span><\/span><\/a><\/span><\/span><\/p>\n<\/li>\n<li data-start=\"9264\" data-end=\"9505\">\n<p data-start=\"9266\" data-end=\"9505\">Techniques such as learning rate <strong data-start=\"9301\" data-end=\"9326\">warm-up<\/strong>, <strong data-start=\"9374\" data-end=\"9385\">dropout<\/strong>, <strong data-start=\"9390\" data-end=\"9409\">label smoothing<\/strong>, etc. are used to regularize and stabilize training. <span class=\"\" data-state=\"closed\"><span class=\"ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]\" data-testid=\"webpage-citation-pill\"><a class=\"flex h-4.5 overflow-hidden rounded-xl px-2 text-[9px] font-medium transition-colors duration-150 ease-in-out text-token-text-secondary! bg-[#F4F4F4]! dark:bg-[#303030]!\" href=\"https:\/\/arxiv.org\/abs\/1706.03762?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noopener\"><span class=\"relative start-0 bottom-0 flex h-full w-full items-center\"><span class=\"flex h-4 w-full items-center justify-between\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">arXiv<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+2<\/span><\/span><span class=\"flex h-4 w-full items-center justify-between absolute\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">neuron.ai<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+2<\/span><\/span><\/span><\/a><\/span><\/span><\/p>\n<\/li>\n<li data-start=\"9506\" data-end=\"9667\">\n<p data-start=\"9508\" data-end=\"9667\">The loss function is generally the <strong data-start=\"9549\" data-end=\"9585\">cross-entropy<\/strong> between the predicted distribution (softmax) and the true distribution (target token).<\/p>\n<\/li>\n<\/ul>\n<p data-start=\"9669\" data-end=\"9799\">In the original article, one of the basic models was trained in 3.5 days on 8 GPUs.  <span class=\"\" data-state=\"closed\"><span class=\"ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]\" data-testid=\"webpage-citation-pill\"><a class=\"flex h-4.5 overflow-hidden rounded-xl px-2 text-[9px] font-medium transition-colors duration-150 ease-in-out text-token-text-secondary! bg-[#F4F4F4]! dark:bg-[#303030]!\" href=\"https:\/\/arxiv.org\/abs\/1706.03762?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noopener\"><span class=\"relative start-0 bottom-0 flex h-full w-full items-center\"><span class=\"flex h-4 w-full items-center justify-between overflow-hidden\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">arXiv<\/span><\/span><\/span><\/a><\/span><\/span><\/p>\n<h3 data-start=\"9801\" data-end=\"9831\">4.2 Inference \/ generation<\/h3>\n<ul data-start=\"9833\" data-end=\"10222\">\n<li data-start=\"9833\" data-end=\"9990\">\n<p data-start=\"9835\" data-end=\"9990\">In generation mode, the model decodes autoregressively: one token is generated at a time, using previously generated tokens as partial input.<\/p>\n<\/li>\n<li data-start=\"9991\" data-end=\"10059\">\n<p data-start=\"9993\" data-end=\"10059\">Masking is applied to prevent &#8220;looking ahead&#8221;.<\/p>\n<\/li>\n<li data-start=\"10060\" data-end=\"10222\">\n<p data-start=\"10062\" data-end=\"10222\">Various strategies can be used to select the next token: <strong data-start=\"10134\" data-end=\"10144\">greedy<\/strong>, <strong data-start=\"10146\" data-end=\"10161\">beam search<\/strong>, <strong data-start=\"10163\" data-end=\"10175\">sampling<\/strong>, <strong data-start=\"10177\" data-end=\"10186\">top-k<\/strong>, <strong data-start=\"10188\" data-end=\"10216\">top-p (nucleus sampling)<\/strong>, etc.<\/p>\n<\/li>\n<\/ul>\n<hr data-start=\"10224\" data-end=\"10227\">\n<h2 data-start=\"10229\" data-end=\"10264\">5. Strengths, limits and trends<\/h2>\n<h3 data-start=\"10266\" data-end=\"10292\">5.1 Strengths and benefits<\/h3>\n<ol data-start=\"10294\" data-end=\"11172\">\n<li data-start=\"10294\" data-end=\"10529\">\n<p data-start=\"10297\" data-end=\"10529\"><strong data-start=\"10297\" data-end=\"10316\">Parallelization<\/strong>: as the Transformer has no strict sequential dependencies, the entire sequence can be processed in parallel &#8211; greatly speeding up training. <span class=\"\" data-state=\"closed\"><span class=\"ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]\" data-testid=\"webpage-citation-pill\"><a class=\"flex h-4.5 overflow-hidden rounded-xl px-2 text-[9px] font-medium transition-colors duration-150 ease-in-out text-token-text-secondary! bg-[#F4F4F4]! dark:bg-[#303030]!\" href=\"https:\/\/arxiv.org\/abs\/1706.03762?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noopener\"><span class=\"relative start-0 bottom-0 flex h-full w-full items-center\"><span class=\"flex h-4 w-full items-center justify-between absolute\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">Wikipedia<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+3<\/span><\/span><span class=\"flex h-4 w-full items-center justify-between\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">arXiv<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+3<\/span><\/span><span class=\"flex h-4 w-full items-center justify-between absolute\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">neuron.ai<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+3<\/span><\/span><\/span><\/a><\/span><\/span><\/p>\n<\/li>\n<li data-start=\"10530\" data-end=\"10761\">\n<p data-start=\"10533\" data-end=\"10761\"><strong data-start=\"10533\" data-end=\"10566\">Long-range modeling<\/strong>: each position can directly &#8220;see&#8221; the whole sequence via attention, making it easier to capture distant dependencies. <span class=\"\" data-state=\"closed\"><span class=\"ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]\" data-testid=\"webpage-citation-pill\"><a class=\"flex h-4.5 overflow-hidden rounded-xl px-2 text-[9px] font-medium transition-colors duration-150 ease-in-out text-token-text-secondary! bg-[#F4F4F4]! dark:bg-[#303030]!\" href=\"https:\/\/neuron-ai.at\/attention-is-all-you-need\/?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noopener\"><span class=\"relative start-0 bottom-0 flex h-full w-full items-center\"><span class=\"flex h-4 w-full items-center justify-between\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">neuron.ai<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+2<\/span><\/span><span class=\"flex h-4 w-full items-center justify-between absolute\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">DeanHub | Code Is Life<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+2<\/span><\/span><\/span><\/a><\/span><\/span><\/p>\n<\/li>\n<li data-start=\"10762\" data-end=\"10921\">\n<p data-start=\"10765\" data-end=\"10921\"><strong data-start=\"10765\" data-end=\"10780\">Flexibility<\/strong>: the architecture is generic enough to be adapted to a variety of tasks (translation, text generation, comprehension, etc.).<\/p>\n<\/li>\n<li data-start=\"10922\" data-end=\"11052\">\n<p data-start=\"10925\" data-end=\"11052\"><strong data-start=\"10925\" data-end=\"10940\">Scalability<\/strong>: modern versions can scale up to very large models (LLMs) with billions of parameters.<\/p>\n<\/li>\n<li data-start=\"11053\" data-end=\"11172\">\n<p data-start=\"11056\" data-end=\"11172\"><strong data-start=\"11056\" data-end=\"11082\">Peak performance<\/strong>: many Transformer-based models dominate today&#8217;s NLP benchmarks.<\/p>\n<\/li>\n<\/ol>\n<h3 data-start=\"11174\" data-end=\"11197\">5.2 Limits \/ challenges<\/h3>\n<ul data-start=\"11199\" data-end=\"12037\">\n<li data-start=\"11199\" data-end=\"11376\">\n<p data-start=\"11201\" data-end=\"11376\">The cost in <strong data-start=\"11212\" data-end=\"11233\">memory and calculation<\/strong> can become very high for long sequences (quadratic complexity in the length of the sequence, due to calculations <span class=\"katex\"><span class=\"katex-mathml\">QK\u22a4QK^\\top<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">Q<\/span><span class=\"mord\"><span class=\"mord mathnormal\">K<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u22a4<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span>).<\/p>\n<\/li>\n<li data-start=\"11377\" data-end=\"11722\">\n<p data-start=\"11379\" data-end=\"11722\">Some &#8220;pure attention&#8221; versions may suffer from <strong data-start=\"11442\" data-end=\"11468\">rank degeneracy<\/strong> (their outputs may converge to low-rank matrices) if bypass mechanisms (residuals, MLPs) are not added. An article entitled <em data-start=\"11625\" data-end=\"11658\">&#8220;Attention is Not All You Need&#8221;<\/em> explores this phenomenon. <span class=\"\" data-state=\"closed\"><span class=\"ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]\" data-testid=\"webpage-citation-pill\"><a class=\"flex h-4.5 overflow-hidden rounded-xl px-2 text-[9px] font-medium transition-colors duration-150 ease-in-out text-token-text-secondary! bg-[#F4F4F4]! dark:bg-[#303030]!\" href=\"https:\/\/arxiv.org\/abs\/2103.03404?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noopener\"><span class=\"relative start-0 bottom-0 flex h-full w-full items-center\"><span class=\"flex h-4 w-full items-center justify-between overflow-hidden\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">arXiv<\/span><\/span><\/span><\/a><\/span><\/span> <\/p>\n<\/li>\n<li data-start=\"11723\" data-end=\"11848\">\n<p data-start=\"11725\" data-end=\"11848\">The need for massive data and computing power is often very high to achieve peak performance.<\/p>\n<\/li>\n<li data-start=\"11849\" data-end=\"12037\">\n<p data-start=\"11851\" data-end=\"12037\">Masking or full global attention is not always ideal for very long sequences: variants (restricted, hierarchical, compressed attention) have been proposed.<\/p>\n<\/li>\n<\/ul>\n<h3 data-start=\"12039\" data-end=\"12070\">5.3 Evolutions and variants<\/h3>\n<p data-start=\"12072\" data-end=\"12154\">Since the original Transformer, many extensions and variants have emerged:<\/p>\n<ul data-start=\"12156\" data-end=\"12724\">\n<li data-start=\"12156\" data-end=\"12325\">\n<p data-start=\"12158\" data-end=\"12325\"><strong data-start=\"12158\" data-end=\"12176\">Transformer-XL<\/strong>, <strong data-start=\"12178\" data-end=\"12190\">Reformer<\/strong>, <strong data-start=\"12192\" data-end=\"12206\">Longformer<\/strong>, <strong data-start=\"12208\" data-end=\"12219\">BigBird<\/strong>: models adapted to manage very long sequences with restricted or efficient attentions.<\/p>\n<\/li>\n<li data-start=\"12326\" data-end=\"12431\">\n<p data-start=\"12328\" data-end=\"12431\"><strong data-start=\"12328\" data-end=\"12366\">Alternative positional encodings<\/strong> (learnable, rotary, relative) for enhanced flexibility.<\/p>\n<\/li>\n<li data-start=\"12432\" data-end=\"12607\">\n<p data-start=\"12434\" data-end=\"12607\"><strong data-start=\"12434\" data-end=\"12465\">Encoder-only models<\/strong> (like BERT) or decoder-only models (like GPT): the architecture can be simplified to suit the task in hand. <span class=\"\" data-state=\"closed\"><span class=\"ms-1 inline-flex max-w-full items-center relative top-[-0.094rem] animate-[show_150ms_ease-in]\" data-testid=\"webpage-citation-pill\"><a class=\"flex h-4.5 overflow-hidden rounded-xl px-2 text-[9px] font-medium transition-colors duration-150 ease-in-out text-token-text-secondary! bg-[#F4F4F4]! dark:bg-[#303030]!\" href=\"https:\/\/kwanlung.github.io\/posts\/attentionisallyouneed\/?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noopener\"><span class=\"relative start-0 bottom-0 flex h-full w-full items-center\"><span class=\"flex h-4 w-full items-center justify-between\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">DeanHub | Code Is Life<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+2<\/span><\/span><span class=\"flex h-4 w-full items-center justify-between absolute\"><span class=\"max-w-[15ch] grow truncate overflow-hidden text-center\">neuron.ai<\/span><span class=\"-me-1 flex h-full items-center rounded-full px-1 text-[#8F8F8F]\">+2<\/span><\/span><\/span><\/a><\/span><\/span><\/p>\n<\/li>\n<li data-start=\"12608\" data-end=\"12724\">\n<p data-start=\"12610\" data-end=\"12724\"><strong data-start=\"12610\" data-end=\"12638\">Multimodal transformers<\/strong>: applied not only to text, but also to images, audio, graphs, etc.<\/p>\n<\/li>\n<\/ul>\n<hr data-start=\"12726\" data-end=\"12729\">\n<h2 data-start=\"12731\" data-end=\"12787\">6. Example of data flow (with a small example)<\/h2>\n<p data-start=\"12789\" data-end=\"12859\">To illustrate, here is a simplified flow for a translation task:<\/p>\n<ol data-start=\"12861\" data-end=\"13585\">\n<li data-start=\"12861\" data-end=\"12910\">\n<p data-start=\"12864\" data-end=\"12910\">We take a source sentence: &#8220;The cat is sleeping.&#8221;<\/p>\n<\/li>\n<li data-start=\"12911\" data-end=\"13015\">\n<p data-start=\"12914\" data-end=\"13015\">On tokenize (&#8220;Le&#8221;, &#8220;chat&#8221;, &#8220;dort&#8221;, &#8220;.&#8221;) \u2192 embeddings + positional encoding \u2192 encoder input.<\/p>\n<\/li>\n<li data-start=\"13016\" data-end=\"13158\">\n<p data-start=\"13019\" data-end=\"13158\">The encoder processes this sequence via its attention + feed-forward layers: enriched representations are obtained for each token.<\/p>\n<\/li>\n<li data-start=\"13159\" data-end=\"13585\">\n<p data-start=\"13162\" data-end=\"13257\">The decoder starts with a starting token <code data-start=\"13207\" data-end=\"13212\">&lt;s&gt;<\/code> and generates a new token at each step.<\/p>\n<ul data-start=\"13261\" data-end=\"13585\">\n<li data-start=\"13261\" data-end=\"13332\">\n<p data-start=\"13263\" data-end=\"13332\">It performs masked self-attention on tokens already generated.<\/p>\n<\/li>\n<li data-start=\"13336\" data-end=\"13402\">\n<p data-start=\"13338\" data-end=\"13402\">It cross-checks the encoder output.<\/p>\n<\/li>\n<li data-start=\"13406\" data-end=\"13439\">\n<p data-start=\"13408\" data-end=\"13439\">It is a feed-forward process.<\/p>\n<\/li>\n<li data-start=\"13443\" data-end=\"13537\">\n<p data-start=\"13445\" data-end=\"13537\">Produces a distribution on vocabulary \u2192 we choose the following token (e.g. &#8220;The&#8221;).<\/p>\n<\/li>\n<li data-start=\"13541\" data-end=\"13585\">\n<p data-start=\"13543\" data-end=\"13585\">Repeat until you reach the end token <code data-start=\"13575\" data-end=\"13582\">&lt;\\!s&gt;<\/code>.<\/p>\n<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<p data-start=\"13587\" data-end=\"13763\">Each output position is influenced both by previous target positions (via masked self-attention) and by input positions (via cross-attention).<\/p>\n<p data-start=\"13587\" data-end=\"13763\">&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<\/p>\n<h2 data-start=\"0\" data-end=\"13\">Conclusion<\/h2>\n<p data-start=\"15\" data-end=\"618\">The <strong data-start=\"18\" data-end=\"33\">Transformer<\/strong> is not just a technical brick: it&#8217;s <strong data-start=\"76\" data-end=\"104\">the foundation of modern AI<\/strong>. By replacing sequential processing with<strong data-start=\"151\" data-end=\"164\">attention<\/strong> and <strong data-start=\"171\" data-end=\"191\">parallel computing<\/strong>, it has made possible the major advances in <strong data-start=\"240\" data-end=\"247\">LLM<\/strong> (comprehension, generation, translation, assisted reasoning) and their deployment on an enterprise scale.<br data-start=\"353\" data-end=\"356\">Its strengths &#8211; <strong data-start=\"369\" data-end=\"384\">scalability<\/strong>, <strong data-start=\"386\" data-end=\"419\">performance over a long context<\/strong>, <strong data-start=\"421\" data-end=\"453\">multi-modality adaptability<\/strong> &#8211; make it the reference architecture for critical use cases: <strong data-start=\"525\" data-end=\"549\">document analysis<\/strong>, <strong data-start=\"551\" data-end=\"572\">business assistants<\/strong>, <strong data-start=\"574\" data-end=\"597\">fraud detection<\/strong>, <strong data-start=\"599\" data-end=\"617\">automation<\/strong>. <\/p>\n<p data-start=\"620\" data-end=\"896\">There remains one imperative: to <strong data-start=\"641\" data-end=\"654\">govern<\/strong> these models (data quality, <strong data-start=\"689\" data-end=\"696\">RAG<\/strong> for accuracy, explainability, human control, security). Organizations that master this triptych of <em data-start=\"805\" data-end=\"837\">model + data + governance<\/em> are already transforming their operations and customer experience. <\/p>\n<p data-start=\"898\" data-end=\"1228\"><strong data-start=\"898\" data-end=\"913\">Remember:<\/strong> understanding Transformer means having a common language for <strong data-start=\"979\" data-end=\"990\">evaluating<\/strong>, <strong data-start=\"992\" data-end=\"1004\">integrating<\/strong> and <strong data-start=\"1008\" data-end=\"1026\">industrializing<\/strong> AI. What&#8217;s the next step? Map your use cases, define the sources of truth (RAG) and launch a <strong data-start=\"1131\" data-end=\"1151\">measurable pilot<\/strong> with clear quality and risk metrics. To find out more :   <\/p>\n<ul data-start=\"1229\" data-end=\"1493\">\n<li data-start=\"1229\" data-end=\"1366\">\n<p data-start=\"1231\" data-end=\"1366\"><a href=\"https:\/\/palmer-consulting.com\/definition-rag-retrieval-augmented-generation\/\"><strong data-start=\"1237\" data-end=\"1244\">RAG<\/strong> (Retrieval-Augmented Generation) Guide at Palmer Consulting  <\/a><\/p>\n<\/li>\n<li data-start=\"1367\" data-end=\"1493\">\n<p data-start=\"1369\" data-end=\"1493\"><a href=\"https:\/\/palmer-consulting.com\/en\/llm-large-language-model-definition\/\"><strong data-start=\"1375\" data-end=\"1382\">LLM<\/strong> (Large Language Models) Guide at Palmer Consulting \u2192 palmer consulting &#8211; LLM<\/a><\/p>\n<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Introduction: what is a Transformer in artificial intelligence? Transformer is the architecture that has shaken up artificial intelligence since 2017. Conceived by Google researchers in the seminal article &#8220;Attention is All You Need&#8221;, it paved the way for large language models (LLMs) like GPT, BERT, LLaMA or Gemini. Why such an impact? Because Transformer has [&hellip;]<\/p>\n","protected":false},"author":5,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"inline_featured_image":false,"footnotes":""},"categories":[78],"tags":[],"class_list":["post-4817","post","type-post","status-publish","format-standard","hentry","category-artificial-intelligence"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.3 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Transformer definition | Palmer<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/palmer-consulting.com\/en\/transformer-definition\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Transformer definition | Palmer\" \/>\n<meta property=\"og:description\" content=\"Introduction: what is a Transformer in artificial intelligence? Transformer is the architecture that has shaken up artificial intelligence since 2017. Conceived by Google researchers in the seminal article &#8220;Attention is All You Need&#8221;, it paved the way for large language models (LLMs) like GPT, BERT, LLaMA or Gemini. Why such an impact? Because Transformer has [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/palmer-consulting.com\/en\/transformer-definition\/\" \/>\n<meta property=\"og:site_name\" content=\"Palmer\" \/>\n<meta property=\"article:published_time\" content=\"2025-09-24T10:48:45+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/palmer-consulting.com\/wp-content\/uploads\/2023\/09\/social-graph-palmer.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"675\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Laurent Zennadi\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Laurent Zennadi\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"10 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/transformer-definition\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/transformer-definition\\\/\"},\"author\":{\"name\":\"Laurent Zennadi\",\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/#\\\/schema\\\/person\\\/7ea52877fd35814d1d2f8e6e03daa3ed\"},\"headline\":\"Transformer definition\",\"datePublished\":\"2025-09-24T10:48:45+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/transformer-definition\\\/\"},\"wordCount\":2048,\"publisher\":{\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/#organization\"},\"articleSection\":[\"Artificial intelligence\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/transformer-definition\\\/\",\"url\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/transformer-definition\\\/\",\"name\":\"Transformer definition | Palmer\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/#website\"},\"datePublished\":\"2025-09-24T10:48:45+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/transformer-definition\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/transformer-definition\\\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/transformer-definition\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/home\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Transformer definition\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/#website\",\"url\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/\",\"name\":\"Palmer\",\"description\":\"Evolve at the speed of change\",\"publisher\":{\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/#organization\",\"name\":\"Palmer\",\"url\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/palmer-consulting.com\\\/wp-content\\\/uploads\\\/2023\\\/08\\\/Palmer_Logo_Full_PenBlue_1x1-2.jpg\",\"contentUrl\":\"https:\\\/\\\/palmer-consulting.com\\\/wp-content\\\/uploads\\\/2023\\\/08\\\/Palmer_Logo_Full_PenBlue_1x1-2.jpg\",\"width\":480,\"height\":480,\"caption\":\"Palmer\"},\"image\":{\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.linkedin.com\\\/company\\\/palmer-consulting\\\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/#\\\/schema\\\/person\\\/7ea52877fd35814d1d2f8e6e03daa3ed\",\"name\":\"Laurent Zennadi\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/110e8a99f01ca2c88c3d23656103640dc17e08eac86e26d0617937a6846b4007?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/110e8a99f01ca2c88c3d23656103640dc17e08eac86e26d0617937a6846b4007?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/110e8a99f01ca2c88c3d23656103640dc17e08eac86e26d0617937a6846b4007?s=96&d=mm&r=g\",\"caption\":\"Laurent Zennadi\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Transformer definition | Palmer","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/palmer-consulting.com\/en\/transformer-definition\/","og_locale":"en_US","og_type":"article","og_title":"Transformer definition | Palmer","og_description":"Introduction: what is a Transformer in artificial intelligence? Transformer is the architecture that has shaken up artificial intelligence since 2017. Conceived by Google researchers in the seminal article &#8220;Attention is All You Need&#8221;, it paved the way for large language models (LLMs) like GPT, BERT, LLaMA or Gemini. Why such an impact? Because Transformer has [&hellip;]","og_url":"https:\/\/palmer-consulting.com\/en\/transformer-definition\/","og_site_name":"Palmer","article_published_time":"2025-09-24T10:48:45+00:00","og_image":[{"width":1200,"height":675,"url":"https:\/\/palmer-consulting.com\/wp-content\/uploads\/2023\/09\/social-graph-palmer.png","type":"image\/png"}],"author":"Laurent Zennadi","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Laurent Zennadi","Est. reading time":"10 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/palmer-consulting.com\/en\/transformer-definition\/#article","isPartOf":{"@id":"https:\/\/palmer-consulting.com\/en\/transformer-definition\/"},"author":{"name":"Laurent Zennadi","@id":"https:\/\/palmer-consulting.com\/en\/#\/schema\/person\/7ea52877fd35814d1d2f8e6e03daa3ed"},"headline":"Transformer definition","datePublished":"2025-09-24T10:48:45+00:00","mainEntityOfPage":{"@id":"https:\/\/palmer-consulting.com\/en\/transformer-definition\/"},"wordCount":2048,"publisher":{"@id":"https:\/\/palmer-consulting.com\/en\/#organization"},"articleSection":["Artificial intelligence"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/palmer-consulting.com\/en\/transformer-definition\/","url":"https:\/\/palmer-consulting.com\/en\/transformer-definition\/","name":"Transformer definition | Palmer","isPartOf":{"@id":"https:\/\/palmer-consulting.com\/en\/#website"},"datePublished":"2025-09-24T10:48:45+00:00","breadcrumb":{"@id":"https:\/\/palmer-consulting.com\/en\/transformer-definition\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/palmer-consulting.com\/en\/transformer-definition\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/palmer-consulting.com\/en\/transformer-definition\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/palmer-consulting.com\/en\/home\/"},{"@type":"ListItem","position":2,"name":"Transformer definition"}]},{"@type":"WebSite","@id":"https:\/\/palmer-consulting.com\/en\/#website","url":"https:\/\/palmer-consulting.com\/en\/","name":"Palmer","description":"Evolve at the speed of change","publisher":{"@id":"https:\/\/palmer-consulting.com\/en\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/palmer-consulting.com\/en\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/palmer-consulting.com\/en\/#organization","name":"Palmer","url":"https:\/\/palmer-consulting.com\/en\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/palmer-consulting.com\/en\/#\/schema\/logo\/image\/","url":"https:\/\/palmer-consulting.com\/wp-content\/uploads\/2023\/08\/Palmer_Logo_Full_PenBlue_1x1-2.jpg","contentUrl":"https:\/\/palmer-consulting.com\/wp-content\/uploads\/2023\/08\/Palmer_Logo_Full_PenBlue_1x1-2.jpg","width":480,"height":480,"caption":"Palmer"},"image":{"@id":"https:\/\/palmer-consulting.com\/en\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.linkedin.com\/company\/palmer-consulting\/"]},{"@type":"Person","@id":"https:\/\/palmer-consulting.com\/en\/#\/schema\/person\/7ea52877fd35814d1d2f8e6e03daa3ed","name":"Laurent Zennadi","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/110e8a99f01ca2c88c3d23656103640dc17e08eac86e26d0617937a6846b4007?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/110e8a99f01ca2c88c3d23656103640dc17e08eac86e26d0617937a6846b4007?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/110e8a99f01ca2c88c3d23656103640dc17e08eac86e26d0617937a6846b4007?s=96&d=mm&r=g","caption":"Laurent Zennadi"}}]}},"_links":{"self":[{"href":"https:\/\/palmer-consulting.com\/en\/wp-json\/wp\/v2\/posts\/4817","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/palmer-consulting.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/palmer-consulting.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/palmer-consulting.com\/en\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/palmer-consulting.com\/en\/wp-json\/wp\/v2\/comments?post=4817"}],"version-history":[{"count":0,"href":"https:\/\/palmer-consulting.com\/en\/wp-json\/wp\/v2\/posts\/4817\/revisions"}],"wp:attachment":[{"href":"https:\/\/palmer-consulting.com\/en\/wp-json\/wp\/v2\/media?parent=4817"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/palmer-consulting.com\/en\/wp-json\/wp\/v2\/categories?post=4817"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/palmer-consulting.com\/en\/wp-json\/wp\/v2\/tags?post=4817"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}