Artificial intelligence

Architecture of large language models

Publiée le January 8, 2026

Architecture of large language models: understanding their inner workings

LLM foundations

Large modern language models are mainly based on an architecture called Transformer, introduced by Google researchers in 2017. This architecture has revolutionized AI by replacing recurrent networks with an attention mechanism that can process an entire sequence in parallel. LLMs such as GPT, Llama, Claude or Gemini are derived from this lineage and all operate on similar principles: transforming a sequence of tokens into an internal representation, then predicting the next word from this representation.

Processing steps

The LLM pipeline can be broken down into several blocks:

Tokenization: the input text is broken down into units called tokens (words, sub-words or characters). This process converts a string of characters into a sequence of integers corresponding to a predefined vocabulary.

Embeddings and positionality: each token is transformed into a dense vector via an embedding. As the Transformer has no sequential memory, positional information is added to indicate the order of the tokens (positional encoding).

Attention mechanism: at the heart of the architecture,attention enables the model to weight the importance of each token in relation to the others. In an attention layer, queries, keys and values are calculated for each token. The product of queries and keys yields attention scores, which weight the values. Everything is then normalized.

Multi-head attention: to capture several types of relationships, several attention heads operate in parallel. The results are concatenated and transformed.

Feedforward network: after attention, each position passes through a dense neural network that is identical for all positions. This network provides additional non-linearity and increases representational capacity.

Layer stacking: the attention and feedforward steps are repeated many times (from a few dozen to several hundred), depending on the size of the model. Residual connections and layer normalizations stabilize training.

Output and decoding: the last layer produces logits, i.e. scores for each token in the vocabulary. After applying a softmax function, a probability distribution is obtained. The model then selects the next token via decoding strategies (greedy, top-k, nucleus sampling, etc.).

The table below summarizes these main components and their functions:

Component	Main function
Tokenization	Break down input into stable units
Embeddings	Represent each token by a continuous vector
Positional encoding	Add positional information to each vector
Self-attention	Weight relations between all tokens
Multi-head attention	Multiply attention subspaces to capture different dependencies
Feedforward network	Transform representations with a non-linear function
Stacking layers	Increase depth to capture complex relationships
Softmax and decoding	Generate scores and select the next token

Component

Main function

Tokenization

Break down input into stable units

Embeddings

Represent each token by a continuous vector

Positional encoding

Add positional information to each vector

Self-attention

Weight relations between all tokens

Multi-head attention

Multiply attention subspaces to capture different dependencies

Feedforward network

Transform representations with a non-linear function

Stacking layers

Increase depth to capture complex relationships

Softmax and decoding

Generate scores and select the next token

Pre-training and adjustment

LLMs are first pre-trained on massive volumes of text (websites, books, articles) via self-supervised tasks. The main objective is to predict the next word (or fill in masks) from the context. This pre-training enables the model to acquire a general knowledge of the language and common facts. Next, afine-tuning phase uses labeled data or instructions to adapt the model to specific uses: assisted dialogue, summaries, code, question-answer. The technique of reinforcement learning from human feedback (RLHF) is also commonly used to refine behavior and reduce drift.

Sizing strategies

Two factors significantly influence the capabilities of an LLM: size (number of parameters) and data quantity. Scaling rules show that larger models trained on larger datasets often perform better. However, there are limits to this growth: high energy costs, training time, carbon footprint and deployment difficulties.

Several strategies are used to manage these constraints:

Data parallelism: distribute data over several processors to drive different parts of the batch simultaneously.

Model parallelism: distribute layers or attention heads over several GPUs.

Mixture of Experts (MoE): activate only a subset of experts for each input, reducing the computation required and enabling specialization by task.

Quantization: reduce number precision (from 16 bits to 8 bits or 4 bits) to reduce model size and speed up inference.

Distillation: train a smaller model to reproduce the behavior of a larger model, offering a compromise between quality and efficiency.

Ethical issues and limitations

Massive models raise questions:

Bias and discrimination: training data may contain stereotypes, leading to biases reproduced by the model. Studies and de-biasing techniques seek to identify and reduce these effects.

Hallucinations: due to their objective of predicting the next word, models can invent plausible but false facts. Integration of reliable sources or post-generation filtering are necessary.

Confidentiality: models sometimes store sensitive information present in training data. Methods of differential confidentiality or retraining on anonymized data are being explored.

Environmental impact: training and deploying giant models consumes a lot of energy. It is crucial to measure and reduce the carbon footprint, notably by improving algorithms and using decarbonized datacenters.

Conclusion

The Transformer architecture has enabled language models to make remarkable qualitative leaps. Understanding its components – from tokenization to multi-headed attention to feedforward networks – helps to identify the possibilities and limits of LLMs. However, the design and use of these models require ethical reflection and optimizations to reconcile performance, cost and responsibility. With the rise of hybrid techniques (mixture of experts, distillation) and innovations in pre-training, architectures will continue to evolve to meet the growing demand for more accurate and efficient models.

Architecture of large language models