Architecture of large language models: understanding their inner workings
LLM foundations
Large modern language models are mainly based on an architecture called Transformer, introduced by Google researchers in 2017. This architecture has revolutionized AI by replacing recurrent networks with an attention mechanism that can process an entire sequence in parallel. LLMs such as GPT, Llama, Claude or Gemini are derived from this lineage and all operate on similar principles: transforming a sequence of tokens into an internal representation, then predicting the next word from this representation.
Processing steps
The LLM pipeline can be broken down into several blocks:
- Tokenization: the input text is broken down into units called tokens (words, sub-words or characters). This process converts a string of characters into a sequence of integers corresponding to a predefined vocabulary.
- Embeddings and positionality: each token is transformed into a dense vector via an embedding. As the Transformer has no sequential memory, positional information is added to indicate the order of the tokens (positional encoding).
- Attention mechanism: at the heart of the architecture,attention enables the model to weight the importance of each token in relation to the others. In an attention layer, queries, keys and values are calculated for each token. The product of queries and keys yields attention scores, which weight the values. Everything is then normalized.
- Multi-head attention: to capture several types of relationships, several attention heads operate in parallel. The results are concatenated and transformed.
- Feedforward network: after attention, each position passes through a dense neural network that is identical for all positions. This network provides additional non-linearity and increases representational capacity.
- Layer stacking: the attention and feedforward steps are repeated many times (from a few dozen to several hundred), depending on the size of the model. Residual connections and layer normalizations stabilize training.
- Output and decoding: the last layer produces logits, i.e. scores for each token in the vocabulary. After applying a softmax function, a probability distribution is obtained. The model then selects the next token via decoding strategies (greedy, top-k, nucleus sampling, etc.).
The table below summarizes these main components and their functions:
| Component |
Main function |
| Tokenization |
Break down input into stable units |
| Embeddings |
Represent each token by a continuous vector |
| Positional encoding |
Add positional information to each vector |
| Self-attention |
Weight relations between all tokens |
| Multi-head attention |
Multiply attention subspaces to capture different dependencies |
| Feedforward network |
Transform representations with a non-linear function |
| Stacking layers |
Increase depth to capture complex relationships |
| Softmax and decoding |
Generate scores and select the next token |
Pre-training and adjustment
LLMs are first pre-trained on massive volumes of text (websites, books, articles) via self-supervised tasks. The main objective is to predict the next word (or fill in masks) from the context. This pre-training enables the model to acquire a general knowledge of the language and common facts. Next, afine-tuning phase uses labeled data or instructions to adapt the model to specific uses: assisted dialogue, summaries, code, question-answer. The technique of reinforcement learning from human feedback (RLHF) is also commonly used to refine behavior and reduce drift.
Sizing strategies
Two factors significantly influence the capabilities of an LLM: size (number of parameters) and data quantity. Scaling rules show that larger models trained on larger datasets often perform better. However, there are limits to this growth: high energy costs, training time, carbon footprint and deployment difficulties.
Several strategies are used to manage these constraints:
- Data parallelism: distribute data over several processors to drive different parts of the batch simultaneously.
- Model parallelism: distribute layers or attention heads over several GPUs.
- Mixture of Experts (MoE): activate only a subset of experts for each input, reducing the computation required and enabling specialization by task.
- Quantization: reduce number precision (from 16 bits to 8 bits or 4 bits) to reduce model size and speed up inference.
- Distillation: train a smaller model to reproduce the behavior of a larger model, offering a compromise between quality and efficiency.
Ethical issues and limitations
Massive models raise questions:
- Bias and discrimination: training data may contain stereotypes, leading to biases reproduced by the model. Studies and de-biasing techniques seek to identify and reduce these effects.
- Hallucinations: due to their objective of predicting the next word, models can invent plausible but false facts. Integration of reliable sources or post-generation filtering are necessary.
- Confidentiality: models sometimes store sensitive information present in training data. Methods of differential confidentiality or retraining on anonymized data are being explored.
- Environmental impact: training and deploying giant models consumes a lot of energy. It is crucial to measure and reduce the carbon footprint, notably by improving algorithms and using decarbonized datacenters.
Conclusion
The Transformer architecture has enabled language models to make remarkable qualitative leaps. Understanding its components – from tokenization to multi-headed attention to feedforward networks – helps to identify the possibilities and limits of LLMs. However, the design and use of these models require ethical reflection and optimizations to reconcile performance, cost and responsibility. With the rise of hybrid techniques (mixture of experts, distillation) and innovations in pre-training, architectures will continue to evolve to meet the growing demand for more accurate and efficient models.