Summary
We empirically find that the Jacobians of transformer blocks in pre-trained LLMs have highly similar singular vectors, a property we call transformer block coupling. Transformer blocks contribute to the residual of token embeddings with \(X^{l+1} = X^l + f^{l+1}(X^l)\). We linearize each block \(f^l\) via its Jacobian \[J^l = U^l S^l (V^l)^T\] and find that across 30+ pretrained LLMs, the top singular vectors \(U^l, V^l\) are highly consistent across layers and tokens. Coupling is measured by norms on the off-diagonal of the co-diagonalization \[(U^j)^T U^i S^i (V^i)^T V^j\] between layers \(i\) and \(j\).
Abstract
Large Language Models (LLMs) have made significant strides in natural language processing, and a precise understanding of the internal mechanisms driving their success is essential. We analyze the trajectories of token embeddings in LLMs as they pass through transformer blocks, linearizing the system along these trajectories through their Jacobian matrices. For two such Jacobians \(J_1, J_2\), and their singular value decompositions \(J_1 = U_1 S_1 V_1^T\), \(J_2 = U_2 S_2 V_2^T\) we measure the agreement of \(U_1, U_2\) and \(V_1, V_2\). We broadly uncover the transformer block coupling phenomenon in a variety of pretrained LLMs, characterized by the coupling of their top singular vectors across tokens and depth. Our findings reveal that coupling positively correlates with model performance, and that this relationship is stronger than with other hyperparameters such as parameter count, model depth, and embedding dimension.
Coupling Metric
Transformers may be described as a deep composition of functions that iteratively transform token embeddings. By \(x_i^l \in \mathbb{R}^d\) we denote the embedding of the \(i\)-th token at the \(l\)-th layer, which are transformed by \[X^{l+1} = F_{\text{block}}^{l+1}(X^l) = X^l + f^{l+1}(X^l)\] The second equality highlights the residual connection present in the transformer block. To analyze the change in embeddings at layer \(l\) we compute the Jacobian of \(f_l\) in order to linearize this system (contribution to residual): \[J_{t_1t_2}^l = \frac{\partial}{\partial x_{t_1}^{l-1}}(f^l(X^{l-1}))_{t_2} \in \mathbb{R}^{d \times d}\] Where \(t_1, t_2\) denote possibly varying input-output tokens of the Jacobian contribution. Given Jacobians \(J_1, J_2\) with singular value decompositions: \[J_1 = U_1S_1V_1^T \quad J_2 = U_2S_2V_2^T\]
We quantify coupling of their top-\(K\) singular vectors using: \[m_K(J_1, J_2) = \frac{\|U_{2,K}^TJ_1V_{2,K} - S_{1,K}\|_F}{\|S_{1,K}\|_p} = \frac{\|U_{2,K}^TU_1S_1V_1^TV_{2,K} - S_{1,K}\|_F}{\|S_{1,K}\|_p}\] This measures how strongly the top-\(K\) singular vectors are aligned (diagonalizing \(J_1\) with the top-\(K\) singular vectors of \(J_2\)). Strong coupling suggests that transformer blocks coordinate operations in the same basis across layers.
The coupling metric \(m_K(J_1, J_2)\) may be computed for linearizations \(J_1, J_2\) for multiple interactions between tokens across depths.
- Depth-wise coupling: Fixing a token \(t\), we measure the coupling between \(J_1 = J_{tt}^l\), \(J_2 = J_{tt}^{l'}\) across all layers \(l,l' \in \{1, \ldots, L\}\). This captures the effect of distinct layers on the same token.
-
Token-wise coupling: We quantify the coupling across tokens in several ways
- Self-coupling: By fixing two layers \(l,l' \in \{1,\ldots,L\}\), we analyze the case where the input and output tokens are the same. Explicitly, we compare \(J_{tt}^l\) and \(J_{t't'}^{l'}\) across \(t,t' \in \{1,\ldots,n\}\), which represents the coupling across tokens for a token's effect on its own trajectory.
- Context Coupling: We consider the context tokens' impact on a trajectory by measuring coupling between \(J_{t_1t_2}^l\) and \(J_{t_1t_2'}^{l'}\) across \(t_2,t_2' \geq t_1\) (fixing the input token to be the same) and also between \(J_{t_1t_2}^l\) and \(J_{t_1't_2}^{l'}\) across \(t_1,t_1' \leq t_2\) (fixing the output token to be the same).
Results
Depth-wise. In trained LLMs, we observe coupling of the top singular vectors of the Jacobians across depth, evident in the low off-diagonal values with a visible diagonal present in the matrix subplots. This is consistently observed across various LLMs considered. In untrained models, there is no coupling of Jacobians across different depths, and singular vectors are uncorrelated.
Token-wise. We analyze the coupling of singular vectors of Jacobians across tokens. For input and output tokens that are the same (\(J_{tt}^l\) and \(J_{t't'}^{l'}\)), we observe strong coupling, indicating that a token's interactions along its trajectory are coupled with others. For context tokens, coupling is examined by fixing the input token (\(J_{t_1t_2}^l\) and \(J_{t_1t_2'}^{l'}\)) or the output token (\(J_{t_1t_2}^l\) and \(J_{t_1't_2}^{l'}\)). While context coupling exists, its strength varies across token pairs. Untrained models show no such coupling.
Across training. Coupling emerges through training for the evaluated LLMs, including coupling across depth and across tokens. Evaluating layer-wise coupling at intermediate training checkpoints of Pythia 6.9B and 12B (Figure 2b), we observe that coupling is generally low at initialization and increases persistently throughout training.
Implementation
Coupling can be measured on any HuggingFace LLM through a few additional lines of code.
- Install coupling package
pip install git+https://github.com/sugolov/coupling.git
- Add to HuggingFace inference script
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
# Coupling imports
from coupling import run_coupling_hf
model_path = "meta-llama/Meta-Llama-3-8B"
model_name = os.path.normpath(os.path.basename(model_path))
bnb_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="cuda",
trust_remote_code=True,
quantization_config=bnb_config
)
tokenizer = AutoTokenizer.from_pretrained(
model_path,
use_fast=True,
)
# Run coupling measurements
prompts = ["What is the capital of France? The capital is"]
out = run_coupling_hf(model, tokenizer, model_name, prompts, save=True, verbose=True)
BibTeX
If you find this work useful, please consider citing our paper:
@misc{aubry2025transformerblockcouplingcorrelation,
title={Transformer Block Coupling and its Correlation with Generalization in LLMs},
author={Murdock Aubry and Haoming Meng and Anton Sugolov and Vardan Papyan},
year={2025},
eprint={2407.07810},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2407.07810},
}