Transformer Block Coupling
& its Correlation with Generalization in LLMs

Murdock Aubry^, Haoming Meng^, Anton Sugolov^*, Vardan Papyan

^*Equal contribution

arXiv Poster GitHub ICLR 2025

Abstract

Large Language Models (LLMs) have made significant strides in natural language processing, and a precise understanding of the internal mechanisms driving their success is essential. We analyze the trajectories of token embeddings in LLMs as they pass through transformer blocks, linearizing the system along these trajectories through their Jacobian matrices. For two such Jacobians \(J_1, J_2\), and their singular value decompositions \(J_1 = U_1 S_1 V_1^T\), \(J_1 = U_2 S_2 V_2^T\) we measure the agreement of \(U_1, U_2\) and \(V_1, V_2\). We broadly uncover the transformer block coupling phenomenon in a variety of pretrained LLMs, characterized by the coupling of their top singular vectors across tokens and depth. Our findings reveal that coupling positively correlates with model performance, and that this relationship is stronger than with other hyperparameters such as parameter count, model depth, and embedding dimension.

Pythia 12-B

Figure 1: (a) Correlation with HuggingFace Open LLM Leaderboard (b) Measurements on Pythia 6.9B, 12B training checkpoints (c) Coupling between Pythia 12B transformer blocks at varying depths during training

Coupling Metric

Transformers may be described as a deep composition of functions that iteratively transform token embeddings. By \(x_i^l \in \mathbb{R}^d\) we denote the embedding of the \(i\)-th token at the \(l\)-th layer, which are transformed by \[X^{l+1} = F_{\text{block}}^{l+1}(X^l) = X^l + f^{l+1}(X^l)\] The second equality highlights the residual connection present in the transformer block. To analyze the change in embeddings at layer \(l\) we compute the Jacobian of \(f_l\) in order to linearize this system (contribution to residual): \[J_{t_1t_2}^l = \frac{\partial}{\partial x_{t_1}^{l-1}}(f^l(X^{l-1}))_{t_2} \in \mathbb{R}^{d \times d}\] Where \(t_1, t_2\) denote possibly varying input-output tokens of the Jacobian contribution. Given Jacobians \(J_1, J_2\) with singular value decompositions: \[J_1 = U_1S_1V_1^T \quad J_2 = U_2S_2V_2^T\] We quantify coupling of their top-\(K\) singular vectors using: \[m_K(J_1, J_2) = \frac{\|U_{2,K}^TJ_1V_{2,K} - S_{1,K}\|_F}{\|S_{1,K}\|_F} = \frac{\|U_{2,K}^TU_1S_1V_1^TV_{2,K} - S_{1,K}\|_F}{\|S_{1,K}\|_F}\] This measures how strongly the top-\(K\) singular vectors are aligned (diagonalizing \(J_1\) with the top-\(K\) singular vectors of \(J_2\)). Strong coupling suggests that transformer blocks coordinate operations in the same basis across layers.

Pythia 12-B

Figure 2: Measuring coupling through multiple token interactions throughout the transformer block

The coupling metric \(m_K(J_1, J_2)\) may be computed for linearizations \(J_1, J_2\) for multiple interactions between tokens across depths.

Depth-wise coupling: Fixing a token \(t\), we measure the coupling between \(J_1 = J_{tt}^l\), \(J_2 = J_{tt}^{l'}\) across all layers \(l,l' \in \{1, \ldots, L\}\). This captures the effect of distinct layers on the same token.

Token-wise coupling. We quantify the coupling across tokens in several ways
- Self-coupling. By fixing two layers \(l,l' \in \{1,\ldots,L\}\), we analyze the case where the input and output tokens are the same. Explicitly, we compare \(J_{tt}^l\) and \(J_{t't'}^{l'}\) across \(t,t' \in \{1,\ldots,n\}\), which represents the coupling across tokens for a token's effect on its own trajectory.
- Context Coupling. We consider the context tokens' impact on a trajectory by measuring coupling between \(J_{t_1t_2}^l\) and \(J_{t_1t_2'}^{l'}\) across \(t_2,t_2' \geq t_1\) (fixing the input token to be the same) and also between \(J_{t_1t_2}^l\) and \(J_{t_1't_2}^{l'}\) across \(t_1,t_1' \leq t_2\) (fixing the output token to be the same).

Implementation

Coupling can be measured on any HuggingFace LLM through a few additional lines of code.

Install coupling package

pip install git+https://github.com/sugolov/coupling.git

Add to HuggingFace inference script

import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# Coupling imports
from coupling import run_coupling_hf

model_path = "meta-llama/Meta-Llama-3-8B"
model_name = os.path.normpath(os.path.basename(model_path))
bnb_config = BitsAndBytesConfig(load_in_4bit=True)

model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="cuda",
trust_remote_code=True,
quantization_config=bnb_config
)

tokenizer = AutoTokenizer.from_pretrained(
model_path,
use_fast=True,
)

# Run coupling measurements
prompts = ["What is the capital of France? The capital is"]
out = run_coupling_hf(model, tokenizer, model_name, prompts, save=True, verbose=True)

BibTeX

@misc{aubry2025transformerblockcouplingcorrelation,
          title={Transformer Block Coupling and its Correlation with Generalization in LLMs},
          author={Murdock Aubry and Haoming Meng and Anton Sugolov and Vardan Papyan},
          year={2025},
          eprint={2407.07810},
          archivePrefix={arXiv},
          primaryClass={cs.LG},
          url={https://arxiv.org/abs/2407.07810},
       }

Transformer Block Coupling& its Correlation with Generalization in LLMs