Transformer Block Coupling
& its Correlation with Generalization in LLMs

Murdock Aubry*, Haoming Meng*, Anton Sugolov*, Vardan Papyan

*Equal contribution



Abstract


Large Language Models (LLMs) have made significant strides in natural language processing, and a precise understanding of the internal mechanisms driving their success is essential. We analyze the trajectories of token embeddings in LLMs as they pass through transformer blocks, linearizing the system along these trajectories through their Jacobian matrices. For two such Jacobians \(J_1, J_2\), and their singular value decompositions \(J_1 = U_1 S_1 V_1^T\), \(J_1 = U_2 S_2 V_2^T\) we measure the agreement of \(U_1, U_2\) and \(V_1, V_2\). We broadly uncover the transformer block coupling phenomenon in a variety of pretrained LLMs, characterized by the coupling of their top singular vectors across tokens and depth. Our findings reveal that coupling positively correlates with model performance, and that this relationship is stronger than with other hyperparameters such as parameter count, model depth, and embedding dimension.

Pythia 12-B

Figure 1: (a) Correlation with HuggingFace Open LLM Leaderboard (b) Measurements on Pythia 6.9B, 12B training checkpoints (c) Coupling between Pythia 12B transformer blocks at varying depths during training



Coupling Metric


Transformers may be described as a deep composition of functions that iteratively transform token embeddings. By \(x_i^l \in \mathbb{R}^d\) we denote the embedding of the \(i\)-th token at the \(l\)-th layer, which are transformed by \[X^{l+1} = F_{\text{block}}^{l+1}(X^l) = X^l + f^{l+1}(X^l)\] The second equality highlights the residual connection present in the transformer block. To analyze the change in embeddings at layer \(l\) we compute the Jacobian of \(f_l\) in order to linearize this system (contribution to residual): \[J_{t_1t_2}^l = \frac{\partial}{\partial x_{t_1}^{l-1}}(f^l(X^{l-1}))_{t_2} \in \mathbb{R}^{d \times d}\] Where \(t_1, t_2\) denote possibly varying input-output tokens of the Jacobian contribution. Given Jacobians \(J_1, J_2\) with singular value decompositions: \[J_1 = U_1S_1V_1^T \quad J_2 = U_2S_2V_2^T\] We quantify coupling of their top-\(K\) singular vectors using: \[m_K(J_1, J_2) = \frac{\|U_{2,K}^TJ_1V_{2,K} - S_{1,K}\|_F}{\|S_{1,K}\|_F} = \frac{\|U_{2,K}^TU_1S_1V_1^TV_{2,K} - S_{1,K}\|_F}{\|S_{1,K}\|_F}\] This measures how strongly the top-\(K\) singular vectors are aligned (diagonalizing \(J_1\) with the top-\(K\) singular vectors of \(J_2\)). Strong coupling suggests that transformer blocks coordinate operations in the same basis across layers.

Pythia 12-B

Figure 2: Measuring coupling through multiple token interactions throughout the transformer block

The coupling metric \(m_K(J_1, J_2)\) may be computed for linearizations \(J_1, J_2\) for multiple interactions between tokens across depths.

  1. Depth-wise coupling: Fixing a token \(t\), we measure the coupling between \(J_1 = J_{tt}^l\), \(J_2 = J_{tt}^{l'}\) across all layers \(l,l' \in \{1, \ldots, L\}\). This captures the effect of distinct layers on the same token.

  2. Token-wise coupling. We quantify the coupling across tokens in several ways



Implementation


Coupling can be measured on any HuggingFace LLM through a few additional lines of code.
  1. Install coupling package
  2. pip install git+https://github.com/sugolov/coupling.git

  3. Add to HuggingFace inference script
  4. import os
    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
    
    # Coupling imports
    from coupling import run_coupling_hf
    
    model_path = "meta-llama/Meta-Llama-3-8B"
    model_name = os.path.normpath(os.path.basename(model_path))
    bnb_config = BitsAndBytesConfig(load_in_4bit=True)
    
    model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="cuda",
    trust_remote_code=True,
    quantization_config=bnb_config
    )
    
    tokenizer = AutoTokenizer.from_pretrained(
    model_path,
    use_fast=True,
    )
    
    # Run coupling measurements
    prompts = ["What is the capital of France? The capital is"]
    out = run_coupling_hf(model, tokenizer, model_name, prompts, save=True, verbose=True)
    



BibTeX


@misc{aubry2025transformerblockcouplingcorrelation,
          title={Transformer Block Coupling and its Correlation with Generalization in LLMs},
          author={Murdock Aubry and Haoming Meng and Anton Sugolov and Vardan Papyan},
          year={2025},
          eprint={2407.07810},
          archivePrefix={arXiv},
          primaryClass={cs.LG},
          url={https://arxiv.org/abs/2407.07810},
       }