Yesterday I wrote about what happens when you swap softmax for sigmoid in the attention computation. If you missed it: the short version is that the exam mistake I made turned out to be a structural error, not just a notational one. Softmax is the ingredient that creates per-row probability distributions over tokens, induces attention moment matrices, and makes the Hessian of the loss block-heterogeneous and partially indefinite. That piece is here. I was still sitting with the same ICLR 2025 paper by Ormaniec, Dangel, and Singh when another question surfaced -one I had been near many times but never actually examined directly.
Why do we use two separate matrices for queries and keys?
The standard self-attention computation is:
Two weight matrices, \(W_Q \in \mathbb{R}^{d_V \times d_K}\) and \(W_K \in \mathbb{R}^{d_V \times d_K}\), whose product \(W_Q W_K^\top\) enters the score. From a function-class perspective there is nothing forcing this. You could replace \(W_Q W_K^\top / \sqrt{d_K}\) with a single matrix \(W_{QK} \in \mathbb{R}^{d_V \times d_V}\) and express exactly the same functions -in fact more, since \(W_Q W_K^\top\) is rank-limited by \(d_K\) while a general \(W_{QK}\) is not. So why bother splitting it?
The standard explanations are interpretability and parameter efficiency when \(d_K < d_V\). Those are real. But the paper surfaces a third consequence that is not usually mentioned: the dual parameterization changes the curvature of the loss surface in a precise and measurable way, independent of what outputs the network produces.
The two parameterizations side by side
Let me be explicit about what each version looks like. In the standard mechanism, the score matrix is:
\[ T(X) = \frac{X W_Q W_K^\top X^\top}{\sqrt{d_K}}, \]
parameterized by the pair \((W_Q, W_K)\). In the single-matrix version we define \(W_{QK} \in \mathbb{R}^{d_V \times d_V}\) and write:
\[ T(X) = X W_{QK} X^\top. \]
Both produce a score matrix in \(\mathbb{R}^{L \times L}\), which then goes through row-wise softmax and produces the same attention output -assuming the weights satisfy \(W_{QK} = W_Q W_K^\top / \sqrt{d_K}\). The predictions are identical. The gradients, in a first-order sense, are the same. But the Hessian -the second-order structure of the loss -is not.
Figure 1. Two parameterizations of the query-key score matrix that produce identical outputs when weights match. The dual-matrix version \((W_Q, W_K)\) introduces an extra indefinite term in the Hessian -the T-functional Hessian -that is entirely absent in the single-matrix \(W_{QK}\) case.
The single-matrix Hessian: what Lemma 4.1 says
The paper derives the Hessian for single-matrix attention explicitly. With \(T(X) = X W_{QK} X^\top\) and row-wise softmax, Lemma 4.1 gives:
where \(Z_1\) collects first-order softmax derivatives, \(Z_2\) collects second-order softmax derivatives, and \(\delta_{XY} = \mathrm{vec}_r(F(X) - Y)\) is the flattened residual. The first term is the outer-product Hessian, which is always positive semi-definite -it is a Gram-type product of Jacobians. The second term is the functional Hessian, which accounts for second derivatives of the network itself.
These two terms together define the entire Hessian for the single-matrix case. Nothing more. No additional indefinite structure. The curvature, however complex it may be due to softmax, is confined to what those two terms produce.
What the dual parameterization adds: the nested decomposition
With separate \(W_Q\) and \(W_K\), the score matrix \(T\) depends on both. Differentiating with respect to \(W_Q\) involves \(W_K\), and vice versa. The Hessian picks up cross-terms that the single-matrix version does not have.
The paper calls this a T-Gauss-Newton decomposition, described in Remark D.1. The query-key portion of the full Hessian decomposes into a T-outer-product Hessian \(H_o^T\) and a T-functional Hessian \(H_f^T\):
where \(U\) is precisely the expression from Lemma 4.1, and \(V = [(W_Q \otimes I_{d_V}) K_{d_K,d_V},\ I_{d_V} \otimes W_K]\) is a structured matrix that encodes the dual-matrix relationship. This T-outer-product term is positive semi-definite. It is the analogue of the single-matrix Hessian, now projected through the dual parameterization.
And then, additionally, there is a term that has no counterpart in the single-matrix case:
where \(B = R_{d_V}(I_L \otimes W_V^\top \otimes I_{d_V})(Z_1 \otimes I_{d_V})S\). The diagonal blocks are zero. This is the block-hollow structure. And block-hollow means indefinite.
Figure 2. The nested Hessian decomposition for the dual-matrix parameterization. The query-key portion of the outer-product Hessian decomposes further into a T-outer-product term \(H_o^T\) (positive semi-definite) and a T-functional term \(H_f^T\) (block-hollow, indefinite). The latter vanishes entirely when queries and keys are merged into a single matrix \(W_{QK}\).
Block-hollow means indefinite: the eigenvalue argument
The T-functional Hessian has the form \(\frac{1}{\sqrt{d_K}}\begin{bmatrix} 0 & B^\top \\ B & 0 \end{bmatrix} \otimes I_{d_K}\). The zero blocks on the diagonal make this block-hollow. And block-hollow implies a very specific eigenvalue structure: all eigenvalues come in symmetric \(\pm\lambda\) pairs.
To see this, suppose \([v_1^\top, v_2^\top]^\top\) is an eigenvector with eigenvalue \(\lambda\). Then \(B^\top v_2 = \lambda v_1\) and \(B v_1 = \lambda v_2\). Now consider \([-v_1^\top, v_2^\top]^\top\): we get \(B^\top v_2 = \lambda v_1 = -\lambda(-v_1)\) and \(B(-v_1) = -\lambda v_2\). So \([-v_1^\top, v_2^\top]^\top\) is an eigenvector with eigenvalue \(-\lambda\). Every positive eigenvalue has a corresponding negative one of equal magnitude.
The Kronecker product with \(I_{d_K}\) means each such pair \(\pm\lambda_i\) appears with multiplicity \(d_K\). So the T-functional Hessian alone contributes at least \(2 \cdot \mathrm{rank}(B) \cdot d_K\) non-zero eigenvalues, split evenly between positive and negative. This indefiniteness is intrinsic to the dual parameterization. It is not a consequence of softmax specifically -you would get it in linear attention with separate \(W_Q, W_K\) as well.
The paper notes that this is structurally identical to what Singh et al. (2021) found for the functional Hessian of a two-layer linear MLP. That work showed eigenvalues come in \(\pm\lambda\) pairs with multiplicity \(d_K\) at each layer. In self-attention with two query-key matrices, the same structure reappears -not because of depth, but because the product \(W_Q W_K^\top\) is itself a two-matrix composition. A single attention layer with dual parameterization inherits the eigenvalue structure of a two-layer MLP, confined entirely to the query-key block.
Figure 3. The T-functional Hessian \(H_f^T\) has zero diagonal blocks and off-diagonal \(B, B^\top\). This block-hollow structure forces eigenvalues into symmetric \(\pm\lambda_i\) pairs, each with multiplicity \(d_K\). The same structure appears in the functional Hessian of a two-layer linear MLP. It is entirely absent when queries and keys are merged into a single \(W_{QK}\).
Rank bounds: curvature lives in a low-dimensional subspace
There is a second structural consequence that Lemma D.2 makes precise. The T-outer-product Hessian \(H_o^T\), even when \(W_Q\) and \(W_K\) are full rank, has bounded rank:
The bound follows from the rank of \(V^\top\) in the expression \(H_o^T = V^\top U V / d_K\). Even if \(U\) is full rank, the product is bounded by the rank of \(V^\top\), and the structured form of \(V\) limits this.
To see what this means concretely: if \(d_K = d_V / 2\), the bound is \(2 \cdot (d_V/2) \cdot d_V - (d_V/2)^2 = 3d_V^2/4\). So at most three-quarters of the directions in parameter space have any curvature from the T-outer-product term. In multi-head attention where \(d_K = d_V / H\), the bound becomes roughly \(2d_V^2/H\), which is much smaller than \(d_V^2\) for the head counts used in practice. Curvature concentrates in a low-rank subspace; most directions are flat.
Figure 4. The maximum rank of the T-outer-product Hessian \(H_o^T\) as a fraction of \(d_V^2\), for different \(d_K / d_V\) ratios. In typical multi-head attention where \(d_K = d_V / H\), the curvature in the query-key outer-product Hessian is concentrated in a small subspace, with most directions entirely flat.
What the combination implies for optimization
Putting the two results together gives a picture of the query-key Hessian that is more intricate than it first appears. The T-outer-product term \(H_o^T\) is positive semi-definite but low-rank: only a small fraction of parameter space experiences curvature from it. The T-functional term \(H_f^T\) is indefinite, with eigenvalues in \(\pm\lambda\) pairs, and exists only because of the dual parameterization. The full query-key Hessian is the sum of these two.
This produces an unusual combination: a landscape that is simultaneously indefinite and sparse in curvature. Most directions in parameter space are flat from the outer-product term. The few directions that are curved have mixed sign because of the T-functional term. There is no single "upward" or "downward" direction -the loss curves up in some directions and down in others, and most directions have zero curvature from the leading term.
This matters for optimizer design in a concrete way. An optimizer like SGD, which updates all parameters with the same learning rate scaled by gradient magnitude, has no mechanism to handle the directional heterogeneity here. Adaptive optimizers like Adam, which maintain per-parameter statistics, are better suited -they can implicitly assign different effective learning rates to directions with high curvature versus flat directions. The paper discussed this connection in more detail in the context of block-heterogeneity from softmax. But the dual parameterization adds another layer: even within the query-key block, the curvature is non-uniform and indefinite in ways that make simple optimization difficult.
The functional equivalence that isn't
I keep returning to a broader point this makes. In machine learning we usually say two parameterizations are equivalent if they can represent the same set of functions. And in that sense \((W_Q, W_K)\) and \(W_{QK}\) are equivalent -every product \(W_Q W_K^\top / \sqrt{d_K}\) is a valid \(W_{QK}\), and every \(W_{QK}\) with rank at most \(d_K\) can be written as such a product.
But the loss surface is not determined by function class alone. It is determined by the parameterization. Two parameterizations that cover the same functions can curve differently, be more or less indefinite, have more or less rank. The Hessian is sensitive to how the parameters enter the computation, not only to what computation they perform. \(W_Q W_K^\top\) enters the score computation as a product of two independent matrices, both of which are differentiated separately. \(W_{QK}\) enters as a single matrix. The chain rule produces different second-order terms in the two cases, and those differences accumulate into the T-functional Hessian.
The dual-matrix choice was motivated by interpretability and efficiency. Queries and keys are meaningful objects in the attention picture: a query is what a token is looking for, a key is what a token offers to be found by. Keeping them separate preserves that semantic structure. Using \(d_K < d_V\) reduces parameters and regularizes the score matrix to be low-rank. These are good reasons. But as a side effect, the parameterization introduces \(\pm\lambda\) eigenvalue pairs with multiplicity \(d_K\) and concentrates outer-product curvature into a rank-limited subspace. These are not bugs. They are precise geometric consequences of the design choice.
Connecting the two days
Yesterday's question was about activation functions. Softmax versus sigmoid, and how that choice governs the attention moment structure and the block-heterogeneity of the Hessian across parameter groups. Today's question is about parameterization. Two matrices versus one, and how that choice governs the definiteness and rank of the query-key Hessian itself.
Both are instances of the same observation: the Transformer's training behavior is not separable from its specific structural choices. Function class does not determine optimization geometry. The row normalization of softmax creates distributions over tokens and induces moment matrices that make blocks heterogeneous. The product parameterization of \(W_Q W_K^\top\) creates a T-level composition that introduces indefiniteness and rank concentration. Layer normalization reduces data growth rates. Each choice leaves a signature in the second-order structure of the loss that shapes what optimizers experience when training the model.
The paper's title asks what it means to be a Transformer. Having now spent two days inside a single section of it, my answer is getting more specific. It means committing to a set of structural decisions -softmax row normalization, dual query-key matrices, pre-layer normalization -each of which has measurable consequences for curvature, indefiniteness, and rank. Those consequences are not incidental. They are part of what the Transformer is.
One matrix or two is not a minor implementation detail. It is a curvature choice. And I find it interesting that the original Attention Is All You Need paper almost certainly made that choice for semantic and efficiency reasons, not geometric ones. The geometry came along for free, for better and for worse.
References and links
- Ormaniec, W., Dangel, F., & Singh, S. P. (2025). What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis. International Conference on Learning Representations (ICLR 2025). Code repository: github.com/dalab/transformer-hessian.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems. Paper link.
- Singh, S. P., Bachmann, G., & Hofmann, T. (2021). Analytic Insights into Structure and Rank of Neural Network Hessian Maps. Advances in Neural Information Processing Systems. Paper link.
- Singh, S. P., Hofmann, T., & Schölkopf, B. (2023). The Hessian Perspective into the Nature of Convolutional Neural Networks. International Conference on Machine Learning.
- Ahn, K., Cheng, X., Song, M., Yun, C., Jadbabaie, A., & Sra, S. (2024). Linear Attention is (Maybe) All You Need (to Understand Transformer Optimization). International Conference on Learning Representations.
- My previous post on softmax versus sigmoid in attention: The Sigmoid Mistake.