I still remember the exact moment. It was the Advanced Deep Learning midsemester, question three: solve a lengthy attention problem numerical, was mostly easy. I was moving fast, confident, writing the query-key product, scaling by the square root of the key dimension, and then-without thinking-I wrote:
\[ A = \sigma\!\left(\frac{QK^\top}{\sqrt{d_k}}\right). \]
Sigmoid. I wrote sigmoid.
I realized it about forty minutes later, three questions down, when something nagged at the back of my mind and I flipped back. There it was: \(\sigma\) instead of softmax, sitting there, perfectly confident in its wrongness. I fixed it, added a note, moved on. The marks stung, but what lingered longer was the question I could not shake: how bad is it, really? What does sigmoid actually break in the attention mechanism, and is the choice of softmax just convention, or is there something mathematically deep going on?
This post is the answer I eventually built for myself, and it is longer and stranger than I expected.
Two functions that look the same from a distance
Let me start with what I already knew, which was apparently not enough to save me on the exam. Both sigmoid and softmax take real numbers and map them into \((0,1)\). Both are smooth, differentiable, and well-behaved. If you squint, the formulas look related:
\[ \sigma(x) = \frac{1}{1 + e^{-x}}, \qquad \text{softmax}(x)_i = \frac{e^{x_i}}{\sum_j e^{x_j}}. \]
But there is one structural difference that changes everything. Sigmoid is applied elementwise. Softmax is applied rowwise to vectors. When we write softmax inside attention, we mean: take each row of the \(L \times L\) score matrix \(QK^\top / \sqrt{d_k}\), and normalize that row so its entries sum to exactly one.
With sigmoid, every entry gets independently squashed into \((0,1)\), with no interaction between entries, and no guarantee that any row sums to anything meaningful. With softmax, every row becomes a proper probability distribution over the \(L\) tokens in the sequence. That row of attention weights now says: when computing the output for token \(i\), here is how much weight to place on each other token, and those weights must sum to one.
Figure 1. Same score matrix, two different functions. Sigmoid squashes each entry independently: rows have no normalization and their sums vary freely. Softmax normalizes each row into a probability distribution over the sequence tokens, making each attention row a convex combination of the \(L\) positions.
That single property - rows sum to one - is what turns each row of the attention matrix into a distribution over the sequence. And that turns the weighted sum \(AXW_V\) into something geometrically meaningful: for each query token \(i\), the output is a convex combination of the value vectors, a properly weighted expectation. Sigmoid gives you no such structure. You are not computing an expectation under anything. You are just element-wise squashing followed by matrix multiplication.
Normalization is load-bearing structure
After the exam, I kept asking myself: so what? Empirically, sigmoid-based attention and various softmax alternatives have been explored in the literature. Why does the standard persist so firmly? The geometric answer is clear enough: softmax creates simplex-valued rows, a known compact manifold with well-understood properties. Sigmoid gives you an unconstrained hypercube. But I wanted something more precise than geometry. I wanted to know what softmax does to the optimization landscape of the whole model. That question led me somewhere I did not expect.
The paper I found a few weeks too late
While preparing for the final, I came across a paper published at ICLR 2025 by Ormaniec, Dangel, and Singh, titled "What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis." The title felt like it had been written for my post-exam state of mind. The paper does something I had not seen done before: it derives the full, exact Hessian of the loss with respect to all weight matrices in a single self-attention layer, and then uses that Hessian structure to explain the standard training practices of Transformer models. Why Adam and not SGD. Why layer normalization. Why learning rate warmup. All from first principles, all traceable back to the specific mathematical structure of the self-attention mechanism.
The Hessian of the loss, in case you have not spent time with it lately, is the matrix of all second-order derivatives of the loss with respect to the model parameters. It encodes the local curvature of the loss surface: how quickly gradients are changing, which directions are steep or flat, whether we are near a minimum or a saddle. For MLPs and CNNs, this object has been studied extensively going back decades. For Transformers, there was surprisingly little rigorous theoretical treatment before this work. The authors begin to fill that gap, and the main thing they find is that softmax is the core architectural ingredient responsible for what makes the Transformer's Hessian structurally unusual.
The Hessian has blocks, and they are not equal
The paper sets up a single self-attention layer: \(F(X) = A(X)XW_V\) where \(A(X) = \text{softmax}(XW_Q W_K^\top X^\top / \sqrt{d_K})\). The learnable parameters are \(W_Q, W_K, W_V \in \mathbb{R}^{d_V \times d_K}\) (or \(d_V \times d_V\) for the value matrix). The loss Hessian decomposes into blocks for each pair of parameter matrices, and further splits via the Gauss-Newton decomposition into an outer-product Hessian \(H_o\) and a functional Hessian \(H_f\):
\[ H = H_o + H_f. \]
The outer-product Hessian is always positive semi-definite-it is built from outer products of Jacobians. The functional Hessian carries second-order derivatives of the network itself and can be indefinite. Now, naively, you might expect the diagonal Hessian blocks \(H(W_V, W_V)\) and \(H(W_Q, W_Q)\) to be roughly comparable in magnitude. They are not. They differ by approximately two orders of magnitude.
This is the key empirical finding in the paper: when you look at histograms of the absolute entries of each diagonal block evaluated on a GPT-2 Transformer at initialization, the value block \(H(W_V, W_V)\) has entries concentrated around \(10^{-3}\), while the query block \(H(W_Q, W_Q)\) has entries concentrated around \(10^{-5}\). The two distributions barely overlap. The curvature experienced by the value parameters and the curvature experienced by the query parameters are completely different scales.
Figure 2. Schematic Hessian block magnitude distributions, following the empirical finding of Ormaniec et al. (2025, Figure 1), evaluated on a single-block GPT-2 Transformer at initialization. The query block entries are approximately two orders of magnitude smaller than the value block entries. This block heterogeneity is directly caused by the softmax nonlinearity.
The root of the heterogeneity: data dependence
The theoretical heart of the paper is a precise characterization of how each Hessian block depends on the input data \(X\). Because \(X\) enters self-attention three times-as queries, keys, and values-the data dependence of the Hessian is highly nonlinear and varies significantly across blocks. Writing the dependence in big-\(\mathcal{O}\) notation (where \(\mathcal{O}(X^k)\) means all entries of the block scale as the \(k\)-th power of the entry magnitude of \(X\)), the outer-product Hessian satisfies:
\[ H_o \in \begin{array}{r|ccc} & Q & K & V \\ \hline Q & \mathcal{O}(X^6) & \mathcal{O}(X^6) & \mathcal{O}(X^4) \\ K & \cdot & \mathcal{O}(X^6) & \mathcal{O}(X^4) \\ V & \cdot & \cdot & \mathcal{O}(X^2) \end{array} \]
And for the functional Hessian, omitting a factor involving the residual:
\[ H_f \in \begin{array}{r|ccc} & Q & K & V \\ \hline Q & \mathcal{O}(X^5) & \mathcal{O}(X^5 + X^3) & \mathcal{O}(X^3) \\ K & \cdot & \mathcal{O}(X^5) & \mathcal{O}(X^3) \\ V & \cdot & \cdot & \mathcal{O}(1) \end{array} \]
Look at the extremes. The value diagonal block is dominated by \(\mathcal{O}(X^2)\) from the outer-product Hessian (the functional Hessian's value block is exactly zero). The query and key diagonal blocks grow as \(\mathcal{O}(X^6)\) from the outer-product Hessian. At initialization, when token embeddings \(X\) have small magnitude-say standard deviation \(\sigma \approx 0.1\) or so-a term of order \(\sigma^6 \approx 10^{-6}\) is vastly smaller than a term of order \(\sigma^2 \approx 10^{-2}\). This is precisely why the query Hessian block is so much smaller than the value block at initialization: it lives in a higher-power regime that is numerically tiny when data magnitudes are small. The paper verifies this by systematically varying the initialization scale \(\sigma\) and measuring how the Frobenius norm of each block scales, finding clean agreement with the theoretical predictions.
Figure 3. Data dependence of the outer-product Hessian diagonal blocks for softmax attention versus linear attention. Softmax introduces extreme block heterogeneity: query and key blocks depend on \(\mathcal{O}(X^6)\) while the value block depends on \(\mathcal{O}(X^2)\). Removing softmax makes all blocks depend uniformly on the cube of the intra-sequence covariance, eliminating the heterogeneity entirely.
And this is the key point I had been missing. If you remove softmax-whether you replace it with sigmoid, with a linear identity, or with anything else that does not enforce row normalization-the heterogeneity disappears. The outer-product Hessian blocks all depend uniformly on \(\mathcal{O}(\Sigma_{XX}^3)\) where \(\Sigma_{XX} = X^\top X / L\) is the intra-sequence covariance. Every block grows at the same rate. No two-orders-of-magnitude spread. No structural asymmetry between query-key and value parameters.
Softmax is what breaks the symmetry. The query and key matrices interact with the derivative of the softmax in ways that multiply the data dependence, because those matrices parameterize the argument of the softmax. The value matrix sits outside the softmax entirely and contributes only linearly. That distinction-inside versus outside the normalizing nonlinearity-is the structural cause of the entire block heterogeneity.
Softmax makes the Hessian more indefinite
There is a second consequence that I find even more conceptually striking. In the Gauss-Newton decomposition, the outer-product term \(H_o\) is always positive semi-definite: it is a sum of Gram matrices and can only have non-negative eigenvalues. The functional Hessian \(H_f\), which carries second derivatives of the network itself, can be indefinite.
For MLPs with piecewise linear activations like ReLU, the functional Hessian has what the authors call a block-hollow structure: the diagonal blocks are zero. This means the Hessian diagonal blocks for MLP parameters are determined entirely by \(H_o\), and they are therefore positive semi-definite. Non-negative eigenvalues everywhere.
For self-attention with softmax, this is no longer true. The functional Hessian \(H_f\) has non-zero diagonal blocks for the query and key parameters-specifically, the query diagonal block depends on second derivatives of the softmax, which are non-trivial. These blocks are indefinite, carrying both positive and negative eigenvalues. Park and Kim (2022) had already observed empirically that the Transformer Hessian is more indefinite than that of CNNs. This paper provides the precise theoretical mechanism: it is the softmax nonlinearity that prevents the block-hollow structure and introduces indefiniteness into the diagonal Hessian blocks.
Removing softmax restores the block-hollow structure to the functional Hessian's diagonal, and the Hessian blocks become positive semi-definite again-structurally much more like an MLP. With my sigmoid mistake on the exam, I had essentially written down a model whose Hessian is qualitatively closer to an MLP's than to a proper Transformer's.
Attention moments: a statistical structure I had not seen before
One of the most elegant things in the paper is how the data terms in the Hessian are organized through what the authors call attention moment matrices. Because each row of the softmax attention matrix \(A\) is a probability distribution over the \(L\) input tokens, you can define statistical moments of those per-row distributions-just as you would define the mean and variance of a probability distribution over numbers.
The first attention moment is the attention-weighted average of the token embeddings:
\[ M_1 := AX = \bigl[A_{i,:}^\top X\bigr]_{1 \le i \le L} \in \mathbb{R}^{L \times d_V}. \]
This is exactly the context vector from the standard attention mechanism-the output before the value projection. It is also the quantity that appears in the value outer-product Hessian. The second central moment matrix \(M_2\) captures the attention-weighted covariance of the token embeddings around the attention mean, and the third central moment matrix \(M_3\) captures the corresponding skewness analogue.
These three moment matrices appear in different parts of the Hessian in a beautifully stratified way:
The value outer-product Hessian \(H_o(W_V, W_V)\) depends on \(M_1\), the first moment-the basic weighted average.
The query-key outer-product Hessian depends on \(M_2\), the second central moment-which measures how spread out the attention distribution is across the token sequence.
The query-key functional Hessian depends on \(M_3\), the third central moment-the most complex statistical structure of all.
This means the curvature experienced by the query and key parameters is sensitive to higher-order statistics of the attention distribution than the value parameters. A model that has learned sharply peaked attention-mostly attending to one token-will have \(M_2 \approx 0\) and \(M_3 \approx 0\), giving near-zero curvature in the query-key directions. A model with diffuse, spread-out attention will have substantial \(M_2\) and non-trivial \(M_3\), giving meaningful curvature to the query-key parameters.
None of this structure exists if you replace softmax with sigmoid. Sigmoid does not create probability distributions over rows. It does not define an attention mean, or attention covariance, or any higher moment in this precise statistical sense. The entire moment-theoretic characterization of the Hessian-arguably the paper's most elegant contribution-is a direct consequence of the row-normalization that softmax provides and sigmoid does not.
Figure 4. The self-attention Hessian is stratified by attention moment matrices. The value Hessian depends on the first moment (the familiar context vector). The query-key outer-product Hessian depends on the second central moment. The query-key functional Hessian involves the third central moment. This structure is unique to softmax attention and has no analogue when softmax is replaced by sigmoid or any element-wise function.
What this implies for how Transformers are trained
Once you see the Hessian structure, a lot of standard Transformer training practice stops feeling like folklore and starts making sense from first principles.
The block heterogeneity-query-key blocks at \(\mathcal{O}(X^6)\) versus value blocks at \(\mathcal{O}(X^2)\)-means that at initialization, with small token embedding magnitudes, the curvature in the query-key directions is orders of magnitude smaller than in the value directions. These are completely different optimization regimes within the same layer. An optimizer like SGD with a fixed learning rate for all parameters has no way to cope with this. A step size appropriate for the value parameters is either catastrophically large for the eventually-developing query landscape, or absurdly small for the value landscape. You cannot win.
Adam, by maintaining per-parameter exponential moving averages of squared gradients and dividing by them, naturally assigns different effective learning rates to parameters with different gradient scales. It is not that Adam is theoretically superior for arbitrary optimization problems. It is that Adam is specifically well-matched to the heterogeneous curvature structure that softmax creates. Zhang et al. (2024) showed empirically that the Hessian spectra of Transformer blocks are far more heterogeneous across diagonal blocks than those of CNNs-precisely the block-heterogeneity that this paper explains theoretically.
The case for layer normalization follows a similar logic. The paper shows that stacking \(D\) linear self-attention layers (no softmax) makes the Hessian diagonal blocks depend on \(\mathcal{O}(\Sigma_{XX}^{3^D})\): for \(D=2\) that is \(\mathcal{O}(X^{18})\), for \(D=3\) it is \(\mathcal{O}(X^{54})\). Super-exponential growth with depth. With softmax and proper pre-layer normalization, the block-heterogeneity in these growth rates is substantially reduced: the exponents across different blocks become more similar, making the problem more tractable. Layer normalization is not simply a variance stabilization trick. It actively counteracts a structural consequence of the softmax-induced Hessian dynamics.
Back to the exam mistake
So here is what I would have written, if I had understood all of this before question three.
Sigmoid is not merely wrong in the sense of violating convention. Writing sigmoid instead of softmax in the attention computation breaks the row-normalization structure that makes each row of the attention matrix a probability distribution. It eliminates the attention moments-the precise statistical objects that govern the curvature of the loss surface for query and key parameters. It makes the Hessian blocks uniform across parameter groups rather than heterogeneous, qualitatively transforming the optimization landscape. A model with sigmoid attention does not need Adam in the same essential way a model with softmax attention does. Its Hessian is closer to an MLP's in structure: less indefinite, more homogeneous. Perhaps easier to train with simpler optimizers. But it is not a Transformer in the mathematically meaningful sense.
The paper's title asks: "What does it mean to be a Transformer?" The Hessian answer is: it means having a loss landscape shaped by softmax in ways that propagate through three orders of attention moments, create heterogeneous curvature between parameter blocks by factors of hundreds at initialization, and make certain diagonal Hessian blocks indefinite in ways that make adaptive optimizers effectively necessary. Sigmoid gives you none of that. It gives you a simpler, more uniform curvature structure, and a model that is functionally something else.
A personal note on what bad answers teach you
I think I understand the attention mechanism better for having made this mistake than I would have if I had written softmax automatically and moved on. The automatic answer is "softmax because it normalizes rows and gives a probability distribution." The deeper answer is "softmax because it creates probability distributions over tokens, which induces attention moments, which create a stratified Hessian structure with heterogeneous curvature that is responsible for the characteristic training dynamics of every Transformer ever trained."
Those are very different levels of understanding. The mistake pushed me toward the second one.
I do not recommend failing exam questions as a general learning strategy. But I do think there is something to be said for tracing a wrong answer all the way to its structural roots. "How wrong was I, exactly?" turns out to be a much more productive question than "what was the right answer?" because the first question leads into the mathematics, and the second just leads to a formula that you write correctly next time without knowing why.
The next time someone asks me why we use softmax in attention, I will not say convention, or normalization, or because it works empirically. I will say: softmax is the specific function that creates a probability distribution over tokens at each position, and that distribution generates a stratified Hessian landscape governed by first, second, and third attention moments, block-heterogeneous in curvature, partially indefinite, and requiring adaptive optimization in a structurally precise sense. That is what the choice of softmax actually means. It took writing the wrong symbol in an exam, and a few weeks of reading afterward, to make me see it clearly.
References and links
- Ormaniec, W., Dangel, F., & Singh, S. P. (2025). What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis. International Conference on Learning Representations (ICLR 2025). Code repository: github.com/dalab/transformer-hessian.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems. Paper link.
- Zhang, Y., Chen, C., Ding, T., Li, Z., Sun, R., & Luo, Z.-Q. (2024). Why Transformers Need Adam: A Hessian Perspective. Advances in Neural Information Processing Systems. Paper link.
- Park, N., & Kim, S. (2022). How Do Vision Transformers Work? International Conference on Learning Representations. Paper link.
- Noci, L., Anagnostidis, S., Biggio, L., Orvieto, A., Singh, S. P., & Lucchi, A. (2022). Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse. Advances in Neural Information Processing Systems. Paper link.
- Singh, S. P., Bachmann, G., & Hofmann, T. (2021). Analytic Insights into Structure and Rank of Neural Network Hessian Maps. Advances in Neural Information Processing Systems. Paper link.
- Kingma, D. P., & Ba, J. (2015). Adam: A Method for Stochastic Optimization. International Conference on Learning Representations. Paper link.
- Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., & Liu, T. (2020). On Layer Normalization in the Transformer Architecture. International Conference on Machine Learning. Paper link.