Martingales in Optimization: Why SGD Doesn't Drift Randomly

21 March, 2026

It started, as many things do, with a Wikipedia tab I did not intend to open. I was reading through a convergence proof for stochastic gradient descent, the kind of proof that appears in the appendix of a paper you are trying to understand quickly, and I kept encountering the word martingale. Not in passing. It appeared in lemmas, in the core argument, as though it were assumed knowledge. I had a vague recollection from probability: something about gambling, something about fair games. I clicked through. Then through again. By the time I looked up it was well past midnight and I had tabs open on optional stopping, Doob's decomposition, the law of large numbers for martingales, and the Robbins-Siegmund theorem. This is the post I wished had existed before I went down that spiral.

The question I kept returning to was simple. SGD is noisy. It does not follow the gradient exactly; it follows a stochastic estimate of it, computed on a random mini-batch. That noise accumulates over thousands of steps. Yet the algorithm converges. Why does the noise not send the iterates drifting off in some random direction? What is the structure that prevents pure diffusion? The answer, it turns out, is martingales. Specifically, the noise in SGD is not ordinary random noise. It has a very particular zero-mean structure that makes it a martingale difference sequence, and that structure is the precise reason it does not accumulate directional drift.

What a martingale is

The easiest entry point is through gambling. Suppose you are playing a fair game: at each step, you either win or lose one unit, each with probability one-half. Your fortune after \(t\) steps is \(M_t\). The question is: given everything that has happened up to step \(t\), what is your expected fortune at step \(t+1\)?

\[ \mathbb{E}[M_{t+1} \mid M_0, M_1, \ldots, M_t] = M_t. \]

Your expected future fortune equals your current fortune. Not more, not less. The game is fair. This is a martingale. Formally, a sequence \((M_t)_{t \geq 0}\) is a martingale with respect to a filtration \((\mathcal{F}_t)_{t \geq 0}\) if it is adapted (each \(M_t\) is measurable with respect to \(\mathcal{F}_t\), meaning it depends only on information available at time \(t\)), integrable (\(\mathbb{E}[|M_t|] < \infty\) for all \(t\)), and satisfies:

\[ \mathbb{E}[M_{t+1} \mid \mathcal{F}_t] = M_t. \]

The filtration \(\mathcal{F}_t\) is the technical object for "everything known up to time \(t\)." In SGD, it will be the sigma-algebra generated by the initial parameters and all the random data batches seen so far: \(\mathcal{F}_t = \sigma(\theta_0, z_0, z_1, \ldots, z_{t-1})\). The filtration grows over time as more information accumulates.

Two variants are useful here. A supermartingale satisfies \(\mathbb{E}[M_{t+1} \mid \mathcal{F}_t] \leq M_t\) - the process tends to decrease in expectation, like a biased casino game. A submartingale satisfies the reverse inequality. Supermartingales will be the key object in the convergence analysis, because the distance from the optimum is exactly the kind of process that decreases in expectation under a well-behaved optimizer.

A related concept is a martingale difference sequence: a sequence \((\xi_t)\) where each \(\xi_t\) has zero conditional mean given the past, \(\mathbb{E}[\xi_t \mid \mathcal{F}_{t-1}] = 0\). The increments of a martingale form a martingale difference sequence. And crucially, any sum of martingale differences \(M_T = \sum_{t=0}^{T-1} \xi_t\) is itself a martingale: its expected future value equals its current value, because each new term contributes zero in conditional expectation.

SGD as a stochastic process

The standard SGD update is:

\[ \theta_{t+1} = \theta_t - \eta_t\, g_t(\theta_t;\, z_t), \]

where \(g_t(\theta_t; z_t)\) is the gradient of the loss computed on a random mini-batch \(z_t\), and \(\eta_t > 0\) is the learning rate at step \(t\). Define the objective function \(f(\theta) = \mathbb{E}_{z}[\ell(\theta; z)]\) as the expected loss over the data distribution. The stochastic gradient is an unbiased estimator of the true gradient:

\[ \mathbb{E}[g_t(\theta_t; z_t) \mid \mathcal{F}_t] = \nabla f(\theta_t). \]

This is the unbiasedness assumption, and it is fundamental. It says that the mini-batch gradient points, on average, in the right direction. Now split the stochastic gradient into its mean and its deviation from the mean:

\[ g_t(\theta_t; z_t) = \nabla f(\theta_t) + \xi_t, \]

where \(\xi_t = g_t(\theta_t; z_t) - \nabla f(\theta_t)\) is the noise term. By unbiasedness, \(\mathbb{E}[\xi_t \mid \mathcal{F}_t] = 0\). The noise has zero conditional mean given the history. This is exactly the definition of a martingale difference sequence.

Substituting back, the SGD update becomes:

\[ \theta_{t+1} = \theta_t - \eta_t \nabla f(\theta_t) - \eta_t \xi_t. \]

Two terms. The first, \(-\eta_t \nabla f(\theta_t)\), is the deterministic gradient descent step: this is the signal, the part that points toward the optimum. The second, \(-\eta_t \xi_t\), is the noise contribution at step \(t\): this is random, with zero conditional mean. The cumulative noise after \(T\) steps is:

\[ \sum_{t=0}^{T-1} \eta_t \xi_t. \]

Because each \(\eta_t \xi_t\) has zero conditional mean given the past, this sum is a martingale. The noise accumulates, but it accumulates as a fair game. It oscillates without systematic directional bias. This is the first key insight: the stochastic part of SGD is not generic random noise. It is a martingale, which means it cannot drift in any preferred direction in expectation.

The Doob decomposition: splitting drift from noise

There is a general theorem by Doob that formalizes exactly this split. The Doob decomposition says: any adapted integrable process \((X_t)\) can be written uniquely as:

\[ X_t = X_0 + M_t + A_t, \]

where \((M_t)\) is a martingale with \(M_0 = 0\), and \((A_t)\) is a predictable process with \(A_0 = 0\). Predictable means that \(A_t\) is known at time \(t-1\): it depends only on the past, not on the current randomness. The decomposition is unique, and the two parts are given explicitly by:

\[ A_t = \sum_{s=1}^{t} \bigl(\mathbb{E}[X_s \mid \mathcal{F}_{s-1}] - X_{s-1}\bigr), \qquad M_t = X_t - X_0 - A_t. \]

The predictable part \(A_t\) accumulates the conditional expected increments: it is the total systematic drift of the process. The martingale part \(M_t\) is everything that is not predictable: the pure noise, the oscillation that has zero expected contribution at each step.

Apply this to \(V_t = \|\theta_t - \theta^*\|^2\), the squared distance to the optimum. For SGD under a smooth objective \(f\) with \(L\)-Lipschitz gradients, expanding the squared norm after one step:

\[ \|\theta_{t+1} - \theta^*\|^2 = \|\theta_t - \theta^* - \eta_t g_t\|^2 = \|\theta_t - \theta^*\|^2 - 2\eta_t \langle g_t,\, \theta_t - \theta^* \rangle + \eta_t^2 \|g_t\|^2. \]

Taking conditional expectation and using \(\mathbb{E}[g_t \mid \mathcal{F}_t] = \nabla f(\theta_t)\):

\[ \mathbb{E}[V_{t+1} \mid \mathcal{F}_t] = V_t - 2\eta_t \langle \nabla f(\theta_t),\, \theta_t - \theta^* \rangle + \eta_t^2\, \mathbb{E}[\|g_t\|^2 \mid \mathcal{F}_t]. \]

Assuming bounded stochastic gradient variance, \(\mathbb{E}[\|g_t\|^2 \mid \mathcal{F}_t] \leq G^2\), and using strong convexity with parameter \(\mu > 0\), which gives \(\langle \nabla f(\theta_t), \theta_t - \theta^* \rangle \geq \mu \|\theta_t - \theta^*\|^2 = \mu V_t\):

\[ \mathbb{E}[V_{t+1} \mid \mathcal{F}_t] \leq (1 - 2\mu \eta_t)\, V_t + \eta_t^2 G^2. \]

This is the key inequality. The predictable part of \(V_t\) decreases by a factor of at least \(2\mu\eta_t\) at each step. The bias term \(\eta_t^2 G^2\) is the price paid for using a stochastic gradient instead of the true one. The martingale part is the residual \(V_{t+1} - \mathbb{E}[V_{t+1} \mid \mathcal{F}_t]\): random, zero-mean, oscillating, not accumulating in any direction.

The martingale difference in practice

I want to make this concrete. At each step \(t\), the noise \(\xi_t = g_t - \nabla f(\theta_t)\) can be large or small, positive or negative in each coordinate, depending on which data points happened to land in the mini-batch. There is no structure preventing any individual \(\xi_t\) from being large. What is constrained is the conditional expectation: averaged over all possible mini-batches that could be drawn at step \(t\), given the current parameters \(\theta_t\), the noise has zero mean.

This is a strong statement. It says the noise is not biased toward any direction. In particular, it does not systematically point away from the optimum, even when the parameters are close to convergence. Each noise realization is random and may take the iterate in a bad direction, but the next one is just as likely to compensate. The accumulated noise \(\sum_{t=0}^{T-1} \eta_t \xi_t\) is a martingale: its expected value at any future time, given the present, equals its current value. It wanders like a fair game, not like a biased drift.

The Robbins-Siegmund theorem: from martingales to convergence

Having the supermartingale-like inequality for \(V_t\) is the hard part. Converting it into an almost sure convergence statement is handled by a classical result due to Robbins and Siegmund (1971). The theorem is worth knowing.

Robbins-Siegmund theorem. Let \((V_t)\), \((a_t)\), \((b_t)\) be non-negative adapted processes satisfying:

\[ \mathbb{E}[V_{t+1} \mid \mathcal{F}_t] \leq V_t - a_t + b_t \quad \text{for all } t \geq 0. \]

If \(\sum_{t=0}^\infty b_t < \infty\) almost surely, then: (i) \(V_t\) converges almost surely to a finite non-negative limit, and (ii) \(\sum_{t=0}^\infty a_t < \infty\) almost surely.

Apply this to the SGD inequality \(\mathbb{E}[V_{t+1} \mid \mathcal{F}_t] \leq V_t - 2\mu\eta_t V_t + \eta_t^2 G^2\). Set \(a_t = 2\mu\eta_t V_t\) and \(b_t = \eta_t^2 G^2\). The condition \(\sum b_t < \infty\) requires \(\sum \eta_t^2 < \infty\). If this holds, then the theorem gives us two conclusions. First, \(V_t = \|\theta_t - \theta^*\|^2\) converges almost surely to some finite limit \(V_\infty\). Second, \(\sum_t \eta_t V_t < \infty\) almost surely.

The second conclusion, combined with \(\sum \eta_t = \infty\), forces \(V_\infty = 0\). Here is why: if \(V_t\) converged to some positive limit \(c > 0\), then \(\sum \eta_t V_t \geq c \sum \eta_t = \infty\), contradicting \(\sum \eta_t V_t < \infty\). So \(V_\infty\) must be zero: \(\theta_t \to \theta^*\) almost surely.

This gives the classical Robbins-Monro conditions on the learning rate schedule:

\[ \sum_{t=0}^\infty \eta_t = \infty \quad \text{and} \quad \sum_{t=0}^\infty \eta_t^2 < \infty. \]

The first condition says the learning rates cannot decay too fast: the algorithm must take steps large enough to cover arbitrary distances in parameter space. The second says the learning rates must eventually go to zero: the noise contribution \(\eta_t^2 G^2\) must be summable. The canonical schedule \(\eta_t = \eta_0 / (t+1)\) satisfies both: the harmonic series diverges but the sum of its squares converges.

What I find beautiful about this is that the two conditions are not arbitrary engineering rules. They fall out directly from the martingale structure of the problem. The first condition forces the descent term to dominate. The second keeps the noise finite. Together they ensure the supermartingale \(V_t\) is well-behaved enough for the Robbins-Siegmund argument to close. The learning rate schedule is not a hyperparameter choice in the usual sense. It is a condition required for the proof to work.

Why this is not the same as a random walk

I want to address the thing that confused me longest. A martingale also wanders like a random walk - its expected future value equals its current value. So is SGD just a biased random walk? The answer is no, and the distinction matters.

In a pure random walk, the increments are i.i.d. and have no dependence on where you currently are. There is no restoring force. The walk drifts because there is nothing stopping it. The expected position stays at the starting point only in the very long-run average; any individual path wanders wherever the noise takes it.

In SGD, the noise \(\xi_t\) has zero conditional mean, yes. But the deterministic gradient term \(-\eta_t \nabla f(\theta_t)\) depends on the current position \(\theta_t\). It points toward the optimum. And the gradient gets larger as \(\theta_t\) moves farther from the optimum (under strong convexity). This is a restoring force that pure random walks do not have. The further the iterate strays, the harder the gradient pulls it back. The martingale structure of the noise means the noise does not fight this restoring force in expectation. The noise oscillates; the gradient descends.

The formal way to see this is in the Doob decomposition. The predictable part \(A_t\) is monotone decreasing (pulling \(V_t\) down) when the learning rate is small enough. The martingale part \(M_t\) fluctuates but does not drift. A random walk has no equivalent of \(A_t\): its predictable part is identically zero. The entire process is its martingale part. SGD is not a random walk; it is a random walk with an additional restoring force, and the restoring force wins in the long run because the noise has no preferred direction to fight it with.

Variance and the price of stochasticity

There is one more piece worth examining: the bias term \(\eta_t^2 G^2\) in the supermartingale inequality. This term does not vanish. It is the permanent price paid for using stochastic gradients. In full-batch gradient descent, this term is zero: the gradient is exact. In SGD, even when \(\theta_t = \theta^*\), the stochastic gradient \(g_t(\theta^*; z_t)\) is not zero - it depends on the random mini-batch and fluctuates around the true gradient \(\nabla f(\theta^*) = 0\). The variance of the noise does not disappear at the optimum.

This is precisely why the learning rate must decay to zero: \(\eta_t \to 0\). If the learning rate were held constant, the bias term \(\eta^2 G^2\) would prevent \(V_t\) from converging to zero. The iterate would bounce around a neighborhood of the optimum whose radius scales with \(\eta G / \mu\). To converge exactly, the learning rate must shrink. But it cannot shrink too fast, or the gradient term no longer accumulates enough progress. The Robbins-Monro conditions pin down exactly the right rate of decay.

Modern practice often uses constant or piece-wise constant learning rates with periodic restarts, which deliberately does not converge to zero. This is a choice to trade exact convergence for faster initial progress and implicit regularization via the noise. The martingale analysis explains why that trade-off exists: constant \(\eta\) gives a richer martingale structure (the noise is always active) but prevents the supermartingale from collapsing to zero. Decaying \(\eta\) collapses the noise but slows late-stage progress. The theory makes the trade-off explicit.

What the Wikipedia spiral taught me

The thing I did not expect when I started reading about martingales is how much the formalism clarifies. Before, I had a vague intuition that SGD converges because the gradient signal is stronger than the noise signal on average. That is true but imprecise. The martingale framework makes it precise. The noise is a martingale difference - zero conditional mean, no directional bias. The gradient is the predictable drift. The Doob decomposition separates these cleanly. The Robbins-Siegmund theorem converts the supermartingale inequality into almost sure convergence. Every step in the argument has a name and a reference.

Knowing this changes how I look at variants. Stochastic variance reduction methods like SVRG reduce the variance of the martingale difference term, making the noise smaller without eliminating it. Adam implicitly handles heterogeneous gradient scales, which the Hessian analysis from the softmax attention post explains is particularly relevant for Transformers. Momentum integrates the martingale differences over time, smoothing the noise while preserving the gradient signal. Each of these methods is, in part, a manipulation of the martingale structure.

There is also a broader lesson about what makes a convergence proof work. The proof does not argue that the noise is small. It argues that the noise is unbiased. Those are different claims. Large unbiased noise is fine for convergence, as long as the learning rate decays appropriately. Small biased noise can prevent convergence entirely - the algorithm can be confidently wrong in a consistent direction. The martingale property is not about magnitude; it is about direction. The noise has no preferred direction, and that is sufficient.

I started this post with a Wikipedia spiral and I am ending it with a cleaner picture of something I had used for years without really understanding. SGD does not converge despite its noise. It converges because of the precise structure of that noise. The stochastic gradient is unbiased, which makes its deviation from the true gradient a martingale difference, which means the accumulated noise has no directional drift, which means the deterministic gradient signal always wins in expectation, which is why the whole thing works. That is a complete story. It took until now to find it.

References and links

  1. Robbins, H., & Siegmund, D. (1971). A convergence theorem for non negative almost supermartingales and some applications. In Optimizing Methods in Statistics, Academic Press. The original Robbins-Siegmund quasi-martingale convergence theorem.
  2. Robbins, H., & Monro, S. (1951). A stochastic approximation method. Annals of Mathematical Statistics, 22(3), 400-407. The original stochastic approximation paper establishing the learning rate conditions.
  3. Bottou, L., Curtis, F. E., & Nocedal, J. (2018). Optimization methods for large-scale machine learning. SIAM Review, 60(2), 223-311. Paper link. A comprehensive survey of SGD theory including martingale-based analyses.
  4. Doob, J. L. (1953). Stochastic Processes. Wiley. The original statement of the Doob decomposition and martingale convergence theorems.
  5. Bach, F., & Moulines, E. (2013). Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). Advances in Neural Information Processing Systems. Paper link. Martingale tools applied to non-strongly-convex SGD.
  6. My previous post on the Hessian structure of softmax attention, which motivates why these optimization properties matter specifically for Transformers: The Sigmoid Mistake.
  7. My previous post on query-key parameterization and how the dual matrix structure changes the Hessian: One Matrix or Two.