Pratyaksh Patel | Carathéodory's Theorem

March 2026

I first encountered Carathéodory's theorem in the middle of reading a proof about the Frank-Wolfe algorithm. There it sat in the opening paragraph, stated as though everyone already knew it: any point in the convex hull of a set in \(\mathbb{R}^d\) can be expressed as a convex combination of at most \(d+1\) points of that set. The author moved on immediately. I stopped there for a long time.

The statement sounds mild until you sit with what it actually says. Take a convex polygon with a hundred vertices. Pick any point in its interior. Carathéodory's theorem guarantees you can always find three vertices of that polygon - just three, never more - such that your chosen point is a weighted average of exactly those three. The other ninety-seven vertices are irrelevant to this point. Three witnesses always suffice.

That feels wrong. If you pick a point somewhere in the middle of a polygon whose vertices are spread all around it, surely you need all of them to locate it precisely? Not at all. The theorem says three is always enough, and the bound is tight. It holds in any dimension, for any convex set, for any interior or boundary point. In \(\mathbb{R}^2\), always three. In \(\mathbb{R}^3\), always four. In \(\mathbb{R}^d\), always \(d+1\).

The proof fits in a paragraph

Which is part of why I love it. Suppose \(p = \sum_{i=1}^k \lambda_i x_i\) with each \(\lambda_i > 0\), \(\sum_i \lambda_i = 1\), and \(k > d+1\). We want to reduce the number of active terms. Since \(k-1 > d\), the vectors \(x_1 - x_k, \ldots, x_{k-1} - x_k\) are \(k-1 > d\) vectors in \(\mathbb{R}^d\), so they are linearly dependent. There exist \(\alpha_1, \ldots, \alpha_{k-1}\), not all zero, with \(\sum_{i=1}^{k-1} \alpha_i (x_i - x_k) = 0\). Set \(\alpha_k = -\sum_{i=1}^{k-1} \alpha_i\). Then:

\[ \sum_{i=1}^k \alpha_i x_i = 0 \qquad \text{and} \qquad \sum_{i=1}^k \alpha_i = 0. \]

Now observe that for any \(t\):

\[ \sum_{i=1}^k (\lambda_i - t\alpha_i)\, x_i = p \qquad \text{and} \qquad \sum_{i=1}^k (\lambda_i - t\alpha_i) = 1. \]

So \((\lambda_i - t\alpha_i)\) is a valid set of weights for \(p\) at any value of \(t\). Now increase \(|t|\) from zero (in the direction that keeps some \(\alpha_i \neq 0\)) until the first coefficient \(\lambda_j - t\alpha_j\) hits zero. At that value of \(t\), we still have \(p\) expressed as a convex combination - but now with at most \(k-1\) positive terms, since the \(j\)-th coefficient has vanished. Repeat until \(k \leq d+1\).

It is one of those proofs where once you see it, you cannot unsee it. The linear dependence of more than \(d+1\) points in \(\mathbb{R}^d\) is the engine. The reduction step is entirely mechanical. The conclusion is striking.

Try it: click inside the polygon

The tool below makes this visceral. Click anywhere inside the polygon and the algorithm instantly finds three hull vertices whose convex combination is exactly your point. Drag any hull vertex to reshape the polygon - the decomposition updates live.

Carathéodory decomposition - interactive

Click anywhere inside the polygon to decompose the point.

Click inside · drag vertices to reshape

The polygon always has a fan triangulation into \(n-2\) triangles from vertex 1. Your click lands in exactly one of them. The three triangle vertices are the Carathéodory witnesses, and the barycentric coordinates \(\lambda_1, \lambda_2, \lambda_3\) are the weights. They are always non-negative and always sum to exactly one.

What you are watching

The implementation uses a fan triangulation from the topmost hull vertex. The convex polygon is split into \(n-2\) triangles, each sharing hull vertex 1 as the apex. When you click a point P, the algorithm checks each triangle in turn using barycentric coordinates:

\[ \lambda_1 = \frac{(y_2 - y_3)(P_x - x_3) + (x_3 - x_2)(P_y - y_3)}{(y_2 - y_3)(x_1 - x_3) + (x_3 - x_2)(y_1 - y_3)}, \]

with \(\lambda_2\) defined analogously and \(\lambda_3 = 1 - \lambda_1 - \lambda_2\). The point lies inside the triangle if and only if all three coordinates are non-negative. The first triangle satisfying this condition gives the decomposition.

The three colored circles are the witnesses. The dashed lines from P to each vertex show the three directions whose weighted combination recovers P exactly. The weights at the bottom are the barycentric coordinates: if one vertex dominates with weight near one, P is close to that vertex. If the three weights are roughly equal, P is near the centroid of the triangle.

Notice that the triangle you see is not unique - a different fan triangulation from a different apex would typically give a different triangle for the same interior point. Carathéodory's theorem guarantees a decomposition exists using at most three vertices; it does not say which three. The fan triangulation picks one valid set quickly. The geometry is always correct regardless of which decomposition you find.

The bound d+1 is tight

A natural question: can we always do it with fewer than three? Sometimes, yes. If P lies on an edge of the polygon, two vertices suffice. If P is a vertex itself, one vertex suffices. But for a generic interior point, three vertices are necessary - and two cannot always work.

The standard example is the centroid of a triangle. Label the vertices \(A, B, C\). The centroid is \(\frac{1}{3}A + \frac{1}{3}B + \frac{1}{3}C\). Can you write the centroid as a convex combination of only two of the three vertices? That would require the centroid to lie on one of the edges, which it does not. The centroid is strictly interior, so no two-vertex combination reaches it. Three are genuinely necessary here. The bound \(d+1\) is achieved, not merely approached.

Why this mattered beyond the Frank-Wolfe paper

Once I understood the theorem, I started seeing it everywhere.

Linear programming. A basic feasible solution of an LP with feasible region in \(\mathbb{R}^n\) has at most \(n\) nonzero variables. The simplex algorithm moves between vertices of the feasible polytope. Every interior point is a convex combination of at most \(n+1\) vertices, so if you are minimizing a linear function over a polytope, the minimum is always at a vertex. Carathéodory is the reason the simplex method is complete: it does not miss the optimum by staying in the interior.

Support vector machines. The maximum-margin hyperplane in \(\mathbb{R}^d\) is determined by at most \(d+1\) support vectors. The dual formulation of the SVM writes the normal vector \(w = \sum_i \alpha_i y_i x_i\) as a sparse combination of training points. At optimality, the active \(\alpha_i\) are nonzero only on the support vectors - and Carathéodory bounds how many you need. In practice the number is often much smaller than \(d+1\), but the theorem gives the worst-case certificate: the rest of the dataset is provably irrelevant to the decision boundary.

Frank-Wolfe / conditional gradient. This is the algorithm that started everything for me. Frank-Wolfe minimizes a smooth function \(f\) over a convex set \(\mathcal{C}\) by at each step finding the vertex of \(\mathcal{C}\) that most reduces the linearized objective, then taking a convex combination step. After \(k\) steps the iterate is a convex combination of at most \(k\) vertices of \(\mathcal{C}\). Carathéodory says this representation can always be pruned to \(d+1\) terms without changing the point. In practice, Frank-Wolfe is used exactly because the iterates stay sparse - a consequence of this structure.

Neural network weight space. Consider the loss surface of a neural network. The flat minima hypothesis says that wide, flat regions of the loss are preferable to sharp, narrow ones for generalization. Carathéodory has a subtle role here: if you take a convex combination of two models (parameter averaging, as in model ensembling or stochastic weight averaging), the resulting point lies in the convex hull of those models. If the loss is low at both endpoints and the path between them is also low (the linear mode connectivity hypothesis), then the ensemble inherits low loss by convexity. The theorem bounds how many models you need to mix to cover a region: in the \(d\)-dimensional weight space of a network, \(d+1\) models suffice to represent any point in their convex hull.

Attention and mixture models. Each row of the softmax attention matrix is a probability distribution over the sequence. Any such distribution is a point in the probability simplex \(\Delta_{L-1} \subset \mathbb{R}^L\), whose extreme points are the one-hot vectors. In \(\mathbb{R}^L\), Carathéodory says any probability vector can be written as a convex combination of at most \(L\) extreme points - which is trivially true for the simplex since the simplex itself has exactly \(L\) vertices. But for the effective attention distribution (which may live on a lower-dimensional manifold), fewer witnesses suffice. Sharp attention, where one token gets nearly all the weight, is the extreme-point case: one witness. Uniform attention is the interior: all \(L\) witnesses with equal weights. Carathéodory formalizes why sharp attention is "simpler" in a representational sense.

Sparse dictionary learning and compressed sensing. In dictionary learning, you want to represent data points \(x \in \mathbb{R}^n\) as sparse combinations of dictionary atoms \(D = \{d_1, \ldots, d_m\}\) with \(m \gg n\). Carathéodory says that any point in conv\((D)\) can be represented with at most \(n+1\) atoms. If your data lives in the convex hull of the dictionary, you never need more than \(n+1\) atoms - a bound on the intrinsic sparsity of the representation that is independent of the dictionary size \(m\). This is the theoretical foundation for why \(\ell_1\) regularization works: it promotes sparse representations, and the sparse representation always exists with at most \(n+1\) terms by Carathéodory.

K-means and clustering. The centroid of a cluster is a convex combination of its members. If the cluster lives in \(\mathbb{R}^d\), the centroid lies in the convex hull of the cluster points. Carathéodory says the centroid can be expressed as a convex combination of at most \(d+1\) cluster members. In high-dimensional spaces where \(d\) is large, this is not a practical compression - but in the interpretability-motivated setting where you want to find "representative points" for a cluster, the theorem says \(d+1\) data points always suffice to locate the centroid exactly.

Reinforcement learning and policy mixing. A mixed policy in RL is a convex combination of deterministic policies. The set of deterministic policies is finite (for finite state and action spaces), and any mixed policy is in their convex hull. Carathéodory says any mixed policy can be written as a convex combination of at most \(|\mathcal{S}||\mathcal{A}| + 1\) deterministic policies (where \(|\mathcal{S}||\mathcal{A}|\) is the dimension of the policy space). In practice the bound is tighter. But the theorem tells you that mixed policies are not fundamentally more expressive than sparse mixtures - the mixing complexity is bounded by the dimension.

Building this

Writing the tool was its own small pleasure. The hardest part is not the math - the barycentric coordinate formula is a textbook half-page - but the interactive geometry: handling mouse coordinates relative to the canvas, maintaining the convex hull ordering after vertex drags, keeping the fan triangulation consistent with whatever hull order the algorithm returns.

I used Andrew's monotone chain algorithm for the convex hull, which outputs vertices in counterclockwise order when working in standard coordinates. The fan triangulation then follows the hull ordering, and the barycentric coordinates are computed for each triangle until one contains the click point. The whole thing runs in real time because the number of hull vertices is small enough that the linear scan through triangles is imperceptible.

What I did not expect when building it: how clearly you can see the triangulation structure as you click around. Near an edge between two triangles of the fan, small clicks move you from one triangle to the other, and the highlighted witnesses switch abruptly. The decomposition is discontinuous at triangle boundaries, even though the geometric object (the polygon and its interior) is smooth. The witnesses jump; the weights jump. P itself moves continuously. This discontinuity is not a bug - it is a property of the particular decomposition algorithm, not of Carathéodory's theorem itself. The theorem only guarantees existence; it says nothing about continuity of the witness selection.

That observation sat with me for a while. The theorem is existential. It tells you witnesses exist, not how to find them canonically. The fan triangulation is one particular selection rule, fast and simple, but arbitrary in the sense that rotating the fan to a different apex gives a different set of witnesses for the same point. There is no canonical set of three witnesses for an interior point of a general polygon. Just the guarantee that three is always enough.