Shannon's Kids

15 July, 2025

One of the things I love about IISERB is the annual math club fest of Continuum. This year, one of the events was a 7-minute symposium where you could present any topic — as long as you used mathematics. I chose Shannon's kids: KL and JS divergences. It won me second prize and a sipper, now employed for iced tea on lazy afternoons.

Imagine two professors grading differently. BIO101 has Gaussian-like scores — balanced and symmetric. ECS201 or MTH202, however, reveal skewed results. It’s obvious visually — but how can we measure this difference rigorously and quantitatively? That's where information theory enters.

The Hassle: Uneven Grading Distributions

Here's what these distributions might look like:

Not all distributions describe reality equally well. KL divergence tells us how inefficient one distribution is in describing data generated from another. This inefficiency — in bits — is the cost of using the wrong model.

KL Divergence: The First, Asymmetric Kid

Kullback-Leibler divergence quantifies the discrepancy between two probability distributions:

\[ D_{KL}(P || Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)} \]

Measures the expected number of extra bits to encode samples from \(P\) using a code optimized for \(Q\).
Not symmetric — it matters which distribution is the reference.
Infinite if \(Q(x) = 0\) and \(P(x) > 0\), due to division by zero inside the logarithm.

KL divergence is zero iff \(P = Q\) almost everywhere. In this sense, it acts as a quasi-distance measure on the space of distributions.

Shannon’s Legacy and KL Divergence

The KL divergence is rooted in Shannon’s definition of entropy, which measures the average uncertainty (or information content) in a distribution:

\[ H(P) = - \sum_x P(x) \log P(x) \]

Cross-entropy is the average number of bits needed to encode samples from \(P\) using a coding scheme optimized for \(Q\):

\[ H(P, Q) = - \sum_x P(x) \log Q(x) \]

The KL divergence is simply the difference:

\[ D_{KL}(P || Q) = H(P, Q) - H(P) \]

This makes its interpretation in coding theory precise: how many more bits are needed, on average, because you assumed the wrong distribution?

A Real Example

Suppose Biology marks are normally distributed and Math marks are skewed. Then:

\(D_{KL}(\text{Bio} || \text{Math}) = 1.0034\)
\(D_{KL}(\text{Math} || \text{Bio}) = 2.6962\)

This tells us assuming Math is Bio is much worse than the reverse — there’s more inefficiency, more surprise.

Understanding Information

Think of guessing a number from 1 to 8. Each binary question like “Is it > 4?” gives 1 bit of information. Since there are 8 options, you need:

\[ \log_2 8 = 3 \text{ bits} \]

Information content of an event with probability \(p\) is:

\[ I(x) = -\log_2 p(x) \]

Thus, rarer events contain more information. Shannon entropy is just the expected value of this surprise.

Why KL Isn’t Enough

While powerful, KL divergence has limitations:

Asymmetry: direction matters — switching \(P\) and \(Q\) gives different results.
Undefined for zeros: if \(Q(x) = 0\) and \(P(x) > 0\), then KL diverges to infinity.

To address these, a more robust measure was developed.

Jensen-Shannon Divergence: The Symmetric One

Jensen-Shannon Divergence is a symmetrized and smoothed version of KL divergence:

\[ D_{JS}(P || Q) = \frac{1}{2} D_{KL}(P || M) + \frac{1}{2} D_{KL}(Q || M), \quad M = \frac{1}{2}(P + Q) \]

Symmetric by construction.
Always finite — even if \(P\) or \(Q\) contain zeros.
Bounded: \( 0 \leq D_{JS} \leq \log 2 \).

Entropy Formulation of JSD

JSD can also be written using entropies:

\[ D_{JS}(P || Q) = H(M) - \frac{1}{2} H(P) - \frac{1}{2} H(Q) \]

This measures how much more uncertain the mixture \(M\) is compared to the average uncertainty of \(P\) and \(Q\).

It also has a nice interpretation: it is the mutual information between a sample and its origin label (whether it came from \(P\) or \(Q\)) — thus connecting information theory with classification.

Where I Learned This: Topic Modeling

In Latent Dirichlet Allocation (LDA), topics are distributions over words, and documents are mixtures of topics. KL divergence plays a central role:

Comparing topic similarity: are two topics semantically close?
Document inference: how likely is a document under a given topic model?
Optimization: LDA uses variational inference to minimize KL divergence between true and approximate posteriors.

Conclusion

KL divergence gave us a powerful lens for measuring divergence between beliefs and reality — but it’s not perfect. Jensen-Shannon builds on it with symmetry and robustness. Behind them all, Shannon's insight remains foundational: information is uncertainty, and math can measure it.

References

Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. Annals of Mathematical Statistics.
Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal.
Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research.