RisingBALLER — A Deep Dive into Treating Players as Tokens

04 November, 2025

Every now and then, a paper comes along that doesn't just present a new model, but a new way of thinking. I recently stumbled upon one such paper from the StatsBomb Conference 2024 that genuinely made me sit up and reconsider how we can apply modern AI to sports analytics. It's called RisingBALLER, and its core premise is as elegant as it is powerful: what if we treat football matches like sentences and players like tokens?

This blog is my deep dive into that very idea. I'm going to walk you through the entire methodology, the math that underpins it, and the fascinating results, sharing my perspective on why I think this approach is so transformative. Everything here is sourced directly from the original paper—I'm just adding my own narrative as I explore their work.

The Core Idea: Football Through the Lens of NLP

The intuition behind RisingBALLER is what I found most captivating. The authors essentially asked: why can't we use the same foundation model concepts that revolutionized Natural Language Processing (NLP) for football? In NLP, a transformer model learns the meaning of words (tokens) by looking at their context within a sentence. RisingBALLER ports this idea directly to the pitch.

Each player in a match becomes a token, and the match itself becomes the sentence.

By feeding this sequence into a transformer, the model learns deeply contextualized player embeddings that are specific to that single match. This unlocks a whole host of downstream tasks, from predicting future performance and finding stylistically similar players to even estimating abstract concepts like team cohesion. It's a fundamental shift from static player attributes to dynamic, context-aware representations.

Building the Foundation: Data and Preprocessing

As any data scientist knows, an idea is only as good as the data it's built on. For this project, the authors used the incredible StatsBomb event dataset, focusing on the 2015–2016 season across the top 5 European leagues. The raw data for a single match consists of 3,500-4,000 event rows, which isn't directly usable by a transformer.

So, the first crucial step was a heavy dose of preprocessing. I was impressed by how they converted this event stream into a structured, per-player statistics table for each match. Every player in the match day squad (both the starting XI and the bench) was given a feature vector. For their main downstream task, Next Match Statistics Prediction (NMSP), they didn't just use raw stats. They selected 39 base statistics (like progressive passes, successful dribbles, aerial duels won, interceptions, xG, etc.) and then engineered aggregates (sums, means, and standard deviations over rolling 3 and 5-match windows) to create a rich, 234-variable feature vector for each player. This captures not just what a player did in one match, but their recent form and consistency.

The Model Architecture: Deconstructing a "Player-Token"

Now, let's get into the technical core of RisingBALLER. The fundamental building block is how they represent each player within a match. For any given player, the model constructs four separate embeddings and then sums them element-wise. This creates a rich initial "token" embedding that captures multiple facets of the player's context.

If we denote the embedding dimension by \(D\) and the number of players in the match sequence by \(N\) (the paper uses a fixed sequence length of up to 80, padding where necessary), then for a player \(i\), I can write their initial embedding like this:

\[ \mathbf{x}^{(i)}_{init} = \mathbf{e}^{(i)}_{player} + \mathbf{e}^{(i)}_{pos} + \mathbf{e}^{(i)}_{team} + \mathbf{e}^{(i)}_{tpe}, \]

The Four Pillars of a Player Embedding

Let me break down what each of these components represents, as this is key to the whole model:

The Transformer's Role

Once these initial embeddings are created for all \(N\) players, they're stacked into a matrix \(X_{init} \in \mathbb{R}^{N\times D}\). This matrix is then fed through a standard transformer encoder. It's here that the real magic happens. Through its multi-head self-attention mechanism, the transformer allows every player-token to 'look' at every other token in the sequence. The model learns who to pay attention to, effectively asking questions like "Given this midfielder's performance, how should I update my understanding of the striker he was passing to?" The output is a matrix of context-aware embeddings \(X_{out}\in\mathbb{R}^{N\times D}\), where each player's vector is now enriched with information about everyone else in that match.

Pre-training: Developing a "Football IQ"

To get the transformer to learn these complex relationships, the authors used a self-supervised pre-training task they call Masked Player Prediction (MPP). If you're familiar with NLP models like BERT, this is directly analogous to Masked Language Modeling (MLM).

For each match, they randomly hide (or "mask") 25% of the players in the input sequence. The model's job is to predict the identities of these masked players based on the context provided by the unmasked players. This forces the model to develop a deep "football IQ." For instance, if it sees a sequence of Real Madrid defenders and midfielders from 2016, and one player is masked, it has to learn that the missing player is likely to be someone like Cristiano Ronaldo, based on the surrounding context.

From a technical perspective, for each masked position \(j\), the model takes the final contextualized output vector \(\mathbf{x}^{(j)}_{out}\) and projects it into a probability distribution over the entire vocabulary of players \(V\). This is done using a standard softmax layer:

\[ P(\hat{y}_j = v \mid X_{out}) = \mathrm{softmax}\left( W_{v}^\top \mathbf{x}^{(j)}_{out} + b_v\right) ,\quad v\in\{1,\dots,|V|\}. \]

The model is then trained to minimize the cross-entropy loss, which essentially penalizes it for making wrong predictions. For a single match, the MPP loss function is:

\[ \mathcal{L}_{MPP} = -\sum_{j\in M} \log P(y_j\mid X_{out}). \]

Fine-tuning: From General Knowledge to Specific Prediction

After the pre-training phase, the model has a deep, contextual understanding of players. The next step, which I think demonstrates the real utility of this approach, is fine-tuning it for a specific downstream task: Next Match Statistics Prediction (NMSP). They take the pre-trained transformer, remove the MPP head, and attach a new MLP head. This new head is trained to take the contextualized representations of a team's players and predict 18 team-level statistics for the *next* game.

For this task, the model's goal is to predict the next-match team stats vector \(\mathbf{y}\in\mathbb{R}^{2N_{stats}}\) (one set of stats for each of the two teams). The training objective is a straightforward mean squared error, averaged across all the predicted statistics:

\[ \mathcal{L}_{NMSP} = \frac{1}{2N_{stats}} \sum_{k=1}^{2N_{stats}} (\hat{y}_k - y_k)^2. \]

During fine-tuning, all the model's weights are updated. This allows the model to adapt its general football knowledge specifically to the task of statistical prediction, a process known as transfer learning.

So, Did It Work? The Key Results

This is the moment of truth. After all this clever setup, does the model actually perform well? Based on the paper, the answer is a resounding yes. Here's my summary of their key findings:

Beyond Prediction: Unlocking the Player Embeddings

For me, this is the most exciting part of the paper. The predictive accuracy is great, but the true power of this approach lies in the learned embeddings themselves. They are rich, nuanced representations of players that can be used for all sorts of analysis. The authors showcase a few fantastic examples:

  1. Positional Clustering: When they visualized the learned positional embeddings, they found that the vectors naturally clustered into defenders, midfielders, and attackers. This shows the model learns the tactical structure of a football pitch organically.
  2. Similar Player Retrieval: This is a classic use case with a new twist. By finding the nearest neighbors to a player's embedding, you can find others who perform a similar role. When I saw their example of querying for players similar to the 2016 version of N'Golo Kanté, the results were stunning. It didn't just find other defensive midfielders; it found players with a similar "engine" and defensive work-rate, like Idrissa Gueye and Allan, even across different leagues.
  3. A Metric for Team Cohesion: The authors proposed a fascinating heuristic for team cohesion: calculate the average pairwise similarity (e.g., cosine similarity) of all players in a team's starting lineup. A higher score might indicate a more stylistically coherent unit that "sings from the same hymn sheet."
  4. Attention Analysis: While noted as future work, one could analyze the transformer's attention matrices to see which players "pay attention" to which other players, potentially revealing on-pitch synergies, like the connection between a playmaker and a striker.

A Frank Look at Limitations and Future Directions

Of course, no model is perfect, and the authors are transparent about the limitations.

Its key strengths, in my view, are:

However, there are important caveats:

My Final Take

RisingBALLER isn't just another model; I see it as a blueprint for the future of sports analytics. It demonstrates that the principles of modern AI—self-supervised pre-training on large datasets followed by task-specific fine-tuning—are just as powerful on the football pitch as they are in natural language. It moves the field from static analysis to a dynamic, context-aware understanding of players and teams. I, for one, am incredibly excited to see where this line of research goes next.