Pratyaksh Patel | Technical Blog

01 May, 2026

I was very forntunate to be able to select my own topic for my bachelor's thesis, which was Next Event Prediction in football, I know very aspirational and cool but also equally hard to be pulled off in four months while also working fulltime somehwere else. I did it nonethsless. While working on it, I tried Mamba, but it worked really bad, because of lack of data, at the end Graph attention hit it and it worked. but I never sat down and thought, why not both of them together? Now that the thesis is submitted and I have enough time to work on it, I started reading heavily on the intersection of graph and sequence models, and I found this paper which is exactly about the awkward border between two worlds. Graphs do not arrive as a sentence. Mamba wants a sequence. A football possession is both: a sequence of actions and a constantly changing graph of players, spaces, and passing options. So the interesting problem is not just "can I put Mamba on graph data?" It is: what must be true before that even makes sense? Led to all this.

So, I did use the ideas separately. Graphs helped me think about structure. Mamba helped me think about time. Here, Im trynna answer a pretty obvious question - Can we use them together?

That question led me to Graph Mamba: Towards Learning on Graphs with State Space Models by Behrouz and Hashemi [1]. The paper is exactly about the awkward border between two worlds. Graphs do not arrive as a sentence. Mamba wants a sequence. A football possession is both: a sequence of actions and a constantly changing graph of players, spaces, and passing options. So the interesting problem is not just "can I put Mamba on graph data?" It is: what must be true before that even makes sense?

The paper's answer is surprisingly practical. It says: do not simply replace attention with Mamba inside an existing graph transformer and hope the architecture forgives you. First decide what the tokens are. Then decide how to order them. Then encode local neighborhoods. Then scan them in both directions using a selective state space model. Positional and structural encodings are useful, but the paper argues they do not always need to be the whole personality of the model.

That is the whole blog in one sentence. But it took me a while to feel why it matters, so let me build it slowly.

Soccer next-event prediction is naturally two things at once. The match is a graph of players and relations, but the prediction target lives in an event stream.

The old graph story: messages are local

Most graph neural networks begin with a clean idea. A node updates itself by listening to its neighbors. In a citation graph, a paper listens to nearby papers. In a molecule, an atom listens to adjacent atoms. In soccer, a player could listen to nearby teammates, opponents, the ball carrier, and maybe zones of space. The common message-passing template looks like this:

\[ h_v^{(\ell+1)} = \text{Update}\left(h_v^{(\ell)},\; \text{Aggregate}\{h_u^{(\ell)} : u \in \mathcal{N}(v)\}\right). \]

This is beautiful because it respects the graph. It does not pretend that node order matters. If I rename the players from 1 to 22, the model should not suddenly change its answer. Methods like GCNs made this simple and scalable by learning local graph structure and node features with cost linear in the number of edges [4]. GIN then gave the field a sharper language for discussing how expressive these neighborhood aggregation models can be [5].

But local listening has a price. If a node needs information from five hops away, the model must stack enough layers for that information to arrive. By the time it arrives, it has been mixed, compressed, and re-compressed through fixed-size vectors. This is the over-squashing problem: too much distant information gets squeezed through too narrow a channel [6]. In a soccer analogy, imagine trying to understand why the right winger received the ball by only asking the nearest player, then asking that player's nearest player, and so on. The reason might live in a center-back stepping forward ten seconds earlier. By the time that signal reaches the winger through local gossip, it is not crisp anymore.

That was one of the reasons graph transformers became attractive. If attention lets every token talk to every other token, then long-range interactions become direct. The original Transformer made this idea famous for sequences [7]. Graph Transformers then tried to bring the same global mixing into graph learning. GraphGPS, for example, framed a strong recipe as a combination of positional or structural encoding, local message passing, and global attention [8].

The problem is that full attention is expensive. If a graph has \(n\) nodes, all-pairs attention has \(O(n^2)\) pair interactions. For small graphs this can be fine. For large graphs, it becomes the elephant sitting on the GPU. Sparse graph transformers such as Exphormer reduce this by using local edges, expander connections, and virtual global nodes [10]. That is clever. But the broader question remains: do we need attention as the global mixing mechanism, or do we need something that can carry long-range information efficiently?

The Mamba story: memory is selective

Mamba comes from a different direction. It is a sequence model built on state space models. A very simplified state space model keeps a hidden state \(h_t\), updates it as new input \(x_t\) arrives, and emits an output \(y_t\):

\[ h_t = \bar{A}h_{t-1} + \bar{B}x_t, \qquad y_t = Ch_t. \]

S4 made these models exciting again by showing how structured state spaces can handle long sequences efficiently [3]. Mamba's important twist is selectivity [2]. Instead of using fixed transition behavior for every token, Mamba lets parts of the state update depend on the input. In plain language: the model can decide what to remember, what to ignore, and when to reset. A bad touch, a harmless sideways pass, and a line-breaking carry do not have to enter memory with the same force.

A toy version of the idea is:

\[ h_t = A(x_t)h_{t-1} + B(x_t)x_t, \qquad y_t = C(x_t)h_t. \]

That notation hides many implementation details, but it captures the feeling. The transition is not blind. The token can influence how memory changes. This is exactly why Mamba felt natural for event prediction in soccer. Event streams are long, noisy, and uneven. Most events are not decisive. Some are everything. A model that can selectively carry context has the right temperament for the sport.

But Mamba has a catch: it scans a sequence. Graphs are not sequences. A sentence has a first word, second word, and third word. A graph has nodes and edges, but no canonical order. If we line up players by jersey number, we have injected a fake structure. If we line them up by x-coordinate, we have injected a tactical bias. Sometimes that bias is useful. Sometimes it is nonsense. This is the central tension the Graph Mamba paper takes seriously.

The mistake would be to say: "Mamba is linear, graphs are expensive, let us just replace graph attention with Mamba." The paper's point is subtler. A selective scan is powerful only after we have chosen meaningful tokens and a meaningful order.

Why the obvious combination is not enough

There is a concurrent paper with a very similar name, Graph-Mamba: Towards Long-Range Graph Sequence Modeling with Selective State Spaces, which replaces the attention block in GraphGPS with a Mamba-style block and uses node prioritization/order strategies [13]. That is a natural first attempt, and it is useful. But Behrouz and Hashemi argue that this is not the whole story [1]. The paper I am writing about calls its architecture Graph Mamba Networks, or GMNs, and the key distinction is that it treats graph-to-sequence conversion as a first-class design problem.

If each node is simply a token, then Mamba receives a list of nodes. But what does the first node know? In a one-directional scan, early nodes cannot see later nodes. In attention, this is not a problem because every node can attend to every other node. In a recurrent scan, order shapes information flow. If the ordering is arbitrary, the model can lose structural information for boring reasons.

This matters in soccer too. Suppose I order players by distance to the ball. The ball carrier appears first, then nearby pressure, then farther support. That ordering might be good for predicting immediate actions. But if I order players by shirt number, I am basically asking the model to discover football through laundry metadata. It might learn something despite me, but I am not helping.

So Graph Mamba introduces a recipe. The paper describes four required steps and one optional step: tokenization, token ordering, local encoding, bidirectional selective SSM encoding, and optional positional/structural encoding. I like this because it feels less like a single architecture and more like a checklist for not fooling yourself.

A simplified version of GMN's core recipe. The exact paper includes optional PE/SE, optional MPNN augmentation, and a final node-level bidirectional scan.

Step one: make graph tokens that mean something

The tokenization is the heart of the paper. For a node \(v\), GMN samples random walks starting from \(v\). For each walk length \(\hat{m}\), it samples \(M\) walks, takes the set of visited nodes, and builds the induced subgraph. In the paper's notation, one token is:

\[ G[T_{\hat{m}}(v)] = G\left[\bigcup_{i=1}^{M} T_{\hat{m}, i}(v)\right]. \]

This sounds more complicated than it feels. Pick a player. Walk one relation away, two relations away, three relations away. Each distance gives you a small neighborhood around the player. Instead of using the whole \(k\)-hop ball, which can explode in dense graphs, use sampled walks. Now each node is represented by a sequence of neighborhood snapshots.

There are three knobs:

Symbol	Plain meaning	Soccer translation
\(M\)	number of walks per length	how many local routes we sample from a player
\(m\)	maximum walk length	how far tactical context can reach
\(s\)	how many times we repeat sampling	how many alternate views of the same neighborhood we give the model

I like \(s\) a lot. Mamba tends to benefit from longer useful sequences because its selection mechanism can filter what does not matter [2]. GMN exploits that by giving the model more sampled subgraphs, not fewer. This is a very Mamba-ish choice. Attention often makes us afraid of token count because every token talks to every other token. A linear scan changes the economics. More context is not free, but it is less terrifying.

The paper also has a nice bridge between node and subgraph tokenization. If \(m = 0\), the token for a node is just the node itself. If \(m \geq 1\), the tokens become sampled subgraphs around the node. So one hyperparameter moves the architecture between a node-token view and a neighborhood-token view. That matters because not all graph tasks want the same bias. Sometimes the global node sequence is enough. Sometimes local structure is the whole game.

Step two: order the tokens without lying too much

Mamba cares about order. That is not a defect. It is what lets the scan behave like memory. But graphs do not hand us an order, so GMN must create one.

For subgraph tokens, the paper gets a useful order almost for free. A larger-hop sampled neighborhood contains broader context. A smaller-hop sampled neighborhood is closer to the target node. GMN orders tokens from farther neighborhoods to nearer neighborhoods. The intuition is: let the broad context flow into the local context, so the final representation of the node can know both the outside world and the immediate surroundings.

For repeated samples with the same walk length, the paper shuffles them to avoid overfitting to an arbitrary order. That small detail matters. If two sampled two-hop neighborhoods have no natural before/after relation, pretending otherwise can leak nonsense into the model.

When \(m = 0\), there are no subgraph tokens, only nodes. Then the ordering problem returns. The paper discusses centrality-style orderings and uses degree ordering in experiments for simplicity and efficiency. In a soccer version, I would not blindly use degree. I would test several task-aware orderings: distance to ball, pitch x-coordinate, possession-team first, pressure score, pass-network centrality, or even a learned ordering. But I would treat this as a modeling assumption, not a harmless preprocessing step.

Step three: encode local structure

A sampled subgraph is still a graph. Mamba wants vectors. So each neighborhood token must be vectorized. The paper allows different local encoders, including message-passing networks such as Gated-GCN, or random-walk feature encodings inspired by CRaWl [11].

This is where GMN is nicely hybrid. It does not throw away message passing as if local inductive bias suddenly became embarrassing. It uses local encoders where local encoders make sense. Then it uses Mamba for the longer-range scan. I think that is the correct instinct. In soccer, local geometry is not optional. Pressure, marking, passing lanes, and support angles are all local. But the next event also depends on previous tempo, field tilt, transitions, and the possession story. Local structure and long memory are not enemies. They are different organs.

Step four: scan both ways

A normal Mamba block scans in one direction. For language, direction can be part of the task. For graph encoding, a single direction is suspicious. If node A appears before node B, then B can receive information from A, but A cannot receive information from B. That asymmetry may be completely artificial.

GMN uses a bidirectional Mamba block. One scan goes forward. Another scan goes backward. Their outputs are combined. This makes the model less fragile to ordering choices and gives each token a chance to be influenced by tokens on both sides. Vision Mamba models use a similar instinct when adapting sequence scans to images, because images also do not naturally behave like one-dimensional sentences [1].

The paper's ablation study says this choice matters a lot. On heterophilic datasets, removing bidirectional Mamba causes a large drop. For example, on Roman-empire, the full GMN reports 0.8769 accuracy, while the version without bidirectional Mamba reports 0.8327. On Minesweeper, ROC AUC drops from 0.9101 to 0.8597. That is not a tiny decoration. It is one of the pieces that makes the graph adaptation real.

The optional part: positional and structural encodings

Graph Transformers often lean heavily on positional and structural encodings because attention by itself has weak graph bias. If every node can attend to every other node, the model needs help understanding where nodes sit in the graph. Laplacian eigenvectors, random-walk encodings, shortest-path encodings, and related features are common ways to inject this information.

GMN does not ban PE/SE. It makes them optional. That is an important difference in tone. If subgraph tokens already carry local structure, and if random-walk sampling already provides meaningful neighborhoods, then complex encodings may not be the only way to give the model graph awareness. This is one reason the authors can make the provocative claim that Transformers, complex message passing, and PE/SE are sufficient for good performance, but not necessary [1].

I would not read that as "never use PE/SE." I would read it as "do not confuse a successful recipe with a law of nature." If positional encodings help, use them. If they become the scalability bottleneck, GMN gives another path.

A useful complexity picture

The computational story is simple enough to keep in your head. Full attention over \(n\) nodes is quadratic: \(O(n^2)\). GMN's sampled-neighborhood version is roughly linear in graph size when \(M\), \(s\), and \(m\) are treated as controlled hyperparameters:

\[ O\left(Ms(m+1)|V| + |E|\right). \]

This does not mean GMN is magically free. Sampling, encoding, and scanning all cost something. Larger \(s\) gives more context but slower training. Larger \(m\) reaches farther but increases token work. Larger \(M\) gives better neighborhood coverage but more computation. The useful part is that these are knobs, not a wall.

This is a conceptual plot, not a measured benchmark. The paper's measured efficiency results show GMN using less GPU memory than graph transformer baselines on large datasets.

What the experiments say

The paper evaluates GMNs on long-range graph benchmarks, standard GNN benchmarks, heterophilic datasets, and a large OGB dataset. The headline is that GMN is competitive or better while using less memory than attention-heavy graph transformer approaches.

On the Long Range Graph Benchmark [9], GMN reports the best numbers among listed baselines on COCO-SP, PascalVOC-SP, Peptides-Func, and Peptides-Struct in the table. A few values that stood out to me:

Dataset	Metric	GMN	GPS + Mamba	Best non-GMN comparator in table
COCO-SP	F1 higher is better	0.3974	0.3895	GPS: 0.3774
PascalVOC-SP	F1 higher is better	0.4393	0.4180	GPS + Mamba: 0.4180
Peptides-Func	AP higher is better	0.7071	0.6624	CRaWl: 0.6963
Peptides-Struct	MAE lower is better	0.2473	0.2518	Graph ViT: 0.2468

The comparison to GPS + Mamba is especially important. It is the "obvious" baseline: take a strong graph transformer framework and swap the transformer block for Mamba. GMN beating it is evidence for the paper's main design claim. The win is not only from Mamba. It is from respecting what changes when attention becomes a scan.

On GNN benchmark datasets, GMN is also strong. It reports 0.7576 accuracy on CIFAR10, above Exphormer's 0.7469 in the table, and 0.9415 on MalNet-Tiny, slightly above Exphormer's 0.9402 [10]. On heterophilic datasets, GMN is best on Amazon-ratings, Minesweeper, and Tolokers, while Exphormer is higher on Roman-empire. That is a healthy result. It does not say one model crushes all others everywhere. It says the design is serious.

The efficiency table is the one I would tape to my monitor before trying this on sports data. On OGBN-Arxiv, the paper reports GMN at 3.85 GB GPU memory, GPS + Mamba at 5.02 GB, Gated-GCN at 11.09 GB, and Exphormer at 36.18 GB. GMN also reports the best accuracy among those entries at 0.7248. Memory is not an aesthetic metric. It decides whether an idea gets to run at all.

Hand-drawn from selected values reported in the paper's tables. G+M means GPS + Mamba. Exph means Exphormer.

The theory, in human language

The paper has three theoretical claims that are worth translating carefully.

First, its neighborhood sampling can be more expressive than fixed \(k\)-hop neighborhood sampling when \(M\), \(m\), and \(s\) are large enough. The intuition is that fixed \(k\)-hop neighborhoods give one big view. Random-walk sampling with repeated samples can expose many substructures inside that view. It is like comparing a single aerial photo of a possession to several camera angles through possible passing lanes.

Second, with positional encoding and enough parameters, GMNs can approximate permutation-equivariant functions on graphs. This mirrors the kind of universality argument often made for graph transformers, but the paper leans on results showing that state space models with layer-wise nonlinearities are universal approximators for sequence-to-sequence functions [1].

Third, without PE and without an MPNN, GMNs can still have expressive power not bounded by the Weisfeiler-Leman hierarchy, because the random-walk feature encoding connects to CRaWl [11]. I would be careful not to over-sell this as "GMN solves graph isomorphism." That is not the practical takeaway. The takeaway is better: random-walk subgraph views can expose patterns that standard message passing may miss.

For my purposes, the theory says: this is not only an engineering hack. The sampling and encoding choices change what the model can see.

The soccer version I keep imagining

Now back to the thesis itch. In next-event prediction, the target could be the next action type, next location, next receiver, next possession value, or some joint distribution over all of these. The input is usually an event stream: pass, carry, pressure, duel, shot, clearance. Each event has time, location, team, player, body part, result, and context. Soccer event frameworks such as SPADL and VAEP made this action-language view popular by representing and valuing on-ball actions with context [14].

But every event also lives inside a graph. At a timestamp, players have relations: distance, angle, passing lane visibility, same-team/opponent indicator, role, pressure, velocity alignment. TacticAI is a nice real-world example of football graph modeling: it treats corner-kick situations as graphs of players and uses graph neural networks to reason about first receivers and shot likelihood [15]. That is set-piece focused, but the representation lesson transfers. Football is relational before it is sequential.

So here is the Graph Mamba-flavored next-event model I would now try.

At each event time \(t\), construct a graph \(G_t\). Nodes are players, maybe the ball, maybe tactical zones. Edges encode distances, same-team relations, passing-lane openness, marking pressure, and relative velocity. Node features include player location, velocity, team possession flag, role, fatigue proxy if available, and event-specific context. Then choose a target node or target region depending on the prediction task.

For each key node, sample neighborhoods using random walks over this relational graph. A walk is not a literal player movement. It is a route through influence: ball carrier to nearest presser, ball carrier to passing option, passing option to marker, marker to covering defender. These sampled subgraphs become local tactical tokens. Order them from broader context back toward the player or ball carrier. Encode each token with a local graph encoder. Then use bidirectional Mamba to produce node-level representations.

Now we still need temporal memory. There are two options, and I would test both. Option one: use GMN inside each event frame to produce graph-aware event embeddings, then feed the event sequence into a causal Mamba for next-event prediction. Option two: build tokens that already combine time and graph, such as sampled neighborhoods across recent event frames, and scan those. Option one is cleaner. Option two is more ambitious and probably more chaotic, which means I would secretly want to try it after the clean baseline works.

The clean version I would implement first: GMN for graph-aware per-event embeddings, then causal Mamba over the event sequence.

Where this could help in next-event prediction

The most obvious gain is long-range context without flattening the pitch into a huge attention problem. A possession can contain twenty, thirty, forty events. Tracking data can contain hundreds of frames. If each frame has a player graph, full attention over every player at every time becomes expensive quickly. A selective scan gives a way to carry information across this long context while still filtering noise.

The second gain is better inductive bias. Pure sequence models see "pass by player A to player B" as a token with features. But they do not naturally know that player B was between two defenders, that the fullback was free on the touchline, or that the pass opened a triangle. A graph encoder can make those relations visible before the temporal model tries to predict the next event.

The third gain is interpretability of failure. If the model misses a next pass, I can ask whether the error came from the local graph view or the temporal memory. Did the sampled neighborhoods miss the relevant passing option? Did the event-level Mamba forget the earlier overload? Did the ordering make sense? This decomposition gives better debugging handles than a single giant transformer block that consumes everything.

What I would be careful about

First, causality. The paper's bidirectional scan is useful for graph representation tasks where the whole graph is available. In next-event prediction, time must remain causal. I can use bidirectional scanning inside a single current-frame graph, because all players at time \(t\) are observed. But I should not use future event frames when predicting the next event. The temporal layer should be causal unless the task explicitly allows offline smoothing.

Second, leakage through graph construction. Football data is full of traps. If an edge feature uses information only known after the event, the model will look brilliant and be useless. For example, if the graph includes pass destination while predicting pass destination, congratulations, we invented cheating. Every node and edge feature must be timestamp-clean.

Third, ordering. In GMN, ordering is not a cosmetic choice. For soccer, I would run ablations across several orderings and report them honestly. Distance to ball might work for immediate actions. Tactical role order might work for team shape. Learned ordering might work best but be harder to trust. There is no free lunch, only lunch with a better validation protocol.

Fourth, benchmarks. The Long Range Graph Benchmark itself has been re-examined, with later work arguing that some reported gaps between graph transformers and message-passing baselines shrink after stronger hyperparameter tuning [12]. This does not invalidate GMN, but it is a good reminder: architecture papers live inside benchmark culture. I would want soccer-specific ablations, not only imported confidence from graph benchmarks.

Fifth, data regime. Graph Mamba has more moving parts than a plain event-sequence model. If the dataset is small, or if tracking data is missing, the extra structure may overfit. A simple Mamba over SPADL events might beat a fancy graph model if the graph features are noisy. The right question is not "is Graph Mamba cooler?" It is "does the graph view add predictive signal after controlling for complexity?"

The part I find genuinely exciting

What I like most is that Graph Mamba changes the metaphor. A graph transformer says: let everyone talk to everyone, then learn what matters. A message-passing GNN says: let neighbors talk, layer by layer. GMN says: sample meaningful local worlds, order them carefully, and let a selective memory decide what survives.

That feels very football. A player does not process the whole pitch as a dense matrix. They scan. They notice pressure, space, runs, body shapes. They ignore half the noise because the game is too fast. Good players do not see everything equally. They see selectively.

Of course, neural networks are not footballers. I am not making that romantic claim. But as a modeling bias, selectivity feels right. The next event is usually not caused by every previous event. It is caused by a small set of relevant moments, some local and some long-range. Mamba gives a mechanism for selective memory. Graphs give a mechanism for relational structure. Graph Mamba is interesting because it tries to make those mechanisms meet without flattening one into the other.

If I were implementing this tomorrow

I would start modestly. No heroic architecture on day one. The first baseline would be a causal Mamba over event tokens: action type, start/end coordinates, team, player role, time delta, score state, and possession context. That tells me how much sequence alone can do.

The second baseline would add a simple graph encoder per event frame: maybe a GCN/GAT over player positions if tracking exists, or a pass-network/context graph if only event data exists. Pool the graph into an event vector and feed that into the same causal Mamba. That tells me whether graph structure helps at all.

Only then would I try the proper GMN version: random-walk neighborhood tokens around the ball carrier, receiver candidates, and high-pressure defenders; local encoding; bidirectional scan inside the frame; causal Mamba over frames. I would compare tokenization settings \(m = 0\), \(m = 1\), \(m = 2\), and different \(s\). I would track both accuracy and calibration, because next-event models should know when they are uncertain.

For outputs, I would avoid predicting only action type. A useful model should predict a distribution over action type, next location, and receiver. The receiver prediction especially benefits from graph structure because the candidate set is relational. The model should learn not just "a pass happens" but "a pass to this player in this lane is plausible."

And I would keep a small qualitative notebook. The best sports models are not only leaderboard entries. They should produce moments where you look at a possession and say: yes, that is why the model thought the switch was coming. If the model cannot be interrogated at the level of football intuition, it may still be useful, but it will be harder to trust.

Closing note

The thesis is done, so naturally I now have a better thesis idea. This is the academic version of thinking of a perfect comeback in the shower two days later.

But I do not find that frustrating anymore. I actually like it. It means the work left a residue. Graphs taught me to care about shape. Mamba taught me to care about memory. Graph Mamba sits in the space between them and asks a very practical question: if the world is relational but our model scans sequences, how do we turn relation into sequence without destroying the relation?

That question is bigger than soccer. It shows up in molecules, recommendation systems, traffic, social networks, brains, and any domain where structure and time refuse to stay separate. But soccer is where I feel it most vividly. A match is not a sentence, and it is not a static graph. It is a moving graph that tells a story one event at a time.

So yes, I think graph and Mamba can be used together. Not by duct-taping them. Not by pretending a graph is secretly just a sentence. But by doing the careful middle work: tokenize neighborhoods, order them honestly, encode local structure, scan selectively, and keep the temporal prediction causal.

If I get to revisit next-event prediction properly, this is the idea I would want on the whiteboard first.

References and links

Behrouz, A., & Hashemi, F. (2024). Graph Mamba: Towards Learning on Graphs with State Space Models. arXiv:2402.08678.
Gu, A., & Dao, T. (2024). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. COLM 2024. OpenReview.
Gu, A., Goel, K., & Re, C. (2022). Efficiently Modeling Long Sequences with Structured State Spaces. ICLR 2022. OpenReview.
Kipf, T. N., & Welling, M. (2017). Semi-Supervised Classification with Graph Convolutional Networks. ICLR 2017. OpenReview.
Xu, K., Hu, W., Leskovec, J., & Jegelka, S. (2019). How Powerful are Graph Neural Networks? ICLR 2019. OpenReview.
Alon, U., & Yahav, E. (2021). On the Bottleneck of Graph Neural Networks and its Practical Implications. ICLR 2021. OpenReview.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. NeurIPS 2017. Proceedings.
Rampasek, L., Galkin, M., Dwivedi, V. P., Luu, A. T., Wolf, G., & Beaini, D. (2022). Recipe for a General, Powerful, Scalable Graph Transformer. NeurIPS 2022. Proceedings.
Dwivedi, V. P., Rampasek, L., Galkin, M., Parviz, A., Wolf, G., Luu, A. T., & Beaini, D. (2022). Long Range Graph Benchmark. NeurIPS 2022 Datasets and Benchmarks. Proceedings.
Shirzad, H., Velingker, A., Venkatachalam, B., Sutherland, D. J., & Sinop, A. K. (2023). Exphormer: Sparse Transformers for Graphs. ICML 2023. PMLR.
Tonshoff, J., Ritzert, M., Wolf, H., & Grohe, M. (2023). Walking Out of the Weisfeiler Leman Hierarchy: Graph Learning Beyond Message Passing. TMLR. OpenReview.
Tonshoff, J., Ritzert, M., Rosenbluth, E., & Grohe, M. (2023). Where Did the Gap Go? Reassessing the Long-Range Graph Benchmark. arXiv:2309.00367.
Wang, C., Tsepa, O., Ma, J., & Wang, B. (2024). Graph-Mamba: Towards Long-Range Graph Sequence Modeling with Selective State Spaces. arXiv:2402.00789.
Decroos, T., Bransen, L., Van Haaren, J., & Davis, J. (2019). Actions Speak Louder Than Goals: Valuing Player Actions in Soccer. KDD 2019. arXiv.
Wang, Z., Velickovic, P., Hennes, D., et al. (2024). TacticAI: an AI assistant for football tactics. Nature Communications. Article.

Graph Mamba: The Post-Thesis Idea I Wish I Had Earlier

The old graph story: messages are local

The Mamba story: memory is selective

Why the obvious combination is not enough

Step one: make graph tokens that mean something

Step two: order the tokens without lying too much

Step three: encode local structure

Step four: scan both ways

The optional part: positional and structural encodings

A useful complexity picture

What the experiments say

The theory, in human language

The soccer version I keep imagining

Where this could help in next-event prediction

What I would be careful about

The part I find genuinely exciting

If I were implementing this tomorrow

Closing note

References and links