AI Attention Mechanisms: A d^2 Dimensional Problem, Not n^2
Reddit Community
Community Problem
Elevator Pitch
A new mathematical proof suggests current AI attention mechanisms are fundamentally d^2-dimensional, not n^2, potentially unlocking significant efficiency gains and enabling more stable, faster AI model training and inference.
Full Description
Hello, r/MachineLearning . I am just a regular user from a Korean AI community ("The Singularity Gallery"). I recently came across an anonymous post with a paper attached. I felt that the mathematical proof inside was too important to be buried in a local forum and not go viral globally, so I used Gemini to help me write this English post to share it with you all.
The author claims they do not work in the LLM industry, but they dropped a paper titled: "The d^2 Pullback Theorem: Why Attention is a d^2-Dimensional Problem".
They argue that the field has been fundamentally misunderstanding the intrinsic geometry of Attention. Here is the core of their mathematical proof:
The d^2 Pullback Theorem (The Core Proof):
The author mathematically proves that if you combine the Forward pass (n X n) and the Backward gradient (n X n), the actual optimization landscape the parameter explores is strictly d^2-dimensional. The n X n bottleneck is merely an illusion caused by the softmax normalization choice.
- •Softmax destroys the Euclidean Matching structure:
Previous O(n) linear attention models failed because removing exp() (softmax) destroyed the contrast (matching). Softmax creates the "matching" but artificially inflates the rank to n, causing the O(n^2) curse.
- •O(nd^3) Squared Attention without the instability:
Because the true optimization geometry is d^2, we can swap softmax with a degree-2 polynomial kernel (x^2) and still explore the exact same optimization landscape. The author introduces CSQ (Centered Shifted-Quadratic) Attention with soft penalties. This retains the Euclidean matching property, stabilizes the training, and drops both training AND inference complexity to O(nd^3).
The author wrote: "I'm not in the LLM industry, so I have nowhere to share this. I'm just posting it here hoping it reaches the researchers who can build better architectures."
I strongly believe this math needs to be verified by the experts here. Could this actually be the theoretical foundation for replacing standard Transformers?
Original PDF:https://drive.google.com/file/d/1IhcjxiiHfRH4_1QIxc7QFxZL3_Jb5dOI/view?usp=sharing
Original Korean Forum Post:https://gall.dcinside.com/mgallery/board/view/?id=thesingularity&no=1016197
Get involved
Discussion
No comments yet. Be the first to share your thoughts.
From the Reddit thread(12 top comments)
- 84·Reddit commenter·1mo ago
the paper is heavily theoretical and should be sent to some experts in the field working in the intersection of pure math and machine learning. I doubt many folks here will understand nor validate the results
permalink ↗ - 74·Reddit commenter·1mo ago·reply
From skimming there really isn’t any math you wouldn’t encounter outside of undergrad CS. It’s notation heavy but not conceptually, if that makes sense.
permalink ↗ - 55·Reddit commenter·1mo ago
Generally speaking, the paper starts off nice. The core concept that the optimization landscape of attention is d^2 -dimensional rather than n^2 is correct. Simple thought experiment: if you used a really small d (say 2), your attention head has only ~4 knobs to tune, and it fundamentally can't extract fine-grained information from long sequences. It's like doing a regression but only being allowed a handful of free parameters. Works fine when n is small, breaks as the data grows. That said, the low-rank bottleneck of attention heads has been explored iirc. (I can search for the papers later …
permalink ↗ - 43·Reddit commenter·1mo ago
conflating optimization landscape dimensionality with computational complexity
permalink ↗ - 42·Reddit commenter·1mo ago·reply
The beauty of math is that it's objective. If the author's d\^2 Pullback Theorem is just a hobbyist's fundamental misunderstanding of differential geometry, an expert should be able to quickly open the PDF, point out the mathematical flaw, and easily put this to rest. I just hope the proof gets evaluated on its mathematical merits, rather than being dismissed solely based on the author's lack of industry affiliation.
permalink ↗ - 34·Reddit commenter·1mo ago
I surely don't have enough skills to validate it all, I'm just an engineer afterall. But for my understanding, his math and reasoning is very sound. The only thing I would argue is that O(nd\^3) is not necessarly better than O(n\^2d), even if mathematichally he's right. The reason is simple: in modern models, d is also pretty big, 128 or 256. d\^3 for a head dim of 128 is 2,097,152. If your sequence length n is 2,048 (a standard block), n\^2 is 4,194,304. Therefore, his math practically wins only when n is much bigger than d\^2, which is not true for standard and small tasks (while being a…
permalink ↗ - 30·Reddit commenter·1mo ago·reply
That's a wall of math. probably AI generated. If you know a little about random projection and how they preserve linear separabililty the result is not very surprising. Yet the d\^2 seems the right dimension for geometric reasons and that is new. Everything hold because of their hypothesis 6.2 which assumes that stacks of quadratic attention can simulate polynomial attention with O(log p) depth. No real practical insight. I would reject at a top conference, yet accept at a workshop.
permalink ↗ - 24·Reddit commenter·1mo ago
The framing is doing a lot of work here. n² gets all the attention because sequence length is the variable practitioners actually tune. d is fixed per architecture. If your model has d=4096, d²=16M is a constant you pay once per forward pass. n² scales with every request. That said, the geometry point isn't nothing. Attention compute is O(n²·d), not O(n²). The d factor is always there, just absorbed into the constant multiplier. Whether calling it a "d² problem" constitutes a fundamental misunderstanding or just a different frame depends entirely on what you're optimizing for. If you're com…
permalink ↗ - 22·Reddit commenter·1mo ago·reply
On top of that I would argue that from an engineering perspective, matmul is pretty damn efficient now due to how tensor cores are designed. So probably it would still win the race, but that's something we can change by optimizing the chips
permalink ↗ - 7·Reddit commenter·1mo ago·reply
That's a rather ivory outlook. Indeed the field will advance and some of the contributors will come from outside academia and giant labs. Token to token attention is fundamentally deficient. It's bad on big O and it also entangles relationships with memory without providing a good mechanism for compression. Further it is opaque and offers minimal leverage for intervention. Just to be a prick I'm going to have chat gpt slop this together for me and find out for myself.
permalink ↗ - 6·Reddit commenter·1mo ago·reply
If real mathematicians or experts from other fields go at things, you get things like that happening all the time. It's also similar to some earlier result I remember.
permalink ↗ - 5·Reddit commenter·1mo ago·reply
I completely agree with you. But to be honest, bringing it here is the absolute limit of what I can do. I'm just a regular community user who found the post, realized it was too important to be ignored, and translated it. I don't have the background or the network to forward this directly to the experts at the intersection of pure math and ML. I've carried the baton as far as I can. Now, I'm just hoping the internet does its thing and eventually gets this PDF onto the right desks.
permalink ↗