The Embedding Geometry Hypothesis:
From Fourier Circuits to No-Q Attention

Nathan Rigoni

Phronesis Analytics — www.phronesis-analytics.com

March 2026

PREPRINT — UNDER REVIEW

Abstract

We present a unified account of why transformer token embeddings are the primary geometric foundation of attention and what follows from that observation. Starting from the mechanistic interpretability result that grokking on modular arithmetic coincides with Fourier circuit formation in the embedding, we introduce Prescribed Fourier Frequency Training (PFFT), which steers embedding gradients toward near-Nyquist frequency modes and achieves a 92.7% reduction in epochs-to-grokking (57 vs. 782) with a 97.9% reduction in the memorization phase.

PFFT works because it does two things simultaneously: it respects the embedding's geometric authority and it reduces gradient noise. The Sounding Hammer diagnostic reveals, however, that these gradient-domain techniques do not transfer to language model token embeddings: BPE vocabulary gradients are spectrally flat ($\rho = 0.42$), causing catastrophic BPC regression (2.90→9.47) when Fourier steering is applied. Behavioral weight trajectory analysis of language models trained on TinyStories and FineWeb shows that Q-weight matrices are the primary locus of representational reorganization — sharper and more clustered than K, V, or MLP trajectories. This motivates No-Q attention: setting $\mathbf{Q} = \mathbf{x}$ (no projection) at every layer. No-Q attention achieves the same two goals as PFFT through an architectural change rather than a gradient-domain filter. The result: +3.18% validation BPC on TinyStories and +2.24% on FineWeb with 8% fewer parameters, plus a 58.9% ETG speedup on modular arithmetic. Taken together, these results support the Embedding Geometry Hypothesis: the token embedding is not a lookup table that feeds into attention — it is the attention query, and the Q projection is a redundant reparameterization of a signal the embedding has already encoded.

Key Results at a Glance:

PFFT (near-Nyquist): 92.7% grokking speedup (57 vs. 782 epochs), 97.9% memorization reduction on $(a+b) \bmod 97$
No-Q Attention: +3.18% BPC on TinyStories, +2.24% BPC on FineWeb, 8% fewer parameters
No-Q on grokking: 58.9% ETG speedup (383 vs. 933 epochs)
Sounding Hammer: predicts Fourier steering safety ($\rho = 0.82$ PASS for positional embeddings, $\rho = 0.42$ FAIL for BPE)

1 Introduction

The transformer computes self-attention via three symmetric projections:

$$\text{Attn}(\mathbf{x}) = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_h}}\right)\mathbf{V}, \quad \mathbf{Q} = \mathbf{x}\mathbf{W}_Q^\top,\; \mathbf{K} = \mathbf{x}\mathbf{W}_K^\top,\; \mathbf{V} = \mathbf{x}\mathbf{W}_V^\top \tag{1}$$

The matrices $\mathbf{W}_Q, \mathbf{W}_K, \mathbf{W}_V \in \mathbb{R}^{d \times d}$ are treated symmetrically in most analyses and architectures. We question this symmetry for $\mathbf{W}_Q$.

The investigation begins with grokking — the phenomenon in which a small transformer trained on modular arithmetic suddenly generalizes after hundreds of epochs of near-perfect training accuracy. The mechanistic interpretability literature has established that generalization coincides with the emergence of a sparse Fourier circuit in the token embedding, not in the attention weights. The attention layers then exploit the geometric structure that the embedding has established.

This suggests a hierarchy: the embedding sets the representational geometry; downstream layers are secondary processors of that geometry. If true, the Q projection — applied to the post-embedding hidden state — is reparameterizing a space the embedding has already structured. It is a redundant degree of freedom that competes with, rather than extends, the embedding's geometric authority.

Research arc. We develop this idea through four stages:

Modular arithmetic and PFFT. We show that prescribing near-Nyquist Fourier modes in the embedding gradient achieves a 92.7% grokking speedup, confirming that the embedding is the geometric root of generalization.
Why PFFT fails on language models. The Sounding Hammer diagnostic reveals that gradient-domain Fourier steering is not safe to apply to BPE vocabulary embeddings ($\rho = 0.42$). A different mechanism is needed for language.
Behavioral diagnosis. Weight trajectory analysis of language models trained on TinyStories and FineWeb reveals that Q-weight matrices — not K, V, or MLP — are the primary site of representational reorganization, pointing directly at the Q projection as the pathological component.
No-Q attention. Removing $\mathbf{W}_Q$ entirely ($\mathbf{Q} = \mathbf{x}$) achieves the same two goals as PFFT — respecting the embedding's geometric authority and reducing parameter noise — but through an architectural change that is safe for language models. Result: +3.18% BPC on TinyStories, +2.24% on FineWeb, 8% fewer parameters, and 58.9% ETG speedup on modular arithmetic.

2 The Embedding as Geometric Foundation: Evidence from Modular Arithmetic

2.1 Grokking and Fourier Circuits

Power et al. (2022) observed that small transformers trained on modular arithmetic $(a+b) \bmod p$ exhibit delayed generalization: training accuracy reaches ~100% hundreds of epochs before validation accuracy does. Nanda et al. (2023) identified the mechanism: generalization coincides with the emergence of a Fourier circuit in the token embedding using modes $\{1, 14, 41\}$ for $p = 97$. The attention weights do not carry this structure independently — they inherit it from the embedding geometry.

This is strong evidence for the Embedding Geometry Hypothesis: the token embedding is the primary carrier of the representational geometry that determines what downstream attention layers can efficiently compute.

2.2 Prescribed Fourier Frequency Training (PFFT)

We ask: if the embedding is the root, can we accelerate grokking by steering embedding gradients toward Fourier modes that carry more useful structure?

Prescribed Fourier Frequency Training (PFFT). After each backward pass, we project the token embedding gradient onto a prescribed frequency set $S$: $$\nabla\mathbf{E} \;\leftarrow\; \text{irfft}\!\left(\text{rfft}(\nabla\mathbf{E},\,\text{dim}=0) \odot M_S,\; n=p,\; \text{dim}=0\right)$$ where $M_S[k] = \mathbf{1}[k \in S]$. PFFT incurs zero inference cost and negligible training overhead.

2.3 Results: 92.7% Grokking Speedup

Table 1: Grokking acceleration on $(a+b)\bmod 97$ (mean ± std, 3 seeds). Mem = memorization epoch (train acc ≥ 99%).
Method	Prescribed Modes $S$	ETG	Mem	Speedup
Baseline	—	782 ± 95	451	—
PFFT Nanda	$\{1, 14, 41\}$	191 ± 5	32	+75.5%
PFFT quint	$\{1, 2, 3, 4, 5\}$	261 ± 3	90	+66.7%
PFFT near-Nyquist	$\mathbf{\{30, 35, 40, 45, 48\}}$	57 ± 4	9	+92.7%
Adaptive $K=5$	top-5 dynamic	97 ± 7	38	+87.6%

Near-Nyquist modes $\{30, 35, 40, 45, 48\}$ achieve ETG = 57 (92.7% speedup) and reduce the memorization phase from 451 to 9 epochs (97.9% reduction). Three findings are especially diagnostic:

Near-Nyquist beats task-correct modes. The mechanistic interpretability literature identifies $\{1, 14, 41\}$ as the "correct" modes for $p = 97$. Prescribing these yields 75.5% speedup; near-Nyquist $\{30, 35, 40, 45, 48\}$ yields 92.7%. The speedup comes from gradient noise reduction, not mode guidance. The task-relevant Fourier structure then emerges naturally via the optimization landscape.

Memorization is almost entirely bypassed. Standard training spends 451 epochs memorizing before generalizing. PFFT near-Nyquist grokks at epoch 9 and generalizes at epoch 57: the model generalizes before memorizing in any meaningful sense. Memorization is not a necessary stage — it is a symptom of gradient noise.

Table 2: Cross-$p$ validation. "DNF" = did not grok within 1500 epochs.
Setup	Prescription	ETG	vs. Baseline
$p=97$, baseline	—	782	—
$p=97$, near-Nyquist	$\{30,35,40,45,48\}$	57	+92.7%
$p=113$, baseline	—	321	—
$p=113$, near-Nyquist	$\{34,40,46,51,56\}$	75	+76.6%
$p=97$, single Nyquist	$\{48\}$	DNF	—

Figure 1a: Epochs-to-grokking (ETG) across all 14 variants × 3 seeds. Near-Nyquist PFFT achieves the strongest and most consistent speedup.

Figure 1b: Memorization epoch vs. ETG. Near-Nyquist PFFT (lower-left outlier) nearly eliminates the memorization phase entirely — a qualitatively different learning trajectory from all other methods.

2.4 Why PFFT Works: Two Mechanisms

PFFT simultaneously achieves two things:

Embedding geometry authority. By projecting embedding gradients onto a frequency subspace, PFFT prevents the optimizer from diffusing the embedding's representational structure across a noisy gradient landscape. The embedding is free to crystallize its Fourier circuit without the Q, K, and V projections competing to restructure it.
Gradient noise reduction. Constraining the gradient to a low-dimensional frequency subspace removes the high-frequency noise component that causes each gradient step to partially undo the previous one. The memorization phase is the optimizer spinning on this noise; removing the noise collapses the memorization phase entirely.

Both mechanisms are necessary. Prescribing only a single mode (even the Nyquist, $k = 48$) fails to grok entirely: a single mode cannot span the multi-dimensional Fourier circuit required by modular arithmetic, which needs ≥3 independent phase components.

3 Why Fourier Steering Fails on Language Models

3.1 The Sounding Hammer

Before applying PFFT to any model, we need to answer two questions: (1) does the gradient along the target parameter axis have structured Fourier content, and (2) if so, which modes dominate? The Sounding Hammer is a pre-training diagnostic that answers both.

Definition (Sounding Hammer). Given a parameter tensor $\mathbf{W}$, collect the aggregate gradient $\bar{G} = \mathbb{E}_\text{batch}[\nabla_\mathbf{W} \mathcal{L}]$ over a representative data sample. Apply the real FFT along the parameter axis of interest to get $\hat{G}[k, \cdot] = \text{rfft}(\bar{G}, \text{dim}=0)[k]$. Define the power spectrum $P(k) = \|\hat{G}[k, \cdot]\|^2$ and the gradient regularity: $$\rho = \frac{\sum_{k \in \text{top-}K} P(k)}{\sum_k P(k)}$$ where $K$ is a small fraction of the total bins (e.g., $K=16$ of 257 for $p=512$).

The Sounding Hammer returns two outputs:

The dominant mode spectrum — ranked frequencies by power $P(k)$. For modular arithmetic, this recovers the Fourier circuit modes identified by mechanistic interpretability directly from the gradient, without requiring a trained model.
The regularity score $\rho$ — a measure of spectral concentration. High $\rho$ (close to 1) means the gradient is spectrally sparse: projection preserves signal while discarding noise. Low $\rho$ means the gradient is spectrally uniform: projection discards signal indiscriminately.

Natural Ordering Condition (NOC). High $\rho$ is possible only if nearby indices along the parameter axis correspond to semantically or geometrically related inputs. If the parameter axis has natural ordering, Fourier steering is safe; if it does not, Fourier steering destroys signal. The modular arithmetic token axis satisfies the NOC by construction. BPE token indices satisfy no ordering at all.

3.2 Sounding Hammer Applied to GPT-2

Table 3: Sounding Hammer results on GPT-2 Small (200 TinyStories documents, top-$K=16$ of 257 frequency bins).
Tensor	Description	$\rho$
`wpe.weight`	Positional embedding (512×768)	0.82 (PASS)
`h.*.mlp.c_proj.weight`	MLP output projections (avg 12 layers)	0.45–0.54
`wte.weight`	BPE token embedding (50,257×768)	0.42 (FAIL)

Figure 2: Sounding Hammer gradient power spectra. The positional embedding (left) shows a clear low-frequency peak ($\rho = 0.82$, PASS). The BPE token embedding (right) is nearly flat ($\rho = 0.42$, FAIL) — no frequency subspace is more informative than any other.

3.3 Failure of Fourier Steering on Language Models

Applying PFFT near-Nyquist to BPE vocabulary gradients causes BPC to increase from 2.90 to 9.47 — a catastrophic regression. Projecting onto 5 of 25,128 bins discards nearly all informative gradient signal. The degradation is not specific to mode choice; it reflects the NOC violation.

Table 4: FGP applied to BPE vocabulary gradients on TinyStories. NOC failure causes catastrophic regression.
Variant	Description	BPC at 10k steps
Baseline	No FGP	2.90
PFFT near-Nyquist	$\{25124, \ldots, 25128\}$	9.47 (catastrophic)
Adaptive FGP $K=5$	Top-5 gradient modes	~4.0 (degraded)

This leaves a key question: gradient-domain noise reduction works beautifully for modular arithmetic, and the embedding is clearly the geometric root in both settings. But the noise reduction mechanism cannot be applied the same way. What is the right intervention for language models?

4 Behavioral Analysis: Diagnosing the Q Projection

4.1 Weight Trajectory Analysis

To understand which components of the transformer undergo the most significant representational change during training, we collect per-step weight snapshots of the Q, K, V, and MLP weight matrices across all layers for language models trained on TinyStories and FineWeb.

Each snapshot is encoded by a behavioral autoencoder (dual-objective: weight reconstruction loss + training-loss prediction from the bottleneck) and projected to 2D via PyMDE. HDBSCAN clustering reveals the number and structure of distinct behavioral phases.

4.2 Results: Q Weights Are the Primary Reorganization Site

Behavioral autoencoder trajectories for Q weight matrices

Figure 3: Behavioral autoencoder trajectories for the Q weight matrices across all 5 attention layers of a TinyStories language model. Each point is a per-step weight snapshot projected to 2D by PyMDE; color encodes HDBSCAN cluster membership. Q weights show the sharpest transitions and most structured clustering, correlated with training loss milestones. K, V, MLP-up, and MLP-down weights evolve more smoothly and exhibit fewer distinct behavioral phases.

The pattern is consistent across all layers and replicated on the FineWeb model:

Q weights reorganize sharply. The Q-weight trajectory shows tight, well-separated behavioral clusters with sharp transitions between them. Each transition corresponds to a reorganization of the query geometry — the model abruptly learns a new "what to look for" strategy. These transitions are correlated with drops in training loss.
K and V weights evolve smoothly. The K-weight trajectory shows some structure but with softer cluster boundaries. The V-weight trajectory is smoother still. This is consistent with K playing a "what to compare against" role that updates incrementally.
MLP weights are the most stable. The MLP-up and MLP-down trajectories show the least clustering — these layers perform incremental refinement rather than representational reorganization.

4.3 Interpretation

The Q weights are doing something the K and V weights are not: they are repeatedly reorganizing to match a shifting query geometry. But the query geometry is exactly what the embedding encodes. Under the Embedding Geometry Hypothesis, the Q projection is competing with the embedding for representational ownership of the query space, and the sharpness of the Q trajectory reflects the cost of that competition.

This analysis points directly at a hypothesis: if we remove the Q projection, the embedding can set the query geometry without competition, and the representational reorganization cost disappears.

5 No-Q Attention

5.1 Definition

No-Q attention replaces standard self-attention (Eq. 1) with:

$$\text{NoQ-Attn}(\mathbf{x}) = \text{softmax}\!\left(\frac{\mathbf{x}\,\mathbf{K}^\top}{\sqrt{d_h}}\right)\mathbf{V}, \quad \mathbf{K} = \mathbf{x}\mathbf{W}_K^\top,\; \mathbf{V} = \mathbf{x}\mathbf{W}_V^\top \tag{2}$$

The post-LayerNorm hidden state $\mathbf{x}$ serves directly as the query. $\mathbf{W}_Q$ is removed entirely; $\mathbf{W}_K$, $\mathbf{W}_V$, and $\mathbf{W}_O$ are retained without modification.

5.2 Why This Achieves Both Goals

Goal 1: Embedding geometry authority. When $\mathbf{Q} = \mathbf{x}$, the query is the hidden state as shaped by the embedding and all preceding layers. There is no learned projection competing to reshape the query geometry. The embedding's representational choices propagate directly into the query-key dot product. This is the architectural equivalent of PFFT's gradient-domain intervention: instead of filtering gradients to keep the embedding's structure intact, we remove the parameter that would otherwise overwrite it.

Goal 2: Gradient noise reduction. Removing $\mathbf{W}_Q$ eliminates $L \times d^2$ parameters. For our 4-layer, $d = 256$ model: $4 \times 65{,}536 \approx 262$K parameters (8% of total). Fewer parameters means a smaller-dimensional optimization landscape with lower inherent noise. Unlike PFFT, this noise reduction requires no knowledge of the gradient's spectral structure and is always safe to apply.

Why K and V are kept. $\mathbf{W}_K$ is necessary because K must be in a space compatible with the dot product against $\mathbf{x}$: without $\mathbf{W}_K$, the attention pattern collapses to a function of pairwise $\|\mathbf{x}\|$ only. $\mathbf{W}_V$ is necessary to select what information flows forward — a distinct operation from the query-key relevance computation.

6 Experiments

6.1 Setup

All language model experiments use a byte-level character language model with $d = 256$, $L = 4$ layers, $H = 4$ heads, $d_\text{MLP} = 1024$, sequence length 256, vocabulary size 256. Total parameter count: 3.35M (baseline) / 3.09M (No-Q, −8%). Trained with AdamW, LR $= 3 \times 10^{-4}$, weight decay $= 0.1$, cosine schedule, batch size 64, 5000 steps.

Datasets:

TinyStories: ~475MB of short children's stories; simple, repetitive distribution.
FineWeb: 2GB byte-sampled web text; diverse distribution covering news, blogs, code, and science. Uses $L = 5$, sequence length 512, batch size 32, 10,000 training steps.

6.2 Main Result: TinyStories

Table 5: No-Q attention vs. standard attention on TinyStories.
Variant	Params	Val BPC	Δ BPC	Δ%
Baseline (standard)	3.347M	1.0819	—	—
No-Q attention	3.085M	1.0475	$-0.0344$	+3.18%

No-Q attention improves validation BPC from 1.0819 to 1.0475 — a 3.18% relative improvement — with 8% fewer parameters. The improvement appears early in training and persists throughout. This is the largest improvement of any architectural modification tested, achieved by removing computation rather than adding it.

Training and validation BPC curves on TinyStories

Figure 4: Training and validation BPC curves for baseline and No-Q attention on TinyStories. Left: val BPC. Center: train BPC. Right: K-pathway alignment metric (leading singular value ratio of $\mathbf{W}_K$ for No-Q; rank-1 Q–K cosine similarity for baseline). No-Q consistently outperforms the baseline throughout training.

6.3 Generalization to FineWeb

Table 6: No-Q attention on FineWeb (5-layer model, seq=512, 10K steps).
Variant	Params	Val BPC	Δ BPC	Δ%
Baseline (standard)	4.200M	1.8942	—	—
No-Q attention	3.872M	1.8518	$-0.0424$	+2.24%

No-Q attention generalizes from TinyStories to the more challenging and diverse FineWeb corpus. The 5-layer model achieves a 2.24% BPC improvement while removing 7.8% of parameters. The consistency of the result across both datasets is important: FineWeb spans news, blogs, science, and code — a much richer distribution than children's stories. The improvement is not an artifact of distributional simplicity.

Training and validation BPC curves on FineWeb

Figure 5: Training and validation BPC curves for baseline and No-Q attention on FineWeb. The No-Q improvement is consistent throughout training on this more challenging and diverse corpus.

6.4 Connection to Grokking

We test whether No-Q attention changes grokking dynamics on modular arithmetic. Setup: $(a+b)\bmod 97$, $d = 128$, $L = 2$, $H = 4$, 40% train split, AdamW LR $= 10^{-3}$, WD $= 1.0$, up to 1500 epochs, 3 seeds.

Table 7: No-Q attention on modular arithmetic $(a+b)\bmod 97$. ETG = mean epochs-to-grokking (val acc ≥ 0.99). Reference ETG for PFFT near-Nyquist is 57 epochs.
Variant	ETG (mean)	ETG (seeds)	Mem. eps	Speedup
Baseline	933	925, 1125, 750	50	—
No-Q attention	383	325, 350, 475	50	+58.9%
PFFT near-Nyquist (ref.)	57	—	9	+92.7%

No-Q attention reduces mean ETG from 933 to 383 epochs — a 58.9% speedup. The memorization epoch (mem-eps = 50) is identical for both variants. No-Q does not accelerate memorization; it shortens the gap between memorization and generalization. The standard model spends 883 epochs after memorization before generalizing; No-Q spends only 333. Under the Embedding Geometry Hypothesis, removing $\mathbf{W}_Q$ eliminates the competition for representational ownership of the query space, allowing the embedding's Fourier structure to be directly exploited once memorization is complete.

Validation accuracy on modular arithmetic

Figure 6: Validation accuracy on $(a+b)\bmod 97$ for baseline and No-Q attention (3 seeds each). Dashed line: grokking threshold (0.99 accuracy). No-Q grokking occurs at mean ETG=383 vs. 933 for baseline (58.9% speedup).

7 Discussion

7.1 A Unified View

The results across all four stages of the investigation tell a coherent story:

Fourier circuits in modular arithmetic confirm that the embedding is the primary locus of generalizing structure. Attention layers inherit the embedding's geometry; they do not create it.
PFFT shows that gradient-domain noise reduction at the embedding level achieves dramatic generalization speedup (92.7%), confirming the two-mechanism picture (embedding geometry authority + noise reduction).
Sounding Hammer + language model failure shows that the gradient-domain approach cannot be applied directly to BPE or byte token embeddings. The intervention must be architectural.
Behavioral analysis pinpoints the Q projection as the component undergoing the most disruptive representational reorganization — a signature of competition with the embedding for query geometry.
No-Q attention resolves the competition architecturally, achieving both goals without any gradient-domain intervention, and generalizes safely to language.

7.2 The Q–K Asymmetry

The standard transformer treats $\mathbf{W}_Q$ and $\mathbf{W}_K$ symmetrically, but there is a fundamental asymmetry in their roles. $\mathbf{W}_K$ produces the key space: it defines "what to compare against" for each position — a genuinely helpful degree of freedom. Without $\mathbf{W}_K$, the attention pattern collapses to pure self-similarity.

$\mathbf{W}_Q$ produces the query space: it defines "what this token is looking for." But this information is already encoded in the embedding geometry. What a token wants is precisely what the embedding encodes. $\mathbf{W}_Q$ reparameterizes a signal that was already present.

7.3 Implications

Standard transformers over-parameterize the query pathway. Every $d^2$ parameters spent on $\mathbf{W}_Q$ per layer is a parameter budget that would be better spent elsewhere — or not spent at all.
No-Q attention is free at byte vocabulary scale. At 256 byte tokens, the embedding geometry fully determines the query. Whether this holds at BPE scale (50K tokens) is an open question: with a much richer vocabulary, the Q projection may serve a more meaningful adaptation role.
No-Q + PFFT may combine additively. On modular arithmetic, No-Q achieves 58.9% ETG speedup and PFFT achieves 92.7%. Whether their combination yields a further speedup is a natural next experiment.

8 Related Work

Grokking and Fourier circuits. Power et al. (2022) introduced grokking. Nanda et al. (2023) showed generalization coincides with Fourier circuit formation in the token embedding. GrokFast amplifies slow-gradient components but requires the memorization phase first; PFFT sidesteps memorization entirely.

Attention simplifications. Multi-query attention (MQA) and grouped-query attention (GQA) share K and V heads across Q heads to reduce KV-cache memory. Linformer and linear attention modify the attention kernel. None of these is equivalent to No-Q attention, which removes the Q projection matrix entirely.

Spectral bias and gradient dynamics. Rahaman et al. (2019) established that gradient descent is biased toward low-frequency solutions. This bias is harmful for modular arithmetic, where the optimal solution requires high-frequency representations near the Nyquist limit.

Embedding structure. Our work argues that the embedding's geometric independence must be preserved, not post-processed, and that the Q projection is the primary threat to that independence.

9 Conclusion

We have presented the Embedding Geometry Hypothesis and traced its implications from modular arithmetic to language modeling.

Starting from Fourier circuit formation in grokking, we showed that prescribing near-Nyquist embedding gradients (PFFT) achieves 92.7% ETG speedup by simultaneously respecting the embedding's geometric authority and reducing gradient noise. The Sounding Hammer revealed that the same gradient-domain technique cannot be applied safely to language model token embeddings. Behavioral weight trajectory analysis identified the Q projection as the primary site of representational reorganization, pointing directly at the intervention: remove $\mathbf{W}_Q$.

No-Q attention — setting $\mathbf{Q} = \mathbf{x}$ at every layer — improves language modeling BPC by +3.18% on TinyStories and +2.24% on FineWeb while removing 8% of parameters, and accelerates grokking by 58.9% on modular arithmetic.

The embedding is not a shallow lookup table that feeds into attention — it is the attention query. K and V provide the comparison structure that the embedding's geometry can exploit. The Q projection is the part we can — and should — remove.

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
Power, A., Burda, Y., Edwards, H., Babuschkin, I., and Misra, V. (2022). Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177.
Nanda, N., Chan, L., Lieberum, T., Smith, J., and Steinhardt, J. (2023). Progress measures for grokking via mechanistic interpretability. ICLR 2023.
Liu, Z., Michaud, E. J., and Tegmark, M. (2023). Omnigrok: Grokking beyond algorithmic data. ICLR 2023.
Shazeer, N. (2019). Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150.
Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebrón, F., and Sanghai, S. (2023). GQA: Training generalized multi-query transformer models from multi-head checkpoints. EMNLP 2023.
Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. (2020). Transformers are RNNs: Fast autoregressive transformers with linear attention. ICML 2020.
Rahaman, N., Baratin, A., Arpit, D., Draxler, F., Lin, M., Hamprecht, F., Bengio, Y., and Courville, A. (2019). On the spectral bias of neural networks. ICML 2019.
Agrawal, A. and Boyd, S. (2021). Minimum-distortion embedding. Foundations and Trends in Machine Learning, 14(3):211–378.
McInnes, L., Healy, J., and Astels, S. (2017). hdbscan: Hierarchical density based clustering. Journal of Open Source Software, 2(11):205.
Eldan, R. and Li, Y. (2023). TinyStories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759.
Penedo, G., Kydlíček, H., allal, L. B., Lozhkov, A., Mitchell, M., Raffel, C., Von Werra, L., and Wolf, T. (2024). The FineWeb datasets: Decanting the web for the finest text data at scale. arXiv preprint arXiv:2406.17557.
Loshchilov, I. and Hutter, F. (2019). Decoupled weight decay regularization. ICLR 2019.
Rigoni, N. (2026). Spectral alignment: Engineering the Fourier path to generalization in neural networks. Phronesis Analytics preprint.

All Publications Download PDF

The Embedding Geometry Hypothesis:From Fourier Circuits to No-Q Attention