The Embedding Geometry Hypothesis:
From Fourier Circuits to No-Q Attention

Nathan Rigoni

Phronesis Analytics — www.phronesis-analytics.com

March 2026

PREPRINT — UNDER REVIEW

Abstract

We present a unified account of why transformer token embeddings are the primary geometric foundation of attention and what follows from that observation. Starting from the mechanistic interpretability result that grokking on modular arithmetic coincides with Fourier circuit formation in the embedding, we introduce Prescribed Fourier Frequency Training (PFFT), which steers embedding gradients toward near-Nyquist frequency modes and achieves a 92.7% reduction in epochs-to-grokking (57 vs. 782) with a 97.9% reduction in the memorization phase.

PFFT works because it does two things simultaneously: it respects the embedding's geometric authority and it reduces gradient noise. The Sounding Hammer diagnostic reveals, however, that these gradient-domain techniques do not transfer to language model token embeddings: BPE vocabulary gradients are spectrally flat ($\rho = 0.42$), causing catastrophic BPC regression (2.90→9.47) when Fourier steering is applied. Behavioral weight trajectory analysis of language models trained on TinyStories and FineWeb shows that Q-weight matrices are the primary locus of representational reorganization — sharper and more clustered than K, V, or MLP trajectories. This motivates No-Q attention: setting $\mathbf{Q} = \mathbf{x}$ (no projection) at every layer. No-Q attention achieves the same two goals as PFFT through an architectural change rather than a gradient-domain filter. The result: +3.18% validation BPC on TinyStories and +2.24% on FineWeb with 8% fewer parameters, plus a 58.9% ETG speedup on modular arithmetic. Taken together, these results support the Embedding Geometry Hypothesis: the token embedding is not a lookup table that feeds into attention — it is the attention query, and the Q projection is a redundant reparameterization of a signal the embedding has already encoded.


Key Results at a Glance:
  • PFFT (near-Nyquist): 92.7% grokking speedup (57 vs. 782 epochs), 97.9% memorization reduction on $(a+b) \bmod 97$
  • No-Q Attention: +3.18% BPC on TinyStories, +2.24% BPC on FineWeb, 8% fewer parameters
  • No-Q on grokking: 58.9% ETG speedup (383 vs. 933 epochs)
  • Sounding Hammer: predicts Fourier steering safety ($\rho = 0.82$ PASS for positional embeddings, $\rho = 0.42$ FAIL for BPE)

1   Introduction

The transformer computes self-attention via three symmetric projections:

$$\text{Attn}(\mathbf{x}) = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_h}}\right)\mathbf{V}, \quad \mathbf{Q} = \mathbf{x}\mathbf{W}_Q^\top,\; \mathbf{K} = \mathbf{x}\mathbf{W}_K^\top,\; \mathbf{V} = \mathbf{x}\mathbf{W}_V^\top \tag{1}$$

The matrices $\mathbf{W}_Q, \mathbf{W}_K, \mathbf{W}_V \in \mathbb{R}^{d \times d}$ are treated symmetrically in most analyses and architectures. We question this symmetry for $\mathbf{W}_Q$.

The investigation begins with grokking — the phenomenon in which a small transformer trained on modular arithmetic suddenly generalizes after hundreds of epochs of near-perfect training accuracy. The mechanistic interpretability literature has established that generalization coincides with the emergence of a sparse Fourier circuit in the token embedding, not in the attention weights. The attention layers then exploit the geometric structure that the embedding has established.

This suggests a hierarchy: the embedding sets the representational geometry; downstream layers are secondary processors of that geometry. If true, the Q projection — applied to the post-embedding hidden state — is reparameterizing a space the embedding has already structured. It is a redundant degree of freedom that competes with, rather than extends, the embedding's geometric authority.

Research arc. We develop this idea through four stages:

  1. Modular arithmetic and PFFT. We show that prescribing near-Nyquist Fourier modes in the embedding gradient achieves a 92.7% grokking speedup, confirming that the embedding is the geometric root of generalization.
  2. Why PFFT fails on language models. The Sounding Hammer diagnostic reveals that gradient-domain Fourier steering is not safe to apply to BPE vocabulary embeddings ($\rho = 0.42$). A different mechanism is needed for language.
  3. Behavioral diagnosis. Weight trajectory analysis of language models trained on TinyStories and FineWeb reveals that Q-weight matrices — not K, V, or MLP — are the primary site of representational reorganization, pointing directly at the Q projection as the pathological component.
  4. No-Q attention. Removing $\mathbf{W}_Q$ entirely ($\mathbf{Q} = \mathbf{x}$) achieves the same two goals as PFFT — respecting the embedding's geometric authority and reducing parameter noise — but through an architectural change that is safe for language models. Result: +3.18% BPC on TinyStories, +2.24% on FineWeb, 8% fewer parameters, and 58.9% ETG speedup on modular arithmetic.

2   The Embedding as Geometric Foundation: Evidence from Modular Arithmetic

2.1   Grokking and Fourier Circuits

Power et al. (2022) observed that small transformers trained on modular arithmetic $(a+b) \bmod p$ exhibit delayed generalization: training accuracy reaches ~100% hundreds of epochs before validation accuracy does. Nanda et al. (2023) identified the mechanism: generalization coincides with the emergence of a Fourier circuit in the token embedding using modes $\{1, 14, 41\}$ for $p = 97$. The attention weights do not carry this structure independently — they inherit it from the embedding geometry.

This is strong evidence for the Embedding Geometry Hypothesis: the token embedding is the primary carrier of the representational geometry that determines what downstream attention layers can efficiently compute.

2.2   Prescribed Fourier Frequency Training (PFFT)

We ask: if the embedding is the root, can we accelerate grokking by steering embedding gradients toward Fourier modes that carry more useful structure?

Prescribed Fourier Frequency Training (PFFT). After each backward pass, we project the token embedding gradient onto a prescribed frequency set $S$: $$\nabla\mathbf{E} \;\leftarrow\; \text{irfft}\!\left(\text{rfft}(\nabla\mathbf{E},\,\text{dim}=0) \odot M_S,\; n=p,\; \text{dim}=0\right)$$ where $M_S[k] = \mathbf{1}[k \in S]$. PFFT incurs zero inference cost and negligible training overhead.

2.3   Results: 92.7% Grokking Speedup

Table 1: Grokking acceleration on $(a+b)\bmod 97$ (mean ± std, 3 seeds). Mem = memorization epoch (train acc ≥ 99%).
Method Prescribed Modes $S$ ETG Mem Speedup
Baseline 782 ± 95 451
PFFT Nanda $\{1, 14, 41\}$ 191 ± 5 32 +75.5%
PFFT quint $\{1, 2, 3, 4, 5\}$ 261 ± 3 90 +66.7%
PFFT near-Nyquist $\mathbf{\{30, 35, 40, 45, 48\}}$ 57 ± 4 9 +92.7%
Adaptive $K=5$ top-5 dynamic 97 ± 7 38 +87.6%

Near-Nyquist modes $\{30, 35, 40, 45, 48\}$ achieve ETG = 57 (92.7% speedup) and reduce the memorization phase from 451 to 9 epochs (97.9% reduction). Three findings are especially diagnostic:

Near-Nyquist beats task-correct modes. The mechanistic interpretability literature identifies $\{1, 14, 41\}$ as the "correct" modes for $p = 97$. Prescribing these yields 75.5% speedup; near-Nyquist $\{30, 35, 40, 45, 48\}$ yields 92.7%. The speedup comes from gradient noise reduction, not mode guidance. The task-relevant Fourier structure then emerges naturally via the optimization landscape.

Memorization is almost entirely bypassed. Standard training spends 451 epochs memorizing before generalizing. PFFT near-Nyquist grokks at epoch 9 and generalizes at epoch 57: the model generalizes before memorizing in any meaningful sense. Memorization is not a necessary stage — it is a symptom of gradient noise.

Table 2: Cross-$p$ validation. "DNF" = did not grok within 1500 epochs.
SetupPrescriptionETGvs. Baseline
$p=97$, baseline782
$p=97$, near-Nyquist$\{30,35,40,45,48\}$57+92.7%
$p=113$, baseline321
$p=113$, near-Nyquist$\{34,40,46,51,56\}$75+76.6%
$p=97$, single Nyquist$\{48\}$DNF
ETG across all variants (14 × 3 seeds)
Figure 1a: Epochs-to-grokking (ETG) across all 14 variants × 3 seeds. Near-Nyquist PFFT achieves the strongest and most consistent speedup.
Memorization epoch vs ETG
Figure 1b: Memorization epoch vs. ETG. Near-Nyquist PFFT (lower-left outlier) nearly eliminates the memorization phase entirely — a qualitatively different learning trajectory from all other methods.

2.4   Why PFFT Works: Two Mechanisms

PFFT simultaneously achieves two things:

  1. Embedding geometry authority. By projecting embedding gradients onto a frequency subspace, PFFT prevents the optimizer from diffusing the embedding's representational structure across a noisy gradient landscape. The embedding is free to crystallize its Fourier circuit without the Q, K, and V projections competing to restructure it.
  2. Gradient noise reduction. Constraining the gradient to a low-dimensional frequency subspace removes the high-frequency noise component that causes each gradient step to partially undo the previous one. The memorization phase is the optimizer spinning on this noise; removing the noise collapses the memorization phase entirely.

Both mechanisms are necessary. Prescribing only a single mode (even the Nyquist, $k = 48$) fails to grok entirely: a single mode cannot span the multi-dimensional Fourier circuit required by modular arithmetic, which needs ≥3 independent phase components.

3   Why Fourier Steering Fails on Language Models

3.1   The Sounding Hammer

Before applying PFFT to any model, we need to answer two questions: (1) does the gradient along the target parameter axis have structured Fourier content, and (2) if so, which modes dominate? The Sounding Hammer is a pre-training diagnostic that answers both.

Definition (Sounding Hammer). Given a parameter tensor $\mathbf{W}$, collect the aggregate gradient $\bar{G} = \mathbb{E}_\text{batch}[\nabla_\mathbf{W} \mathcal{L}]$ over a representative data sample. Apply the real FFT along the parameter axis of interest to get $\hat{G}[k, \cdot] = \text{rfft}(\bar{G}, \text{dim}=0)[k]$. Define the power spectrum $P(k) = \|\hat{G}[k, \cdot]\|^2$ and the gradient regularity: $$\rho = \frac{\sum_{k \in \text{top-}K} P(k)}{\sum_k P(k)}$$ where $K$ is a small fraction of the total bins (e.g., $K=16$ of 257 for $p=512$).

The Sounding Hammer returns two outputs:

  1. The dominant mode spectrum — ranked frequencies by power $P(k)$. For modular arithmetic, this recovers the Fourier circuit modes identified by mechanistic interpretability directly from the gradient, without requiring a trained model.
  2. The regularity score $\rho$ — a measure of spectral concentration. High $\rho$ (close to 1) means the gradient is spectrally sparse: projection preserves signal while discarding noise. Low $\rho$ means the gradient is spectrally uniform: projection discards signal indiscriminately.
Natural Ordering Condition (NOC). High $\rho$ is possible only if nearby indices along the parameter axis correspond to semantically or geometrically related inputs. If the parameter axis has natural ordering, Fourier steering is safe; if it does not, Fourier steering destroys signal. The modular arithmetic token axis satisfies the NOC by construction. BPE token indices satisfy no ordering at all.

3.2   Sounding Hammer Applied to GPT-2

Table 3: Sounding Hammer results on GPT-2 Small (200 TinyStories documents, top-$K=16$ of 257 frequency bins).
TensorDescription$\rho$
wpe.weight Positional embedding (512×768) 0.82 (PASS)
h.*.mlp.c_proj.weight MLP output projections (avg 12 layers) 0.45–0.54
wte.weight BPE token embedding (50,257×768) 0.42 (FAIL)
Sounding Hammer gradient power spectra
Figure 2: Sounding Hammer gradient power spectra. The positional embedding (left) shows a clear low-frequency peak ($\rho = 0.82$, PASS). The BPE token embedding (right) is nearly flat ($\rho = 0.42$, FAIL) — no frequency subspace is more informative than any other.

3.3   Failure of Fourier Steering on Language Models

Applying PFFT near-Nyquist to BPE vocabulary gradients causes BPC to increase from 2.90 to 9.47 — a catastrophic regression. Projecting onto 5 of 25,128 bins discards nearly all informative gradient signal. The degradation is not specific to mode choice; it reflects the NOC violation.

Table 4: FGP applied to BPE vocabulary gradients on TinyStories. NOC failure causes catastrophic regression.
VariantDescriptionBPC at 10k steps
BaselineNo FGP2.90
PFFT near-Nyquist$\{25124, \ldots, 25128\}$9.47 (catastrophic)
Adaptive FGP $K=5$Top-5 gradient modes~4.0 (degraded)

This leaves a key question: gradient-domain noise reduction works beautifully for modular arithmetic, and the embedding is clearly the geometric root in both settings. But the noise reduction mechanism cannot be applied the same way. What is the right intervention for language models?

4   Behavioral Analysis: Diagnosing the Q Projection

4.1   Weight Trajectory Analysis

To understand which components of the transformer undergo the most significant representational change during training, we collect per-step weight snapshots of the Q, K, V, and MLP weight matrices across all layers for language models trained on TinyStories and FineWeb.

Each snapshot is encoded by a behavioral autoencoder (dual-objective: weight reconstruction loss + training-loss prediction from the bottleneck) and projected to 2D via PyMDE. HDBSCAN clustering reveals the number and structure of distinct behavioral phases.

4.2   Results: Q Weights Are the Primary Reorganization Site

Behavioral autoencoder trajectories for Q weight matrices
Figure 3: Behavioral autoencoder trajectories for the Q weight matrices across all 5 attention layers of a TinyStories language model. Each point is a per-step weight snapshot projected to 2D by PyMDE; color encodes HDBSCAN cluster membership. Q weights show the sharpest transitions and most structured clustering, correlated with training loss milestones. K, V, MLP-up, and MLP-down weights evolve more smoothly and exhibit fewer distinct behavioral phases.

The pattern is consistent across all layers and replicated on the FineWeb model:

4.3   Interpretation

The Q weights are doing something the K and V weights are not: they are repeatedly reorganizing to match a shifting query geometry. But the query geometry is exactly what the embedding encodes. Under the Embedding Geometry Hypothesis, the Q projection is competing with the embedding for representational ownership of the query space, and the sharpness of the Q trajectory reflects the cost of that competition.

This analysis points directly at a hypothesis: if we remove the Q projection, the embedding can set the query geometry without competition, and the representational reorganization cost disappears.

5   No-Q Attention

5.1   Definition

No-Q attention replaces standard self-attention (Eq. 1) with:

$$\text{NoQ-Attn}(\mathbf{x}) = \text{softmax}\!\left(\frac{\mathbf{x}\,\mathbf{K}^\top}{\sqrt{d_h}}\right)\mathbf{V}, \quad \mathbf{K} = \mathbf{x}\mathbf{W}_K^\top,\; \mathbf{V} = \mathbf{x}\mathbf{W}_V^\top \tag{2}$$

The post-LayerNorm hidden state $\mathbf{x}$ serves directly as the query. $\mathbf{W}_Q$ is removed entirely; $\mathbf{W}_K$, $\mathbf{W}_V$, and $\mathbf{W}_O$ are retained without modification.

5.2   Why This Achieves Both Goals

Goal 1: Embedding geometry authority. When $\mathbf{Q} = \mathbf{x}$, the query is the hidden state as shaped by the embedding and all preceding layers. There is no learned projection competing to reshape the query geometry. The embedding's representational choices propagate directly into the query-key dot product. This is the architectural equivalent of PFFT's gradient-domain intervention: instead of filtering gradients to keep the embedding's structure intact, we remove the parameter that would otherwise overwrite it.

Goal 2: Gradient noise reduction. Removing $\mathbf{W}_Q$ eliminates $L \times d^2$ parameters. For our 4-layer, $d = 256$ model: $4 \times 65{,}536 \approx 262$K parameters (8% of total). Fewer parameters means a smaller-dimensional optimization landscape with lower inherent noise. Unlike PFFT, this noise reduction requires no knowledge of the gradient's spectral structure and is always safe to apply.

Why K and V are kept. $\mathbf{W}_K$ is necessary because K must be in a space compatible with the dot product against $\mathbf{x}$: without $\mathbf{W}_K$, the attention pattern collapses to a function of pairwise $\|\mathbf{x}\|$ only. $\mathbf{W}_V$ is necessary to select what information flows forward — a distinct operation from the query-key relevance computation.

6   Experiments

6.1   Setup

All language model experiments use a byte-level character language model with $d = 256$, $L = 4$ layers, $H = 4$ heads, $d_\text{MLP} = 1024$, sequence length 256, vocabulary size 256. Total parameter count: 3.35M (baseline) / 3.09M (No-Q, −8%). Trained with AdamW, LR $= 3 \times 10^{-4}$, weight decay $= 0.1$, cosine schedule, batch size 64, 5000 steps.

Datasets:

6.2   Main Result: TinyStories

Table 5: No-Q attention vs. standard attention on TinyStories.
VariantParamsVal BPCΔ BPCΔ%
Baseline (standard)3.347M1.0819
No-Q attention3.085M1.0475$-0.0344$+3.18%

No-Q attention improves validation BPC from 1.0819 to 1.0475 — a 3.18% relative improvement — with 8% fewer parameters. The improvement appears early in training and persists throughout. This is the largest improvement of any architectural modification tested, achieved by removing computation rather than adding it.

Training and validation BPC curves on TinyStories
Figure 4: Training and validation BPC curves for baseline and No-Q attention on TinyStories. Left: val BPC. Center: train BPC. Right: K-pathway alignment metric (leading singular value ratio of $\mathbf{W}_K$ for No-Q; rank-1 Q–K cosine similarity for baseline). No-Q consistently outperforms the baseline throughout training.

6.3   Generalization to FineWeb

Table 6: No-Q attention on FineWeb (5-layer model, seq=512, 10K steps).
VariantParamsVal BPCΔ BPCΔ%
Baseline (standard)4.200M1.8942
No-Q attention3.872M1.8518$-0.0424$+2.24%

No-Q attention generalizes from TinyStories to the more challenging and diverse FineWeb corpus. The 5-layer model achieves a 2.24% BPC improvement while removing 7.8% of parameters. The consistency of the result across both datasets is important: FineWeb spans news, blogs, science, and code — a much richer distribution than children's stories. The improvement is not an artifact of distributional simplicity.

Training and validation BPC curves on FineWeb
Figure 5: Training and validation BPC curves for baseline and No-Q attention on FineWeb. The No-Q improvement is consistent throughout training on this more challenging and diverse corpus.

6.4   Connection to Grokking

We test whether No-Q attention changes grokking dynamics on modular arithmetic. Setup: $(a+b)\bmod 97$, $d = 128$, $L = 2$, $H = 4$, 40% train split, AdamW LR $= 10^{-3}$, WD $= 1.0$, up to 1500 epochs, 3 seeds.

Table 7: No-Q attention on modular arithmetic $(a+b)\bmod 97$. ETG = mean epochs-to-grokking (val acc ≥ 0.99). Reference ETG for PFFT near-Nyquist is 57 epochs.
VariantETG (mean)ETG (seeds)Mem. epsSpeedup
Baseline933925, 1125, 75050
No-Q attention383325, 350, 47550+58.9%
PFFT near-Nyquist (ref.)579+92.7%

No-Q attention reduces mean ETG from 933 to 383 epochs — a 58.9% speedup. The memorization epoch (mem-eps = 50) is identical for both variants. No-Q does not accelerate memorization; it shortens the gap between memorization and generalization. The standard model spends 883 epochs after memorization before generalizing; No-Q spends only 333. Under the Embedding Geometry Hypothesis, removing $\mathbf{W}_Q$ eliminates the competition for representational ownership of the query space, allowing the embedding's Fourier structure to be directly exploited once memorization is complete.

Validation accuracy on modular arithmetic
Figure 6: Validation accuracy on $(a+b)\bmod 97$ for baseline and No-Q attention (3 seeds each). Dashed line: grokking threshold (0.99 accuracy). No-Q grokking occurs at mean ETG=383 vs. 933 for baseline (58.9% speedup).

7   Discussion

7.1   A Unified View

The results across all four stages of the investigation tell a coherent story:

7.2   The Q–K Asymmetry

The standard transformer treats $\mathbf{W}_Q$ and $\mathbf{W}_K$ symmetrically, but there is a fundamental asymmetry in their roles. $\mathbf{W}_K$ produces the key space: it defines "what to compare against" for each position — a genuinely helpful degree of freedom. Without $\mathbf{W}_K$, the attention pattern collapses to pure self-similarity.

$\mathbf{W}_Q$ produces the query space: it defines "what this token is looking for." But this information is already encoded in the embedding geometry. What a token wants is precisely what the embedding encodes. $\mathbf{W}_Q$ reparameterizes a signal that was already present.

7.3   Implications

8   Related Work

Grokking and Fourier circuits. Power et al. (2022) introduced grokking. Nanda et al. (2023) showed generalization coincides with Fourier circuit formation in the token embedding. GrokFast amplifies slow-gradient components but requires the memorization phase first; PFFT sidesteps memorization entirely.

Attention simplifications. Multi-query attention (MQA) and grouped-query attention (GQA) share K and V heads across Q heads to reduce KV-cache memory. Linformer and linear attention modify the attention kernel. None of these is equivalent to No-Q attention, which removes the Q projection matrix entirely.

Spectral bias and gradient dynamics. Rahaman et al. (2019) established that gradient descent is biased toward low-frequency solutions. This bias is harmful for modular arithmetic, where the optimal solution requires high-frequency representations near the Nyquist limit.

Embedding structure. Our work argues that the embedding's geometric independence must be preserved, not post-processed, and that the Q projection is the primary threat to that independence.

9   Conclusion

We have presented the Embedding Geometry Hypothesis and traced its implications from modular arithmetic to language modeling.

Starting from Fourier circuit formation in grokking, we showed that prescribing near-Nyquist embedding gradients (PFFT) achieves 92.7% ETG speedup by simultaneously respecting the embedding's geometric authority and reducing gradient noise. The Sounding Hammer revealed that the same gradient-domain technique cannot be applied safely to language model token embeddings. Behavioral weight trajectory analysis identified the Q projection as the primary site of representational reorganization, pointing directly at the intervention: remove $\mathbf{W}_Q$.

No-Q attention — setting $\mathbf{Q} = \mathbf{x}$ at every layer — improves language modeling BPC by +3.18% on TinyStories and +2.24% on FineWeb while removing 8% of parameters, and accelerates grokking by 58.9% on modular arithmetic.

The embedding is not a shallow lookup table that feeds into attention — it is the attention query. K and V provide the comparison structure that the embedding's geometry can exploit. The Q projection is the part we can — and should — remove.

References