Research Publications

Peer-reviewed papers and preprints from the Phronesis Analytics research team.

2026

Preprint — Under Review

The Embedding Geometry Hypothesis: From Fourier Circuits to No-Q Attention

Nathan Rigoni Phronesis Analytics March 2026

The token embedding layer is the geometric foundation of transformer attention. We develop this claim through four stages. First, we show that prescribing near-Nyquist frequency modes in the embedding gradient — Prescribed Fourier Frequency Training (PFFT) — achieves a 92.7% reduction in epochs-to-grokking (57 vs. 782) on modular arithmetic, with a 97.9% reduction in the memorization phase. Second, the Sounding Hammer diagnostic reveals that gradient-domain Fourier steering cannot safely transfer to language model embeddings: BPE vocabulary gradients are spectrally flat (p=0.42), causing catastrophic BPC regression (2.90→9.47) when applied. We introduce Natural Ordering Conditions (NOC) to characterize when Fourier steering is safe. Third, Fourier Gradient Projection (FGP), a dynamic variant of PFFT that follows whichever frequency modes become important during training, is introduced as a general gradient-domain tool, though it shares the NOC limitation. Fourth, behavioral weight trajectory analysis of language models trained on TinyStories and FineWeb reveals that all weight matrices — Q, K, V, and MLP — inherit the same two-arm trajectory shape from the token embedding through the residual stream. This universal inheritance motivates No-Q attention: setting Q=x (no projection) at every layer, improving validation BPC by 3.18% on TinyStories and 2.24% on FineWeb with 8% fewer parameters and a 51.0% grokking speedup. The token embedding is not a lookup table that feeds into attention — it is the attention query.

Download PDF