Research Publications
Peer-reviewed papers and preprints from the Phronesis Analytics research team.
The Embedding Geometry Hypothesis: From Fourier Circuits to No-Q Attention
The token embedding layer is the geometric foundation of transformer attention. We develop this claim through four stages. First, we show that prescribing near-Nyquist frequency modes in the embedding gradient — Prescribed Fourier Frequency Training (PFFT) — achieves a 92.7% reduction in epochs-to-grokking (57 vs. 782) on modular arithmetic, with a 97.9% reduction in the memorization phase. Second, the Sounding Hammer diagnostic reveals that gradient-domain Fourier steering cannot safely transfer to language model embeddings: BPE vocabulary gradients are spectrally flat (p=0.42), causing catastrophic BPC regression (2.90→9.47) when applied. We introduce Natural Ordering Conditions (NOC) to characterize when Fourier steering is safe. Third, Fourier Gradient Projection (FGP), a dynamic variant of PFFT that follows whichever frequency modes become important during training, is introduced as a general gradient-domain tool, though it shares the NOC limitation. Fourth, behavioral weight trajectory analysis of language models trained on TinyStories and FineWeb reveals that all weight matrices — Q, K, V, and MLP — inherit the same two-arm trajectory shape from the token embedding through the residual stream. This universal inheritance motivates No-Q attention: setting Q=x (no projection) at every layer, improving validation BPC by 3.18% on TinyStories and 2.24% on FineWeb with 8% fewer parameters and a 51.0% grokking speedup. The token embedding is not a lookup table that feeds into attention — it is the attention query.