SLAY: Geometry-Aware Spherical Linearized Attention with Yat-Kernel

A new class of linear-time attention mechanisms built on the Yat-kernel. Co-authored with Krzysztof Choromanski of Columbia University and Google DeepMind.

Overview

SLAY (Spherical Linearized Attention with Yat Kernels) introduces a new class of linear-time attention mechanisms based on a relaxed and computationally efficient formulation of the Yat-kernel. The resulting interactions are geometry-aware and inspired by inverse-square interactions in physics.

The work is co-authored by Jose Miguel Luna and Taha Bouhsine of Azetta.ai alongside Krzysztof Choromanski of Columbia University and Google DeepMind—a leading researcher in scalable attention mechanisms and the creator of Performers.

Key contributions

  • Spherical Yat-kernel: Queries and keys are constrained to the unit sphere so that attention depends only on angular alignment—grounding the mechanism in geometric structure rather than arbitrary learned projections.
  • Positive random-feature approximation: Using Bernstein's theorem, the spherical Yat-kernel is expressed as a nonnegative mixture of polynomial-exponential product kernels, enabling a strictly positive random-feature map for linear-time O(L) attention.
  • Theoretical guarantees: Positive definiteness and boundedness are established on the sphere. The estimator yields well-defined, nonnegative attention scores.
  • State-of-the-art linear attention: Empirically, SLAY achieves performance nearly indistinguishable from standard softmax attention while retaining linear time and memory scaling—consistently outperforming prior linear-time mechanisms such as Performers and Cosformers.

Why this matters

Standard softmax attention scales quadratically with sequence length, making long-context Transformers prohibitively expensive. Previous linear-time approximations sacrificed quality for speed. SLAY closes that gap: to the best of our knowledge, it represents the closest linear-time approximation to softmax attention reported to date—enabling scalable Transformers without the typical performance trade-offs of attention linearization.

The paper validates Azetta's core thesis: that grounding neural network operations in physics-inspired geometry—here, the Yat-kernel's inverse-square interactions on the sphere—produces systems that are both more efficient and more principled than arbitrary learned transformations.

Resources