The Geometry of Softmax

how competitive potentials split vector space—and why the similarity you choose changes everything

the line-drawing myth

open any machine learning textbook and you'll find the same picture. two classes. a line between them. "the softmax classifier draws a linear decision boundary." clean, simple, wrong.

well—not wrong exactly. but so incomplete that it misleads. the textbook picture shows you a 2D slice of something far richer. it makes you think softmax is about drawing lines. that the fundamental operation is: find a hyperplane, put class A on one side, class B on the other.

this framing is seductive because it connects to our geometric intuition. lines separate things. boundaries divide space. we've been drawing lines since grade school. and when we see the softmax output for two classes collapsing to a sigmoid—a smooth step function—the "line" interpretation feels natural.

but here's what the textbook never tells you: the line is an artifact, not the mechanism. softmax doesn't draw lines. it does something far more fundamental— something that becomes invisible when you only look at the boundary.

// The Textbook View Click to add prototypes
The standard picture: softmax draws lines between classes. Clean, intuitive— and profoundly incomplete. What happens between the lines?

look at the visualization above. it shows the standard dot-product softmax with a few prototypes. you see the lines—the decision boundaries where classes meet. but notice what the textbook ignores: the colored regions. the entire space is painted. every single point belongs to some class, with some confidence. the boundary is just the thin edge where confidence transitions. the real story is the territory—not the border.

the misdirection: by focusing on decision boundaries, we've trained a generation of practitioners to think about classification as line-drawing. but softmax is not a boundary-finding algorithm. it's a space-splitting algorithm. every point in the entire vector space gets assigned to a prototype. the boundary is just where assignments change.

what softmax actually does

let's look at the formula everyone knows but few people see:

$$p(k \mid \mathbf{x}) = \frac{e^{s(\mathbf{x}, \mathbf{w}_k)}}{\sum_j e^{s(\mathbf{x}, \mathbf{w}_j)}}$$

here $s(\mathbf{x}, \mathbf{w}_k)$ is a similarity function between the input $\mathbf{x}$ and prototype $\mathbf{w}_k$. the exponential amplifies differences. the denominator normalizes into probabilities.

now read that formula as a sentence: for every point $\mathbf{x}$ in space, compute how similar it is to each prototype $\mathbf{w}_k$, then normalize those similarities into a probability distribution.

this is not line-drawing. this is competitive potential evaluation. each prototype generates a potential field across the entire space. at every point, the potentials compete. the prototype with the highest potential at that point "wins" the territory.

the decision boundary? it's just the equipotential surface—the set of points where two prototypes have exactly equal potential. it's the tideline between two seas, not a wall someone built.

the reframe: softmax doesn't draw boundaries. it evaluates competitive similarity everywhere. each prototype radiates a potential field. the boundaries are just where those fields balance—like the Lagrange points between two stars, or the watershed line between two river basins.

space splitting, not line drawing

once you see softmax as competitive potential, the geometry transforms completely. you're no longer asking "where is the line?" you're asking "what shape is each prototype's territory?"

and here's the crucial insight: the shape of the territory depends entirely on what "similarity" means. change the similarity function $s(\mathbf{x}, \mathbf{w}_k)$, and the territories reshape completely—even with the exact same prototypes in the exact same positions.

this is what we explored in The Meaning of Non-Linearity: different metrics see different geometry. but now we can be precise about how the metric shapes classification. the softmax is the arena. the similarity function is the physics of the arena. change the physics, and the same competitors carve out radically different territories.

// Space Splitting: Metric Comparison Click buttons to switch metrics • Click canvas to add prototypes
Same prototypes, different similarity functions, radically different territories. The metric is the geometry of classification.

toggle between the metrics in the visualization above. watch how the same three or four prototypes produce completely different space partitions. with the dot product, one prototype can dominate everything. with cosine, the territories slice by angle. with negative euclidean distance, you get classic Voronoi cells. with the Yat, you get curved, gravitational territories.

the prototypes didn't move. the space didn't change. only the definition of similarity changed. and everything followed.

the anatomy of the dot product

most softmax classifiers use the dot product as their similarity function: $s(\mathbf{x}, \mathbf{w}) = \mathbf{x} \cdot \mathbf{w}$. let's dissect exactly what this means geometrically.

the dot product decomposes cleanly:

$$\mathbf{x} \cdot \mathbf{w} = \|\mathbf{x}\| \cdot \|\mathbf{w}\| \cdot \cos\theta$$

three factors, multiplied together. the magnitude of $\mathbf{x}$, the magnitude of $\mathbf{w}$, and the cosine of the angle between them. all three contribute. but they don't contribute equally.

magnitude is multiplicative. if you double $\|\mathbf{w}\|$, you double the dot product. the cosine term only ranges from $-1$ to $+1$—it's bounded. but magnitudes are unbounded. this means that in the dot product, magnitude dominates direction.

consider two prototypes: $\mathbf{w}_A$ with $\|\mathbf{w}_A\| = 10$ and $\mathbf{w}_B$ with $\|\mathbf{w}_B\| = 1$. even if an input $\mathbf{x}$ points directly at $\mathbf{w}_B$ ($\cos\theta = 1$) and is nearly orthogonal to $\mathbf{w}_A$ ($\cos\theta \approx 0.1$), the dot product with $\mathbf{w}_A$ can still win because $10 \times 0.1 = 1 \geq 1 \times 1$.

// Anatomy of Dot Product Drag the slider to change prototype magnitude
Watch how increasing one prototype's magnitude causes it to absorb territory from its neighbor—even territory that is closer and better-aligned to the other prototype.

this is the dot product's fundamental character: it conflates "how aligned are we?" with "how large are we?" in the softmax competition, a large prototype doesn't just win in its direction—it bleeds into other directions through sheer magnitude. it's a bully metric.

the magnitude trap: in dot product softmax, a prototype with twice the magnitude gets twice the "vote" at every point in space—regardless of direction. this is why neural networks need careful weight initialization and normalization. without it, the loudest voice drowns out all the others, regardless of whether it's right.

the black hole effect

let's push this to its extreme. what happens when one prototype has vastly more magnitude than the others?

in dot product softmax, the answer is stark: it absorbs everything. the high-magnitude prototype becomes a black hole—its territory expands to fill the entire space, pushing all other prototypes into negligible slivers.

this isn't a pathological edge case. it's what happens naturally during training when weight norms grow unevenly. one class accumulates larger weights, which gives it larger territory, which means it sees more data, which reinforces its larger weights. a positive feedback loop toward collapse.

// The Black Hole Effect Click the arrows to grow/shrink a prototype's magnitude
In dot product softmax, increasing one prototype's magnitude causes it to swallow the space. This is the "black hole effect"—magnitude overpowers geometry.

this is why modern neural network practice is filled with normalization tricks: batch normalization [1], layer normalization [2], weight normalization [3], cosine classifiers [4]. they're all fighting the same enemy—the dot product's magnitude bias. they're trying to force the competition back to direction, where it geometrically belongs.

but normalization is a patch, not a cure. it forces all prototypes to have unit magnitude, effectively replacing the dot product with cosine similarity:

$$\cos\theta = \frac{\mathbf{x} \cdot \mathbf{w}}{\|\mathbf{x}\| \|\mathbf{w}\|}$$

cosine similarity solves the magnitude problem. but it introduces a new one: it completely ignores distance. two vectors on opposite ends of the space, pointing in the same direction, are "maximally similar" by cosine—even if they're lightyears apart. as we discussed in The Meaning of Non-Linearity, this is half the picture at best.

competitive potentials

let's formalize the intuition. each prototype $\mathbf{w}_k$ generates a potential field $\phi_k(\mathbf{x})$ across the entire vector space:

$$\phi_k(\mathbf{x}) = s(\mathbf{x}, \mathbf{w}_k)$$

the softmax converts these potentials into probabilities. the "winning class" at any point $\mathbf{x}$ is simply:

$$\hat{y}(\mathbf{x}) = \arg\max_k \; \phi_k(\mathbf{x})$$

the decision boundary between classes $i$ and $j$ is the equipotential surface:

$$\phi_i(\mathbf{x}) = \phi_j(\mathbf{x})$$

now the shape of these potential fields—and therefore the shape of classification— depends entirely on $s$. let's compare what each similarity function produces.

dot product: $\mathbf{x} \cdot \mathbf{w}$

Potential increases linearly along $\mathbf{w}$'s direction

Isopotential lines are hyperplanes perpendicular to $\mathbf{w}$

Boundaries are flat

Magnitude dominates—large prototypes absorb space

cosine: $\cos(\mathbf{x}, \mathbf{w})$

Potential depends only on angle, not position

Isopotential lines are cones centered at origin

Boundaries are angular rays from origin

Distance-blind—far and near are treated identically

neg. euclidean: $-\|\mathbf{x} - \mathbf{w}\|$

Potential decreases with distance from $\mathbf{w}$

Isopotential lines are circles around $\mathbf{w}$

Boundaries are Voronoi edges (perpendicular bisectors)

Direction-blind—only proximity matters

yat: $\frac{(\mathbf{x} \cdot \mathbf{w})^2}{\|\mathbf{x} - \mathbf{w}\|^2}$

Potential depends on both alignment and proximity

Isopotential lines are curves that respect geometry

Boundaries are gravitational equipotentials

Local—nearby prototypes outweigh distant large ones

// Competitive Potential Fields Click buttons to switch metrics • Hover for potential values
The potential field of a single prototype, shown as a heatmap. Brighter = higher potential. Notice how each metric creates a completely different "field shape" from the same prototype.

the yat: gravitational classification

as we introduced in The Meaning of Non-Linearity and explored in The Duality of Information, the Yat metric combines alignment and distance:

$$\text{Yat}(\mathbf{x}, \mathbf{w}) = \frac{(\mathbf{x} \cdot \mathbf{w})^2}{\|\mathbf{x} - \mathbf{w}\|^2}$$

when we use this as the similarity function inside softmax, something remarkable happens. the potential field of each prototype follows an inverse-square law—the same law that governs gravity.

near the prototype, the potential rises sharply (the $1/d^2$ term dominates). far from the prototype, the potential falls off rapidly. this means each prototype has a sphere of influence—a local territory where it dominates, regardless of what larger prototypes exist far away.

remember the analogy from The Universe of Self-Organization: the moon orbits the earth, not the sun, even though the sun is vastly more massive. why? because the earth is closer. gravity's inverse-square law ensures that proximity trumps magnitude at local scales.

the Yat-softmax does the same thing for classification. a small prototype nearby can outcompete a large prototype far away. every data point gets classified by its local geometry, not by some global magnitude hierarchy.

// Gravitational vs. Linear Potential Drag prototypes to reposition • Click to add
Left half: Dot product softmax—flat, linear potential fields. Right half: Yat softmax—curved, gravitational potential fields. Same prototypes, radically different territories.
the gravity principle: in Yat-softmax, each prototype is a "mass" that curves the classification manifold around it. the resulting space partition follows the same inverse-square dynamics as planetary orbits. nearby prototypes create local structure. distant prototypes fade. this is classification as gravitational self-organization—the same principle we traced from Descartes' vortices to Einstein's curved spacetime.

the polarity of space

there's one more geometric effect that the Yat reveals—something the dot product hides. remember that the Yat uses the squared dot product in its numerator: $(\mathbf{x} \cdot \mathbf{w})^2$. this means it treats parallel and anti-parallel vectors identically.

a vector pointing at $(1, 1)$ and a vector pointing at $(-1, -1)$ have the same Yat with respect to a prototype at $(1, 1)$. both are "aligned"—one positively, one negatively. in the Yat's view, the opposite is just as informative as the identical.

this creates a striking visual effect: each prototype's territory extends to the opposite side of the origin. the space develops a bipolar character. as we discussed in The Meaning of Non-Linearity, anti-parallel vectors are linearly dependent—knowing one perfectly predicts the other.

contrast this with the dot product, where $\mathbf{x} \cdot \mathbf{w} < 0$ for anti-parallel vectors. the dot product treats opposites as negatively similar— it pushes them away. the Yat treats them as positively related—it pulls them in. this isn't a quirk; it's a deeper reading of the geometry. in a universe where random vectors are orthogonal (as we showed in The Meaning of Non-Linearity), finding your exact opposite is just as remarkable as finding your twin.

how the dot product and the yat behave

let's now do a systematic comparison. fix two prototypes and sweep a test point across the space. what does each metric "see"?

dot product behavior

as $\|\mathbf{x}\|$ grows: similarity grows linearly. a point far from the origin has enormous dot product with whatever prototype it vaguely points toward. distance from origin amplifies everything.

as $\theta$ changes: similarity varies smoothly as $\cos\theta$. the transition from "aligned" to "orthogonal" is gradual and symmetric.

result: magnitude and angle compete. in practice, magnitude usually wins because it's unbounded while $\cos\theta \in [-1, 1]$.

yat behavior

as $\|\mathbf{x}\|$ grows (away from $\mathbf{w}$): the numerator grows as $\|\mathbf{x}\|^2 \|\mathbf{w}\|^2 \cos^2\theta$, but the denominator grows as $\|\mathbf{x} - \mathbf{w}\|^2 \approx \|\mathbf{x}\|^2$. they cancel. far-away points don't dominate.

as $\mathbf{x} \to \mathbf{w}$: the denominator approaches zero. the Yat explodes to infinity. this is the singularity—the prototype's gravitational core.

result: locality. the Yat is dominated by proximity to the nearest prototype. direction matters, but distance matters more at close range.

// Dot Product vs Yat: Sweep Comparison Drag the test point around the space
Drag the test point (white circle) around the space. The bar chart shows how each metric scores the similarity to each prototype. Watch how the dot product's scores are dominated by magnitude, while the Yat's scores are dominated by proximity.

the fundamental difference is this: the dot product is a projective measure—it projects $\mathbf{x}$ onto the direction of $\mathbf{w}$ and reads the shadow's length. this shadow grows with distance from the origin, regardless of distance from the prototype.

the Yat is a field measure—it computes the strength of the prototype's "gravitational field" at the point $\mathbf{x}$. this field strength is dominated by proximity, with direction serving as a modulator. close and aligned is strong. close and orthogonal is weak. far away is always weak—no matter how aligned.

the core difference: the dot product asks "how much of $\mathbf{w}$ can I see in $\mathbf{x}$?" (a projection question). the Yat asks "how strongly does $\mathbf{w}$ influence $\mathbf{x}$?" (a field question). projections are global— they extend to infinity. fields are local—they decay with distance. this is why Yat-softmax produces gravitational territories while dot-product-softmax produces linear half-spaces.

why this matters

this isn't abstract geometry. it's the foundation of how every neural network classifier works. the final layer of nearly every classification network is a softmax over dot products. and now you know what that means: the network is splitting its representation space into linear half-spaces, where magnitude dominates direction.

every normalization trick, every temperature parameter, every label-smoothing strategy is a workaround for the dot product's geometric pathologies. we've built an enormous engineering apparatus to patch a fundamentally wrong choice of similarity.

the alternative is simple: use a similarity function whose geometry matches the problem. if you want local structure, use a local metric. if you want both direction and distance to matter, use a metric that captures both. if you want prototypes to have bounded spheres of influence—like stars with gravitational reach rather than searchlights with infinite range—use a metric with inverse-square decay.

the Yat is one such metric. there may be others. the point isn't that the Yat is uniquely correct—it's that the choice of similarity function inside softmax determines the entire geometry of classification. this choice is usually made by default (dot product) and never questioned. it should be the first question we ask.

the takeaway: softmax is not about lines. it's about competitive potentials splitting vector space between prototypes. the shape of those territories— linear or curved, local or global, magnitude-dominated or geometry-respecting—is entirely determined by the similarity function. change the similarity, change the geometry. change the geometry, change what the network can learn. the metric is not a detail. the metric is the architecture.

References

  1. [1] Ioffe, S. & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Proceedings of the 32nd International Conference on Machine Learning (ICML). arXiv:1502.03167
  2. [2] Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer Normalization. arXiv preprint. arXiv:1607.06450
  3. [3] Salimans, T. & Kingma, D. P. (2016). Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks. Advances in Neural Information Processing Systems (NeurIPS). arXiv:1602.07868
  4. [4] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., & Liu, W. (2018). CosFace: Large Margin Cosine Loss for Deep Face Recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:1801.09414
  5. [5] Bridle, J. S. (1990). Probabilistic Interpretation of Feedforward Classification Network Outputs. Neurocomputing: Algorithms, Architectures and Applications, NATO ASI Series.