the line-drawing myth
open any machine learning textbook and you'll find the same picture. two classes. a line between them. "the softmax classifier draws a linear decision boundary." clean, simple, wrong.
well—not wrong exactly. but so incomplete that it misleads. the textbook picture shows you a 2D slice of something far richer. it makes you think softmax is about drawing lines. that the fundamental operation is: find a hyperplane, put class A on one side, class B on the other.
this framing is seductive because it connects to our geometric intuition. lines separate things. boundaries divide space. we've been drawing lines since grade school. and when we see the softmax output for two classes collapsing to a sigmoid—a smooth step function—the "line" interpretation feels natural.
but here's what the textbook never tells you: the line is an artifact, not the mechanism. softmax doesn't draw lines. it does something far more fundamental— something that becomes invisible when you only look at the boundary.
look at the visualization above. it shows the standard dot-product softmax with a few prototypes. you see the lines—the decision boundaries where classes meet. but notice what the textbook ignores: the colored regions. the entire space is painted. every single point belongs to some class, with some confidence. the boundary is just the thin edge where confidence transitions. the real story is the territory—not the border.
what softmax actually does
let's look at the formula everyone knows but few people see:
$$p(k \mid \mathbf{x}) = \frac{e^{s(\mathbf{x}, \mathbf{w}_k)}}{\sum_j e^{s(\mathbf{x}, \mathbf{w}_j)}}$$
here $s(\mathbf{x}, \mathbf{w}_k)$ is a similarity function between the input $\mathbf{x}$ and prototype $\mathbf{w}_k$. the exponential amplifies differences. the denominator normalizes into probabilities.
now read that formula as a sentence: for every point $\mathbf{x}$ in space, compute how similar it is to each prototype $\mathbf{w}_k$, then normalize those similarities into a probability distribution.
this is not line-drawing. this is competitive potential evaluation. each prototype generates a potential field across the entire space. at every point, the potentials compete. the prototype with the highest potential at that point "wins" the territory.
the decision boundary? it's just the equipotential surface—the set of points where two prototypes have exactly equal potential. it's the tideline between two seas, not a wall someone built.
space splitting, not line drawing
once you see softmax as competitive potential, the geometry transforms completely. you're no longer asking "where is the line?" you're asking "what shape is each prototype's territory?"
and here's the crucial insight: the shape of the territory depends entirely on what "similarity" means. change the similarity function $s(\mathbf{x}, \mathbf{w}_k)$, and the territories reshape completely—even with the exact same prototypes in the exact same positions.
this is what we explored in The Meaning of Non-Linearity: different metrics see different geometry. but now we can be precise about how the metric shapes classification. the softmax is the arena. the similarity function is the physics of the arena. change the physics, and the same competitors carve out radically different territories.
toggle between the metrics in the visualization above. watch how the same three or four prototypes produce completely different space partitions. with the dot product, one prototype can dominate everything. with cosine, the territories slice by angle. with negative euclidean distance, you get classic Voronoi cells. with the Yat, you get curved, gravitational territories.
the prototypes didn't move. the space didn't change. only the definition of similarity changed. and everything followed.
the anatomy of the dot product
most softmax classifiers use the dot product as their similarity function: $s(\mathbf{x}, \mathbf{w}) = \mathbf{x} \cdot \mathbf{w}$. let's dissect exactly what this means geometrically.
the dot product decomposes cleanly:
$$\mathbf{x} \cdot \mathbf{w} = \|\mathbf{x}\| \cdot \|\mathbf{w}\| \cdot \cos\theta$$
three factors, multiplied together. the magnitude of $\mathbf{x}$, the magnitude of $\mathbf{w}$, and the cosine of the angle between them. all three contribute. but they don't contribute equally.
magnitude is multiplicative. if you double $\|\mathbf{w}\|$, you double the dot product. the cosine term only ranges from $-1$ to $+1$—it's bounded. but magnitudes are unbounded. this means that in the dot product, magnitude dominates direction.
consider two prototypes: $\mathbf{w}_A$ with $\|\mathbf{w}_A\| = 10$ and $\mathbf{w}_B$ with $\|\mathbf{w}_B\| = 1$. even if an input $\mathbf{x}$ points directly at $\mathbf{w}_B$ ($\cos\theta = 1$) and is nearly orthogonal to $\mathbf{w}_A$ ($\cos\theta \approx 0.1$), the dot product with $\mathbf{w}_A$ can still win because $10 \times 0.1 = 1 \geq 1 \times 1$.
this is the dot product's fundamental character: it conflates "how aligned are we?" with "how large are we?" in the softmax competition, a large prototype doesn't just win in its direction—it bleeds into other directions through sheer magnitude. it's a bully metric.
the black hole effect
let's push this to its extreme. what happens when one prototype has vastly more magnitude than the others?
in dot product softmax, the answer is stark: it absorbs everything. the high-magnitude prototype becomes a black hole—its territory expands to fill the entire space, pushing all other prototypes into negligible slivers.
this isn't a pathological edge case. it's what happens naturally during training when weight norms grow unevenly. one class accumulates larger weights, which gives it larger territory, which means it sees more data, which reinforces its larger weights. a positive feedback loop toward collapse.
this is why modern neural network practice is filled with normalization tricks: batch normalization [1], layer normalization [2], weight normalization [3], cosine classifiers [4]. they're all fighting the same enemy—the dot product's magnitude bias. they're trying to force the competition back to direction, where it geometrically belongs.
but normalization is a patch, not a cure. it forces all prototypes to have unit magnitude, effectively replacing the dot product with cosine similarity:
$$\cos\theta = \frac{\mathbf{x} \cdot \mathbf{w}}{\|\mathbf{x}\| \|\mathbf{w}\|}$$
cosine similarity solves the magnitude problem. but it introduces a new one: it completely ignores distance. two vectors on opposite ends of the space, pointing in the same direction, are "maximally similar" by cosine—even if they're lightyears apart. as we discussed in The Meaning of Non-Linearity, this is half the picture at best.
competitive potentials
let's formalize the intuition. each prototype $\mathbf{w}_k$ generates a potential field $\phi_k(\mathbf{x})$ across the entire vector space:
$$\phi_k(\mathbf{x}) = s(\mathbf{x}, \mathbf{w}_k)$$
the softmax converts these potentials into probabilities. the "winning class" at any point $\mathbf{x}$ is simply:
$$\hat{y}(\mathbf{x}) = \arg\max_k \; \phi_k(\mathbf{x})$$
the decision boundary between classes $i$ and $j$ is the equipotential surface:
$$\phi_i(\mathbf{x}) = \phi_j(\mathbf{x})$$
now the shape of these potential fields—and therefore the shape of classification— depends entirely on $s$. let's compare what each similarity function produces.
dot product: $\mathbf{x} \cdot \mathbf{w}$
Potential increases linearly along $\mathbf{w}$'s direction
Isopotential lines are hyperplanes perpendicular to $\mathbf{w}$
Boundaries are flat
Magnitude dominates—large prototypes absorb space
cosine: $\cos(\mathbf{x}, \mathbf{w})$
Potential depends only on angle, not position
Isopotential lines are cones centered at origin
Boundaries are angular rays from origin
Distance-blind—far and near are treated identically
neg. euclidean: $-\|\mathbf{x} - \mathbf{w}\|$
Potential decreases with distance from $\mathbf{w}$
Isopotential lines are circles around $\mathbf{w}$
Boundaries are Voronoi edges (perpendicular bisectors)
Direction-blind—only proximity matters
yat: $\frac{(\mathbf{x} \cdot \mathbf{w})^2}{\|\mathbf{x} - \mathbf{w}\|^2}$
Potential depends on both alignment and proximity
Isopotential lines are curves that respect geometry
Boundaries are gravitational equipotentials
Local—nearby prototypes outweigh distant large ones
the yat: gravitational classification
as we introduced in The Meaning of Non-Linearity and explored in The Duality of Information, the Yat metric combines alignment and distance:
$$\text{Yat}(\mathbf{x}, \mathbf{w}) = \frac{(\mathbf{x} \cdot \mathbf{w})^2}{\|\mathbf{x} - \mathbf{w}\|^2}$$
when we use this as the similarity function inside softmax, something remarkable happens. the potential field of each prototype follows an inverse-square law—the same law that governs gravity.
near the prototype, the potential rises sharply (the $1/d^2$ term dominates). far from the prototype, the potential falls off rapidly. this means each prototype has a sphere of influence—a local territory where it dominates, regardless of what larger prototypes exist far away.
remember the analogy from The Universe of Self-Organization: the moon orbits the earth, not the sun, even though the sun is vastly more massive. why? because the earth is closer. gravity's inverse-square law ensures that proximity trumps magnitude at local scales.
the Yat-softmax does the same thing for classification. a small prototype nearby can outcompete a large prototype far away. every data point gets classified by its local geometry, not by some global magnitude hierarchy.
the polarity of space
there's one more geometric effect that the Yat reveals—something the dot product hides. remember that the Yat uses the squared dot product in its numerator: $(\mathbf{x} \cdot \mathbf{w})^2$. this means it treats parallel and anti-parallel vectors identically.
a vector pointing at $(1, 1)$ and a vector pointing at $(-1, -1)$ have the same Yat with respect to a prototype at $(1, 1)$. both are "aligned"—one positively, one negatively. in the Yat's view, the opposite is just as informative as the identical.
this creates a striking visual effect: each prototype's territory extends to the opposite side of the origin. the space develops a bipolar character. as we discussed in The Meaning of Non-Linearity, anti-parallel vectors are linearly dependent—knowing one perfectly predicts the other.
contrast this with the dot product, where $\mathbf{x} \cdot \mathbf{w} < 0$ for anti-parallel vectors. the dot product treats opposites as negatively similar— it pushes them away. the Yat treats them as positively related—it pulls them in. this isn't a quirk; it's a deeper reading of the geometry. in a universe where random vectors are orthogonal (as we showed in The Meaning of Non-Linearity), finding your exact opposite is just as remarkable as finding your twin.
how the dot product and the yat behave
let's now do a systematic comparison. fix two prototypes and sweep a test point across the space. what does each metric "see"?
dot product behavior
as $\|\mathbf{x}\|$ grows: similarity grows linearly. a point far from the origin has enormous dot product with whatever prototype it vaguely points toward. distance from origin amplifies everything.
as $\theta$ changes: similarity varies smoothly as $\cos\theta$. the transition from "aligned" to "orthogonal" is gradual and symmetric.
result: magnitude and angle compete. in practice, magnitude usually wins because it's unbounded while $\cos\theta \in [-1, 1]$.
yat behavior
as $\|\mathbf{x}\|$ grows (away from $\mathbf{w}$): the numerator grows as $\|\mathbf{x}\|^2 \|\mathbf{w}\|^2 \cos^2\theta$, but the denominator grows as $\|\mathbf{x} - \mathbf{w}\|^2 \approx \|\mathbf{x}\|^2$. they cancel. far-away points don't dominate.
as $\mathbf{x} \to \mathbf{w}$: the denominator approaches zero. the Yat explodes to infinity. this is the singularity—the prototype's gravitational core.
result: locality. the Yat is dominated by proximity to the nearest prototype. direction matters, but distance matters more at close range.
the fundamental difference is this: the dot product is a projective measure—it projects $\mathbf{x}$ onto the direction of $\mathbf{w}$ and reads the shadow's length. this shadow grows with distance from the origin, regardless of distance from the prototype.
the Yat is a field measure—it computes the strength of the prototype's "gravitational field" at the point $\mathbf{x}$. this field strength is dominated by proximity, with direction serving as a modulator. close and aligned is strong. close and orthogonal is weak. far away is always weak—no matter how aligned.
why this matters
this isn't abstract geometry. it's the foundation of how every neural network classifier works. the final layer of nearly every classification network is a softmax over dot products. and now you know what that means: the network is splitting its representation space into linear half-spaces, where magnitude dominates direction.
every normalization trick, every temperature parameter, every label-smoothing strategy is a workaround for the dot product's geometric pathologies. we've built an enormous engineering apparatus to patch a fundamentally wrong choice of similarity.
the alternative is simple: use a similarity function whose geometry matches the problem. if you want local structure, use a local metric. if you want both direction and distance to matter, use a metric that captures both. if you want prototypes to have bounded spheres of influence—like stars with gravitational reach rather than searchlights with infinite range—use a metric with inverse-square decay.
the Yat is one such metric. there may be others. the point isn't that the Yat is uniquely correct—it's that the choice of similarity function inside softmax determines the entire geometry of classification. this choice is usually made by default (dot product) and never questioned. it should be the first question we ask.