Note: I understand the basic math. I understand that the typical perspective
function in various math libraries produces a matrix that converts z values from -z
It's even simpler; the clipping happens after the vertex shading. If the vertex shader was allowed (or more strongly, mandated) to do perspective divison the clipping would have to happen in homogeneous coordinates which would be very inconvenient. The vertex attributes are still linear in clip coordinates which makes clipping a child's play instead of having to clip in homogeneous coordinates:
v' = 1.0f / (lerp(1.0 / v0, 1.0 / v1, t))
See how division-heavy that would be? In clip coordinates it is simply:
v' = lerp(v0, v1, t)
It is even better than that: the clipping limits in clip coordinates are:
-w < x < w
This means the distances to clip planes (left and right) are trivial to compute in clip coordinates:
x - w, and w - x. It's just so much simpler and efficient to clip in clip coordinates that it just makes all the sense in the world to insist that vertex shader outputs are in clip coordinates. Then let the hardware do the clipping and dividing by w-coordinate since there is no reason left to leave it to the user anymore. It's also simpler as that way we don't need post-clip vertex shader (which would also include mapping into the viewport but that is another story). The way they designed it is actually quite nice. :)