Where cost hides on a Kubernetes AI cluster

TODO: This is a placeholder post. It exists only to exercise the /writing index, the post route, and the prose styles end-to-end. Replace it with a real field note from the os/content/ pipeline before launch.

Most teams look for cost in the obvious place — the monthly bill, sorted descending. But the bill is a symptom, not a cause. The cause sits a layer or two beneath it, in a scheduling decision or an idle GPU reservation that nobody owns.

Where it hides

A capable team can fix almost any problem it can see clearly. So when spend keeps climbing and nobody can explain it, the problem isn’t that the fix is hard — it’s that the cost is hidden. A few common hiding places:

Idle reservations. A node pool sized for peak, running at trough.
Over-provisioned requests. Pods that reserve four GPUs to use one.
Cross-zone traffic. Egress between zones that never shows up as “compute.”

Reading it back

Here’s the kind of query that starts to make the invisible visible — grouping spend by the workload that actually drove it:

SELECT workload, namespace, SUM(gpu_hours) AS gpu_hours
FROM cluster_usage
WHERE day >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY workload, namespace
ORDER BY gpu_hours DESC;

Workload	Namespace	GPU-hours
`rag-indexer`	`prod`	4,210
`vision-train`	`research`	1,980
`idle (unallocated)`	`—`	1,540

That last row is the one worth staring at. Once a team can see the idle line, the fix is the easy part.