Where cost hides on a Kubernetes AI cluster

TODO: This is a placeholder post. It exists only to exercise the /writing index, the post route, and the prose styles end-to-end. Replace it with a real field note from the os/content/ pipeline before launch.

Most teams look for cost in the obvious place — the monthly bill, sorted descending. But the bill is a symptom, not a cause. The cause sits a layer or two beneath it, in a scheduling decision or an idle GPU reservation that nobody owns.

Where it hides

A capable team can fix almost any problem it can see clearly. So when spend keeps climbing and nobody can explain it, the problem isn’t that the fix is hard — it’s that the cost is hidden. A few common hiding places:

  • Idle reservations. A node pool sized for peak, running at trough.
  • Over-provisioned requests. Pods that reserve four GPUs to use one.
  • Cross-zone traffic. Egress between zones that never shows up as “compute.”

Reading it back

Here’s the kind of query that starts to make the invisible visible — grouping spend by the workload that actually drove it:

SELECT workload, namespace, SUM(gpu_hours) AS gpu_hours
FROM cluster_usage
WHERE day >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY workload, namespace
ORDER BY gpu_hours DESC;
WorkloadNamespaceGPU-hours
rag-indexerprod4,210
vision-trainresearch1,980
idle (unallocated)1,540

That last row is the one worth staring at. Once a team can see the idle line, the fix is the easy part.