Where cost hides on a Kubernetes AI cluster
TODO: This is a placeholder post. It exists only to exercise the
/writingindex, the post route, and the prose styles end-to-end. Replace it with a real field note from theos/content/pipeline before launch.
Most teams look for cost in the obvious place — the monthly bill, sorted descending. But the bill is a symptom, not a cause. The cause sits a layer or two beneath it, in a scheduling decision or an idle GPU reservation that nobody owns.
Where it hides
A capable team can fix almost any problem it can see clearly. So when spend keeps climbing and nobody can explain it, the problem isn’t that the fix is hard — it’s that the cost is hidden. A few common hiding places:
- Idle reservations. A node pool sized for peak, running at trough.
- Over-provisioned requests. Pods that reserve four GPUs to use one.
- Cross-zone traffic. Egress between zones that never shows up as “compute.”
Reading it back
Here’s the kind of query that starts to make the invisible visible — grouping spend by the workload that actually drove it:
SELECT workload, namespace, SUM(gpu_hours) AS gpu_hours
FROM cluster_usage
WHERE day >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY workload, namespace
ORDER BY gpu_hours DESC;
| Workload | Namespace | GPU-hours |
|---|---|---|
rag-indexer | prod | 4,210 |
vision-train | research | 1,980 |
idle (unallocated) | — | 1,540 |
That last row is the one worth staring at. Once a team can see the idle line, the fix is the easy part.