Show HN: Orpheus, An Agent runtime that scales on queue depth and not CPU

arpitnath42 · 2026-02-04T16:25:13.000Z 1770222313

Hey HN! Over the last year, I’ve been running AI agents in production and kept hitting the same issue. Responses would suddenly take 30+ seconds, even though everything looked fine. CPU was low, memory was fine, no errors. But requests were clearly getting stuck.

The reason turned out to be simple. These agents spend most of their time waiting for AI APIs to respond. While they wait, they barely use the CPU. So the system looks idle even when it’s overloaded. Low CPU doesn’t mean low demand.

I first tried normal autoscaling based on CPU. It never worked. CPU stayed low, so nothing scaled up, and queues kept growing.

Then I moved to scaling based on queue size. That worked much better. I tried a few setups using different queues and cloud tools. The idea was right, but the setup was heavy. It needed extra services and cloud-specific configs and still suffered from slow container startups.

What I really wanted was something simpler: scale based on how much work is waiting, react quickly, and avoid complex infrastructure.

After running this approach in production for about seven months, I pulled it out into a small standalone system called Orpheus. It scales based on pending work, starts workers quickly, keeps agent state on disk so it survives restarts, and doesn’t automatically retry crashed jobs (which helps avoid repeating things like emails or payments). It also avoids container overhead and gives each agent a built-in way to connect tools and models.

This isn’t meant to replace Kubernetes. It’s meant for AI agent workloads where CPU-based scaling gives the wrong signal.