A screwup is only real when shared
Optimizing resources and costs can be a real buzzkill, and when done manually, especially in a complex infrastructure environment like K8s, mistakes are bound to happen. It’s only human.
A misconfigured autoscaler here, a resource-devouring pod there, a “just-for-testing”
cluster left running for three weeks.
So let’s embrace our humanity by sharing our most glorious (and expensive) DevOops screwups. Share your most epic fails and the surprise cloud bills they triggered!
The best story will be immortalized in our next video.
Hear it from the best of them
After a migration from a GCP autopilot cluster to cast.ai, I have a cluster I had to delete.
The problem? It’s name was very similar to the production cluster.
I messed up a deleted the wrong cluster.
Luckily it was during a planned maintenance window and I had the time to DR it back to life
Never will I again postpone a DR drill because someone higher up thinks it’s not important enough to do _right now_.
A couple of years ago, I was tasked with provisioning resources for a high-performance computing project. Don’t remember why but I couldn’t really be bothered with it and rushing it I accidentally provisioned a few metal instance instead of regular virtual ones. It took a couple of weeks to catch it and during that time that single instance cost the company over 40K. I’ve been trying to pay attention to what I;m doing since
made this automation for creating and deleting clusters with Karpenter using python scripts and terraform for testing purposes but didn’t realize that new nodes spun up by Karpenter aren’t added to the Terraform state. So when I created and deleted the clusters there were always leftover nodes not assigned to anything at all. It ended up costing the company $14,500 before I figured it out.
View testimonials on
© 2025 Zesty. All Rights Reserved
Learn more about us www.zesty.co