A screwup is only real when shared

Optimizing resources and costs can be a real buzzkill, and when done manually, especially in a complex infrastructure environment like K8s, mistakes are bound to happen. It’s only human.

A misconfigured autoscaler here, a resource-devouring pod there, a “just-for-testing”
cluster left running for three weeks.

So let’s embrace our humanity by sharing our most glorious (and expensive) DevOops screwups. Share your most epic fails and the surprise cloud bills they triggered!

The best story will be immortalized in our next video.

Hear it from the best of them

After a migration from a GCP autopilot cluster to cast.ai, I have a cluster I had to delete. The problem? It’s name was very similar to the production cluster. I messed up a deleted the wrong cluster. Luckily it was during a planned maintenance window and I had the time to DR it back to life Never will I again postpone a DR drill because someone higher up thinks it’s not important enough to do _right now_.

A couple of years ago, I was tasked with provisioning resources for a high-performance computing project. Don’t remember why but I couldn’t really be bothered with it and rushing it I accidentally provisioned a few metal instance instead of regular virtual ones. It took a couple of weeks to catch it and during that time that single instance cost the company over 40K. I’ve been trying to pay attention to what I;m doing since

made this automation for creating and deleting clusters with Karpenter using python scripts and terraform for testing purposes but didn’t realize that new nodes spun up by Karpenter aren’t added to the Terraform state. So when I created and deleted the clusters there were always leftover nodes not assigned to anything at all. It ended up costing the company $14,500 before I figured it out.

Worked at a media company, and we set up an SFTP tool that started replicating files across AWS regions every 30 minutes but didn’t realize the costs involved until we got served a $150,000 S3 bill. AWS regions matter..

I do devops in e-commerce. We put new logging for error in Azure with no think cost. After 3 days we got a $50k bill just for trace log.

Made one small update to an internal tool at the bank I used to work. ended up crashing more than 100k virtual servers. Total downtime was around 3 hours, and we lost over 1 million dollars in revenue. learned to always test properly before pressing the deploy button.

That time when i messed up an auto scaling config, left unused EBS volumes running for a month, and we got an $8,000 bill 🙁

I messed up a AWS Glacier lifecycle policy and moved active data to long-term storage and ended up with a $30,000 charge

Ran a Terraform destroy command on a production K8s cluster without realizing the cost. Recovery cost us $15,000 + downtime

Deleted a big bunch of S3 objects manually instead of setting them to expired. We ended up paying the delete request fee for each one, when we could’ve avoided it by just letting them expire.

Once I wanted to test Spark on counting the number of words in a text file so I provisioned 16 terabytes worth of high-RAM instances..… it did manage to count the 6,742 words of the file, but we ended up with a $9,500 bill for unnecessary resources. People were not happy……

View testimonials on