Walking the Price-Performance Tightrope: Best Practices for IOPS and Throughput
Balancing performance and cost-effectiveness in cloud systems has evolved into an art form, requiring considerable time and effort from companies determined to avoid compromising on either front. What’s the secret to skillfully walking this tightrope without tipping towards excessive costs or compromised performance? Read the following insights to find out!”
In an ideal world, we would never need to compromise between cost and performance. As a developer, I’d have an unlimited budget to squeeze the absolute maximum that technology allows from each and every application.
Yet we don’t live in an ideal world.
And the fact is that I and my team – like so many of our colleagues – spend much of our time walking the price-performance tightrope. We’re constantly trying to keep our balance between delivering optimal Quality of Service (QoS) and user experience (UX), and not falling to the hard ground of budgetary limitations.
What does this look like in real life? How do we optimize storage performance for speed, efficiency, and cost reduction? In this article, we’ll take a deep dive into one instance where my team optimized a storage layer for improved application performance – ensuring superior product quality without breaking the bank.
The Balancing Act
My team was managing a data-intensive application that monitors the comments section of a high-traffic online publication. The app monitors, collects and stores comments on this site, analyzes them, and then uses the data-driven conclusions to both dial in targeted advertisements and filter out
bad actors.
All this happens in real-time – in theory. The problem was that the app was suffering badly from latency and inconsistent performance – resulting in poor user experience.
It did not take long to find the culprit: subpar optimization of input/output operations per second (IOPS) and throughput in the storage system, Amazon Elastic Block Store (EBS).
As you may know, EBS offers a range of volume types, including general-purpose, provisioned IOPS, and throughput-optimized volumes. Yet while provisioned IOPS and throughput-optimized volumes offer serious performance benefits, they can also increase costs dramatically.
The trick is balance: identifying the correct EBS volume type that would satisfy our performance requirements and budget constraints – balancing the high costs of provisioned IOPS and throughput-optimized volumes versus the performance limitations of general-purpose volumes.
Backgrounder: Tuning IOPS and Throughput
Although IOPS and throughput highlight different performance aspects, they are interconnected—influenced by factors like block size and queue depth. This means that optimizing both is vital for peak cloud storage performance. However, achieving this balance can be challenging, since you need to take into account QoS, the inherent limitations of hardware and SSDs, and the intricacies of effective monitoring and management. When optimizing, we also need to take into account:
- Cost – Higher-quality storage solutions often incur higher costs, increasing the total cost of ownership (TCO).
- Complexity – The tuning process is intricate and demands careful execution to reach your desired performance level.
- System Stability – Any changes made to improve performance should not compromise the system’s stability or lead to unexpected downtime.
- Bottlenecks – Performance can be constrained by bottlenecks in various infrastructure components.
First Stage: Optimizing I/O size, Latency, CPUs, Sharding
Since we identified that the problem stemmed from EBS, the first thing my team did was take a deep dive into the application’s storage IOPS and throughput. Specifically, we zoomed into I/O size, network latency, CPUs, and sharding:
- I/O size – To optimize storage throughput in the cloud, we checked the application’s I/O size. By raising the size to 256 KB, each read or write operation handles a larger amount of data, facilitating faster data throughput.
- Network latency – Cloud storage volumes often face higher latency due to their network connectivity. To mitigate this, we ensured that the application was dispatching large I/O requests in multiples. For instance, instead of sending ten 25 KB requests separately, we bundled them into a single 250 KB request. This approach reduces the number of I/O requests the storage system needs to manage, reducing latency and making better use of the network bandwidth.
- CPUs – Every I/O read and write cycle consumes CPU power. We monitored the application’s CPU usage to ensure that its CPU capacity could handle peak I/O demands without reaching full utilization, thus preventing queuing and throughput reduction.
- Sharding – Lastly, we used sharding – distributing the data across several disks – to enhance performance by allowing simultaneous read and write operations on different shards, speeding data access and processing.
Next Stage: Using Zesty Disk to get an Extra Boost
While it’s possible to optimize IOPS and throughput manually, it is a difficult and time- consuming process at scale, especially when managing a multi-tenancy environment.
That’s why our next step was to put Zesty Disk to work. Zesty Disk is a block storage autoscaler that is primarily used to ensure available disk capacity as it can scale up and down in size, but a fringe benefit of the product is that it also optimizes IOPS and throughput for cloud-based workloads.
In brief, this is how it works: A replica filesystem of the original block storage, with all the same cloud-native policies and configurations, is created. The main difference is that instead of being comprised of only one volume in the filesystem, it consists of several smaller-sized storage volumes. But the IOPS burst capacity is not tied to the filesystem, but to the volume. By creating multiple volumes in the filesystem each with its own burst capacity— their combined power is leveraged, automatically providing a boost in IOPS in the range of 300% or more. I like how this extra burst is achieved seamlessly, without the effort of provisioning additional IOPS, dealing with complex configurations, or extra costs.
For the application in question, Zesty Disk aggregated the power of each volume’s allocation of IOPS and throughput to deliver compounded performance improvement. The results: Zesty Disk serialized the filesystem into volumes of 3, 4, and then 5, doubling the prior maximum IOPS of 3000 to 6000 (with bursts up to 9000) and resolved the application’s performance issues at no extra cloud cost.
The Takeaways
Striking a balance between cost and performance in storage optimization is an ongoing challenge for developers like me and my team. To solve the application challenges for this client, we fine-tuned I/O size, managed latency, monitored CPUs, and employed sharding to maximize data throughput and optimize resources. But the game-changer was Zesty Disk – which seamlessly boosted IOPS by up to 300%, doubling the application’s performance without extra costs.
The key lesson? Balancing performance, cost-effectiveness, and system stability is an art. It demands continuous adaptation and leveraging innovative solutions like Zesty Disk to deliver exceptional user experiences within budgetary constraints.