How I Improved Storage Performance and Cost Efficiency for My Model Training on EMR

Read More >

I was tasked with building a recommendation model for our shop site. We have hundreds of product types (e.g., shirts, pants, shoes, bags, and more), each with different attributes (e.g., color, size, material, etc.), with tens of thousands of customers making hundreds of thousands of purchases every year. This means that we gather tens of terabytes of data each month. While I could have trained a basic model with a subset of the data locally, it was much more effective to use cloud resources to achieve this with the real data.

It seemed straightforward: Create a Spark cluster with Amazon EMR, connect my Jupyter Notebook to the cluster, and train an ML model to give intelligent recommendations. But as it turns out, training on enormous data sets isn’t just about distributing the workload to multiple instances; it’s also vital to give each instance enough storage capacity. Since I wasn’t prepared to throw a pile of money at this problem, I tried a number of different approaches.

In this article, I’ll share the storage-related issues I encountered and how I solved them.

Why I chose Amazon EMR for machine learning

Amazon EMR is a powerful service offering an array of features. It lets you run your workloads on frameworks like Apache Spark, Hadoop, Flink, and more. And it allows you to use an EC2 cluster that uses VMs, or an EKS cluster that uses Docker containers. Recently, AWS has even added a serverless option. It also offers countless configuration options for your processing nodes (e.g., instance types, block storage types, and more).

Overall, it’s a pretty flexible solution that fits many use cases, but I was mainly using it for integration with other AWS services. As we were already using so many other AWS services, it just made things easier. AWS services work very well with each other, and having some AWS know-how in-house didn’t hurt either. That’s why I chose to go with Amazon EMR.

Challenges with storage performance and cost efficiency in Amazon EMR

When I tried to train my model on EMR for the first time, it didn’t work. I knew the data set was big; that’s why I used Spark, which allows me to distribute small pieces to each instance. Plus, the data generated on each node can grow quite quickly. 

I saw some of this during my local training but underestimated the issues it could cause when using the whole data set to train the model. The moment the disks ran full, everything came to a screeching halt.

So I increased the size of each disk and was able to get everything running smoothly; and that worked for the first run, but the problem wasn’t really solved. 

There’s the preprocessing with things like adding missing values, one-hot encoding, and feature engineering, plus the actual training that requires you to store intermediate results. Cross-validation and hyperparameter tuning are other sources of data. So changes in the training algorithm and the data set (i.e., adding and removing products with different attributes) would’ve just taken me back to square one.

Performance also needed some improvements. EC2’s general-purpose SSDs only have 16,000 IOPS, which I ran at capacity. Provisioned IOPS SSDs have four times the IOPS but cost ten times as much; and I’m not really sure it’s worth it.

My attempts to resolve these issues

I tried two approaches: the first was to invest more time in capacity planning; the second was to do no planning and provision dynamically at runtime instead.

Better capacity planning

The first approach I took was to precalculate the required storage based on the changes in data since the last run and to provision the disks accordingly.

I overprovisioned the first training run and wrote down the disk usage to get to a reasonable starting figure. When planning the next run, I checked which new products we had, which ones were discontinued, and what attributes they’d add to the data set. For example, if we got a new product with four new colors, I’d end up with four new dimensions after one-hot encoding. Ultimately, I could use that data to update my provisioned storage.

This approach worked well for updates to the training data, but less so for updates to the training algorithm. As soon as you threw hyperparameter tuning into the mix, your intermediate results multiplied in size.

Dynamically provisioning volumes

My second approach was to update the storage by adding and removing volumes dynamically.

After some research, I discovered a service that offered dynamic disk resizing for EC2 instances, called Zesty Disk. It adds and removes storage volumes on the fly, allowing the provisioned storage to grow and shrink as needed. It eliminates the headaches of capacity planning and actually works.

Just as garbage collection in programming languages prevents your apps from running out of memory; Zesty Disk prevents the instance from running out of storage. Do I provision more storage to support traffic to the new winter collection, or do I want to try cross-validation? Not an issue—Zesty Disk manages the storage. The best part? It works without the additional costs of overprovisioning since everything is updated on demand.

I also witnessed an improvement in application performance. Zesty leverages several small volumes instead of the one big volume I’d been using when I was doing capacity planning. This means I get free additional IOPS for each volume that’s used within the Zesty Disk. Sure, it isn’t the 3x increase the provisioned IOPS SSDs bring—and it varies based on the number of volumes my training runs—but it doesn’t cost 10 times as much as the provisioned IOPS SSDs do.


Machine learning is a storage- and I/O-intensive workload, enabling you to put enormous amounts of data to work. But this comes with a cost. Solutions like Spark and Amazon EMR are awesome tools that definitely make it easier to work with such big data sets, but machine learning is a challenging task.

The required block storage can multiply when trying different learning methods; for example, when multiple models are created in parallel and compared. At first, 1 TB is sufficient; then all of a sudden you need 5 TB! If you throw in a variable data-set size, it could also fluctuate and end up requiring anywhere between 4 TB to 7 TB. Who knows?

With the dynamic storage resizing option I discovered I could set a reasonably low number. Instead of provisioning 5TB for each server, I’m now provisioning roughly half of that. Saving an average of more than 50% of my initial costs.

So I’ve since bid farewell to capacity planning for storage and took the dynamic provisioning route. Depending on your requirements, you might be able to build a custom solution for this in a few weeks, but I didn’t have the time nor the resources, so I went with Zesty Disk and focused on improving the training algorithm instead.