Optimizing Storage for AI Workloads

Read More >

As machine learning requires extensive resources, algorithm teams need support from cloud engineering. This article gives some pointers on the challenges of AI workloads when it comes to block storage and what cloud engineers can do to mitigate them.

So I landed this new gig for a blogging platform where the company—following bad experiences with Google and a growing number of privacy-sensitive users—decided to offer its own advertisement product. Since the platform had also experienced considerable growth, they finally had enough data to make this work.

My job was to build an advertisement engine—a system that would select ads from our customers and display them to our users without sharing their browsing data with Google. They had dozens of advertisers interested in the idea and over a million users, so the idea was to let AI handle the ad selection, and I was hired to make this happen.

I planned to use online reinforcement learning (RL) with Rudder. The ad selection engine would be an agent that shows users ads and gets rewarded based on their clicks. Later, when a user converted, I’d redistribute delayed rewards. This scheme was supposed to ensure the agent didn’t just get optimized for clicks.

But the tricky part was that a usual RL episode terminates when a user clicks an ad. However, redistributing delayed rewards would extend the episode to an unknown length of time—at least if you ignore the timeouts. 

Here’s what I found.

Challenges with Online RL

Online RL, that is training agents with live data just as it comes in, can be more accurate. But depending on the amount of data your users generate, it can be pretty slow. And of course, you need to store the live data for the training for as long as an RL episode goes on.

Variable Storage Requirements

Planning for the right storage capacity is tough since the number of active users influences the operational training data. If a single writer published a banger article, you’d get a huge spike in traffic. This is nice; the more data you have, the faster the training goes, but it’s also problematic since it can overwhelm your block storage if you don’t plan accordingly. Because of this you usually need to limit the number of active users for your training.

Delayed Rewards Extend RL Episodes

Rudder allows you to redistribute delayed rewards to inform your agent about important outcomes of its actions later. The agent gets the immediate rewards first after the users click the ads; then, when a user buys, signs up, or whatever an advertiser wants them to do, the agent gets a delayed reward. But this kind of reward system requires keeping the episode data around for much longer, leading to more active episodes simultaneously and, in turn, more data on the block storage.

Unpredictable User Conversions

Advertisers have different types of conversions, and each user reacts differently to ads. Sometimes, users see something and buy it immediately; sometimes, they have to think about it for a few hours. This makes the length of the RL episodes random. Sure, adding a timeout is a good idea—users probably won’t convert anymore after some time—but choosing that timeout might not be trivial. All this makes capacity planning for storage a guessing game that burdens the RL process with many assumptions.

Impacting Production Performance

Then, there’s the aspect of production performance with online RL. When you train with live data as users generate it, you need to ensure the training works fast enough so that it doesn’t impact the rest of the business. Users want to read their articles without having to wait for ads to load. And they’ll leave as soon as they’re done reading, so fast storage devices are necessary.

Tackling Online RL Issues

When I started to design the system, I had made some assumptions about the storage requirements. but then noticed they were highly variable for the above-mentioned reasons.

The obvious solution seemed to be to dial up the block storage capacity right from the get-go, but my project manager wasn’t too happy with the costs. So I had to compromise by setting shorter timeouts and more aggressive sampling on the user base, which reduced the validity of the training results. 

Then, there was the issue of disk performance. Faster disks are more expensive, so they’d take away from capacity.

All in all, it seemed like a farce. Why do all the work if right from the start, I couldn’t trust my results? So I abandoned that endeavor and did some research, which led me to Zesty Disk. The tool did just what I wanted—scaling storage capacity up and down as needed.

It enabled me to start with small volumes that cut costs as long as traffic was low and then to reinvest those savings later when traffic spikes occurred. 

Takeaways

There are multiple factors that can impact storage requirements when training AI models. This includes general aspects like the method and very specific factors related to the training area.

  • An algorithm change can multiply the training data, even if the source data stays the same.
  • Training online can impact production systems, especially when the agents interact with users.
  • Live data is unpredictable; if you want to plan for it, you need huge margins, which aren’t always available.
  • Reacting to storage requirements dynamically is more effective than predicting them.

When we considered Rudder, we knew our training episodes would become longer, but not all of them. Zesty Disk gave us the ability to dynamically shrink storage on nodes with a lower load so we could cut costs and instead use those savings for storage on other nodes with higher loads.