Optimizing Storage for AI Workloads

By Alexey Baikov
CTO and Co-founder

Challenges with Online RL

Online RL, that is training agents with live data just as it comes in, can be more accurate. But depending on the amount of data your users generate, it can be pretty slow. And of course, you need to store the live data for the training for as long as an RL episode goes on.

Variable Storage Requirements

Planning for the right storage capacity is tough since the number of active users influences the operational training data. If a single writer published a banger article, you’d get a huge spike in traffic. This is nice; the more data you have, the faster the training goes, but it’s also problematic since it can overwhelm your block storage if you don’t plan accordingly. Because of this you usually need to limit the number of active users for your training.

Delayed Rewards Extend RL Episodes

Rudder allows you to redistribute delayed rewards to inform your agent about important outcomes of its actions later. The agent gets the immediate rewards first after the users click the ads; then, when a user buys, signs up, or whatever an advertiser wants them to do, the agent gets a delayed reward. But this kind of reward system requires keeping the episode data around for much longer, leading to more active episodes simultaneously and, in turn, more data on the block storage.

Unpredictable User Conversions

Advertisers have different types of conversions, and each user reacts differently to ads. Sometimes, users see something and buy it immediately; sometimes, they have to think about it for a few hours. This makes the length of the RL episodes random. Sure, adding a timeout is a good idea—users probably won’t convert anymore after some time—but choosing that timeout might not be trivial. All this makes capacity planning for storage a guessing game that burdens the RL process with many assumptions.

Impacting Production Performance

Then, there’s the aspect of production performance with online RL. When you train with live data as users generate it, you need to ensure the training works fast enough so that it doesn’t impact the rest of the business. Users want to read their articles without having to wait for ads to load. And they’ll leave as soon as they’re done reading, so fast storage devices are necessary.

Tackling Online RL Issues

When I started to design the system, I had made some assumptions about the storage requirements. but then noticed they were highly variable for the above-mentioned reasons.

The obvious solution seemed to be to dial up the block storage capacity right from the get-go, but my project manager wasn’t too happy with the costs. So I had to compromise by setting shorter timeouts and more aggressive sampling on the user base, which reduced the validity of the training results.

Then, there was the issue of disk performance. Faster disks are more expensive, so they’d take away from capacity.

All in all, it seemed like a farce. Why do all the work if right from the start, I couldn’t trust my results? So I abandoned that endeavor and did some research, which led me to Zesty Disk. The tool did just what I wanted—scaling storage capacity up and down as needed.

It enabled me to start with small volumes that cut costs as long as traffic was low and then to reinvest those savings later when traffic spikes occurred.

Takeaways

There are multiple factors that can impact storage requirements when training AI models. This includes general aspects like the method and very specific factors related to the training area.

An algorithm change can multiply the training data, even if the source data stays the same.
Training online can impact production systems, especially when the agents interact with users.
Live data is unpredictable; if you want to plan for it, you need huge margins, which aren’t always available.
Reacting to storage requirements dynamically is more effective than predicting them.

When we considered Rudder, we knew our training episodes would become longer, but not all of them. Zesty Disk gave us the ability to dynamically shrink storage on nodes with a lower load so we could cut costs and instead use those savings for storage on other nodes with higher loads.

Optimizing Storage for AI Workloads

Challenges with Online RL

Variable Storage Requirements

Delayed Rewards Extend RL Episodes

Unpredictable User Conversions

Impacting Production Performance

Tackling Online RL Issues

Takeaways

Related Articles

Tags

Keep your cloud up to the pace of change

Products

Solutions

Company

Resources

More

Proud to be