Improving Storage Efficiency for Solr & Elasticsearch

By Omer Hamerman
Principal DevOps Engineer

How Do Search Engines Work?

A search engine like Solr or Elasticsearch is a document database that helps users find specific documents. It creates different indexes for each document field. This means every document field can have one or more associated indexes enabling different types of searches.

For example, a reverse index allows users to search for documents with specific words. The index works by creating a lookup table of all used words. Each word is linked to a list of documents containing this given word. When searching for a comment, the engine doesn’t have to check each document for the word; it simply looks up the word in the index and returns the associated documents.

Unlike relational databases, which try to eliminate data duplicates to improve consistency and storage footprints, a search engine is optimized for data searchability and access. This means that a data set saved in a relational database is usually much smaller than when stored in a search engine, which has ramifications when it comes to storage costs.

Which Storage Factors Can Impact Search Performance and Costs?

When it comes to search engines, I’ve found that different storage factors influence the performance and costs of my searches.

Disk I/O

Using disks with low I/O specs has a negative impact on search performance. A slow disk can drastically prolong the response times of a search engine, especially when the workload includes huge indices or complex queries.

If you’re running an e-commerce site, you probably know slow responses are correlated with bounce rates, meaning saving on disk I/O can result in the indirect cost of losing customers. To keep your customers happy, you want to keep response times low. But the only way to do this while using slow disks is excessive caching in memory, which, in turn, raises costs again—and often to the degree that your savings on I/O evaporate.

Storage Capacity

Another issue I encountered was insufficient storage capacity. Search engines work by building indexes for the data they ingest. One approach for this is to create a lookup table that uses each word in the data as a key and a list of documents including that word as a value. Most words are used in many documents, so these indexes become large quickly.

Insufficient storage capacity can limit the size of these indexes and, in turn, their performance. But more storage can raise your monthly bill.

Take the e-commerce example. Some of these sites have thousands of products, in dozens of categories, and each of them has descriptions that need to be searchable. Since descriptions for products of the same category have a high probability of sharing many words, each index entry for a word can get really big because it’s used in many places. Storage limits will impact the size of the index, only allowing a subset of words or products to be searchable.

Then there’s the question of how long the data needs to be retained and how far back the backups should go. If I choose retention times that are too short, I might save money but reindex too often, which hurts performance. If retention times are too long, I can use my indexes for a longer period, but, again, I’ll increase costs. The same is true for backups. More storage for backups can lower the risk if things go wrong; but while cheaper than live storage for an active search engine, storage for backups isn’t free.

Improving Storage Performance and Cost Efficiency

I used several different optimization methods to help me get the most out of my search engines. Make sure to integrate each method into a recurring process to reevaluate each requirement with up-to-date production data.

Mapping Document Fields

Mapping is the process of defining how a search engine should index each document field. Usually, this process is done automatically when I save a new document, but creating mappings manually is a good idea to improve performance and save costs.

The default mapping algorithm generates two mappings for text fields—one for a full-text search index and one for a keyword search index.

Full-text search is good for fuzzy searches in continuous text because I might want to search for words that are similar to “scarf” and expect to find documents that may only include “scarves.”

Keyword mapping is less flexible, but if I have fields like clothing sizes, I know that we only sell five sizes, meaning I don’t need that fuzziness. Depending on the use case, having two indexes for each text field wastes storage and slows the system down.

I also disable the mapping of specific fields entirely, preventing the search engine from creating an index. This way, the search engine doesn’t need to index the entire data set, lowering the bill accordingly.

I always check what fields my documents have and make sure to choose the best mapping for each.

Defining Data Retention, Replication, and Backup Policies

There are a few best practices I follow to optimize storage performance and cost efficiency.

First, I define data retention policies based on business requirements and consider implementing data lifecycle management strategies to optimize storage costs. Moving older data into slower storage can save money while keeping the data around if it’s needed in the future. But the cheapest storage is no storage at all, so I check to know which data to retain. I want to keep fast indexes for popular products, but might want to save a bit of money on the more niche inventory.

Replication is important for reliability and performance. After all, I choose a search engine because it lets my users search for data faster. So replicating it close to users can lower latency, which in the case of my e-commerce stores, had a positive impact on revenue. Again, don’t go overboard here; subsecond responses might seem nice on paper but aren’t a requirement for all interactions on my website.

Finally, I assess my backup requirements. Replicas can reduce the risk of needing a backup, but can’t eliminate it completely. I make sure I have a backup to restore past states if my data is destroyed but also keep it reasonable.

Automating Recurring Storage Estimations

To get the most out of my search engine, I ensure my storage system can handle the required read-and-write operations efficiently. This means choosing the right storage type and provisioning it with the correct size. On the other hand, I don’t want to go overboard with my resources. Storage that isn’t used still costs money, so I want to provide as much as necessary, but no more. Usually, this requires a manual resource estimation process, but tools like Zesty Disk automate this chore.

Zesty Disk is a block storage auto-scaler that automatically expands and shrinks block storage. In fashion e-commerce stores, where products change each season, the indexes grow and shrink frequently, and with this, the storage requirements too. Zesty Disk will add volumes to my filesystem so I always have exactly what I need, plus a buffer for new data. And if I remove indexes or documents from the search engine, Zesty Disk will remove volumes and recoup expenses by removing capacity that’s no longer needed.

This behavior perfectly aligns with the need to re-estimate resource requirements regularly. I might not know how much storage a search engine will need in the future, but when resizing disks in short time frames, I can be as close as possible to the optimal space needed by using Zesty Disk.

Summary

Search engines, like Solr and Elasticsearch, let users query data in a flexible way, which is crucial in times of ever-growing mountains of information. But, resource allocation becomes an issue that requires an operator to reevaluate resource requirements continuously. Each season, I had to check how our inventory changed to ensure the resources could handle it.

Automatic scaling solutions made my life much easier. They can scan the current load resources have to handle and decide how to scale them up and down without constant manual intervention by an operator. It’s even better if such a solution is capable of leveraging performance optimizations like burst capacity by provisioning small storage volumes. That way, I’m not only saving money but also ensuring performance is never lacking.