What is Amazon S3?
Amazon Simple Storage Service – or S3 for short – is one of the most widely-used cloud-based data storage solutions today. It’s a low-latency, scalable, highly available, secure storage web service.
S3 was an early product from Amazon Web Service (AWS) and has been commercially available since 2006. Many companies use S3 as the backbone of their data storage architecture; a few AWS services also use it for their back-end data.
This article is a brief introduction to S3 and is part of a blog series about S3. Here, you will learn how it works, why it is so popular, some typical use cases for it, how applications can access, store, and retrieve data from it, and how it secures data.
How Does S3 Work?
S3 is different from file or block storage platforms. It’s an object storage service.
Object storage stores data in distinct units called objects. The objects are spread throughout multiple, distributed storage hardware but organized in a self-contained, logical repository. Each object has a unique identifier that allows it to be easily found. This unique identification makes cloud-based object storage like S3 able to store virtually an unlimited number of files and yet find them quickly.
In S3, the highest level of a logical grouping of objects is called a bucket. A bucket is like a container where data files are saved. The image below shows some examples of S3 buckets:
Each bucket name is unique within the S3 namespace. If you try to create a bucket with a name that has been already used, S3 will prevent that operation:
Buckets can be created in one of the AWS supported regions. To ensure minimum latency and to keep data transfer costs down, you should create buckets in the same region as your other AWS resources.
Within buckets, you can create logical placeholders for objects, called folders. For example, a data processing application can use three folders in an S3 bucket – one for input files, one for output files, and another for storing output logs:
S3 folders are not the same as file system folders. Folders in S3 are also considered objects.
You can create a maximum of 100 buckets across all regions in an AWS account. You can request AWS to extend this quota to a maximum of 1000 buckets. However, there is no limit on the number of folders you can create or the number of objects you can store.
The maximum size of a file you can upload to S3 in a single write operation is 5 GB. Above that, you have to use something called multi-part upload where the file is broken into a number of chunks, and each chunk uploaded in parallel. The largest size of a file S3 can store is 5 TB.
In S3, each object is identified by an Amazon Resource Name (ARN) which is its unique identifier. It looks something like this:
Here, we are seeing the ARN for a folder called “input” within a bucket called “sample-data-process”.
Objects in object stores also have metadata attached to them. These metadata can be information like security settings, creator name, etc. The image below shows a number of metadata for an S3 file:
Each object in a bucket can have a storage class. We will cover S3 storage classes in a separate article, but for now, just know that it dictates how an object will be stored in S3, and what will be the pricing for that storage. You can change an object’s storage class any time:
Why is S3 so popular?
There are a few reasons why S3 is so popular.
The first reason is its unlimited scalability. As a managed, object storage service, Amazon ensures the backend for S3 is always available and growing with demand. When storing data in S3, users don’t need to worry about provisioning disk volumes, formatting, partitioning, or creating disk arrays.
The second reason is its low cost. Any individual or organization can sign-up for an AWS account and start using S3 immediately. The price of storing per gigabyte of data starts at $0.025 for the first 50 terabytes and goes down with increasing volume.
The third reason is S3’s object durability. According to Amazon, it provides 99.999999999% durability for objects stored over a year. This figure is often called eleven nines of durability. This means once you store files in S3, there’s a risk of losing 0.000000001% of the files over one year. Amazon also provides protection against an entire availability zone (AZ) failure. The availability of S3 in most AWS regions is between 99.95% to 99.99%, which means the service is pretty much always available.
Objects in S3 can be further protected with versioning. Versioning is enabled at the bucket level. Once enabled, S3 will transparently store every version of each object as they are written, overwritten, or deleted. If a file is accidentally or maliciously deleted or changed, it can be easily recovered from a previous version.
What are Some S3 Use-Cases?
Organizations use S3 for many use cases. Here are some typical examples:
- File storage and backup: Enterprises can store non-relational data files in S3 for operational access and long-term archiving. With services like AWS Storage Gateway, S3 can be transparently made part of the on-premise file server or network share.
- Media library: Multimedia (e.g., video, audio, images, music, 3D animations, large electronic drawings) can use S3 as a cost-effective, scalable, low latency storage.
- Data lake: Enterprises can store business-critical information from many sources into a data lake for current and future analysis, processing, and insight. At the heart of any data lake is a scalable data storage. Most data lakes built on AWS use S3 for this.
- Software-as-a-Service (SaaS): Many SaaS-based solutions allow customers to save their application data in the cloud. Behind-the-scene, this data may be saved in the vendor’s AWS S3 tenancy. S3 is also often used by ISVs as a repository for applications, libraries, and patch updates.
- Static websites: S3 can serve static websites (or web sites with client-side scripting). This saves the cost and effort of running a web server.
How users and applications access S3?
Like everything else in AWS, S3 is a web service. This means it can be accessed through a defined set of REST APIs. The S3 REST APIs allow applications and users to perform several operations like:
AWS S3 documentation provides a full list of these actions.
Instead of calling the APIs directly, applications can use different types of “wrappers”:
- AWS Command Line Interface (CLI) is available for Bash, PowerShell, or Windows
- SDKs are available for most popular programming languages like Java, Python, C#, etc.
There are two ways for non-programmatic access to S3:
- AWS S3 console: This is the web-based user interface for S3.
Custom applications: There are many third-party applications that allow GUI-based access to S3. For example, S3Browser is a popular browser plugin for S3.
Which AWS services use S3?
Some of the AWS services use or depend on S3. Here are a few names:
- Amazon Athena can interactively run ANSI SQL queries on structured or semi-structured data stored in S3 buckets.
- AWS CloudTrail saves its audit log files in S3. You can’t view, or modify these files directly from S3.
- Amazon Elastic Block Storage (EBS) volumes’ snapshots are saved in S3.
- Amazon RDS database instances’ snapshots are saved in S3.
- Amazon Redshift Spectrum maps database tables to data files stored in S3. Users can run SQL queries on those tables.
- Amazon CloudWatch Logs can be exported to S3 for offline viewing and analysis.
- Amazon Kinesis Streams and Firehose can capture streaming data to S3.
- Amazon Elastic Map Reduce (EMR) clusters can map S3 storage as the underlying file system, known as EMR File System (EMRFS).
- Amazon Lambda: Amazon Lambda functions can be invoked in response to an S3 event.
Is S3 secure?
Amazon S3 offers several layers of data security.
The first layer of security is something called a bucket policy. A bucket policy controls what AWS accounts, IAM users, roles, or AWS services will have access to a bucket, and what will be the level of that access. For example, if a bucket policy denies operations such as PUT or DELETE from an account, user, group, or role, those entities won’t be able to perform these operations.
Similar to bucket policies are IAM policies. An IAM policy controls the access level of an IAM user, group, or role within an AWS account. Custom IAM policies can be created with restricted access to S3.
Another S3 security feature is the ability to remove anonymous access from the Internet. Unless a bucket hosts content that should be publicly accessible – like website files – all public access to a bucket and its objects should be blocked
Objects stored in S3 can be also protected with encryptions keys. There are two ways to encrypt data in S3.
In the first method, a customer-managed key (CMK) is generated in the AWS Key Management Service (KMS). This key is then used to encrypt S3 files. In this case, KMS is responsible for, managing the key.
In the second method, an S3-generated encryption key is used to encrypt files. In this case, S3 manages the key, and the key is further encrypted with a master key that’s rotated regularly.
Finally, enabling server access logging for a bucket ensures all access requests to the bucket being logged.
Similarly, enabling object-level logging ensures all object-level API activities are logged.
Hopefully, by now you have a good understanding of Amazon S3. As you saw, S3 is a truly cloud-native storage platform with many features. However, despite its obvious advantages, it may not be suitable for some applications that need fast access to data and high I/O performance. A separate type of storage – Elastic Block Storage, or EBS – can be used in these cases.
For the savvy user though, the question remains – how to make the best use of S3, and more importantly, how much does it cost?
This is what you will see in the next article where we will talk about S3 storage types and costs.