The AWS Data Ecosystem Grand Tour - File Systems
Written by Alex Rasmussen on February 6, 2020
This article is part of a series. Here are the rest of the articles in that series:
- Introduction
- Where Your AWS Data Lives
- Block Storage
- Object Storage
- Relational Databases
- Data Warehouses
- Data Lakes
- Key/Value and "NoSQL" Stores
- Graph Databases
- Time-Series Databases
- Ledger Databases
- SQL on S3 and Federated Queries
- Search
- Streaming Data
- File Systems
- Data Ingestion
- ETL
- Processing
- Data Interfaces
- Training Data for Machine Learning
- Data Security
- Business Intelligence
In this series, we've talked a lot about different ways to store data. These range from the familiar relational database to the not-so-familiar ledger database. While a lot of our data gets stored in various database systems, the data storage system most people are most familiar with is the file system, which stores hierarchical directories full of files. File systems are so familiar and ubiquitous that I don't even have to motivate them; they sit at the bottom of the storage pyramid on virtually every computer you or I have ever used.
We've already talked about file systems way back when we talked about block storage with EBS. EBS only deals with file systems on a single EC2 instance, though. What happens if you need to share a file system across multiple instances at once? A lot of systems, especially scientific computing systems, assume some form of network-accessible shared file system that the entire cluster can access. You could try to use S3 as a shared file system, but it's not really suited for that use case. You could also run one of these yourself on an EC2 instance, but that would be a pain to administer. The good news is that there are a number of managed shared file systems to choose from in AWS.
Shared File Systems for Servers
Amazon Elastic File System (EFS) provides a fully managed version of NFS, one of the oldest and most ubiquitous network file systems. EFS volumes are provisioned inside of a VPC, a virtual network within AWS. Once the volume is provisioned, all EC2 instances in that VPC have shared access to it. Like S3, EFS has an infrequent access storage tier called EFS IA, and you can define a lifecycle management policy to automatically move infrequently accessed files to that tier. One catch is that EFS IA charges you extra for I/O, but if you have a handful of frequently accessed files and a ton of infrequently accessed ones, moving files to EFS IA may be a good cost-saving move.
EFS has two throughput modes: bursting and provisioned. Bursting throughput mode gives you 50 KB/s of throughput per GB of storage, or roughly 50 MB/s per TB of storage. Provisioned throughput allows you to pay for throughput independently of the size of the volume, but that additional bandwidth is pricey.
Amazon FSx provides fully managed versions of Windows File Server and Lustre. Both provide network file systems that are generally more capable than NFS, but each targets a different clientele.
Windows File Server is a popular network file system in Windows-heavy organizations. FSx's Windows File Server variant provides a fully managed file server running on a Windows Server VM with things like replication, failover, backups, and monitoring configured for you.
Lustre is a popular alternative to NFS in scientific computing environments or organizations with a lot of Linux servers. It began life as a research project in the late 1990s, and was roughly contemporaneous with Google's GFS. Like GFS, Lustre decouples the file system into a metadata server that contains the file system's structure and a number of storage servers that store data blocks. This decoupling makes it particularly amenable to integration with S3, and FSx for Lustre takes full advantage of this, even allowing Lustre to provide a POSIX compliant file system interface to S3 data. As with FSx for Windows File Server, FSx for Lustre provides managed backups, replication, failover, and monitoring.
FSx doesn't expose the servers running your file system to you directly; instead, you specify storage capacity and AWS takes care of the rest. Windows File Server allows you to specify throughput capacity (for which you're charged separately), while Lustre provides a fixed 200MB/s of throughput per TB of storage.
Shared File Systems for People
The network file systems we've looked at so far are designed for sharing data across a cluster of instances, but that's not the only place where having a file system hosted in the network can come in handy. People need to be able to collaborate on files with colleagues and sync files across devices, and network file systems (in the form of Dropbox, Box, Google Drive, Microsoft OneDrive, and others) have been filling that role for a long time.
Not to be outdone, AWS provides Amazon WorkDocs, which you can think of as roughly analogous to something like Dropbox or OneDrive but hosted in AWS. It's got built-in collaborative editing for some kinds of documents, versioning, search, sharing, and a lot of other features you'd expect from its brand of collaboration-focused network file system.
WorkDocs costs $5 per user, billed monthly. Each user gets 1TB of storage by default, but you can pay for more. Since your data's being stored in S3 under the hood, you're charged for additional data by the GB-month at S3 prices.
Next: Moving Your Data into AWS
In this article, we took a look at AWS's managed file system offerings. Shockingly, this is the last storage system we'll be covering in this series! Now that we've talked about all the different ways data can be stored and queried in AWS, we'll talk about some of the logistics around moving that data around and processing it.
Next time, we'll look at ways to get your organization's data into AWS, especially if you have a lot of it.
If you'd like to get notified when new articles in this series get written, please subscribe to the newsletter by entering your e-mail address in the form below. You can also subscribe to the blog's RSS feed. If you'd like to talk more about any of the topics covered in this series, please contact me.