The AWS Data Ecosystem Grand Tour - SQL on S3 and Federated Queries
Written by Alex Rasmussen on January 27, 2020
This article is part of a series. Here are the rest of the articles in that series:
- Introduction
- Where Your AWS Data Lives
- Block Storage
- Object Storage
- Relational Databases
- Data Warehouses
- Data Lakes
- Key/Value and "NoSQL" Stores
- Graph Databases
- Time-Series Databases
- Ledger Databases
- SQL on S3 and Federated Queries
- Search
- Streaming Data
- File Systems
- Data Ingestion
- ETL
- Processing
- Data Interfaces
- Training Data for Machine Learning
- Data Security
- Business Intelligence
Relational, graph, and document databases all have complete control over how the data they manage is stored. This control makes it easier for the database to construct efficient queries and store data compactly, but it also comes with a cost. If you want to leverage the database's capabilities, you have to load data into it. This can be a complicated and expensive proposition, especially when you've got a lot of data under management.
If you've got a lot of data in AWS, it's cheap and easy to store that data in S3. That's one of the things that makes data lakes in AWS so attractive. Wouldn't it be great if we could just query the data in our data lake directly, rather than having to load it into a database first? Turns out, AWS offers several different solutions for doing just that.
SQL on S3
S3 Select and Glacier Select allow you to use SQL to query data within individual files (well, objects and archives, but you get the idea) on S3 and Glacier, respectively. This allows you to read only the subset of a large file that you need, which can save on query costs and improve performance. Both S3 Select and Glacier Select require that a file has enough structure to be queried. CSV, JSON and Parquet files are currently supported, and each file type requires different amounts of context when you query them to extract the requisite structure. Parquet files have an embedded schema, so it's already got all the structure the system needs. JSON is a little more tricky; a file can contain one JSON object or many, so you need to specify which kind of document you're dealing with. CSV is by far the most complicated, since the query has to know how to find header names, how to handle quotes, and how fields are delimited.
Amazon Athena takes the ideas from S3 Select and Glacier Select a lot further. Athena is based on Apache Presto, a project started by Facebook and open-sourced in 2013. Presto was originally designed for interactively querying petabyte-scale data stored on HDFS. At its core, Presto looks a lot like the query planning part of a MPP database; a coordinator translates SQL queries into a logical query plan, optimizes that plan for execution across a collection of workers, and then distributes the job among those workers. As workers complete their portion of the query, they stream records back to the coordinator. This means that queries across large data volumes can be fast, provided your data is laid out and partitioned appropriately.
Where S3 Select only works on one file at a time, Athena can query large collections of objects in S3 at once. Athena also needs to be given information about where and how data is stored in order to know how to query it, but Athena leverages the Glue Data Catalog for this; you can have Glue's crawlers find your datasets or declare your data's location and schema yourself with a CREATE TABLE
statement.
Athena charges per TB of data scanned, which means you're incentivized to store your data partitioned, compressed, and in a columnar format like Parquet, since Athena's query planner is smart enough to take advantage of that. Conveniently, if you're managing your data in Lake Formation, your data is probably already stored this way.
Query Federation
Both Athena and Redshift are previewing support for query federation, which allows you to query data across multiple data stores at once. This means you can query data in an RDS PostgreSQL database or data in S3 from within Redshift or Athena, joining the two data sources together in a single query. Redshift can already reference S3 data with Redshift Spectrum, so its federated querying capability just extends that functionality to support other sources. Of course, this will likely impact the performance of your queries, but it'll almost certainly be faster than whatever you've put together to do this yourself.
Next: Search
In this article, we took a look at SQL queries on S3 and on federated data sources. Next time, we'll take a look at how to efficiently look for how to efficiently execute complex lookups in large datasets with search engines.
If you'd like to get notified when new articles in this series get written, please subscribe to the newsletter by entering your e-mail address in the form below. You can also subscribe to the blog's RSS feed. If you'd like to talk more about any of the topics covered in this series, please contact me.