The AWS Data Ecosystem Grand Tour - Graph Databases
Written by Alex Rasmussen on January 17, 2020
This article is part of a series. Here are the rest of the articles in that series:
- Introduction
- Where Your AWS Data Lives
- Block Storage
- Object Storage
- Relational Databases
- Data Warehouses
- Data Lakes
- Key/Value and "NoSQL" Stores
- Graph Databases
- Time-Series Databases
- Ledger Databases
- SQL on S3 and Federated Queries
- Search
- Streaming Data
- File Systems
- Data Ingestion
- ETL
- Processing
- Data Interfaces
- Training Data for Machine Learning
- Data Security
- Business Intelligence
Relational databases struggle when querying highly interconnected data. Of course, databases can handle structured relationships between entities; in fact, that's a big part of what a relational database does. However, the structure of a database is pretty entity-centric. If you want to add a new relationship between entities, you have to add a foreign key relationship to one of the entities or make a join table, each of which involves a potentially painful data migration. Queries that operate on complex relationships can be hard to express and execute efficiently in SQL due to all the joins involved in getting from one end of a complex relationship to the other.
A classic example of the kind of complexity we're talking about here is the "six degrees of Kevin Bacon" game. The premise of the game is that every actor is linked to prolific actor Kevin Bacon by at most six degrees of separation via co-stars in other films. For example, Elvis Presley was in Change of Habit (1969) with Ed Asner, who was in JFK (1991) with Kevin Bacon, so Ed Asner's Bacon number is 1, and Elvis's Bacon number is 2. The objective of the game is to determine a given actor's "Bacon number", the minimum number of degrees of separation between that actor and Kevin Bacon.
Suppose you had three tables in a relational database: actors
, movies
, and appears_in
. actors
and movies
are simple tables of entities, and appears_in
is the table that joins the two entity tables together in a many-to-many relationship.
If you wanted to find all actors with a Bacon number of 2 with SQL, it would look something like this:
SELECT A1.name, M1.name, A2.name, M2.name, A3.name
FROM
actors A1, actors A2, actors A3,
movies M1, movies M2,
appears_in P1, appears_in P2, appears_in P3, appears_in P4
WHERE
-- Actor A1 appeared in movie M1
A1.id == P1.actor_id AND M1.id == P1.movie_id AND
-- ... and so did actor A2
A2.id == P2.actor_id AND M1.id == P2.movie_id AND
-- Actor A2 appeared in movie M2
A2.id == P3.actor_id AND M2.id == P3.movie_id AND
-- ... and so did actor A3
A3.id == P4.actor_id AND M2.id == P4.movie_id AND
-- Actor A3 is Kevin Bacon
A3.name == 'Kevin Bacon' AND
-- All three actors are different people
A1.id != A2.id AND A2.id != A3.id AND A3.id != A1.id AND
-- The two movies they appeared in are different
M1.id != M2.id;
It's worth noting that this wouldn't tell you if the resulting actors had a Bacon number lower than 2; in order to do that, you'd have to do something exceptionally clever.
You could admittedly make this query less verbose with things like CTEs, but it's still a gnarly query. There's a lot of verbosity and many joins across that join table. Imagine with computing actors with a Bacon number of 6 would look like.
Graph databases are designed for efficiently answering complex queries like these. In a graph database, everything is either a vertex or an edge that connects two vertices together, and users express queries as structural queries about the properties of the graph. Complex relationship-centric queries are much easier to express in this form than they typically are in a relational model. For instance, finding actors with a Bacon number of 2 is simply finding all paths in the graph that look like Actor(name=Kevin Bacon) -> Movie -> Actor -> Movie -> Actor
. You could even efficiently compute the shortest path between the Kevin Bacon actor vertex and all other actor vertices to compute the Bacon number for all actors at once if you really felt like over-achieving.
Graph databases have been around for a long time, and there are a few competing systems with their own models and query languages for working with graphs. Rather than bundle one of the existing open source systems like it did with ElastiCache or RDS MySQL and RDS PostgreSQL, AWS opted to write their own graph database called Amazon Neptune. Neptune supports two popular graph models: property graphs and the Resource Description Framework (RDF). Some structures are easier to express in one model than they are in the other, and both can be used interchangeably on top of the same Neptune graph. This first-class support for multiple models is something of a rarity in graph databases; those few databases that support multiple models tend to have a strong preference as to which model you should use. Notably, Neptune doesn't currently support the Cypher query language used by Neo4j, one of the oldest and most popular open-source graph databases. It's possible that there's either some legal or technological hurdle that they have yet to clear, or that there isn't sufficient customer demand, but either way Cypher support is conspicuous in its absence.
In addition to the base functionality that you'd expect from a graph database - being able to read and write vertices and edges and perform queries on graphs - Neptune exposes updates to graphs as a stream, allowing for change data capture in much the same way that DynamoDB does. Neptune has a built-in integration with Elasticsearch (which we'll talk about more when we talk about search engines) that uses this change data capture stream to propagate updates between the two systems.
Neptune is built on top of AWS's storage platform and provides the same kind of high availability, durability, and encryption features that we've seen in Aurora, DynamoDB, and EBS among others. Neptune also provides snapshots and point-in-time restores in the same way that other systems based on this storage platform do.
Neptune's pricing looks a lot like other similar services. It runs on a cluster of instances, and running on more powerful instances will improve your performance but will also cost more. You pay per GB-month and per million I/Os for storage, and standard data transfer rates also apply, although multi-AZ data transfer for replication is free.
Next: Traveling Through Time
In this article, we've taken a look at graph databases, a data system purpose-built for a specific kind of queries. Next time, we'll look at another kind of purpose-built data system, the time-series database.
If you'd like to get notified when new articles in this series get written, please subscribe to the newsletter by entering your e-mail address in the form below. You can also subscribe to the blog's RSS feed. If you'd like to talk more about any of the topics covered in this series, please contact me.