The Rise of the Distributed SQL Database

2018-02-23, by PJ Hagerty

The pace of technological advancement is fast. This is especially true when it comes to the technologies that underpin websites web applications. What was common practice 20, or even 10 years ago, can appear humorously outdated today.

For example, did you know that the Linux Kernel Mailing List Archive is hosted on a PC that lives in somebody's closet? Quite recently, the website went down because the PC had stopped working. And because the owner was on holiday, the kernel developers had to wait until he was home to fix things. No, really.

Over the past two decades, websites have gone from being hosted on a PC kept in a closet, to being run on physical or virtual machines in colocation data centers, to Platform as a Service (PaaS), to clusters of ephemeral containers running in distributed, horizontally scalable, shared-nothing configurations.

But when we say that, we're really only talking about application servers.

What about databases? For many people, their database is the one part of their application that is still stuck in the 90s.

And I, for one, think that this is going to change.

Distributed Applications

Before I talk about the future of databases, I want to take a quick look at how the state-of-the-art for application architecture has changed over the years.

A lot has changed, and I can't cover all of it, but main thing of interest has been the move away from a pet architecture to a cattle architecture.

But what on earth do I mean by that?

Here's a slide from a presentation given by Gavin McCance at CERN in 2012 that builds upon work by Microsoft’s Bill Baker:

So the basic idea here is as follows.

In the old model, your application would be running on a PC you keep in a closet, or on a dedicated machine or virtual machine at a colocation data center. You would install the operating system by hand, configure it, install the application dependencies by hand, get the application running, and then manually tend to the machine, applying software updates and fixing any issues as they arise.

These machines are like pets that you have to look after. And if you want a bigger pet, you have to fatten it up yourself. This is known as vertical scaling, or scaling up.

This used to be a big problem.

If a machine broke, you might find yourself driving to a neighboring city to fix the problem in person. And when you wanted to scale up, you had to buy and configure new machines. And when you wanted to scale down. Oops. You have a bunch of equipment you no longer need, i.e. a pet you no longer want. 😔

In the new model, you create machine images. You can then spin up or spin down as many instances of these as you want.

In many cases, you can spin them up on somebody else's infrastructure, meaning you no longer have to think too hard about capacity planning or the complexities of infrastructure amortization. Instances boot up and the application start immediately. If an application encounters problems, you can terminate the instance and start a new one.

Using the pets-cattle analogy, these machines are more like cattle.

If you want more meat, no problem. Just add more cows. This is known as horizontal scaling, or scaling out. And then when you want to scale in, you terminate however many machines you like. (It's best if we don't dwell on how this maps to the cattle analogy.)

But to make this work, you need to be able to spin up multiple instances of that machine image, forming a cluster of identical machines (your herd of cattle) and have them work in unison.

To realize this, we saw the adoption of the shared-nothing architecture. Which, in short, is an approach to application programming that completely eliminates local state. That is, each application instance is stateless, and all persistent data is written to an external database. This allows you to create or destroy as many of these application servers as you want, at will.

Heroku famously cemented this approach to application architecture with its development of the 12 factor application.

In fact, it is no coincidence that Heroku (one of the first PaaS companies) were the ones to develop this. You need to be able to run your applications this way if you truly want to be able to reap the most benefit from using containers or a PaaS.

What about the database though? If we're writing all our state to the database, how do we scale our database? And are we able to take full advantage of containers or a PaaS?

Well, not so much.

Distributed Databases, but With a Catch

Let's rule out managed databases for a second.

Of course, many PaaS providers do offer managed databases that you can scale up or down with the press of a button. And if these products work for you, that's great.

But unfortunately, they often come with vendor lock-in. With most of those database offerings, you can't download some software and get a local version of Amazon Redshift (for instance) running that functions on your personal computer. You can't switch to a different PaaS without migrating to a new database.

Rewind a decade or so, and for most people, your options for an open source database that would work with your application, that you could run on-premises, were MySQL or PostgreSQL.

But scaling these traditional relational database management systems (RDBMS) isn't easy because they were originally designed to run on a single node. And you can tell.

A typical approach is have a single machine running the master node. And then to scale up, you either upgrade the machine or buy a new one. Or you stick some read-only database replicas in front of it. Or, if you are feeling adventurous, and your databases supports it or you use a third-party tool to enable it, you might try to configure and maintain a multi-master setup. But that is not for the faint hearted.

So you have all of these applications being written using a shared nothing architecture that can be scaled horizontally. But they're all persisting state to an RDBM that is still decidedly stuck in the past.

There was a gap in the market. And that is where NoSQL came in.

Google created BigTable, and Amazon created DynamoDB. Two databases that were specifically designed to run across thousands of identical machines.

Then, crucially, both companies published research papers detailing how these databases work. And what happened next was an explosion in open source databases that mimicked their designs. Starting with Apache CouchDB in 2004 (also, in part, inspired by Lotus Notes) and then, in 2008, Riak and Apache Cassandra came along, followed by MongoDB, and many others.

Finally, developers were able to scale their applications and databases horizontally, making the most of both PaaS offerings and containers—two technologies which also started to gain immense popularity around this time.

But there was a catch.

Distributed databases cannot be always available and always consistent. When a node becomes separated from a part of the cluster (and this will happen) it must choose between refusing to answer requests or returning potentially stale data.

Most of the NoSQL databases decided to solve this problem by sacrificing strong consistency. But they also threw the baby out with the bathwater, and sacrificed the relational model and the standard language for querying databases, SQL, with it. Hence the name, NoSQL.

So now you have very available, very scalable databases, with neat features like advanced full-text search. But you can no longer use SQL, or the tools, reporting applications, and frameworks that depend on SQL. Instead, you, your application, and all of the engineers you hire, must use a database specific query paradigm. And some of these (like MapReduce) are tricky to wrap your head around.

But what if you could have the best of both worlds? SQL, features like full-text search, dynamic schemas, and so on, as well as horizontal scalability that is perfectly suited to PaaS environments and containers.

Enter the distributed SQL database.

Distributed SQL Databases

CrateDB is one of a few distributed SQL databases to pop up in recent years, and it offers the sorts of features that would typically tempt someone to use a NoSQL database, without sacrificing the SQL.

You really can have the best of both worlds.

The way it works is simple:

  1. You create a cluster by spinning up (or spinning down) as many CrateDB nodes as you wish. You can adjust the size of the cluster whenever you wish.
  2. The CrateDB nodes automatically work in unison to distribute your data (and copies of your data) across the cluster as efficiently as possible.
  3. You query CrateDB with ANSI SQL using any tool or library that supports the PostgreSQL wire protocol. You can also use the HTTP REST API.
  4. CrateDB then uses a proprietary (but 100% open source) distributed query engine to execute the query across the cluster in parallel.

Behind the scenes, CrateDB selects between a few different strategies for running queries across the cluster. MapReduce, for example, is one such strategy! But from your perspective, all you need to think about is SQL.

And because the query is being run in parallel across the cluster, you can add new nodes to linearly increase not only data capacity and redundancy (i.e. copies of your data kept on multiple machines) but also query performance.

So along with everything you might expect from an SQL database, such as tables, partitioning, replication, BLOB support, and so on, CrateDB also comes with features you might expect from a NoSQL database, such as: real-time full-text search, geospatial queries, time-series data, as well as support for dynamic schemas and arbitrary objects (which are fully queryable to any depth).

And, for application developers, a database that is familiar to use, runs easily on your personal computer, and which has been designed, from the ground up, to be horizontally scalable across on-premises infrastructure, private clouds, public clouds, and even across multiple availability zones or regions.

Wrap Up

This is really just scratching the surface of CrateDB. If you want to know more about how it is CrateDB able to deliver all of this and more, check out our resource library.

And if you're asking yourself what sorts of applications CrateDB works best with, check out our collection of use-cases.

For me, personally, finding out about CrateDB and learning how it works was truly mind expanding. And it's left me thinking about the future of SQL databases.

I look at what is common practice, and I look something like CrateDB. And like the story of the Linux Kernel Mailing List Archive at the start of this post, I have to wonder how quickly some of the things we're doing with SQL today (such as monolithic servers hosting the entire master database) will seem humorously outdated.

  • Follow us