Move Data Between MySQL and CrateDB with Shinkansen

2017-03-28, by Naomi Slater

Shinkansen N700 with Mount Fuji

Shinkansen is a high-throughput framework for migrating data between MySQL instances and between MySQL and CrateDB.

With Shinkansen, you can migrate multiple tables, multiple shards of data, and data partitioned by a single column or a SQL expression. You can even incrementally migrate updated data following a complete migration.

You can also use Shinkansen as a system to keep a secondary database eventually consistent with your primary database. This might be useful as a backup, or as a place to run large queries without affecting your primary database.

In this post, we interview Justin Patrin, the creator of Shinkansen, to learn a little more about the software.

Why did you originally start work on Shinkansen?

At Skyhigh Networks, we were using MySQL for our largest data set, but it was becoming a performance and maintenance nightmare due to the volume of data and the number of individual MySQL servers needed.

Eventually, we decided to switch to CrateDB and had to find a way to migrate the data from MySQL to CrateDB.

To help us with the volume of data we had to migrate, we decided to make a migration service that would let us migrate the data incrementally. That way, we were not forced to perform the migration all at once—something that would have taken too long.

What were the technical challenges behind the project?

Initially, we thought we could quiesce updates to our source database while migrating the data into crate. This turned out not to be possible, however, so we needed a solution which could migrate data while it was being updated.

This necessitated the addition of the "delta" migration type, and eventually the "autodelta" and "complete" migration types. These migration types automatically handle requerying for changes and rerunning migrations until no new changes have been found since the beginning of the last migration.

After our original migrations, we realized it might also be useful to move data between CrateDB clusters so we added support for CrateDB as a migration source as well as a migration destination.

To round out the source and destination feature sets, we added MySQL as a supported destination. So, in the event you need to move between MySQL servers or from CrateDB back to MySQL, Shinkansen supports that too.

Are there any specific use cases that work especially well? Any use cases that aren't ideal?

While Shinkansen can work with data that is being updated regularly, the fewer updates the better in terms of efficiency.

If data is being updated rapidly, you may end up running delta migrations forever or until data is no longer being written. However, if you break your data up into multiple logical partitions, each of which you can quiesce individually, you can use the partitioning support in Shinkansen to move the data without having to worry about delta migrations or eventual consistency during the migration.

Shinkansen works best if your source and destination schema are as similar as possible. If there are major differences, you will run into issues. Shinkansen does have an `ignore_columns` feature, which allows you to handle some minor differences, but care must be taken.

Another thing to watch out for is unique keys. MySQL support multiple unique keys on a table but CrateDB supports only one unique key (the primary key).

What would you love to see come out of this project? What kinds of ways would you like to see people use it? Do you welcome contributions?

Shinkansen has evolved into a data migration framework as well as a service which supports horizontal scalability for larger deployments. It would be great to see it expanded to support a number of other datastores.

As with any open source project, the more people using it and contributing to it the better. I always welcome feature requests, pull requests, and bug reports.

The name of the project is great! How did you choose it?

Shinkansen is the name of the Japanese bullet train network that is responsible for moving a large number of people very quickly. Because my project is designed to move as much data as possible as quickly as possible, I thought it was a fitting name.

Can you talk about which features you feel are the most important or valuable?

Shinkansen was built to be resilient, efficient, and idempotent.

RabbitMQ is used as a task queue and tasks are not removed from the queue until they are finished. Redis is used as a backing store for state. As long as these two services are managed well, Shinkansen can be relied on to always finish any task given to it.

Shinkansen gracefully handles, for example:

  • Temporary network failures
  • Service restarts or crashes (both of Shinkansen and the source and destination data stores)
  • Timeouts due to database load
  • Temporary query failures (which can happen if nodes are bounced or indices are undergoing allocation)

For almost all data queries on the destination, Shinkansen uses INSERT… ON DUPLICATE UPDATE. This allows records to be inserted or updated if already present. It also means that records can be migrated multiple times without risk of duplication, making migrations idempotent.

Data is automatically split up into chunks for migration so that it doesn't overload the data stores with requests for large amounts of data and to allow for resumption of migration with a minimum amount of duplicated effort.

When using CrateDB as a data source, Shinkansen will dynamically choose one of a variety of data export strategies depending on the source schema and the nature of the data, so as to maximize performance.

In particular Shinkansen will:

  • Avoid using LIMIT and OFFSET for large datasets, which causes the CrateDB cluster to generate the same large result sets over and over.
  • Detect when the ordering of data won't work due to unsortable columns being in the primary key (such as when the primary key includes a partition column) and use COPY TO as a more efficient alternative.