Migrating from MongoDB

Exporting Data from MongoDB

MongoDB ships with a commandline tool called mongoexport. This tool can be used to export data in the MongoDB Extended JSON file format.

The Extended JSON format is a custom format from MongoDB that includes type information.

In order to export the data invoke the mongoexport tool:

mongoexport --db <dbname> --collection <data> --out out.json

To import the data into crate the additional type information must be stripped away. In order to do so download the conversion script from the crate-utils repo on github:

curl https://raw.githubusercontent.com/crate/crate-utils/master/migrations/mongodb/convert_json.py -o convert_json.py

Once the script is downloaded convert the data:

mongoexport --db <dbname> --collection <data> | python convert_json.py > out.json

Importing Data into Crate

Before the converted file can be imported into Crate a table has to be created.

A basic CREATE TABLE statement looks as follows:

cr> create table mytable (
...     name string,
...     obj object (dynamic)
... ) clustered into 5 shards with (number_of_replicas = 0);
CREATE OK, 1 row affected (... sec)

In Crate each field is indexed by default. So it is not necessary to create any additional indices.

However, if some fields are never used for filtering, indexing can be turned off:

cr> create table mytable2 (
...     name string,
...     obj object (dynamic),
...     dummy string INDEX OFF
... ) clustered into 5 shards with (number_of_replicas = 0);
CREATE OK, 1 row affected (... sec)

For fields that contain text consider using a full-text analyzer. This will enable great full-text search capabilities. See Indices and Fulltext Search for more information.

Crate is able to dynamically extend the schema, so it is not necessary to define all columns up-front.

Given the table above, it is possible to insert new columns at the top level of the table and insert arbitrary objects into the obj column:

cr> insert into mytable2 (name, obj, newcol, dummy) values
... ('Trillian', {gender = 'female'}, 2804, 'dummy');
INSERT OK, 1 row affected (... sec)

cr> refresh table mytable2;
REFRESH OK, 1 row affected (... sec)
cr> select * from mytable2;
| dummy | name     | newcol | obj                  |
| dummy | Trillian |   2804 | {"gender": "female"} |
SELECT 1 row in set (... sec)

But this has some limitations. For example timestamps in long format won’t be recognised as timestamps. Due to this limitation it is recommended to specify fields up-front.

After the table has been created the file can be imported using COPY FROM.

There is an entire section dedicated on how to do a data import efficiently. Continue reading there: Importing Huge Datasets into Crate.