Data Definition

Table Basics

To create a table use the CREATE TABLE command. You must at least specify a name for the table and names and types of the columns. See Data Types for information about the supported data types.

Let’s create a simple table with two columns of type integer and string:

cr> create table my_table (
...   first_column integer,
...   second_column string
... );
CREATE OK (... sec)

A table can be removed by using the DROP TABLE command:

cr> drop table my_table;
DROP OK (... sec)

The DROP TABLE command takes the optional clause IF EXISTS which prevents the generation of an error if the specified table does not exist:

cr> drop table if exists my_table;
DROP OK (... sec)

Schemas

Tables can be created in different schemas. These are created implicitly on table creation and cannot be created explicitly. If a schema did not exist yet, it will be created:

cr> create table my_schema.my_table (
...   pk int primary key,
...   label string,
...   position geo_point
... );
CREATE OK (... sec)
cr> select schema_name, table_name from information_schema.tables
... where table_name='my_table';
+-------------+------------+
| schema_name | table_name |
+-------------+------------+
| my_schema   | my_table   |
+-------------+------------+
SELECT 1 row in set (... sec)

The following schema names are reserved and may not be used:

  • blob
  • information_schema
  • sys

Note

Schemas are primarily namespaces for tables. There is no notion of access control. Everybody can see and manipulate tables in every schema.

A user created schema exists as long as there are tables with the same schema name. If the last table with that schema is dropped, the schema is gone (except for the blob and doc schema):

cr> drop table my_schema.my_table ;
DROP OK (... sec)

Every table that is created without an explicit schema name, will be created in the doc schema:

cr> create table my_doc_table (
...   a_column byte,
...   another_one geo_point
... );
CREATE OK (... sec)
cr> select schema_name, table_name from information_schema.tables
... where table_name='my_doc_table';
+-------------+--------------+
| schema_name | table_name   |
+-------------+--------------+
| doc         | my_doc_table |
+-------------+--------------+
SELECT 1 row in set (... sec)

Naming restrictions

Table, schema and column identifiers cannot have the same names as reserved key words. Please refer to the Lexical Structure section for more information about naming.

Additionally, table and schema names are restricted in terms of characters and length. They:

  • may not contain one of the following characters: \ / * ? " < > | <whitespace> , #
  • may not contain upper case letters
  • may not start with an underscore: _
  • should not exceed 255 bytes when encoded with utf-8 (this limit applies on the optionally schema-qualified table name)

Column names are restricted in terms of characters. They:

  • may not contain one of the following characters: [ ' ] .
  • may not start with an underscore: _

Generated Columns

It is possible to define columns whose value is computed by applying a generation expression in the context of the current row, where it is possible to reference the values of other columns.

Generated columns are defined by providing a generation expression. Providing a data type is optional. It is inferred by the return type of the supplied expression if omitted:

cr> CREATE TABLE computed (
...   dividend double,
...   divisor double,
...   quotient AS (dividend / divisor)
... );
CREATE OK (... sec)

For a full syntax description, see CREATE TABLE.

Generated columns are read-only as they are computed internally. They are computed upon INSERT and UPDATE:

cr> INSERT INTO computed (dividend, divisor) VALUES (1.7, 1.5), (0.0, 10.0);
INSERT OK, 2 rows affected (... sec)

The generated column is now filled with the computed value:

cr> SELECT dividend, divisor, quotient
... FROM computed
... ORDER BY quotient;
+----------+---------+--------------------+
| dividend | divisor |           quotient |
+----------+---------+--------------------+
|      0.0 |    10.0 | 0.0                |
|      1.7 |     1.5 | 1.1333333333333333 |
+----------+---------+--------------------+
SELECT 2 rows in set (... sec)

If values are supplied for generated columns, these values are validated against the result of applying the generation expression:

cr> INSERT INTO computed (dividend, divisor, quotient) VALUES (100.0, 2.0, 12.0);
SQLActionException[SQLParseException: Given value 12.0 for generated column does not match defined generated expression value 50.0]

Warning

Supplied values for generated columns are not validated when they are imported using copy from.

They can also be used in the PARTITIONED BY Clause in order to compute the value to partition by from existing columns in the table:

cr> CREATE TABLE computed_and_partitioned (
...   huge_cardinality long,
...   big_data string,
...   partition_value AS ( huge_cardinality % 10 )
... ) PARTITIONED BY (partition_value);
CREATE OK (... sec)

Constraints

Primary Key

The primary key constraint combines a unique constraint and a not-null constraint. It also defines the default routing value used for sharding. Example:

cr> create table my_table1 (
...   first_column integer primary key,
...   second_column string
... );
CREATE OK (... sec)

Currently primary keys cannot be auto generated and have to be specified if data is inserted, otherwise an error is returned.

Defining multiple columns with a primary key constraint is also supported:

cr> create table my_table1pk (
...   first_column integer primary key,
...   second_column string primary key,
...   third_column string
... );
CREATE OK (... sec)

Or using a alternate syntax:

cr> create table my_table1pk1 (
...   first_column integer,
...   second_column string,
...   third_column string,
...   primary key (first_column, second_column)
... );
CREATE OK (... sec)

Note

Not all column types can be used as PRIMARY KEY. See PRIMARY KEY Constraint.

Sharding

Number of Shards

Crate supports sharding natively. The number of shards can be defined by using the CLUSTERED INTO <number> SHARDS statement upon the table creation. Example:

cr> create table my_table5 (
...   first_column int
... ) clustered into 10 shards;
CREATE OK (... sec)

If the number of shards is not defined explicitly, the sensible default value is applied, (see CLUSTERED Clause).

Note

The number of shards can only be set on table creation, it cannot be changed later on.

Routing

The column used for routing can be freely defined using the CLUSTERED BY (<column>) statement and is used to route a row to a particular shard. Example:

cr> create table my_table6 (
...   first_column int,
...   second_column string
... ) clustered by (first_column);
CREATE OK (... sec)

If primary key constraints are defined, the routing column definition can be omitted as primary key columns are always used for routing by default. If the routing column is defined explicitly, it must match a primary key column:

cr> create table my_table8 (
...   first_column int primary key,
...   second_column string primary key,
...   third_column string
... ) clustered by (first_column);
CREATE OK (... sec)

Example for combining custom routing and shard definition:

cr> create table my_table9 (
...   first_column int primary key,
...   second_column string primary key,
...   third_column string
... ) clustered by (first_column) into 10 shards;
CREATE OK (... sec)

Replication

Replication of a table in Crate means that each primary shard of a table is stored additionally on so called secondary shards. This might be useful for better read performance and high availability. If not specified, crate creates one replica, which means that a tables content is stored twice across the nodes of a cluster.

Defining the number of replicas is done using the number_of_replicas property.

Example:

cr> create table my_table10 (
...   first_column int,
...   second_column string
... ) with (number_of_replicas = 1);
CREATE OK (... sec)

The number_of_replicas property also accepts an string as parameter that contains a range.

A range is a definition of minimum number of replicas to maximum number of replicas depending on the number of nodes in the cluster. The table below shows some examples.

Range Explanation
0-1 Will create 0 or 1 replicas depending on the number of available nodes
2-4 Table requires at least 2 replicas to be fully replicated. Will create up to 4 if nodes are added.
0-all Will expand the number of replicas to the available number of nodes.

For details of the range syntax refer to number_of_replicas.

Note

The number of replicas can be changed at any time.

Column Policy

The Column Policy defines if a table enforces its defined schema or if it’s allowed to store additional columns which are a not defined in the table schema. If the column policy is not defined within the with clause, dynamic will be used.

strict

The column policy can be configured to be strict, rejecting any column on insert/update/copy_to which is not defined in the schema.

Example:

cr> create table my_table11 (
...   title string,
...   author string
... ) with (column_policy = 'strict');
CREATE OK (... sec)

dynamic

The other option is dynamic which is the default policy. dynamic means that new columns can be added using insert, update or copy from.

Note that adding new columns to a table with a dynamic policy will affect the schema of the table. Once a column is added, it shows up in the information_schema.columns table and its type and attributes are fixed. It will have the type that was guessed by its inserted/updated value and they will always be not_indexed which means they are analyzed with the plain analyzer, which means as-is. If a new column a was added with type boolean, adding strings to this column will result in an error, except the string can be implicit casted to a boolean value.

Examples:

cr> create table my_table12 (
...   title string,
...   author string
... );
CREATE OK (... sec)

which is exactly the same as:

cr> create table my_table13 (
...   title string,
...   author string
... ) with (column_policy = 'dynamic');
CREATE OK (... sec)

New columns added to dynamic tables are, once added, usable as usual columns. One can retrieve them, sort by them and use them in where clauses.

Warning

The mapping update is processed asynchrously on multiple nodes. If a new field gets added to the local mapping of two shards, these shards are sending their mapping to the master. If this mapping update gets delivered later than the next query on the previously added column, it will result in a ColumnUnknownException.

Partitioned Tables

A partitioned table is a virtual table that is internally split up into several internal tables, called partitions, by the value of one or more columns. For every distinct combination of values in those configured columns one separate partition is created.

Creating a partitioned table is done using the PARTITIONED BY Clause:

cr> create table partitioned_table (
... id long,
... title string,
... date timestamp
... ) partitioned by (date);
CREATE OK (... sec)

For further details, please refer to PARTITIONED BY Clause.

System Columns

On every table Crate implements several implicitly defined system columns. Their names are reserved and cannot be used as user-defined column names. All system columns are prefixed with an underscore.

_version
Crate uses an internal versioning for every row, the version number is increased on every write. This column can be used for Optimistic Concurrency Control, see Optimistic Concurrency Control with Crate for usage details.
_score
This internal system column is available on all documents retrieved by a SELECT query. It is representing the scoring ratio of the document related to the used query filter and makes most sense on fulltext searches. The scoring ratio is always related to the highest score determined by a search, thus scores are not directly comparable across searches. If the query does not include a fulltext search the value is 1.0f in most cases.
_id
_id is an internal system column that is available on each indexed document and can be retrieved by a SELECT query from doc schema tables. The value is a unique identifier for each row in a table and is a compound string representation of all primary key values of that row. If no primary keys are defined the id is randomly generated. If no dedicated routing column is defined the _id value is used for distributing the records on the shards.

Alter Table

Updating Parameters

The parameters of a table can be modified using the ALTER TABLE clause:

cr> alter table my_table1 set (number_of_replicas = '0-all');
ALTER OK (... sec)

In order to set a parameter to its default value use reset:

cr> alter table my_table1 reset (number_of_replicas);
ALTER OK (... sec)

Read Alter Partitioned Tables to see how to alter parameters of partitioned tables.

Adding Columns

In order to add a column to an existing table use ALTER TABLE with the ADD COLUMN clause:

cr> alter table my_table1 add column new_column_name string;
ALTER OK (... sec)

The inner schema of object columns can also be extended, as shown in the following example.

First a column of type object is added:

cr> alter table my_table1 add column obj_column object as (age int);
ALTER OK (... sec)

And now a nested column named name is added to the obj_column:

cr> alter table my_table1 add column obj_column['name'] string;
ALTER OK (... sec)
cr> select column_name, data_type from information_schema.columns
... where table_name = 'my_table1' and column_name like 'obj_%';
+--------------------+-----------+
| column_name        | data_type |
+--------------------+-----------+
| obj_column         | object    |
| obj_column['age']  | integer   |
| obj_column['name'] | string    |
+--------------------+-----------+
SELECT 3 rows in set (... sec)

Set and Reset

The crate cluster can be configured at runtime using the SET and RESET statement. See the Cluster Settings configuration section for details about the supported settings.

If SET is used with PERSISTENT the change will survive a cluster restart, if used with TRANSIENT the value will be restored to default or config file value on a restart:

cr> SET GLOBAL PERSISTENT stats.enabled = false;
SET OK (... sec)
cr> select sys.cluster.settings['stats']['enabled'] from sys.cluster;
+------------------------------------------+
| sys.cluster.settings['stats']['enabled'] |
+------------------------------------------+
| FALSE                                    |
+------------------------------------------+
SELECT 1 row in set (... sec)

You can change multiple values at once:

cr> SET GLOBAL TRANSIENT stats.enabled = true,
... stats.jobs_log_size = 1024, stats.operations_log_size = 4096;
SET OK (... sec)
cr> select settings['stats']['enabled'],
...   settings['stats']['jobs_log_size'],
...   settings['stats']['operations_log_size']
... from sys.cluster;
+-...------------+-...------------------+-...------------------------+
| ...['enabled'] | ...['jobs_log_size'] | ...['operations_log_size'] |
+-...------------+-...------------------+-...------------------------+
| TRUE           |                 1024 |                       4096 |
+-...------------+-...------------------+-...------------------------+
SELECT 1 row in set (... sec)

Its also possible to save a complete nested object of settings:

cr> SET GLOBAL TRANSIENT stats = {
...   jobs_log_size = 2048,
...   operations_log_size = 8192
... };
SET OK (... sec)
cr> select settings['stats'] from sys.cluster;
+-----------------------------------------------------------------------+
| settings['stats']                                                     |
+-----------------------------------------------------------------------+
| {"enabled": true, "jobs_log_size": 2048, "operations_log_size": 8192} |
+-----------------------------------------------------------------------+
SELECT 1 row in set (... sec)

Using the RESET statement, a setting will be reset to either on node startup defined configuration file value or to its default value:

cr> RESET GLOBAL stats.enabled, stats.operations_log_size;
RESET OK (... sec)
cr> select settings['stats'] from sys.cluster;
+-------------------------------------------------------------------------+
| settings['stats']                                                       |
+-------------------------------------------------------------------------+
| {"enabled": false, "jobs_log_size": 2048, "operations_log_size": 10000} |
+-------------------------------------------------------------------------+
SELECT 1 row in set (... sec)

RESET can also be done on objects:

cr> RESET GLOBAL stats;
RESET OK (... sec)
cr> select settings['stats'] from sys.cluster;
+--------------------------------------------------------------------------+
| settings['stats']                                                        |
+--------------------------------------------------------------------------+
| {"enabled": false, "jobs_log_size": 10000, "operations_log_size": 10000} |
+--------------------------------------------------------------------------+
SELECT 1 row in set (... sec)

Show Create Table

The SHOW CREATE TABLE statement can be used to print the DDL statement of already existing user-created doc tables in the cluster:

cr> show create table my_table;
+-----------------------------------------------------+
| SHOW CREATE TABLE doc.my_table                      |
+-----------------------------------------------------+
| CREATE TABLE IF NOT EXISTS "doc"."my_table" (       |
|    "first_column" INTEGER,                          |
|    "fourth_column" OBJECT (STRICT) AS (             |
|       "key" STRING,                                 |
|       "value" STRING                                |
|    ),                                               |
|    "second_column" STRING,                          |
|    "third_column" TIMESTAMP,                        |
|    PRIMARY KEY ("first_column")                     |
| )                                                   |
| CLUSTERED BY ("first_column") INTO 5 SHARDS         |
| WITH (                                              |
|    "blocks.metadata" = false,                       |
|    "blocks.read" = false,                           |
|    "blocks.read_only" = false,                      |
|    "blocks.write" = false,                          |
|    column_policy = 'dynamic',                       |
|    number_of_replicas = '1',                        |
|    "recovery.initial_shards" = 'quorum',            |
|    refresh_interval = 1000,                         |
|    "routing.allocation.enable" = 'all',             |
|    "routing.allocation.total_shards_per_node" = -1, |
|    "translog.disable_flush" = false,                |
|    "translog.flush_threshold_ops" = 2147483647,     |
|    "translog.flush_threshold_period" = 1800000,     |
|    "translog.flush_threshold_size" = 209715200,     |
|    "translog.interval" = 5000,                      |
|    "translog.sync_interval" = 5000,                 |
|    "unassigned.node_left.delayed_timeout" = 60000,  |
|    "warmer.enabled" = true                          |
| )                                                   |
+-----------------------------------------------------+
SHOW 1 row in set (... sec)

The table settings returned within the WITH clause of the result are all available table settings showing their respective values at the time of the execution of the SHOW statement. Different versions of Crate may have different default table settings. This means that if you re-create the table using the resulting CREATE TABLE statement the settings of the ‘old’ table may differ from the settings of the ‘new’ table. This is because the table settings are set explicitly on creation time.