COPY FROM

Copy data from files into a table.

Synopsis

COPY table_ident [ PARTITION (partition_column = value [ , ... ]) ]
FROM uri [ WITH ( option = value [, ...] ) ]

where option can be one of:

  • bulk_size integer
  • shared boolean
  • num_readers integer
  • compression string
  • overwrite_duplicates boolean

Description

COPY FROM copies data from a URI to the specified table as a raw data import.

The nodes in the cluster will attempt to access the resources available at the URI and import the data. The data must be provided as file(s) containing one JSON object per line and have to be encoded using UTF-8. Empty lines are skipped and any additional keys in the object will be added as columns, regardless of the previously defined table.

Example JSON data:

{"id": 1, "quote": "Don't panic"}
{"id": 2, "quote": "Ford, you're turning into a penguin. Stop it."}

See also: Import/Export.

Type Casts and Constraints

Crate does not check if the column’s data types match the types from the import file. It does not cast the types but will always import the data as in the source file. Furthermore Crate will only check for primary key duplicates but not for other Constraints like NOT NULL.

For example a WKT string cannot be imported into a column of geo_shape or geo_point type, since there is no implict cast to the GeoJSON format.

Note

In case the COPY FROM statement fails, the log output on the node will provide an error message. Any data that has been imported until then has been written to the table and should be deleted before restarting the import.

URI

A string literal or array of string literals containing URIs. Each URI must be formatted according to the URI Scheme.

In case the URI scheme is missing the value is assumed to be a file path
and will be converted to a file:// URI implicitly.

For example:

/tmp folder/file.json

Is converted to:

'file:///tmp%20folder/file.json'

Supported Schemes

file

The provided (absolute) path should point to files available on every node with read access to the Crate process there.

By default each node will attempt to read the files specified. In case the URI points to a shared folder (where other Crate nodes also have access) the shared option must be set to true in order to avoid importing duplicates.

s3

Can be used to access buckets on the Amazon AWS S3 Service:

s3://[<accesskey>:<secretkey>@]<bucketname>/<path>

If accesskey and secretkey are ommited, Crate attempts to load the credentials from the environment or Java settings.

Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_KEY Java System Properties - aws.accessKeyId and aws.secretKey

If no credentials are set the s3 client will operate in anonymous mode, see AWS Java Documentation.

Using a s3 URI will set the shared option implicitly.

Note

A secretkey provided by Amazon Web Service can contain characters such as ‘/’, ‘+’ or ‘=’. These characters must be URL encoded. For a detailed explanation read the official AWS documentation.

Note

Versions prior to 0.51.x use HTTP for connections to S3. Since 0.51.x these connections are using the HTTPS protocol. Please make sure you update your firewall rules to allow outgoing connections on port 443.

http, https, and jar (Java URL protocols)

In addition to the schemes above, Crate supports all protocols supported by the URL implementation of its JVM (typically http, https, ftp, and jar). Please refer to the documentation of the JVM vendor for an accurate list of supported protocols.

These schemes do not support wildcard expansion.

Parameters

table_ident:The name (optionally schema-qualified) of an existing table where the data should be put.
uri:An expression which evaluates to a URI as defined in RFC2396. The supported schemes are listed above. The last part of the path may also contain * wildcards to match multiple files.

PARTITION Clause

For partitioned tables this clause can be used to import data into the specified partition. This clause takes one or more partition columns and for each column a value.

[ PARTITION ( partition_column = value [ , ... ] ) ]
partition_column:The name of the column by which the table is partitioned. All partition columns that were part of the PARTITIONED BY Clause of the CREATE TABLE statement must be specified.
value:The column’s value.

Note

Partitioned tables do not store the value for the partition column in each row, hence every row will be imported into the specified partition regardless of the value provided for the partition columns.

WITH Clause

The optional WITH clause can specify options for the COPY FROM statement.

[ WITH ( option = value [, ...] ) ]

Options

bulk_size

Crate will process the lines it reads from the path in bulks. This option specifies the size of one batch. The provided value must be greater than 0, the default value is 10000.

shared

This option should be set to true if the URI points to a location accessible by
multiple Crate nodes to prevent them from importing the same file.

The default value depends on the scheme of each URI.

If an array of URIs is passed to COPY FROM this option will overwrite the default for all URIs.

node_filters

A filter expression to select the nodes to run the read operation.

It’s an object in the form of:

{
    name = '<node_name_regex>',
    id = '<node_id_regex>'
}

Only one of the keys is required. The name regular expression is applied on the name of all execution nodes, whereas the id regex is applied on the node id.

If both keys are set, both regular expressions have to match for a node to be included.

If the shared option if false, a strict node filter might exclude nodes with access to the data leading to a partial import.

To verify which nodes match the filter, run the statement with EXPLAIN.

num_readers

The number of nodes that will read the resources specified in the URI. Defaults to the number of nodes available in the cluster. If the option is set to a number greater than the number of available nodes it will still use each node only once to do the import. However, the value must be an integer greater than 0.

If shared is set to false this option has to be used with caution. It might exclude the wrong nodes, causing COPY FROM to read no files or only a subset of the files.

compression

The default value is null, set to gzip to read gzipped files.

overwrite_duplicates

Default: false

COPY FROM by default won’t overwrite rows if a document with the same primary key already exists. Set to true to overwrite duplicate rows.