The Guide for Time Series Data Projects is out.

Download now
Skip to content
Blog

Using Crate As a BLOB Store

This article is more than 4 years old

Crate provides BLOB storage so you can persistently store and retrieve BLOB – typically pictures, videos or large unstructured files, providing a fully distributed cluster solution for BLOB storage. Using Crate, the disk space of all available nodes (whether commodity hardware or cloud computing resources) is compiled into one system with multiple distributed nodes – and all data is accessible from every node.

BLOB storage is a great way to store content and serve it through an application – be it analytics, a social networks or a storage service.

BLOBs are typically defined as binary large objects; since BLOB are binary, they do not conform to the conventional data types stored in databases and are therefore stored separately. Images are the best example.

Crate frees developers from the need to “glue” several technologies together to store documents, blobs and support real-time search. Rather than using one open-source system to store and serve blobs (GridFS, Rados or HDFS) together with an open source database for structured data, Crate provides a unified data store. It is possible, however, to use Crate for blob storage alone, as a blobstore.

Crate replicates and shards BLOBs just like it does for any other data. One of the great things about Crate is the automatic sharding and replication, which saves datastore administration work. Additionally, other solutions use different sharding and replication rules for BLOBs than for the “regular” data in the datastore, creating inconsistencies and more moving parts to track. Crate applies the same replication and sharding to both the blob store and the other data, simplifying the work required for the data store as a whole.

Here are some things you should know about using Crate with BLOBs:

1. Creating a BLOB Table

Before you add a blob to Crate, a blob table must be created.  Using the Crate Shell, CraSh, you can issue a SQL statement as follows:

sh$ crash -c "create blob table myblobs clustered into 3 shards with (number_of_replicas=1)" CREATE OK (... sec) 

You can see that the replication number here was pre-set to 1. To alter a blob table use the ALTER BLOB TABLE clause.

sh$ crash -c "alter blob table myblobs set (number_of_replicas=3)" ALTER OK (... sec) 

In this example blobs are managed under the /_blobs/myblobs endpoint.

2. Uploading BLOBs

To upload a blob the sha1 hash of the blob has to be known upfront. This is the  id of the new blob. Here's a python one-liner that computes the shasum:

sh$ python -c 'import hashlib;print(hashlib.sha1("contents".encode("utf-8")).hexdigest())' 4a756ca07e9487f482465a99e8286abc86ba4dc7 

The blob can now be uploaded by issuing a PUT request:

sh$ curl -isSX PUT '127.0.0.1:4200/_blobs/myblobs/4a756ca07e9487f482465a99e8286abc86ba4dc7' -d 'contents' HTTP/1.1 201 Created Content-Length: 0 

3. Listing & Querying BLOBs

To list all blobs inside a blob table a SELECT statement can be used:

sh$ crash -c "select digest, last_modified from blob.myblobs" 
+------------------------------------------+---------------+
| digest                                   | last_modified | 
+------------------------------------------+---------------+ 
| 4a756ca07e9487f482465a99e8286abc86ba4dc7 | ...           | 
+------------------------------------------+---------------+ 
SELECT 1 row in set (... sec)

E.g. if you want to get the number of BLOBs that have been uploaded in the last 24 hours you can count by last_modified (assuming the current timestamp is 1404134357383).

sh$ crash -c "select count(*) as num_blobs from blob.myblobs where last_modified > 1404134357383-86400000" 
+-----------+ 
| num_blobs | 
+-----------+ 
|    162873 | 
+-----------+ 
SELECT 1 row in set (... sec) 

4. Downloading BLOBs

Blobs are downloaded through a GET request.

sh$ curl -sS '127.0.0.1:4200/_blobs/myblobs/4a756ca07e9487f482465a99e8286abc86ba4dc7' contents 

Since blobs are sharded,  not every node in a given cluster has all BLOBs. If the GET request has been sent to a node that doesn't contain the requested file it will respond with a 307 Temporary Redirect which will lead to a node that does contain the file. To determine if a blob exists without downloading it, a HEAD request can be used.

5. Deleting BLOBs

To delete a blob simply use a DELETE request:

sh$ curl -isS -XDELETE '127.0.0.1:4200/_blobs/myblobs/4a756ca07e9487f482465a99e8286abc86ba4dc7' HTTP/1.1 204 No Content Content-Length: 0 

You can delete a blob table by using the following in the Crate Shell.

sh$ crash -c "drop blob table myblobs" DROP OK (... sec)