This post will describe how we’re using Elasticsearch as part of Crate. Crate is an open source data store for any data. Crate is a shared-nothing, fully searchable, document-oriented cluster store. As far as we know, we’re the first to use Elasticsearch as a framework.
We were always impressed by the amazing simplicity of Elasticsearch. I began using it in 2010, to support large scale systems we built out at Lovely Systems (our systems integration company prior to founding Crate.IO). Elasticsearch isn’t just simple, it is also ultra fast and scales extremely easy. Click here to see Jodok talking about how we threw 24 billion records into Elasticsearch.
At the time, we needed special aggregations and other functionality for some of our projects, so, in 2011, we wrote our own plugins(1) for Elasticsearch. This made me like Java again, since Elasticsearch is written in a very direct and clean way.
When we began thinking about the ingredients of a good, modern, data store - a no-brainer data store for big data applications - we wanted to take elasticsearch as a model for simplicity and scalability.
We decided to use Elasticsearch as a framework. Elasticsearch is currently included from source as a git submodule. Unlike Elasticsearch, Crate uses Gradle as build tool, so we have our own project file for the Elasticsearch source tree (see here).
There are two main reasons why we are including the source tree instead of just adding a dependency on the jar:
Elasticsearch shades and minimizes some of its dependencies at distribution time. This makes it hard to extend components where the dependencies are used. For example Crate uses its own Netty HTTPHandler in order to support streaming data from the client. This handler uses Netty classes which are not redistributed with the elasticsearch jar. It is also not possible to add Netty as a direct dependency which is also the case in this issue http://elasticsearch-users.115913.n3.nabble.com/netty-transport-and-shaded-jar-td4028065.html
We use our own Elasticsearch fork, since we have some minor patches in it. In most cases those patches are merged into Elasticsearch (e.g: https://github.com/elasticsearch/elasticsearch/pull/6127) . Our own fork gives us the possibility to do real world testing and stabilize the change before doing a pull request on the upstream.
Today, Crate depends on many Elasticsearch components, such as:
We still enjoy coding against the Elasticsearch codebase and also how contributions and communications are done in this Project. There are many other things we use from Elasticsearch, feel free to take a look at source on GitHub or contact us on Slack if you are interested in details.
(1) While most of them have been project specific - some of them have been made open source like the inout https://github.com/crate/elasticsearch-inout-plugin and the timefacets plugin https://github.com/crate/elasticsearch-timefacets-plugin. The functionality of these Plugins are now included in Crate.
(2) The Elasticsearch-API can be enabled via configuration. However this might change in the future, and is currently only there for debugging purposes.