Some of the Crate.io team attended Berlin Buzzwords 2018 last week. Berlin Buzzwords had tech talks, workups, and meetups that were all centered around searching, scaling, streaming, and storing large amounts of data.
At Crate.io, we are building CrateDB, a distributed SQL database that makes it simple to store and analyze massive amounts of machine data in real-time. So this conference was a perfect fit for us! :)
In this post, we will share some of the things we learned.
You can watch the talk here:
The big news was that Elasticsearch now ranks at #8 on DB-Engines Ranking, the de-facto league table of databases.
Philip introduced a bunch of new stuff that should be coming to Elasticsearch soon:
- Increased strictness on bootstrap checks.
- Rolling upgrades.
This was a big one. Elasticsearch 6 now supports rolling upgrades between major versions, instead of the typical full cluster restart and the associated cluster downtime.
- Floodstage watermarks.
Elasticsearch currently has a low watermark (85%) and high watermark (90%) for disk space warnings.
An additional floodstage watermark is going to be introduced at 95%, and when triggered, this enforces node-local read-only indexes, to prevent the shards from running out of disk space.
- Sequence numbers to keep track of operations.
Previously, each document stored in Elasticsearch was stored in a single index with a single mapping type. The mapping type represents the type of document or entity being indexed.
The implementation of this type system caused numerous issues, and so is going to be removed by version 8.x
- Indexes will have one shard by default, as opposed to five.
Max went over the basic structure of CrateDB, explaining internal subcomponents as well as how CrateDB makes use of third-party open source technologies such as Antlr, Elasticsearch, and Apache Lucene. Overall, this is an excellent starting point for anyone who wants to dive a little deeper into CrateDB.
You can watch his talk here:
Ted’s talk went over the recent developments in Kubernetes which help facilitate stateful applications. This is a fundamentally tricky problem, because containers, by design, are intended to be stateless and ephemeral.
If you keep state in your containers, and one of the application containers goes away, your application just suffered data loss.
So the typical way to handle this is to write stateful information to an external source. But this can be tricky if you're using files for state because the complexities and limitations of networked file systems can open up an entirely new can of worms.
If you're using Kubernetes, the standard approach to dealing with persistence is through the use of persistent volumes and data platforms that abstract away persistence. Ted's specific example involved mounting a MapR data volume and using the FlexVolume plugin.
You can watch Ted's talk here:
Overall, Ted's talk was an engaging introduction to Kubernetes. He had anecdotes (horror stories, really) about problems encountered over the course of his career, such as having to change parts of a program while it is still running (because they couldn't afford to run it on another machine or stop the application). These anecdotes helped to put into perspective how powerful and useful Kubernetes is.
My coworker, Mika, attended a talk by Marton Elek, titled From Docker to Kubernetes: Running Apache Hadoop in a Cloud Native Way. This talk, somewhat unsurprisingly, focused on how to build scalable Hadoop clusters with Kubernetes. And Marton had some interesting suggestions about how to deploy new services.
You can watch the talk here:
One method suggested was to include a deployment script that builds the configuration file from environment variables within the Docker container, instead of mounting a configuration file via ConfigMap.
What's interesting is that CrateDB already allows you to emulate this. Command line options or environment variables can override settings in the configuration file.
Still, as Mika pointed out to me, for other sorts of deployments, this seems like an interesting approach. And from what we gather, Helm is worth looking at for preprocessing configuration files.
I attended a talk by Alvaro Videla, a former RabbitMQ developer, titled Lector in Codigo. This talk diverged from the primary themes of the conference but was nevertheless insightful, and explored how literary theory can help us write better code.
— Stefanie Schirmer (@linse) June 11, 2018
Alvaro reminded us that the purpose of programming is not only to provide a computer with instructions. Programming is a form of communication between people.
Programming enables us to share how we solved a problem at one point in time.
Given that, over the lifetime of a piece of software, more time is typically spent reading the code than writing it, this idea seems like an important point to drive home.
Writing accessible and communicative code doesn’t take too much extra effort and has long-term benefits for future contributors and maintainers. This is something that benefits any business paying people to contribute, and something that also increases the accessibility of the project with regards to open source contributions. A win-win!
Here's a video of his talk:
You can also read more about it in his blog post.
On the whole, the Kubernetes talks at the conference were illuminating, but also a little frustrating.
Much of what was covered at the conference is stuff that we have encountered and found a solution for already. Indeed, the questions that we currently have do not seem to be common enough to have ready-made answers. But I am sure that will change! Kubernetes is still relatively young. :)
Would I go to the conference again?
For the most part, the talks were excellent and highly relevant to the technologies we work with. Even when the talks covered familiar material, it was nice to have that validation that we're doing things right.