Logz.io is focused on creating the best observability service to manage the scale of monitoring, add value on top of AI/ML technologies, and enhance enterprise security. Metrics is one of the pillars of Logz.io, and our Prometheus-as-a-Service offering.
It has been a crucial part of our platform goals, but if we turn the clocks back a year, our service only used the open-source Elasticsearch database (ES). We decided to build new technologies on top of ES to better store and roll up metrics in order to reduce cost. While this solution worked and let us use the same data source for logs, metrics, and traces it had two major drawbacks.
First, the cost to store aging metrics was expensive. On top of that, the performance was not ideal owing to the fact that Elasticsearch is a document database and not particularly well suited to storing time-series data.
Secondly, and more importantly, most users that approach Logz.io are already using Prometheus today who need a smooth cloud migration to our own metrics service.
Knowing where these two challenges would lead, we knew we had to take a way we had less traveled by, and that has made all the difference.
Databases vs. Data Stores: Pros & Cons
We isolated the options down to four possible paths that provided compatibility with Prometheus query languages and APIs. We began investigating VictoriaMetrics, M3DB, Thanos, and Cortex. We had a lot of discussions within the community, set up and then tested each of those solutions. There were pros and cons to each for storage, scalability, performance, cost, and health of the community at large.
Another key criterion is the ability to integrate and scale the solution across our multi-region cloud presence in Amazon AWS and Microsoft Azure. This is only going to expand over time to address more customer demand. We planned on implementing and scaling the database using our Kubernetes infrastructure, which is a self-managed implementation. To do this we were going to leverage Kubernetes Operators for the given solution.
One of the fundamental differences among these four options is that two of them are databases while two of them are data stores. The data store approach relies upon an object store with services in front to insert and query data.
Why We Didn’t Go with Thanos & Cortex
We were concerned with the performance of this along with having vendor-specific object storage differences between cloud providers which may become an issue as we continue to run across more cloud providers. The good news with both Thanos and Cortex is that there were lots of companies using them who were also involved in the community. Both projects are backed by the cloud-native computing foundation (CNCF) which is a big positive for future direction. The team contributing to these projects is world-class and then some.
We felt that a database would be a better approach for us. As an organization, we are comfortable running scale-out databases as we have been doing for the last 5+ years with Elasticsearch. We needed scale-out, and we preferred to scale that out with Kubernetes as that’s been our general direction for all of our services.
VictoriaMetrics vs. M3DB
Of the two databases, the first was VictoriaMetrics. It has some impressive performance numbers behind it, and managing it would be quite simple as this is based on the well-understood Clickhouse database technology. There is a company behind the technology, which can be both good and bad. We prefer a vibrant community behind the technology instead of a consulting company.
VictoriaMetrics’ major issue is that there are limited scale-out and clustering capabilities, which would require us to build our own sharding system. Such a sharding system can become difficult to manage over time or require us to build software to handle our multi-tenant requirements. This is precisely what we had to do for ElasticSearch and doing this on another database would be a large body of work for the team.
This left one technology that fits all of our technical requirements — M3.
M3DB: A Database for Prometheus
The M3DB came from Uber and is a project which they open-sourced. Several Uber employees have created a startup using M3DB called Chronosphere who competes with us but also does a lot of consulting around M3. They do not work with us due to the conflict, which is understandable.
The M3 community is not nearly as diverse as Thanos or Cortex: there are fewer companies using it, and the open source tech is generally controlled by both Uber and Chronosphere. We have had a good working relationship with both companies as we contribute to the open source technology and we have a lot more planned.
We have continued to find many of M3DB’s capabilities to not be as robust as one would expect for a production-level database, we’ve been building new capabilities and contributed them. Chronosphere’s engineering team is working on the open source — they are great to work with since both of our organizations want M3DB to succeed. We have continued to invest in scaling and operating the technology efficiently.
During our buildout of the new Prometheus as a service offering in 2020, a blog was published in August by Prometheus and PromLabs creator Julius Voltz showing the compatibility of various backends with the Prometheus query language and API. With the results on this blog showing issues with VictoriaMetrics, it seems we avoided a potential issue. M3 is fully compatible with Prometheus, so we are covered on that front.
We will continue evolving, building, and learning how to run more and more open source technologies at scale, we hope that you will learn from our findings and we are always up for discussing so please hit me up on Twitter @jkowall and thanks for reading.