Processing Data at Scale – Introducing Sawmill


Sawmill is a JSON transformation open source library. It enables you to enrich, transform, and filter your JSON documents. Using Sawmill pipelines, you can integrate your favorite groks, geoip, user-agent resolving, add or remove fields/tags and more in a descriptive manner, using configuration files or builders, in a simple DSL, allowing you to dynamically change transformations. has been using Sawmill in production for over a year now. Today, all of the logs shipped to’s ingestion pipelines are ingested using Sawmill. As such, we now feel the tool is mature and stable enough to open source it. We’re happy to contribute it to the community, and are looking forward to getting feedback so we can make it better.

Why we built Sawmill

Sawmill was quite a journey for us. To understand the motivation for embarking on this journey, I will try and provide some contextual background on how ingestion was handled previously.

Initially, we relied heavily on Logstash for handling all data enrichment. This caused us some very specific pain points.

Logstash startup time

Over time, our Logstash configuration files grew and as a result the load and start time increased. Supporting accounts with heavy and complex configurations meant that Logstash startup time sometimes took two minutes and more. This of course directly affected scaling — responding to data bursts or traffic increases was extremely slow.

Dynamic configurations

Up until Logstash version 2.2, there was no dynamic configuration reload. At the time, we were using version 1.4, and would have benefited greatly from this missing capability. In addition, the configuration reload in Logstash requires draining the existing pipelines, loading the new ones and starting them – this causes a temporary halt of the ingestions, which causes a queue to build which is not silent.

Processing time

In some cases we found that processing simply took too long. For example, when the log being ingested didn’t match the parsing configuration or the log was totally invalid or too long. In one case, our ingestion pipeline almost came to a full stop due to a burst of invalid logs.

And so we wanted the ability to stop any log processing taking too long, specifically — more than a few hundred milliseconds (the grok thread interrupt feature was only introduced in version 5.x, so was unavailable for us at the time as well).


In the spirit of the previous pain point, a bad log can also cause Logstash to stop completely — resulting in the process dying, or even worse, a zombie process that seems like it is up but in actuality is not working. In our case, this was of course unacceptable.

Metrics and visibility

We wanted to implement a tight monitoring and troubleshooting mechanism. We needed to know, when something was stuck, what exactly was causing it to get stuck, and being a Java shop, we felt comfortable with the rich debugging tools the JVM has to offer. We needed to have clear metrics on which pipeline was slow and which logs were causing the issues, and the ability to alert when a pipeline is not efficient.  All this was not possible at the time.

These are the main pain points, but there are some additional reasons, such as being able to write business logic before and after Logstash processing. If you want to write your logic on top of Logstash, in Ruby, then you are good to go, but if you wish to use a different language or platform, then you need to add a queue before Logstash and after, therefore needing to store the data in queues multiple times, or hack it in other ways, like working async with Logstash directly by sending/receiving via TCP. In short — not an easy implementation, at least not for our use case.

As Sawmill is a Java library, you can have a single process that reads the data from a queue, runs all the business logic and enrichment, and then pushes it forward in a simpler way. But now I’m getting ahead of myself.

Exploring alternatives

What about alternatives? Well, like everything in life — it depends on your use case. Solutions can vary from a simple installation of Logstash or fluentd to fully controlled stream processing using Kafka streams.

Logstash and Fluentd

Logstash or fluentd can fit in case all you need is to collect the data, enrich it and push it to Elasticsearch (or any other data store). If you need to add some more complex business logic, say before the transformations, or after the transformations depending on the enriched data, then that might not fit. Having to either write it as a plugin/filter or script it via the configuration is very limiting, platform dependent, requires knowledge of the internal pipelining, and you would need to write it in Ruby (which is fine if that’s your preferred language or your company works polyglot).

Elasticsearch Ingest API

Another option is to use Elasticsearch Ingest API capabilities.

Similar to Logstash filters, Ingest API runs on Elasticsearch Ingest Nodes as part of your cluster, and indexes the data directly after executing the ingest pipeline. With a simpler JSON configuration, REST API and relying on Elasticsearch for persistence, it is a good and valid option.

For our use case, though, it wasn’t a fit.

We needed to execute some actions on the enriched data, or at times even drop an event before indexing it — both not doable using Ingest Node.

Another concern we had was regarding the ability to handle bursts of incoming data and have a good back pressure mechanism. Having a considerable amount of experience managing large Elasticsearch clusters, I can tell you that spinning Elasticsearch instances up and down is not a whole lot of fun — nodes need to communicate with each other, join the cluster, sync the cluster state (which can be huge), on some operations wait for master locks, and more.

You also need to keep in mind that using Ingest Node will only work for indexing to Elasticsearch, and is definitely not a general JSON processing library useful for other use cases.

To sum this up — all of the above are valid alternatives, but it depends on your use case, scale considerations and your preferred stack based on your experience and skill set.

Based on all of the above, we decided that the best option for the use case was to write a Java library which will allow us to integrate Logstash enrichment capabilities smoothly into our Log Processing/Ingestion micro-service.

Thus, Sawmill came into the world.

What is Sawmill?

I define Sawmill as a JSON transformation library. It allows you to enrich, transform, and filter JSON documents. Using Sawmill pipelines you can integrate your favorite groks, geoip, user-agent resolving, add or remove fields/tags and more in a descriptive manner, using configuration files or builders, in a simple DSL, allowing you to dynamically change transformations.

Sawmill 101:

  • Sawmill is similar to the ‘filter’ section of Logstash, responsible for performing transformations on the data. Unlike Logstash, Sawmill does not have any inputs or outputs to read and write data. It is only responsible for data transformation.
  • Sawmill performs transformations on documents using processors, which are chained together into pipelines. Using a PipelineExecutor, one can execute a pipeline on a document.
  • Sawmill is written in Java, is thread safe and efficient, and uses caches where needed.
  • Sawmill can be configured in HOCON or JSON (see examples below).
  • Timeout of long processing logs after configurable timeout threshold
  • Sawmill exposes metrics for successful, failed, expired, and dropped executions, and a metric for processing exceeding a defined threshold. All metrics are available per pipeline and processor.
  • Sawmill supports 25+ processors, including: grok, geoip, user-agent, date, drop, key-value, json, math and more.
  • Sawmill supports nine logical conditions, including the basics, and: field-exists, has-value, match-regex and math-compare.

How do I use Sawmill?

Here is a basic example illustrating how to use Sawmill:

Doc doc = new Doc(myLog);
PipelineExecutor pipelineExecutor = new PipelineExecutor();
pipelineExecutor.execute(pipeline, doc);

As you can see above, there are a few entities in Sawmill:

  • Doc – essentially a Map representing a JSON.
  • Processor – a single document logical transformation. Either grok-processor, key-value-processor, add-field and so on.
  • Pipeline – specifies a series of processing steps using an ordered list of processors. Each processor transforms the document in some specific way. For example, a pipeline might have one processor that removes a field from the document, followed by another processor that renames a field.
  • PipelineExecutor – executes the processors defined in the pipeline on a document. The PipelineExecutor is responsible for the execution flow – handling onFailure and onSuccess flows, stops on failure, expose metrics of the execution and more.
  • PipelineExecutionTimeWatchdog – responsible for the warning on long processing time, interrupts and stops processing on timeout (not in the example above).  

Sawmill Configuration

A Sawmill pipeline can get built from a HOCON string (Human-Optimized Config Object Notation. See A JSON is a valid HOCON, so if you are not familiar with HOCON, you can always use JSON configuration.

Here is a simple configuration snippet, to get the feeling of it:

"steps": [{
    "grok": {
        "config": {
            "field": "message",
            "overwrite": ["message"],	                         "patterns":["%{COMBINEDAPACHELOG}+%{GREEDYDATA:extra_fields}"]

Which is equivalent to the following in HOCON:

steps: [{
    grok.config: {
            field : "message"
            overwrite : ["message"]
            patterns : ["%{COMBINEDAPACHELOG}+%{GREEDYDATA:extra_fields}"]

Here’s a simple code sample showing GeoIP resolution:

package io.logz.sawmill;

import io.logz.sawmill.Doc;
import io.logz.sawmill.ExecutionResult;
import io.logz.sawmill.Pipeline;
import io.logz.sawmill.PipelineExecutor;

import static io.logz.sawmill.utils.DocUtils.createDoc;

public class SawmillTesting {

    public static void main(String[] args) {

        Pipeline pipeline = new Pipeline.Factory().create(
                "{ steps :[{\n" +
                "    geoIp: {\n" +
                "      config: {\n" +
                "        sourceField: \"ip\"\n" +
                "        targetField: \"geoip\"\n" +
                "        tagsOnSuccess: [\"geo-ip\"]\n" +
                "      }\n" +
                "    }\n" +
                "  }]\n" +

        Doc doc = createDoc("message", "testing geoip resolving", "ip", "");
        ExecutionResult executionResult = new PipelineExecutor().execute(pipeline, doc);

        if (executionResult.isSucceeded()) {
            System.out.println("Success! result is:"+doc.toString());
            // will print out:
            // Success! result is:Doc{source={message=testing geoip resolving, ip=, geoip={timezone=America/Los_Angeles, city_name=Mountain View, country_name=United States, ...

Looking into the future

The development of Sawmill was done in a somewhat ad-hoc and selfish manner — we only implemented what was needed for our specific use case.

We are extremely happy with how Sawmill is being used in our current architecture, but at the same time we also realize that there are still a lot of features and processing options that can be added (for example — the ability to run scripts in a processor and a nice reference implementation.)

We are super excited to open source this project and are looking forward to getting the community involved. In case you are looking for something specific and find Sawmill lacking, please either give us a shout in GitHub or open a PR and contribute it!

We hope you’ll find Sawmill useful as we did!

Observability at scale, powered by open source


2022 Gartner® Magic Quadrant for Application Performance Monitoring and Observability
Forrester Observability Snapshot.

Organize Your Kubernetes Logs On One Unified SaaS Platform

Learn More