Logstash Tutorial: How to Get Started Shipping Logs

By: Jurgens du Toit

Logstash is the “L” in the ELK Stack — the world’s most popular log analysis platform and is responsible for aggregating data from different sources, processing it, and sending it down the pipeline, usually to be directly indexed in Elasticsearch.

This Logstash tutorial gives you a crash course in getting started with Logstash, and provides instructions for installing Logstash and configuring it.

Logstash can pull from almost any data source using input plugins, apply a wide variety of data transformations and enhancements using filter plugins, and ship the data to a large number of destinations using output plugins. The role Logstash plays in the stack, therefore, is critical — it allows you to filter, massage, and shape your data so that it’s easier to work with.

Before we begin with Logstash…

Despite its popularity, Logstash has some serious shortcomings – chief among them being its huge computing footprint and tendency to break.

If you’re here because you want to get the most out of your existing Logstash installation, please read on!

If you’re here to evaluate Logstash, we typically recommend other options like Fluentd or FluentBit – which are lightweight log collectors that can handle most Logstash log processing capabilities, without the heavy computing footprint or propensity for breaking.

If you’re struggling with Logstash or the ELK data pipeline more generally, check out Logz.io Log Management to centralize your logs with out-of-the-box log ingestion, processing, storage, and analysis. We manage and scale OpenSearch – the newly forked version of Elasticsearch, maintained by AWS – on our SaaS platform, so you don’t have to do it yourself.

The service includes parsing-as-a-service, which means our Customer engineers will just parse your logs for you. Beats configuring Logstash, eh?!

Anyways, this blog is about Logstash. So let’s get started.

Installing Logstash

Depending on your operating system and your environment, there are various ways of installing Logstash. We will be installing Logstash on an Ubuntu 16.04 machine running on AWS EC2 using apt. Check out other installation options here.

Before you install Logstash, make sure you have either Java 8 or Java 11 installed.

To install Java, use:

sudo apt-get update
sudo apt-get install default-jre

First, you need to add Elastic’s signing key so that the downloaded package can be verified (skip this step if you’ve already installed packages from Elastic):

wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key
add -

The next step is to add the repository definition to your system:

echo "deb https://artifacts.elastic.co/packages/7.x/apt stable main" | sudo
tee -a /etc/apt/sources.list.d/elastic-7.x.list

It’s worth noting that there is another package containing only features available under the Apache 2.0 license. To install this package, use:

echo "deb https://artifacts.elastic.co/packages/oss-7.x/apt stable main" |
sudo tee -a /etc/apt/sources.list.d/elastic-7.x.list

All that’s left to do is to update your repositories and install Logstash:

sudo apt-get update
sudo apt-get install logstash

Configuring Logstash

Logstash configuration is one of the biggest obstacles users face when working with Logstash. While improvements have been made recently to managing and configuring pipelines, this can still be a challenge for beginners.

We’ll start by reviewing the three main configuration sections in a Logstash configuration file, each responsible for different functions and using different Logstash plugins.

Logstash Inputs

One of the things that makes Logstash so powerful is its ability to aggregate logs and events from various sources. Using more than 50 input plugins for different platforms, databases and applications, Logstash can be defined to collect and process data from these sources and send them to other systems for storage and analysis.

The most common inputs used are file, beats, syslog, http, tcp, ssl (recommended), udp, stdin but you can ingest data from plenty of other sources.

Inputs are the starting point of any configuration. If you do not define an input, Logstash will automatically create a stdin input. Since you can create multiple inputs, it’s important to type and tag them so that you can properly manipulate them in filters and outputs.

Logstash Syslog Input

This input will send machine messages to Logstash. The Logstash input plugin only supports rsyslog RFC3164 by default. There are other fields to configure the plugin, including the grok_pattern field. Note that with a proper grok pattern, non-RFC3164 syslog can be supported. So, as of version 3.4.1, the grok_pattern and syslog fields are both configurable.

The default grok pattern is:
"<%{POSINT:priority}>%{SYSLOGLINE}"

Other fields include the strings timezone, locale, and host; the arrays severity_labels and facility_labels; and the booleans proxy_protocol and use_labels. Oh yeah, and the port field is a number.All Logstash input plugins support the following optional configurations: tags, type, id, enable_metric, codec, and add_field.

Logz.io provides a more advanced Logstash tutorial for grok.

Logstash Filters

If Logstash were just a simple pipe between a number of inputs and outputs, you could easily replace it with a service like IFTTT or Zapier. Luckily for us, it isn’t. Logstash supports a number of extremely powerful filter plugins that enable you to manipulate, measure, and create events. It’s the power of these filters that makes Logstash a very versatile and valuable tool.

Logstash Outputs

As with the inputs, Logstash supports a number of output plugins that enable you to push your data to various locations, services, and technologies. You can store events using outputs such as File, CSV, and S3, convert them into messages with RabbitMQ and SQS, or send them to various services like HipChat, PagerDuty, or IRC. The number of combinations of inputs and outputs in Logstash makes it a really versatile event transformer.

Logstash events can come from multiple sources, so it’s important to check whether or not an event should be processed by a particular output. If you do not define an output, Logstash will automatically create a stdout output.

Logstash Configuration Examples

Logstash has a simple configuration DSL that enables you to specify the inputs, outputs, and filters described above, along with their specific options. Order matters, specifically around filters and outputs, as the configuration is basically converted into code and then executed. Keep this in mind when you’re writing your configs, and try to debug them.

Structure

Your configurations will generally have three sections: inputs, outputs and filters. You can have multiple instances of each of these instances, which means that you can group related plugins together in a config file instead of grouping them by type. Logstash configs are generally structured as follows:

#/etc/logstash/conf.d/
- apache.conf
- haproxy.conf
- syslog.conf

So you can have a configuration file for each of the functions or integrations that you would like Logstash to perform. Each of those files will contain the necessary inputs, filters, and outputs to perform that function.

Example 1: File → Logstash → Elasticsearch

input {
  file {
         path => "/var/log/apache2/access.log"
    start_position => "beginning"
    sincedb_path => "/dev/null"
  }
}
filter {
    grok {
      match => { "message" => "%{COMBINEDAPACHELOG}" }
    }
    date {
    match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
  }
  geoip {
      source => "clientip"
    }
}
output {
  elasticsearch {
    hosts => ["localhost:9200"]
  }
}

The input section is using the file input plugin to tell Logstash to pull logs from the Apache access log.

In the filter section, we are applying: a) a grok filter that parses the log string and populates the event with the relevant information from the Apache logs, b) a date filter to define the timestsamp field, and c) a geoip filter to enrich the clientip field with geographical data.

Tip! The grok filter is not easy to configure. We recommend testing your filters before starting Logstash using the grok debugger. A rich list of the most commonly used grok patterns is available here.

Lastly, the output section which in this case is defined to send data to a local Elasticsearch instance.

Example 2: Filebeat → Logstash → Kafka

input {
    beats {
        port => "5044"
    }
}
filter {
    grok {
      match => { "message" => "%{COMBINEDAPACHELOG}" }
    }
    date {
    match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
  }
  geoip {
      source => "clientip"
    }
}
output {
  kafka {
    bootstrap_servers => "localhost"
    codec => plain {
        format => "%{message}"
    }
    topic_id =>  "apache"
  }
}

In this case, we’re using the same processing for our Apache logs but instead of pulling them directly from the file, we’re using the beats input plugin to pull them from Filebeat. Likewise, we’re outputting the logs to a Kafka topic instead of our Elasticsearch instance.

Example 3: Beats → Logstash → Logz.io (TCP)

input {
    beats {
        port => "5044"
    }
    type => apache_access
}
filter {
  add_field => { "token" => "aaWTINmMspBUetRoGUrxEApzQkkoMWMn" }
}
tcp {
    host => "listener.logz.io"
    port => 5050
    codec => json_lines
 }

In this example, we’re shipping our Apache access logs to Logz.io. Note, that since Logz.io applies parsing automatically, we are just using the add_field filter to add a field with the Logz.io token. The tcp output plugin defines the Logz.io listener as the destination.

Example 4: Beats → Logstash → Logz.io (SSL)

input {
    beats {
        port => "5044"
    }
    type => apache_access
}
filter {
  add_field => { "token" => "aaWTINmMspBUetRoGUrxEApzQkkoMWMn" }
}
output {
  lumberjack {
    host => "listener.logz.io"
    port => 5006
    ssl_certificate => "/usr/share/logstash/keys/TrustExternalCARoot.crt"
    codec => json_lines
 }

When shipping to Logz.io, while possible with TCP, we recommend shipping over SSL.

___

Each Logstash configuration file can contain these three sections. Logstash will typically combine all of our configuration files and consider it as one large config. Since you can have multiple inputs, it’s recommended that you tag your events or assign types to them so that it’s easy to identify them at a later stage. Also ensure that you wrap your filters and outputs that are specific to a category or type of event in a conditional, otherwise you might get some surprising results.

Working with Logstash Plugins

You will find that most of the most common use cases are covered by the plugins shipped and enabled by default. To see the list of loaded plugins, access the Logstash installation directory and execute the list command:

cd /usr/share/logstash
bin/logstash-plugin list

Installing other plugins is easily accomplished with:

bin/logstash-plugin install logstash-output-kafka

Updating and removing plugins is just as easy, as well as installing a plugin built locally.

Start stashing!

The only thing that’s left to do is get your hands dirty – start Logstash!

sudo service logstash start

Configuration errors are a frequent occurrence, so using the Logstash logs can be useful to find out what error took place.

sudo tail -f /var/log/logstash/logstash-plain.log

Monitoring Logstash

As powerful as it is, Logstash is notorious for suffering from design-related performance issues. This problem is exacerbated as pipelines get more complex and configuration files begin to get longer.

Luckily, there are some methods you can use to monitoring Logstash performance.

Logstash automatically records some information and metrics on the node running Logstash, JVM and running pipelines that can be used to monitor performance. To tap into this information, you can use monitoring API.

For example, you can use the Hot Threads API to view Java threads with high CPU and extended execution times:

curl -XGET 'localhost:9600/_node/hot_threads?human=true'
Hot threads at 2019-05-27T08:43:05+00:00, busiestThreads=10:
================================================================================
3.16 % of cpu usage, state: timed_waiting, thread name: 'LogStash::Runner', thread id: 1
	java.base@11.0.3/java.lang.Object.wait(Native Method)
	java.base@11.0.3/java.lang.Thread.join(Thread.java:1313)
	app//org.jruby.internal.runtime.NativeThread.join(NativeThread.java:75)
--------------------------------------------------------------------------------
0.61 % of cpu usage, state: timed_waiting, thread name: '[main]>worker5', thread id: 29
	java.base@11.0.3/jdk.internal.misc.Unsafe.park(Native Method)
	java.base@11.0.3/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:234)
java.base@11.0.3/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2123)
--------------------------------------------------------------------------------
0.47 % of cpu usage, state: timed_waiting, thread name: '[main]<file', thread id: 32
	java.base@11.0.3/jdk.internal.misc.Unsafe.park(Native Method)
	java.base@11.0.3/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:234)
java.base@11.0.3/java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1079)

Alternatively, you can use monitoring UI within Kibana, available under the Basic license.

Endnotes

Logstash is now increasingly being used in tandem with lighter data collectors called Beats. The different beats, such as Filebeat and Metricbeat, act as lightweight shippers that collect different types of data and subsequently ship it into Logstash for more advanced processing. This has changed the way data pipelines are set up with Logstash and also helped alleviate some of the performance issues mentioned above.

This getting started guide provided you with the steps you’ll need to start using Logstash. After you’ve set up the first pipeline, you will slowly become more acquainted with the ins and outs of using Logstash. Handling multiple and complex data pipelines with Logstash is not easy. Read the docs carefully and test in development before applying in production.