Kafka

Logs are unpredictable.

Following a production incident, and precisely when you need them the most, logs can suddenly surge and overwhelm your logging infrastructure. To protect Logstash and Elasticsearch against such data bursts, users deploy buffering mechanisms to act as message brokers.

Apache Kafka is the most common broker solution deployed together the ELK Stack. Usually, Kafka is deployed between the shipper and the indexer, acting as an entrypoint for the data being collected:

kafka

In this article, I’ll show how to deploy all the components required to set up a resilient data pipeline with the ELK Stack and Kafka:

  • Filebeat – collects logs and forwards them to a Kafka topic.
  • Kafka – brokers the data flow and queues it.
  • Logstash – aggregates the data from the Kafka topic, processes it and ships to Elasticsearch.
  • Elasticsearch – indexes the data.
  • Kibana – for analyzing the data.

My environment

To perform the steps below, I set up a single Ubuntu 16.04 machine on AWS EC2 using local storage. In real-life scenarios you will probably have all these components running on separate machines.

I started the instance in the public subnet of a VPC and then set up a security group to enable access from anywhere using SSH and TCP 5601 (for Kibana). Finally, I added a new elastic IP address and associated it with the running instance.

The example logs used for the tutorial are Apache access logs.

Step 1: Installing Elasticsearch

We will start with installing the main component in the stack — Elasticsearch. Since version 7.x, Elasticsearch is bundled with Java so we can jump right ahead with adding Elastic’s signing key:

For installing Elasticsearch on Debian, we also need to install the apt-transport-https package:

Our next step is to add the repository definition to our system:

All that’s left to do is to update your repositories and install Elasticsearch:

Before we bootstrap Elasticsearch, we need to apply some basic configurations using the Elasticsearch configuration file at: /etc/elasticsearch/elasticsearch.yml:

Since we are installing Elasticsearch on AWS, we will bind Elasticsearch to localhost. Also, we need to define the private IP of our EC2 instance as a master-eligible node:

Save the file and run Elasticsearch with:

To confirm that everything is working as expected, point curl to: http://localhost:9200, and you should see something like the following output (give Elasticsearch a minute or two before you start to worry about not seeing any response):

Step 2: Installing Logstash

Next up, the “L” in ELK — Logstash. Logstash will requires us to install Java 8 which is fine because this is also a requirement for running Kafka:

Verify java is installed:

Since we already defined the repository in the system, all we have to do to install Logstash is run:

Next, we will configure a Logstash pipeline that pulls our logs from a Kafka topic, process these logs and ships them on to Elasticsearch for indexing.

Let’s create a new config file:

Paste the following configurations:

As you can see — we’re using the Logstash Kafka input plugin to define the Kafka host and the topic we want Logstash to pull from. We’re applying some filtering to the logs and we’re shipping the data to our local Elasticsearch instance.

Save the file.

Step 3: Installing Kibana

Let’s move on to the next component in the ELK Stack — Kibana. As before, we will use a simple apt command to install Kibana:

We will then open up the Kibana configuration file at: /etc/kibana/kibana.yml, and make sure we have the correct configurations defined:

These specific configurations tell Kibana which Elasticsearch to connect to and which port to use.

Now, we can start Kibana with:

Open up Kibana in your browser with: http://localhost:5601. You will be presented with the Kibana home page.

Kibana

Step 4: Installing Filebeat

As mentioned above, we will be using Filebeat to collect the log files and forward them to Kafka.

To install Filebeat, we will use:

Let’s open the Filebeat configuration file at:

Enter the following configurations:

In the input section, we are telling Filebeat what logs to collect — apache access logs. In the output section, we are telling Filebeat to forward the data to our local Kafka server and the relevant topic (to be installed in the next step).

Note the use of the codec.format directive — this is to make sure the message and timestamp fields are extracted correctly. Otherwise, the lines are sent in JSON to Kafka.

Save the file.

Step 4: Installing Kafka

Our last and final installation involves setting up Apache Kafka — our message broker.

Kafka uses ZooKeeper for maintaining configuration information and synchronization so we’ll need to install ZooKeeper before setting up Kafka:

Next, let’s download and extract Kafka:

We are now ready to run kafka, which we will do with this script:

You should begin to see some INFO messages in your console:

Next, we’re going to create a topic for our Apache logs:

We are all set to start the pipeline.

Step 5: Starting the data pipeline

Now that we have all the pieces in place, it’s time to start all the components in charge of running the data pipeline.

First, we’ll start Filebeat:

Then, Logstash:

It will take a few minutes for the pipeline to start streaming logs. To see Kafka in action, enter the following command in a separate tab:

If your Apache web server is indeed serving requests, you should begin to see the messages being forwarded to the Kafka topic by Filebeat in your console:

To make sure Logstash is aggregating the data and shipping it into Elasticsearch, use:

If all is working as expected, you should see a logstash-* index listed:

If you don’t see this index, it’s time to do some debugging I’m afraid. Check out this blog post for some tips on debugging Logstash.

All we have to do now is define the index pattern in Kibana to begin analysis. This is done under Management → Kibana Index Patterns.

step 1

Kibana will identify the index, so simply define it in the relevant field and continue on to the next step of selecting the timestamp field:

step 2

Once you create the index pattern, you’ll see a list of all the parsed and mapped fields:

create index pattern

Open the Discover page to begin analyzing your data!

discover

Summing it up

Resilient data pipelines are a must in any production-grade ELK deployments. Logs are crucial for piecing together events and they are needed most in emergencies. We cannot afford to have our logging infrastructure fail at the exact point in time when it is most required. Kafka, and similar brokers, play a huge part in buffering the data flow so Logstash and Elasticsearch don’t cave under the pressure of a sudden burst.

The example above is a basic setup of course. Production deployments will include multiple Kafka instances, a much larger amount of data and much more complicated pipelines. This will involve a considerable amount of engineering time and resources, something to take into consideration. The instructions here will help you understand how to get started though. I also recommend taking a look at our article explaining how to log Kafka itself.

Good luck!

Looking for a scalable ELK solution?