sre

Many of my fellow engineers ask me what it means to be an SRE (Site Reliability Engineer). 

When I tell them it’s a type of DevOps engineer, they get a glazed look in their eyes and then ask what a DevOps engineer is. I then find myself googling both job titles and reading twelve very different definitions until I reach the conclusion that these definitions vary wildly from company to company and from team to team.

My final answer is that like a programmer, an SRE writes code; however,  we don’t work on the product itself but on the surrounding systems. In addition to writing code, we also work on operations.

cat

Confused? So is the cat.

So, while I’m still not exactly sure where the fine line between SRE and DevOps lies, what I can do is give you a picture of what one year’s worth of work as an SRE at Logz.io looks like. Hopefully, this will help you decide whether or not you should become one.

Helping deliver better code, faster

As an SRE, I helped develop Apollo, our open source continuous deployment solution on top of Kubernetes. With Jenkins and Apollo, we have built a way to fully automate our deployment to production.

Apollo is a web application written in Java. Most of the work was classic backend development, along with some frontend development using  JavaScript and Angular, as well as some work on the Jenkins pipelines.

I’ve had the pleasure to have developed many of the features and became the go-to person for every Apollo related issue (not that it has any issues, Apollo is GREAT).

You are welcome to read about our transition to CD and use Apollo to deploy your software better and faster!

Stabilizing important components in the system

Worried about our night’s sleep, we are always trying to improve the stability of our system.

For example, we use, and heavily rely upon, a Slack bot that helps everyone in the company with their day-to-day work. We use it to investigate the shards spread on our Elasticsearch clusters, or the ingestion rate of our Kafka consumers.

In the past, we used to run the bot from an EC2 instance in AWS, which meant we had to bring it back to life manually every once in a while and then deploy it by running a dedicated script which we had to maintain.

So, I dockerized it and it now runs on Kubernetes and is deployed with Apollo.

Much better.

janga

Stabilizing important components.

Tightening up our monitoring operation

Monitoring is a major concern for the production team. We use Nagios to run bash and python tests on our services and cloud components, and we use Puppet to configure Nagios.

I have worked a lot on our monitoring system, wrote new tests, cleaned and improved existing tests, and fought to understand why tests failed and woke me up at night.

Resolving production issues, 24/7

Woken up in the middle of the night again?!

Oh yeah, I’m participating in an On-Call rotation, where we are the first to get an alert when our monitoring system discovers that something is wrong. We are the first warriors to handle real-time production events.

Don’t worry, they let us arrive late to work after a tough night 🙂

Setting up a new database

We needed a multi-region master-master DB cluster, and we chose to use Galera clusterBuilding a usable Galera cluster involved, among other things:

  • Deciding that we prefer it on EC2, not Kubernetes.
  • Writing a new puppet module.
  • Adjusting our launching scripts.
  • Figuring out a way to back up the data to AWS S3 buckets.
  • Implementing a monitoring system.
  • Designing and building the AWS set up (auto-scaling groups, load balancers, security groups etc.).

Summing it up

These examples are just a taste of what an SRE’s life looks like at Logz.io and of course, there are many more examples that I left out. If I had to summarize it in just a few points, I would say that  we are basically trying to improve everyone’s life programmatically by:

  • Automating everything we can.
  • Making software releases as seamless and safe as possible.
  • Knowing when things go wrong, fixing them automatically when we can, and letting the right people know when we can’t.
Computer

Trying to improve everyone’s life programmatically.

Experience one unified platform for monitoring, troubleshooting, and security.