Many of my fellow engineers ask me what it means to be an SRE (Site Reliability Engineer).
When I tell them it’s a type of DevOps engineer, they get a glazed look in their eyes and then ask what a DevOps engineer is. I then find myself googling both job titles and reading twelve very different definitions until I reach the conclusion that these definitions vary wildly from company to company and from team to team.
My final answer is that like a programmer, an SRE writes code; however, we don’t work on the product itself but on the surrounding systems. In addition to writing code, we also work on operations.
So, while I’m still not exactly sure where the fine line between SRE and DevOps lies, what I can do is give you a picture of what one year’s worth of work as an SRE at Logz.io looks like. Hopefully, this will help you decide whether or not you should become one.
Helping deliver better code, faster
As an SRE, I helped develop Apollo, our open source continuous deployment solution on top of Kubernetes. With Jenkins and Apollo, we have built a way to fully automate our deployment to production.
I’ve had the pleasure to have developed many of the features and became the go-to person for every Apollo related issue (not that it has any issues, Apollo is GREAT).
You are welcome to read about our transition to CD and use Apollo to deploy your software better and faster!
Stabilizing important components in the system
Worried about our night’s sleep, we are always trying to improve the stability of our system.
For example, we use, and heavily rely upon, a Slack bot that helps everyone in the company with their day-to-day work. We use it to investigate the shards spread on our Elasticsearch clusters, or the ingestion rate of our Kafka consumers.
In the past, we used to run the bot from an EC2 instance in AWS, which meant we had to bring it back to life manually every once in a while and then deploy it by running a dedicated script which we had to maintain.
So, I dockerized it and it now runs on Kubernetes and is deployed with Apollo.
Tightening up our monitoring operation
Monitoring is a major concern for the production team. We use Nagios to run bash and python tests on our services and cloud components, and we use Puppet to configure Nagios.
I have worked a lot on our monitoring system, wrote new tests, cleaned and improved existing tests, and fought to understand why tests failed and woke me up at night.
Resolving production issues, 24/7
Woken up in the middle of the night again?!
Oh yeah, I’m participating in an On-Call rotation, where we are the first to get an alert when our monitoring system discovers that something is wrong. We are the first warriors to handle real-time production events.
Don’t worry, they let us arrive late to work after a tough night 🙂
Setting up a new database
We needed a multi-region master-master DB cluster, and we chose to use Galera cluster. Building a usable Galera cluster involved, among other things:
- Deciding that we prefer it on EC2, not Kubernetes.
- Writing a new puppet module.
- Adjusting our launching scripts.
- Figuring out a way to back up the data to AWS S3 buckets.
- Implementing a monitoring system.
- Designing and building the AWS set up (auto-scaling groups, load balancers, security groups etc.).
Summing it up
These examples are just a taste of what an SRE’s life looks like at Logz.io and of course, there are many more examples that I left out. If I had to summarize it in just a few points, I would say that we are basically trying to improve everyone’s life programmatically by:
- Automating everything we can.
- Making software releases as seamless and safe as possible.
- Knowing when things go wrong, fixing them automatically when we can, and letting the right people know when we can’t.